HDP Analyst Data Science

Our classes are always live and instructor led from our Exton, PA or EPIC Partner locations. Springhouse AnywhereLive options require Internet Access. Select classes are Guaranteed to Run (GTR). View our complete schedule policies.






​This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.

Intended Audience

​Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.

At Completion

Recognize use cases for data science on Hadoop

  • Describe the Hadoop and YARN architecture
  • Describe supervised and unsupervised learning differences
  • Use Mahout to run a machine learning algorithm on Hadoop
  • Describe the data science life cycle
  • Use Pig to transform and prepare data on Hadoop
  • Write a Python script
  • Describe options for running Python code on a Hadoop cluster
  • Write a Pig User-Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script
  • Use machine learning algorithms
  • Describe use cases for Natural Language Processing (NLP)
  • Use the Natural Language Toolkit (NLTK)
  • Describe the components of a Spark application
  • Write a Spark application in Python
  • Run machine learning algorithms using Spark MLlib
  • Take data science into production


​Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

Exams & Certifications


  • ​50% Lecture/Discussion
  • 50% Hands-on Labs

Course Outline

​Hands-On Labs

  • Lab: Setting Up a Development Environment
  • Demo: Block Storage
  • Lab: Using HDFS Commands
  • Demo: MapReduce
  • Lab: Using Apache Mahout for Machine Learning
  • Demo: Apache Pig
  • Lab: Getting Started with Apache Pig
  • Lab: Exploring Data with Pig
  • Lab: Using the IPython Notebook
  • Demo: The NumPy Package
  • Demo: The pandas Library
  • Lab: Data Analysis with Python
  • Lab: Interpolating Data Points
  • Lab: Defining a Pig UDF in Python
  • Lab: Streaming Python with Pig
  • Demo: Classification with Scikit-Learn
  • Lab: Computing K-Nearest Neighbor
  • Lab: Generating a K-Means Clustering
  • Lab: POS Tagging Using a Decision Tree
  • Lab: Using NLTK for Natural Language Processing
  • Lab: Classifying Text using Naive Bayes
  • Lab: Using Spark Transformations and Actions
  • Lab Using Spark MLlib
  • Lab: Creating a Spam Classifier with MLlib



HDP Analyst Data Sciencehttp://springhouse.com/course-catalog/HW HDP DSHDP Analyst Data Science

Get More Information




Help us prove you're not a robot:

 ‭(Hidden)‬ Catalog-Item Reuse

Microsoft Gold Partner


AXELOS Limited

The Microsoft Gold CPLS logo is a mark of Microsoft, Inc.

The PMI R.E.P. logo is a mark of the Project Management Institute, Inc.

ITIL® is a registered trade mark of AXELOS Limited.
IT Infrastructure Library® is a registered trade mark of AXELOS Limited
The Swirl logo™ is a registered trade mark of AXELOS Limited
Accredited course material is property of ITSM Academy.

Connect with us

Springhouse Education & Consulting Services

Corporate HQ:Eagleview Corporate Park
707 Eagleview Boulevard
Suite 207
Exton, PA 19341

610-321-3500 - info@springhouse.com