Data Science Crash Course

Data Science
Crash Course - DataWorks Summit - Munich 2017
Robert Hryniewicz
Developer Advocate
@RobertH8z
rhryniewicz@hortonworks.com

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Data Science?
Ã Extracting knowledge/insights from data
– Data: structured or unstructured
Ã Continuation of
– statistics
– machine learning
– data mining
– predictive analytics

What is Machine Learning?
Machine Learning
“science of how computers learn
without being explicitly programmed”

“AI is the new electricity.”
“AI needs to be company wide
strategic decision.”
Andrew Ng
Chief Data Scientist
Co-founder of Coursera
Prof. at Stanford

A Brief History of AI
Antiquity – An Ancient Wish to Forge the Gods
1940 (Digital Computer, scientists discuss electronic brain)
1954 – 73 (Marvin Minsky et al. in Dartmouth College)
1973 – 80
1980 – 87 (Japanese gov.)
1987 – 93
1993 – 2000
2000 à Present

AI in Media & Pop Culture

What is AI?
Ã General or Pure AI
Ã Narrow or Pragmatic AI

“Big Data”
Ã Internet of Anything (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars
Ã User Generated Content (Social, Web & Mobile)
– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo
44ZB in 2020

Visualizing 44ZB
100 pixels = 1M TB
100 px -> 1M TB assumes 5M pixel resolution screen

Key drivers behind AI Explosion
Ã Exponential data growth
Ã Faster distributed systems
Ã Smarter algorithms

Major Trends in AI Technologies
Ã Knowledge Engineering
Ã Machine Learning
Ã Deep Learning
Ã Image Analysis
Ã Natural Language Processing & Generation
Ã Robotics & Automation

Creating Value with AI
Ã Cognitive insights
Ã Cognitive engagement
Ã Cognitive automation

Machine Learning Use Cases
Healthcare
Predict diagnosis
Prioritize screenings
Reduce re-admittance rates
Financial services
Fraud Detection/prevention
Predict underwriting risk
New account risk screens
Public Sector
Analyze public sentiment
Optimize resource allocation
Law enforcement & security
Retail
Product recommendation
Inventory management
Price optimization
Telco/mobile
Predict customer churn
Predict equipment failure
Customer behavior analysis
Oil & Gas
Predictive maintenance
Seismic data management
Predict well production levels

What Is Apache Spark?
Ã Apache open source project
originally developed at AMPLab
(University of California Berkeley)
Ã Unified data processing engine that
operates across varied data
workloads and platforms

Why Apache Spark?
Ã Elegant Developer APIs
– Single environment for data munging, data wrangling, and Machine Learning (ML)
Ã Fast! - In-memory computation model
– Effective for iterative computations
Ã Machine Learning
– Implementation of distributed ML algorithms

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

More Flexible Better Storage and Performance///

Spark SQL Overview
Ã Spark module for structured data processing (e.g. DB tables, JSON files, CSV)
Ã Three ways to manipulate data:
– DataFrames API
– SQL queries
– Datasets API

DataFrames
Ã Distributed collection of data organized into named
columns
Ã Conceptually equivalent to a table in relational DB or
a data frame in R/Python
Ã API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema

DataFrames
CSVAvro
HIVE
Spark SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON

Visualizations

Data Visualization: Twitter
Source: https://medium.com/@swainjo/us-presidential-election-2016-twitter-analysis-7596606853e5#.dozwu2bhd

imple line chart

orizontal plot of three line charts

treaming data into a line chart

lotting Iris data features in one plot

omparing Iris data distributions

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Algorithms

What is a ML Model?
Ã Mathematical formula with a number of parameters that need to be learned from the
data. And fitting a model to the data is a process known as model training
Ã E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …

START
Regression
Classification Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression
• Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)

CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
Ã Logistic Regression
– Fast training, linear model
– Classes expressed in probabilities
Ã Support Vector Machines (SVM)
– “Best” supervised learning algorithm, effective
– More robust to outliers than Log Regression
– Handles non-linearity
Ã Random Forest
– Fast training
– Handles categorical features
– Does not require feature scaling
– Captures non-linearity and
feature interaction
Ã Naïve Bayes
– Good for text classification
– Assumes independent variables

Visual Intro to Decision Trees
Ã http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION

REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression

CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA

COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)

DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
Ã Removing noise in images by selecting only
“important” features
Ã Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)

START
Regression
Classification Deep Learning
Clustering
Dimensionality Reduction
• XGBoost (Extreme Gradient Boosting)
• Classification and regression trees (CART)
• Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
• Yinyang K-Means
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Local Regression (LOESS)
Collaborative Filtering
• Weighted Alternating Least
Squares (WALS)

Hyperparameters
Ã Define higher-level model properties, e.g. complexity or learning rate
Ã Cannot be learned during training à need to be predefined
Ã Can be decided by
– setting different values
– training different models
– choosing the values that test better
Ã Hyperparameter examples
– Number of leaves or depth of a tree
– Number of latent factors in a matrix factorization
– Learning rate (in many models)
– Number of hidden layers in a deep neural network
– Number of clusters in a k-means clustering

Predictive Analytics Pre-requisites

Predictive Analytics Process and Tools

Asking Relevant Questions
Ã Specific (can you think of a clear answer?)
Ã Measurable (quantifiable? data driven?)
Ã Actionable (if you had an answer, could you do something with it?)
Ã Realistic (can you get an answer with data you have?)
Ã Timely (answer in reasonable timeframe?)

With that in mind…
Ã No simple formula for “good questions” only general guidelines
Ã The right data is better than lots of data
Ã Understanding relationships matters

Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States è California, CA, Cal., Cal

Detailed Research and Operational Workflows

Training Set
Learning Algorithm
h
hypothesis/model
input output
Ingest / Enrich Data
Clean / Transform / Filter
Select / Create New Features
Evaluate Accuracy / Score

Building Spark ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model

Spark ML Pipeline
Ã fit() is for training
Ã transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model
fit()
transform()
Train
Predict

Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
Ã Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

HDCloud

Hortonworks Cloud Solutions
Microsoft AWS Google
Managed Azure HDInsight
Non-Managed /
Marketplace
Hortonworks Data
Cloud for AWS
Cloud IaaS
Hortonworks Data Platform
(via Ambari and via Cloudbreak)

Zeppelin
Ambari
Spark History Server
Files View

Ã Zeppelin è Interactive notebook
Ã Spark
Ã YARN è Resource Management
Ã HDFS è Distributed Storage Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Spark and HDP

Labs / Tutorials

Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...

Linear Regression Model Training (one feature)
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result

Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563

ML Lab
• Residuals
• residual of an observed value is the difference between the observed value and
the estimated value
• R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
• RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model or and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)

Demo: Stock Portfolio Simulation using Monte Carlo method
Monte Carlo Simulation
1. Define a domain of possible inputs
2. Randomly generate inputs from prob.
distribution over domain
3. Perform computation on the inputs
4. Aggregate the results
Approximating the value of π after
placing 30K random points.
Error < 0.07% of actual value.

Demo: Text Classification with Naïve Bayes

Diabetes Dataset – Decision Trees / Random Forest
Labeled set with 8 Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667
-1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333
+1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1
-1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6
+1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7
-1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...

TensorFlowOnSpark

Robert Hryniewicz
E: rhryniewicz@hortonworks.com
T: @robertH8z

Feature Selection

Feature Selection
Ã Also known as variable or attribute selection
Ã Why important?
– simplification of models è easier to interpret by researchers/users
– shorter training times
– enhanced generalization by reducing overfitting
Ã Dimensionality reduction vs feature selection
– Dimensionality red: create new combinations of attributes
– Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?

Feature Selection
Ã Methods
– Filter
– Wrapper
– Embedded
Goal: Identify and remove unneeded, irrelevant and redundant features from data that do not contribute
or may decrease the accuracy of a predictive model.

Feature Selection Traps
Ã Feature selection is another key part of the applied machine learning process, like
model selection. You cannot fire and forget.
Ã It is important to consider feature selection a part of the model selection process. If you
do not, you may inadvertently introduce bias into your models which can result in
overfitting.
Ã For example, you must include feature selection within the inner-loop when you are
using accuracy estimation methods such as cross-validation. This means that feature
selection is performed on the prepared fold right before the model is trained. A mistake
would be to perform feature selection first to prepare your data, then perform model
selection and training on the selected features.

Feature Selection Checklist
1. Do you have domain knowledge? If yes, construct a better set of “ad hoc” features
2. Are your features commensurate? If no, consider normalizing them.
3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your
computer resources allow you.
4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of
feature
5. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a
first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.
6. Do you need a predictor? If no, stop
7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier
examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-
norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of
features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new
idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model
selection
10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several
“bootstrap”.

Robert Hryniewicz
T: @robertH8z

AI Investment Landscape

Only $100k investment needed to start with AI

Report from IDC Analyst firm
Spending on AI
• $12.5B in 2017
• $4.5B on apps for threat detection, fraud analysis, public safety, and pharmaceutical research
• $46B+ by 2020

Closing thoughts on AI

The Future of Cognitive Computing / MI
– Machine
• Deep Learning
• Discovery
• Large-scale math
• Fact checking
– Human
• Compassion
• Intuition
• Design
• Value judgements
• Common Sense

Robert Hryniewicz
T: @robertH8z

What’s new in HDP 2.6 – Spark & Zeppelin
Ã Spark 1.6.3 GA
Ã Spark 2.1 GA
Ã REST API (Livy) GA
Ã Spark Thrift Server doAS GA
Ã SparkSQL – Row/Column Security (GA)
Ã Spark Streaming + Kafka over SSL
Ã Multi Cluster HBase support for SHC
Ã Package support in PySpark & SparkR
Spark
Ã Spark 2.x support
Ã Improved Livy integration
Ã No password in clear
Ã JDBC interpreter improvements
Ã Smart Sense integration
Ã Knox proxy Zeppelin UI
Zeppelin 0.7.x

Thanks!
Robert Hryniewicz
@RobertH8z
rhryniewicz@hortonworks.com

Data Science Crash Course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Science Crash Course

Similar to Data Science Crash Course (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Data Science Crash Course