SlideShare a Scribd company logo
1 of 49
Download to read offline
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Science with Hadoop
Fall, 2014
Ajay Singh
Director, Technical Alliance
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
•  Data Science
•  Machine Learning – quick overview
•  Data Science with Hadoop
•  Demo
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Science
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Data Science?
Data facts and statistics collected together for reference or analysis
Science The intellectual and practical activity encompassing the systematic
study of the structure and behavior of the physical and natural world
through observation and experiment.
Data Science The	
  scien&fic	
  explora+on	
  of	
  data	
  to	
  extract	
  meaning	
  or	
  insight, and the
construction of software systems to utilize such insight in a business
context.
Someone who does this …
Data
Scientist
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Where Can We Use Data Science?
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Science is an Iterative Activity
Visualize,
Explore
Hypothesize;
Model
Measure/
EvaluateAcquire Data
Clean Data
Formulate
the question Deploy
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Science combines proficiencies…
Data
Exploration
Feature
Engineering
Raw
Transforms
The data science process is comprised
of three main tasks, requiring different
skill types, including technical, analytical
and programming. Signal
Processing
OCR
Geo-spatial
Normalize
Transform/
aggregate
Sample
Dimensionality
reduction
Feature
Selection
NLP
Mutual
Information
Data
Modeling
Frequent
Itemset
Anomaly
Detection
Clustering
Collaborative Filter
Regression
Classification
Supervised
Learning
Unsupervised
Learning
ReportingVisualizationData Quality
technical analytical
A data scientist needs to be proficient in
all these tasks.
Pre-processing
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Science with Big Data…
Very large raw datasets are
now available:
-  Log files
-  Sensor data
-  Sentiment information
With more raw data, we can
build better models with
improved predictive
performance.
To handle the larger datasets
we need a scalable processing
platform like Hadoop and YARN
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data scientists master many skills
Applied Science
•  Statistics, applied math
•  Machine Learning
•  Tools: Python, R, SAS, SPSS
Big data engineering
•  Big data pipeline engineering
•  Statistics and machine learning
over large datasets
•  Tools: Hadoop, PIG, HIVE,
Cascading, SOLR, etc
Business Analysis
•  Data Analysis, BI
•  Business/domain expertise
•  Tools: SQL, Excel, EDW
Data engineering
•  Database technologies
•  Computer science
•  Tools: Java, Scala, Python, C++
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Which makes them hard to find…
Applied Science
•  Statistics, applied math
•  Machine Learning
•  Tools: Python, R, SAS, SPSS
Business Analysis
•  Data Analysis, BI
•  Business/domain expertise
•  Tools: SQL, Excel, EDW
Data engineering
•  Database technologies
•  Computer science
•  Tools: Java, Scala, Python, C++
Big data engineering
•  Big data pipeline engineering
•  Statistics and machine learning
over large datasets
•  Tools: Hadoop, PIG, HIVE,
Cascading, SOLR, etc
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
The Data Science Team
Business
Analyst
Data
engineer
Applied
Scientist
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning Overview
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Machine Learning?
WALL-E was a machine that learned how to feel emotions after
700 years of experiences on Earth collecting human artifacts.
Machine learning is the science of getting
computers to learn from data and act without being
explicitly programmed.
•  Machine learning is about the construction and
study of systems that can learn from data.
•  The core of machine learning deals with
representation and generalization so that the
system will perform well on unseen data
instances and predict unknown events.
•  There is a wide variety of machine learning
tasks and successful applications.
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Supervised vs. Unsupervised learning
Data
Modeling
Frequent
Itemset
Anomaly
Detection
Clustering
Collaborative Filter
Regression
Classification
Supervised
Learning
Unsupervised
Learning
Supervised learning:
Applications in which the training data is a set of “labeled”
examples of the input vectors along with their corresponding
target variable (labels)
Unsupervised learning:
Applications in which the training data comprises examples of
input vectors WITHOUT any corresponding target variables.
The goal is to unearth “naturally occurring patterns” in the data,
such as in clustering
Collaborative filtering:
(recommendations engine) uses techniques from both
supervised and unsupervised world.
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Supervised Learning: learn from examples
Labeled	
  dataset	
  
Test	
  data	
  
Patient
Age
Tumor
Size
Clump
Thickness
… Malignant?
55 5 3 TRUE
70 4 7 TRUE
85 4 6 FALSE
35 2 1 FALSE
… … … … FALSE
TRUE
Patient age Tumor size Clump …
72 3 3
66 4 4
Cancer	
  model	
  
F(k1,	
  k2,	
  k3,	
  k4)	
  
Malignant
?
?
f(V1, V2, V3, …) = ?
Feature
Matrix
Target function
Feature Vector
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Classification: predicting a category
Some techniques:
-  Naïve Bayes
-  Decision Tree
-  Logistic Regression
-  SGD
-  Support Vector Machines
-  Neural Network
-  Ensembles
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Regression: predict a continuous value
Some techniques:
-  Linear Regression / GLM
-  Decision Trees
-  Support vector regression
-  SGD
-  Ensembles
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Ad Click-Through Rates in Ad Search
Rank = bid * CTR
Predict CTR for each ad
to determine placement,
based on:
-  Historical CTR
-  Keyword match
-  Etc…
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Unsupervised Learning: detect natural patterns
Age State Annual Income Marital
status
25 CA $80,000 M
45 NY $150,000 D
55 WA $100,500 M
18 TX $85,000 S
… … … …
No	
  labels	
  
Model	
  
Naturally	
  occurring	
  
(hidden)	
  structure	
  
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Clustering: detect similar instance groupings
Some techniques:
-  k-means
-  Spectral clustering
-  DB-scan
-  Hierarchical
clustering
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: market segmentation
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Outlier Detection: identify abnormal patterns
Example: identify engine anomalies
Features:
-  Heat generated
-  Vibration of engine
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Outlier Detection Target Function: outlier factor
Outlier factor (0…1)
ID Total$ Age City OF
101 $200 25 SF 0.1
102 $350 35 LA 0.05
103 $25 15 LA 0.2
… … … … 0.1
0.9
0.2
0.15
0.1
Some techniques:
-  Statistical techniques
-  Local outlier factor
-  One-class SVM
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Credit Card Fraud Detection
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Affinity Analysis: identifying frequent item sets
Y N N Y N
Y N N Y N
Y Y N Y N
N N Y Y Y
Tx 1
Tx 2
Tx 3
Tx 4
Tx 5
…
Item1
Item2
Item3
Item4
Item5
…
Y N N Y N
Y N N Y N
Y Y N Y N
N N Y Y Y
Tx 1
Tx 2
Tx 3
Tx 4
Tx 5
…
Item1
Item2
Item3
Item4
Item5
…
Goal: identify frequent item set
Techniques: FP Growth, a priori
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Affinity Analysis
Use affinity analysis
for
-  store layout design
-  Coupons
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Product recommendation: predicting “preference”
Collaborative Filtering
Identify users with similar “taste”
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Collaborative filtering -> matrix completion
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
Harrypotter
X-Men
Hobbit
Argo
Pirates
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
101
102
103
104
105
…
101
102
103
104
105
…
Harrypotter
X-Men
Hobbit
Argo
Pirates
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Netflix
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop and Data Science
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Data Lake: all the data in one place
– Ability to store ALL the data in raw format
– Data silo convergence
• Data/compute capabilities available as shared asset
– Data scientists can quickly prototype a new idea without
an up-front request for funding
– YARN enables multiple processing applications
Hadoop Improves Data Scientist Productivity
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
I	
  need	
  new	
  
data	
  
Finally,	
  we	
  
start	
  
collec+ng	
  
Let	
  me	
  see…	
  
is	
  it	
  any	
  good?	
  
Start 6 months 9 months
“Schema change” project
Let’s	
  just	
  put	
  it	
  in	
  a	
  
folder	
  on	
  HDFS	
  
Let	
  me	
  see…	
  
is	
  it	
  any	
  good?	
  
3 months
My	
  model	
  is	
  
awesome!	
  
“Schema on read” Accelerates Data Innovation
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop is ideal for pre-processing
Feature
Engineering
Raw
Transforms
Signal
Processing
OCR
Geo-spatial
Normalize
Transform/
aggregate
Sample
Dimensionality
reduction
Feature
Selection
NLP
Mutual
Information
Data
Modeling
Frequent
Itemset
Anomaly
Detection
Clustering
Collaborative Filter
Regression
Classification
Supervised
Learning
Unsupervised
Learning
Pre-processing
Build a better feature matrix
-  More/new features
-  More instances
-  Faster and at more scale
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Training a Supervised Learning model with Hadoop
•  Typically “training set” is not that large
–  In this case, it’s very common to train on a high-memory node
–  Using existing tools: R, Python Scikit-learn or SAS
•  For really large training sets that don’t fit in memory
–  SAS
–  Spark ML-Lib is a promising (albeit new) solution
–  Mahout is workable in some cases (but future is unclear)
•  Hadoop is also useful in parameter tuning:
–  Grid-search: optimizing the model’s parameters
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Scoring a Supervised Learning Model with Hadoop
•  Scoring of a single instance is usually fast
•  Some use-cases require frequent batch re-scoring of a
large population (e.g, 20M customers):
-  Use PMML scoring engine (e.g., Zementis, Pattern)
-  Custom implementation with Python, R, Java, etc
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Unsupervised learning with Hadoop
•  Clustering:
–  Many clustering algorithms are parallelizable
–  Distributed K-means is popular and available in Spark ML-Lib &
Mahout
•  Collaborative Filtering:
–  Alternating Least Squares (ALS) – very parallelizable
–  ALS implemented in Mahout, Spark ML-Lib, others
–  Item-based or user-based collaborative filtering available in Mahout
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deployment Considerations: Hadoop and Spark
Page 37
•  User runs Spark (or ML-Lib) job directly
from Edge Node
•  Scala API or Java API
•  Python API also good
•  Spark runs directly as a YARN job
•  No need to install anything else
Spark
ML-LibEdge node
Spark . .
. . .
. . Spark
YARN
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deployment Considerations: Hadoop and R
Page 38
•  R and relevant packages installed on each
node
•  User runs R on high-memory node
•  Rstudio or Rstudio server
•  RCloud
•  Interfaces to Hadoop
•  RMR: run map-reduce with R
•  RHDFS: access HDFS files from R
•  RHIVE: run hive queries from R
•  RHBase: Hbase from R
•  RODBC
Rstudio, Rcloud
Rhadoop
RHive
R . .
. . .
. . R
YARN
R high-
memory node
Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deployment Considerations: Hadoop and Python
Page 39
•  Python and relevant packages installed on
each node and high-memory nodes
•  User runs Python on high-memory node
•  IPython notebook is a great UI
•  Interfaces to Hadoop
•  PyDoop: access HDFS from Python
•  Map-reduce jobs with Hadoop streaming
•  Python UDFs with PIG
IPython
Pandas, Scikit-learn
Numpy, Scipy
Matplotlib
PyDoop
Python
Scikit-learn
Pandas
. .
. . .
. .
Python
Scikit-learn
Pandas
YARN
Python high-
memory node
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Supervised Learning with Hadoop
More details + demo
Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Model
Predict
Supervised Learning Workflow
Feature
Extraction
Train
the
Model
Model
Raw
Data
(Train)
Labels
New
Data
Feature
Extraction
Labels
Training
Predicting
Eval
Model
Feature Matrix
Feature Vector
Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Close up: Feature Extraction
Raw	
  Data	
  
ID Total$ Age City Target
101 200 25 SF
102 350 35 LA
103 25 15 LA
… … … …
Feature MatrixFeature
Engineering
Raw
Transforms
Signal
Processing
OCR
Geo-spatial
Normalize
Transform/
aggregate
Sample
Dimensionality
reduction
Feature
Selection
NLP
Mutual
Information
TB, PB
Feature
Extraction
MB, GB
Page 43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How Big is your Feature Matrix?
Example:
•  10M rows, 100 features
•  Each feature = 8 bytes (double)
•  Total memory = ~7.5GB
Page 44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Close-Up: Training the Model
Train
the
Model
Training
Set
Model
Eval
Model
Metric
-  Feature matrix randomly split into “training” (70%) and “validation” set (30%)
-  Model is built using training set and error measure is computed over validation set
-  Iterative process or grid-search to determine the best algorithm and choice of
parameters so that:
-  We get optimal model accuracy
-  We prevent over-fitting
Validation
Set
Page 45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Evaluating Performance of a Classifier
•  Determine “confusion matrix”
•  Compute metrics: precision, recall, accuracy and specificity
Actual
Yes No
Predicted
Yes True
positives
False
positives
No False
negatives
True
negatives
Confusion Matrix
From confusion matrix, we can compute these metrics:
Precision = % of positive predicts that are correct
Recall = % of positive instances that were predicts as positive
F1 score = a measure of test’s accuracy, combining precision and recall
Accuracy = % of correct classifications
Page 46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo overview
•  Datasets:
– Airline delay data (we’re using only 2007, 2008 years)
–  http://stat-computing.org/dataexpo/2009/the-data.html
– Weather data from http://ncdc.noaa.gov/
•  Goal:
– Predict delay (delayTime >= 15 mins) in flights
– For simplicity, limited to flights originating from ORD
•  Tools:
– Pre-process: PIG or Spark on Hadoop
– Modeling: Scikit-learn or Spark/ML-Lib or R
Page 47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo	
  Flow	
  
Feature
Extraction
Train
the
Model
Predict /
Score
Model
Raw
Data
(Train)
Labels
Raw
Data
(Test)
Feature
Extraction
Labels
Training
Prediction
Airline, Weather (2007) ORD_2007
ORD_2008
Airline, Weather (2008)
Page 48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DEMO now!
Page 49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Q&A, Open discussion
Architecting the Future of Big Data Page 49

More Related Content

What's hot

Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientistVijayMohan Vasu
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Edureka!
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With PythonMosky Liu
 
Machine Learning
Machine LearningMachine Learning
Machine LearningKumar P
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Simplilearn
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 
Big Data Analytics in Government
Big Data Analytics in GovernmentBig Data Analytics in Government
Big Data Analytics in GovernmentDeepak Ramanathan
 

What's hot (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Machine learning
Machine learning Machine learning
Machine learning
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Big Data Analytics in Government
Big Data Analytics in GovernmentBig Data Analytics in Government
Big Data Analytics in Government
 

Viewers also liked

L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Airline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectAirline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectHaozhe Wang
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesRui Pedro Paiva
 
From Digital Analytics to Insight
From Digital Analytics to InsightFrom Digital Analytics to Insight
From Digital Analytics to InsightPithan Rojanawong
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniBig Data Spain
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
The New Era of Cognitive Computing
The New Era of Cognitive ComputingThe New Era of Cognitive Computing
The New Era of Cognitive ComputingIBM Research
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Data analytics telecom churn final ppt
Data analytics telecom churn final ppt Data analytics telecom churn final ppt
Data analytics telecom churn final ppt Gunvansh Khanna
 
Social bots Présentation Générale
Social bots   Présentation Générale Social bots   Présentation Générale
Social bots Présentation Générale Social Bots
 
DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDATAVERSITY
 
Analytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionAnalytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionDeloitte United States
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Hamilton
 
Working With Big Data
Working With Big DataWorking With Big Data
Working With Big DataSeth Familian
 

Viewers also liked (20)

L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Airline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining ProjectAirline flights delay prediction- 2014 Spring Data Mining Project
Airline flights delay prediction- 2014 Spring Data Mining Project
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
From Digital Analytics to Insight
From Digital Analytics to InsightFrom Digital Analytics to Insight
From Digital Analytics to Insight
 
Cognitive computing 2016
Cognitive computing 2016Cognitive computing 2016
Cognitive computing 2016
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo Molini
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
The New Era of Cognitive Computing
The New Era of Cognitive ComputingThe New Era of Cognitive Computing
The New Era of Cognitive Computing
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Data analytics telecom churn final ppt
Data analytics telecom churn final ppt Data analytics telecom churn final ppt
Data analytics telecom churn final ppt
 
Churn Predictive Modelling
Churn Predictive ModellingChurn Predictive Modelling
Churn Predictive Modelling
 
Social bots Présentation Générale
Social bots   Présentation Générale Social bots   Présentation Générale
Social bots Présentation Générale
 
DAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data QualityDAMA Webinar - Big and Little Data Quality
DAMA Webinar - Big and Little Data Quality
 
Analytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionAnalytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolution
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science
 
Working With Big Data
Working With Big DataWorking With Big Data
Working With Big Data
 

Similar to Data science workshop

Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at ScaleArtem Ervits
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXKirk Haslbeck
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Timothy Spann
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scaleCarolyn Duby
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data AnalyticsDatameer
 
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and HortonworksHow to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and HortonworksHortonworks
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?Inside Analysis
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science TeamsEMC
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onDataWorks Summit
 
A #Pink14 Presentation: Optimizing for the #SDDC
A #Pink14 Presentation: Optimizing for the #SDDCA #Pink14 Presentation: Optimizing for the #SDDC
A #Pink14 Presentation: Optimizing for the #SDDCTeamQuest Corporation
 
The Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamThe Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamSenturus
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stackKirk Haslbeck
 
Analytics: The Next Killer App for Optimizing IT? #GartnerIOM
Analytics: The Next Killer App for Optimizing IT? #GartnerIOMAnalytics: The Next Killer App for Optimizing IT? #GartnerIOM
Analytics: The Next Killer App for Optimizing IT? #GartnerIOMTeamQuest Corporation
 

Similar to Data science workshop (20)

Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at Scale
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and HortonworksHow to Become an Analytics Ready Insurer - with Informatica and Hortonworks
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
A #Pink14 Presentation: Optimizing for the #SDDC
A #Pink14 Presentation: Optimizing for the #SDDCA #Pink14 Presentation: Optimizing for the #SDDC
A #Pink14 Presentation: Optimizing for the #SDDC
 
The Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamThe Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science Team
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stack
 
Analytics: The Next Killer App for Optimizing IT? #GartnerIOM
Analytics: The Next Killer App for Optimizing IT? #GartnerIOMAnalytics: The Next Killer App for Optimizing IT? #GartnerIOM
Analytics: The Next Killer App for Optimizing IT? #GartnerIOM
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Data science workshop

  • 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Science with Hadoop Fall, 2014 Ajay Singh Director, Technical Alliance
  • 2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda •  Data Science •  Machine Learning – quick overview •  Data Science with Hadoop •  Demo
  • 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Science
  • 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What is Data Science? Data facts and statistics collected together for reference or analysis Science The intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment. Data Science The  scien&fic  explora+on  of  data  to  extract  meaning  or  insight, and the construction of software systems to utilize such insight in a business context. Someone who does this … Data Scientist
  • 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Where Can We Use Data Science? Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels
  • 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Science is an Iterative Activity Visualize, Explore Hypothesize; Model Measure/ EvaluateAcquire Data Clean Data Formulate the question Deploy
  • 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Science combines proficiencies… Data Exploration Feature Engineering Raw Transforms The data science process is comprised of three main tasks, requiring different skill types, including technical, analytical and programming. Signal Processing OCR Geo-spatial Normalize Transform/ aggregate Sample Dimensionality reduction Feature Selection NLP Mutual Information Data Modeling Frequent Itemset Anomaly Detection Clustering Collaborative Filter Regression Classification Supervised Learning Unsupervised Learning ReportingVisualizationData Quality technical analytical A data scientist needs to be proficient in all these tasks. Pre-processing
  • 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Science with Big Data… Very large raw datasets are now available: -  Log files -  Sensor data -  Sentiment information With more raw data, we can build better models with improved predictive performance. To handle the larger datasets we need a scalable processing platform like Hadoop and YARN
  • 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data scientists master many skills Applied Science •  Statistics, applied math •  Machine Learning •  Tools: Python, R, SAS, SPSS Big data engineering •  Big data pipeline engineering •  Statistics and machine learning over large datasets •  Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc Business Analysis •  Data Analysis, BI •  Business/domain expertise •  Tools: SQL, Excel, EDW Data engineering •  Database technologies •  Computer science •  Tools: Java, Scala, Python, C++
  • 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Which makes them hard to find… Applied Science •  Statistics, applied math •  Machine Learning •  Tools: Python, R, SAS, SPSS Business Analysis •  Data Analysis, BI •  Business/domain expertise •  Tools: SQL, Excel, EDW Data engineering •  Database technologies •  Computer science •  Tools: Java, Scala, Python, C++ Big data engineering •  Big data pipeline engineering •  Statistics and machine learning over large datasets •  Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc
  • 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved The Data Science Team Business Analyst Data engineer Applied Scientist
  • 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Machine Learning Overview
  • 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What is Machine Learning? WALL-E was a machine that learned how to feel emotions after 700 years of experiences on Earth collecting human artifacts. Machine learning is the science of getting computers to learn from data and act without being explicitly programmed. •  Machine learning is about the construction and study of systems that can learn from data. •  The core of machine learning deals with representation and generalization so that the system will perform well on unseen data instances and predict unknown events. •  There is a wide variety of machine learning tasks and successful applications.
  • 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Supervised vs. Unsupervised learning Data Modeling Frequent Itemset Anomaly Detection Clustering Collaborative Filter Regression Classification Supervised Learning Unsupervised Learning Supervised learning: Applications in which the training data is a set of “labeled” examples of the input vectors along with their corresponding target variable (labels) Unsupervised learning: Applications in which the training data comprises examples of input vectors WITHOUT any corresponding target variables. The goal is to unearth “naturally occurring patterns” in the data, such as in clustering Collaborative filtering: (recommendations engine) uses techniques from both supervised and unsupervised world.
  • 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Supervised Learning: learn from examples Labeled  dataset   Test  data   Patient Age Tumor Size Clump Thickness … Malignant? 55 5 3 TRUE 70 4 7 TRUE 85 4 6 FALSE 35 2 1 FALSE … … … … FALSE TRUE Patient age Tumor size Clump … 72 3 3 66 4 4 Cancer  model   F(k1,  k2,  k3,  k4)   Malignant ? ? f(V1, V2, V3, …) = ? Feature Matrix Target function Feature Vector
  • 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Classification: predicting a category Some techniques: -  Naïve Bayes -  Decision Tree -  Logistic Regression -  SGD -  Support Vector Machines -  Neural Network -  Ensembles
  • 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Regression: predict a continuous value Some techniques: -  Linear Regression / GLM -  Decision Trees -  Support vector regression -  SGD -  Ensembles
  • 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Ad Click-Through Rates in Ad Search Rank = bid * CTR Predict CTR for each ad to determine placement, based on: -  Historical CTR -  Keyword match -  Etc…
  • 19. Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Unsupervised Learning: detect natural patterns Age State Annual Income Marital status 25 CA $80,000 M 45 NY $150,000 D 55 WA $100,500 M 18 TX $85,000 S … … … … No  labels   Model   Naturally  occurring   (hidden)  structure  
  • 20. Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Clustering: detect similar instance groupings Some techniques: -  k-means -  Spectral clustering -  DB-scan -  Hierarchical clustering
  • 21. Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: market segmentation
  • 22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Outlier Detection: identify abnormal patterns Example: identify engine anomalies Features: -  Heat generated -  Vibration of engine
  • 23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Outlier Detection Target Function: outlier factor Outlier factor (0…1) ID Total$ Age City OF 101 $200 25 SF 0.1 102 $350 35 LA 0.05 103 $25 15 LA 0.2 … … … … 0.1 0.9 0.2 0.15 0.1 Some techniques: -  Statistical techniques -  Local outlier factor -  One-class SVM
  • 24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Credit Card Fraud Detection
  • 25. Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Affinity Analysis: identifying frequent item sets Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 … Item1 Item2 Item3 Item4 Item5 … Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 … Item1 Item2 Item3 Item4 Item5 … Goal: identify frequent item set Techniques: FP Growth, a priori
  • 26. Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Affinity Analysis Use affinity analysis for -  store layout design -  Coupons
  • 27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Product recommendation: predicting “preference” Collaborative Filtering Identify users with similar “taste”
  • 28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Collaborative filtering -> matrix completion 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 Harrypotter X-Men Hobbit Argo Pirates 5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 101 102 103 104 105 … 101 102 103 104 105 … Harrypotter X-Men Hobbit Argo Pirates
  • 29. Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example: Netflix
  • 30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop and Data Science
  • 31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved • Data Lake: all the data in one place – Ability to store ALL the data in raw format – Data silo convergence • Data/compute capabilities available as shared asset – Data scientists can quickly prototype a new idea without an up-front request for funding – YARN enables multiple processing applications Hadoop Improves Data Scientist Productivity
  • 32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved I  need  new   data   Finally,  we   start   collec+ng   Let  me  see…   is  it  any  good?   Start 6 months 9 months “Schema change” project Let’s  just  put  it  in  a   folder  on  HDFS   Let  me  see…   is  it  any  good?   3 months My  model  is   awesome!   “Schema on read” Accelerates Data Innovation
  • 33. Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop is ideal for pre-processing Feature Engineering Raw Transforms Signal Processing OCR Geo-spatial Normalize Transform/ aggregate Sample Dimensionality reduction Feature Selection NLP Mutual Information Data Modeling Frequent Itemset Anomaly Detection Clustering Collaborative Filter Regression Classification Supervised Learning Unsupervised Learning Pre-processing Build a better feature matrix -  More/new features -  More instances -  Faster and at more scale
  • 34. Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Training a Supervised Learning model with Hadoop •  Typically “training set” is not that large –  In this case, it’s very common to train on a high-memory node –  Using existing tools: R, Python Scikit-learn or SAS •  For really large training sets that don’t fit in memory –  SAS –  Spark ML-Lib is a promising (albeit new) solution –  Mahout is workable in some cases (but future is unclear) •  Hadoop is also useful in parameter tuning: –  Grid-search: optimizing the model’s parameters
  • 35. Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Scoring a Supervised Learning Model with Hadoop •  Scoring of a single instance is usually fast •  Some use-cases require frequent batch re-scoring of a large population (e.g, 20M customers): -  Use PMML scoring engine (e.g., Zementis, Pattern) -  Custom implementation with Python, R, Java, etc
  • 36. Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Unsupervised learning with Hadoop •  Clustering: –  Many clustering algorithms are parallelizable –  Distributed K-means is popular and available in Spark ML-Lib & Mahout •  Collaborative Filtering: –  Alternating Least Squares (ALS) – very parallelizable –  ALS implemented in Mahout, Spark ML-Lib, others –  Item-based or user-based collaborative filtering available in Mahout
  • 37. Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deployment Considerations: Hadoop and Spark Page 37 •  User runs Spark (or ML-Lib) job directly from Edge Node •  Scala API or Java API •  Python API also good •  Spark runs directly as a YARN job •  No need to install anything else Spark ML-LibEdge node Spark . . . . . . . Spark YARN
  • 38. Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deployment Considerations: Hadoop and R Page 38 •  R and relevant packages installed on each node •  User runs R on high-memory node •  Rstudio or Rstudio server •  RCloud •  Interfaces to Hadoop •  RMR: run map-reduce with R •  RHDFS: access HDFS files from R •  RHIVE: run hive queries from R •  RHBase: Hbase from R •  RODBC Rstudio, Rcloud Rhadoop RHive R . . . . . . . R YARN R high- memory node
  • 39. Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deployment Considerations: Hadoop and Python Page 39 •  Python and relevant packages installed on each node and high-memory nodes •  User runs Python on high-memory node •  IPython notebook is a great UI •  Interfaces to Hadoop •  PyDoop: access HDFS from Python •  Map-reduce jobs with Hadoop streaming •  Python UDFs with PIG IPython Pandas, Scikit-learn Numpy, Scipy Matplotlib PyDoop Python Scikit-learn Pandas . . . . . . . Python Scikit-learn Pandas YARN Python high- memory node
  • 40. Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Supervised Learning with Hadoop More details + demo
  • 41. Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Model Predict Supervised Learning Workflow Feature Extraction Train the Model Model Raw Data (Train) Labels New Data Feature Extraction Labels Training Predicting Eval Model Feature Matrix Feature Vector
  • 42. Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Close up: Feature Extraction Raw  Data   ID Total$ Age City Target 101 200 25 SF 102 350 35 LA 103 25 15 LA … … … … Feature MatrixFeature Engineering Raw Transforms Signal Processing OCR Geo-spatial Normalize Transform/ aggregate Sample Dimensionality reduction Feature Selection NLP Mutual Information TB, PB Feature Extraction MB, GB
  • 43. Page 43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved How Big is your Feature Matrix? Example: •  10M rows, 100 features •  Each feature = 8 bytes (double) •  Total memory = ~7.5GB
  • 44. Page 44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Close-Up: Training the Model Train the Model Training Set Model Eval Model Metric -  Feature matrix randomly split into “training” (70%) and “validation” set (30%) -  Model is built using training set and error measure is computed over validation set -  Iterative process or grid-search to determine the best algorithm and choice of parameters so that: -  We get optimal model accuracy -  We prevent over-fitting Validation Set
  • 45. Page 45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Evaluating Performance of a Classifier •  Determine “confusion matrix” •  Compute metrics: precision, recall, accuracy and specificity Actual Yes No Predicted Yes True positives False positives No False negatives True negatives Confusion Matrix From confusion matrix, we can compute these metrics: Precision = % of positive predicts that are correct Recall = % of positive instances that were predicts as positive F1 score = a measure of test’s accuracy, combining precision and recall Accuracy = % of correct classifications
  • 46. Page 46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo overview •  Datasets: – Airline delay data (we’re using only 2007, 2008 years) –  http://stat-computing.org/dataexpo/2009/the-data.html – Weather data from http://ncdc.noaa.gov/ •  Goal: – Predict delay (delayTime >= 15 mins) in flights – For simplicity, limited to flights originating from ORD •  Tools: – Pre-process: PIG or Spark on Hadoop – Modeling: Scikit-learn or Spark/ML-Lib or R
  • 47. Page 47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo  Flow   Feature Extraction Train the Model Predict / Score Model Raw Data (Train) Labels Raw Data (Test) Feature Extraction Labels Training Prediction Airline, Weather (2007) ORD_2007 ORD_2008 Airline, Weather (2008)
  • 48. Page 48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved DEMO now!
  • 49. Page 49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2014 Q&A, Open discussion Architecting the Future of Big Data Page 49