Data Science with Hadoop - A primer

© Hortonworks Inc. 2013
Hortonworks
Data Science with Hadoop – A Primer
Hadoop Summit, June 2013
Ofer Mendelevitch
ofer@hortonworks.com
@ofermend

© Hortonworks Inc. 2013 Page 2
Who am I?
currently <- c(
role=“director of data sciences”,
company=“Hortonworks”)
• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…
• Blog: www.achessdad.com

What I will be talking about?
•What is Data Science?
•Hadoop and Data Science
•Use-cases: data science with Hadoop
•How to get started?

What is Data Science?
What is a data scientist?
A person who does this
Data Product: software product whose core
functionality relies on applying statistical (or
machine learning) methods to data.
What is Data Science?
The art of building data products

Data science & big data

With Hadoop…
Time and cost of building large scale
data products is dramatically reduced

ApplianceCloudOS / VM
An Apache Hadoop Platform
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store, Proces
s and Access
Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI

A typical Big Data Architecture
Page 8
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
HORTONWORKS
DATA PLATFORM

Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed to work
together
• Affordable at scale
– Use “commodity” hardware nodes
– Self-healing; failure handled by software
– Very good at batch processing of large datasets

Hadoop improves productivity of data
scientists
•All data in one place
–Ability to store all the data in raw format
–Data silo convergence
–Data scientists will find innovative uses of combined data
assets
•Data/compute capabilities available as shared asset
–Data scientists can quickly prototype a new idea without an
up-front request for funding

Data-driven innovation is accelerated since
Hadoop is “schema on read”
I need
new data
Finally,
we start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
“Schema change” project
Let’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
3 months
My model is
awesome!

Hadoop is ideal for pre-processing of large
raw datasets
Strip away
HTML/PDF/DOC/P
PT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term
normalization

In machine learning, very often:
more data -> better outcomes
Banko & Brill, 2001
•More examples to learn from
•More possible feature types
–We’re looking for the most useful
for our task

Use-cases

A (partial) map of data science “tasks”
Discovery
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Prediction
Classification
Predict a category
Regression
Predict a value
Recommendation
Predict a preference
Big Data Science: High energy physics, Genomics, etc

Use-case: product recommendation
•Inputs:
–Explicit product ratings (when provided)
–Implicit information: purchase transactions, page views,
comments
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum
Comments

Goal: predict a preference
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
3 2 3 1 5
U101
U102
U103
U104
U105
…
U101
U102
U103
U104
U105
…
Epic
X-Men
Hobbit
Argo
Pirates

Using Hadoop for recommendation
Pre-process
SQL
Online serving
HDFS
Map Reduce
Transactions
Page views
Content
Recommend
Data sources
Custom
Logic
With Hadoop, we can process
very large preference datasets

Use-case: failure prediction
•Inputs:
–Equipment history: install date, model, past issues
–Equipment sensor data
–Product catalog: product families, expected lifetime
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
history
Sensor data
Product
Catalog

Building a prediction model
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
Unseen data
Model
TTF
Labeled Data
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
332456 3/3/2013 1345 94005 71
442343 6/6/2013 1112 77485 67

Using Hadoop for failure prediction
• HDFS: central repository for all data
– Service records (word, pdf, etc)
– Equipment purchase transaction data
– Product catalog: SKUs, model numbers, etc
• Pre-process
– Convert service records to item features: remove PDF
formatting, detect entities in records
– Normalize data using service records, product catalog
– Create feature matrix; ready for modeling algorithm

Use-case: SaaS application security
•Inputs:
–Click-stream: user interaction with application
User ID User
since
Logins/m
onth
Avg DL
KB/day
…
123456 1/3/2004 6 30
998323 5/3/2009 1 5
345375 8/2/2005 22 120
… … … …
User data
Clicks

Detecting anomalous behavior records
• User access profile modeled as vector of features
• Detect anomalies in application access patterns
– Rules based
– Machine learning based (determine “outlier factor”: 0…1)

Using Hadoop for anomaly detection
• HDFS: central repository for all raw data
– Raw user-access logs
– User information (organization, demographics)
• Pre-process
– Build access-profile (behavioral) for each user
• Detect anomalies
– In Hadoop
– Using existing tools: R, SAS, rules engine, etc

How do I get started?

1. Pick a good use-case that delivers immediate
business value
2. Implement a proof-of-value (POV)
3. Build a team (hire/train)
Getting started with Data science on Hadoop

• Put together a Hadoop cluster
• Define the POV business use-case
• Pull raw data you need into the cluster
• Build it
• Show the business value of your data assets
Contact us. We can help!
Implement a proof-of-value

Build a team:
The data scientist skillset continuum
Software
engineer
Research
Scientist
Data
Engineer
Data
Scientist
Applied
Scientist
Role Data Engineer Applied Scientist
Function Builds production-grade data products Finds signal/meaning in the data
Applies statistical/ML models and tunes the
algorithm
Good at…. Data and Systems architecture
Hadoop, PIG/HIVE, MapReduce, mahout
Java, Python, Perl, SQL, C++, etc
NoSQL (Hbase, Cassandra, Mongo)
Statistics, Machine learning
Text processing, NLP
R, Matlab, SAS, SQL
Sciptring, prototyping
Visualization / telling the story

Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend
We’re hiring!
Data Science training: www.hortonworks.com/training

Data Science with Hadoop - A primer

More Related Content

What's hot

Similar to Data Science with Hadoop - A primer

Recently uploaded

Data Science with Hadoop - A primer

Editor's Notes