More Related Content Similar to Predictive Analytics and Machine Learning…with SAS and Apache Hadoop (20) More from Hortonworks (20) Predictive Analytics and Machine Learning…with SAS and Apache Hadoop1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Predictive Analytics and Machine Learning
…with SAS and Apache Hadoop
Spring 2014
Version 1.5
We do Hadoop.
2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers…
Ofer Mendelevitch, Director of Data Science
Hortonworks
Wayne Thompson, Chief Data Scientist
SAS
3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A data architecture under pressure from new dataAPPLICATIONS
DATA
SYSTEM
REPOSITORIES
SOURCES
Exis4ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
RDBMS
EDW
MPP
Business
Analy4cs
Custom
Applica4ons
Packaged
Applica4ons
Source: IDC
2.8
ZB
in
2012
85%
from
New
Data
Types
15x
Machine
Data
by
2020
40
ZB
by
2020
OLTP,
ERP,
CRM
Systems
Unstructured
documents,
emails
Clickstream
Server
logs
Sen>ment,
Web
Data
Sensor.
Machine
Data
Geo-‐loca>on
4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop within an emerging Modern Data Architecture
OPERATIONS
TOOLS
Provision,
Manage &
Monitor
DEV
&
DATA
TOOLS
Build &
Test
DATA
SYSTEM
REPOSITORIES
SOURCES
RDBMS
EDW
MPP
OLTP,
ERP,
CRM
Systems
Documents,
Emails
Web
Logs,
Click
Streams
Social
Networks
Machine
Generated
Sensor
Data
Geoloca>on
Data
Governance
&Integration
Security
Operations
Data Access
Data Management
APPLICATIONS
Business
Analy4cs
Custom
Applica4ons
Packaged
Applica4ons
Data Lake
An architectural shift in the
data center that uses Hadoop
to deliver deeper insight across
a large, broad, diverse set of
data at efficient scale
5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop unlocks a new approach: Iterative Analytics
Hadoop
Mul>ple
Query
Engines
Itera>ve
Process:
Explore,
Transform,
Analyze
SQL
Single
Query
Engine
Repeatable
Linear
Process
✚
Determine
list
of
ques4ons
Design
solu4ons
Collect
structured
data
Ask
ques4ons
from
list
Detect
addi4onal
ques4ons
Batch
Interac4ve
Real-‐4me
Streaming
Current Reality
Apply schema on write
Dependent on IT
Augment w/ Hadoop
Apply schema on read
Support range of access patterns to data stored in
HDFS: polymorphic access
6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hadoop for Data Science
• Hadoop’s schema on read reduces cycle times
• Hadoop is ideal for pre-processing of raw data
• Improved models with larger datasets
7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop’s schema-on-read accelerates innovation
I
need
new
data
Finally,
we
start
collec>ng
Let
me
see…
is
it
any
good?
Start 6 months 9 months
“Schema change” project
Let’s
just
put
it
in
a
folder
on
HDFS
Let
me
see…
is
it
any
good?
3 months
My
model
is
awesome!
8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop ideal for large scale pre-processing
Join
Normalize
OCR
Sample
Aggregate
Raw
Data
Feature
Matrix
NLP
Transform
9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why big data science?
Larger datasets à better outcomes
Banko & Brill, 2001
• More examples
• More features
10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A (partial) map of data science “tasks”
Discovery
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Association rule mining
Co-occurrence patterns
Prediction
Classification
Predict a category
Regression
Predict a value
Recommendation
Predict a preference
Big Data Science: High energy physics, Genomics, etc.
11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Typical iterative flow in data science
Page 11
Visualize,
Explore
Hypothesize;
Model
Measure/
Evaluate
Acquire
Data
Clean
Data
Deploy & Monitor
12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SAS in-memory and Visual Statistics
HDP 2.1
Hortonworks Data Platform
Provision,
Manage
&
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data
Workflow,
Lifecycle
&
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN
:
Data
Opera4ng
System
DATA
MANAGEMENT
SECURITY
DATA
ACCESS
GOVERNANCE
&
INTEGRATION
Authen4ca4on
Authoriza4on
Accoun4ng
Data
Protec4on
Storage:
HDFS
Resources:
YARN
Access:
Hive,
…
Pipeline:
Falcon
Cluster:
Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive/Tez,
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
In-‐Memory
Analy>cs,
ISV
engines
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
Batch
Map
Reduce
Deployment
Choice
Linux Windows On-Premise Cloud
SAS®
Visual Statistics
SAS
®
In-Memory Statistics
for Hadoop
• Provide powerful
advanced analytics
integrated directly on HDP
13. Copyright © 2012, SAS Institute Inc. All rights reserved.
BIG ANALYTICS+ HORTONWORKS DATA PLATFORM (HDP) = BIG OPPORTUNITIES
14. Copyright © 2012, SAS Institute Inc. All rights reserved.
WHAT IS IT?
Provides a single interactive analytical platform on
Hadoop to perform
• analytical data preparation
• variable transformations
• exploratory analysis
• statistical modeling and machine learning
• integrated modeling comparison and scoring
• Takes advantage of distributed in-memory computing
optimized for analytical workloads
TEXT
PREPARE
DATA EXPLORE
DATA
DEVELOP
MODELS
SCORE
SAS
®
IN-MEMORY
ANALYTICS
Governance
&Integration
Security
Operations
Data Access
Data
Management
15. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS®
IN-
MEMORY
ANALYTICS
INTEGRATED USER EXPERIENCE
Data Preparation
Exploration/
Visualization
Modeling Deployment
DATA SCIENTIST /PROGRAMMER
SAS®
Visual
Statistics SAS
®
In-Memory
Statistics for Hadoop
GUI
GUI
STATISTICIAN
PROGRAMMING
16. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS IN-MEMORY STATISTICS FOR HADOOP
Data Management
• Aggregate
• Compute
• Update
• Append
• Set
• Schema
• DeleteRows
• DropTables
• PurgeTempTables
Data Exploration
• Boxplot
• Corr
• Crosstab
• Distinct
• Fetch
• Frequency
• Histogram
• KDE
• MDSummary
• Percentile
• Summary
• TopK
Descriptive Modeling
• Association
• Path Analysis
• Clustering (k-means)
• Clustering (DBSCAN)
Evaluation, Deployment
• Assess
Misclassification matrix
Lift, ROC, Concordance
• Score
• Training / Validation
Data
Management &
Exploration
Modeling
Model
Evaluation &
Deployment
ANALYTICAL
LIFE CYCLE
Utilities
• Where
• GroupBy
• TableInfo, ColumnInfo, ServerInfo
• Partition, Balance
• Store, Replay, Free
• Table, Promote
Text Analytics
• Parsing
• SVD
• Topic generation
• Document projection
Recommendation Systems
• Association
• Clustering
• kNN
• SVD
• Ensemble
Predictive Modeling
• Decision Tree
• Forecast
• Gen Linear Model
• Linear Regression
• Logistic Regression
• Random Forests
HDFS I/O
• Sasiola
• Sashdat
• Anyfile Reader
17. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS ON HADOOP
MemoryHortonworks Data
Platform
SAS
®
LASR™ Analytic Server
Head
node
Data
Nodes
Data
Data
Data
Data
Edge Node
SAS®
Visual
Analy>cs
SAS®
Visual
Sta>s>cs
SAS®
In-‐Memory
Sta>s>cs
SAS
®
In-Memory Analytic Products
Web Clients
IN-MEMORY, CLIENT-SERVER, WEB-BASED
18. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS ON HADOOP
MemoryHortonworks Data
Platform
SAS
®
LASR™ Analytic Server
Head
node
Data
Nodes
Data
Data
Data
Data
Edge Node
SAS®
Visual
Analy>cs
SAS®
Visual
Sta>s>cs
SAS®
In-‐Memory
Sta>s>cs
SAS
®
In-Memory Analytic Products
Web Clients
IN-MEMORY, CLIENT-SERVER, WEB-BASED
19. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS ON HADOOP
MemoryHortonworks Data
Platform
SAS
®
LASR™ Analytic Server
Head
node
Data
Nodes
Data
Data
Data
Data
Edge Node
SAS®
Visual
Analy>cs
SAS®
Visual
Sta>s>cs
SAS®
In-‐Memory
Sta>s>cs
result task
SAS
®
In-Memory Analytic Products
Web Clients
IN-MEMORY, CLIENT-SERVER, WEB-BASED
20. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS ON HADOOP
broadcasts
SAS
®
LASR™ Analytic Server
Head
node
Data
Nodes
Data
Data
Data
Data
Edge Node
result task
SAS
®
In-Memory Analytic Products
SUMMARY STATISTICS
Web Clients
proc imstat;
table dat1;
summary X / mean;
run;
OUTPUT
Send request
SampleMean(X) to
LASR
Waiting..
Receive 𝑿
A) Request 𝑺↓𝑿 =∑𝒊↑▒
𝒙↓𝒊 from data nodes
C) Aggregate 𝑿 =∑ 𝒋↑▒
𝑺↓𝑿, 𝒋 ⁄𝑵
D) Send 𝑿 back to Edge
B) Data node 𝒋 computes
𝑺↓𝑿, 𝒋 =∑𝒊↑▒ 𝒙↓𝒊, 𝒋 ,
𝒋=𝟏,𝟐,𝟑,𝟒
Broadcast..
Memory
21. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS ON HADOOP
broadcasts
SAS
®
LASR™ Analytic Server
Head
node
Data
Nodes
Data
Data
Data
Data
Edge Node
result task
SAS
®
In-Memory Analytic Products
PRINCIPLES OF THE DESIGN
Web Clients
Thin Clients
Multi-user
Interactive
Real-time
Point-and-click or
programing
Receive requests
from a UI or SAS
program.
• NO MAP REDUCE
• One data copy
• Concurrency
• Temporary tables or
columns
• MPP or SMP
Memory
Work on light
computations
(interactive trees)
22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case #1: Recommendation systems
Why recommender systems?
• 5 – 20% increase in sales
• 60% use “recommendations” to
determine suitable product
• In 2011 15% of customers
admitted to buying
recommended products, 2013
nearly 30%
36 Million subscribers
60-70% view results from
recommendation
Tens of Billions “Thumbs up”
60 Million active users
3.8 billion hours of music (last Qtr)
47% up-tic in active users
67% increase in music served
25% YOY Growth
Trip Advisor collaborates with
EBAY, ORBITZ and others.
23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pre-processing raw data for recommendation
• Inputs:
• Explicit product ratings (when provided)
• Implicit information: purchase transactions, page views, comments
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-‐Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum
Comments
24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Goal: predict a preference
Epic
X-‐Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
…
Epic
X-‐Men
Hobbit
Argo
Pirates
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
U101
U102
U103
U104
U105
…
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
3 2 3 1 5
25. Copyright © 2012, SAS Institute Inc. All rights reserved.
MACHINE LEARNING INTEGRATION
PREDICTIVE
ANALYTICS &
MACHINE LEARNING
RECOMMENDATION SYSTEM DEMO
SAS Visual Analytics
LOUNGE
PUB
BEER
DRINK
GAME
MUSIC
Deployment
PINT
BAND
PLAY
GLASS
Relevant,
Real-time,
Interactions
VODKA
PATIO KARAOKE
COCKTAIL
WINGS
DATA WRANGLING
Data Director*
Convert Json Files
Load LASR
Standardize
SAS In-Memory Statistics
Tony’s
Bar
Trees
Lounge
The
Tropicana
Blue
Parrot
Tony
Patty
George
Users
Business
Beer & Wine
Chinese Food
Mexican Food
LIQUOR
ALCOHOL
BARTENDER
DRAFT
Topics
TAP
FUN
LIVE
SCENE
POOL
Business
REVIEWS
* New SAS Product
26. Copyright © 2012, SAS Institute Inc. All rights reserved.
PREDICTIVE
ANALYTICS &
MACHINE LEARNING
RECOMMENDATION SYSTEM DEMO
John Clark
Recommendation
History
1. Oyster Bar
2. The Brick
3. Trees Lounge
4. Blue Parrot
5. Winchester Club
6. Starlight Lounge
7. Tony’s Bar
8. Lucy’s
9. The Tropicana
Rank
1
2
3
Recommendation
Review History
1. Oyster Bar
2. The Brick
3. Trees Lounge
4. Blue Parrot
5. Winchester Club
6. Starlight Lounge
7. Tony’s Bar
8. Lucy’s
9. The Tropicana
Rank
1,2, 3, …
Recommendation
27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case # 2: Building a prediction model
Customer ID Age Gender Loyalty Card
More
features…
Buys organic
11001 45 M Yes Yes
11002 43 M No Yes
11003 65 F Yes No
… … … …
Unseen data
Model
Buys organic
Labeled Data
Customer ID Age Gender Home
Owner
More
features…
11004 33 M No …
11005 25 F No …
28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo #2: Predicting who buys organic products?
• Dataset: grocery transaction and customer data
• Goals:
• Understand customer propensity to buy organic products
• Develop segments using an interactive decision
• Develop stratified models to predict organic purchases
• Why is it useful?
• Inventory strategy
• Store layout planning
• Provider management
29. Copyright © 2012, SAS Institute Inc. All rights reserved.
SAS VISUAL STATISTICS 6.4 – ORGANICS PURCHASE DEMO
PREDICTIVE
ANALYTICS &
MACHINE LEARNING
30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Wrap up: SAS and Hortonworks Data Platform
• Increase productivity for data scientists
• Users can concurrently & interactively analyze traditional & new data sets in HDP to help
businesses quickly discover and capitalize on new business insights from their data
• Increase efficiency
• Avoid unnecessary, multiple passes through the data
• SAS in-memory infrastructure running on top of Hadoop eliminates costly data movement and
persists data in-memory for the entire analytics session
• Capture and analyze new data types
• HDP + SAS enables data scientists to look at more of their enterprise data
• Leverage 100 percent open-source Apache Hadoop
• SAS customers can now embrace Hadoop as a core platform in their data architecture
31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How should you get started? Next steps…
• Get the Data
• Formulate a well defined business objective
• Data exploration: integrate and fuse heterogeneous data types
• Pre-process: generate features from raw data
• Manage the long-tail distribution and data imbalance
• Modeling: remember model building is cyclical
• Evaluate your results
• Work with IT to move analytics from research and into operations
32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
More details..
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about SAS Software & Hortonworks
http://hortonworks.com/partner/SAS/
Contact us: events@hortonworks.com