Predictive Analytics and Machine Learning …with SAS and Apache Hadoop

8,974 views

Published on

In this interactive webinar, we'll walk through use cases on how you can use advanced analytics like SAS Visual Statistics and In-Memory Statistic with Hortonworks’ data platform (HDP) to reveal insights in your big data and redefine how your organization solves complex problems.

Published in: Software, Technology, Education

Predictive Analytics and Machine Learning …with SAS and Apache Hadoop

  1. 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Predictive Analytics and Machine Learning …with SAS and Apache Hadoop Spring 2014 Version 1.5 We do Hadoop.
  2. 2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Your speakers… Ofer Mendelevitch, Director of Data Science Hortonworks Wayne Thompson, Chief Data Scientist SAS
  3. 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved A data architecture under pressure from new dataAPPLICATIONS  DATA    SYSTEM   REPOSITORIES   SOURCES   Exis4ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   RDBMS   EDW   MPP   Business     Analy4cs   Custom   Applica4ons   Packaged   Applica4ons   Source: IDC 2.8  ZB  in  2012   85%  from  New  Data  Types   15x  Machine  Data  by  2020   40  ZB  by  2020   OLTP,  ERP,  CRM  Systems   Unstructured  documents,  emails   Clickstream   Server  logs   Sen>ment,  Web  Data   Sensor.  Machine  Data   Geo-­‐loca>on  
  4. 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop within an emerging Modern Data Architecture OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test DATA    SYSTEM   REPOSITORIES   SOURCES   RDBMS   EDW   MPP   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social   Networks   Machine   Generated   Sensor   Data   Geoloca>on   Data   Governance &Integration Security Operations Data Access Data Management APPLICATIONS   Business     Analy4cs   Custom   Applica4ons   Packaged   Applica4ons   Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale
  5. 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop unlocks a new approach: Iterative Analytics Hadoop   Mul>ple  Query  Engines   Itera>ve  Process:  Explore,  Transform,  Analyze   SQL   Single  Query  Engine   Repeatable  Linear  Process   ✚ Determine   list  of   ques4ons   Design   solu4ons   Collect   structured   data   Ask   ques4ons   from  list   Detect   addi4onal   ques4ons   Batch   Interac4ve   Real-­‐4me   Streaming   Current Reality Apply schema on write Dependent on IT Augment w/ Hadoop Apply schema on read Support range of access patterns to data stored in HDFS: polymorphic access
  6. 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop for Data Science • Hadoop’s schema on read reduces cycle times • Hadoop is ideal for pre-processing of raw data • Improved models with larger datasets
  7. 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop’s schema-on-read accelerates innovation I  need  new   data   Finally,  we   start   collec>ng   Let  me  see…  is   it  any  good?   Start 6 months 9 months “Schema change” project Let’s  just  put  it  in  a   folder  on  HDFS   Let  me  see…  is   it  any  good?   3 months My  model  is   awesome!  
  8. 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop ideal for large scale pre-processing Join   Normalize   OCR   Sample   Aggregate   Raw  Data   Feature   Matrix   NLP   Transform  
  9. 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Why big data science? Larger datasets à better outcomes Banko & Brill, 2001 • More examples • More features
  10. 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved A (partial) map of data science “tasks” Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Association rule mining Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Big Data Science: High energy physics, Genomics, etc.
  11. 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Typical iterative flow in data science Page 11 Visualize, Explore Hypothesize; Model Measure/ Evaluate Acquire Data Clean Data Deploy & Monitor
  12. 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SAS in-memory and Visual Statistics HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera4ng  System   DATA    MANAGEMENT   SECURITY  DATA    ACCESS   GOVERNANCE  &   INTEGRATION   Authen4ca4on   Authoriza4on   Accoun4ng   Data  Protec4on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   Analy>cs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       Deployment  Choice   Linux Windows On-Premise Cloud SAS® Visual Statistics SAS ® In-Memory Statistics for Hadoop •  Provide powerful advanced analytics integrated directly on HDP
  13. 13. Copyright © 2012, SAS Institute Inc. All rights reserved. BIG ANALYTICS+ HORTONWORKS DATA PLATFORM (HDP) = BIG OPPORTUNITIES
  14. 14. Copyright © 2012, SAS Institute Inc. All rights reserved. WHAT IS IT? Provides a single interactive analytical platform on Hadoop to perform •  analytical data preparation •  variable transformations •  exploratory analysis •  statistical modeling and machine learning •  integrated modeling comparison and scoring •  Takes advantage of distributed in-memory computing optimized for analytical workloads TEXT PREPARE DATA EXPLORE DATA DEVELOP MODELS SCORE SAS ® IN-MEMORY ANALYTICS Governance &Integration Security Operations Data Access Data Management
  15. 15. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS® IN- MEMORY ANALYTICS INTEGRATED USER EXPERIENCE Data Preparation Exploration/ Visualization Modeling Deployment DATA SCIENTIST /PROGRAMMER SAS® Visual Statistics SAS ® In-Memory Statistics for Hadoop GUI GUI STATISTICIAN PROGRAMMING
  16. 16. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS IN-MEMORY STATISTICS FOR HADOOP Data Management •  Aggregate •  Compute •  Update •  Append •  Set •  Schema •  DeleteRows •  DropTables •  PurgeTempTables Data Exploration • Boxplot • Corr • Crosstab • Distinct • Fetch • Frequency • Histogram • KDE • MDSummary • Percentile • Summary • TopK Descriptive Modeling •  Association •  Path Analysis •  Clustering (k-means) •  Clustering (DBSCAN) Evaluation, Deployment • Assess Misclassification matrix Lift, ROC, Concordance • Score • Training / Validation Data Management & Exploration Modeling Model Evaluation & Deployment ANALYTICAL LIFE CYCLE Utilities • Where • GroupBy • TableInfo, ColumnInfo, ServerInfo • Partition, Balance • Store, Replay, Free • Table, Promote Text Analytics •  Parsing •  SVD •  Topic generation •  Document projection Recommendation Systems •  Association •  Clustering •  kNN •  SVD •  Ensemble Predictive Modeling • Decision Tree • Forecast • Gen Linear Model • Linear Regression • Logistic Regression • Random Forests HDFS I/O •  Sasiola •  Sashdat •  Anyfile Reader
  17. 17. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ON HADOOP MemoryHortonworks Data Platform SAS ® LASR™ Analytic Server Head node Data Nodes Data Data Data Data Edge Node SAS®  Visual   Analy>cs   SAS®  Visual   Sta>s>cs   SAS®  In-­‐Memory   Sta>s>cs   SAS ® In-Memory Analytic Products Web Clients IN-MEMORY, CLIENT-SERVER, WEB-BASED
  18. 18. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ON HADOOP MemoryHortonworks Data Platform SAS ® LASR™ Analytic Server Head node Data Nodes Data Data Data Data Edge Node SAS®  Visual   Analy>cs   SAS®  Visual   Sta>s>cs   SAS®  In-­‐Memory   Sta>s>cs   SAS ® In-Memory Analytic Products Web Clients IN-MEMORY, CLIENT-SERVER, WEB-BASED
  19. 19. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ON HADOOP MemoryHortonworks Data Platform SAS ® LASR™ Analytic Server Head node Data Nodes Data Data Data Data Edge Node SAS®  Visual   Analy>cs   SAS®  Visual   Sta>s>cs   SAS®  In-­‐Memory   Sta>s>cs   result task SAS ® In-Memory Analytic Products Web Clients IN-MEMORY, CLIENT-SERVER, WEB-BASED
  20. 20. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ON HADOOP broadcasts SAS ® LASR™ Analytic Server Head node Data Nodes Data Data Data Data Edge Node result task SAS ® In-Memory Analytic Products SUMMARY STATISTICS Web Clients proc imstat; table dat1; summary X / mean; run; OUTPUT Send request SampleMean(X) to LASR Waiting.. Receive ​ 𝑿  A) Request ​ 𝑺↓𝑿 =∑𝒊↑▒​ 𝒙↓𝒊   from data nodes C) Aggregate ​ 𝑿 =​∑ 𝒋↑▒​ 𝑺↓𝑿, 𝒋  ⁄𝑵  D) Send ​ 𝑿  back to Edge B) Data node 𝒋 computes ​ 𝑺↓𝑿, 𝒋 =∑𝒊↑▒​ 𝒙↓𝒊, 𝒋  , 𝒋=𝟏,𝟐,𝟑,𝟒 Broadcast.. Memory
  21. 21. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ON HADOOP broadcasts SAS ® LASR™ Analytic Server Head node Data Nodes Data Data Data Data Edge Node result task SAS ® In-Memory Analytic Products PRINCIPLES OF THE DESIGN Web Clients Thin Clients Multi-user Interactive Real-time Point-and-click or programing Receive requests from a UI or SAS program. •  NO MAP REDUCE •  One data copy •  Concurrency •  Temporary tables or columns •  MPP or SMP Memory Work on light computations (interactive trees)
  22. 22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Use Case #1: Recommendation systems Why recommender systems? •  5 – 20% increase in sales •  60% use “recommendations” to determine suitable product •  In 2011 15% of customers admitted to buying recommended products, 2013 nearly 30% 36 Million subscribers 60-70% view results from recommendation Tens of Billions “Thumbs up” 60 Million active users 3.8 billion hours of music (last Qtr) 47% up-tic in active users 67% increase in music served 25% YOY Growth Trip Advisor collaborates with EBAY, ORBITZ and others.
  23. 23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Pre-processing raw data for recommendation • Inputs: • Explicit product ratings (when provided) • Implicit information: purchase transactions, page views, comments 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic   X-­‐Men   Hobbit   Argo   Pirates   U101   U102   U103   U104   U105   …   Ratings Page views Forum Comments
  24. 24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Goal: predict a preference Epic   X-­‐Men   Hobbit   Argo   Pirates   U101   U102   U103   U104   U105   …   Epic   X-­‐Men   Hobbit   Argo   Pirates   5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 U101   U102   U103   U104   U105   …   5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5
  25. 25. Copyright © 2012, SAS Institute Inc. All rights reserved. MACHINE LEARNING INTEGRATION PREDICTIVE ANALYTICS & MACHINE LEARNING RECOMMENDATION SYSTEM DEMO SAS Visual Analytics LOUNGE PUB BEER DRINK GAME MUSIC Deployment PINT BAND PLAY GLASS Relevant, Real-time, Interactions VODKA PATIO KARAOKE COCKTAIL WINGS DATA WRANGLING Data Director* Convert Json Files Load LASR Standardize SAS In-Memory Statistics Tony’s Bar Trees Lounge The Tropicana Blue Parrot Tony Patty George Users Business Beer & Wine Chinese Food Mexican Food LIQUOR ALCOHOL BARTENDER DRAFT Topics TAP FUN LIVE SCENE POOL Business REVIEWS * New SAS Product
  26. 26. Copyright © 2012, SAS Institute Inc. All rights reserved. PREDICTIVE ANALYTICS & MACHINE LEARNING RECOMMENDATION SYSTEM DEMO John Clark Recommendation History 1. Oyster Bar 2. The Brick 3. Trees Lounge 4. Blue Parrot 5. Winchester Club 6. Starlight Lounge 7. Tony’s Bar 8. Lucy’s 9. The Tropicana Rank 1 2 3 Recommendation Review History 1.  Oyster Bar 2.  The Brick 3.  Trees Lounge 4.  Blue Parrot 5.  Winchester Club 6.  Starlight Lounge 7.  Tony’s Bar 8.  Lucy’s 9.  The Tropicana Rank 1,2, 3, … Recommendation
  27. 27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Use Case # 2: Building a prediction model Customer ID Age Gender Loyalty Card More features… Buys organic 11001 45 M Yes Yes 11002 43 M No Yes 11003 65 F Yes No … … … … Unseen data Model Buys organic Labeled Data Customer ID Age Gender Home Owner More features… 11004 33 M No … 11005 25 F No …
  28. 28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo #2: Predicting who buys organic products? •  Dataset: grocery transaction and customer data •  Goals: •  Understand customer propensity to buy organic products •  Develop segments using an interactive decision •  Develop stratified models to predict organic purchases •  Why is it useful? •  Inventory strategy •  Store layout planning •  Provider management
  29. 29. Copyright © 2012, SAS Institute Inc. All rights reserved. SAS VISUAL STATISTICS 6.4 – ORGANICS PURCHASE DEMO PREDICTIVE ANALYTICS & MACHINE LEARNING
  30. 30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Wrap up: SAS and Hortonworks Data Platform •  Increase productivity for data scientists •  Users can concurrently & interactively analyze traditional & new data sets in HDP to help businesses quickly discover and capitalize on new business insights from their data •  Increase efficiency •  Avoid unnecessary, multiple passes through the data •  SAS in-memory infrastructure running on top of Hadoop eliminates costly data movement and persists data in-memory for the entire analytics session •  Capture and analyze new data types •  HDP + SAS enables data scientists to look at more of their enterprise data •  Leverage 100 percent open-source Apache Hadoop •  SAS customers can now embrace Hadoop as a core platform in their data architecture
  31. 31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved How should you get started? Next steps… •  Get the Data •  Formulate a well defined business objective •  Data exploration: integrate and fuse heterogeneous data types •  Pre-process: generate features from raw data •  Manage the long-tail distribution and data imbalance •  Modeling: remember model building is cyclical •  Evaluate your results •  Work with IT to move analytics from research and into operations
  32. 32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved More details.. Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 More about SAS Software & Hortonworks http://hortonworks.com/partner/SAS/ Contact us: events@hortonworks.com

×