Successfully reported this slideshow.

Mad skills new analysis practices for big data

1,348 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Mad skills new analysis practices for big data

  1. 1. MAD SKILLS: NEW ANALYSIS PRACTICES FOR BIG DATA© Copyright 2012 EMC Corporation. All rights reserved. 1
  2. 2. Big Data Has Arrived THE DIGITAL UNIVERSE WILL GROW 44X IN THE NEXT 10 YEARSSource : 2011 IDC Digital Universe Study© Copyright 2012 EMC Corporation. All rights reserved. 2
  3. 3. Big Data: Hype or Reality?•  Do we have a Big Data problem in New Zealand?•  Do we have a Big Data problem in my organisation?•  Do I really need to care?© Copyright 2012 EMC Corporation. All rights reserved. 3
  4. 4. Big Data: Hype or Reality?•  Do we have a Big Data problem in New Zealand? Maybe.•  Do we have a Big Data problem in my organisation? Maybe.•  Do I really need to care? ABSOLUTELY.•  Big Data Practices is about widely applicable© Copyright 2012 EMC Corporation. All rights reserved. 4
  5. 5. Today Data Sources Slow- moving models Images Shadow Slow- systems ‘Shallow’ moving data Business Intelligence EDW Static schemas accrete over Departmental time warehouses© Copyright 2012 EMC Corporation. All rights reserved. 5
  6. 6. A Common Analytics Environment SAS/ACCESS Data Warehouses, File Systems© Copyright 2012 EMC Corporation. All rights reserved. 6
  7. 7. A Common Analytics Environment SAS/ACCESS SAS/CONNECT Data Warehouse, Cluster File Systems Computer© Copyright 2012 EMC Corporation. All rights reserved. 7
  8. 8. A Common Analytics Environment SAS/ACCESS SAS/CONNECT Bag of Tricks - DATA Step Bravado - SQL Pushdown - SASFILE - SAS Views - Compression Data Warehouse, Cluster File Systems Computer© Copyright 2012 EMC Corporation. All rights reserved. 8
  9. 9. A Leaner Configuration SAS/ACCESS ... ... Parallel Database ... ... Greenplum© Copyright 2012 EMC Corporation. All rights reserved. 9
  10. 10. An Integrated Architecture SAS/ACCESS SAS/CONNECT ... ... Parallel Database ... ... Greenplum© Copyright 2012 EMC Corporation. All rights reserved. 10
  11. 11. SAS High Performance Appliance on Greenplum SAS/ACCESS SAS/CONNECT ... ... Parallel Greenplum Database ... ... Greenplum© Copyright 2012 EMC Corporation. All rights reserved. 11
  12. 12. SAS-GP High-Performance Analytics Master Worker Node 1 Worker Node 2 Worker Node N Analytical Computation and data request sent to the worker nodes© Copyright 2012 EMC Corporation. All rights reserved. 12
  13. 13. SAS-GP High-Performance Analytics Master Worker Node 1 Worker Node 2 Worker Node N Data request sent to the database, data slice moved into memory© Copyright 2012 EMC Corporation. All rights reserved. 13
  14. 14. SAS-GP High-Performance Analytics Master Worker Node 1 Worker Node 2 Worker Node N Analytic Processing with internode communication© Copyright 2012 EMC Corporation. All rights reserved. 14
  15. 15. SAS-GP High-Performance Analytics Master Worker Node 1 Worker Node 2 Worker Node N Worker node results returned to the Master Node, finalize computation© Copyright 2012 EMC Corporation. All rights reserved. 15
  16. 16. SAS-GP High-Performance Analytics Root Node Worker Node 1 Worker Node 2 Worker Node N Result returned to the client© Copyright 2012 EMC Corporation. All rights reserved. 16
  17. 17. How do we sort out this mess? Data Sources Slow- moving models Images Shadow Slow- systems ‘Shallow’ moving data Business Intelligence Static schemas accrete over Departmental time warehouses© Copyright 2012 EMC Corporation. All rights reserved. 17
  18. 18. Keep the Enterprise Data Warehouse Enterprise Data Warehouse •  Single Source of Truth •  Heavy data governance and quality •  Operational reporting •  Financial consolidation© Copyright 2012 EMC Corporation. All rights reserved. 18
  19. 19. Add an Analytics Data Cloud as a Complement SAS/Greenplum/Hadoop Commodity Virtual Public Hardware Machines Cloud Enterprise Data Warehouse Analytics Data Cloud •  Single Source of Truth • Source of all raw data (often 10X size of EDW) •  Heavy data governance and quality • Self-service infrastructure to support multiple •  Operational reporting marts and sandboxes •  Financial consolidation • Ad hoc, business-led analytics solutions© Copyright 2012 EMC Corporation. All rights reserved. 19
  20. 20. MAD Analytics© Copyright 2012 EMC Corporation. All rights reserved. 20
  21. 21. Magnetic Simple linear models Trend analysis Analytics Mode design Mode Data Chorus TL/ELT Fast E ta l sele l a Agile D ction ADC PLATFORM© Copyright 2012 EMC Corporation. All rights reserved. 21
  22. 22. Agile analyze and model in the cloud push results back into the cloud get data into the cloud© Copyright 2012 EMC Corporation. All rights reserved. 22
  23. 23. Deep Future What will How can we do happen? better? What happened How and why Past where and did it happen? when? Facts Interpretation© Copyright 2012 EMC Corporation. All rights reserved. 23
  24. 24. Different Phases of Analytics DATA EXPLORATION SCORING PREDICTIVE MODEL Frequency Linear Regression Linear Regression Histogram Logistic Regression Logistic Regression Bar Chart Naïve Bayes Classifier Naïve Bayes Classifier Box Plot Chart Decision Trees Decision Trees Correlation Matrix Neural Networks Neural NetworksData Exploration Scoring Modeling Model Fit Data Prep TRANSFORMATION DATA MINING MODEL FIT STATISTICS Aggregation Association Rule Goodness of Fit Row Filtering K-means Clustering ROC Deriving New Variables Significance statistics for all independent variables Pivoting Normalizing © Copyright 2012 EMC Corporation. All rights reserved. 24
  25. 25. In-Database Machine Learning•  Goal: Build models using all available data•  Principle: Avoid using samples if possible.•  Principle: Bring computation to data, not the other way round.•  In practice: Write machine learning algorithms in (parallelised) data languages like SQL, SAS, and MapReduce.© Copyright 2012 EMC Corporation. All rights reserved. 25
  26. 26. Design Pattern – Online Learning•  Process data one at a time using an incrementally maintained model; adjust model every time we make a prediction error•  Examples: perceptron, online SVMs, Bayesian filters, etc.•  Such algorithms can be implemented using SAS DATA steps or SQL aggregate functions© Copyright 2012 EMC Corporation. All rights reserved. 26
  27. 27. Design Pattern – Parallel Ensemble Learning•  Break a (large) dataset into i.i.d subsets residing on each node, learn a model on each subset in parallel, and then combine the models appropriately•  Examples: random forests, ensembles of SVMs, etc.© Copyright 2012 EMC Corporation. All rights reserved. 27
  28. 28. Design Pattern – MapReduce•  Repeatedly apply a Map function to transform (local) chunks of data and then use a Reduce function to consolidate the transformed results•  Examples: parallel LDA, k-Means, Naive Bayes, etc.© Copyright 2012 EMC Corporation. All rights reserved. 28
  29. 29. Design Pattern – Prediction Markets© Copyright 2012 EMC Corporation. All rights reserved. 29
  30. 30. Japanese Telco: What People Are Talking About !© Copyright 2012 EMC Corporation. All rights reserved. 30
  31. 31. Traffic Network Modelling© Copyright 2012 EMC Corporation. All rights reserved. 31
  32. 32. Massively Parallel Model Learning•  Solving tens of thousands of statistical modelling problems, one for each road in the city, in parallel: libname adc greenplm server=gplum db=traffic port=5432 user=keesiong … proc sql; select origin, dest, linregr(travel_time, array[peak_period(entry_time), …, origin_vol, dest_vol]) from adc.route_travel_info group by origin,dest;•  A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) + 0.378 originVol(x) + 0.691 destVol(x)© Copyright 2012 EMC Corporation. All rights reserved. 32
  33. 33. It’s MAD, but is it Mad? SAS/Greenplum/Hadoop Commodity Virtual Public Hardware Machines Cloud Enterprise Data Warehouse Analytics Data Cloud •  Single Source of Truth • Source of all the raw data (often 10X size of •  1 Logical Model the EDW) •  Heavy data governance and quality • Self-service infrastructure to support multiple marts and sandboxes •  Operational reporting • Rapid analytic iteration, and business led •  Financial consolidation solutions© Copyright 2012 EMC Corporation. All rights reserved. 33
  34. 34. Public Cloud Computing Services© Copyright 2012 EMC Corporation. All rights reserved. 34
  35. 35. Democratisation of Data© Copyright 2012 EMC Corporation. All rights reserved. 35
  36. 36. Democratisation of Data© Copyright 2012 EMC Corporation. All rights reserved. 36
  37. 37. Democratisation of Data© Copyright 2012 EMC Corporation. All rights reserved. 37
  38. 38. Helping Organizations Evolve From This… Line Of Business User Business Intelligence Analyst Business I.T. Department Database Administrator© Copyright 2012 EMC Corporation. All rights reserved. 38
  39. 39. To This… Line Of Business User Business Data Intelligence Scientists Analysts Data Platform Administrator© Copyright 2012 EMC Corporation. All rights reserved. 39
  40. 40. Top 3 Steps You Should Take On Your Journey To Big Data Analytics© Copyright 2012 EMC Corporation. All rights reserved. 40
  41. 41. 3. Put all your data to work.© Copyright 2012 EMC Corporation. All rights reserved. 41
  42. 42. 2. Have a data strategy. Model less, iterate more.© Copyright 2012 EMC Corporation. All rights reserved. 42
  43. 43. 1. First invest in people, then technology.© Copyright 2012 EMC Corporation. All rights reserved. 43
  44. 44. A journey of a thousand miles begins with a single step - Lao-tzu, Chinese philosopher (531 BC) First Step: Walk towards the SAS-Greenplum Technical Session.© Copyright 2012 EMC Corporation. All rights reserved. 44

×