BIG Data Science: A Path Forward

1,198
-1

Published on

This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience:

- Practitioners will learn two key techniques for early success
- Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences
- Hiring managers will expand their knowledge of the skills required to bring business value with data

Published in: Technology, Education

BIG Data Science: A Path Forward

  1. 1. June 2013BIG DATA SCIENCE: A PATH FORWARD
  2. 2. CONFIDENTIAL | 2linkedin.com/in/danmallinger/@danmallingerwww.thinkbiganalytics.com Data Science Lead @ Think Big Product/Brand Obsessive Teacher Occasional Engineer
  3. 3. CONFIDENTIAL | 3TODAY• High level exploration of the• skills, tools, and techniques• needed to achieve early success• and to help you build• your data science practice.
  4. 4. CONFIDENTIAL | 4 Understand our organizational needs for data science Infrastructure: Technological tools and platforms. Talent: Staff hired and trained. Capabilities: Data science techniques utilized.INFRASTRUCTURE, TALENT, & CAPABILITIESHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduceDataExplorationBasic Modeling PhD MathVisualization Clustering CategorizationContinuousModelsText Analysis
  5. 5. CONFIDENTIAL | 5 Boxed Solutions: Mahout & Platform Toolkits: RHadoop, Scikit, etc. You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity.ANALYTICS TOOLS
  6. 6. CONFIDENTIAL | 6 Gigabytes from Stackoverflow Questions from users With metadata Users have reputations Questions open or closed Follow along Thinking about your data To learn in a Familiar context and PlanDATAPresenter AudienceHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
  7. 7. CONFIDENTIAL | 7select count(1) as total, sum(has_code), avg(body_count), stddev_samp(body_count), corr(reputation,owner_questions),histogram_numeric(body_count, 10)from questions;STEP 1: EXPLOREHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text AnalysisPatterns through Hive Patterns through Tableau
  8. 8. CONFIDENTIAL | 8 Summaries of unstructureddata Time-since metricsselect transform(…)using ‘python …’ Clustering: Browsing cohorts/bin/mahout canopySTEP 2: FEATURE BUILDINGHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text AnalysisSQL Windowing Cross-Record Features
  9. 9. CONFIDENTIAL | 9• Sample (don’t parallelize)• Naturally parallel• SVD• Random Forests• Estimators and Ensembles• Bootstrapping• Localizing• Advanced Parallelization• Linear models with SGD• Neural networksPARALLEL MODELS IN HADOOPHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
  10. 10. CONFIDENTIAL | 10 Single R model run many times over samples and aggregatedm <- C5.0(status ~ …)STEP 3: STRUCTURED MODEL (BAGGING)Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text AnalysisMapper 1:Define n reducer keysSend any record to reducer I withprobability pReducer 1:Key: Id of sampleValue: List of recordsPerform analysis over recordsReducer 2:Key: OneValue: List of modelsAggregate the models (e.g. average)Bagging a Model
  11. 11. CONFIDENTIAL | 11WHERE ARE WE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis We’ve created a structured model to flag questions that won’t be closed using Big Data. But we haven’t used unstructured data.
  12. 12. CONFIDENTIAL | 12TEXT ANALYSISHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis• Is “the big dog” really different from “dog is big?”• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”• Language has lexical and syntactical features• Different techniques leverage these in different ways Bag of Words: Structure doesn’t matter n-gram: Structure matters (but not that much) Feature Extraction: BACON! BACON! BACON!
  13. 13. CONFIDENTIAL | 13STEP 4: UNSTRUCTURED MODELHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis Similar to Hadoop’s WordCount Create counts fortoken/category pairs Use counts to calculateInformation GainMR Job 1:Calculate information gain (IG) for alltokens.MR Job 2:Select tokens with largest IG.Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1MR Job 3:Build a classifier over the newly structureddata (prior slides)Information Gain
  14. 14. CONFIDENTIAL | 14WHERE ARE WE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis We’ve created two models One structured, one unstructured. But they don’t work together.
  15. 15. CONFIDENTIAL | 15STEP 5: ENSEMBLE MODELHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis Join many models together By using their output As input to ensemble model. Best when models performdifferently Exploit differences withnonlinearities Like interaction effects.EnsemblingMapper 1:Load multiple modelsScore the models per record and outputReducer 1:Key: Id of recordValue: List of model outputsJoin model outputs to make new recordsMR Job 2:Build a model over the output data as if itwas raw data.
  16. 16. CONFIDENTIAL | 16 We’ve created two models: one structured, one unstructured and have ensembled them to create a single, powerful model and solve a practical business problem.WHERE ARE WE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
  17. 17. CONFIDENTIAL | 17 This required simple infrastructure a blend of analysis and scripting skills an understanding of BIG data science techniques but not a team of PhDs or a billion dollars.HOW DID WE GET HERE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
  18. 18. CONFIDENTIAL | 18Questions?www.thinkbiganalytics.com@danmallinger

×