Like this presentation? Why not share!

- Big Analytics: Building Lasting Value by Dan Mallinger 530 views
- Recent Developments in Spark MLlib ... by Hadoop Summit 6250 views
- How To Use The Magic Of Thinking Big by Paul Williams 47570 views
- Practical Machine Learning Pipeline... by Databricks 3952 views
- Large-Scale Machine Learning with A... by DB Tsai 14673 views
- MLlib: Spark's Machine Learning Lib... by jeykottalam 4403 views

1,198

-1

-1

Published on

- Practitioners will learn two key techniques for early success

- Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences

- Hiring managers will expand their knowledge of the skills required to bring business value with data

No Downloads

Total Views

1,198

On Slideshare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

21

Comments

0

Likes

5

No embeds

No notes for slide

- 1. June 2013BIG DATA SCIENCE: A PATH FORWARD
- 2. CONFIDENTIAL | 2linkedin.com/in/danmallinger/@danmallingerwww.thinkbiganalytics.com Data Science Lead @ Think Big Product/Brand Obsessive Teacher Occasional Engineer
- 3. CONFIDENTIAL | 3TODAY• High level exploration of the• skills, tools, and techniques• needed to achieve early success• and to help you build• your data science practice.
- 4. CONFIDENTIAL | 4 Understand our organizational needs for data science Infrastructure: Technological tools and platforms. Talent: Staff hired and trained. Capabilities: Data science techniques utilized.INFRASTRUCTURE, TALENT, & CAPABILITIESHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduceDataExplorationBasic Modeling PhD MathVisualization Clustering CategorizationContinuousModelsText Analysis
- 5. CONFIDENTIAL | 5 Boxed Solutions: Mahout & Platform Toolkits: RHadoop, Scikit, etc. You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity.ANALYTICS TOOLS
- 6. CONFIDENTIAL | 6 Gigabytes from Stackoverflow Questions from users With metadata Users have reputations Questions open or closed Follow along Thinking about your data To learn in a Familiar context and PlanDATAPresenter AudienceHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
- 7. CONFIDENTIAL | 7select count(1) as total, sum(has_code), avg(body_count), stddev_samp(body_count), corr(reputation,owner_questions),histogram_numeric(body_count, 10)from questions;STEP 1: EXPLOREHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text AnalysisPatterns through Hive Patterns through Tableau
- 8. CONFIDENTIAL | 8 Summaries of unstructureddata Time-since metricsselect transform(…)using ‘python …’ Clustering: Browsing cohorts/bin/mahout canopySTEP 2: FEATURE BUILDINGHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text AnalysisSQL Windowing Cross-Record Features
- 9. CONFIDENTIAL | 9• Sample (don’t parallelize)• Naturally parallel• SVD• Random Forests• Estimators and Ensembles• Bootstrapping• Localizing• Advanced Parallelization• Linear models with SGD• Neural networksPARALLEL MODELS IN HADOOPHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
- 10. CONFIDENTIAL | 10 Single R model run many times over samples and aggregatedm <- C5.0(status ~ …)STEP 3: STRUCTURED MODEL (BAGGING)Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text AnalysisMapper 1:Define n reducer keysSend any record to reducer I withprobability pReducer 1:Key: Id of sampleValue: List of recordsPerform analysis over recordsReducer 2:Key: OneValue: List of modelsAggregate the models (e.g. average)Bagging a Model
- 11. CONFIDENTIAL | 11WHERE ARE WE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis We’ve created a structured model to flag questions that won’t be closed using Big Data. But we haven’t used unstructured data.
- 12. CONFIDENTIAL | 12TEXT ANALYSISHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis• Is “the big dog” really different from “dog is big?”• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”• Language has lexical and syntactical features• Different techniques leverage these in different ways Bag of Words: Structure doesn’t matter n-gram: Structure matters (but not that much) Feature Extraction: BACON! BACON! BACON!
- 13. CONFIDENTIAL | 13STEP 4: UNSTRUCTURED MODELHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis Similar to Hadoop’s WordCount Create counts fortoken/category pairs Use counts to calculateInformation GainMR Job 1:Calculate information gain (IG) for alltokens.MR Job 2:Select tokens with largest IG.Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1MR Job 3:Build a classifier over the newly structureddata (prior slides)Information Gain
- 14. CONFIDENTIAL | 14WHERE ARE WE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis We’ve created two models One structured, one unstructured. But they don’t work together.
- 15. CONFIDENTIAL | 15STEP 5: ENSEMBLE MODELHadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis Join many models together By using their output As input to ensemble model. Best when models performdifferently Exploit differences withnonlinearities Like interaction effects.EnsemblingMapper 1:Load multiple modelsScore the models per record and outputReducer 1:Key: Id of recordValue: List of model outputsJoin model outputs to make new recordsMR Job 2:Build a model over the output data as if itwas raw data.
- 16. CONFIDENTIAL | 16 We’ve created two models: one structured, one unstructured and have ensembled them to create a single, powerful model and solve a practical business problem.WHERE ARE WE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
- 17. CONFIDENTIAL | 17 This required simple infrastructure a blend of analysis and scripting skills an understanding of BIG data science techniques but not a team of PhDs or a billion dollars.HOW DID WE GET HERE?Hadoop NoSQL Analytics SQL/MPP Real TimeScripting MapReduce Exploration Basic Modeling PhD MathVisualization Clustering Categorization Continuous Text Analysis
- 18. CONFIDENTIAL | 18Questions?www.thinkbiganalytics.com@danmallinger

No public clipboards found for this slide

Be the first to comment