Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Analytics -
The Best of the Worst
Krishna Sankar
@ksankar
https://www.linkedin.com/in/ksankar
About MeAbout Me
o Data Scientist
• Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.g...
Background – Top 5Background – Top 5
http://tcapp2.publishpath.com/rabbithole
http://conservationmagazine.org/wordpress/wp...
1) Data Science
The art of building a model with known knowns
Which when let loose, works with unknown unknowns
1) Data Sc...
2) The pipeline is the context2) The pipeline is the context
o Scalable  Model  
Deployment
o Big  Data  
automation  &  
...
VolumeVolume
VelocityVelocity
VarietyVariety
3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s
ContextContext
C...
4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift
Dynamic dash boards
Multi-dimensional
pivots w/
custo...
5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps
oAnalytics in the...
6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell
o Data is the len...
“Therearenoroutinestatisticalquestions,only
questionablestatisticalroutines”--DavidCox
Ref:	
  Gabriele	
  Corno Natural	
...
Data SwampData Swamp
Blue Pill
o Typical case of “ungoverned data
stores addressing a limited data science
audience“
o The...
Data SwampData Swamp
Blue Pill
o Typical case of “ungoverned data
stores addressing a limited data science
audience“
o The...
Big Data To NowhereBig Data To Nowhere
Blue Pill
o IT sees an opportunity and starts
building the infrastructure, sometime...
ML Engine
numPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collect
o Store
o Transform
o Report
o Visualize
o Recommend...
ML Engine
numPy, SciPy, Pandas, Spark,
Azure ML, MPP/Impala
o Collect
o Store
o Transform
o Report
o Visualize
o Recommend...
A Data Too FarA Data Too Far
Blue Pill
o You might get a few .gz files, a few .csv files
and of course, parquet files, in ...
Where is the Tofu ?Where is the Tofu ?
Blue Pill
o It is very simple to produce
“reasonable” recommendations
o But extreme...
Analytics - miscuesAnalytics - miscues
oDon’t Torture the Data
Down	
  the	
  rabbit	
  hole	
  art	
  by	
  frostyshadows
http://frostyshadows.deviantart.com/art/Down-­‐the-­‐Rabbit-­‐...
Data Alone is not enoughData Alone is not enough
o Data alone is not enough
• Induction not deduction - Every learner shou...
More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm
o More Data Beats a Cleverer Algorithm
• Or conve...
In short …In short …
o Build Full stack, iteratively building capabilities
o Identify the ‘Right’ Business Problems
o Crea...
Ogilvy & Mather Advertising: Morningview fromthe Ogilvy & Mather NY office,nicknamedthe ChocolateFactory # TravelTuesday
h...
Upcoming SlideShare
Loading in …5
×

Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

11,930 views

Published on

Slides for my pydata talk http://seattle.pydata.org/schedule/presentation/20/

Published in: Data & Analytics

Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes

  1. 1. Big Data Analytics - The Best of the Worst Krishna Sankar @ksankar https://www.linkedin.com/in/ksankar
  2. 2. About MeAbout Me o Data Scientist • Decision Data Science & Product Data Science [Data Science Folk Knowledge http://goo.gl/O4svPx] • Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L] • Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3] o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] … o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA • Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI, • Guest Lecturer at Naval PG School,… o Studying MS-CFRM (Computational Finance/Risk management) UWA o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC] o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT] o Reviewer : “Machine Learning with Spark” Packt Publishing o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  3. 3. Background – Top 5Background – Top 5 http://tcapp2.publishpath.com/rabbithole http://conservationmagazine.org/wordpress/wp-­‐content/uploads/2013/05/context-­‐matters.jpg
  4. 4. 1) Data Science The art of building a model with known knowns Which when let loose, works with unknown unknowns 1) Data Science The art of building a model with known knowns Which when let loose, works with unknown unknowns Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others  know,  you  don’t Model Evolution/DevOps to capture this o Capture in   Models o Facts,  outcomes  or   scenarios  we  have  not   encountered,  nor   considered o “Black  swans”,  outliers,  long   tails  of  probability   distributions o Lack  of  experience,   imagination o Potential  facts,   outcomes  we  are   aware,  but  not     with  certainty o Stochastic   processes,   Probabilities o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't knowGoal of Big Data is AnalyticsGoal of Big Data is Analytics
  5. 5. 2) The pipeline is the context2) The pipeline is the context o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Volume o Velocity o Streaming  Data o Volume o Velocity o Streaming  Data o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure CollectCollect StoreStore TransformTransform o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set ReasonReason ModelModel DeployDeploy Data ManagementData Management Data ScienceData Science o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  6. 6. VolumeVolume VelocityVelocity VarietyVariety 3) Mind Your “I”s, “C”s & “V”s3) Mind Your “I”s, “C”s & “V”s ContextContext Connect edness Connect edness IntelligenceIntelligence InterfaceInterface InferenceInference o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality CURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCECURATED SIGNALS > APPLIED INTELLIGENCE > STRATIFIED INFERENCE
  7. 7. 4) Model Evolution & Concept Drift4) Model Evolution & Concept Drift Dynamic dash boards Multi-dimensional pivots w/ customization Selectable algorithms on data subsets “Cluster Customer for 5 thanksgiving seasons” Learning Models Automatic Feature Selection & hyper parameter optimizations as it gets more data Dynamic Models – Model Selection based on context Complexity Value Automated Analytics- Let Data tell story Feature Learning, AI, Deep Learning Concept Drift Validate Model assumptions + hyper parameters + features in the current context – after they are in production Ref:  Prof.  Josh  Bloom,  Keynote:  A  Systems  View  of  Machine  Learning,  #pydata Seattle’15
  8. 8. 5) The Sense & Sensibility of a DataScientist DevOps5) The Sense & Sensibility of a DataScientist DevOps oAnalytics in the lab = Investigative • Interactive, Iterative, Explorative • Output is usually decision data science o Analytics in the factory = Operational • Automated, systemic, transparent & explainable • Output is embedded intelligence • Embedded in customer facing decision systems Josh  Wills-­‐From   the  labs  to  the  factory,   https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/ http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/ There is a chasm between Model/Reason and Deploy
  9. 9. 6) Data is your product, regardless of what you sell6) Data is your product, regardless of what you sell o Data is the lens through which you see the business and fell the pulse o Collect the right data through “Thoughtful Data Design” o Give Data Back in a Powerful Way o But don’t confuse or overwhelm the users • The users have to feel safe • The users have to feel they are in control o Never try to launch a complicated data product on a fixed schedule o Offer progressively sophisticated products, leveraging the data & insights, across the different user population segments • Customer segmentation & stratification is not just for retail ! Josh  Wills-­‐From   the  labs  to  the  factory,   https://doubleclix.wordpress.com/2013/11/17/of-­‐building-­‐data-­‐products/ http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
  10. 10. “Therearenoroutinestatisticalquestions,only questionablestatisticalroutines”--DavidCox Ref:  Gabriele  Corno Natural  History  Museum  in  #London  ..by  George  Thalassinos Big Data Analytics - The Best of the Worst
  11. 11. Data SwampData Swamp Blue Pill o Typical case of “ungoverned data stores addressing a limited data science audience“ o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. o Now every one starts putting their data into this “lake”. o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence Red Pill-Data Curation o Data Curation • A consistent published schema o Data Quality & Data Lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer … o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation & discipline o Design for the right “Data Gravity” & “Data Mass” as Van Lindberg mentioned, yesterday, in his keynote • Not Data Molasses !
  12. 12. Data SwampData Swamp Blue Pill o Typical case of “ungoverned data stores addressing a limited data science audience“ o The company proudly has crossed the chasm to the big data world with a new shiny Hadoop infrastructure. o Now every one starts putting their data into this “lake”. o After a few months, the disks are full; Hadoop is replicating 3 copies; even some bytes are falling off the floor from the wires – but no one has any clue on what data is in there, the consistency and the semantic coherence Red Pill-Data Curation o Data Curation • A consistent published schema o Data quality & data lineage, “descriptive metadata and an underlying mechanism to maintain it”, all are part of the data curation layer … o Semantic consistency across diverse multi-structured multi-temporal transactions require a level of data curation and discipline https://www.linkedin.com/pulse/data-­‐lakes-­‐udls-­‐vs-­‐analytics-­‐platforms-­‐gargi-­‐adhav
  13. 13. Big Data To NowhereBig Data To Nowhere Blue Pill o IT sees an opportunity and starts building the infrastructure, sometimes massive, and puts petabytes of data in the Big Data Hub or lake or pool or … But no relevant business facing apps. o A conversation goes like this … • Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? • IT : We have petabytes of data and I can show the Hadoop admin console. We even have the Spark UI ! • Business : … (unprintable) Red Pill-Full Stack MVP (see next slide) o Build the full stack ie bits to business … o Build incremental Decision Data Science & Product Data Science layers, as appropriate … o The following conversation is a lot better … • Business : I heard that we have a big data infrastructure, cool. When can I show a demo to our customers ? • IT : Actually we don’t have all the data. But from the transaction logs and customer data, we can infer that Males between 34 -36 buy a lot of stuff from us between 11:00 PM & 2:00 AM ! • Business : That is interesting … Show me a graph. BTW, do you know what is the revenue is and the profit margin from these buys ? • IT : Graph is no problem. We have a shiny app with the dynamic model over the web logs. • IT: With the data we have, we only know that they comprise ~‾30% of our volume by transaction. But we do not have the order data in our Hadoop yet. We can … let me send out a budget request …
  14. 14. ML Engine numPy, SciPy, Pandas, Spark, Azure ML, MPP/Impala o Collect o Store o Transform o Report o Visualize o Recommend o Predict o Reason o Model o Model o Explore R/Python o Compositional Analysis Data Hub Curated Data Storage : HDFS, Parquet Compute : Hadoop MR, Spark Landing Zone Dashboards APIs Reporting Hub Analytics Hub ETL In-Memory Hub Real-Time Kafka … Reporting   Hub Analytics   Hub Hadoop   MR Long-­‐Running  Complex  Jobs  -­‐ Yearly  pivots,   Multi-­‐dimensional   Exact  Uniques ✔ ️ ✔ ️ Real-­‐time  ad-­‐hoc  pivots,  Approx Uniques (HLL) ✔ ️ Fast  Response  with  Aggregated  data  Subsets ✔ ️
  15. 15. ML Engine numPy, SciPy, Pandas, Spark, Azure ML, MPP/Impala o Collect o Store o Transform o Report o Visualize o Recommend o Predict o Reason o Model o Model o Explore R/Python o Compositional Analysis Data Hub Curated Data Storage : HDFS, Parquet Compute : Hadoop MR, Spark Landing Zone Dashboards APIs Reporting Hub Analytics Hub ETL In-Memory Hub Real-Time Kafka … Reporting   Hub Analytics   Hub Hadoop   MR Long-­‐Running  Complex  Jobs  -­‐ Yearly  pivots,   Multi-­‐dimensional   Exact  Uniques ✔ ️ ✔ ️ Real-­‐time  ad-­‐hoc  pivots,  Approx Uniques (HLL) ✔ ️ Fast  Response  with  Aggregated  data  Subsets ✔ ️ https://www.linkedin.com/pulse/why-­‐how-­‐make-­‐mvp-­‐analytics-­‐ruoyu-­‐bao Build The E2E Analytics MVP Stack
  16. 16. A Data Too FarA Data Too Far Blue Pill o You might get a few .gz files, a few .csv files and of course, parquet files, in multiple systems o Some will have IDs, some names, some aggregated by week, some aggregated by day and others pure transactional. o The challenge is that we have the data, but there is no easy way to combine them for interesting inferences … Red Pill-Data Curation o “..The most creative things that happen with data are less about sophisticated algorithms and vast computation (though those are nice) than it is about putting together different pieces of data that were previously locked up in different silos.” o Data Pipelines (eg.Kafka) with in-line processing to ensure correctness, semantic and temporal congruence & integrity Ref:  Jay  Kreps,  Announcing  Confluent
  17. 17. Where is the Tofu ?Where is the Tofu ? Blue Pill o It is very simple to produce “reasonable” recommendations o But extremely difficult to improve them to become “great” o And, there is a huge difference in business value between reasonable Data Set & great … Red Pill-Data Curation o The Antidote : The insights and the algorithms should be relevant and scalable … o There is a huge gap between Model- Reason and Deploy … o Statistical Significance need not mean business significance o Don't confuse the statistical significance of an experiment with the magnitude of the result, even though the word "significance" is often used for both – Peter Norvig Ref:   Xavier  Amatriain when  he  talked  about  the  Netflix  Prize "Knowledge is a process of piling up facts; wisdom lies in their simplification." - Martin Fischer
  18. 18. Analytics - miscuesAnalytics - miscues oDon’t Torture the Data
  19. 19. Down  the  rabbit  hole  art  by  frostyshadows http://frostyshadows.deviantart.com/art/Down-­‐the-­‐Rabbit-­‐Hole-­‐358090601 Design PrinciplesDesign Principles 1. Start with needs* 2. Do less 3. Design with data 4. Do the hard work to make it simple 5. Iterate. Then iterate again. 6. Build for inclusion 7. Understand context 8. Build digital services, not websites 9. Be consistent, not uniform 10. Make things open: it makes things better https://www.gov.uk/design-­‐principles
  20. 20. Data Alone is not enoughData Alone is not enough o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset o Data Scientists are not Data Alchemists • Don’t expect Analytic Gold from a pack of data lead A few useful things to know about machine learning- by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755 https://www.flickr.com/photos/bionerd/3123155390
  21. 21. More Data Beats a Cleverer AlgorithmMore Data Beats a Cleverer Algorithm o More Data Beats a Cleverer Algorithm • Or conversely select algorithms that improve with data • Don’t optimize prematurely without getting more data o Learn many models, not Just One • Ensembles ! – Change the hypothesis space • Netflix prize • E.g. Bagging, Boosting, Stacking o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • Just because a function can be represented does not mean it can be learned o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A few useful things to know about machine learning - by Pedro Domingos § http://dl.acm.org/citation.cfm?id=2347755
  22. 22. In short …In short … o Build Full stack, iteratively building capabilities o Identify the ‘Right’ Business Problems o Create Valuable Data Perspectives o Frame problems & bring analytics together with non-quantitative information to build compelling stories o Embed Inference & Intelligence in products https://www.linkedin.com/pulse/article/20141108013125-­‐1290064-­‐winning-­‐at-­‐analytics-­‐takes-­‐more-­‐than-­‐technology http://www.kdnuggets.com/2014/09/hiring-­‐data-­‐scientist-­‐what-­‐to-­‐look-­‐for.html
  23. 23. Ogilvy & Mather Advertising: Morningview fromthe Ogilvy & Mather NY office,nicknamedthe ChocolateFactory # TravelTuesday hankYou ThankYou

×