Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science in 2016: Moving Up

9,447 views

Published on

A keynote presentation for Big Data Spain 2015 in Madrid, 2015-10-15 http://www.bigdataspain.org/program/

Published in: Technology

Data Science in 2016: Moving Up

  1. 1. Data Science in 2016: Moving Up 2015-10-15 • Madrid • http://bigdataspain.org/ Paco Nathan, @pacoid
 O’Reilly Media
  2. 2. • general patterns • trends and analysis: the discipline, the jobs • some good examples: moving up into use cases • glimpses ahead: an emerging content • a proposed theme Data Science 2016: Moving Up
  3. 3. Design Patterns
  4. 4. Design Patterns Methodology for cloud-computing architecture
 (2008-06-29) http://ceteri.blogspot.com/2008/06/methodology-for- cloud-computing.html
  5. 5. cluster scheduler data pipes some cloud containers analytics search/index elastic compute elastic storage Design Patterns
  6. 6. Design Patterns some cloud
  7. 7. Design Patterns some cloud DataStax $189.7M Confluent $30.9M Databricks $47M Jupyter $6M Elastic $104M Docker $162MMesosphere $48.75M
  8. 8. Design Patterns: Issues some cloud • integration could be better • that implies sharing markets • VCs in SiliconValley dislike that • customers need integration
  9. 9. some cloud Design Patterns: Where?
  10. 10. Design Patterns: Where? some cloud
  11. 11. Design Patterns: Where? some cloud
  12. 12. Design Patterns: Where? some cloud
  13. 13. Design Patterns: Where? some cloud
  14. 14. Design Patterns: Where? some cloud • that playing field becomes overly crowded, soon… • what happens at that point?
  15. 15. • so much emphasis on plumbing: `data engineering` • not enough on domain expertise, which trumps all Much activity in Big Data seems awkwardly focused at the bottom of the tech stack: infrastructure, not domain However, that may be changing… Design Patterns: Opinion
  16. 16. Interesting Trends
  17. 17. Interesting Trends There are many possible trends to discuss, but let’s 
 concentrate on four of these going into 2016: • leveraging multicore and large memory spaces • generalized libraries for frequently repeated work • workflows blend the best of people and computing • framework for a big leap ahead, not just incremental
  18. 18. Original definitions for what became relational databases had less to do with dedicated SQL products, more similarity with something like 
 Spark SQL Interesting Trend #1: Contemporary Hardware A relational model of data 
 for large shared data banks
 Edgar Codd
 Communications of the ACM (1970)
 dl.acm.org/citation.cfm?id=362685
  19. 19. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM GPU NVRAM Unified API, One Engine, Automatically Optimized Tungsten backend language frontend … from Databricks Interesting Trend #1: Contemporary Hardware
  20. 20. Deep Dive into ProjectTungsten: 
 Bringing Spark Closer to Bare Metal
 Josh Rosen
 spark-summit.org/2015/events/deep-dive-into-project- tungsten-bringing-spark-closer-to-bare-metal/ Set Footer from Insert Dropdown Menu Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Interesting Trend #1: Contemporary Hardware from Databricks
  21. 21. Interesting Trend #2: Generalized Libraries Tensors are a good way to handle time-series 
 geo-spatially distributed linked data with lots 
 of N-dimensional attributes In other words, nearly a general case for handling much of the data that we’re likely to encounter That’s better than attempting to shoehorn data into matrix representation, then writing lots of custom code to support it
  22. 22. Tensor factorization may be problematic, but probabilistic solutions seem to provide relatively general case solutions: TheTensor Renaissance in Data Science
 Anima Anandkumar @UC Irvine
 radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and 
 Higher Order Markov Chains
 David Gleich @Purdue
 slideshare.net/dgleich/spacey-random- walks-and-higher-order-markov-chains Interesting Trend #2: Generalized Libraries
  23. 23. Interesting Trend #3: Leveraging Workflows evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms APIs, algorithms, developer-centric template thinking – 
 these only go so far; the overall context is a workflow…
  24. 24. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms look beyond an API, beyond a code repo … think of people and machines working together Interesting Trend #3: Leveraging Workflows APIs, algorithms, developer-centric template thinking – these only
  25. 25. Chris Ré, @Stanford
 https://www.macfound.org/fellows/943/ Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive
 Strata CA (2015) TheThorn in the Side of Big Data: too few artists
 Strata CA (2014) Interesting Trend #4: A Leap Ahead
  26. 26. Chris Ré https://www.macfound.org/fellows/943/ Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive Strata CA (2015) TheThorn in the Side of Big Data: too few artists Strata CA (2014) Interesting Trend #4: A Leap Ahead cognitive computing “flywheel”: probabilistic reasoning about complex data and predictions together
  27. 27. Chris Ré https://www.macfound.org/fellows/943/ Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive Strata CA (2015) TheThorn in the Side of Big Data: too few artists Strata CA (2014) Interesting Trend #4: A Leap Ahead
  28. 28. Data Scientists
  29. 29. William Cleveland 
 “Data Science: an Action Plan for Expanding 
 the Technical Areas of the Field of Statistics,” 
 International Statistical Review (2001), 69, 21-26 http://www.stat.purdue.edu/~wsc/papers/ datascience.pdf Leo Breiman
 “Statistical modeling: the two cultures”, 
 Statistical Science (2001), 16:199-231 http://projecteuclid.org/euclid.ss/1009213726 …also good to mention John Tukey Data Scientists: Primary Sources
  30. 30. Data Scientists: Five Years of Strata Conference
  31. 31. One 2015 report (RJMetrics) tallied a minimum of 
 11,400 data scientists worldwide by scraping LinkedIn So many suddenly, really? Perhaps that’s doubtful… Comparing surveys: O’Reilly Media conducts salary surveys 
 for data scientists, along with exploring about the tools used 2013 – tools, trends, not all data is “Big”, coding scripts! 2014 – correlation of tools and skills, rapid evolution 2015 – divide blurring between open source and proprietary Data Scientists: Everywhere, all the time?
  32. 32. http://radar.oreilly.com/2015/09/2015-data-science-salary-survey.html John King, Roger Magoulas Data Scientists: 2015 Survey
  33. 33. Data Scientists: 2015 Survey
  34. 34. Moving Up
  35. 35. Enlitic http://www.enlitic.com/ deep learning to assist doctors treating cancer Moving Up: Medicine
  36. 36. Moving Up: Medicine “Whatever the models might discover or predict, Howard isn’t suggesting they’ll do away with a doctor’s judgment. Rather, artificially intelligent computers could provide strong, unbiased second opinions, or perhaps lead a doctor down 
 a path of investigation she other wouldn’t have considered.” With Enlitic, a veteran data scientist plans 
 to fight disease using deep learning
 GigaOM (2014-08-22)
 https://gigaom.com/2014/08/22/with-enlitic-a-veteran- data-scientist-plans-to-fight-disease-using-deep-learning/
  37. 37. Moving Up: Political Platform http://www.predikon.ch/en/voting-patterns/residents
  38. 38. Moving Up: Political Platform Mining Democracy
 Matthias Grossglauser @EPFL
 ICT Labs (2015)
 http://ictlabs-summer-school.sics.se/ slides/mining%20democracy.pdf What if a political candidate could cluster political positions in a multi-dimensional data space, to optimize for being recommended to voters? http://www.predikon.ch/en/voting-patterns/residents
  39. 39. Moving Up: Government Ethics TheWhite House has a plan to help society through data analysis
 Fortune (2018-09-30)
 http://fortune.com/2015/09/30/dj-patil-white-house-data/
  40. 40. Moving Up: Government Ethics TheWhite House has a plan to help society through data analysis
 Fortune (2018-09-30)
 http://fortune.com/2015/09/30/dj-patil-white-house-data/ “Opening up government data about child labor to concerned data scientists; recruiting folks to help analyze data about suicide prevention, social injustice and incarceration; a call for mandatory and `intrinsic` ethics instruction in every course teaching students data science; and an effort to help the transgender community create its own census of sorts, so that members and society can get a better grasp on the issues that matter to the group.”
  41. 41. Moving Up: Neuroscience Analytics +Visualization for Neuroscience: Spark,Thunder, Lightning Jeremy Freeman
 2015-01-29 youtu.be/cBQm4LhHn9g?t=28m55s
  42. 42. For excellent examples of Science and Data together see CodeNeuro, particularly for 
 use of Jupyter notebooks + Apache Spark Moving Up: Neuroscience
  43. 43. Learning
  44. 44. Learning: What About MOOCs?
  45. 45. Massive Open Online Courses – 
 seven year trend, beginning with: Connectivism and Connective Knowledge
 George Siemens, Stephen Downes
 University of PEI (2008)
 http://cck11.mooc.ca/ Learning: What About MOOCs? Adios EdTech. Hola something else
 George Siemens (2015-09-09)
 http://www.elearnspace.org/blog/2015/09/09/ adios-ed-tech-hola-something-else/
  46. 46. Online education: MOOCs taken by educated few
 Ezekiel Emanuel, Nature 503, 342 (2013-11-21) • 80% students already have an advanced degree • 80% come from the richest 6% of the population Michael Shanks @Stanford: “retrenchment around traditional disciplines will make disparities even more pronounced” An Early Report Card on Massive Open Online Courses
 Geoffrey Fowler, WSJ (2013-10-08) Amherst, Duke, etc., have rejected edX Learning: What About MOOCs?
  47. 47. Online education: MOOCs taken by educated few Ezekiel Emanuel • 80% students already have an advanced degree • 80% come from the richest 6% of the population Michael Shanks disciplines will make disparities even more pronounced” An Early Report Card on Massive Open Online Courses Geoffrey Fowler Amhers Learning: What About MOOCs? So then, what else works better?
  48. 48. How to Flip a Class 
 CTL @UT/Austin
 http://ctl.utexas.edu/teaching/flipping-a-class/how 1. identify where the flipped classroom model makes 
 the most sense for your course 2. spend class time engaging students in application activities with feedback 3. clarify connections between inside and outside 
 of class learning 4. adapt your materials for students to acquire course content in preparation of class 5. extend learning beyond class through individual 
 and collaborative practice Learning: Inverted Classroom
  49. 49. Scalable Learning
 David Black-Schaffer @Uppsala
 Sverker Janson @KTH SICS https://www.scalable-learning.com/ • active learning: Flipped Classroom and Just-in-timeTeaching • exams built directly into specific diagrams within videos • metrics for where in video+code that students get stuck • instructor can customize subsequent classroom discussions 
 (active teaching phase) based on stuck/unstuck metrics Learning: Inverted Classroom
  50. 50. Learning programming at scale Philip Guo 
 O’Reilly Radar (2015-08-13) http://radar.oreilly.com/2015/08/learning- programming-at-scale.html • PythonTutor • Codechella Tutors could keep an eye on around 
 50 learners during a 30-minute session, 
 start 12 chat conversations, and 
 concurrently help 3 learners at once Learning: Collaborative Learning
  51. 51. Data-driven Education and the Quantified Student Lorena Barba @GWU PyData Seattle (2015) https://youtu.be/2YIZ2SY9mW4 • keynote talk: abstract, slides • homepage • Open edX Universities Symposium, DC 2015-11-11 Learning: If you study just one link from this talk…
  52. 52. If by some bizarre chance you haven’t used 
 it already, go to https://jupyter.org/ • 50+ different language kernels • new funding 2015-07 • UC Berkeley, Cal Poly • nbgrader autograder by Jess Hamrick • jupyterhub multi-user server • curating a list of examples • repeatable science! see also:
 Teaching with Jupyter Notebooks
 http://tinyurl.com/scipy2015-education Learning: Jupyter Project
  53. 53. Embracing Jupyter Notebooks at O'Reilly
 Andrew Odewahn
 O’Reilly Media (2015-05-07) https://beta.oreilly.com/ideas/jupyter-at-oreilly O’Reilly Media is using our Atlas platform 
 to make Jupyter Notebooks a first class authoring environment for our publishing program Jupyter, Thebe, Atlas, Docker, etc. Learning: O’Reilly Media
  54. 54. Learning: O’Reilly Media https://beta.oreilly.com/
  55. 55. in-person blended on-demand Mostly Synchronous Mostly Asynch Inverted Classroom Subscription Free Content Learning: Audience Patterns
  56. 56. Is it possible to measure “distance” between 
 a learner and a subject community? From Amateurs to Connoisseurs:
 Modeling the Evolution of User 
 Expertise through Online Reviews
 Julian McAuley, Jure Leskovec
 http://i.stanford.edu/~julian/pdfs/www13.pdf Learning: Machine Learning about People Learning
  57. 57. Learning,Assessment,Team Building, Diversity – these can be accomplished together, in situ Collective Intelligence in Human Groups
 Anita Williams Woolley @CMU
 https://youtu.be/Bz1dDiW2mvM • balance of participation (no one dominates) • 2+ women engaging within the group • group size < 9 • diversity of formal backgrounds Learning: Machine Learning about People Learning
  58. 58. People + Automation
  59. 59. Data Science teams apply machine learning (automation) to help arrive at key insights, to learn what is important 
 in data sets – finding the proverbial needle in the haystack Cognitive Computing exhibits people + automation 
 as a process, in a learning context That’s also a basic tenet of workflows in general: 
 people + automation And a key aspect of the emerging gig economy too… People + Automation
  60. 60. People + Automation: Gig Economy
  61. 61. People + Automation: Gig Economy http://orchestra.unlimitedlabs.com/ “Workflows with humans and machines”
  62. 62. People + Automation: Gig Economy Workers in aWorld of Continuous Partial Employment Tim O’Reilly Medium (2015-08-31)
 https://medium.com/the-wtf-economy/workers-in-a- world-of-continuous-partial-employment-4d7b53f18f96 http://conferences.oreilly.com/next-economy
  63. 63. Learning is key. Effective use of Data Science in these new economic conditions requires people + automation, learning together – albeit in different ways. Plus, there’s an excellent framework for that: Autopoiesis and Cognition
 Humberto Maturana, FranciscoVarela
 Springer (1973) https://books.google.es/books?id=nVmcN9Ja68kC People + Automation
  64. 64. I’d like to leave this as a theme for you to consider about 
 Data Science 2016, Moving Up into use cases… We see an intersection of key points in both the emerging Cognitive Computing context and the Gig Economy in general: systems of people + automation, learning together It posits an interesting duality for use to leverage With that I wish you a great conference here at Big Data Spain! People + Automation
  65. 65. Gracias
  66. 66. contact: Just Enough Math O’Reilly (2014) justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Intro to Apache Spark
 O’Reilly (2015)
 shop.oreilly.com/product/ 0636920036807.do

×