Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GalvanizeU Seattle: Eleven Almost-Truisms About Data

8,473 views

Published on

http://www.meetup.com/Seattle-Data-Science/events/223445403/

Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.

Published in: Technology

GalvanizeU Seattle: Eleven Almost-Truisms About Data

  1. 1. Eleven Almost-Truisms About Data 2015-07-24 • Seattle Paco Nathan, @pacoid
 O’Reilly Learning
  2. 2. Set and Setting: Almost a Dozen Almost-Truisms about Data …
 to consider when embarking on a journey 
 into Data Science There are a number preconceptions about working with data at scale, where the realities beg to differ We’ll crank this number up to eleven – even though the actual number is of course much larger, but that’s perhaps for another day
  3. 3. Almost a Dozen Almost-Truisms about Data …
 to consider when embarking on a journey 
 into Data Science Let’s discuss some less-intuitive directions, along with likely consequences and corollaries This is not intended to prove a set of points, rather to provide a set of launching points Set and Setting:
  4. 4. #01: Because Rates
  5. 5. The rates of data being stored and analyzed jumped quite dramatically in the late 1990s 
 to early 2000s … partly because storage became incredibly cheap … partly because internetworked machines suddenly started producing much more machine data Fifteen years later, the rates jump again, this time by orders of magnitude … Because IoT It’s almost like this thing has a pulse? #01: Because Rates
  6. 6. In other words, to paraphrase von Schelling, experience precedes analysis Typically, we’re swimming in data, and we tend to respond by struggling to understand its structure and dynamics That, in contrast to the myth that our analysis drives data collection #01: Because Rates
  7. 7. Four independent teams were working toward horizontal 
 scale-out of workflows based on commodity hardware This effort prepared the way for huge Internet successes during
 the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce on clusters of commodity hardware and the 
 Apache Hadoop open source stack emerged from this context #01: Because Rates – 1997 Q3 Inflection Point
  8. 8. Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting- website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/ you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/ eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtu.be/E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtu.be/qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/ JeffDeanOnGoogleInfrastructure.aspx #01: Because Rates – 1997 Q3 Inflection Point
  9. 9. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels #01: Because Rates – Circa 2001, post e-commerce success
  10. 10. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels “data products” #01: Because Rates – Circa 2001, post e-commerce success
  11. 11. Primary sources for the notion: Cleveland,W. S., 
 “Data Science: an Action Plan for Expanding 
 the Technical Areas of the Field of Statistics,” 
 International Statistical Review (2001), 69, 21-26. http://cm.bell-labs.com/stat/doc/datascience.ps Breiman L., 
 “Statistical modeling: the two cultures”, 
 Statistical Science (2001), 16:199-231. http://projecteuclid.org/euclid.ss/1009213726 …also good to mention John Tukey #01: Because Rates –Whither Data Science?
  12. 12. Rashomon, the 1950 Japanese period drama 
 by Akira Kurosawa, symbolizes a long-standing tension in Statistics, one which Mark Twain described ever so succinctly… wikipedia.org/wiki/Rashomon: “The film is known for a plot device
 which involves various characters
 providing alternative, self-serving
 and contradictory versions of the
 same incident.” #01: Because Rates – A Sea Change
  13. 13. Because IoT! (exabytes/day per sensor) bits.blogs.nytimes.com/2013/06/19/g-e-makes-the- machine-and-then-uses-sensors-to-listen-to-it/ #01: Because Rates – A Sea Change, Redux
  14. 14. #02: Batch Defenestration
  15. 15. #02: Batch Defenestration
  16. 16. #02: Batch Defenestration Batch Analytics Going strong, since 1944
 Been there, done that
  17. 17. Businesses want to join the 21c., 
 and level up to streaming analytics “I saw what you did … in batch,”
 now performed a zillion times faster #02: Batch Defenestration – Infrastructure, Remodeled Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache, More than 500 known production deployments
  18. 18. Tuning Spark Streaming forThroughput Gerard Maas, 2014-12-22 virdata.com/tuning-spark/ #02: Batch Defenestration – “Team Apache”, $316.4M funding
  19. 19. Can Spark Streaming survive Chaos Monkey? Bharat Venkat, Prasanna Padmanabhan, 
 Antony Arokiasamy, Raju Uppalapati techblog.netflix.com/2015/03/can-spark- streaming-survive-chaos-monkey.html #02: Batch Defenestration – Resiliency, at the edge of Comp Sci
  20. 20. #03: Circa 1904
  21. 21. Trending interests: • electric cars • organic farm-to-table cuisine • permaculture • sustainable urbanism #03: Circa 1904
  22. 22. Speaking of batch windows… The last century or two of statistics represent an extremely huge mess Let’s start the clock over, then move forward into a more real-time near-future #03: Circa 1904
  23. 23. #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science Probability got going, formally, in the 16th c. – 
 although interesting mathematical estimations 
 trace back to classical times Arabs in the 9th c. used frequency analysis – 
 later rediscovered by Europeans during the 
 early Italian Renaissance Statistics followed, originally more about what 
 we might call demographics – through 18th c.
  24. 24. Laplace, Gauss, et al., bridged prob & stats in the 
 late 18th c. using distributions (what we studied 
 in Stats 101) to infer the probability of errors 
 in estimates Much of the 19th/20th c. work was about using goodness of fit tests, etc., justifying some distribution • generally speaking, that require samples • that, in turn, implies batch windows #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  25. 25. While 19th/20th c. stats work focused on defensibility 21st c. work, w.r.t. Big Data apps, focuses more 
 on predictability – plus there’s a shift in how we make estimates… BTW, doesn’t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream? #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  26. 26. A fascinating, relatively new area pioneered by relatively few people – e.g., Philippe Flajolet Provides approximation with error bounds using much less resources (RAM, CPU, etc.) highlyscalable.wordpress.com/ 2012/05/01/probabilistic-structures- web-analytics-data-mining/ #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  27. 27. algorithm use case example Bloom Filter set membership code MinHash set similarity code HyperLogLog set cardinality code Count-Min Sketch frequency summaries code DSQ streaming quantiles code SkipList ordered sequence search code #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  28. 28. E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for 
 an analytics app OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
  29. 29. #04: Your API is an Illusion
  30. 30. IMO, many notions of “API” are illusions Arguably, reductionist shell games And that imposes limitations on how we work, and even how we think… #04: Your API is an Illusion
  31. 31. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in a workflow… Results are shown in blue, while the real work 
 is highlighted in red #04: Your API is an Illusion –The Libraries: Alexandria, Redux
  32. 32. On the other hand, Physics does well to teach modeling – I like to hire physicists to work on Data teams… They tend to get the interdisciplinary aspects: 
 got the math background, coding experience, 
 generally good at systems engineering, etc. Not saying we must all rush out to get Physics 
 degrees – there’s something to be learned there, 
 vital for the work and priorities ahead #04: Your API is an Illusion –The Interzone
  33. 33. “The impact of computing extends far beyond
 science… affecting all aspects of our lives. 
 To flourish in today's world, everyone needs
 computational thinking.” – Jeannette Wing, CMU Computing now ranks alongside the proverbial Reading,Writing, and Arithmetic… Center for ComputationalThinking @ CMU
 http://www.cs.cmu.edu/~CompThink/ Exploring ComputationalThinking @ Google
 https://www.google.com/edu/computational-thinking/ #04: Your API is an Illusion – Antidote: ComputationalThinking
  34. 34. #05: Code Inceptionism
  35. 35. Even so, do we really need to 
 write code for WordCount 
 10^N times? #05: Code Inceptionism
  36. 36. Inceptionism: Going Deeper into 
 Neural Networks
 Alexander Mordvintsev, 
 Christopher Olah, Mike Tyka
 Google (2015-06-17) googleresearch.blogspot.com/2015/06/ inceptionism-going-deeper-into-neural.html Artificial Neural Networks have spurred remarkable recent progress in image classification and speech recognition. But even though these are very useful tools based on well-known mathematical methods, we actually understand surprisingly little of why certain models work and others don’t. So let’s take a look at some simple techniques for peeking inside these networks. #05: Code Inceptionism
  37. 37. Imagine data mining GitHub commit histories of popular open source projects, then applying genetic programming to evolve patches for other OSS projects... 
 
 In other words, brilliant: Imagine data mining GitHub commit histories of popular open source projects, then apply genetic programming to evolve patches for other OSS projects… 
 in other words, brilliant: Sidebar: Claire Le Goues, automating software repair Claire Le Goues
 cmu.edu GenProg:A Generic Method for Automatic Software Repair
 Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, Westley Weimer
 IEEE TSE (2012)
 www.cs.cmu.edu/~clegoues/ docs/legoues-tse-genprog12.pdf We describe the algorithm and report experimental results of its success on 16 programs totaling 1.25M lines of C code and 120K lines of module code, spanning eight classes of defects, in 357 seconds, 
 on average.We analyze the generated repairs qualitatively and quantitatively to demonstrate 
 that the process efficiently produces evolved programs that repair the defect, are not fragile 
 input memorizations, and do not lead to serious degradation in functionality. GenProg:A Generic Method for 
 Automatic Software Repair
 
 Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, Westley Weimer
 IEEE TSE (2012)
 www.cs.cmu.edu/~clegoues/ docs/ legoues-tse-genprog12.pdf We describe the algorithm and report experimental results of its success on 16 programs totaling 1.25M lines of C code and 120K lines of module code, spanning eight classes of defects, in 357 seconds, on average. 
 We analyze the generated repairs qualitatively and quantitatively to demonstrate that the process efficiently produces evolved programs that repair the defect, are not fragile input memorizations, and do not lead to serious degradation in functionality. #05: Code Inceptionism
  38. 38. #06: Database Extinction?
  39. 39. Are databases going extinct? Distributed file systems that can be accessed as column stores are generally quite useful There’s an old saying in Computer Science: 
 it’s difficult to distinguish a really good file system from a database, and vice versa #06: Database Extinction?
  40. 40. Original definitions for what became relational databases had less to do with dedicated SQL products, more similarity with something like Spark SQL: A relational model of data for 
 large shared data banks
 Edgar Codd
 Communications of the ACM (1970)
 dl.acm.org/citation.cfm?id=362685 #06: Database Extinction?
  41. 41. #06: Database Extinction? Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Tungsten
  42. 42. #07: “N Dims good, 2 Dims baa-d”
  43. 43. Consider: matrices, pivot tables, etc. Our thinking about data representation 
 is often quite two-dimensional… #07: “N Dims good, 2 Dims baa-d”
  44. 44. • many real-world problems are often represented as graphs • graphs can generally be converted into sparse matrices (bridge to linear algebra) • eigenvectors find the stable points in 
 a system defined by matrices – which 
 may be more efficient to compute • beyond simpler graphs, complex data 
 may require work with tensors #07: “N Dims good, 2 Dims baa-d”
  45. 45. Suppose we have a graph as shown below: We call x a vertex (sometimes called a node) An edge (sometimes called an arc) is any line connecting two vertices v u w x #07: “N Dims good, 2 Dims baa-d”
  46. 46. We can represent this kind of graph as an adjacency matrix: • label the rows and columns based 
 on the vertices • entries get a 1 if an edge connects the corresponding vertices, or 0 otherwise v u w x u v w x u 0 1 0 1 v 1 0 1 1 w 0 1 0 1 x 1 1 1 0 #07: “N Dims good, 2 Dims baa-d”
  47. 47. An adjacency matrix always has certain properties: • it is symmetric, i.e., A = AT • it has real eigenvalues Therefore algebraic graph theory bridges between linear algebra and graph theory #07: “N Dims good, 2 Dims baa-d”
  48. 48. Tensors are a good way to handle time- series geo-spatially distributed linked data with lots of N-dimensional attributes In other words, potentially a general case 
 for handling much of the data that we’re likely to encounter #07: “N Dims good, 2 Dims baa-d”
  49. 49. Although tensor factorization is considered problematic, it may provide more general case solutions: TheTensor Renaissance in Data Science
 Anima Anandkumar @UC Irvine
 radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and 
 Higher Order Markov Chains
 David Gleich @Purdue
 slideshare.net/dgleich/spacey-random-walks- and-higher-order-markov-chains #07: “N Dims good, 2 Dims baa-d”
  50. 50. #08: Science … and Data
  51. 51. There is Science … and there is Data Data Science is largely about interdisciplinary teams, largely about crossing boundaries (organizational, cognitive) that might otherwise preclude arriving at crucial insights – In other words, about learning It’s also about the repeatability and predictive aspects of science, where workflows combine people + automation NB: may conflict with large portions of academia which tend to decontextualize subjects #08: Science … and Data
  52. 52. The Science in Data Science tends to rely on the phenomenology and modeling of complex systems (did we already mention Physics?) Speaking of science and predictions, two important works to include: • Charles Sanders Peirce – one of the most prolific scientists in the US, and also one of the most fierce critics (abduction, etc.) • Karl Popper – who articulated some 
 of the inherent risks of mixing “science”, “history”, and politics #08: Science … and Data
  53. 53. For excellent examples of Science and Data together, see CodeNeuro, particularly for use of notebooks: #08: Science … and Data
  54. 54. #09: Learning Curves are Forever
  55. 55. Learning Curves are forever – 
 the part you need to manage more carefully than just about anything else, especially within
 a social context In some sense, this is essence of Data Science: How well do you learn? Much of the risk in managing 
 a Data Science team is about budgeting for learning curve #09: Learning Curves are Forever
  56. 56. In contrast, IT has a long history of practicing a flavor of engineering “conservatism”: highly structured process, strictly codified practices People learn a few things well, then avoid having to struggle with learning many new things perpetually… That leads to enormous teams and low ROI, among other badness scale➞ complexity➞ #09: Learning Curves are Forever
  57. 57. ThrowYour Life a Curve
 Whitney Johnson blogs.hbr.org/johnson/2012/09/ throw-your-life-a-curve.html Aggressively Pro-Active Learning: • deconstruction of the cognitive bias One Size Fits All • “makes a compelling case for personal disruption” • “plan your career around learning curves” • hire people who learn/re-learn efficiently #09: Learning Curves are Forever
  58. 58. #09: Learning Curves are Forever Education is more than just lessons, exams, certifications, instructor evaluations, etc., … though some tools would try to reduce it 
 to that level What’s even more interesting is to leverage ML to understand the “distance” between the learner and the subject material
  59. 59. #10: Books, not so much, sadly…
  60. 60. Speaking as a former alt bookstore owner… Sadly, we don’t use books quite as much 
 these days: • above ~35: buy it on Kindle • below ~35: watch it onYouTube #10: Books, not so much, sadly…
  61. 61. From a publisher perspective, consider some of the risks: • less people buy the titles • search engines surface oh-so-much noise • increasingly, it’s more difficult for experts to take time to author good content and keep it updated #10: Books, not so much, sadly… Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache, More than 500 known production deployments
  62. 62. However, it’s unlikely that Kindle, etc., represent the end-all-be-all of publishing… Here’s an idea: your next “book” or “video” should be able to compute something useful #10: Books, not so much, sadly…
  63. 63. Interactive notebooks: Sharing the code Helen Shen Nature (2014-11-05) nature.com/news/interactive-notebooks- sharing-the-code-1.16261 #10: Books, not so much – Repeatable Science
  64. 64. Embracing Jupyter Notebooks at O'Reilly
 Andrew Odewahn, 2015-05-07 https://beta.oreilly.com/ideas/jupyter-at-oreilly “O'Reilly Media is using our Atlas platform to 
 make Jupyter Notebooks a first class authoring environment for our publishing program.” Jupyter, Thebe, Docker, etc. #10: Books, not so much – Something Borrowed, Something New
  65. 65. #10: Books, not so much – Something Borrowed, Something New
  66. 66. #11: A MOOCish Edumacation?
  67. 67. MOOCs have become popular, some are quite useful … even so, these tend to have 
 a very low completion rate Don’t hold your breath waiting for MOOCs to replace other modes of education Learning generally requires a social context: for reinforcement, peer insights/modeling, and frankly some people really feel a need to be given permission to learn #11: A MOOCish Edumacation?
  68. 68. One problem with university study is that disciplines tend to decontextualize GalvanizeU is rare opportunity in that way: accredited, with contextualized hands-on experience #11: A MOOCish Edumacation?
  69. 69. A significant improvement may be found in the notion of “flipped” 
 or inverted classrooms For a good example, see: Caltech Offers Online Course with 
 Live Lectures in Machine Learning Yaser Abu-Mostafa (2012-03-30) http://www.caltech.edu/news/caltech-offers-online- course-live-lectures-machine-learning-4248 #11: A MOOCish Edumacation?
  70. 70. So a good bit of advice about learning and Data Science … is to invert your classrooms, recontextualize, cross the boundaries to do things that matter, and leverage the hands-on social aspects of learning Like here at GalvanizeU Summary…
  71. 71. Thank You
  72. 72. contact: Just Enough Math O’Reilly (2014) justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Intro to Apache Spark
 O’Reilly (2015)
 shop.oreilly.com/product/ 0636920036807.do
  73. 73. Sometimes A Strange Notion
  74. 74. After we’ve cleaned up data, formulated workflows in terms of monoids, used graph representation, and parallelized with a wealth of linear algebra, much of the heavy-lifting that remains on the clusters is in optimization For example, deep learning @Google 
 uses many layers of neural nets trained 
 with gradient descent optimization Taming LatencyVariability and Scaling Deep Learning
 Jeff Dean @Google (2013)
 youtu.be/S9twUcX1Zp0 Vector Quantization:
  75. 75. One advantage of quantum algorithms is 
 to run large gradient descent problems in constant time… Reworking high-ROI apps to leverage lots of ML and large clusters, 
 then SGD represents the datacenter cost basis, notably that part that scales… Want to slash costs exponentially? 
 Plug in quantum for a game-changer,
 maybe Fast quantum algorithm for 
 numerical gradient estimation
 Stephen P. Jordan
 Phys. Rev. Lett. 95, 050501 (2005)
 arxiv.org/abs/quant-ph/0405146 dwavesys.com Vector Quantization:
  76. 76. Proposal: let’s drop clusters of quantum devices into lunar polar craters, so we 
 can handle massive vector quantization workloads • micro-kelvin environs • near perpetual sunlight 
 for energy sources • park routers at L4 • approx. $15B to finance, 
 i.e., ~6 days DoD budget Vector Quantization:
  77. 77. We’ll just put this here… 
 a couple o’ Googly projects in progress: qCraft: Quantum Physics In Minecraft
 plus.google.com/u/ 1/+QuantumAILab/posts/ grMbaaDGChH Vector Quantization: “We’re going back to the Moon. For good.” lunar.xprize.org

×