Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A New Year in Data Science: ML Unpaused

19,831 views

Published on

Data Day Texas 2015 keynote talk http://datadaytexas.com/

Published in: Technology

A New Year in Data Science: ML Unpaused

  1. 1. A New Year in Data Science: 
 ML Unpaused Data Day Texas
 Austin, 2015-01-10 Paco Nathan, @pacoid
  2. 2. Observations about Machine Learning, Data Science, Big Data, Open Source, Cluster Computing, Notebooks, etc., over the past year … plus, a look ahead
  3. 3. Backstory
  4. 4. Backstory: The Sun Also Rises Some wake early in the morning and go build buildings
  5. 5. Backstory: The Sun Also Rises Some wake early in the morning and go build buildings
  6. 6. Backstory: The Sun Also Rises Some gaze into the heavens, sit back, and explain the process…
  7. 7. Backstory: The Sun Also Rises Some gaze into the heavens, sit back, and explain the process… Clearly, provably, 
 our Sun revolves around the Earth 
 at an observable rate
  8. 8. Backstory: The Sun Also Rises Others create and evaluate models to predict the Earth’s orbit of the Sun
  9. 9. Backstory: The Sun Also Rises Sometimes, when 
 the sky gods become angry and obscure the Sun as our due punishment… We grow scared and react: sacrifices must be offered, our plans must change, etc.
  10. 10. Backstory: The Sun Also Rises Sometimes, when the sky gods become angry and obscure the Sun punishment… We grow scared and react: sacrifices must be offered, our plans must These points are what 
 I’d like to discuss today
  11. 11. Whither Data Science?
  12. 12. Whither Data Science? twitter.com/josh_wills/status/198093512149958656
  13. 13. Feel free to disagree, but I find that definition 
 to be flawed… Whither Data Science?
  14. 14. Feel free to disagree, but I find that definition 
 to be flawed… 1. That ignores DevOps (how’s that working out?) 
 and Visualization/Design (ditto) Whither Data Science?
  15. 15. Feel free to disagree, but I find that definition 
 to be flawed… 1. That ignores DevOps (how’s that working out?) 
 and Visualization/Design (ditto) 2. When the CEO asks you to help explain why 
 revenue nose-dived over the past month… neither field has a clue about how to model business phenomena Whither Data Science?
  16. 16. Software Engineering: 
 implement and test a model that somebody selected …almost ignores the matter of modeling entirely, 
 at least not since old school types like Dijkstra ! Statistics: 
 measure and justify a model that somebody selected …was never particularly good at teaching how to 
 model problems – as two renowned statisticians, 
 William Cleveland and Leo Breiman, noted Whither Data Science?
  17. 17. Software Engineering: implement and test a model that somebody selected …almost ignores the matter of modeling entirely, at least not since old school types like ! Statistics: measure and justify a model that somebody selected …was never particularly good at teaching how to model problems – as two renowned statisticians, William Cleveland Whither Data Science? Both fields are necessary, but not sufficient
  18. 18. TheThorn in the Side of Big Data: too few artists
 Christopher Ré, Stanford
 safaribooksonline.com/library/view/strata-conference-santa/9781491900321/ part92.html Whither Data Science?
  19. 19. TheThorn in the Side of Big Data: too few artists Christopher Ré, Stanford safaribooksonline.com/library/view/strata-conference-santa/9781491900321/ part92.html Whither Data Science? “You should think about features and not algorithms”
  20. 20. Remember EJBs?
  21. 21. Floyd Marinescu observed about the aftermath 
 of EJBs in Brief History… Intended for building framework components,
 e.g., for IBM, Oracle, Sun, but not many others Based on RMI, prior to notions 
 like RESTful web services Enterprise Java Beans: Lessons from hate-watch reality television
  22. 22. Maybe a handful of people in the world would 
 ever actually need to use EJBs, but those few people wanted a spec Then, for tragic political reasons (MSFT envy), 
 Sun Microsystems made EJBs prominent in 
 their Java APIs Enterprise Java Beans: Lessons from hate-watch reality television
  23. 23. Fortunately, we evolved: Spring, JBoss, etc., 
 those came along as relatively more sane tech Now we see the Docker thing soar, with notions such as microservices displacing legacy cruft (BTW, if you haven’t yet, check out Weave) Enterprise Java Beans: Lessons from hate-watch reality television
  24. 24. I mention this because, to me, EJB represented 
 a convoluted form of template thinking: Enterprise Java Beans: Lessons from hate-watch reality television developing complex web apps 
 for the sake of 
 developing complex web apps
  25. 25. Enterprise Java Beans: Lessons from hate-watch reality television IRL developers and template thinking don’t determine public policy… right?
  26. 26. Enterprise Java Beans: Lessons from hate-watch reality television To paraphrase Dean Wampler, consider WordCount a simple apps written for MapReduce in Hadoop … ~50 lines of unapologetic Java that feels hella like writing EJBs:
  27. 27. Enterprise Java Beans: Lessons from hate-watch reality television Compare that with functional programming, where 
 the same WC app is three lines of easily-read Scala when run in Apache Spark:
  28. 28. Enterprise Java Beans: Lessons from hate-watch reality television Check out Dean’s talk at 11:00, 
 “Why Scala isTaking Over 
 the Big DataWorld” Compare that with functional programming, where 
 the same WC app is three lines of easily-read Scala when run in Apache Spark:
  29. 29. Enterprise Java Beans: Lessons from hate-watch reality television Hadoop suffers because, IMHO, that convoluted 
 EJB style of developer-centric template thinking staged a coup Perhaps we could “donate” some OSS talent… Send a pull request… Or something.
  30. 30. Lies, Damn Lies, 
 Statistics, and 
 Data Science
  31. 31. Probability got going, formally, in the 16th c. – 
 although interesting mathematical estimations 
 trace back to classical times Arabs in the 9th c. used frequency analysis – 
 later rediscovered by Europeans during the 
 early Italian Renaissance Statistics followed, originally more about what 
 we might call demographics – through 18th c. Lies, Damn Lies, Statistics, Data Science
  32. 32. Laplace, Gauss, et al., bridged the fields in the 
 late 18th c. using distributions (what we studied 
 in Stats 101) to infer the probability of errors 
 in estimates ! ! Much of the 19th/20th c. work was about using goodness of fit tests, etc., justifying some distribution • generally speaking, that require samples • that, in turn, implies batch windows Lies, Damn Lies, Statistics, Data Science
  33. 33. Lies, Damn Lies, Statistics, Data Science That kind of template thinking in action
 really lurvs it some batch windows
  34. 34. While 19th/20th c. stats work focused on defensibility 21st c. work, w.r.t. Big Data apps, focuses more 
 on predictability – plus there’s a shift in how we make estimates… Lies, Damn Lies, Statistics, Data Science BTW, doesn’t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream?
  35. 35. A fascinating, relatively new area pioneered by relatively few people – e.g., Philippe Flajolet Provides approximation with error bounds using much less resources (RAM, CPU, etc.) highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics- data-mining/ Lies, Damn Lies, Statistics, Data Science
  36. 36. algorithm use case example Bloom Filter set membership code MinHash set similarity code HyperLogLog set cardinality code Count-Min Sketch frequency summaries code DSQ streaming quantiles code SkipList ordered sequence search code Lies, Damn Lies, Statistics, Data Science
  37. 37. Lies, Damn Lies, Statistics, Data Science E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for 
 an analytics app ! OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale
  38. 38. Lies, Damn Lies, Statistics, Data Science E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for an analytics app ! OSS projects such as provide for this newer approach to the math of approximati Oscar Boykin at 14:00, 
 “Aggregators: Modeling 
 Data Queries Functionally” co-author of Algebird, Scalding
  39. 39. The Interzone
  40. 40. Data Science is inherently interdisciplinary To paraphrase Chris Ré, emphasis on algorithms 
 is relatively minor in the grand scheme – Especially when compared to needs for modeling business problems effectively To wit: beyond phenomenology, leading 
 into quantitative analysis and repeatable results On the one hand, CS + Stats do not quite address those needs… The Interzone
  41. 41. On the other hand, Physics does well to teach modeling – I like to hire physicists to work on Data teams… The Interzone They tend to get the interdisciplinary aspects: 
 got the math background, coding experience, generally good at systems engineering, etc. Not saying we should all rush out to get Physics degrees; there’s something to be learned there, 
 vital for the work and priorities ahead
  42. 42. I mention this because we are at a crossroads, 
 which has more to do with the physical world – 
 some talks here at DDTx15 help illustrate that Vast implications for Health Care, Transportation, Agriculture, Energy, Gov, Manufacturing in general… More about that 
 in a bit – The Interzone
  43. 43. The Libraries
  44. 44. Most of the ML libraries that one encounters 
 today focus on two general kinds of solutions: • convex optimization • matrix factorization The Libraries: Alexandria Redux
  45. 45. One might think of the convex optimization 
 in this case as a kind of curve fitting – generally 
 with some regularization term to avoid overfitting, 
 which is not good Good Bad The Libraries: Alexandria Redux
  46. 46. For supervised learning, used to create classifiers: 1. categorize the expected data into N classes 2. split a sample of the data into train/test sets 3. use learners to optimize classifiers based on
 the training set, to label the data into N classes 4. evaluate the classifiers against the test set, measuring error in predicted vs. expected labels The Libraries: Alexandria Redux
  47. 47. Bokay, great for security problems with simply two classes: good guys vs. bad guys How do you decide what the classes are 
 for more complex problems in business? That’s where the matrix factorization parts come in handy… The Libraries: Alexandria Redux
  48. 48. For unsupervised learning, which is often used 
 to reduce dimension: 1. create a covariance matrix of the data 2. solve for the eigenvectors and eigenvalues 
 of the matrix 3. select the top N eigenvectors, based on diminishing returns for how they explain variance in the data 4. those eigenvectors define your N classes The Libraries: Alexandria Redux
  49. 49. An excellent overview of ML definitions 
 (up to this point) is given in: The Libraries: Alexandria Redux To wit: 
 Generalization = Representation + Optimization + Evaluation A Few UsefulThings to Know about Machine Learning
 Pedro Domingos
 CACM 55:10 (Oct 2012)
 http://dl.acm.org/citation.cfm?id=2347755
  50. 50. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in a workflow… Results are shown in blue, and the real work 
 is highlighted in red The Libraries: Alexandria Redux
  51. 51. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far Results are shown in is highlighted in 1. focus on features not algorithms 2. learn how to model business problems by leveraging data 3. notice the workflows needed? 4. leave the dev-centric thinking 
 for odd city council meetings The Libraries: Alexandria Redux
  52. 52. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far Results are shown in is highlighted in The Libraries: Alexandria Redux Matthew Kirk 12:00
 “Lessons Learned: Machine Learning andTechnical Debt” Ted Dunning 13:00
 “Computing with Chaos” Julia Evans 15:00
 “Data Pipelines.They're a lot of work!” Christopher Johnson 16:00
 “Scala Data Pipelines for Music Recommendations”
  53. 53. Even so, business demands exceed far beyond what classifiers and labels alone can give us… Businesses lurv Optimization, gobs of it; in 
 that context ML libraries today merely scratch the surface Round hole, square peg The Libraries: Alexandria Redux
  54. 54. Imagine that you compete with FedEx… how do you optimize delivery routes for airplanes, trucks, trains, nanodrones, hoverboards, etc.? Which do you optimize: fuel cost, delivery time, maintenance schedules, minimizing lost packages? Doesn’t sound much like online advertising, social networks, or 
 any episode of Silicon Valley The Libraries: Alexandria Redux
  55. 55. ML, Unpaused
  56. 56. What were the origins of machine learning? • Marvin Minsky @MIT, 1950s • Support Vector Machines @Bell Labs, 1990s • Google @Stanford, 1990s • Ray Kurzweil, 2000s Nope… ML, Unpaused
  57. 57. ML has been an aspect of AI research for a 
 long while, through several different vectors A good early history (up to 1980s) is given in: ML, Unpaused Machine Learning:A Historical and Methodological Analysis
 Jaime Carbonell, Ryszard Michalski, Tom Mitchell
 AI Magazine 4:3 (1983)
 http://dx.doi.org/10.1609/aimag.v4i3.406 To wit: task-oriented studies, knowledge acquisition, cognitive simulation, theoretical exploration … overall, a much 
 broader class of optimization problems
  58. 58. An era of anticipation – AI was making inroads… • emphasis on capturing/representing knowledge 
 and expertise – production use cases in medicine • Fifth Generation Computing (parallel h/w) 
 in Japan MCC, etc. However: • few outside academia had enough cluster compute power – aside from 3-letter agencies and AT&T • meanwhile ML was not yet considered “academic” enough within academia Circa early 1980s:
  59. 59. Stock market “corrected” in 1987: But…
  60. 60. Some fundamental tech platforms emerge… • Hubble Space Telescope, Human Genome Project, WWW, electric cars relaunched And throughout that decade: • Linux, Java @Sun, JavaScript @Netscape • Firefly, an initial commercial ML app 
 on teh interwebs @MIT Media Lab • Rise of e-commerce leveraging horizontal 
 scale-out with commodity hardware Circa early 1990s:
  61. 61. Stock market “tumbled” in 2000: But…
  62. 62. GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR emerged out of the dust… • web apps dominated for search, e-commerce, 
 social networks, etc. • did we mention EJBs and template thinking? • mobile picked up traction • recommender systems went mainstream • AI picked up with semantic web efforts… Circa early 2000s:
  63. 63. Stock market “went free-fall” in 2008: But…
  64. 64. Successful e-commerce firms have IPO’ed and are now busy building skyscrapers in downtown SF… Circa mid 2010s: LinkedIn, 350 Bush Transbay Transit Salesforce, 415 Mission
  65. 65. An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projects… But… Sears Tower, Chicago Lehman Brothers, London Fontainebleau, Las Vegas
  66. 66. An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projects… But…
  67. 67. Businesses lurv Optimization, lots of it… • ML circa 1985 focused on those needs, but got knocked back to something inevitably more aristotelian and predictable • Outside of SiliconValley, we’ve made big strides • One danger: next downturn cycle,VCs might 
 reshape tech industry, reverting to “safe bets” Circa mid 2010s: Back to the Future However, a few extremely interesting aspects have emerged…
  68. 68. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms We have approximation, deep learning and symbolic regression to assist on “Features” evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Or, maybe, cognitive computing will help on several of the more difficult aspects of this… Circa mid 2010s: Extremely Interesting Emerging Aspects
  69. 69. Circa mid 2010s: Extremely Interesting Emerging Aspects DeepDive @Stanford http://deepdive.stanford.edu/ Knowledge Graph @Google http://www.google.com/insidesearch/ features/search/knowledge.html IBM Watson http://www.ibm.com/ smarterplanet/us/en/ibmwatson/ Scaled Inference https://scaledinference.com/
  70. 70. Circa mid 2010s: Extremely Interesting Emerging Aspects Rhetorical postures: “Is AI a good idea, or potentially harmful?” 
 – per Elon Musk, et al.
  71. 71. Circa mid 2010s: Extremely Interesting Emerging Aspects Clearly: good idea 
 brewbot.io Rhetorical postures: “Is AI a good idea, or potentially harmful?” 
 – per Elon Musk, et al.
  72. 72. Circa mid 2010s: Extremely Interesting Emerging Aspects Speaking of which, a highly recommended podcast 
 by actual data scientists drinking really good beers: partiallyderivative.com
  73. 73. Circa mid 2010s: Extremely Interesting Emerging Aspects 2015: Notebooks in Containers in the Cloud “Keep simple things simple and complex things possible.” databricks.com/product PublishingWorkflows for Jupyter Andrew Odewahn, Kyle Kelley, Rune Madsen odewahn.github.io/publishing-workflows-for-jupyter IPython Interactive Demo
 Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492
  74. 74. 2015: Notebooks in Containers in the Cloud “Keep simple things simple and complex things possible.” databricks.com/product PublishingWorkflows for Jupyter Andrew Odewahn odewahn.github.io/publishing-workflows-for-jupyter IPython Interactive Demo Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492 Circa mid 2010s: Extremely Interesting Emerging Aspects Makes me wonder about the “data engineer” role … notebooks simplify ops needs, while ultimately the domain experts wield the real power with data
  75. 75. Frontstory
  76. 76. Frontstory: The Sun Also Rises Some wake early in the morning and go build buildings dev-centric templates
  77. 77. Some gaze into the heavens, sit back, and explain the process… 20th c. stats Frontstory: The Sun Also Rises
  78. 78. Sometimes, when the sky gods become angry and obscure the Sun as our due punishment… VCs during recessions Frontstory: The Sun Also Rises
  79. 79. Others create and evaluate models to predict the Earth’s orbit of the Sun What’s needed most Frontstory: The Sun Also Rises
  80. 80. Forward Motion: SV trend: early data scientists displace old-school product managers Because there are hard 
 problems to be solved… Because we need 
 new eyes on target… Because use cases…
  81. 81. Because Use Cases
  82. 82. Because Use Cases: Health Care “In fact, using ourTopological Data Analysis system, they were able to discover multiple types of Type 2 diabetes … huge impact on all the hundreds of millions of people” – Ayasdi “Nobody knows what to do with those archives …They’re just sitting there, costing money. This is just seen as a big opportunity. It’s like,‘Oh, this is what we were saving this up for!’” – Enlitic “Sloan-Kettering is also trainingWatson on 1,500 real-world lung cancer cases, helping it to decipher physician notes and learn from the hospital’s expertise in treating cancer.” – IBM Watson Employing tech such as deep learning and cognitive computing for vital use cases in 
 health care:
  83. 83. Because Use Cases: Transportation http://automatic.com/ ! Detects events like hard braking, acceleration – uploaded in real-time with geolocation to a Spark Streaming pipeline … data trends indicate road hazards, blind intersections, bad signal placement, and other input to improve traffic planning. Also detects inefficient vehicle operation, under-inflated tires, poor driving behaviors, aggressive acceleration, etc.
  84. 84. Because Use Cases: Education https://databricks.com/blog/2014/12/08/ pearson… ! Integrates Kafka + Spark Streaming + Cassandra + Blur, running within aYARN cluster on AWS to provide a scalable, reliable, cloud-based platform for services that analyze student performance across product and institution boundaries. Delivers immersive learning experiences designed for how students read, think, and learn; as well as efficacy insights to both learners and institutions which were not possible before. ! Reliability features handle Kafka node failures, receiver failures, leader changes, committed offset in ZK, plus adjustable data-rate throughput.
  85. 85. Because Use Cases: Language, everywhere http://idibon.com/ ! ! ! Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there – it’s the launch point for cognitive computing. http://digitalreasoning.com/
  86. 86. http://digitalreasoning.com/ Because Use Cases: Language, everywhere http://idibon.com/ ! ! ! Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there – it’s the launch point for cognitive computing. Robert Munroe, 12:00 “Building Better Experts: co-optimization of human and machine intelligence at Idibon” AndrewTrask, David Gilmore 11:00 “Deep Learning for Natural Language Processing”
  87. 87. Because Use Cases: Geospatial Advanced geo uses cases throughout all levels of gov 
 and industry for Big Data, machine learning, graph algorithms, approximations, etc. If you roll trucks you probably use licenses from ESRI. Also consider the IoT sensor data, e.g., from National Instruments' customers – where does it go, what do organizations use to analyze it? These are the large-scale optimization problems you were looking for… http://esri.github.io/gis-tools-for-hadoop/ (and Spark) http://thunderheadxpler.blogspot.com/ http://geotrellis.io/ http://www.oculusinfo.com/tiles/ https://databricks.com/blog/2014/12/03/app...
  88. 88. Because Use Cases: Telecom,Travel, Banking, etc. http://spark-summit.org/2014/talk/ stratio-streaming… Stratio represents one of the most sophisticated integrations for Spark Streaming – the union of a real-time messaging bus with a complex event processing engine: Kafka, Spark Streaming, Cassandra, along with the Siddhi CEP engine Telecom, in particular, is leveraging this new streaming technology as a big win near-term http://www.openstratio.org/
 https://github.com/stratio https://github.com/Stratio/streaming- cep-engine BTW if you’re in Madrid next fall 
 check out Big Data Hispano
  89. 89. Because Use Cases… Common theme: many of those use cases are powered by Apache Spark – Especially notice Spark Streaming, which is a big game-changer for analytics across industry
  90. 90. Because Use Cases… Common theme: many of those use cases are powered by Especially notice game-changer for analytics across industry Taylor Goetz 11:00
 “Beyond theTweetingToaster: IoT Streaming AnalyticsWith Apache Storm, Kafka, and Arduino” Hari Shreedharan 12:00
 “RealTime Data Processing Using Spark Streaming”
  91. 91. Because Use Cases: Agriculture Ag+Data Issues
 http://radar.oreilly.com/2014/04/agdata.html Data Guild whitepaper: Ag Systems + Data Outlook
 http://goo.gl/OK8RFf • livelihood for 40% of world population • $15T/year annual GDP globally • data-intensive issues, much legal impasse Over a half billion small farms worldwide, and most 
 are family-run farms that rely on rain-fed agriculture Nudge, and I just might propose DWave clusters 
 into cold craters on the Lunar South Pole with 
 routers @L5 and an LLO skyhook… to handle
 the vector quantization demands. Or something. airships e.g., JP Aerospace, 40 km atmostats e.g.,Titan Aerospace, 20 km microsats e.g., Planet Labs, 400 km robots e.g., Blue River, 1 m sensors e.g., Hortau, -0.3 m drones e.g., HoneyComb, 120 m Layered Sensing Networks
  92. 92. Resources
  93. 93. Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise certification:
  94. 94. MOOCs: Anthony Joseph
 UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  95. 95. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  96. 96. http://spark-summit.org/
  97. 97. confs: Strata CA
 San Jose, Feb 18-20
 strataconf.com/strata2015 Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  98. 98. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  99. 99. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do

×