Big Data in Texas: Then, Now, and Ahead
 

Like this? Share it with your network

Share

Big Data in Texas: Then, Now, and Ahead

on

  • 9,954 views

Plenary talk from Data Day Texas 2013 http://datadaytexas.com/ in Austin

Plenary talk from Data Day Texas 2013 http://datadaytexas.com/ in Austin

Statistics

Views

Total Views
9,954
Views on SlideShare
7,786
Embed Views
2,168

Actions

Likes
13
Downloads
33
Comments
0

7 Embeds 2,168

http://liber118.com 2117
https://twitter.com 31
http://lanyrd.com 11
https://www.rebelmouse.com 4
http://www.liber118.com 3
http://fr.slideshare.net 1
http://kred.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data in Texas: Then, Now, and Ahead Presentation Transcript

  • 1. “Big Data in Texas: Then, Now, and Ahead”Paco Nathan,Evil Mad Scientist @Concurrent, Inc. 1
  • 2. Then, Now, and Ahead THEN1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 2
  • 3. observations… Lynn asked me to talk about Data here today A few weeks ago we stepped back for a moment to reflect about what we’d seen happen in Austin over the years Both of us ran alternative bookstores in Austin, twenty or so years ago, and we participated as the Internet thing exploded in the 1990s That was a blast – 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. observations… We noticed a trend Thinking about some of those who kept showing up whenever interesting things were afoot… 8
  • 9. 9
  • 10. “curation and metadata” 10
  • 11. observations… Overall, it’s about systems thinking We have a wealth of that here, at UT/Austin in particular… Ilya Prigogine spent years here, which is just incredible School of Architecture, with leading work in VR, GIS, etc. Interactive innovations at ACTLab… Quantitative emphasis at McCombs… major intellectual resources here 11
  • 12. Then, Now, and Ahead NOW1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 12
  • 13. Data Science edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC business process, wodniW D3 nepO Domain dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE Expert woN tahC stakeholder teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC data detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC science egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL data prep, discovery, noitartsigeR euqinU Data edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO Scientist modeling, etc. software engineering, App Dev automation Ops systems engineering, availability introduced capability 13
  • 14. Data Science in Texas… 14
  • 15. references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE 15
  • 16. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Countcascading.org 16
  • 17. Enterprise Data Workflows Over the past 5+ years, we’ve seen many large- scale Enterprise production deployments based on Cascading, Cascalog, Scalding, PyCascading, Cascading.JRuby, etc. Enterprise data workflows, Machine learning at scale, Big Data… Why? amazon.com/dp/1449358721 17
  • 18. Then, Now, and Ahead NOW1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 18
  • 19. Three broad categories of dataCurt Monash, 2010dbms2.com/2010/01/17/three-broad-categories-of-data• Human/Tabular data – human-generated data which fits well into tables/arrays• Human/Nontabular data – all other data generated by humans• Machine-Generated data 19
  • 20. Three broad categories of dataCurt Monash, 2010dbms2.com/2010/01/17/three-broad-categories-of-data• Human/Tabular data – human-generated data which fits well into tables/arrays• Human/Nontabular data – all other data generated by humans• Machine-Generated data• Adjusted Data – Dr. Don Easterbrook, Senate witness 20
  • 21. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this 21
  • 22. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 22
  • 23. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy “Throw it over the wall” BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 23
  • 24. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 24
  • 25. Circa 2001: post- big ecommerce successes Stakeholder Product Customers “Data products” dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 25
  • 26. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 26
  • 27. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev “Optimizing topologies” Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 27
  • 28. references… • Lambda Architecture: blending topologies • Big Data by Nathan Marz, James Warren • manning.com/marz source: Nathan Marz 28
  • 29. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L 29
  • 30. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx 30
  • 31. Then, Now, and Ahead NOW1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 31
  • 32. Displacement Geoffrey Moore Mohr Davidow Ventures, author of Crossing The Chasm Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade data as the major force… mostly through apps – verticals, leveraging domain expertise Michael Stonebraker INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps 32
  • 33. Drivers algorithmic modeling + machine data + curation, metadata + Open Data data products, as feedback into automation evolution of feedback loops a big part of the science in data science… internet of things + complex analytics accelerated evolution, additional feedback loops taking this out into a highly social dimension 33
  • 34. “A kind of Cambrian explosion” source: National Geographic 34
  • 35. Internet of Things 35
  • 36. A Thought Exercise Consider that when a company like Catepillar moves into data science, they won’t be building the world’s next search engine or social network They will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment… Operations Research – crunching amazing amounts of data 36
  • 37. A Thought Exercise That’s a $50B company, in a market segment worth $250B Upcoming: tractors as drones – guided by complex, distributed data apps 37
  • 38. Alternatively… climate.com 38
  • 39. Two Avenues to the App Layer Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments complexity ➞ Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space scale ➞ to compete using relatively lean staff 39
  • 40. Then, Now, and Ahead AHEAD1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 40
  • 41. For instance… Let’s drill-down on that intersection of tractors and crops, as a focus… Some of the largest use cases for large-scale data workflows which we encounter are in Agriculture Here’s a sector which integrates some of those themes from the Internet of Things, Catepillar, Climate Corp, etc. 41
  • 42. Data and Agriculture, Ahead • single largest employer, livelihood for 40% globally • 500 million small farms worldwide • most family farmers rely on rain-fed agriculture • approx $2T agricultural real estate in US alone • high annual rate of soil depletion • cycles of flooding, drought, desertification • high resolution from private satellite networks, e.g., skyboximaging.com • SMS networks for “business intelligence” among family farmers in Ethiopia agrepedia.com • microfinance, e.g., kiva.org, slowmoney.org 42
  • 43. Data and Agriculture, Ahead Consider the emerging reality of drone tractors, guided by satellite feeds, with predictive analytics accessing remote cloud-based clusters, crunching data for crops planted per-plot, based on years of history evaluated in time series analysis It would be difficult to identify a bigger Big Data problem in the world 43
  • 44. Data and Agriculture, Ahead You’ve heard about Peak Oil, Peak Phosphorus? How about Peak Snow? In other words, rising variance of snow pack levels, increasingly earlier peak snow in the mountains… which stresses the watersheds, infrastructure, etc., which in turn stress agriculture, energy, transportation, financial markets, tax basis, etc. Jeff Dozier, William Gail “The Emerging Science of Environmental Applications” The Fourth Paradigm, 2009 source: J. Dozier, et al., UCSB 44
  • 45. Data and Agriculture, Ahead Variance in the timing of the water cycle causes stress on natural resources and infrastructure: reservoirs, aqueducts, river ways, aquifers, levees, farm lands, seawater incursion, etc. Even in the face of so much IoT data looming, we lack adequate data and modeling of snowpack, snow melt, runoff, evaporation, water basins, etc., to understand the impact of these changes – now needed to forecast where to change infrastructure or strategies There’s not much machine data up in the mountain peaks, and satellite data only serves so far… new opportunities for Big Data source: J. Dozier, et al., UCSB 45
  • 46. Data and Agriculture, Ahead 46
  • 47. Data and Agriculture, Ahead We can resolve these kinds of problems; however, solutions must leverage huge amounts of data 47
  • 48. Then, Now, and Ahead AHEAD1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 48
  • 49. Everything’s Bigger in Texas Agriculture is just one sector, one set of problems to tackle We have much, much more here in Texas For example, Houston is a major center for Maritime work… check out: marinexplore.org 49
  • 50. Everything’s Bigger in Texas There’s also the not so small matter of the Energy and Transportation sectors GE is putting sensors in each and every wind generator, each and every jet engine – again, the Internet of Things. I’ve heard rumors there are a few of those wind turbines out in West Texas? 50
  • 51. Everything’s Bigger in Texas Another of the fastest growing use cases we see for large-scale predictive modeling is in Telecom Think about the stream of CDRs, billions of us bipeds wandering about the planet with our phones… Firehose for that makes Twitter look like MySpace! The value of location services as data products for local businesses, communities is astounding 51
  • 52. Then, Now, and Ahead AHEAD1. Keep Austin Weird?2. Something Called Data Science3. Rise Of The Machine Data4. A Cambrian Explosion5. Eat, Drink, Be Merry…6. Data-Driven In TX7. Roll Up Your Sleeves 52
  • 53. What is needed? Approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. Unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up Most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable source: D3 53
  • 54. What else do we need? • more emphasis on statistical thinking • not SQL vs. NoSQL, but instead a focus on apps as the process of structuring data • multi-disciplinary teams, not cubicles and silos • evolving more feedback loops, to drive more automation • oddly enough, we need automation to be able to employ more people in intelligent, productive ways • otherwise, we’re left with… source: Schwa Corporation 54
  • 55. source: Twentieth Century Fox 55
  • 56. Thank you very much! source: Twentieth Century Fox 56