SlideShare a Scribd company logo
1 of 56
Download to read offline
“Big Data in Texas:
          Then, Now, and Ahead”

Paco Nathan,
Evil Mad Scientist @
Concurrent, Inc.


                                  1
Then, Now, and Ahead




                                             THEN
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          2
observations…




  Lynn asked me to talk about Data here today
  A few weeks ago we stepped back for a moment
  to reflect about what we’d seen happen in Austin
  over the years
  Both of us ran alternative bookstores in Austin,
  twenty or so years ago, and we participated as
  the Internet thing exploded in the 1990s
  That was a blast –




                                                     3
4
5
6
7
observations…




  We noticed a trend
  Thinking about some of those who kept
  showing up whenever interesting things
  were afoot…




                                           8
9
“curation and metadata”


                          10
observations…




  Overall, it’s about systems thinking
  We have a wealth of that here, at UT/Austin in particular…
  Ilya Prigogine spent years here, which is just incredible
  School of Architecture, with leading work in VR, GIS, etc.
  Interactive innovations at ACTLab…
  Quantitative emphasis at McCombs…
     major intellectual resources here




                                                               11
Then, Now, and Ahead




                                             NOW
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          12
Data Science                                                                   edoMpUsserD:IUN
                                                           tcudorP ylppA lenaP yrotnevnI tneilC
                                                        tcudorP evomeR lenaP yrotnevnI tneilC
                                                                               edoMmooRyM:IUN
                                                                           edoMmooRcilbuP:IUN
                                                                                        ydduB ddA
                                                                                     nigoL etisbeW
                                                                                                 vd
                                                                                edoMsdneirF:IUN
                                                                                    edoMtahC:IUN
                                                                                egasseM a evaeL
                                                                   G1 :gniniamer ecaps sserddA
                                                                            dekcilCeliforPyM:IUN
                                                                             edoMstiderCyuB:IUN
                                                                                 tohspanS a ekaT
                                                                             egapemoH nwO tisiV
                                                                                     elbbuB a epyT
                                                                                      taeS egnahC


                               business process,
                                                                                wodniW D3 nepO

       Domain                                                                           dneirF ddA
                                                       revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                         lenaP tidE

       Expert
                                                                                          woN tahC


                               stakeholder
                                                                                           teP yalP
                                                                                          teP deeF
                                                   2 petS egaP traC esahcruP edaM remotsuC
                                                                M215 :gniniamer ecaps sserddA
                                                                                    gnihtolC no tuP
                                                                                 bew :metI na yuB
                                                                                   edoMeivoM:IUN
                                                          ytinummoc ,tneilc :detratS weiV eivoM
                                                                                   teP weN etaerC

                       data                                   detrats etius tset :tseTytivitcennoC
                                                                         emag pazyeh dehcnuaL
                                                                          eciov mooRcilbuP tahC

                     science                                                    egasseM yadhtriB
                                                                                edoMlairotuT:IUN
                                                                          ybbol semag dehcnuaL


                               data prep, discovery,
                                                                              noitartsigeR euqinU

        Data




                                                                                                      edoMpUsserD:IUN
                                                                                                      tcudorP ylppA lenaP yrotnevnI tneilC
                                                                                                      tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                                      edoMmooRyM:IUN
                                                                                                      edoMmooRcilbuP:IUN
                                                                                                      ydduB ddA
                                                                                                      nigoL etisbeW
                                                                                                      vd
                                                                                                      edoMsdneirF:IUN
                                                                                                      edoMtahC:IUN
                                                                                                      egasseM a evaeL
                                                                                                      G1 :gniniamer ecaps sserddA
                                                                                                      dekcilCeliforPyM:IUN
                                                                                                      edoMstiderCyuB:IUN
                                                                                                      tohspanS a ekaT
                                                                                                      egapemoH nwO tisiV
                                                                                                      elbbuB a epyT
                                                                                                      taeS egnahC

                                                                                                      dneirF ddA
                                                                                                      revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                                      lenaP tidE
                                                                                                      woN tahC
                                                                                                      teP yalP
                                                                                                      teP deeF
                                                                                                      2 petS egaP traC esahcruP edaM remotsuC
                                                                                                      M215 :gniniamer ecaps sserddA
                                                                                                      gnihtolC no tuP
                                                                                                      bew :metI na yuB
                                                                                                      edoMeivoM:IUN
                                                                                                      ytinummoc ,tneilc :detratS weiV eivoM
                                                                                                      teP weN etaerC
                                                                                                      detrats etius tset :tseTytivitcennoC
                                                                                                      emag pazyeh dehcnuaL
                                                                                                      eciov mooRcilbuP tahC
                                                                                                      egasseM yadhtriB
                                                                                                      edoMlairotuT:IUN
                                                                                                      ybbol semag dehcnuaL
                                                                                                      noitartsigeR euqinU
                                                                                                      wodniW D3 nepO
      Scientist                modeling, etc.


                               software engineering,
       App Dev
                               automation



         Ops
                               systems engineering,
                               availability


       introduced
        capability


                                                                                                                                                 13
Data Science in Texas…




                         14
references…


   by DJ Patil

   Data Jujitsu
   O’Reilly, 2012
   amazon.com/dp/B008HMN5BE

   Building Data Science Teams
   O’Reilly, 2011
   amazon.com/dp/B005O4U3ZE

                                 15
Enterprise Data Workflows


          Document
          Collection



                                       Scrub
                       Tokenize
                                       token

                  M



                                               HashJoin   Regex
                                                 Left     token
                                                                  GroupBy    R
                                  Stop Word                        token
                                     List
                                                 RHS




                                                                     Count




                                                                                 Word
                                                                                 Count




cascading.org


                                                                                         16
Enterprise Data Workflows




   Over the past 5+ years, we’ve seen many large-
   scale Enterprise production deployments based
   on Cascading, Cascalog, Scalding, PyCascading,
   Cascading.JRuby, etc.
   Enterprise data workflows,
   Machine learning at scale,
   Big Data…
   Why?




                                                    amazon.com/dp/1449358721
                                                                               17
Then, Now, and Ahead




                                             NOW
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          18
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which fits well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data




                                                                                19
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which fits well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

• Adjusted Data – Dr. Don Easterbrook, Senate witness




                                                                                20
Q3 1997: inflection point




   Four independent teams were working toward horizontal
   scale-out of workflows based on commodity hardware
   This effort prepared the way for huge Internet successes
   in the 1997 holiday season… AMZN, EBAY, Inktomi
   (YHOO Search), then GOOG

   MapReduce and the Apache Hadoop open source stack
   emerged from this




                                                              21
Circa 1996: pre- inflection point

                                       Stakeholder                   Customers

                Excel pivot tables
              PowerPoint slide decks        strategy



                    BI
                                           Product
                  Analysts


                                          requirements



                  SQL Query                              optimized
                                         Engineering       code         Web App
                   result sets



                                                                        transactions




                                                                        RDBMS




                                                                                       22
Circa 1996: pre- inflection point

                                       Stakeholder                   Customers

                Excel pivot tables
              PowerPoint slide decks        strategy




        “Throw it over the wall”
                    BI
                                           Product
                  Analysts


                                          requirements



                  SQL Query                              optimized
                                         Engineering       code         Web App
                   result sets



                                                                        transactions




                                                                        RDBMS




                                                                                       23
Circa 2001: post- big ecommerce successes

               Stakeholder                    Product                   Customers




                 dashboards                                                  UX
                                             Engineering

                               models                        servlets

                                             recommenders
               Algorithmic                          +                   Web Apps
                Modeling                        classifiers


                                                                        Middleware
                               aggregation
                                                              event
                SQL Query                                    history
                 result sets                                               customer
                                                                         transactions
                                                Logs



                   DW                             ETL                    RDBMS




                                                                                        24
Circa 2001: post- big ecommerce successes

               Stakeholder                    Product                   Customers




                  “Data products”
                 dashboards                                                  UX
                           Engineering

                               models                        servlets

                                             recommenders
               Algorithmic                          +                   Web Apps
                Modeling                        classifiers


                                                                        Middleware
                               aggregation
                                                              event
                SQL Query                                    history
                 result sets                                               customer
                                                                         transactions
                                                Logs



                   DW                             ETL                    RDBMS




                                                                                        25
Circa 2013: clusters everywhere

                                             Data Products                                      Customers
                               business
       Domain                  process                                                                                       Prod
       Expert                                 Workflow
                                 dashboard
                                  metrics
                       data
                                                                                               Web Apps,               s/w
                                                History                   services
                     science                                                                   Mobile, etc.            dev
        Data
      Scientist
                                               Planner                                     social
                               discovery                                                interactions
                                   +                      optimized                                    transactions,
                                                                                                                              Eng
                               modeling           taps     capacity                                       content

       App Dev
                                                  Use Cases Across Topologies


                                                Hadoop,                 Log                      In-Memory
                                                  etc.                 Events                     Data Grid
         Ops                          DW                                                                                      Ops
                                                                                batch     near time


                                                                      Cluster Scheduler
       introduced                                                                                                            existing
        capability                                                                                                            SDLC

                                                                                                       RDBMS
                                                                                                        RDBMS


                                                                                                                                        26
Circa 2013: clusters everywhere

                                             Data Products                                      Customers
                               business
       Domain                  process                                                                                       Prod
       Expert                                 Workflow
                                 dashboard
                                  metrics
                       data
                                                                                               Web Apps,               s/w
                                                History                   services
                     science                                                                   Mobile, etc.            dev
        Data
      Scientist
                                               Planner                                     social
                               discovery                                                interactions
                                   +                      optimized                                    transactions,
                                                                                                                              Eng
                               modeling           taps     capacity                                       content

       App Dev

                                  “Optimizing topologies”
                                                  Use Cases Across Topologies


                                                Hadoop,                 Log                      In-Memory
                                                  etc.                 Events                     Data Grid
         Ops                          DW                                                                                      Ops
                                                                                batch     near time


                                                                      Cluster Scheduler
       introduced                                                                                                            existing
        capability                                                                                                            SDLC

                                                                                                       RDBMS
                                                                                                        RDBMS


                                                                                                                                        27
references…

   • Lambda Architecture: blending topologies
   • Big Data by Nathan Marz, James Warren
   • manning.com/marz




              source: Nathan Marz




                                                28
references…




   by Leo Breiman
   Statistical Modeling: The Two Cultures
   Statistical Science, 2001
   bit.ly/eUTh9L




                                            29
references…

  Amazon
  “Early Amazon: Splitting the website” – Greg Linden
  glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

  eBay
  “The eBay Architecture” – Randy Shoup, Dan Pritchett
  addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
  addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

  Inktomi (YHOO Search)
  “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
  youtube.com/watch?v=E91oEn1bnXM

  Google
  “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
  youtube.com/watch?v=qsan-GQaeyk
  perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx




                                                                             30
Then, Now, and Ahead




                                             NOW
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          31
Displacement

               Geoffrey Moore
               Mohr Davidow Ventures, author of Crossing The Chasm
               Hadoop Summit, 2012:

               what Amazon did to the retail sector… has put the
               entire Global 1000 on notice over the next decade
               data as the major force… mostly through apps –
               verticals, leveraging domain expertise

               Michael Stonebraker
               INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
               XLDB, 2012:

               complex analytics workloads are now displacing
               SQL as the basis for Enterprise apps



                                                                     32
Drivers




   algorithmic modeling + machine data
     + curation, metadata + Open Data
       data products, as feedback into automation
       evolution of feedback loops

   a big part of the science in data science…


   internet of things + complex analytics
       accelerated evolution, additional feedback loops

   taking this out into a highly social dimension




                                                          33
“A kind of Cambrian explosion”



                                 source: National Geographic

                                                               34
Internet of Things




                     35
A Thought Exercise




   Consider that when a company like Catepillar moves
   into data science, they won’t be building the world’s
   next search engine or social network
   They will most likely be optimizing supply chain,
   optimizing fuel costs, automating data feedback
   loops integrated into their equipment…
   Operations Research –
   crunching amazing amounts of data




                                                           36
A Thought Exercise




   That’s a $50B company,
   in a market segment worth $250B
   Upcoming: tractors as drones –
   guided by complex, distributed data apps




                                              37
Alternatively…




       climate.com




                     38
Two Avenues to the App Layer

   Enterprise: must contend with
   complexity at scale everyday…
   incumbents extend current practices and
   infrastructure investments




                                             complexity ➞
   Start-ups: crave complexity and
   scale to become viable…
   new ventures move into Enterprise space                  scale ➞
   to compete using relatively lean staff


                                                                      39
Then, Now, and Ahead




                                             AHEAD
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          40
For instance…




   Let’s drill-down on that intersection of tractors and
   crops, as a focus…
   Some of the largest use cases for large-scale data
   workflows which we encounter are in Agriculture

   Here’s a sector which integrates some of those
   themes from the Internet of Things, Catepillar,
   Climate Corp, etc.




                                                           41
Data and Agriculture, Ahead

    • single largest employer, livelihood for 40% globally
    • 500 million small farms worldwide
    • most family farmers rely on rain-fed agriculture

    • approx $2T agricultural real estate in US alone
    • high annual rate of soil depletion
    • cycles of flooding, drought, desertification

    • high resolution from private satellite networks,
      e.g., skyboximaging.com
    • SMS networks for “business intelligence” among
      family farmers in Ethiopia agrepedia.com
    • microfinance, e.g., kiva.org, slowmoney.org

                                                             42
Data and Agriculture, Ahead




   Consider the emerging reality of drone tractors,
   guided by satellite feeds, with predictive analytics
   accessing remote cloud-based clusters, crunching
   data for crops planted per-plot, based on years of
   history evaluated in time series analysis
   It would be difficult to identify a bigger Big Data
   problem in the world




                                                          43
Data and Agriculture, Ahead




   You’ve heard about Peak Oil, Peak Phosphorus?
   How about Peak Snow?

   In other words, rising variance of snow pack levels,
   increasingly earlier peak snow in the mountains…
   which stresses the watersheds, infrastructure, etc.,
   which in turn stress agriculture, energy, transportation,
   financial markets, tax basis, etc.
   Jeff Dozier, William Gail
   “The Emerging Science of Environmental Applications”
   The Fourth Paradigm, 2009




                                                               source: J. Dozier, et al., UCSB

                                                                                                 44
Data and Agriculture, Ahead

   Variance in the timing of the water cycle causes
   stress on natural resources and infrastructure:
   reservoirs, aqueducts, river ways, aquifers, levees,
   farm lands, seawater incursion, etc.
   Even in the face of so much IoT data looming,
   we lack adequate data and modeling of snowpack,
   snow melt, runoff, evaporation, water basins, etc.,
   to understand the impact of these changes – now
   needed to forecast where to change infrastructure
   or strategies
   There’s not much machine data up in the mountain
   peaks, and satellite data only serves so far…
      new opportunities for Big Data



                                                          source: J. Dozier, et al., UCSB

                                                                                            45
Data and Agriculture, Ahead




                              46
Data and Agriculture, Ahead




              We can resolve these kinds of
              problems; however, solutions
              must leverage huge amounts
              of data




                                              47
Then, Now, and Ahead




                                             AHEAD
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          48
Everything’s Bigger in Texas




   Agriculture is just one sector, one set of
   problems to tackle
   We have much, much more here in Texas
   For example, Houston is a major center
   for Maritime work…

   check out:
   marinexplore.org




                                                49
Everything’s Bigger in Texas




   There’s also the not so small matter of the
   Energy and Transportation sectors

   GE is putting sensors in each and every
   wind generator, each and every jet engine –
   again, the Internet of Things.
   I’ve heard rumors there are a few of those
   wind turbines out in West Texas?




                                                 50
Everything’s Bigger in Texas




   Another of the fastest growing use cases we
   see for large-scale predictive modeling is in
   Telecom

   Think about the stream of CDRs, billions of us
   bipeds wandering about the planet with our
   phones…
   Firehose for that makes Twitter look like MySpace!
   The value of location services as data products
   for local businesses, communities is astounding




                                                        51
Then, Now, and Ahead




                                             AHEAD
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves



                                                          52
What is needed?




   Approximately 80% of the costs for data-related projects
   get spent on data preparation – mostly on cleaning up
   data quality issues: ETL, log file analysis, etc.

   Unfortunately, data-related budgets for many companies tend
   to go into frameworks which can only be used after clean up

   Most valuable skills:
    ‣ learn to use programmable tools that prepare data
     ‣ learn to generate compelling data visualizations
     ‣ learn to estimate the confidence for reported results
     ‣ learn to automate work, making analysis repeatable

                                                                 source: D3
                                                                              53
What else do we need?

    • more emphasis on statistical thinking
    • not SQL vs. NoSQL, but instead a focus
      on apps as the process of structuring data

    • multi-disciplinary teams,
      not cubicles and silos

    • evolving more feedback loops,
      to drive more automation

    • oddly enough, we need automation
      to be able to employ more people
      in intelligent, productive ways

    • otherwise, we’re left with…


                                                   source: Schwa Corporation

                                                                               54
source: Twentieth Century Fox

                                55
Thank you very much!



                       source: Twentieth Century Fox

                                                       56

More Related Content

Viewers also liked

Viewers also liked (15)

Catalogo
CatalogoCatalogo
Catalogo
 
Indian lapidary
Indian lapidaryIndian lapidary
Indian lapidary
 
Venezia
VeneziaVenezia
Venezia
 
Power Point Mariela Bedoya
Power Point Mariela Bedoya Power Point Mariela Bedoya
Power Point Mariela Bedoya
 
Emprendimiento
EmprendimientoEmprendimiento
Emprendimiento
 
1 Como registrarse - Docente - V3
1 Como registrarse - Docente - V31 Como registrarse - Docente - V3
1 Como registrarse - Docente - V3
 
Trastornos específicos del lenguaje
Trastornos específicos del lenguajeTrastornos específicos del lenguaje
Trastornos específicos del lenguaje
 
AMOP MONO K
AMOP MONO K AMOP MONO K
AMOP MONO K
 
Datos socioeconómicos de consuegra (2)
Datos socioeconómicos de consuegra (2)Datos socioeconómicos de consuegra (2)
Datos socioeconómicos de consuegra (2)
 
How To Speak To Them On Their Wavelength
How To Speak To Them On Their WavelengthHow To Speak To Them On Their Wavelength
How To Speak To Them On Their Wavelength
 
Electronica y circuitos tecnologia
Electronica y circuitos tecnologiaElectronica y circuitos tecnologia
Electronica y circuitos tecnologia
 
Trabajo rn
Trabajo rnTrabajo rn
Trabajo rn
 
Eva y maria
Eva y mariaEva y maria
Eva y maria
 
Familia 1
 Familia 1  Familia 1
Familia 1
 
Cómo modificar mis datos y foto unipe
Cómo modificar mis datos y foto unipeCómo modificar mis datos y foto unipe
Cómo modificar mis datos y foto unipe
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Big Data in Texas: Then, Now, and Ahead

  • 1. “Big Data in Texas: Then, Now, and Ahead” Paco Nathan, Evil Mad Scientist @ Concurrent, Inc. 1
  • 2. Then, Now, and Ahead THEN 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 2
  • 3. observations… Lynn asked me to talk about Data here today A few weeks ago we stepped back for a moment to reflect about what we’d seen happen in Austin over the years Both of us ran alternative bookstores in Austin, twenty or so years ago, and we participated as the Internet thing exploded in the 1990s That was a blast – 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. observations… We noticed a trend Thinking about some of those who kept showing up whenever interesting things were afoot… 8
  • 9. 9
  • 11. observations… Overall, it’s about systems thinking We have a wealth of that here, at UT/Austin in particular… Ilya Prigogine spent years here, which is just incredible School of Architecture, with leading work in VR, GIS, etc. Interactive innovations at ACTLab… Quantitative emphasis at McCombs… major intellectual resources here 11
  • 12. Then, Now, and Ahead NOW 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 12
  • 13. Data Science edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC business process, wodniW D3 nepO Domain dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE Expert woN tahC stakeholder teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC data detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC science egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL data prep, discovery, noitartsigeR euqinU Data edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO Scientist modeling, etc. software engineering, App Dev automation Ops systems engineering, availability introduced capability 13
  • 14. Data Science in Texas… 14
  • 15. references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE 15
  • 16. Enterprise Data Workflows Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count cascading.org 16
  • 17. Enterprise Data Workflows Over the past 5+ years, we’ve seen many large- scale Enterprise production deployments based on Cascading, Cascalog, Scalding, PyCascading, Cascading.JRuby, etc. Enterprise data workflows, Machine learning at scale, Big Data… Why? amazon.com/dp/1449358721 17
  • 18. Then, Now, and Ahead NOW 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 18
  • 19. Three broad categories of data Curt Monash, 2010 dbms2.com/2010/01/17/three-broad-categories-of-data • Human/Tabular data – human-generated data which fits well into tables/arrays • Human/Nontabular data – all other data generated by humans • Machine-Generated data 19
  • 20. Three broad categories of data Curt Monash, 2010 dbms2.com/2010/01/17/three-broad-categories-of-data • Human/Tabular data – human-generated data which fits well into tables/arrays • Human/Nontabular data – all other data generated by humans • Machine-Generated data • Adjusted Data – Dr. Don Easterbrook, Senate witness 20
  • 21. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this 21
  • 22. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 22
  • 23. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy “Throw it over the wall” BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS 23
  • 24. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 24
  • 25. Circa 2001: post- big ecommerce successes Stakeholder Product Customers “Data products” dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS 25
  • 26. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 26
  • 27. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev “Optimizing topologies” Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS 27
  • 28. references… • Lambda Architecture: blending topologies • Big Data by Nathan Marz, James Warren • manning.com/marz source: Nathan Marz 28
  • 29. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L 29
  • 30. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx 30
  • 31. Then, Now, and Ahead NOW 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 31
  • 32. Displacement Geoffrey Moore Mohr Davidow Ventures, author of Crossing The Chasm Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade data as the major force… mostly through apps – verticals, leveraging domain expertise Michael Stonebraker INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps 32
  • 33. Drivers algorithmic modeling + machine data + curation, metadata + Open Data data products, as feedback into automation evolution of feedback loops a big part of the science in data science… internet of things + complex analytics accelerated evolution, additional feedback loops taking this out into a highly social dimension 33
  • 34. “A kind of Cambrian explosion” source: National Geographic 34
  • 36. A Thought Exercise Consider that when a company like Catepillar moves into data science, they won’t be building the world’s next search engine or social network They will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment… Operations Research – crunching amazing amounts of data 36
  • 37. A Thought Exercise That’s a $50B company, in a market segment worth $250B Upcoming: tractors as drones – guided by complex, distributed data apps 37
  • 38. Alternatively… climate.com 38
  • 39. Two Avenues to the App Layer Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments complexity ➞ Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space scale ➞ to compete using relatively lean staff 39
  • 40. Then, Now, and Ahead AHEAD 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 40
  • 41. For instance… Let’s drill-down on that intersection of tractors and crops, as a focus… Some of the largest use cases for large-scale data workflows which we encounter are in Agriculture Here’s a sector which integrates some of those themes from the Internet of Things, Catepillar, Climate Corp, etc. 41
  • 42. Data and Agriculture, Ahead • single largest employer, livelihood for 40% globally • 500 million small farms worldwide • most family farmers rely on rain-fed agriculture • approx $2T agricultural real estate in US alone • high annual rate of soil depletion • cycles of flooding, drought, desertification • high resolution from private satellite networks, e.g., skyboximaging.com • SMS networks for “business intelligence” among family farmers in Ethiopia agrepedia.com • microfinance, e.g., kiva.org, slowmoney.org 42
  • 43. Data and Agriculture, Ahead Consider the emerging reality of drone tractors, guided by satellite feeds, with predictive analytics accessing remote cloud-based clusters, crunching data for crops planted per-plot, based on years of history evaluated in time series analysis It would be difficult to identify a bigger Big Data problem in the world 43
  • 44. Data and Agriculture, Ahead You’ve heard about Peak Oil, Peak Phosphorus? How about Peak Snow? In other words, rising variance of snow pack levels, increasingly earlier peak snow in the mountains… which stresses the watersheds, infrastructure, etc., which in turn stress agriculture, energy, transportation, financial markets, tax basis, etc. Jeff Dozier, William Gail “The Emerging Science of Environmental Applications” The Fourth Paradigm, 2009 source: J. Dozier, et al., UCSB 44
  • 45. Data and Agriculture, Ahead Variance in the timing of the water cycle causes stress on natural resources and infrastructure: reservoirs, aqueducts, river ways, aquifers, levees, farm lands, seawater incursion, etc. Even in the face of so much IoT data looming, we lack adequate data and modeling of snowpack, snow melt, runoff, evaporation, water basins, etc., to understand the impact of these changes – now needed to forecast where to change infrastructure or strategies There’s not much machine data up in the mountain peaks, and satellite data only serves so far… new opportunities for Big Data source: J. Dozier, et al., UCSB 45
  • 47. Data and Agriculture, Ahead We can resolve these kinds of problems; however, solutions must leverage huge amounts of data 47
  • 48. Then, Now, and Ahead AHEAD 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 48
  • 49. Everything’s Bigger in Texas Agriculture is just one sector, one set of problems to tackle We have much, much more here in Texas For example, Houston is a major center for Maritime work… check out: marinexplore.org 49
  • 50. Everything’s Bigger in Texas There’s also the not so small matter of the Energy and Transportation sectors GE is putting sensors in each and every wind generator, each and every jet engine – again, the Internet of Things. I’ve heard rumors there are a few of those wind turbines out in West Texas? 50
  • 51. Everything’s Bigger in Texas Another of the fastest growing use cases we see for large-scale predictive modeling is in Telecom Think about the stream of CDRs, billions of us bipeds wandering about the planet with our phones… Firehose for that makes Twitter look like MySpace! The value of location services as data products for local businesses, communities is astounding 51
  • 52. Then, Now, and Ahead AHEAD 1. Keep Austin Weird? 2. Something Called Data Science 3. Rise Of The Machine Data 4. A Cambrian Explosion 5. Eat, Drink, Be Merry… 6. Data-Driven In TX 7. Roll Up Your Sleeves 52
  • 53. What is needed? Approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. Unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up Most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable source: D3 53
  • 54. What else do we need? • more emphasis on statistical thinking • not SQL vs. NoSQL, but instead a focus on apps as the process of structuring data • multi-disciplinary teams, not cubicles and silos • evolving more feedback loops, to drive more automation • oddly enough, we need automation to be able to employ more people in intelligent, productive ways • otherwise, we’re left with… source: Schwa Corporation 54
  • 56. Thank you very much! source: Twentieth Century Fox 56