Big data, big deal?

         February 2013




         Matt Turck
       Twitter: @mattturck
    Blog: http://mattturck.com
Background: I prepared this slide deck for a couple of
“Big Data 101” guest lectures I did in February 2012 at
New York University’s Stern School of Business and at
The New School. They’re intended for a college
level, non technical audience, as a first exposure to Big
Data and related concepts. I have re-used a number of
stats, graphics, cartoons and other materials freely
available on the internet. Thanks to the authors of those
materials.
What does Target know about
     pregnant women?
Hype

    Data is…
   "the new gold”
   “the new black”
   “the new plastic”
   "the new oil”
   “the new frontier”
Isn’t it what computers have always
                done?
What’s different this time?

         Volume.
         Variety.
         Velocity.
Facebook warehouses 180 petabytes
          of data a year
Twitter manages 1.2 million deliveries
            per second
New sources of data
Twitter manages 1.2 million deliveries
            per second
Open Government Data
Big data is data that exceeds the
processing capacity of conventional
database systems. The data is too
big, moves too fast, or doesn’t fit the
strictures of your database
architectures. To gain value from this
data, you must choose an alternative
way to process it.

               Edd Dumbill, O’Reilly
A new breed of technologies
Big Data Landscape
                  Infrastructure                                         Analytics                                      Applications
   NoSQL Databases              Hadoop Related           Analytics Solutions     Data Visualization                   Ad Optimization




                                                                                                            Publisher            Marketing
   NewSQL Databases
                                                        Statistical Computing                                 Tools

                                                                                      Social Media


MPP Databases     Management /     Cluster Services
                                                                                                                    Industry Applications
                   Monitoring
                                                         Sentiment Analysis      Analytics Services

                                       Security
                                                                                                               Application Service Providers
                                                         Location / People /
                                                                                  Big Data Search
                                                               Events
                      Storage
                                                                                      IT Analytics                   Data Sources
Crowdsourcing
                                                                                                              Data               Data Sources
                                     Collection /           Real-      Crowdsourced SMB Analytics          Marketplaces
                                      Transport             Time         Analytics




                                  Cross Infrastructure / Analytics                                                      Personal Data


                                                            Open Source Projects
 Framework      Query / Data           Data Access                   Coordination /         Real -    Statistical     Machine        Cloud
                   Flow                                                Workflow             Time        Tools         Learning     Deployment


                                         Matt Turck (@mattturck) and Shivon Zilis (@shivonz)
A new breed of people:
    Data scientists
     engineering
                                math

                     nerds


           nerds               nerds



                     nerds
comp sci
                             hacking




                   awesome nerds
                                       Credit: Hilary Mason, Bitly
Sexy nerds?




          “Data Scientist:
The Sexiest Job of the 21st Century”
           October 2012
Nerd talent shortage
Terms worth remembering

Structured vs. unstructured data
            Hadoop
        Cloud computing
       Data visualization
       Machine learning
      Predictive analytics
So what do you do with all that
        technology?
Lending
Trading
Insurance
Agriculture
Healthcare
Energy
Music
Education
But what about small data?
Moneyball is (relatively) small data
Nate Silver is (relatively) small data
Most companies only have small data
It’s not about big data
for the sake of big data
Data-driven management



“In God we trust. Everyone else, bring data”
Data-driven culture
Easier than ever for any business to be
           truly data-driven
Thanks!



           Learn more:

  NYC Data Business Meetup

meetup.com/NYC-Data-Business-Meetup/

Big Data, Big Deal? (A Big Data 101 presentation)

  • 1.
    Big data, bigdeal? February 2013 Matt Turck Twitter: @mattturck Blog: http://mattturck.com
  • 2.
    Background: I preparedthis slide deck for a couple of “Big Data 101” guest lectures I did in February 2012 at New York University’s Stern School of Business and at The New School. They’re intended for a college level, non technical audience, as a first exposure to Big Data and related concepts. I have re-used a number of stats, graphics, cartoons and other materials freely available on the internet. Thanks to the authors of those materials.
  • 3.
    What does Targetknow about pregnant women?
  • 4.
    Hype Data is… "the new gold” “the new black” “the new plastic” "the new oil” “the new frontier”
  • 5.
    Isn’t it whatcomputers have always done?
  • 6.
    What’s different thistime? Volume. Variety. Velocity.
  • 8.
    Facebook warehouses 180petabytes of data a year
  • 9.
    Twitter manages 1.2million deliveries per second
  • 10.
  • 11.
    Twitter manages 1.2million deliveries per second
  • 12.
  • 13.
    Big data isdata that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it. Edd Dumbill, O’Reilly
  • 14.
    A new breedof technologies
  • 15.
    Big Data Landscape Infrastructure Analytics Applications NoSQL Databases Hadoop Related Analytics Solutions Data Visualization Ad Optimization Publisher Marketing NewSQL Databases Statistical Computing Tools Social Media MPP Databases Management / Cluster Services Industry Applications Monitoring Sentiment Analysis Analytics Services Security Application Service Providers Location / People / Big Data Search Events Storage IT Analytics Data Sources Crowdsourcing Data Data Sources Collection / Real- Crowdsourced SMB Analytics Marketplaces Transport Time Analytics Cross Infrastructure / Analytics Personal Data Open Source Projects Framework Query / Data Data Access Coordination / Real - Statistical Machine Cloud Flow Workflow Time Tools Learning Deployment Matt Turck (@mattturck) and Shivon Zilis (@shivonz)
  • 16.
    A new breedof people: Data scientists engineering math nerds nerds nerds nerds comp sci hacking awesome nerds Credit: Hilary Mason, Bitly
  • 17.
    Sexy nerds? “Data Scientist: The Sexiest Job of the 21st Century” October 2012
  • 18.
  • 19.
    Terms worth remembering Structuredvs. unstructured data Hadoop Cloud computing Data visualization Machine learning Predictive analytics
  • 20.
    So what doyou do with all that technology?
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    But what aboutsmall data?
  • 30.
  • 31.
    Nate Silver is(relatively) small data
  • 32.
    Most companies onlyhave small data
  • 33.
    It’s not aboutbig data for the sake of big data
  • 34.
    Data-driven management “In Godwe trust. Everyone else, bring data”
  • 35.
  • 36.
    Easier than everfor any business to be truly data-driven
  • 37.
    Thanks! Learn more: NYC Data Business Meetup meetup.com/NYC-Data-Business-Meetup/

Editor's Notes

  • #2 This is going to be a talk for people who love the internet.
  • #4 The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
  • #5 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #6 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #7 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #10 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #11 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #12 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #13 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #14 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #15 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #17 Asking questions.
  • #18 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #19 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #20 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #21 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #22 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #23 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #24 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #25 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #26 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #27 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #28 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #29 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #30 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #31 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #32 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #33 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #34 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #35 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #36 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.
  • #37 Very different perspective, we have constrained resources, short time, and an expectation that what we do is relevant to the real world in some way.We build the system on this data, and then scale it for production use.