1




 Big Data
 the next frontier

RVC Seminar                                Leonid Zhukov
Moscow, 08/02/2013   Professor Higher School of Economics
2
Big data




+ Graph of terms popularity




                              www.visibletechologies.com
3
McKinsey, May 2011




                     www.mckinsey.com
4
Headlines




            Data driven business

            Data democratization

            Data scientists
5
The White House



+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system


                            www.whitehouse.gov
6
Gartner Hype Cycle




                     www.gartner.com
7
 Market Forecast




                         + Venture money invested (Reuters):
+ Market forecasts:        + 2009 - $1.1B
 + IDC: 2015 - $16.9B      + 2010 - $1.53B
 + Gartner: 2016- $55B     + 2011 - $2.47B
                                                      www.wikibon.com
8
Big Data Revenue 2012




 + Big Business:
    +   IBM
    +   HP
    +   Oracle
    +   Teradata
    +   EMC             www.wikibon.com
9
Big Data Vendors!




    + Hadoop:
      + Cloudera
      + MapR Techonologies
      + HortonWorks          www.wikibon.com
10
Forrester Wave




                 www.forrester.com
What is big data                                                    11




+ Big data:
  + “Data you can’t process by traditional tools”
  + “A phenomenon defined by the rapid acceleration in the
     expanding volume of high velocity, complex and diverse
     types of data.”

  + “Refers to a collection of tools, techniques and technologies
     for working with data productively, at any scale.”
12
What is Big data

 + 3V
    + Volume: petabytes (1000TB) to exabytes (1000PB)
    + Variety: structured, semi-structured, unstructured
    + Velocity: Tb/s data streams
 + Requires distributed processing
 + Big data = storage + processing
 + Big data = Hadoop (not only)
13
Big data Glossary


+ Hadoop, MapReduce, Hive, Pig, Cascading,
  HBase, Hypertable, Cassandra, Flume, Sqoop,
  Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,
  Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,
  Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,
  Mahout, Weka,
14
How big is Big?

+ Google
  + 24 PB data processed daily
+ Twitter
  + 340 mln daily tweets
  + 1.6 bln search queries
  + 7 TB added daily
+ Facebook
  + 750 mln users
  + 12 TB daily daily content
  + 2.7 bln “likes” and comments daily
15
Sources of Big Data




                      www.ibm.com
16
Supercomputing


+ National Labs, Universities, Military
+ Processing power, flops, MPI
+ Parallel computing:
   + Cray, IBM SP, SGI
   + Beowulf cluster (Linux commodity)
17
New realities


+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
   + web search (crawling, indexing)
   + advertising
   + email services
   + ecommerce


   + Commodity hardware
18
Google




  2003   2004
19
GFS/HDFS

+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
20
  MapReduce


                                                    + Scalable:
                                                      + no file IO
                                                      + no networking
                                                      + no synchronization




                                 + Master-slave architecture
+ MapReduce programming model:
                                   + Master: divide, schedule, monitor work
  + functional programming
                                   + Slave: actual processing
  + like UNIX pipeline
21
 Data movement




+ store and process data on the same nodes
+ bring code to data, data “locality”
                                             www.cloudera.com
22
Hadoop
+ Doug Cutting
  + Search indexer - Lucene
  + Web crawler - Nutch
  + Hadoop
     + HDFS
     + MapReduce
23
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam filters
+ categorization, personalization
+ computational advertising
Data Base NoSQL                   24

Revolution
+ Needed:
   + fast read/write time
   + high concurrency
   + easy horizontally scalable
+ Flat data structure
+ Sacrificed:
   + DB Schema
   + SQL
   + Transactions
25
NoSQL World

+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)
26
Hadoop stack




               www.hortonworks.com
27
Hadoop tools

+ Pig
  + high level scripting language (PigLatin)
  + converts to MapReduce jobs
+ Hive
  + SQL like queries on dat in HDFS
  + converts in MapReduce jobs
28

Hadoop data movement




                       www.cloudera.com
29
Typical hadoop usage
 +   Text mining
 +   Pattern recognition
 +   Recommendation systems (collaborative filtering)
 +   Prediction models
 +   Risk assessment
 +   Sentiment analysis
 +   Customer churn prediction
 +   Customer segmentation
 +   Point of Sale Transaction analysis
 +   Data “sandbox”
30

Application fields

+ Science: sensors, genome, weather, satellite,
   imaging

+ Engineering: log analytics, status feeds, network
   messages, spam filters..

+ Product: financial, pharmaceutical, insurance,
   energy, retail, ecommerce, healthcare, telecom

+ Business: analytics, BI
31
Business analytics



+ Analytic
+ Operational




        Capture, analyze, learn from data
                                            www.datasciencecentral.com
32
Who uses Hadoop?




                   www.cloudera.com
33
Why Hadoop?




              www.thinkbiganalytics.com
34
Cloudera




+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
  + CDH 4 (cloudera distrobution hadoop)
  + Impala
  + Consulting and training
                                           www.cloudera.com
35
MapR




+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-
  changing Map/Reduce related technologies

+ Products:
  + M3,M5,M7
  + NFS, no single node failure
  + NOT open source !
                                             www.mapr.com
36
HortonWorks




+ Founded 2011
+ Yahoo spin-off
+ Products:
  + HDP distribution
  + tools

                       www.hortonworks.com
37
Hadoop Ecosystem




                   www.datameer.com
38
Big Data Landscape




                     www.bigdatalandscape.com
39
Splunk




+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring




                                                            www.splunk.com
40
Datameer




+ Founded 2009,
  Funding $17,8M

+ Big data:
  + Data integration
  + Data Analytics
  + Data Visualization
                         www.datameer.com
41
Datasift




+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data



                                 www.datasift.com
42
Infochimps




+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time




                                                        www.infochimps.com
43
Tableau software




+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization

                               www.tableau.com
Big data Startups                       44

 2012

+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
Big data startups                               45

 2013!


+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
46
Big data by industry




                       www.gartner.com
47
Big data Processing

                 Batch
                             interactive       stream
               processing



               minutes to   Millisecond to
 Query time                                   continues
                 hours         seconds



 data volume    TB to PT      GB to PB        continues



programming
               MapReduce       Queries           DAG
   model




   Users       Developers     Analysts       Developers




                Hadoop
Open Source                  Drill, Impala   Storm, Kafka
               mapreduce
48
New technologies

+ Real time quering
  + Drill (based on Google Dremmel)
  + Impala (Cloudera)


+ Data stream processing
  + Storm (Twitter), real time analytics
  + Kafka (LinkedIn), messaging system
49
Machine learning

 + Predictive analytics
 + Patterns discovery
 + Data mining
 + Tools:
    + Mahout
    + R
50
Big data revolution

+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka
51
Observations

+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
52
Data scientist

+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL

“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.”
Big Data Products                  53

MindMap




                    www.garycrawford.co.uk
54
Contacts


+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science
   Higher School of Economics, NRU-HSE

+ lzhukov@hse.ru
+ www.leonidzhukov.ru

Business of Big Data

  • 1.
    1 Big Data the next frontier RVC Seminar Leonid Zhukov Moscow, 08/02/2013 Professor Higher School of Economics
  • 2.
    2 Big data + Graphof terms popularity www.visibletechologies.com
  • 3.
    3 McKinsey, May 2011 www.mckinsey.com
  • 4.
    4 Headlines Data driven business Data democratization Data scientists
  • 5.
    5 The White House +$200M initiative + NSF: core techniques + NIH: 1000 genomes + DOE: advanced computing + DOD: data to decisions + USGS: Earth system www.whitehouse.gov
  • 6.
    6 Gartner Hype Cycle www.gartner.com
  • 7.
    7 Market Forecast + Venture money invested (Reuters): + Market forecasts: + 2009 - $1.1B + IDC: 2015 - $16.9B + 2010 - $1.53B + Gartner: 2016- $55B + 2011 - $2.47B www.wikibon.com
  • 8.
    8 Big Data Revenue2012 + Big Business: + IBM + HP + Oracle + Teradata + EMC www.wikibon.com
  • 9.
    9 Big Data Vendors! + Hadoop: + Cloudera + MapR Techonologies + HortonWorks www.wikibon.com
  • 10.
    10 Forrester Wave www.forrester.com
  • 11.
    What is bigdata 11 + Big data: + “Data you can’t process by traditional tools” + “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.” + “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
  • 12.
    12 What is Bigdata + 3V + Volume: petabytes (1000TB) to exabytes (1000PB) + Variety: structured, semi-structured, unstructured + Velocity: Tb/s data streams + Requires distributed processing + Big data = storage + processing + Big data = Hadoop (not only)
  • 13.
    13 Big data Glossary +Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
  • 14.
    14 How big isBig? + Google + 24 PB data processed daily + Twitter + 340 mln daily tweets + 1.6 bln search queries + 7 TB added daily + Facebook + 750 mln users + 12 TB daily daily content + 2.7 bln “likes” and comments daily
  • 15.
    15 Sources of BigData www.ibm.com
  • 16.
    16 Supercomputing + National Labs,Universities, Military + Processing power, flops, MPI + Parallel computing: + Cray, IBM SP, SGI + Beowulf cluster (Linux commodity)
  • 17.
    17 New realities + Yahoo,AltaVista, Inktomi, Google + Consumer web companies: + web search (crawling, indexing) + advertising + email services + ecommerce + Commodity hardware
  • 18.
  • 19.
    19 GFS/HDFS + Distributed replicateddata blocks (64Mb) + Master-slave architecture (Name Node, Data Nodes) + Not a general file system + Access via command line utils and API + Can’t modify after files written
  • 20.
    20 MapReduce + Scalable: + no file IO + no networking + no synchronization + Master-slave architecture + MapReduce programming model: + Master: divide, schedule, monitor work + functional programming + Slave: actual processing + like UNIX pipeline
  • 21.
    21  Data movement + storeand process data on the same nodes + bring code to data, data “locality” www.cloudera.com
  • 22.
    22 Hadoop + Doug Cutting + Search indexer - Lucene + Web crawler - Nutch + Hadoop + HDFS + MapReduce
  • 23.
    23 Yahoo! + 40,000 servers +170PB storage + 1000+ active users + 5M+ monthly jobs + email spam filters + categorization, personalization + computational advertising
  • 24.
    Data Base NoSQL 24 Revolution + Needed: + fast read/write time + high concurrency + easy horizontally scalable + Flat data structure + Sacrificed: + DB Schema + SQL + Transactions
  • 25.
    25 NoSQL World + Key-value:Dynamo, Voldemort, Redis, Riak + Column (tabular): HBase, Hypertable, Cassandra + Document store: CouchDB, MongoDB + Graph: Neo4J, FlockDB + 120+ products (2012)
  • 26.
    26 Hadoop stack www.hortonworks.com
  • 27.
    27 Hadoop tools + Pig + high level scripting language (PigLatin) + converts to MapReduce jobs + Hive + SQL like queries on dat in HDFS + converts in MapReduce jobs
  • 28.
    28 Hadoop data movement www.cloudera.com
  • 29.
    29 Typical hadoop usage + Text mining + Pattern recognition + Recommendation systems (collaborative filtering) + Prediction models + Risk assessment + Sentiment analysis + Customer churn prediction + Customer segmentation + Point of Sale Transaction analysis + Data “sandbox”
  • 30.
    30 Application fields + Science:sensors, genome, weather, satellite, imaging + Engineering: log analytics, status feeds, network messages, spam filters.. + Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom + Business: analytics, BI
  • 31.
    31 Business analytics + Analytic +Operational Capture, analyze, learn from data www.datasciencecentral.com
  • 32.
    32 Who uses Hadoop? www.cloudera.com
  • 33.
    33 Why Hadoop? www.thinkbiganalytics.com
  • 34.
    34 Cloudera + Enterprise supportfor Apache Hadoop + Founded 2008, funding $141 M + Employee 230 + Products: + CDH 4 (cloudera distrobution hadoop) + Impala + Consulting and training www.cloudera.com
  • 35.
    35 MapR + Founded 2009,funding $20M + MapR Technologies is engineering game- changing Map/Reduce related technologies + Products: + M3,M5,M7 + NFS, no single node failure + NOT open source ! www.mapr.com
  • 36.
    36 HortonWorks + Founded 2011 +Yahoo spin-off + Products: + HDP distribution + tools www.hortonworks.com
  • 37.
    37 Hadoop Ecosystem www.datameer.com
  • 38.
    38 Big Data Landscape www.bigdatalandscape.com
  • 39.
    39 Splunk + Founded 2003,raised $230M, IPO 2011, Market cap $3.35B + Machine logs analysis, operational intelligence + Collecting, searching, monitoring www.splunk.com
  • 40.
    40 Datameer + Founded 2009, Funding $17,8M + Big data: + Data integration + Data Analytics + Data Visualization www.datameer.com
  • 41.
    41 Datasift + Founded 2010,funding $29.7M + Data platform for social web + Aggregate and filter data www.datasift.com
  • 42.
    42 Infochimps + Founded 2009,funding $5.5M + Transitioned from data marketpalce to big data platform + End-to-end big data solution, real time www.infochimps.com
  • 43.
    43 Tableau software + Founded2003, funding $15M + Big data analytics + Big data visualization www.tableau.com
  • 44.
    Big data Startups 44 2012 + Platfora, in memory BI on Hadoop + Sumologic, log file analysis + Hadapt, Hadoop+RDBSM + Metamarkets, patterns in data flow + DataStax, consulting, training + Karmasphere, BI, analytics on Hadoop
  • 45.
    Big data startups 45 2013! + 10gen, MongoDB + ClearStory, big data aggregation + analytics + Continuuity, Hadoop API + Parstream, database analytics + Zoomdata, data visualization + Climate corporation, predictive analytics
  • 46.
    46 Big data byindustry www.gartner.com
  • 47.
    47 Big data Processing Batch interactive stream processing minutes to Millisecond to Query time continues hours seconds data volume TB to PT GB to PB continues programming MapReduce Queries DAG model Users Developers Analysts Developers Hadoop Open Source Drill, Impala Storm, Kafka mapreduce
  • 48.
    48 New technologies + Realtime quering + Drill (based on Google Dremmel) + Impala (Cloudera) + Data stream processing + Storm (Twitter), real time analytics + Kafka (LinkedIn), messaging system
  • 49.
    49 Machine learning +Predictive analytics + Patterns discovery + Data mining + Tools: + Mahout + R
  • 50.
    50 Big data revolution +Google: GFS, MapReduce, BigTable, + Yahoo: Hadoop + Amazon: DynamoDB + Facebook: Cassandra, HBase + Twitter: FlockDB, Storm + LinkedIn: Vondelmort, Kafka
  • 51.
    51 Observations + Game changingtechnologies come from big companies + Open Source (!) + Start-up ecosystem + Less general, more specialized + Next step: big data analytics and visualization
  • 52.
    52 Data scientist + MachineLearning + Data Mining + Statistics + Software Engineering + Hadoop/MapReduce/HBase/Hive/Pig + Java, Python, C/C+, SQL “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
  • 53.
    Big Data Products 53 MindMap www.garycrawford.co.uk
  • 54.
    54 Contacts + Leonid Zhukov,Ph.D. + School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE + lzhukov@hse.ru + www.leonidzhukov.ru