SlideShare a Scribd company logo
1




 Big Data
 the next frontier

RVC Seminar                                Leonid Zhukov
Moscow, 08/02/2013   Professor Higher School of Economics
2
Big data




+ Graph of terms popularity




                              www.visibletechologies.com
3
McKinsey, May 2011




                     www.mckinsey.com
4
Headlines




            Data driven business

            Data democratization

            Data scientists
5
The White House



+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system


                            www.whitehouse.gov
6
Gartner Hype Cycle




                     www.gartner.com
7
 Market Forecast




                         + Venture money invested (Reuters):
+ Market forecasts:        + 2009 - $1.1B
 + IDC: 2015 - $16.9B      + 2010 - $1.53B
 + Gartner: 2016- $55B     + 2011 - $2.47B
                                                      www.wikibon.com
8
Big Data Revenue 2012




 + Big Business:
    +   IBM
    +   HP
    +   Oracle
    +   Teradata
    +   EMC             www.wikibon.com
9
Big Data Vendors!




    + Hadoop:
      + Cloudera
      + MapR Techonologies
      + HortonWorks          www.wikibon.com
10
Forrester Wave




                 www.forrester.com
What is big data                                                    11




+ Big data:
  + “Data you can’t process by traditional tools”
  + “A phenomenon defined by the rapid acceleration in the
     expanding volume of high velocity, complex and diverse
     types of data.”

  + “Refers to a collection of tools, techniques and technologies
     for working with data productively, at any scale.”
12
What is Big data

 + 3V
    + Volume: petabytes (1000TB) to exabytes (1000PB)
    + Variety: structured, semi-structured, unstructured
    + Velocity: Tb/s data streams
 + Requires distributed processing
 + Big data = storage + processing
 + Big data = Hadoop (not only)
13
Big data Glossary


+ Hadoop, MapReduce, Hive, Pig, Cascading,
  HBase, Hypertable, Cassandra, Flume, Sqoop,
  Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,
  Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,
  Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,
  Mahout, Weka,
14
How big is Big?

+ Google
  + 24 PB data processed daily
+ Twitter
  + 340 mln daily tweets
  + 1.6 bln search queries
  + 7 TB added daily
+ Facebook
  + 750 mln users
  + 12 TB daily daily content
  + 2.7 bln “likes” and comments daily
15
Sources of Big Data




                      www.ibm.com
16
Supercomputing


+ National Labs, Universities, Military
+ Processing power, flops, MPI
+ Parallel computing:
   + Cray, IBM SP, SGI
   + Beowulf cluster (Linux commodity)
17
New realities


+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
   + web search (crawling, indexing)
   + advertising
   + email services
   + ecommerce


   + Commodity hardware
18
Google




  2003   2004
19
GFS/HDFS

+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
20
  MapReduce


                                                    + Scalable:
                                                      + no file IO
                                                      + no networking
                                                      + no synchronization




                                 + Master-slave architecture
+ MapReduce programming model:
                                   + Master: divide, schedule, monitor work
  + functional programming
                                   + Slave: actual processing
  + like UNIX pipeline
21
 Data movement




+ store and process data on the same nodes
+ bring code to data, data “locality”
                                             www.cloudera.com
22
Hadoop
+ Doug Cutting
  + Search indexer - Lucene
  + Web crawler - Nutch
  + Hadoop
     + HDFS
     + MapReduce
23
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam filters
+ categorization, personalization
+ computational advertising
Data Base NoSQL                   24

Revolution
+ Needed:
   + fast read/write time
   + high concurrency
   + easy horizontally scalable
+ Flat data structure
+ Sacrificed:
   + DB Schema
   + SQL
   + Transactions
25
NoSQL World

+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)
26
Hadoop stack




               www.hortonworks.com
27
Hadoop tools

+ Pig
  + high level scripting language (PigLatin)
  + converts to MapReduce jobs
+ Hive
  + SQL like queries on dat in HDFS
  + converts in MapReduce jobs
28

Hadoop data movement




                       www.cloudera.com
29
Typical hadoop usage
 +   Text mining
 +   Pattern recognition
 +   Recommendation systems (collaborative filtering)
 +   Prediction models
 +   Risk assessment
 +   Sentiment analysis
 +   Customer churn prediction
 +   Customer segmentation
 +   Point of Sale Transaction analysis
 +   Data “sandbox”
30

Application fields

+ Science: sensors, genome, weather, satellite,
   imaging

+ Engineering: log analytics, status feeds, network
   messages, spam filters..

+ Product: financial, pharmaceutical, insurance,
   energy, retail, ecommerce, healthcare, telecom

+ Business: analytics, BI
31
Business analytics



+ Analytic
+ Operational




        Capture, analyze, learn from data
                                            www.datasciencecentral.com
32
Who uses Hadoop?




                   www.cloudera.com
33
Why Hadoop?




              www.thinkbiganalytics.com
34
Cloudera




+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
  + CDH 4 (cloudera distrobution hadoop)
  + Impala
  + Consulting and training
                                           www.cloudera.com
35
MapR




+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-
  changing Map/Reduce related technologies

+ Products:
  + M3,M5,M7
  + NFS, no single node failure
  + NOT open source !
                                             www.mapr.com
36
HortonWorks




+ Founded 2011
+ Yahoo spin-off
+ Products:
  + HDP distribution
  + tools

                       www.hortonworks.com
37
Hadoop Ecosystem




                   www.datameer.com
38
Big Data Landscape




                     www.bigdatalandscape.com
39
Splunk




+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring




                                                            www.splunk.com
40
Datameer




+ Founded 2009,
  Funding $17,8M

+ Big data:
  + Data integration
  + Data Analytics
  + Data Visualization
                         www.datameer.com
41
Datasift




+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data



                                 www.datasift.com
42
Infochimps




+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time




                                                        www.infochimps.com
43
Tableau software




+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization

                               www.tableau.com
Big data Startups                       44

 2012

+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
Big data startups                               45

 2013!


+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
46
Big data by industry




                       www.gartner.com
47
Big data Processing

                 Batch
                             interactive       stream
               processing



               minutes to   Millisecond to
 Query time                                   continues
                 hours         seconds



 data volume    TB to PT      GB to PB        continues



programming
               MapReduce       Queries           DAG
   model




   Users       Developers     Analysts       Developers




                Hadoop
Open Source                  Drill, Impala   Storm, Kafka
               mapreduce
48
New technologies

+ Real time quering
  + Drill (based on Google Dremmel)
  + Impala (Cloudera)


+ Data stream processing
  + Storm (Twitter), real time analytics
  + Kafka (LinkedIn), messaging system
49
Machine learning

 + Predictive analytics
 + Patterns discovery
 + Data mining
 + Tools:
    + Mahout
    + R
50
Big data revolution

+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka
51
Observations

+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
52
Data scientist

+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL

“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.”
Big Data Products                  53

MindMap




                    www.garycrawford.co.uk
54
Contacts


+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science
   Higher School of Economics, NRU-HSE

+ lzhukov@hse.ru
+ www.leonidzhukov.ru

More Related Content

What's hot

HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
Dieter De Witte
 
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
Myungjin Lee
 
BIG DATA
BIG DATABIG DATA
BIG DATA
Shashank Shetty
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREJazz Yao-Tsung Wang
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
Ahmed Salman
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
tervela
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
nandhiniarumugam619
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
Ahmed Salman
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013Brian Crotty
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
Brett Colbert
 

What's hot (14)

HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 

Viewers also liked

CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"
Mario Bojilov, MEngsSc, CISA
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI Visualization
Leonid Zhukov
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013Leonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link Analysis
Leonid Zhukov
 
socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukovLeonid Zhukov
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
Leonid Zhukov
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA Ventures
Ben Siscovick
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges
Experian_US
 
A Primer on Big Data for Business
A Primer on Big Data for BusinessA Primer on Big Data for Business
A Primer on Big Data for Business
Leslie Bradshaw
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business Advantage
Teradata Aster
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Leonid Zhukov
 

Viewers also liked (11)

CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI Visualization
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link Analysis
 
socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukov
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA Ventures
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges
 
A Primer on Big Data for Business
A Primer on Big Data for BusinessA Primer on Big Data for Business
A Primer on Big Data for Business
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business Advantage
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
 

Similar to Business of Big Data

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Big Data Spain
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data Trends
IMC Institute
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Big data
Big dataBig data
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup Faizan Javed
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
BigDataEverywhere
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
JAX London
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Febiyan Rachman
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
InnoTech
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
nkabra
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
OCTO Technology
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
OpenDataSoft
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
MapR Technologies
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 

Similar to Business of Big Data (20)

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data Trends
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data
Big dataBig data
Big data
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 

More from Leonid Zhukov

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data use
Leonid Zhukov
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
Leonid Zhukov
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
Leonid Zhukov
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
Leonid Zhukov
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
Leonid Zhukov
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
Leonid Zhukov
 
Information cascades
Information cascadesInformation cascades
Information cascades
Leonid Zhukov
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
Leonid Zhukov
 
Social Networks
Social NetworksSocial Networks
Social Networks
Leonid Zhukov
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
Leonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
Leonid Zhukov
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
Leonid Zhukov
 

More from Leonid Zhukov (13)

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data use
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
 
Information cascades
Information cascadesInformation cascades
Information cascades
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Business of Big Data

  • 1. 1 Big Data the next frontier RVC Seminar Leonid Zhukov Moscow, 08/02/2013 Professor Higher School of Economics
  • 2. 2 Big data + Graph of terms popularity www.visibletechologies.com
  • 3. 3 McKinsey, May 2011 www.mckinsey.com
  • 4. 4 Headlines Data driven business Data democratization Data scientists
  • 5. 5 The White House + $200M initiative + NSF: core techniques + NIH: 1000 genomes + DOE: advanced computing + DOD: data to decisions + USGS: Earth system www.whitehouse.gov
  • 6. 6 Gartner Hype Cycle www.gartner.com
  • 7. 7 Market Forecast + Venture money invested (Reuters): + Market forecasts: + 2009 - $1.1B + IDC: 2015 - $16.9B + 2010 - $1.53B + Gartner: 2016- $55B + 2011 - $2.47B www.wikibon.com
  • 8. 8 Big Data Revenue 2012 + Big Business: + IBM + HP + Oracle + Teradata + EMC www.wikibon.com
  • 9. 9 Big Data Vendors! + Hadoop: + Cloudera + MapR Techonologies + HortonWorks www.wikibon.com
  • 10. 10 Forrester Wave www.forrester.com
  • 11. What is big data 11 + Big data: + “Data you can’t process by traditional tools” + “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.” + “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
  • 12. 12 What is Big data + 3V + Volume: petabytes (1000TB) to exabytes (1000PB) + Variety: structured, semi-structured, unstructured + Velocity: Tb/s data streams + Requires distributed processing + Big data = storage + processing + Big data = Hadoop (not only)
  • 13. 13 Big data Glossary + Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
  • 14. 14 How big is Big? + Google + 24 PB data processed daily + Twitter + 340 mln daily tweets + 1.6 bln search queries + 7 TB added daily + Facebook + 750 mln users + 12 TB daily daily content + 2.7 bln “likes” and comments daily
  • 15. 15 Sources of Big Data www.ibm.com
  • 16. 16 Supercomputing + National Labs, Universities, Military + Processing power, flops, MPI + Parallel computing: + Cray, IBM SP, SGI + Beowulf cluster (Linux commodity)
  • 17. 17 New realities + Yahoo, AltaVista, Inktomi, Google + Consumer web companies: + web search (crawling, indexing) + advertising + email services + ecommerce + Commodity hardware
  • 19. 19 GFS/HDFS + Distributed replicated data blocks (64Mb) + Master-slave architecture (Name Node, Data Nodes) + Not a general file system + Access via command line utils and API + Can’t modify after files written
  • 20. 20 MapReduce + Scalable: + no file IO + no networking + no synchronization + Master-slave architecture + MapReduce programming model: + Master: divide, schedule, monitor work + functional programming + Slave: actual processing + like UNIX pipeline
  • 21. 21  Data movement + store and process data on the same nodes + bring code to data, data “locality” www.cloudera.com
  • 22. 22 Hadoop + Doug Cutting + Search indexer - Lucene + Web crawler - Nutch + Hadoop + HDFS + MapReduce
  • 23. 23 Yahoo! + 40,000 servers + 170PB storage + 1000+ active users + 5M+ monthly jobs + email spam filters + categorization, personalization + computational advertising
  • 24. Data Base NoSQL 24 Revolution + Needed: + fast read/write time + high concurrency + easy horizontally scalable + Flat data structure + Sacrificed: + DB Schema + SQL + Transactions
  • 25. 25 NoSQL World + Key-value: Dynamo, Voldemort, Redis, Riak + Column (tabular): HBase, Hypertable, Cassandra + Document store: CouchDB, MongoDB + Graph: Neo4J, FlockDB + 120+ products (2012)
  • 26. 26 Hadoop stack www.hortonworks.com
  • 27. 27 Hadoop tools + Pig + high level scripting language (PigLatin) + converts to MapReduce jobs + Hive + SQL like queries on dat in HDFS + converts in MapReduce jobs
  • 28. 28 Hadoop data movement www.cloudera.com
  • 29. 29 Typical hadoop usage + Text mining + Pattern recognition + Recommendation systems (collaborative filtering) + Prediction models + Risk assessment + Sentiment analysis + Customer churn prediction + Customer segmentation + Point of Sale Transaction analysis + Data “sandbox”
  • 30. 30 Application fields + Science: sensors, genome, weather, satellite, imaging + Engineering: log analytics, status feeds, network messages, spam filters.. + Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom + Business: analytics, BI
  • 31. 31 Business analytics + Analytic + Operational Capture, analyze, learn from data www.datasciencecentral.com
  • 32. 32 Who uses Hadoop? www.cloudera.com
  • 33. 33 Why Hadoop? www.thinkbiganalytics.com
  • 34. 34 Cloudera + Enterprise support for Apache Hadoop + Founded 2008, funding $141 M + Employee 230 + Products: + CDH 4 (cloudera distrobution hadoop) + Impala + Consulting and training www.cloudera.com
  • 35. 35 MapR + Founded 2009, funding $20M + MapR Technologies is engineering game- changing Map/Reduce related technologies + Products: + M3,M5,M7 + NFS, no single node failure + NOT open source ! www.mapr.com
  • 36. 36 HortonWorks + Founded 2011 + Yahoo spin-off + Products: + HDP distribution + tools www.hortonworks.com
  • 37. 37 Hadoop Ecosystem www.datameer.com
  • 38. 38 Big Data Landscape www.bigdatalandscape.com
  • 39. 39 Splunk + Founded 2003, raised $230M, IPO 2011, Market cap $3.35B + Machine logs analysis, operational intelligence + Collecting, searching, monitoring www.splunk.com
  • 40. 40 Datameer + Founded 2009, Funding $17,8M + Big data: + Data integration + Data Analytics + Data Visualization www.datameer.com
  • 41. 41 Datasift + Founded 2010, funding $29.7M + Data platform for social web + Aggregate and filter data www.datasift.com
  • 42. 42 Infochimps + Founded 2009, funding $5.5M + Transitioned from data marketpalce to big data platform + End-to-end big data solution, real time www.infochimps.com
  • 43. 43 Tableau software + Founded 2003, funding $15M + Big data analytics + Big data visualization www.tableau.com
  • 44. Big data Startups 44 2012 + Platfora, in memory BI on Hadoop + Sumologic, log file analysis + Hadapt, Hadoop+RDBSM + Metamarkets, patterns in data flow + DataStax, consulting, training + Karmasphere, BI, analytics on Hadoop
  • 45. Big data startups 45 2013! + 10gen, MongoDB + ClearStory, big data aggregation + analytics + Continuuity, Hadoop API + Parstream, database analytics + Zoomdata, data visualization + Climate corporation, predictive analytics
  • 46. 46 Big data by industry www.gartner.com
  • 47. 47 Big data Processing Batch interactive stream processing minutes to Millisecond to Query time continues hours seconds data volume TB to PT GB to PB continues programming MapReduce Queries DAG model Users Developers Analysts Developers Hadoop Open Source Drill, Impala Storm, Kafka mapreduce
  • 48. 48 New technologies + Real time quering + Drill (based on Google Dremmel) + Impala (Cloudera) + Data stream processing + Storm (Twitter), real time analytics + Kafka (LinkedIn), messaging system
  • 49. 49 Machine learning + Predictive analytics + Patterns discovery + Data mining + Tools: + Mahout + R
  • 50. 50 Big data revolution + Google: GFS, MapReduce, BigTable, + Yahoo: Hadoop + Amazon: DynamoDB + Facebook: Cassandra, HBase + Twitter: FlockDB, Storm + LinkedIn: Vondelmort, Kafka
  • 51. 51 Observations + Game changing technologies come from big companies + Open Source (!) + Start-up ecosystem + Less general, more specialized + Next step: big data analytics and visualization
  • 52. 52 Data scientist + Machine Learning + Data Mining + Statistics + Software Engineering + Hadoop/MapReduce/HBase/Hive/Pig + Java, Python, C/C+, SQL “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
  • 53. Big Data Products 53 MindMap www.garycrawford.co.uk
  • 54. 54 Contacts + Leonid Zhukov, Ph.D. + School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE + lzhukov@hse.ru + www.leonidzhukov.ru