Business of Big Data

1

Big Data
the next frontier

RVC Seminar Leonid Zhukov
Moscow, 08/02/2013 Professor Higher School of Economics

2
Big data

+ Graph of terms popularity

www.visibletechologies.com

3
McKinsey, May 2011

www.mckinsey.com

4
Headlines

Data driven business

Data democratization

Data scientists

5
The White House

+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system

www.whitehouse.gov

6
Gartner Hype Cycle

www.gartner.com

7
Market Forecast

+ Venture money invested (Reuters):
+ Market forecasts: + 2009 - $1.1B
+ IDC: 2015 - $16.9B + 2010 - $1.53B
+ Gartner: 2016- $55B + 2011 - $2.47B
www.wikibon.com

8
Big Data Revenue 2012

+ Big Business:
+ IBM
+ HP
+ Oracle
+ Teradata
+ EMC www.wikibon.com

9
Big Data Vendors!

+ Hadoop:
+ Cloudera
+ MapR Techonologies
+ HortonWorks www.wikibon.com

10
Forrester Wave

www.forrester.com

What is big data 11

+ Big data:
+ “Data you can’t process by traditional tools”
+ “A phenomenon deﬁned by the rapid acceleration in the
expanding volume of high velocity, complex and diverse
types of data.”

+ “Refers to a collection of tools, techniques and technologies
for working with data productively, at any scale.”

12
What is Big data

+ 3V
+ Volume: petabytes (1000TB) to exabytes (1000PB)
+ Variety: structured, semi-structured, unstructured
+ Velocity: Tb/s data streams
+ Requires distributed processing
+ Big data = storage + processing
+ Big data = Hadoop (not only)

13
Big data Glossary

+ Hadoop, MapReduce, Hive, Pig, Cascading,
HBase, Hypertable, Cassandra, Flume, Sqoop,
Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,
Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,
Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,
Mahout, Weka,

14
How big is Big?

+ Google
+ 24 PB data processed daily
+ Twitter
+ 340 mln daily tweets
+ 1.6 bln search queries
+ 7 TB added daily
+ Facebook
+ 750 mln users
+ 12 TB daily daily content
+ 2.7 bln “likes” and comments daily

15
Sources of Big Data

www.ibm.com

16
Supercomputing

+ National Labs, Universities, Military
+ Processing power, ﬂops, MPI
+ Parallel computing:
+ Cray, IBM SP, SGI
+ Beowulf cluster (Linux commodity)

17
New realities

+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
+ web search (crawling, indexing)
+ advertising
+ email services
+ ecommerce

+ Commodity hardware

19
GFS/HDFS

+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general ﬁle system
+ Access via command line utils and API
+ Can’t modify after ﬁles written

20
MapReduce

+ Scalable:
+ no ﬁle IO
+ no networking
+ no synchronization

+ Master-slave architecture
+ MapReduce programming model:
+ Master: divide, schedule, monitor work
+ functional programming
+ Slave: actual processing
+ like UNIX pipeline

21
Data movement

+ store and process data on the same nodes
+ bring code to data, data “locality”
www.cloudera.com

22
Hadoop
+ Doug Cutting
+ Search indexer - Lucene
+ Web crawler - Nutch
+ Hadoop
+ HDFS
+ MapReduce

23
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam ﬁlters
+ categorization, personalization
+ computational advertising

Data Base NoSQL 24

Revolution
+ Needed:
+ fast read/write time
+ high concurrency
+ easy horizontally scalable
+ Flat data structure
+ Sacriﬁced:
+ DB Schema
+ SQL
+ Transactions

25
NoSQL World

+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)

26
Hadoop stack

www.hortonworks.com

27
Hadoop tools

+ Pig
+ high level scripting language (PigLatin)
+ converts to MapReduce jobs
+ Hive
+ SQL like queries on dat in HDFS
+ converts in MapReduce jobs

28

Hadoop data movement

www.cloudera.com

29
Typical hadoop usage
+ Text mining
+ Pattern recognition
+ Recommendation systems (collaborative ﬁltering)
+ Prediction models
+ Risk assessment
+ Sentiment analysis
+ Customer churn prediction
+ Customer segmentation
+ Point of Sale Transaction analysis
+ Data “sandbox”

30

Application fields

+ Science: sensors, genome, weather, satellite,
imaging

+ Engineering: log analytics, status feeds, network
messages, spam ﬁlters..

+ Product: ﬁnancial, pharmaceutical, insurance,
energy, retail, ecommerce, healthcare, telecom

+ Business: analytics, BI

31
Business analytics

+ Analytic
+ Operational

Capture, analyze, learn from data
www.datasciencecentral.com

32
Who uses Hadoop?

www.cloudera.com

33
Why Hadoop?

www.thinkbiganalytics.com

34
Cloudera

+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
+ CDH 4 (cloudera distrobution hadoop)
+ Impala
+ Consulting and training
www.cloudera.com

35
MapR

+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-
changing Map/Reduce related technologies

+ Products:
+ M3,M5,M7
+ NFS, no single node failure
+ NOT open source !
www.mapr.com

36
HortonWorks

+ Founded 2011
+ Yahoo spin-oﬀ
+ Products:
+ HDP distribution
+ tools

www.hortonworks.com

37
Hadoop Ecosystem

www.datameer.com

38
Big Data Landscape

www.bigdatalandscape.com

39
Splunk

+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring

www.splunk.com

40
Datameer

+ Founded 2009,
Funding $17,8M

+ Big data:
+ Data integration
+ Data Analytics
+ Data Visualization
www.datameer.com

41
Datasift

+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and ﬁlter data

www.datasift.com

42
Infochimps

+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time

www.infochimps.com

43
Tableau software

+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization

www.tableau.com

Big data Startups 44

2012

+ Platfora, in memory BI on Hadoop
+ Sumologic, log ﬁle analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data ﬂow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop

Big data startups 45

2013!

+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics

46
Big data by industry

www.gartner.com

47
Big data Processing

Batch
interactive stream
processing

minutes to Millisecond to
Query time continues
hours seconds

data volume TB to PT GB to PB continues

programming
MapReduce Queries DAG
model

Users Developers Analysts Developers

Hadoop
Open Source Drill, Impala Storm, Kafka
mapreduce

48
New technologies

+ Real time quering
+ Drill (based on Google Dremmel)
+ Impala (Cloudera)

+ Data stream processing
+ Storm (Twitter), real time analytics
+ Kafka (LinkedIn), messaging system

49
Machine learning

+ Predictive analytics
+ Patterns discovery
+ Data mining
+ Tools:
+ Mahout
+ R

50
Big data revolution

+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka

51
Observations

+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization

52
Data scientist

+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL

“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.”

Big Data Products 53

MindMap

www.garycrawford.co.uk

54
Contacts

+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science
Higher School of Economics, NRU-HSE

+ lzhukov@hse.ru
+ www.leonidzhukov.ru

Business of Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (11)

Similar to Business of Big Data

Similar to Business of Big Data (20)

More from Leonid Zhukov

More from Leonid Zhukov (13)

Recently uploaded

Recently uploaded (20)

Business of Big Data