0
1 Big Data the next frontierRVC Seminar                                Leonid ZhukovMoscow, 08/02/2013   Professor Higher ...
2Big data+ Graph of terms popularity                              www.visibletechologies.com
3McKinsey, May 2011                     www.mckinsey.com
4Headlines            Data driven business            Data democratization            Data scientists
5The White House+ $200M initiative+ NSF: core techniques+ NIH: 1000 genomes+ DOE: advanced computing+ DOD: data to decisio...
6Gartner Hype Cycle                     www.gartner.com
7 Market Forecast                         + Venture money invested (Reuters):+ Market forecasts:        + 2009 - $1.1B + I...
8Big Data Revenue 2012 + Big Business:    +   IBM    +   HP    +   Oracle    +   Teradata    +   EMC             www.wikib...
9Big Data Vendors!    + Hadoop:      + Cloudera      + MapR Techonologies      + HortonWorks          www.wikibon.com
10Forrester Wave                 www.forrester.com
What is big data                                                    11+ Big data:  + “Data you can’t process by traditiona...
12What is Big data + 3V    + Volume: petabytes (1000TB) to exabytes (1000PB)    + Variety: structured, semi-structured, un...
13Big data Glossary+ Hadoop, MapReduce, Hive, Pig, Cascading,  HBase, Hypertable, Cassandra, Flume, Sqoop,  Mongo, Voldemo...
14How big is Big?+ Google  + 24 PB data processed daily+ Twitter  + 340 mln daily tweets  + 1.6 bln search queries  + 7 TB...
15Sources of Big Data                      www.ibm.com
16Supercomputing+ National Labs, Universities, Military+ Processing power, flops, MPI+ Parallel computing:   + Cray, IBM SP...
17New realities+ Yahoo, AltaVista, Inktomi, Google+ Consumer web companies:   + web search (crawling, indexing)   + advert...
18Google  2003   2004
19GFS/HDFS+ Distributed replicated data blocks (64Mb)+ Master-slave architecture (Name Node, Data Nodes)+ Not a general fil...
20  MapReduce                                                    + Scalable:                                              ...
21 Data movement+ store and process data on the same nodes+ bring code to data, data “locality”                           ...
22Hadoop+ Doug Cutting  + Search indexer - Lucene  + Web crawler - Nutch  + Hadoop     + HDFS     + MapReduce
23Yahoo!+ 40,000 servers+ 170PB storage+ 1000+ active users+ 5M+ monthly jobs+ email spam filters+ categorization, personal...
Data Base NoSQL                   24Revolution+ Needed:   + fast read/write time   + high concurrency   + easy horizontall...
25NoSQL World+ Key-value: Dynamo, Voldemort, Redis, Riak+ Column (tabular): HBase, Hypertable, Cassandra+ Document store: ...
26Hadoop stack               www.hortonworks.com
27Hadoop tools+ Pig  + high level scripting language (PigLatin)  + converts to MapReduce jobs+ Hive  + SQL like queries on...
28Hadoop data movement                       www.cloudera.com
29Typical hadoop usage +   Text mining +   Pattern recognition +   Recommendation systems (collaborative filtering) +   Pre...
30Application fields+ Science: sensors, genome, weather, satellite,   imaging+ Engineering: log analytics, status feeds, n...
31Business analytics+ Analytic+ Operational        Capture, analyze, learn from data                                      ...
32Who uses Hadoop?                   www.cloudera.com
33Why Hadoop?              www.thinkbiganalytics.com
34Cloudera+ Enterprise support for Apache Hadoop+ Founded 2008, funding $141 M+ Employee 230+ Products:  + CDH 4 (cloudera...
35MapR+ Founded 2009, funding $20M+ MapR Technologies is engineering game-  changing Map/Reduce related technologies+ Prod...
36HortonWorks+ Founded 2011+ Yahoo spin-off+ Products:  + HDP distribution  + tools                       www.hortonworks.com
37Hadoop Ecosystem                   www.datameer.com
38Big Data Landscape                     www.bigdatalandscape.com
39Splunk+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B+ Machine logs analysis, operational intelligence+ Collec...
40Datameer+ Founded 2009,  Funding $17,8M+ Big data:  + Data integration  + Data Analytics  + Data Visualization          ...
41Datasift+ Founded 2010, funding $29.7M+ Data platform for social web+ Aggregate and filter data                          ...
42Infochimps+ Founded 2009, funding $5.5M+ Transitioned from data marketpalce to big data platform+ End-to-end big data so...
43Tableau software+ Founded 2003, funding $15M+ Big data analytics+ Big data visualization                               w...
Big data Startups                       44 2012+ Platfora, in memory BI on Hadoop+ Sumologic, log file analysis+ Hadapt, Ha...
Big data startups                               45 2013!+ 10gen, MongoDB+ ClearStory, big data aggregation + analytics+ Co...
46Big data by industry                       www.gartner.com
47Big data Processing                 Batch                             interactive       stream               processing ...
48New technologies+ Real time quering  + Drill (based on Google Dremmel)  + Impala (Cloudera)+ Data stream processing  + S...
49Machine learning + Predictive analytics + Patterns discovery + Data mining + Tools:    + Mahout    + R
50Big data revolution+ Google: GFS, MapReduce, BigTable,+ Yahoo: Hadoop+ Amazon: DynamoDB+ Facebook: Cassandra, HBase+ Twi...
51Observations+ Game changing technologies come from big companies+ Open Source (!)+ Start-up ecosystem+ Less general, mor...
52Data scientist+ Machine Learning+ Data Mining+ Statistics+ Software Engineering+ Hadoop/MapReduce/HBase/Hive/Pig+ Java, ...
Big Data Products                  53MindMap                    www.garycrawford.co.uk
54Contacts+ Leonid Zhukov, Ph.D.+ School of Applied Mathematics and Information Science   Higher School of Economics, NRU-...
Upcoming SlideShare
Loading in...5
×

Business of Big Data

2,444

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,444
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
296
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "Business of Big Data"

  1. 1. 1 Big Data the next frontierRVC Seminar Leonid ZhukovMoscow, 08/02/2013 Professor Higher School of Economics
  2. 2. 2Big data+ Graph of terms popularity www.visibletechologies.com
  3. 3. 3McKinsey, May 2011 www.mckinsey.com
  4. 4. 4Headlines Data driven business Data democratization Data scientists
  5. 5. 5The White House+ $200M initiative+ NSF: core techniques+ NIH: 1000 genomes+ DOE: advanced computing+ DOD: data to decisions+ USGS: Earth system www.whitehouse.gov
  6. 6. 6Gartner Hype Cycle www.gartner.com
  7. 7. 7 Market Forecast + Venture money invested (Reuters):+ Market forecasts: + 2009 - $1.1B + IDC: 2015 - $16.9B + 2010 - $1.53B + Gartner: 2016- $55B + 2011 - $2.47B www.wikibon.com
  8. 8. 8Big Data Revenue 2012 + Big Business: + IBM + HP + Oracle + Teradata + EMC www.wikibon.com
  9. 9. 9Big Data Vendors! + Hadoop: + Cloudera + MapR Techonologies + HortonWorks www.wikibon.com
  10. 10. 10Forrester Wave www.forrester.com
  11. 11. What is big data 11+ Big data: + “Data you can’t process by traditional tools” + “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.” + “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
  12. 12. 12What is Big data + 3V + Volume: petabytes (1000TB) to exabytes (1000PB) + Variety: structured, semi-structured, unstructured + Velocity: Tb/s data streams + Requires distributed processing + Big data = storage + processing + Big data = Hadoop (not only)
  13. 13. 13Big data Glossary+ Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
  14. 14. 14How big is Big?+ Google + 24 PB data processed daily+ Twitter + 340 mln daily tweets + 1.6 bln search queries + 7 TB added daily+ Facebook + 750 mln users + 12 TB daily daily content + 2.7 bln “likes” and comments daily
  15. 15. 15Sources of Big Data www.ibm.com
  16. 16. 16Supercomputing+ National Labs, Universities, Military+ Processing power, flops, MPI+ Parallel computing: + Cray, IBM SP, SGI + Beowulf cluster (Linux commodity)
  17. 17. 17New realities+ Yahoo, AltaVista, Inktomi, Google+ Consumer web companies: + web search (crawling, indexing) + advertising + email services + ecommerce + Commodity hardware
  18. 18. 18Google 2003 2004
  19. 19. 19GFS/HDFS+ Distributed replicated data blocks (64Mb)+ Master-slave architecture (Name Node, Data Nodes)+ Not a general file system+ Access via command line utils and API+ Can’t modify after files written
  20. 20. 20 MapReduce + Scalable: + no file IO + no networking + no synchronization + Master-slave architecture+ MapReduce programming model: + Master: divide, schedule, monitor work + functional programming + Slave: actual processing + like UNIX pipeline
  21. 21. 21 Data movement+ store and process data on the same nodes+ bring code to data, data “locality” www.cloudera.com
  22. 22. 22Hadoop+ Doug Cutting + Search indexer - Lucene + Web crawler - Nutch + Hadoop + HDFS + MapReduce
  23. 23. 23Yahoo!+ 40,000 servers+ 170PB storage+ 1000+ active users+ 5M+ monthly jobs+ email spam filters+ categorization, personalization+ computational advertising
  24. 24. Data Base NoSQL 24Revolution+ Needed: + fast read/write time + high concurrency + easy horizontally scalable+ Flat data structure+ Sacrificed: + DB Schema + SQL + Transactions
  25. 25. 25NoSQL World+ Key-value: Dynamo, Voldemort, Redis, Riak+ Column (tabular): HBase, Hypertable, Cassandra+ Document store: CouchDB, MongoDB+ Graph: Neo4J, FlockDB+ 120+ products (2012)
  26. 26. 26Hadoop stack www.hortonworks.com
  27. 27. 27Hadoop tools+ Pig + high level scripting language (PigLatin) + converts to MapReduce jobs+ Hive + SQL like queries on dat in HDFS + converts in MapReduce jobs
  28. 28. 28Hadoop data movement www.cloudera.com
  29. 29. 29Typical hadoop usage + Text mining + Pattern recognition + Recommendation systems (collaborative filtering) + Prediction models + Risk assessment + Sentiment analysis + Customer churn prediction + Customer segmentation + Point of Sale Transaction analysis + Data “sandbox”
  30. 30. 30Application fields+ Science: sensors, genome, weather, satellite, imaging+ Engineering: log analytics, status feeds, network messages, spam filters..+ Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom+ Business: analytics, BI
  31. 31. 31Business analytics+ Analytic+ Operational Capture, analyze, learn from data www.datasciencecentral.com
  32. 32. 32Who uses Hadoop? www.cloudera.com
  33. 33. 33Why Hadoop? www.thinkbiganalytics.com
  34. 34. 34Cloudera+ Enterprise support for Apache Hadoop+ Founded 2008, funding $141 M+ Employee 230+ Products: + CDH 4 (cloudera distrobution hadoop) + Impala + Consulting and training www.cloudera.com
  35. 35. 35MapR+ Founded 2009, funding $20M+ MapR Technologies is engineering game- changing Map/Reduce related technologies+ Products: + M3,M5,M7 + NFS, no single node failure + NOT open source ! www.mapr.com
  36. 36. 36HortonWorks+ Founded 2011+ Yahoo spin-off+ Products: + HDP distribution + tools www.hortonworks.com
  37. 37. 37Hadoop Ecosystem www.datameer.com
  38. 38. 38Big Data Landscape www.bigdatalandscape.com
  39. 39. 39Splunk+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B+ Machine logs analysis, operational intelligence+ Collecting, searching, monitoring www.splunk.com
  40. 40. 40Datameer+ Founded 2009, Funding $17,8M+ Big data: + Data integration + Data Analytics + Data Visualization www.datameer.com
  41. 41. 41Datasift+ Founded 2010, funding $29.7M+ Data platform for social web+ Aggregate and filter data www.datasift.com
  42. 42. 42Infochimps+ Founded 2009, funding $5.5M+ Transitioned from data marketpalce to big data platform+ End-to-end big data solution, real time www.infochimps.com
  43. 43. 43Tableau software+ Founded 2003, funding $15M+ Big data analytics+ Big data visualization www.tableau.com
  44. 44. Big data Startups 44 2012+ Platfora, in memory BI on Hadoop+ Sumologic, log file analysis+ Hadapt, Hadoop+RDBSM+ Metamarkets, patterns in data flow+ DataStax, consulting, training+ Karmasphere, BI, analytics on Hadoop
  45. 45. Big data startups 45 2013!+ 10gen, MongoDB+ ClearStory, big data aggregation + analytics+ Continuuity, Hadoop API+ Parstream, database analytics+ Zoomdata, data visualization+ Climate corporation, predictive analytics
  46. 46. 46Big data by industry www.gartner.com
  47. 47. 47Big data Processing Batch interactive stream processing minutes to Millisecond to Query time continues hours seconds data volume TB to PT GB to PB continuesprogramming MapReduce Queries DAG model Users Developers Analysts Developers HadoopOpen Source Drill, Impala Storm, Kafka mapreduce
  48. 48. 48New technologies+ Real time quering + Drill (based on Google Dremmel) + Impala (Cloudera)+ Data stream processing + Storm (Twitter), real time analytics + Kafka (LinkedIn), messaging system
  49. 49. 49Machine learning + Predictive analytics + Patterns discovery + Data mining + Tools: + Mahout + R
  50. 50. 50Big data revolution+ Google: GFS, MapReduce, BigTable,+ Yahoo: Hadoop+ Amazon: DynamoDB+ Facebook: Cassandra, HBase+ Twitter: FlockDB, Storm+ LinkedIn: Vondelmort, Kafka
  51. 51. 51Observations+ Game changing technologies come from big companies+ Open Source (!)+ Start-up ecosystem+ Less general, more specialized+ Next step: big data analytics and visualization
  52. 52. 52Data scientist+ Machine Learning+ Data Mining+ Statistics+ Software Engineering+ Hadoop/MapReduce/HBase/Hive/Pig+ Java, Python, C/C+, SQL“By 2018, the United States alone could face a shortage of 140,000 to 190,000people with deep analytical skills as well as 1.5 million managers and analysts withthe know-how to use the analysis of big data to make effective decisions.”
  53. 53. Big Data Products 53MindMap www.garycrawford.co.uk
  54. 54. 54Contacts+ Leonid Zhukov, Ph.D.+ School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE+ lzhukov@hse.ru+ www.leonidzhukov.ru
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×