Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Ecosystem 
Ivo Vachkov 
Xi Group Ltd.
Big Data ??? 
 Definition 
 The 3Vs: 
 Volume 
 Velocity 
 Variety 
 Added later: 
 Veracity 
 Variability 
 Comp...
Processing Paradigms 
 Batch Processing 
 Large volumes 
 Lower volatility 
 Incremental updates 
 Real-time Processi...
The Data Path 
 From Collection … 
 … to Processing … 
 … to Query: 
 Consumption 
 Visualization 
 [Predictive] Ana...
The Data Path
Data Path / Collection 
 Multiple sources (RDBMS, Logs, activity streams, message 
queues, time series, etc.) 
 Multiple...
Data Path / Collection 
 Wide variety of components and technologies: 
 Flat files, binary formats (AVRO, CSV, etc.) on ...
Data Path / Processing 
 Different processing paradigms: 
 Batch Processing 
 Real-time Processing 
 Multiple expected...
Data Path / Processing 
 Smaller number of technologies: 
 Map / Reduce (Hadoop, CouchDB, MongoDB, Riak) 
 Cluster Comp...
Data Path / Processing 
 The importance of M/R 
 Self-hosted solutions: 
 Apache Hadoop 
 Cloudera, HortonWorks, etc. ...
Data Path / Query 
 Processing will create digital artifact 
 Extremely high variety of technologies, components, 
servi...
Data Path / Query 
 “Query-friendly” stores: 
 Classical RDBMS, NewSQL 
 Big Table & Column Stores 
 Key-Value Stores ...
Data Path / Query 
 Analysis 
 Reports 
 Trends / Predictions 
 Real-time analytics 
 Data-driven Control Plane 
 Cl...
Big Data & Monitoring 
 Infrastructure Monitoring 
 Well understood 
 Many products 
 Full-Stack Application Monitorin...
Big Data & Monitoring 
 Infrastructure Monitoring
Big Data & Monitoring 
 Application Monitoring
Big Data & Monitoring 
 Data Quality Monitoring
… a bag of acronyms … 
 Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, 
Hbase, Pig Latin, Hive, HAWQ, Impala, Prest...
Piece of advice … 
 Collect relevant data! 
Collecting data for data’s sake only costs money … 
 Use the processing tech...
Conclusion 
Q & 
A
Upcoming SlideShare
Loading in …5
×

Big Data Ecosystem

1,274 views

Published on

Presentation for 2014 IDC BIg Data and Business Intelligence forum in Sofia, Bulgaria, 2014-09-18

Published in: Technology
  • Be the first to comment

Big Data Ecosystem

  1. 1. Big Data Ecosystem Ivo Vachkov Xi Group Ltd.
  2. 2. Big Data ???  Definition  The 3Vs:  Volume  Velocity  Variety  Added later:  Veracity  Variability  Complexity
  3. 3. Processing Paradigms  Batch Processing  Large volumes  Lower volatility  Incremental updates  Real-time Processing  Smaller volumes  Higher volatility  Possible full regeneration
  4. 4. The Data Path  From Collection …  … to Processing …  … to Query:  Consumption  Visualization  [Predictive] Analysis  Monitoring / Validation  ETL, anyone?!
  5. 5. The Data Path
  6. 6. Data Path / Collection  Multiple sources (RDBMS, Logs, activity streams, message queues, time series, etc.)  Multiple types (structured, unstructured, free text, bags of words, raw, normalized, etc.)  Collection starts with raw data and produces digital artifacts suitable for machine processing.
  7. 7. Data Path / Collection  Wide variety of components and technologies:  Flat files, binary formats (AVRO, CSV, etc.) on a typical file system  Cluster-specific file systems  RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases, Document Databases  Column Stores  Key-Value Stores  Time Series Stores  Streaming and transformation engines
  8. 8. Data Path / Processing  Different processing paradigms:  Batch Processing  Real-time Processing  Multiple expected outcomes:  Data  Action  Different destinations:  Data stores  Data-driven Control Planes
  9. 9. Data Path / Processing  Smaller number of technologies:  Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)  Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)  HPC / Supercomputing  Data parallelism is the key!  Data locality is important!
  10. 10. Data Path / Processing  The importance of M/R  Self-hosted solutions:  Apache Hadoop  Cloudera, HortonWorks, etc.  Cloud-based solutions:  AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)  Joyent Manta  … many others …
  11. 11. Data Path / Query  Processing will create digital artifact  Extremely high variety of technologies, components, services to deal with those artifacts:  SQL interfaces on top of NoSQL stores  NoSQL to NoSQL  NoSQL to RDBMS  Output to 3rd party API services  Output to proprietary interfaces  … a lot more …
  12. 12. Data Path / Query  “Query-friendly” stores:  Classical RDBMS, NewSQL  Big Table & Column Stores  Key-Value Stores  Search-oriented services  Visualization:  3rd party services  Tableau  HTML5 / JavaScript Dashboards  Programming languages / Visualization libraries
  13. 13. Data Path / Query  Analysis  Reports  Trends / Predictions  Real-time analytics  Data-driven Control Plane  Classical Business Intelligence  Machine Learning (Mahout)  Data Science (usually a fancy term for Statistics)
  14. 14. Big Data & Monitoring  Infrastructure Monitoring  Well understood  Many products  Full-Stack Application Monitoring  Technical challenges  No “one size fits all” solutions  Data Quality Monitoring  Emerging technologies  Home-grown solutions
  15. 15. Big Data & Monitoring  Infrastructure Monitoring
  16. 16. Big Data & Monitoring  Application Monitoring
  17. 17. Big Data & Monitoring  Data Quality Monitoring
  18. 18. … a bag of acronyms …  Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …  AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF  Joyent: Manta
  19. 19. Piece of advice …  Collect relevant data! Collecting data for data’s sake only costs money …  Use the processing technology that best matches your business case! Hadoop is pointless if your clients only want fast geospatial searches …  Consume wisely! Knowing that 100% of X is Y means nothing when there is only one X …
  20. 20. Conclusion Q & A

×