Successfully reported this slideshow.

Big Data Ecosystem

1

Share

Loading in …3
×
1 of 20
1 of 20

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Big Data Ecosystem

  1. 1. Big Data Ecosystem Ivo Vachkov Xi Group Ltd.
  2. 2. Big Data ???  Definition  The 3Vs:  Volume  Velocity  Variety  Added later:  Veracity  Variability  Complexity
  3. 3. Processing Paradigms  Batch Processing  Large volumes  Lower volatility  Incremental updates  Real-time Processing  Smaller volumes  Higher volatility  Possible full regeneration
  4. 4. The Data Path  From Collection …  … to Processing …  … to Query:  Consumption  Visualization  [Predictive] Analysis  Monitoring / Validation  ETL, anyone?!
  5. 5. The Data Path
  6. 6. Data Path / Collection  Multiple sources (RDBMS, Logs, activity streams, message queues, time series, etc.)  Multiple types (structured, unstructured, free text, bags of words, raw, normalized, etc.)  Collection starts with raw data and produces digital artifacts suitable for machine processing.
  7. 7. Data Path / Collection  Wide variety of components and technologies:  Flat files, binary formats (AVRO, CSV, etc.) on a typical file system  Cluster-specific file systems  RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases, Document Databases  Column Stores  Key-Value Stores  Time Series Stores  Streaming and transformation engines
  8. 8. Data Path / Processing  Different processing paradigms:  Batch Processing  Real-time Processing  Multiple expected outcomes:  Data  Action  Different destinations:  Data stores  Data-driven Control Planes
  9. 9. Data Path / Processing  Smaller number of technologies:  Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)  Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)  HPC / Supercomputing  Data parallelism is the key!  Data locality is important!
  10. 10. Data Path / Processing  The importance of M/R  Self-hosted solutions:  Apache Hadoop  Cloudera, HortonWorks, etc.  Cloud-based solutions:  AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)  Joyent Manta  … many others …
  11. 11. Data Path / Query  Processing will create digital artifact  Extremely high variety of technologies, components, services to deal with those artifacts:  SQL interfaces on top of NoSQL stores  NoSQL to NoSQL  NoSQL to RDBMS  Output to 3rd party API services  Output to proprietary interfaces  … a lot more …
  12. 12. Data Path / Query  “Query-friendly” stores:  Classical RDBMS, NewSQL  Big Table & Column Stores  Key-Value Stores  Search-oriented services  Visualization:  3rd party services  Tableau  HTML5 / JavaScript Dashboards  Programming languages / Visualization libraries
  13. 13. Data Path / Query  Analysis  Reports  Trends / Predictions  Real-time analytics  Data-driven Control Plane  Classical Business Intelligence  Machine Learning (Mahout)  Data Science (usually a fancy term for Statistics)
  14. 14. Big Data & Monitoring  Infrastructure Monitoring  Well understood  Many products  Full-Stack Application Monitoring  Technical challenges  No “one size fits all” solutions  Data Quality Monitoring  Emerging technologies  Home-grown solutions
  15. 15. Big Data & Monitoring  Infrastructure Monitoring
  16. 16. Big Data & Monitoring  Application Monitoring
  17. 17. Big Data & Monitoring  Data Quality Monitoring
  18. 18. … a bag of acronyms …  Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …  AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF  Joyent: Manta
  19. 19. Piece of advice …  Collect relevant data! Collecting data for data’s sake only costs money …  Use the processing technology that best matches your business case! Hadoop is pointless if your clients only want fast geospatial searches …  Consume wisely! Knowing that 100% of X is Y means nothing when there is only one X …
  20. 20. Conclusion Q & A

Editor's Notes

  • Intro, Abstract, Who am I
  • Big Data = Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

    If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[18]
    Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.;
    Big data uses inductive statistics and concepts from nonlinear system identification [19] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[20] to reveal relationships, dependencies and perform predictions of outcomes and behaviors.[19][21]
    Big data can also be defined as "Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS".

  • Two distinct processing paradigm that drive different technologies

    Why one? Why the other?

    Use cases …
  • Comes from ETL after all, specific but known.
  • ×