Big Data Ecosystem

Big Data Ecosystem
Ivo Vachkov
Xi Group Ltd.

Big Data ???
 Definition
 The 3Vs:
 Volume
 Velocity
 Variety
 Added later:
 Veracity
 Variability
 Complexity

Processing Paradigms
 Batch Processing
 Large volumes
 Lower volatility
 Incremental updates
 Real-time Processing
 Smaller volumes
 Higher volatility
 Possible full regeneration

The Data Path
 From Collection …
 … to Processing …
 … to Query:
 Consumption
 Visualization
 [Predictive] Analysis
 Monitoring / Validation
 ETL, anyone?!

Data Path / Collection
 Multiple sources (RDBMS, Logs, activity streams, message
queues, time series, etc.)
 Multiple types (structured, unstructured, free text, bags of
words, raw, normalized, etc.)
 Collection starts with raw data and produces digital
artifacts suitable for machine processing.

Data Path / Collection
 Wide variety of components and technologies:
 Flat files, binary formats (AVRO, CSV, etc.) on a typical file
system
 Cluster-specific file systems
 RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases,
Document Databases
 Column Stores
 Key-Value Stores
 Time Series Stores
 Streaming and transformation engines

Data Path / Processing
 Different processing paradigms:
 Batch Processing
 Real-time Processing
 Multiple expected outcomes:
 Data
 Action
 Different destinations:
 Data stores
 Data-driven Control Planes

 Smaller number of technologies:
 Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)
 Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)
 HPC / Supercomputing
 Data parallelism is the key!
 Data locality is important!

 The importance of M/R
 Self-hosted solutions:
 Apache Hadoop
 Cloudera, HortonWorks, etc.
 Cloud-based solutions:
 AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)
 Joyent Manta
 … many others …

Data Path / Query
 Processing will create digital artifact
 Extremely high variety of technologies, components,
services to deal with those artifacts:
 SQL interfaces on top of NoSQL stores
 NoSQL to NoSQL
 NoSQL to RDBMS
 Output to 3rd party API services
 Output to proprietary interfaces
 … a lot more …

Data Path / Query
 “Query-friendly” stores:
 Classical RDBMS, NewSQL
 Big Table & Column Stores
 Key-Value Stores
 Search-oriented services
 Visualization:
 3rd party services
 Tableau
 HTML5 / JavaScript Dashboards
 Programming languages / Visualization libraries

Data Path / Query
 Analysis
 Reports
 Trends / Predictions
 Real-time analytics
 Data-driven Control Plane
 Classical Business Intelligence
 Machine Learning (Mahout)
 Data Science (usually a fancy term for Statistics)

Big Data & Monitoring
 Infrastructure Monitoring
 Well understood
 Many products
 Full-Stack Application Monitoring
 Technical challenges
 No “one size fits all” solutions
 Data Quality Monitoring
 Emerging technologies
 Home-grown solutions

 Infrastructure Monitoring

 Application Monitoring

 Data Quality Monitoring

… a bag of acronyms …
 Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS,
Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix,
Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis,
Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue,
OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer,
Tableau, Pentaho, SumoLogic, MongoDB, CouchDB,
Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB,
Memcache, Foundation DB, …
 AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift,
ElasticCache, SQS, SWF
 Joyent: Manta

Piece of advice …
 Collect relevant data!
Collecting data for data’s sake only costs money …
 Use the processing technology that best matches your
business case!
Hadoop is pointless if your clients only want fast
geospatial searches …
 Consume wisely!
Knowing that 100% of X is Y means nothing when there
is only one X …

Big Data Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Ecosystem

Similar to Big Data Ecosystem (20)

Recently uploaded

Recently uploaded (20)

Big Data Ecosystem

Editor's Notes