6. Data Path / Collection
Multiple sources (RDBMS, Logs, activity streams, message
queues, time series, etc.)
Multiple types (structured, unstructured, free text, bags of
words, raw, normalized, etc.)
Collection starts with raw data and produces digital
artifacts suitable for machine processing.
7. Data Path / Collection
Wide variety of components and technologies:
Flat files, binary formats (AVRO, CSV, etc.) on a typical file
system
Cluster-specific file systems
RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases,
Document Databases
Column Stores
Key-Value Stores
Time Series Stores
Streaming and transformation engines
8. Data Path / Processing
Different processing paradigms:
Batch Processing
Real-time Processing
Multiple expected outcomes:
Data
Action
Different destinations:
Data stores
Data-driven Control Planes
9. Data Path / Processing
Smaller number of technologies:
Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)
Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)
HPC / Supercomputing
Data parallelism is the key!
Data locality is important!
10. Data Path / Processing
The importance of M/R
Self-hosted solutions:
Apache Hadoop
Cloudera, HortonWorks, etc.
Cloud-based solutions:
AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)
Joyent Manta
… many others …
11. Data Path / Query
Processing will create digital artifact
Extremely high variety of technologies, components,
services to deal with those artifacts:
SQL interfaces on top of NoSQL stores
NoSQL to NoSQL
NoSQL to RDBMS
Output to 3rd party API services
Output to proprietary interfaces
… a lot more …
19. Piece of advice …
Collect relevant data!
Collecting data for data’s sake only costs money …
Use the processing technology that best matches your
business case!
Hadoop is pointless if your clients only want fast
geospatial searches …
Consume wisely!
Knowing that 100% of X is Y means nothing when there
is only one X …
Big Data = Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.
If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[18]
Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.;
Big data uses inductive statistics and concepts from nonlinear system identification [19] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[20] to reveal relationships, dependencies and perform predictions of outcomes and behaviors.[19][21]
Big data can also be defined as "Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS".
Two distinct processing paradigm that drive different technologies
Why one? Why the other?
Use cases …