Survey of Big Data Infrastructures

Survey of Big Data
Infrastructures
Malina Kirn, PhD
Principal Technical Architect, Cox Automotive
Co-Founder, Ithryn
VT Code Camp
2018-09-15

Shameless Plug
My team at Dealer.com is hiring! If you’re interested in working with big
data, email me: malina.kirn@coxautoinc.com
There are several talks from Dealer.com today:
Exploring User Mental Models of Data Sharing, Amy Chess
Maintaining Investment Grade Technical Debt, Peter Vile
Pushing Data to the Cloud and Serving it Up Fast!, Rama Kocherlakota

What is “Big Data” anyway?
• By “Big Data”, I mean any data
sample that requires more than one
“node” to process it.
• What is a “node” anyway?
• I’m using “node” to mean any piece of
infrastructure (virtualized or physical)
that runs its own Operating System.
• Defining “Big Data” by data volume is
a hopeless exercise.
• I’ve worked with up to O(10 PB) and as
little as O(100 GB) and needed to
process these volumes across multiple
nodes.

Big Data Infrastructures
• Distributed Relational Databases
• NoSQL Databases
• ETL: Extract, Transform, Load
• HTC: High Throughput Computing
• HPC: High Performance Computing
• Hybrid High Throughput/Performance Systems

Relational Databases
Persistent data store that represents data using a relational data model, provides
SQL semantics for processing data, and provides transactional guarantees.

Relational data
display_name email id
Malina Kirn m.a.kirn@gmail.com 123456
Jane Doe doe@email.dom 987654
creator_id description id
987654 VT Code Camp 1
987654 BTV JS Meetup 2
event_id user_id rsvp
1 987654 Yes
2 123456 Maybe
1 987654 Yes
Image Source: http://tech.favoritemedium.com/2011/06/using-rabl-in-rails-json-web-api.html

SQL semantics
select User.id, User.display_name, count(*)
from User
join EventGuest on EventGuest.user_id=User.id
where EventGuest.rsvp = “Yes”
group by User.id, User.display_name

Distributed Relational Databases
• Relational databases began as single-node applications for storing, representing,
and processing data
• Distributed (multi-node) relational databases were created to accommodate the
growth of data
• Distributed databases are generally less performant than their single node
counterparts
• Because SQL is declarative, distributed relational databases are one of the easiest
ways to store and process big data
• Used in almost every industry and academic domain for a wide variety of
purposes
• Many popular distributed relational databases exist: Oracle, Vertica, Snowflake,
Spanner, …
• They’re generally expensive and not open source

NoSQL Databases
Multi-node, persistent data store that provides fast access to data using simple
access patterns.

NoSQL Databases
• NoSQL databases typically do not provide one or more of:
• Relational data (especially constraint enforcement)
• SQL semantics
• Transactional guarantees (often immediate consistency is sacrificed for eventual consistency)
• Distributed relational databases often cannot meet high performance
requirements, such as O(100 ms) query response time
• Frequently used by businesses in client-facing applications to perform fast
searches of large data volumes
• Many popular NoSQL databases exist: Bigtable, Elastic Search, Cassandra,
DynamoDB, …
• NoSQL implementations vary widely in terms of available features
• Some are free and open source, others are only available as paid licenses or
services

Indexing and sorting
Indexing Sorting
To achieve high performance, most NoSQL databases utilize one or more
indexes and typically rely heavily on ordering data
Image Sources: https://www.lucassaldanha.com/elasticsearch-101/ ; https://www.elastic.co/blog/index-sorting-elasticsearch-6-0

ETL: Extract, Transform, Load
Systems that extract data from multiple sources, process the data to transform it,
then load the data into a data store for persistent storage.

ETL
• ETL doesn’t have to be multi-node, but usually is
• Frequently used by businesses to transform data in a representation
that’s relevant for one product into a representation that’s relevant
for a different product
• Used in academics to process data collected in raw formats into
formats that can be used to support or refute scientific hypotheses
• Most ETL implementations are open source, but many commercial
offerings and managed services are available
• Batch vs streaming

Batch ETL
• Batch ETL processes have a start
and stop and process N>>1 rows
of data; runtimes can be anywhere
from a few minutes to a few days
• Batch ETL is scheduled or triggered
by orchestrators; batch ETL
execution is often chained
• Orchestrators: Airflow, Azkaban,
Oozie, Step Functions, …
• ETL engines: Spark, Talend,
Informatica, Pentaho, SAS, …
Image Source: http://www.pachyderm.io/use_cases.html

Streaming ETL
• Streaming ETL processes run
continuously and process
N 1 rows of data as the
data arrives
• Typically utilize a combo of
pseudo-persistent data stores
(distributed queues) and
streaming ETL engines
• Distributed queues: Kafka,
Kinesis, Flink, *MQ
• ETL engines: Spark, Kafka
SQL, Kinesis data analysis,
Flink SQL
Image Source: https://www.confluent.io/blog/apache-kafka-for-service-architectures/

HTC: High Throughput Computing
Multi-node systems designed to process large volumes of data in parallel with little
or no interactions between nodes.

High Throughput Computing
• AKA embarrassingly parallel computing, grid computing, high latency computing
• Memory is not shared; data shuffling operations require network transfers
• Code that processes data is typically sent to the nodes hosting the data using job
schedulers and managers
• Many open source job schedulers and managers are available: Condor, Globus,
PBS, Sun/Oracle Grid Engine, Oozie, …
• Frequently used in Physics and Engineering research for statistical analysis
• Most academic HTC systems are custom-built using open source tools and are
moderately complicated to use; some work on heterogeneous nodes
• Hadoop (open source) is the most popular HTC implementation and is typically
used in academics and business for ETL use-cases
• HTC systems are gradually being replaced by batch ETL systems

MapReduce
display_name id
Malina Kirn 123456
Jane Doe 987654
Jane Doe 135791
Map
display_name id first_name
Malina Kirn 123456 Malina
Jane Doe 987654 Jane
Jane Doe 135791 Jane
data['first_name'] =
data['display_name'].map(x->x.split()[0])
display_name id
Malina Kirn 123456
Jane Doe 987654
Jane Doe 135791
Reduce
display_name count
Malina Kirn 1
Jane Doe 2
data.groupby('display_name').count()

HPC: High Performance
Computing
Systems designed to process data with many interactions on single nodes and
between nodes.

High Performance Computing
• AKA parallel computing, supercomputing, big algorithms/compute, shared
memory computing, distributed memory computing
• Performance is measured in terms of FLOPS (floating point operations per
second)
• Frequently used in research for a variety of purposes such as nuclear dynamics,
protein folding, and weather modeling; often by solving systems of differential
equations
• Algorithms are typically written exclusively for the HPC system upon which they
run
• HPC systems are custom-built, extremely expensive, and are exceptionally
complicated to maintain; some require their own power plants
• Cray does offer “Supercomputing as a Service” and some supercomputers allow
time rental

Shared vs Distributed Memory
Image Source: “Introduction to High Performance Computing”, Jay Boisseau
Shared Memory Distributed Memory

Hybrid HTC/HPC
Multi-node systems designed to process large volumes of data with many
interactions on single nodes and limited interactions between nodes.

Hybrid HTC/HPC
• AKA Beowulf Cluster, distributed shared memory computing
• Typically network-distributed Graphics Processing Units (GPUs) on commercial hardware
• A single CPU core executes only one computation at a time and modern CPUs have only a handful of
cores
• A single GPU simultaneously executes many computations, enabling a shared memory architecture for
parallel processes executed by the same GPU
• Shared memory GPU used on a single node, data transferred between nodes via network layer
• Frequently used in business for a variety of purposes such as image classification, self-
driving cars, voice recognition, advertising, and financial modeling; often using deep
learning algorithms
• Hybrid systems are typically custom-built using open source tools; however “Machine
Learning as a service” is becoming widely available, some of which leverage hybrid
infrastructure
• TensorFlow (open source) is the most popular implementation and some ML-as-a-service
offerings support TensorFlow

Deep Learning on GPUs
Image Source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”
ICML 2009 & Comm. ACM 2011. Lee, Grosse, Ranganath, Ng.

Summary
• Many big data infrastructures are available
• Distributed Relational Databases
• NoSQL Databases
• ETL: Extract, Transform, Load
• HTC: High Throughput Computing
• HPC: High Performance Computing
• Hybrid High Throughput/Performance Systems
• Choosing the right system is a function of
• What you want to do with it
• How much you want to spend (in licensing or in operations)

Survey of Big Data Infrastructures

More Related Content

What's hot

Similar to Survey of Big Data Infrastructures

Recently uploaded

Survey of Big Data Infrastructures