Survey of Big Data
Infrastructures
Malina Kirn, PhD
Principal Technical Architect, Cox Automotive
Co-Founder, Ithryn
VT Code Camp
2018-09-15
Shameless Plug
My team at Dealer.com is hiring! If you’re interested in working with big
data, email me: malina.kirn@coxautoinc.com
There are several talks from Dealer.com today:
Exploring User Mental Models of Data Sharing, Amy Chess
Maintaining Investment Grade Technical Debt, Peter Vile
Pushing Data to the Cloud and Serving it Up Fast!, Rama Kocherlakota
What is “Big Data” anyway?
• By “Big Data”, I mean any data
sample that requires more than one
“node” to process it.
• What is a “node” anyway?
• I’m using “node” to mean any piece of
infrastructure (virtualized or physical)
that runs its own Operating System.
• Defining “Big Data” by data volume is
a hopeless exercise.
• I’ve worked with up to O(10 PB) and as
little as O(100 GB) and needed to
process these volumes across multiple
nodes.
Big Data Infrastructures
• Distributed Relational Databases
• NoSQL Databases
• ETL: Extract, Transform, Load
• HTC: High Throughput Computing
• HPC: High Performance Computing
• Hybrid High Throughput/Performance Systems
Relational Databases
Persistent data store that represents data using a relational data model, provides
SQL semantics for processing data, and provides transactional guarantees.
Relational data
display_name email id
Malina Kirn m.a.kirn@gmail.com 123456
Jane Doe doe@email.dom 987654
creator_id description id
987654 VT Code Camp 1
987654 BTV JS Meetup 2
event_id user_id rsvp
1 987654 Yes
2 123456 Maybe
1 987654 Yes
Image Source: http://tech.favoritemedium.com/2011/06/using-rabl-in-rails-json-web-api.html
SQL semantics
select User.id, User.display_name, count(*)
from User
join EventGuest on EventGuest.user_id=User.id
where EventGuest.rsvp = “Yes”
group by User.id, User.display_name
Distributed Relational Databases
• Relational databases began as single-node applications for storing, representing,
and processing data
• Distributed (multi-node) relational databases were created to accommodate the
growth of data
• Distributed databases are generally less performant than their single node
counterparts
• Because SQL is declarative, distributed relational databases are one of the easiest
ways to store and process big data
• Used in almost every industry and academic domain for a wide variety of
purposes
• Many popular distributed relational databases exist: Oracle, Vertica, Snowflake,
Spanner, …
• They’re generally expensive and not open source
NoSQL Databases
Multi-node, persistent data store that provides fast access to data using simple
access patterns.
NoSQL Databases
• NoSQL databases typically do not provide one or more of:
• Relational data (especially constraint enforcement)
• SQL semantics
• Transactional guarantees (often immediate consistency is sacrificed for eventual consistency)
• Distributed relational databases often cannot meet high performance
requirements, such as O(100 ms) query response time
• Frequently used by businesses in client-facing applications to perform fast
searches of large data volumes
• Many popular NoSQL databases exist: Bigtable, Elastic Search, Cassandra,
DynamoDB, …
• NoSQL implementations vary widely in terms of available features
• Some are free and open source, others are only available as paid licenses or
services
Indexing and sorting
Indexing Sorting
To achieve high performance, most NoSQL databases utilize one or more
indexes and typically rely heavily on ordering data
Image Sources: https://www.lucassaldanha.com/elasticsearch-101/ ; https://www.elastic.co/blog/index-sorting-elasticsearch-6-0
ETL: Extract, Transform, Load
Systems that extract data from multiple sources, process the data to transform it,
then load the data into a data store for persistent storage.
ETL
• ETL doesn’t have to be multi-node, but usually is
• Frequently used by businesses to transform data in a representation
that’s relevant for one product into a representation that’s relevant
for a different product
• Used in academics to process data collected in raw formats into
formats that can be used to support or refute scientific hypotheses
• Most ETL implementations are open source, but many commercial
offerings and managed services are available
• Batch vs streaming
Batch ETL
• Batch ETL processes have a start
and stop and process N>>1 rows
of data; runtimes can be anywhere
from a few minutes to a few days
• Batch ETL is scheduled or triggered
by orchestrators; batch ETL
execution is often chained
• Orchestrators: Airflow, Azkaban,
Oozie, Step Functions, …
• ETL engines: Spark, Talend,
Informatica, Pentaho, SAS, …
Image Source: http://www.pachyderm.io/use_cases.html
Streaming ETL
• Streaming ETL processes run
continuously and process
N 1 rows of data as the
data arrives
• Typically utilize a combo of
pseudo-persistent data stores
(distributed queues) and
streaming ETL engines
• Distributed queues: Kafka,
Kinesis, Flink, *MQ
• ETL engines: Spark, Kafka
SQL, Kinesis data analysis,
Flink SQL
Image Source: https://www.confluent.io/blog/apache-kafka-for-service-architectures/
HTC: High Throughput Computing
Multi-node systems designed to process large volumes of data in parallel with little
or no interactions between nodes.
High Throughput Computing
• AKA embarrassingly parallel computing, grid computing, high latency computing
• Memory is not shared; data shuffling operations require network transfers
• Code that processes data is typically sent to the nodes hosting the data using job
schedulers and managers
• Many open source job schedulers and managers are available: Condor, Globus,
PBS, Sun/Oracle Grid Engine, Oozie, …
• Frequently used in Physics and Engineering research for statistical analysis
• Most academic HTC systems are custom-built using open source tools and are
moderately complicated to use; some work on heterogeneous nodes
• Hadoop (open source) is the most popular HTC implementation and is typically
used in academics and business for ETL use-cases
• HTC systems are gradually being replaced by batch ETL systems
MapReduce
display_name id
Malina Kirn 123456
Jane Doe 987654
Jane Doe 135791
Map
display_name id first_name
Malina Kirn 123456 Malina
Jane Doe 987654 Jane
Jane Doe 135791 Jane
data['first_name'] =
data['display_name'].map(x->x.split()[0])
display_name id
Malina Kirn 123456
Jane Doe 987654
Jane Doe 135791
Reduce
display_name count
Malina Kirn 1
Jane Doe 2
data.groupby('display_name').count()
HPC: High Performance
Computing
Systems designed to process data with many interactions on single nodes and
between nodes.
High Performance Computing
• AKA parallel computing, supercomputing, big algorithms/compute, shared
memory computing, distributed memory computing
• Performance is measured in terms of FLOPS (floating point operations per
second)
• Frequently used in research for a variety of purposes such as nuclear dynamics,
protein folding, and weather modeling; often by solving systems of differential
equations
• Algorithms are typically written exclusively for the HPC system upon which they
run
• HPC systems are custom-built, extremely expensive, and are exceptionally
complicated to maintain; some require their own power plants
• Cray does offer “Supercomputing as a Service” and some supercomputers allow
time rental
Shared vs Distributed Memory
Image Source: “Introduction to High Performance Computing”, Jay Boisseau
Shared Memory Distributed Memory
Hybrid HTC/HPC
Multi-node systems designed to process large volumes of data with many
interactions on single nodes and limited interactions between nodes.
Hybrid HTC/HPC
• AKA Beowulf Cluster, distributed shared memory computing
• Typically network-distributed Graphics Processing Units (GPUs) on commercial hardware
• A single CPU core executes only one computation at a time and modern CPUs have only a handful of
cores
• A single GPU simultaneously executes many computations, enabling a shared memory architecture for
parallel processes executed by the same GPU
• Shared memory GPU used on a single node, data transferred between nodes via network layer
• Frequently used in business for a variety of purposes such as image classification, self-
driving cars, voice recognition, advertising, and financial modeling; often using deep
learning algorithms
• Hybrid systems are typically custom-built using open source tools; however “Machine
Learning as a service” is becoming widely available, some of which leverage hybrid
infrastructure
• TensorFlow (open source) is the most popular implementation and some ML-as-a-service
offerings support TensorFlow
Deep Learning on GPUs
Image Source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”
ICML 2009 & Comm. ACM 2011. Lee, Grosse, Ranganath, Ng.
Summary
• Many big data infrastructures are available
• Distributed Relational Databases
• NoSQL Databases
• ETL: Extract, Transform, Load
• HTC: High Throughput Computing
• HPC: High Performance Computing
• Hybrid High Throughput/Performance Systems
• Choosing the right system is a function of
• What you want to do with it
• How much you want to spend (in licensing or in operations)

Survey of Big Data Infrastructures

  • 1.
    Survey of BigData Infrastructures Malina Kirn, PhD Principal Technical Architect, Cox Automotive Co-Founder, Ithryn VT Code Camp 2018-09-15
  • 2.
    Shameless Plug My teamat Dealer.com is hiring! If you’re interested in working with big data, email me: malina.kirn@coxautoinc.com There are several talks from Dealer.com today: Exploring User Mental Models of Data Sharing, Amy Chess Maintaining Investment Grade Technical Debt, Peter Vile Pushing Data to the Cloud and Serving it Up Fast!, Rama Kocherlakota
  • 3.
    What is “BigData” anyway? • By “Big Data”, I mean any data sample that requires more than one “node” to process it. • What is a “node” anyway? • I’m using “node” to mean any piece of infrastructure (virtualized or physical) that runs its own Operating System. • Defining “Big Data” by data volume is a hopeless exercise. • I’ve worked with up to O(10 PB) and as little as O(100 GB) and needed to process these volumes across multiple nodes.
  • 4.
    Big Data Infrastructures •Distributed Relational Databases • NoSQL Databases • ETL: Extract, Transform, Load • HTC: High Throughput Computing • HPC: High Performance Computing • Hybrid High Throughput/Performance Systems
  • 5.
    Relational Databases Persistent datastore that represents data using a relational data model, provides SQL semantics for processing data, and provides transactional guarantees.
  • 6.
    Relational data display_name emailid Malina Kirn m.a.kirn@gmail.com 123456 Jane Doe doe@email.dom 987654 creator_id description id 987654 VT Code Camp 1 987654 BTV JS Meetup 2 event_id user_id rsvp 1 987654 Yes 2 123456 Maybe 1 987654 Yes Image Source: http://tech.favoritemedium.com/2011/06/using-rabl-in-rails-json-web-api.html
  • 7.
    SQL semantics select User.id,User.display_name, count(*) from User join EventGuest on EventGuest.user_id=User.id where EventGuest.rsvp = “Yes” group by User.id, User.display_name
  • 8.
    Distributed Relational Databases •Relational databases began as single-node applications for storing, representing, and processing data • Distributed (multi-node) relational databases were created to accommodate the growth of data • Distributed databases are generally less performant than their single node counterparts • Because SQL is declarative, distributed relational databases are one of the easiest ways to store and process big data • Used in almost every industry and academic domain for a wide variety of purposes • Many popular distributed relational databases exist: Oracle, Vertica, Snowflake, Spanner, … • They’re generally expensive and not open source
  • 9.
    NoSQL Databases Multi-node, persistentdata store that provides fast access to data using simple access patterns.
  • 10.
    NoSQL Databases • NoSQLdatabases typically do not provide one or more of: • Relational data (especially constraint enforcement) • SQL semantics • Transactional guarantees (often immediate consistency is sacrificed for eventual consistency) • Distributed relational databases often cannot meet high performance requirements, such as O(100 ms) query response time • Frequently used by businesses in client-facing applications to perform fast searches of large data volumes • Many popular NoSQL databases exist: Bigtable, Elastic Search, Cassandra, DynamoDB, … • NoSQL implementations vary widely in terms of available features • Some are free and open source, others are only available as paid licenses or services
  • 11.
    Indexing and sorting IndexingSorting To achieve high performance, most NoSQL databases utilize one or more indexes and typically rely heavily on ordering data Image Sources: https://www.lucassaldanha.com/elasticsearch-101/ ; https://www.elastic.co/blog/index-sorting-elasticsearch-6-0
  • 12.
    ETL: Extract, Transform,Load Systems that extract data from multiple sources, process the data to transform it, then load the data into a data store for persistent storage.
  • 13.
    ETL • ETL doesn’thave to be multi-node, but usually is • Frequently used by businesses to transform data in a representation that’s relevant for one product into a representation that’s relevant for a different product • Used in academics to process data collected in raw formats into formats that can be used to support or refute scientific hypotheses • Most ETL implementations are open source, but many commercial offerings and managed services are available • Batch vs streaming
  • 14.
    Batch ETL • BatchETL processes have a start and stop and process N>>1 rows of data; runtimes can be anywhere from a few minutes to a few days • Batch ETL is scheduled or triggered by orchestrators; batch ETL execution is often chained • Orchestrators: Airflow, Azkaban, Oozie, Step Functions, … • ETL engines: Spark, Talend, Informatica, Pentaho, SAS, … Image Source: http://www.pachyderm.io/use_cases.html
  • 15.
    Streaming ETL • StreamingETL processes run continuously and process N 1 rows of data as the data arrives • Typically utilize a combo of pseudo-persistent data stores (distributed queues) and streaming ETL engines • Distributed queues: Kafka, Kinesis, Flink, *MQ • ETL engines: Spark, Kafka SQL, Kinesis data analysis, Flink SQL Image Source: https://www.confluent.io/blog/apache-kafka-for-service-architectures/
  • 16.
    HTC: High ThroughputComputing Multi-node systems designed to process large volumes of data in parallel with little or no interactions between nodes.
  • 17.
    High Throughput Computing •AKA embarrassingly parallel computing, grid computing, high latency computing • Memory is not shared; data shuffling operations require network transfers • Code that processes data is typically sent to the nodes hosting the data using job schedulers and managers • Many open source job schedulers and managers are available: Condor, Globus, PBS, Sun/Oracle Grid Engine, Oozie, … • Frequently used in Physics and Engineering research for statistical analysis • Most academic HTC systems are custom-built using open source tools and are moderately complicated to use; some work on heterogeneous nodes • Hadoop (open source) is the most popular HTC implementation and is typically used in academics and business for ETL use-cases • HTC systems are gradually being replaced by batch ETL systems
  • 18.
    MapReduce display_name id Malina Kirn123456 Jane Doe 987654 Jane Doe 135791 Map display_name id first_name Malina Kirn 123456 Malina Jane Doe 987654 Jane Jane Doe 135791 Jane data['first_name'] = data['display_name'].map(x->x.split()[0]) display_name id Malina Kirn 123456 Jane Doe 987654 Jane Doe 135791 Reduce display_name count Malina Kirn 1 Jane Doe 2 data.groupby('display_name').count()
  • 19.
    HPC: High Performance Computing Systemsdesigned to process data with many interactions on single nodes and between nodes.
  • 20.
    High Performance Computing •AKA parallel computing, supercomputing, big algorithms/compute, shared memory computing, distributed memory computing • Performance is measured in terms of FLOPS (floating point operations per second) • Frequently used in research for a variety of purposes such as nuclear dynamics, protein folding, and weather modeling; often by solving systems of differential equations • Algorithms are typically written exclusively for the HPC system upon which they run • HPC systems are custom-built, extremely expensive, and are exceptionally complicated to maintain; some require their own power plants • Cray does offer “Supercomputing as a Service” and some supercomputers allow time rental
  • 21.
    Shared vs DistributedMemory Image Source: “Introduction to High Performance Computing”, Jay Boisseau Shared Memory Distributed Memory
  • 22.
    Hybrid HTC/HPC Multi-node systemsdesigned to process large volumes of data with many interactions on single nodes and limited interactions between nodes.
  • 23.
    Hybrid HTC/HPC • AKABeowulf Cluster, distributed shared memory computing • Typically network-distributed Graphics Processing Units (GPUs) on commercial hardware • A single CPU core executes only one computation at a time and modern CPUs have only a handful of cores • A single GPU simultaneously executes many computations, enabling a shared memory architecture for parallel processes executed by the same GPU • Shared memory GPU used on a single node, data transferred between nodes via network layer • Frequently used in business for a variety of purposes such as image classification, self- driving cars, voice recognition, advertising, and financial modeling; often using deep learning algorithms • Hybrid systems are typically custom-built using open source tools; however “Machine Learning as a service” is becoming widely available, some of which leverage hybrid infrastructure • TensorFlow (open source) is the most popular implementation and some ML-as-a-service offerings support TensorFlow
  • 24.
    Deep Learning onGPUs Image Source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Lee, Grosse, Ranganath, Ng.
  • 25.
    Summary • Many bigdata infrastructures are available • Distributed Relational Databases • NoSQL Databases • ETL: Extract, Transform, Load • HTC: High Throughput Computing • HPC: High Performance Computing • Hybrid High Throughput/Performance Systems • Choosing the right system is a function of • What you want to do with it • How much you want to spend (in licensing or in operations)