Data Pipelines with Spark & DataStax Enterprise

Simon Ambridge
Data Pipelines With Spark & DSE
An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
Version 0.8

Certified Apache Cassandra and DataStax enthusiast who enjoys
explaining that the traditional approaches to data management just
don’t cut it anymore in the new always on, no single point of failure,
high volume, high velocity, real time distributed data management
world.
Previously 25 years implementing Oracle relational data management
solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle
Linux and OBIEE
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK

Introduction To Big Data Pipelines

Big, Static Data
Fast, Streaming Data
Big Data Pipelining: Classification
Big Data Pipelines can mean different things to different people
Repeated analysis on a static but massive dataset
• An element of research – e.g. genomics, clinical trial,
demographic data
• Typically repetitive, iterative, shared amongst data
scientists for analysis
Real-time analytics on streaming data
• Industrialised or commercial processes – sensors, tick
data, bioinformatics, transactional data, real-time
personalisation
• Happening in real-time, data cannot be dropped or lost

Static Datasets
All You Can Eat?
Really.

Static Data Analytics : Traditional Tools
Repeated iterations, at each stage
Run/debug cycle can be slow
Sampling Modeling InterpretTuning Reporting
Re-sample
Typical traditional ‘static’ data analysis model
Data
Results

Static Data Analytics : Scale Up Challenges
Sampling and analysis often run on a single machine
• CPU and memory limitations
Limited sampling of a large dataset because of data size limitations
• Multiple iterations over large datasets is frequently not an ideal
approach

Static Data Analytics : Traditional Scaling
DATA (GB)
DATA (MB)
DATA (TB)
Small datasets, small servers
Large datasets, large servers

Static Data Analytics: Big Data Problems
Data is getting Really Big!
• Data volumes are getting larger!
• The number of data sources is exploding!
• More data is arriving faster!
Scaling up is becoming impractical
• Physical limits
• Datalimits
• The validity of the analysis becomes obsolete, faster

Static Data Analytics : Big Data Needs
We need scalable infrastructure + distributed technologies
• Data volumes can be scaled
• Distribute the data across multiple low-cost machines
• Faster processing
• More complex processing
• No single point of failure

Static Data Analytics : DSE Delivers
Building a distributed data processing framework can be a complex task!
It needs to be:
• Scalable
• Fast in-memory processing
• Replicated for resiliency
• Batch and real-time data feeds
• Ad-hoc queries
DataStax delivers an integrated analytics platform

Cassandra: THE Web, IoT & Cloud Database
What is Apache Cassandra?
• Very fast
• Extremely resilient
• Across multiple data centres
• No single point of failure
• Continuous Availability, Disaster Avoidance
• Linear scale
• Easy to operate
Enterprise Cassandra platform from Datastax

DataStax
Enterprise
DataStax Enterprise: Editions
DataStax Enterprise Standard
• DSE Standard is DataStax’s entry level
commercial database offering
• Represents the minimum recommended to
deploy Cassandra in a production environment
DataStax Enterprise Max
• DSE Max is DataStax’s advanced commercial
database offering
• Designed for production Cassandra
environments that have mixed workload
requirements

Spark: THE Analytics Engine
What is Apache Spark?
• Distributed in-memory analytic processing
• Batch and streaming analytics
• Fast - 10x-100x faster than Hadoop MapReduce
• Rich Scala, Java and Python APIs
Tightly integrated with DSE

Spark: Dayton Gray Sort Contest
Dayton Gray benchmark - tests how fast a system can sort 100 TB
of data (1 trillion records)
• Previous world record held by Hadoop MapReduce cluster of 2100
nodes, in 72 minutes
• 2014: Spark completed the benchmark in 23 minutes on just 206 EC2
nodes = 3X faster using 10X fewer machines
• Spark sorted 1 PB (10 trillion records) on 190 machines in < 4 hours.
Previous Hadoop MapReduce time of 16 hours on 3800 machines = 4X
faster using 20X fewer machines

DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark Cluster
ETL
Spark Cluster
• Tight integration
• Data locality
• Microsecond response times
X
• Apache Cassandra for Distributed Persistent Storage
• Integrated Apache Spark for Distributed Real-Time Analytics
• Analytics nodes close to data - no ETL required
X
• Loose integration
• Data separate from processing
• Millisecond response times
“Latency
when
transferring
data
is
unavoidable.
The
trick
is
to
reduce

the
latency
to
as
close
to
zero
as
possible…”

Static Data Analytics : Requirements
Valid data pipeline analysis methods must be:
Auditable
• Reproducible
• Documented
Controlled
• Version control
Collaborative
• Accessible

Notebooks: Features
What are Notebooks?
• Drive your data analysis from the browser
• Highly interactive
• Tight integration with Apache Spark
• Handy tools for analysts:
• Reproducible visual analysis
• Code in Scala, CQL, SparkSQL, Python
• Charting – pie, bar, line etc
• Extensible with custom libraries

Example: Spark Notebook
Cells
Markdown
Output
Controls

Static Data Analytics : Approach
Example architecture & requirements
1. Optimised source data format
2. Distributed in-memory analytics
3. Interactive and flexible data analysis tool
4. Persistent data store
5. Visualisation tools

Static Data Analytics : Example
ADAM
Notebook Persistent Storage
OLTP Database Visualisation
Genome research platform - ADST (Agile Data Science Toolkit)

Static Data Analytics : Pipeline Process Flow
3. Persistent data storage
2. Interactive, flexible and reproducible analysis
1. Source data
4. Visualise and analyse

Static Data Analytics : Pipeline Scalability
• Add more (physical or virtual) nodes as
required to add capacity
• Container tools ease configuration
management and deployment
• Scale out quickly

Static Data Analytics : Now
• No longer an iterative process constrained by hardware limitations
• Now a more scalable, resilient, dynamic, interactive process, easily shareable
Analyse
The new model for large-scale static data analytics
Share
X
Load
SCALE & DISTRIBUTE PROCESSING

Real-Time Datasets
If it’s Not “Now”, Then It’s Probably Already Too Late

Big Data Pipelining: Why Real-Time?
• React to customers faster and with more accuracy
• Reduce risk through more accurate understanding of the market
• Optimise return on marketing investment
• Faster time to market
• Improve efficiency
In a highly connected world
In most cases ‘real-time’ data changing at <1s intervals

Big Data Pipelining: Real-Time Analytics
• Capture, prepare, and process fast streaming data
• Different approach from traditional batch processing
• The speed of now – cannot wait
• Immediate insight, instant decisions
What problem are we trying to solve?

Big Data Pipelining: Real-Time Use Cases
Sensor data (IoT)
Transactional data
User Experience
Social media
Use cases for streaming analytics

Big Data Analytics: Streams
Data tidal waves!Netflix
• Ingests Petabytes of data per day
• Over 1 TRILLION transactions per day (>10 m per second) into DSE
Data streams?
Data torrent?

Big Data Pipelining: Real-Time architecture
Analytics in real-time, at scale
Fast processing, distributed, in-memory
Increasingly using a technology stack comprising Kafka, Spark and Cassandra
• Scalable
• Distributed
• Resilient
Streaming analytics architecture - what do we need?

Kafka: Architecture
How Does Kafka Work?
Kafka “De-couples” producers and consumers in data pipelines
’Producers’ send messages to the Kafka cluster, which in turn serves them up to
’Consumers’
• Kafka maintains feeds of messages in categories called topics
• A Kafka cluster is comprised of one or more servers called a broker
Producer
Producer
Producer
Consumer
Consumer
Consumer
Kafka
Cluster

Kafka: Streaming With Spark
Kafka writes, Spark reads
• Topics can have multiple partitions
• Each topic partition stored as a log (an ordered set of messages)
• Messages are simply byte arrays, so can store any object in any format
• Each message in a partition is assigned a unique offset
Spark consumes messages as a stream, in micro batches, saved as RDD’s
1 2 3 4 5 6 7 8
Partition 0
1 2 3 4 5 6 7 8
Partition 1
1 2 3 4 5 6
Partition 0
Temperature Topic
Rainfall Topic
Temperature Consumer
Rainfall Consumer
Temperature Consumer

DataStax Enterprise: Streaming Schematic
Sensor
Network
Signal
Aggregation
Services
Messaging Queue
Sensor Data Queue
Management
Broker
Broker
Collection
Service
Data Storage
OLTP PersistenceLayer
Streaming Data
Ingest

DataStax Enterprise: Streaming Analytics
Real-time
Analytics
Persistent Storage
OLTP Database
!$£€!
Personalisation
Actionable insight Monitoring
Web / Analytics / BI

DataStax Enterprise: Multi-DC Uses
DC: EUROPEDC: USA
Real-time active-active geo-replication
across physical datacentres
4 3
25
1
4 3
25
1
8
1
2
3
4
5
6
7
1
2
3
OLTP:
Cassandra
5
4
Analytics:
Cassandra + Spark
Replication
Replication
Workload separation via virtual datacentres

Real-Time Analytics: DSE Multi-DC
Workload Management and Separation With DSE
Analytics / BI
Analytics
Datacentre
OLTP
Datacentre
100% Uptime, Global Scale
OLTP
Real-Time Analytics
Mixed Load OLTP and Analytics Platform
Replication
Replication
JDBC
ODBC
Separation of OLTP
from Analytics
Social Media
IoT
Personalisation & Persistence
Personalisation
!$£€!
Actionable insight
Monitoring
App, Web

DSE & Analytics : Summary
Static, Massive Data
Scalable Data Pipelines
1. Optimised data storage formats
2. Scalable, distributed technologies
3. Flexible and interactive analysis tools
4. Resilient, persistent Storage
Real-Time Streaming Data
Scalable Data Pipelines
1. Scalable, distributed technologies
2. De-coupled Producers and Consumers
3. Real-Time analytics
4. Resilient, persistent Storage
Spark
Mesos
Akka
Cassandra
Kafka

Data Pipelines with Spark & DataStax Enterprise

More Related Content

What's hot

Viewers also liked

Similar to Data Pipelines with Spark & DataStax Enterprise

More from DataStax

Recently uploaded

Data Pipelines with Spark & DataStax Enterprise