SlideShare a Scribd company logo
Building Scalable
Data Pipelines
Evan Chan
Who am I
Distinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark
since 0.9
Co-creator and maintainer of
Spark Job Server
TupleJump - Big Data Dev Partners 3
Instant
Gratification
I want insights now
I want to act on news right away
I want stuff personalized for me (?)
Fast Data, not

Big Data
How Fast do you
Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?
What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning
Common
Components
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Example: Real-time
trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot”
data in stream processor
Combine with recent and historical top K
feature vectors in database
Update database recent feature vectors
Serve to users
Example 2: Smart
Cities
Smart City
Streaming Data
City buses - regular telemetry (position,
velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart
cards
311 info
911 info - new emergencies
Citizens want to
know…
Where and for how long can I park my
car?
Are transportation options affected by
311 and 911 events?
How long will it take the next bus to
get here?
Where is the closest bus to where I am?
Cities want to
know…
How can I maximize parking revenue?
More granular updates to parking spots that don't
need sweeping
How does traffic affect waiting times in public
transit, and revenue?
Patterns in subway train times - is a breakdown
coming?
Population movement - where should new transit
routes be placed?
Message
Queue
Stream
Processing
Layer
Event
storage
Ad-
Hoc
311
911
Buses
Metro
Short term
telemetry
Models
Dashboard
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations
as possible with as few components as
possible
Real-time: “NoETL”
Community: best of breed OSS projects with
huge adoption and commercial support
Message Queue
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Why a message
queue?
Centralized publish-subscribe of
events
Need more processing? Add another
consumer
Buffer traffic spikes
Replay events in cases of failure
Message Queues
help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing
Intro to Apache
Kafka
Kafka is a distributed publish subscribe
system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project
in 2012
On being HARD
Many Big Data projects are open source
implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka
actually isn't a clone of an existing closed
source product
The same codebase being used for years at LinkedIn
answers the questions:
Does it scale?
Is it robust?
Ad Hoc ETL
Decoupled ETL
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings
or byte arrays
Avro is a serialization format used
extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep
track of Avro schemas
Verifies that the correct schemas are being used
Consumer Groups
Commit Logs
Kafka Resources
Official docs - https://
kafka.apache.org/
documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry
Stream Processing
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Types of Stream
Processors
Event by Event: Apache Storm,
Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow
Apache Storm and
Flink
Transform one message at a time
Very low latency
State and more complex analytics difficult
Akka and
Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure
recovery
Cluster or local mode - you don’t always need
distribution!
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher
throughput, more complex analysis / ML
possible
Same programming model as batch
Why Spark?
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Spark Production Deployments
Explosion of Specialized Systems
Spark and Berkeley AMP Lab
Benefits of Unified Libraries
Optimizations can be shared between libraries
Core
Project Tungsten
MLlib
Shared statistics libraries
Spark Streaming
GC and memory management
Mix and match
modules
Easily go from DataFrames (SQL) to
MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)
numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)
numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
Spark Worker Failure
Rebuild RDD Partitions on Worker
from Lineage
Spark SQL & DataFrames
DataFrames & Catalyst Optimizer
Catalyst Optimizations
Column and partition pruning
(Column filters)
Predicate pushdowns (Row filters)
Spark SQL Data Sources API
Enables custom data sources to participate in
SparkSQL = DataFrames + Catalyst
Production Impls
spark-csv (Databricks)
spark-avro (Databricks)
spark-cassandra-connector (DataStax)
elasticsearch-hadoop (Elastic.co)
Spark Streaming
Streaming Sources
Basic: Files, Akka actors, queues of RDDs,
Socket
Advanced
Kafka
Kinesis
Flume
Twitter firehose
DStreams = micro-batches
Streaming Fault Tolerance
Incoming data is replicated to 1
other node
Write Ahead Log for sources that
support ACKs
Checkpointing for recovery if Driver
fails
Direct Kafka Streaming: KafkaRDD
No single Receiver
Parallelizable
No Write Ahead Log
Kafka *is* the Write Ahead Log!
KafkaRDD stores Kafka offsets
KafkaRDD partitions recover from offsets

Spark MLlib & GraphX
Spark MLlib Common Algos
Classifiers
DecisionTree, RandomForest
Clustering
K-Means, Streaming K-Means
Collaborative Filtering
Alternating Least Squares (ALS)
Spark Text Processing Algos
TF/IDF
LDA
Word2Vec
*Pro-Tip:
Use Stanford CoreNLP!
Spark ML Pipelines
Modeled after scikit-learn
Spark GraphX
PageRank
Top Influencers
Connected Components
Measure of clusters
Triangle Counting
Measure of cluster density
Handling State
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
What Kind of State?
Non-persistent / in-memory:
concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate
storage
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or
on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of
data
Snapshotting for recovery
•Massively Scalable
• High Performance
• Always On
• Masterless
Scale
Apache Cassandra
• Scales Linearly to as many nodes as
you need
• Scales whenever you need
Performance
Apache Cassandra
• It’s Fast
• Built to sustain massive data insertion
rates in irregular pattern spikes
Fault
Tolerance
&
Availability
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add
themselves to the cluster
• DataStax drivers automatically discover
new nodes
Architecture
Apache Cassandra
• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over
time
To download:
https://cassandra.apache.org/
download/
https://github.com/pcmanus/ccm
^ Highly recommended for local
testing/cluster setup
Cassandra Data
Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3
Partition 1
Partition 2
Partition 3
City Bus Data
Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single
bus for a range of time
Can also query most recent location + speed
of all buses (slower)
1020 s 1010 s 1000 s
Bus A speed, GPS
Bus B
Bus C
Using Cassandra for
Short Term Storage
Idea is store and read small values
Idempotent writes + huge write
capacity = ideal for streaming
ingestion
For example, store last few (latest +
last N) snapshots of buses, taxi
locations, recent traffic info
But Mommy! What about
longer term data?
I need to read lots
of data, fast!!
- Ad hoc analytics of events
- More specialized / geospatial
- Building ML models from
large quantities of data
- Storing scored/classified data
from models
- OLAP / Data Warehousing
Can Cassandra
Handle Batch?
Cassandra tables are much better at
lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage
and analysis
But are files compatible with streaming?
Lambda
Architecture
Lambda is Hard
and Expensive
Very high TCO - Many moving parts - KV store,
real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data,
network hops, translating domain objects
Reconcile queries against two different places
NoLambda
A unified system
Real-time processing and reprocessing
No ETLs
Fault tolerance
Everything is a stream
Can Cassandra do
batch and ad-hoc?
Yes, it can be competitive with Hadoop
actually….
If you know how to be creative with storing your
data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.
Introduction to
FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x
faster than regular Cassandra tables
Very fine grained filtering for sub-second
concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/
Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
Combining FiloDB
+ Cassandra
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted /
scored data
Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps
Message
Queue
Events
Spark
Streaming Models
Cassandra
FiloDB: Long term event storage
Spark Learned
Data
FiloDB + Cassandra
Robust, peer to peer, proven storage
platform
Use for short term snapshots, dashboards
Use for efficient long term event
storage & ad hoc querying
Use as a source to build detailed
models
Thank you!
@evanfchan
http://tuplejump.com

More Related Content

What's hot

Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
Gyula Fóra
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
Databricks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
Manish Kumar
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kai Wähner
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
JananiJ19
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
Amazon Web Services
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 

Viewers also liked

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 

Viewers also liked (8)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
Bartosz Jankiewicz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
Adam Muise
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
Timothy Spann
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
Jenn Rawlins
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
Stephan Ewen
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
confluent
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle (20)

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 

More from Evan Chan

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 

More from Evan Chan (17)

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

Lab session on Robot Control using teach pendant.pptx
Lab session on Robot Control using teach pendant.pptxLab session on Robot Control using teach pendant.pptx
Lab session on Robot Control using teach pendant.pptx
KPavanKumarReddy4
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
mdfkobir
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
idelewebmestre
 
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdfRed Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
mdfkobir
 
System Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th editionSystem Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th edition
mnassar75g
 
Chapter 1 Introduction to Software Engineering and Process Models.pdf
Chapter 1 Introduction to Software Engineering and Process Models.pdfChapter 1 Introduction to Software Engineering and Process Models.pdf
Chapter 1 Introduction to Software Engineering and Process Models.pdf
MeghaGupta952452
 
Technical Seminar of Mca computer vision .ppt
Technical Seminar of Mca computer vision .pptTechnical Seminar of Mca computer vision .ppt
Technical Seminar of Mca computer vision .ppt
AnkitaVerma776806
 
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
AsiimweJulius2
 
High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...
High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...
High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...
janvikumar4133
 
If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17
Brian Gongol
 
RMC FPV.docx_fpv solar energy panels----
RMC FPV.docx_fpv solar energy panels----RMC FPV.docx_fpv solar energy panels----
RMC FPV.docx_fpv solar energy panels----
Khader Mallah
 
AI chapter1 introduction to artificial intelligence
AI chapter1 introduction to artificial intelligenceAI chapter1 introduction to artificial intelligence
AI chapter1 introduction to artificial intelligence
GeethaAL
 
ISO 9001 - 2015 Quality Management Awareness.pdf
ISO 9001 - 2015 Quality Management Awareness.pdfISO 9001 - 2015 Quality Management Awareness.pdf
ISO 9001 - 2015 Quality Management Awareness.pdf
InfoDqms
 
Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...
Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...
Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...
singharadhana4778
 
ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...
ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...
ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...
Fady M. A Hassouna
 
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptxOME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
shanmugamram247
 
Buy a fake University of Washington diploma
Buy a fake University of Washington diplomaBuy a fake University of Washington diploma
Buy a fake University of Washington diploma
College diploma
 
Generative AI and Large Language Models (LLMs)
Generative AI and Large Language Models (LLMs)Generative AI and Large Language Models (LLMs)
Generative AI and Large Language Models (LLMs)
rkpv2002
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
RajaRamannaTarigoppu
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
rawankhanlove256
 

Recently uploaded (20)

Lab session on Robot Control using teach pendant.pptx
Lab session on Robot Control using teach pendant.pptxLab session on Robot Control using teach pendant.pptx
Lab session on Robot Control using teach pendant.pptx
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
 
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagneEAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
EAAP2023 : Durabilité et services écosystémiques de l'élevage ovin de montagne
 
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdfRed Hat Enterprise Linux Administration 9.0 RH134 pdf
Red Hat Enterprise Linux Administration 9.0 RH134 pdf
 
System Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th editionSystem Analysis and Design in a changing world 5th edition
System Analysis and Design in a changing world 5th edition
 
Chapter 1 Introduction to Software Engineering and Process Models.pdf
Chapter 1 Introduction to Software Engineering and Process Models.pdfChapter 1 Introduction to Software Engineering and Process Models.pdf
Chapter 1 Introduction to Software Engineering and Process Models.pdf
 
Technical Seminar of Mca computer vision .ppt
Technical Seminar of Mca computer vision .pptTechnical Seminar of Mca computer vision .ppt
Technical Seminar of Mca computer vision .ppt
 
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
1. DEE 1203 ELECTRICAL ENGINEERING DRAWING.pdf
 
High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...
High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...
High Profile Girls Call Delhi 9711199171 Provide Best And Top Girl Service An...
 
If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17If we're running two pumps, why aren't we getting twice as much flow? v.17
If we're running two pumps, why aren't we getting twice as much flow? v.17
 
RMC FPV.docx_fpv solar energy panels----
RMC FPV.docx_fpv solar energy panels----RMC FPV.docx_fpv solar energy panels----
RMC FPV.docx_fpv solar energy panels----
 
AI chapter1 introduction to artificial intelligence
AI chapter1 introduction to artificial intelligenceAI chapter1 introduction to artificial intelligence
AI chapter1 introduction to artificial intelligence
 
ISO 9001 - 2015 Quality Management Awareness.pdf
ISO 9001 - 2015 Quality Management Awareness.pdfISO 9001 - 2015 Quality Management Awareness.pdf
ISO 9001 - 2015 Quality Management Awareness.pdf
 
Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...
Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...
Girls call Service Ludhiana 000XX00000 Provide Best And Top Girl Service And ...
 
ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...
ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...
ECONOMIC FEASIBILITY AND ENVIRONMENTAL IMPLICATIONS OF PERMEABLE PAVEMENT IN ...
 
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptxOME754 – INDUSTRIAL SAFETY - unit notes.pptx
OME754 – INDUSTRIAL SAFETY - unit notes.pptx
 
Buy a fake University of Washington diploma
Buy a fake University of Washington diplomaBuy a fake University of Washington diploma
Buy a fake University of Washington diploma
 
Generative AI and Large Language Models (LLMs)
Generative AI and Large Language Models (LLMs)Generative AI and Large Language Models (LLMs)
Generative AI and Large Language Models (LLMs)
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
 

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

  • 2. Who am I Distinguished Engineer, Tuplejump @evanfchan http://velvia.github.io User and contributor to Spark since 0.9 Co-creator and maintainer of Spark Job Server
  • 3. TupleJump - Big Data Dev Partners 3
  • 4. Instant Gratification I want insights now I want to act on news right away I want stuff personalized for me (?)
  • 6. How Fast do you Need to Act? Financial trading - milliseconds Dashboards - seconds to minutes BI / Reports - hours to days?
  • 7. What’s Your App? Concurrent video viewers Anomaly detection Clickstream analysis Live geospatial maps Real-time trend detection & learning
  • 9. Example: Real-time trend detection Events: time, OS, location, asset/product ID Analyze 1-5 second batches of new “hot” data in stream processor Combine with recent and historical top K feature vectors in database Update database recent feature vectors Serve to users
  • 11. Smart City Streaming Data City buses - regular telemetry (position, velocity, timestamp) Street sweepers - regular telemetry Transactions from rail, subway, buses, smart cards 311 info 911 info - new emergencies
  • 12. Citizens want to know… Where and for how long can I park my car? Are transportation options affected by 311 and 911 events? How long will it take the next bus to get here? Where is the closest bus to where I am?
  • 13. Cities want to know… How can I maximize parking revenue? More granular updates to parking spots that don't need sweeping How does traffic affect waiting times in public transit, and revenue? Patterns in subway train times - is a breakdown coming? Population movement - where should new transit routes be placed?
  • 15. The HARD Principle Highly Available, Resilient, Distributed Flexibility - do as many transformations as possible with as few components as possible Real-time: “NoETL” Community: best of breed OSS projects with huge adoption and commercial support
  • 18. Why a message queue? Centralized publish-subscribe of events Need more processing? Add another consumer Buffer traffic spikes Replay events in cases of failure
  • 19. Message Queues help distribute data A-F G-M N-S T-Z Input 1 Input 2 Input3 Input4 Processing Processing Processing Processing
  • 20. Intro to Apache Kafka Kafka is a distributed publish subscribe system It uses a commit log to track changes Kafka was originally created at LinkedIn Open sourced in 2011 Graduated to a top-level Apache project in 2012
  • 21. On being HARD Many Big Data projects are open source implementations of closed source products Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product The same codebase being used for years at LinkedIn answers the questions: Does it scale? Is it robust?
  • 24. Avro Schemas And Schema Registry Keys and values in Kafka can be Strings or byte arrays Avro is a serialization format used extensively with Kafka and Big Data Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
  • 27. Kafka Resources Official docs - https:// kafka.apache.org/ documentation.html Design section is really good read http://www.confluent.io/product Includes schema registry
  • 30. Types of Stream Processors Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka Micro-batch: Apache Spark Hybrid? Google Dataflow
  • 31. Apache Storm and Flink Transform one message at a time Very low latency State and more complex analytics difficult
  • 32. Akka and Gearpump Actor to actor messaging. Local state. Used for extreme low latency (ad networks, etc) Dynamically reconfigurable topology Configurable fault tolerance and failure recovery Cluster or local mode - you don’t always need distribution!
  • 33. Spark Streaming Data processed as stream of micro batches Higher latency (seconds), higher throughput, more complex analysis / ML possible Same programming model as batch
  • 34. Why Spark? file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • 38. Benefits of Unified Libraries Optimizations can be shared between libraries Core Project Tungsten MLlib Shared statistics libraries Spark Streaming GC and memory management
  • 39. Mix and match modules Easily go from DataFrames (SQL) to MLLib / statistics, for example: scala> import org.apache.spark.mllib.stat.Statistics scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble) numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848 scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble) numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848 scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
  • 40. Spark Worker Failure Rebuild RDD Partitions on Worker from Lineage
  • 41. Spark SQL & DataFrames
  • 43. Catalyst Optimizations Column and partition pruning (Column filters) Predicate pushdowns (Row filters)
  • 44. Spark SQL Data Sources API Enables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)
  • 46. Streaming Sources Basic: Files, Akka actors, queues of RDDs, Socket Advanced Kafka Kinesis Flume Twitter firehose
  • 48. Streaming Fault Tolerance Incoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
  • 49. Direct Kafka Streaming: KafkaRDD No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

  • 50. Spark MLlib & GraphX
  • 51. Spark MLlib Common Algos Classifiers DecisionTree, RandomForest Clustering K-Means, Streaming K-Means Collaborative Filtering Alternating Least Squares (ALS)
  • 52. Spark Text Processing Algos TF/IDF LDA Word2Vec *Pro-Tip: Use Stanford CoreNLP!
  • 53. Spark ML Pipelines Modeled after scikit-learn
  • 54. Spark GraphX PageRank Top Influencers Connected Components Measure of clusters Triangle Counting Measure of cluster density
  • 57. What Kind of State? Non-persistent / in-memory: concurrent viewers Short term: latest trends Longer term: raw event & aggregate storage ML Models, predictions, scored data
  • 58. Spark RDDs Immutable, cache in memory and/or on disk Spark Streaming: UpdateStateByKey IndexedRDD - can update bits of data Snapshotting for recovery
  • 59. •Massively Scalable • High Performance • Always On • Masterless
  • 60. Scale Apache Cassandra • Scales Linearly to as many nodes as you need • Scales whenever you need
  • 61. Performance Apache Cassandra • It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
  • 62. Fault Tolerance & Availability Apache Cassandra • Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster • DataStax drivers automatically discover new nodes
  • 63. Architecture Apache Cassandra • Distributed, Masterless Ring Architecture • Network Topology Aware • Flexible, Schemaless - your data structure can evolve seamlessly over time
  • 65. Cassandra Data Modeling Primary key = (partition keys, clustering keys) Fast queries = fetch single partition Range scans by clustering key Must model for query patterns Clustering 1 Clustering 2 Clustering 3 Partition 1 Partition 2 Partition 3
  • 66. City Bus Data Modeling Example Primary key = (Bus UUID, timestamp) Easy queries: location and speed of single bus for a range of time Can also query most recent location + speed of all buses (slower) 1020 s 1010 s 1000 s Bus A speed, GPS Bus B Bus C
  • 67. Using Cassandra for Short Term Storage Idea is store and read small values Idempotent writes + huge write capacity = ideal for streaming ingestion For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
  • 68. But Mommy! What about longer term data?
  • 69. I need to read lots of data, fast!! - Ad hoc analytics of events - More specialized / geospatial - Building ML models from large quantities of data - Storing scored/classified data from models - OLAP / Data Warehousing
  • 70. Can Cassandra Handle Batch? Cassandra tables are much better at lots of small reads than big data scans You CAN store data efficiently in C* Files seem easier for long term storage and analysis But are files compatible with streaming?
  • 72. Lambda is Hard and Expensive Very high TCO - Many moving parts - KV store, real time, batch Lots of monitoring, operations, headache Running similar code in two places Lower performance - lots of shuffling data, network hops, translating domain objects Reconcile queries against two different places
  • 73. NoLambda A unified system Real-time processing and reprocessing No ETLs Fault tolerance Everything is a stream
  • 74. Can Cassandra do batch and ad-hoc? Yes, it can be competitive with Hadoop actually…. If you know how to be creative with storing your data! Tuplejump/SnackFS - HDFS for Cassandra github.com/tuplejump/FiloDB - analytics database Store your data using Protobuf / Avro / etc.
  • 75. Introduction to FiloDB Efficient columnar storage - 5-10x better Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables Very fine grained filtering for sub-second concurrent queries Easy BI and ad-hoc analysis via Spark SQL/ Dataframes (JDBC etc.) Uses Cassandra for robust, proven storage
  • 76. Combining FiloDB + Cassandra Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  • 77. Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 79. FiloDB + Cassandra Robust, peer to peer, proven storage platform Use for short term snapshots, dashboards Use for efficient long term event storage & ad hoc querying Use as a source to build detailed models