SlideShare a Scribd company logo
1 of 23
A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
1
JackGudenkauf@gmail.com
WARNING!
Slides that follow
violate Powerpoint best practices
in favor of providing densely
packed information for later review
https://www.linkedin.com/in/jackglinkedin 2
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
https://www.linkedin.com/in/jackglinkedin 3
Agenda
1. Background
https://www.linkedin.com/in/jackglinkedin 4
My Background
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL),
Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2
Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework,
Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
https://www.linkedin.com/in/jackglinkedin
5
A Quest
With attributes of
 Operational Robustness
 High Availabilty
 Stronger durability guarantees
 Idempotent (an operation that is safe to repeat)
 Productivity
 Analytics
 Streaming, Machine Learning, BI, BA, Data Science
 Rich Development env.
 Strongly typed, OO, Functional, with support for set based logic and
aggregations (SQL)
 Performance
 Scalable in every tier
 MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)
https://www.linkedin.com/in/jackglinkedin 6
https://www.linkedin.com/in/jackglinkedin 7https://en.wikipedia.org/wiki/Extract,_transform,_load
ELT
“Extract, Load, Transform is an alternative to Extract,
transform, load (ETL) used with data lake implementations.
In ELT models the data is not processed on entry to the data
lake which enables faster loading times.
But does require sufficient processing within the data
processing engine to carry out the transform on demand and
return the results to the consumer in a timely manner.
Since the data is not processed on entry to the data lake the
query and schema do not need to be defined a-priori (often
the schema will be available during load since many data
sources are extracts from databases or similar structured data
systems and hence have an associated schema).”
https://www.linkedin.com/in/jackglinkedin 8
https://en.wikipedia.org/wiki/Extract,_load,_transform
Lambda Architecture
9
“Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data” -
https://en.wikipedia.org/wiki/Lambda_architecture
Questioning the Lambda Architecture
by Jay Kreps
The Lambda Architecture has its merits, but alternatives
are worth exploring.
“As someone who designs infrastructure, I think the
glaring question is this: why can’t the stream processing
system just be improved to handle the full problem set in
its target domain? Why do you need to glue on another
system? Why can’t you do both real-time processing and
also handle the reprocessing when code changes? Stream
processing systems already have a notion of parallelism;
why not just handle reprocessing by increasing the
parallelism and replaying history very, very fast? The
answer is that you can do this, and I think this it is
actually a reasonable alternative architecture if you are
building this type of system today.”
10
REST API
Flume
Apache
Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar
DW
HP Vertica™ Cluster
UserId <-> UserGId 
Analytics of Relational Data
 Structured Relational and Aggregated Data
Application
Application
Game
Applications
GameX
GameY
GameZ
COPY
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified
Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
https://www.linkedin.com/in/jackglinkedin 11
UserId: INT
SessionId: UUId (36)
UserId: INT
SessionId: UUId (32)
UserId: varchar(32)
SessionId: varchar(255)
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
https://www.linkedin.com/in/jackglinkedin 12
Real-Time
Messaging
Apache
Kafka™
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
 Structured Relational and Aggregated Data
Resilient Distributed
Datasets
Apache Spark™ Hadoop™
Parquet™ ✓ ✓ ✖ 
REST API
Or Local
Kafka
Application
Application
Game
Applications
Unified
Schema
JSON
Local Data Warehouses
MPP Columnar DW
HP Vertica™

MPP
1 2
3
P a r a l l e l i z e d S t r e a m i n g
T r a n s f o r m a t i o n
L o a d e r
4
5
New PSTL
Architecture
New PSTL
Architecture
https://www.linkedin.com/in/jackglinkedin
13
Bingo Blitz
UserId: INT
SessionId: UUId (36) Slotomania
UserId: INT
SessionId: UUId (32)
WSOP
UserId: varchar(32)
SessionId: varchar(255)
14
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
Apache Kafka ™
is a distributed, partitioned, replicated
commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer
A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed
—for a configurable period of time.
Apache Kafka
™
Spark RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node…
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 3 RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition
1 to 64
RDD 2
Partition
65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
RDD 1
Partition 3
18Initiator Node
An Initiator Node shuffles
data to storage nodes
Vertica Hashing & Partitioning
19
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
{"appId": 3, "sessionId": ”7”,
"userId": ”42” }
{"appId": 3, "sessionId": ”6”,
"userId": ”42” }
Node 1 Node 2 Node 3 Node 4
3 Import recent Sessions
Apache Kafka Cluster
Topic: “appId_1” Topic: “appId_2” Topic: “appId_3”
old new
Kafka Table
appId,
TopicOffsetRange,
Batch_Id
SessionMax Table
sessionGIdMax Int
UserMax Table
userGIdMax Int
appSessionMap_RDD
appId: Int
sessionId: String
sessionGId: Int
appUserMap_RDD
appId: Int
userId: String
userGId: Int
appSession
appId: Int
sessionId:
varchar(255)
sessionGId: Int
appUser
appId: Int
userId:
varchar(255)
userGId: Int
1 Start a Spark Driver
per APP
Node 1 Node 2 Node 3
4 Spark Kafka [non]Streaming job per APP
(read partition/offset range)
5 select for
update;
update max
GId
5 Assign userGIds To
userId
sessionGIds To
sessionId
6 Hash(userGId) to
RDD partitions with
affinity
To Vertica Node(s)
7
userGIdRDD.foreachPartition
{…stream.writeTo(socket)...}
8 Idempotent: Write
Raw JSON to hdfs
9 Idempotent: Write
Parsed JSON to .ORC
hdfs
10 Update
MySQL
Kafka Offsets
{"appId": 2, "sessionId": ”4”,
"userId": ”KA” }
{"appId": 2, "sessionId": ”3”,
"userId": ”KY” }{"appId": 1, "sessionId": ”2”,
"userId": ”CB” }
{"appId": 1, "sessionId": "1”,
"userId": ”JG” }
4 appId {Game events, Users, Sessions,…}
Partition 1..n RDDs
5 appId Users & Sessions
Partition 1..n RDDs
5 appId
appUserMap_RDD.union(assignedID_RDD)
6 appId Users & Sessions
Partition 1..n RDDs
7 copy jackg.DIM_USER
with source SPARK(port='12345’,
nodes=‘node0001:4, node0002:4,
node0003:4’) direct;
2 Import Users
Apache Hadoop™
Spark™ Cluster
HPE Vertica™ Cluster
21
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with Parallel
writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 215 Data Nodes )
A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
23
JackGudenkauf@gmail.com
THANK YOU

More Related Content

What's hot

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...confluent
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...confluent
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
Introduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matterIntroduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matterconfluent
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...HostedbyConfluent
 

What's hot (20)

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Introduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matterIntroduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matter
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
 

Viewers also liked

Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database managementLeyi (Kamus) Zhang
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiData Con LA
 
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...Data Con LA
 
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...Data Con LA
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxData Con LA
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostelloData Con LA
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinanceBig Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinanceData Con LA
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...Data Con LA
 

Viewers also liked (17)

Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
 
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
 
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of Datastax
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostello
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinanceBig Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
 
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
 
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
 

Similar to A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica

HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaJack Gudenkauf
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Timothy Spann
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
G rpc talk with intel (3)
G rpc talk with intel (3)G rpc talk with intel (3)
G rpc talk with intel (3)Intel
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 

Similar to A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica (20)

HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
G rpc talk with intel (3)
G rpc talk with intel (3)G rpc talk with intel (3)
G rpc talk with intel (3)
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica

  • 1. A noETL Parallel Streaming Transformation Loader using Kafka, Spark & Vertica Jack Gudenkauf CEO & Founder of BigDataInfra https://www.linkedin.com/in/jackglinkedin 1 JackGudenkauf@gmail.com
  • 2. WARNING! Slides that follow violate Powerpoint best practices in favor of providing densely packed information for later review https://www.linkedin.com/in/jackglinkedin 2
  • 3. Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance! https://www.linkedin.com/in/jackglinkedin 3
  • 5. My Background Playtika, VP of Big Data Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)] MIS Director of several start-up companies Dataflex a 4GL RDBMS. [E.F. Codd] Self-employed Consultant Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe FoxPro, Sybase, MSSQL Server beta Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four] Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy Group Inventor of “Shuttle”, a Microsoft product in use since 1999 A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS) [Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)] Twitter, Manager of Analytics Data Warehouse Core Storage; Hadoop, HBase, Cassandra, Blob Store Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR) [Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)] https://www.linkedin.com/in/jackglinkedin 5
  • 6. A Quest With attributes of  Operational Robustness  High Availabilty  Stronger durability guarantees  Idempotent (an operation that is safe to repeat)  Productivity  Analytics  Streaming, Machine Learning, BI, BA, Data Science  Rich Development env.  Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL)  Performance  Scalable in every tier  MPP for Transformations, Reads & Writes A Unified Data Pipeline with Parallelism from Streaming Data through Data Transformations to Data Storage (Semi-Structured, Structured, and Relational Data) https://www.linkedin.com/in/jackglinkedin 6
  • 8. ELT “Extract, Load, Transform is an alternative to Extract, transform, load (ETL) used with data lake implementations. In ELT models the data is not processed on entry to the data lake which enables faster loading times. But does require sufficient processing within the data processing engine to carry out the transform on demand and return the results to the consumer in a timely manner. Since the data is not processed on entry to the data lake the query and schema do not need to be defined a-priori (often the schema will be available during load since many data sources are extracts from databases or similar structured data systems and hence have an associated schema).” https://www.linkedin.com/in/jackglinkedin 8 https://en.wikipedia.org/wiki/Extract,_load,_transform
  • 9. Lambda Architecture 9 “Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data” - https://en.wikipedia.org/wiki/Lambda_architecture
  • 10. Questioning the Lambda Architecture by Jay Kreps The Lambda Architecture has its merits, but alternatives are worth exploring. “As someone who designs infrastructure, I think the glaring question is this: why can’t the stream processing system just be improved to handle the full problem set in its target domain? Why do you need to glue on another system? Why can’t you do both real-time processing and also handle the reprocessing when code changes? Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? The answer is that you can do this, and I think this it is actually a reasonable alternative architecture if you are building this type of system today.” 10
  • 11. REST API Flume Apache Flume™ ETL JAVA ™ Parser & Loader MPP Columnar DW HP Vertica™ Cluster UserId <-> UserGId  Analytics of Relational Data  Structured Relational and Aggregated Data Application Application Game Applications GameX GameY GameZ COPY Playtika Santa Monica original ETL Architecture Extract Transform Load Single Source of Truths to Global SOT Unified Schema JSON Local Data Warehouses Original Architecture (ETL) 1 2 3 4 5 https://www.linkedin.com/in/jackglinkedin 11 UserId: INT SessionId: UUId (36) UserId: INT SessionId: UUId (32) UserId: varchar(32) SessionId: varchar(255)
  • 12. Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader https://www.linkedin.com/in/jackglinkedin 12
  • 13. Real-Time Messaging Apache Kafka™ Analytics of [semi]Structured [non]Relational Data Stores Real-Time Streaming ✓Machine Learning ✓ Semi-Structured Raw JSON Data ✖Structured (non)relational Parquet Data  Structured Relational and Aggregated Data Resilient Distributed Datasets Apache Spark™ Hadoop™ Parquet™ ✓ ✓ ✖  REST API Or Local Kafka Application Application Game Applications Unified Schema JSON Local Data Warehouses MPP Columnar DW HP Vertica™  MPP 1 2 3 P a r a l l e l i z e d S t r e a m i n g T r a n s f o r m a t i o n L o a d e r 4 5 New PSTL Architecture New PSTL Architecture https://www.linkedin.com/in/jackglinkedin 13 Bingo Blitz UserId: INT SessionId: UUId (36) Slotomania UserId: INT SessionId: UUId (32) WSOP UserId: varchar(32) SessionId: varchar(255)
  • 14. 14 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica
  • 15. Apache Kafka ™ is a distributed, partitioned, replicated commit log service Producer Producer Producer Kafka Cluster (Broker) Consumer Consumer Consumer
  • 16. A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. The Kafka cluster retains all published messages—whether or not they have been consumed —for a configurable period of time. Apache Kafka ™
  • 17. Spark RDD A Resilient Distributed Dataset [in Memory] Represents an immutable, partitioned collection of elements that can be operated on in parallel Node 1 Node 2 Node 3 Node… RDD 1 RDD 1 Partition 1 RDD 1 Partition 2 RDD 3 RDD 3 Partition 2 RDD 3 Partition 3 RDD 3 Partition 1 RDD 2 RDD 2 Partition 1 to 64 RDD 2 Partition 65 to 128 RDD 2 Partition 193 to 256 RDD 2 Partition 129 to 192 RDD 1 Partition 3
  • 18. 18Initiator Node An Initiator Node shuffles data to storage nodes Vertica Hashing & Partitioning
  • 19. 19 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader
  • 20. {"appId": 3, "sessionId": ”7”, "userId": ”42” } {"appId": 3, "sessionId": ”6”, "userId": ”42” } Node 1 Node 2 Node 3 Node 4 3 Import recent Sessions Apache Kafka Cluster Topic: “appId_1” Topic: “appId_2” Topic: “appId_3” old new Kafka Table appId, TopicOffsetRange, Batch_Id SessionMax Table sessionGIdMax Int UserMax Table userGIdMax Int appSessionMap_RDD appId: Int sessionId: String sessionGId: Int appUserMap_RDD appId: Int userId: String userGId: Int appSession appId: Int sessionId: varchar(255) sessionGId: Int appUser appId: Int userId: varchar(255) userGId: Int 1 Start a Spark Driver per APP Node 1 Node 2 Node 3 4 Spark Kafka [non]Streaming job per APP (read partition/offset range) 5 select for update; update max GId 5 Assign userGIds To userId sessionGIds To sessionId 6 Hash(userGId) to RDD partitions with affinity To Vertica Node(s) 7 userGIdRDD.foreachPartition {…stream.writeTo(socket)...} 8 Idempotent: Write Raw JSON to hdfs 9 Idempotent: Write Parsed JSON to .ORC hdfs 10 Update MySQL Kafka Offsets {"appId": 2, "sessionId": ”4”, "userId": ”KA” } {"appId": 2, "sessionId": ”3”, "userId": ”KY” }{"appId": 1, "sessionId": ”2”, "userId": ”CB” } {"appId": 1, "sessionId": "1”, "userId": ”JG” } 4 appId {Game events, Users, Sessions,…} Partition 1..n RDDs 5 appId Users & Sessions Partition 1..n RDDs 5 appId appUserMap_RDD.union(assignedID_RDD) 6 appId Users & Sessions Partition 1..n RDDs 7 copy jackg.DIM_USER with source SPARK(port='12345’, nodes=‘node0001:4, node0002:4, node0003:4’) direct; 2 Import Users Apache Hadoop™ Spark™ Cluster HPE Vertica™ Cluster
  • 21. 21 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance!
  • 22. Impressive Parallel COPY Performance Loaded 2.42 Billion Rows (451 GB) in 7min 35sec on an 8 Node Cluster Key Takeaways Parallel Kafka Reads to Spark RDD (in memory) with Parallel writes to a Vertica via tcp server – ROCKS! COPY 36 TB/Hour with 81 Node cluster No ephemeral nodes needed for ingest Kafka read parallelism to Spark RDD partitions A priori hash() in Spark RDD Partitions (in Memory) TCP Server as a Vertica User Define Copy Source Single COPY does not preallocate Memory across nodes http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/ * 270 Nodes ( 215 Data Nodes )
  • 23. A noETL Parallel Streaming Transformation Loader using Kafka, Spark & Vertica Jack Gudenkauf CEO & Founder of BigDataInfra https://www.linkedin.com/in/jackglinkedin 23 JackGudenkauf@gmail.com THANK YOU

Editor's Notes

  1. BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career. BigDataInfra specializes in the system architectures we will be discussing Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc.. noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.
  2. My experience and influencers framed my architectural decisions
  3. http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, …
  4. http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, …
  5. We use Spark RDD partitioned data to parallelize opertaions to/from affinitized Vertica nodes e.g., 3 Kafka Partitions would read in parallel into 3 Spark RDD Partitions Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))
  6. SHUFFLE!
  7. BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career. BigDataInfra specializes in the system architectures we will be discussing Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc.. noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.