ETL, ELT and Lambda architectures have evolved into a [non]Streaming general purpose data ingestion pipeline, that is scalable through distributed processing, for Big Data Analytics over hybrid Data Warehouses in Hadoop and MPP Columnar stores like HPE-Vertica.
Bio: Jack Gudenkauf (https://www.linkedin.com/in/jackglinkedin) has over twenty-nine years of experience designing and implementing Internet scale distributed systems. Jack is currently the CEO & Founder of the startup BigDataInfra. He was previously; VP of Big Data at Playtika, a hands-on manager of the Twitter Analytics Data Warehouse team, spent 15 years at Microsoft shipping 15 products, and prior to Microsoft he managed his own consulting company after he began his career as an MIS Director of several startup companies.
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
A noETL Parallel Streaming Transformation Loader using Spark, Kafka & Vertica
1. A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
1
JackGudenkauf@gmail.com
2. WARNING!
Slides that follow
violate Powerpoint best practices
in favor of providing densely
packed information for later review
https://www.linkedin.com/in/jackglinkedin 2
5. My Background
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL),
Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2
Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework,
Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
https://www.linkedin.com/in/jackglinkedin
5
6. A Quest
With attributes of
Operational Robustness
High Availabilty
Stronger durability guarantees
Idempotent (an operation that is safe to repeat)
Productivity
Analytics
Streaming, Machine Learning, BI, BA, Data Science
Rich Development env.
Strongly typed, OO, Functional, with support for set based logic and
aggregations (SQL)
Performance
Scalable in every tier
MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)
https://www.linkedin.com/in/jackglinkedin 6
8. ELT
“Extract, Load, Transform is an alternative to Extract,
transform, load (ETL) used with data lake implementations.
In ELT models the data is not processed on entry to the data
lake which enables faster loading times.
But does require sufficient processing within the data
processing engine to carry out the transform on demand and
return the results to the consumer in a timely manner.
Since the data is not processed on entry to the data lake the
query and schema do not need to be defined a-priori (often
the schema will be available during load since many data
sources are extracts from databases or similar structured data
systems and hence have an associated schema).”
https://www.linkedin.com/in/jackglinkedin 8
https://en.wikipedia.org/wiki/Extract,_load,_transform
9. Lambda Architecture
9
“Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data” -
https://en.wikipedia.org/wiki/Lambda_architecture
10. Questioning the Lambda Architecture
by Jay Kreps
The Lambda Architecture has its merits, but alternatives
are worth exploring.
“As someone who designs infrastructure, I think the
glaring question is this: why can’t the stream processing
system just be improved to handle the full problem set in
its target domain? Why do you need to glue on another
system? Why can’t you do both real-time processing and
also handle the reprocessing when code changes? Stream
processing systems already have a notion of parallelism;
why not just handle reprocessing by increasing the
parallelism and replaying history very, very fast? The
answer is that you can do this, and I think this it is
actually a reasonable alternative architecture if you are
building this type of system today.”
10
11. REST API
Flume
Apache
Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar
DW
HP Vertica™ Cluster
UserId <-> UserGId
Analytics of Relational Data
Structured Relational and Aggregated Data
Application
Application
Game
Applications
GameX
GameY
GameZ
COPY
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified
Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
https://www.linkedin.com/in/jackglinkedin 11
UserId: INT
SessionId: UUId (36)
UserId: INT
SessionId: UUId (32)
UserId: varchar(32)
SessionId: varchar(255)
13. Real-Time
Messaging
Apache
Kafka™
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
Structured Relational and Aggregated Data
Resilient Distributed
Datasets
Apache Spark™ Hadoop™
Parquet™ ✓ ✓ ✖
REST API
Or Local
Kafka
Application
Application
Game
Applications
Unified
Schema
JSON
Local Data Warehouses
MPP Columnar DW
HP Vertica™
MPP
1 2
3
P a r a l l e l i z e d S t r e a m i n g
T r a n s f o r m a t i o n
L o a d e r
4
5
New PSTL
Architecture
New PSTL
Architecture
https://www.linkedin.com/in/jackglinkedin
13
Bingo Blitz
UserId: INT
SessionId: UUId (36) Slotomania
UserId: INT
SessionId: UUId (32)
WSOP
UserId: varchar(32)
SessionId: varchar(255)
15. Apache Kafka ™
is a distributed, partitioned, replicated
commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer
16. A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed
—for a configurable period of time.
Apache Kafka
™
17. Spark RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node…
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 3 RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition
1 to 64
RDD 2
Partition
65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
RDD 1
Partition 3
22. Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with Parallel
writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 215 Data Nodes )
23. A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
23
JackGudenkauf@gmail.com
THANK YOU
Editor's Notes
BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career.
BigDataInfra specializes in the system architectures we will be discussing
Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc..
noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.
My experience and influencers framed my architectural decisions
http://kafka.apache.org/documentation.html
Authored at LinkedIn, used by Twitter, …
http://kafka.apache.org/documentation.html
Authored at LinkedIn, used by Twitter, …
We use Spark RDD partitioned data to parallelize opertaions to/from affinitized Vertica nodes
e.g., 3 Kafka Partitions would read in parallel into 3 Spark RDD Partitions
Typically you want 2-4 partitions for each CPU in your cluster.
Normally, Spark tries to set the number of partitions automatically based on your cluster.
However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))
SHUFFLE!
BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career.
BigDataInfra specializes in the system architectures we will be discussing
Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc..
noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.