What are the key points to focus on before starting to learn ETL Development....
Strata EU tutorial - Architectural considerations for hadoop applications
1. Architectural
Considerations
for Hadoop
Applications
Strata+Hadoop World Barcelona – November 19th 2014
tiny.cloudera.com/app-arch-slides
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
Jonathan Seidman | @jseidman
Gwen Shapira | @gwenshap
50. 50
Data Storage – Format Considerations
Click to enter confidentiality information
Logs
(plain
text)
51. 51
Data Storage – Format Considerations
Click to enter confidentiality information
Logs
(plain
text)
Logs
Logs
(plain
(plain
text)
text)
Logs
Logs
(plain
text)
Logs
(plain
text)
Logs
(plain
text)
Logs
Logs
Logs Logs
52. 52
Data Storage – Compression
Click to enter confidentiality information
snappy
Well, maybe.
But not splittable.
X
Splittable. Getting
better…
Splittable, but no... Hmmm….
54. 54
Hadoop File Types
• Formats designed specifically to store and process data on Hadoop:
– File based – SequenceFile
– Serialization formats – Thrift, Protocol Buffers, Avro
– Columnar formats – RCFile, ORC, Parquet
Click to enter confidentiality information
96. 96
Processing Engines
• MapReduce
• Abstractions
• Spark
• Spark Streaming
• Impala
Confidentiality Information Goes Here
97. 97
MapReduce
• Oldie but goody
• Restrictive Framework / Innovated Work Around
• Extreme Batch
Confidentiality Information Goes Here
98. 98
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
99. 99
MapReduce Innovation
• Mapper Memory Joins
• Reducer Memory Joins
• Buckets Sorted Joins
• Cross Task Communication
• Windowing
• And Much More
Confidentiality Information Goes Here
100. 100
Abstractions
• SQL
– Hive
• Script/Code
– Pig: Pig Latin
– Crunch: Java/Scala
– Cascading: Java/Scala
Confidentiality Information Goes Here
101. 101
Spark
• The New Kid that isn’t that New Anymore
• Easily 10x less code
• Extremely Easy and Powerful API
• Very good for machine learning
• Scala, Java, and Python
• RDDs
• DAG Engine
Confidentiality Information Goes Here
102. 102
Spark - DAG
Confidentiality Information Goes Here
103. 103
Spark - DAG
Confidentiality Information Goes Here
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
104. 104
Spark - DAG
Confidentiality Information Goes Here
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
Good
Good
Good
Good
Good
Good
Good-Replay
Good-Replay
Good-Replay
Good
Good-Replay
Good
Good-Replay
Lost Block
Replay
Good-Replay
Lost Block
Good
Future
Future
Future
Future
105. 105
Spark Streaming
• Calling Spark in a Loop
• Extends RDDs with DStream
• Very Little Code Changes from ETL to Streaming
Confidentiality Information Goes Here
106. 106
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
107. 107
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Stateful RDD 1
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Stateful RDD 2
Print
108. 108
Impala
Confidentiality Information Goes Here
• MPP Style SQL Engine on top of Hadoop
• Very Fast
• High Concurrency
• Analytical windowing functions (C5.2).
109. 109
Impala – Broadcast Join
Confidentiality Information Goes Here
Impala Daemon
Smaller Table
Data Block
100% Cached
Smaller Table
Smaller Table
Data Block
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
110. 110
Impala – Partitioned Hash Join
Confidentiality Information Goes Here
Impala Daemon
Smaller Table
Data Block
~33% Cached
Smaller Table
Impala Daemon
Smaller Table
Data Block
~33% Cached
Smaller Table
Impala Daemon
~33% Cached
Smaller Table
Hash Partitioner Hash Partitioner
Impala Daemon
BiggerTable
Data Block
Impala Daemon Impala Daemon
Hash Partitioner
Hash Join Function
33% Cached
Smaller Table
Hash Join Function
33% Cached
Smaller Table
Hash Join Function
33% Cached
Smaller Table
Output Output Output
Hash Partitioner
BiggerTable
Data Block
Hash Partitioner
BiggerTable
Data Block
111. 111
Impala vs Hive
Confidentiality Information Goes Here
• Very different approaches and
• We may see convergence at some point
• But for now
– Impala for speed
– Hive for batch
115. 115
Why sessionize?
Helps answers questions like:
• What is my website’s bounce rate?
– i.e. how many % of visitors don’t go past the landing page?
• Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
– Which ones of those lead to most conversions (e.g. people buying things,
Confidentiality Information Goes Here
signing up, etc.)
• Do attribution analysis – which channels are responsible for most
conversions?
116. 116
Sessionization
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12+1413580110
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5;
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0
Mobile Safari/533.1” 244.157.45.12+1413583199
Confidentiality Information Goes Here
117. 117
How to Sessionize?
Confidentiality Information Goes Here
1. Given a list of clicks, determine which clicks
came from the same user (Partitioning,
ordering)
2. Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session (Identifying
session boundaries)
131. 131
Orchestrating Clickstream
• Data arrives through Flume
• Triggers a processing event:
– Sessionize
– Enrich – Location, marketing channel…
– Store as Parquet
• Each day we process events from the previous day
155. 157
Visit us at
the Booth
#408
Highlights:
Hear what’s new with
5.2 including Impala 2.0
Learn how Cloudera is
setting the standard for
Hadoop in the Cloud
BOOK SIGNINGS THEATER SESSIONS
TECHNICAL DEMOS GIVEAWAYS
Talk about confusion even with knowledgeable users about how all the components fit together to implement applications.
Think about kittens/cars getting challenged
Data ingestion – what requirements do we have for moving data into our processing flow?
Data storage – what requirements do we have for the storage of data, both incoming raw data and processed data?
Data processing – how do we need to process the data to meet our functional requirements?
Workflow orchestration – how do we manage and monitor all the processing?
We have a farm of web servers – this could be tens of servers, or hundreds of servers, and each of these servers is generating multiple logs every day. This may just be a few GB per server, but the total log volume over time can quickly become terabytes of data.
As traffic on our websites increases, we add more web servers, which means even more logs.
We may also decide we need to bring in additional data sources, for example CRM data, or data stored in our operational data stores. Additionally, we may determine that there’s valuable data in Hadoop that we want to bring in to external data stores – for example info to enrich our customer records.
Add title slide for storage reqs
Data needs to be stored in it’s raw form with full fidelity. This allows us to reprocess the data based on changing or new requirements.
Data needs to be stored in a format that facilitates access by data processing frameworks on Hadoop.
Data needs to be compressed to reduce storage requirements.
So simple! We can just write a quick bash script, schedule it in cron and we are done. This is actually not a bad way to start a project – it shows value very quickly. The important part is to know when to ditch the script.
I typically ditch the script the moment additional requirements arrive. The first few are still simple enough in bash, but soon enough…
There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
Even if we use an engine that allows for complex workflows like Spark – orcehstration makes things like recovering from error, managing dependencies and reusing components easier.
Now that we understand our requirements, we need to look at considerations for meeting these requirements, starting with data storage.
Now that we understand our requirements, we need to look at considerations for meeting these requirements, starting with data storage.
Random access to data doesn’t provide any benefit for our workloads, so HBase is not a good choice.
We may later decide that HBase has a place in our architecture, but would add unnecessary complexity right now.
Recall that we’ll be dealing with both raw data being ingested from our web servers, as well as data that’s the output of processing. These two types of data will have different requirements and considerations.
We’ll start by discussing the raw data.
We could store the logs as plain text. This is well supported by Hadoop, and will allow processing by all processing frameworks that run on Hadoop.
This will quickly consume considerable storage in Hadoop though. This may also not be optimal for processing.
We could store the logs as plain text. This is well supported by Hadoop, and will allow processing by all processing frameworks that run on Hadoop.
This will quickly consume considerable storage in Hadoop though. This may also not be optimal for processing.
SequenceFiles are well suited as a container for data stored in Hadoop, and were specifically designed to work with MapReduce.
SequenceFiles provide Block compression, which will compress a block of records once they reach a specific size.
Block level compression with sequence files allows us to use a non-splittable compression format like Gzip or Snappy, and make it splittable.
Important to note that SequenceFile blocks refer to a block of records compressed within a SequenceFile, and are different than HDFS blocks.
What’s not shown here is a sync marker that’s written before each block of data, which allows readers of the file to sync to block boundaries.
Avro can be seen as a more advanced SequenceFile
Avro files store the metadata in the header using JSON.
An important feature of Avro is that schemas can evolve, so the schema used to read the file doesn’t need to match the schema used to write the file.
The Avro format is very compact, and also supports splittable compression.
Recall that much of our access to the processed data will be through analytical queries that need to access multiple rows, and often only select columns from those rows.
Access to /data often needs to be controlled, since it contains business critical data sets. Generally only automated processes write to this directory, and different business groups will have read access to only required sub-directories.
/app will be used for things like artifacts required to run Oozie workflows,
Note that partitions are actually directories.
Note that partitions are actually directories.
Indicate this applies to raw and processed
Typically we are looking at few files landing at the FTP site once a day, scheduling a job to run on an edgenode of the cluster once a day, fetch the files and stream to HDFS is fine.
If an import fails, it will fail completely and you can retry.
Ease of deploy and management is important
Customer will not write code
Interceptors are important
Data-push is important
Data will always end up in Hadoop
Many planned consumers
High availability is critical
You have control over sources of data
You are happy to write producers yourself
Does not require programming.
Only without a debugger
Multiple agents acting as collectors provides reliability – if one node goes down we’ll still be able to ingest events.
Flume provides support for load balancing such as round robin.
Does not require programming.
Does not require programming.
Does not require programming.
There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
One or two hive actions would do, maybe some error handling
The workflow is simple enough to work in any tool – Bash, Azkaban…
but Oozie’s dataset triggers make it a good fit for this use-case
Note that if we were to use Kafka, the workflow would be even simpler and we wouldn’t use time-based scheduling
There are a lot of Orchestration tools out there. And ETL tools typically do orchestration too. We want to focus on the open-source systems that were built to work with Hadoop – they were built to scale with the cluster without a single node as a bottleneck
No XML is huge for many people
One or two hive actions would do, maybe some error handling
The workflow is simple enough to work in any tool – Bash, Azkaban…
but Oozie’s dataset triggers make it a good fit for this use-case
Oozie also makes recovery from errors easier:
Data sets are immutable, actions are idempotent and oozie supports restarting workflow and running only the failed action.
This is something you’ll need to add yourself, possibly using an RDBMS and custom java action, if advanced metrics are important