SlideShare a Scribd company logo
1 of 157
Download to read offline
Architectural 
Considerations 
for Hadoop 
Applications 
Strata+Hadoop World Barcelona – November 19th 2014 
tiny.cloudera.com/app-arch-slides 
Mark Grover | @mark_grover 
Ted Malaska | @TedMalaska 
Jonathan Seidman | @jseidman 
Gwen Shapira | @gwenshap
2 
About the book 
• @hadooparchbook 
• hadooparchitecturebook.com 
• github.com/hadooparchitecturebook 
• slideshare.com/hadooparchbook 
©2014 Cloudera, Inc. All Rights Reserved.
3 
About the presenters 
• Principal Solutions 
Architect at Cloudera 
• Previously, lead architect 
at FINRA 
• Contributor to Apache 
Hadoop, HBase, Flume, 
Avro, Pig and Spark 
Ted Malaska Jonathan Seidman 
• Senior Solutions 
Architect/Partner 
Enablement at Cloudera 
• Previously, Technical 
Lead on the big data 
team at Orbitz Worldwide 
• Co-founder of the 
Chicago Hadoop User 
Group and Chicago Big 
Data 
©2014 Cloudera, Inc. All Rights Reserved.
4 
About the presenters 
Gwen Shapira Mark Grover 
• Solutions Architect turned 
Software Engineer at 
Cloudera 
• Contributor to Sqoop, 
Flume and Kafka 
• Formerly a senior 
consultant at Pythian, 
Oracle ACE Director 
• Software Engineer at 
Cloudera 
• Committer on Apache 
Bigtop, PMC member on 
Apache Sentry 
(incubating) 
• Contributor to Apache 
Hadoop, Spark, Hive, 
Sqoop, Pig and Flume 
©2014 Cloudera, Inc. All Rights Reserved.
5 
Logistics 
• Break at 3:30-4:00 PM 
• Questions at the end of each section 
©2014 Cloudera, Inc. All Rights Reserved.
6 
Case Study 
Clickstream Analysis
7 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
8 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
9 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
10 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
11 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
12 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
13 
Analytics 
©2014 Cloudera, Inc. All Rights Reserved.
14 
Web Logs – Combined Log Format 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1” 
©2014 Cloudera, Inc. All Rights Reserved.
15 
Clickstream Analytics 
©2014 Cloudera, Inc. All Rights Reserved. 
244.157.45.12 - - [17/Oct/ 
2014:21:08:30 ] "GET /seatposts 
HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/ 
top_online_shops" "Mozilla/5.0 
(Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/ 
537.36”
16 
Similar use-cases 
• Sensors – heart, agriculture, etc. 
• Casinos – session of a person at a table 
©2014 Cloudera, Inc. All Rights Reserved.
17 
Pre-Hadoop 
Architecture 
Clickstream Analysis
18 
Click Stream Analysis (Before Hadoop) 
Transform/ 
Aggregate Business 
©2014 Cloudera, Inc. All Rights Reserved. 
Web logs 
(full fidelity) 
(2 weeks) 
Data Warehouse 
Intelligence 
Tape Archive
19 
Problems with Pre-Hadoop Architecture 
• Full fidelity data is stored for small amount of time (~weeks). 
• Older data is sent to tape, or even worse, deleted! 
• Inflexible workflow - think of all aggregations beforehand 
©2014 Cloudera, Inc. All Rights Reserved.
20 
Effects of Pre-Hadoop Architecture 
• Regenerating aggregates is expensive or worse, impossible 
• Can’t correct bugs in the workflow/aggregation logic 
• Can’t do experiments on existing data 
©2014 Cloudera, Inc. All Rights Reserved.
21 
Why is Hadoop A 
Great Fit? 
Clickstream Analysis
22 
Why is Hadoop a great fit? 
• Volume of clickstream data is huge 
• Velocity at which it comes in is high 
• Variety of data is diverse - semi-structured data 
• Hadoop enables 
– active archival of data 
– Aggregation jobs 
– Querying the above aggregates or raw fidelity data 
©2014 Cloudera, Inc. All Rights Reserved.
Intelligence 
23 
Click Stream Analysis (with Hadoop) 
Web logs Hadoop Business 
©2014 Cloudera, Inc. All Rights Reserved. 
Active archive (no tape) 
Aggregation engine 
Querying engine
24 
Challenges of Hadoop Implementation 
©2014 Cloudera, Inc. All Rights Reserved.
25 
Challenges of Hadoop Implementation 
©2014 Cloudera, Inc. All Rights Reserved.
26 
Other challenges - Architectural Considerations 
• Storage managers? 
– HDFS? HBase? 
• Data storage and modeling: 
– File formats? Compression? Schema design? 
• Data movement 
– How do we actually get the data into Hadoop? How do we get it out? 
• Metadata 
– How do we manage data about the data? 
• Data access and processing 
– How will the data be accessed once in Hadoop? How can we transform it? How do 
we query it? 
• Orchestration 
– How do we manage the workflow for all of this? 
©2014 Cloudera, Inc. All Rights Reserved.
27 
Case Study 
Requirements 
Overview of Requirements
Consumption 
28 
Overview of Requirements 
Data 
Sources Ingestion 
Raw Data 
Storage 
(Formats, 
Schema) 
Processed 
Data Storage 
(Formats, 
Schema) 
Processing Data 
©2014 Cloudera, Inc. All Rights Reserved. 
Orchestration 
(Scheduling, 
Managing, 
Monitoring)
29 
Case Study 
Requirements 
Data Ingestion
30 
Data Ingestion Requirements 
©2014 Cloudera, Inc. All Rights Reserved. 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Logs 
244.157.45.12 - - [17/ 
Oct/2014:21:08:30 ] 
"GET /seatposts HTTP/ 
1.0" 200 4463 … 
CRM 
Data 
ODS 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Logs 
Hadoop
31 
Data Ingestion Requirements 
• So we need to be able to support: 
– Reliable ingestion of large volumes of semi-structured event data arriving with 
high velocity (e.g. logs). 
– Timeliness of data availability – data needs to be available for processing to 
meet business service level agreements. 
– Periodic ingestion of data from relational data stores. 
©2014 Cloudera, Inc. All Rights Reserved.
32 
Case Study 
Requirements 
Data Storage
33 
Data Storage Requirements 
©2014 Cloudera, Inc. All Rights Reserved. 
Store all the data 
Make the data 
accessible for 
processing 
Compress 
the data
34 
Case Study 
Requirements 
Data Processing
35 
Processing requirements 
Be able to answer questions like: 
• What is my website’s bounce rate? 
– i.e. how many % of visitors don’t go past the landing page? 
• Which marketing channels are leading to most sessions? 
• Do attribution analysis 
– Which channels are responsible for most conversions? 
©2014 Cloudera, Inc. All Rights Reserved.
36 
Sessionization 
©2014 Cloudera, Inc. All Rights Reserved. 
Website visit 
Visitor 1 
Session 1 
Visitor 1 
Session 2 
Visitor 2 
Session 1 
> 30 minutes
37 
Case Study 
Requirements 
Orchestration
38 
Orchestration is simple 
We just need to execute actions 
One after another 
©2014 Cloudera, Inc. All Rights Reserved.
©2014 Cloudera, Inc. All Rights Reserved. 39 
Actually, 
we also need to handle errors 
And user notifications 
….
And… 
• Re-start workflows after errors 
• Reuse of actions in multiple workflows 
• Complex workflows with decision points 
• Trigger actions based on events 
• Tracking metadata 
• Integration with enterprise software 
• Data lifecycle 
• Data quality control 
• Reports 
©2014 Cloudera, Inc. All Rights Reserved. 40
©2014 Cloudera, Inc. All Rights Reserved. 41 
OK, maybe we need a product 
To help us do all that
42 
Architectural 
Considerations 
Data Modeling
43 
Data Modeling Considerations 
• We need to consider the following in our architecture: 
– Storage layer – HDFS? HBase? Etc. 
– File system schemas – how will we lay out the data? 
– File formats – what storage formats to use for our data, both raw and 
processed data? 
– Data compression formats? 
©2014 Cloudera, Inc. All Rights Reserved.
44 
Architectural 
Considerations 
Data Modeling – Storage Layer
45 
Data Storage Layer Choices 
• Two likely choices for raw data: 
©2014 Cloudera, Inc. All Rights Reserved.
46 
Data Storage Layer Choices 
• Stores data directly as files 
• Fast scans 
• Poor random reads/writes 
• Stores data as Hfiles on 
HDFS 
• Slow scans 
• Fast random reads/writes 
©2014 Cloudera, Inc. All Rights Reserved.
47 
Data Storage – Storage Manager Considerations 
• Incoming raw data: 
– Processing requirements call for batch transformations across multiple 
records – for example sessionization. 
• Processed data: 
– Access to processed data will be via things like analytical queries – again 
requiring access to multiple records. 
• We choose HDFS 
– Processing needs in this case served better by fast scans. 
©2014 Cloudera, Inc. All Rights Reserved.
48 
Architectural 
Considerations 
Data Modeling – Raw Data Storage
Storage Formats – Raw Data and Processed Data 
49 
©2014 Cloudera, Inc. All Rights Reserved. 
Processed 
Data 
Raw 
Data
50 
Data Storage – Format Considerations 
Click to enter confidentiality information 
Logs 
(plain 
text)
51 
Data Storage – Format Considerations 
Click to enter confidentiality information 
Logs 
(plain 
text) 
Logs 
Logs 
(plain 
(plain 
text) 
text) 
Logs 
Logs 
(plain 
text) 
Logs 
(plain 
text) 
Logs 
(plain 
text) 
Logs 
Logs 
Logs Logs
52 
Data Storage – Compression 
Click to enter confidentiality information 
snappy 
Well, maybe. 
But not splittable. 
X 
Splittable. Getting 
better… 
Splittable, but no... Hmmm….
53 
Raw Data Storage – More About Snappy 
• Designed at Google to provide high compression speeds with reasonable 
compression. 
• Not the highest compression, but provides very good performance for 
processing on Hadoop. 
• Snappy is not splittable though, which brings us to… 
©2014 Cloudera, Inc. All Rights Reserved.
54 
Hadoop File Types 
• Formats designed specifically to store and process data on Hadoop: 
– File based – SequenceFile 
– Serialization formats – Thrift, Protocol Buffers, Avro 
– Columnar formats – RCFile, ORC, Parquet 
Click to enter confidentiality information
55 
SequenceFile 
• Stores records as 
binary key/value pairs. 
• SequenceFile “blocks” 
can be compressed. 
• This enables 
splittability with non-splittable 
compression. 
©2014 Cloudera, Inc. All Rights Reserved.
56 
Avro 
• Kinda SequenceFile on 
Steroids. 
• Self-documenting – 
stores schema in 
header. 
• Provides very efficient 
storage. 
• Supports splittable 
compression. ©2014 Cloudera, Inc. All Rights Reserved.
57 
Our Format Recommendations for Raw Data… 
• Avro with Snappy 
– Snappy provides optimized compression. 
– Avro provides compact storage, self-documenting files, and supports schema 
evolution. 
– Avro also provides better failure handling than other choices. 
• SequenceFiles would also be a good choice, and are directly 
supported by ingestion tools in the ecosystem. 
– But only supports Java. 
©2014 Cloudera, Inc. All Rights Reserved.
58 
But Note… 
• For simplicity, we’ll use plain text for raw data in our example. 
©2014 Cloudera, Inc. All Rights Reserved.
59 
Architectural 
Considerations 
Data Modeling – Processed Data Storage
Storage Formats – Raw Data and Processed Data 
60 
©2014 Cloudera, Inc. All Rights Reserved. 
Processed 
Data 
Raw 
Data
61 
Access to Processed Data 
Column A Column B Column C Column D 
Value Value Value Value 
Value Value Value Value 
Value Value Value Value 
Value Value Value Value 
©2014 Cloudera, Inc. All Rights Reserved. 
Analytical 
Queries
62 
Columnar Formats 
• Eliminates I/O for columns that are not part of a query. 
• Works well for queries that access a subset of columns. 
• Often provide better compression. 
• These add up to dramatically improved performance for many 
queries. 
©2014 Cloudera, Inc. All Rights Reserved. 
1 2014-10-1 
3 
abc 
2 2014-10-1 
4 
def 
3 2014-10-1 
5 
ghi 
1 2 3 
2014-10-1 
2014-10-1 
3 
4 
2014-10-1 
5 
abc def ghi
63 
Columnar Choices – RCFile 
• Designed to provide efficient processing for Hive queries. 
• Only supports Java. 
• No Avro support. 
• Limited compression support. 
• Sub-optimal performance compared to newer columnar formats. 
©2014 Cloudera, Inc. All Rights Reserved.
64 
Columnar Choices – ORC 
• A better RCFile. 
• Also designed to provide efficient processing of Hive queries. 
• Only supports Java. 
©2014 Cloudera, Inc. All Rights Reserved.
65 
Columnar Choices – Parquet 
• Designed to provide efficient processing across Hadoop 
programming interfaces – MapReduce, Hive, Impala, Pig. 
• Multiple language support – Java, C++ 
• Good object model support, including Avro. 
• Broad vendor support. 
• These features make Parquet a good choice for our processed data. 
©2014 Cloudera, Inc. All Rights Reserved.
66 
Architectural 
Considerations 
Data Modeling – Schema Design
67 
HDFS Schema Design – One Recommendation 
/user/<username> - User specific data, jars, conf files 
/etl – Data in various stages of ETL workflow 
/tmp – temp data from tools or shared between users 
/data – processed data to be shared data with the entire organization 
/app – Everything but data: UDF jars, HQL files, Oozie workflows 
©2014 Cloudera, Inc. All Rights Reserved.
68 
Partitioning 
• Split the dataset into smaller consumable chunks. 
• Rudimentary form of “indexing”. Reduces I/O needed to process 
queries. 
©2014 Cloudera, Inc. All Rights Reserved.
69 
Partitioning 
dataset 
col=val1/file.txt 
col=val2/file.txt 
… 
col=valn/file.txt 
©2014 Cloudera, Inc. All Rights Reserved. 
Un-partitioned HDFS 
directory structure 
dataset 
file1.txt 
file2.txt 
… 
filen.txt 
Partitioned HDFS 
directory structure
70 
Partitioning considerations 
• What column to partition by? 
– Don’t have too many partitions (<10,000) 
– Don’t have too many small files in the partitions 
– Good to have partition sizes at least ~1 GB 
• We’ll partition by timestamp. This applies to both our raw and 
processed data. 
©2014 Cloudera, Inc. All Rights Reserved.
71 
Partitioning For Our Case Study 
• Raw dataset: 
– /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! 
• Processed dataset: 
– /data/bikeshop/clickstream/year=2014/month=10/day=10! 
©2014 Cloudera, Inc. All Rights Reserved.
72 
Architectural 
Considerations 
Data Ingestion
Typical Clickstream data sources 
©2014 Cloudera, Inc. All rights reserved. 73 
• Omniture data on FTP 
• Apps 
• App Logs 
• RDBMS
Getting Files from FTP 
©2014 Cloudera, Inc. All rights reserved. 74
Don’t over-complicate things 
curl ftp://myftpsite.com/sitecatalyst/ 
myreport_2014-10-05.tar.gz 
--user name:password | hdfs -put - /etl/clickstream/raw/ 
sitecatalyst/myreport_2014-10-05.tar.gz 
©2014 Cloudera, Inc. All rights reserved. 75
Event Streaming – Flume and Kafka 
Reliable, distributed and highly available systems 
That allow streaming events to Hadoop 
©2014 Cloudera, Inc. All rights reserved. 76
• Many available data collection sources 
• Well integrated into Hadoop 
• Supports file transformations 
• Can implement complex topologies 
• Very low latency 
• No programming required 
©2014 Cloudera, Inc. All rights reserved. 77 
Flume:
“We just want to grab data 
from this directory 
and write it to HDFS” 
©2014 Cloudera, Inc. All rights reserved. 78 
We use Flume when:
• Very high-throughput publish-subscribe messaging 
• Highly available 
• Stores data and can replay 
• Can support many consumers with no extra latency 
©2014 Cloudera, Inc. All rights reserved. 79 
Kafka is:
“Kafka is awesome. 
We heard it cures cancer” 
©2014 Cloudera, Inc. All rights reserved. 80 
Use Kafka When:
©2014 Cloudera, Inc. All rights reserved. 81 
Actually, why choose? 
• Use Flume with a Kafka Source 
• Allows to get data from Kafka, 
run some transformations 
write to HDFS, HBase or Solr
• We want to ingest events from log files 
• Flume’s Spooling Directory source fits 
• With HDFS Sink 
• We would have used Kafka if… 
– We wanted the data in non-Hadoop systems too 
©2014 Cloudera, Inc. All rights reserved. 82 
In Our Example…
83 
Short Intro to Flume 
Sources Interceptors Selectors Channels Sinks 
Flume Agent 
Twitter, logs, JMS, 
webserver, Kafka 
Mask, re-format, 
validate… 
DR, critical 
Memory, file, 
Kafka 
HDFS, HBase, 
Solr
84 
Configuration 
• Declarative 
– No coding required. 
– Configuration specifies 
how components are 
wired together. 
©2014 Cloudera, Inc. All Rights Reserved.
85 
Interceptors 
• Mask fields 
• Validate information 
against external source 
• Extract fields 
• Modify data format 
• Filter or split events 
©2014 Cloudera, Inc. All rights reserved.
©2014 Cloudera, Inc. All rights reserved. 86 
Any sufficiently complex configuration 
Is indistinguishable from code
87 
A Brief Discussion of Flume Patterns – Fan-in 
• Flume agent runs on 
each of our servers. 
• These client agents 
send data to multiple 
agents to provide 
reliability. 
• Flume provides support 
for load balancing. 
©2014 Cloudera, Inc. All Rights Reserved.
88 
A Brief Discussion of Flume Patterns – Splitting 
• Common need is to split 
data on ingest. 
• For example: 
– Sending data to multiple 
clusters for DR. 
– To multiple destinations. 
• Flume also supports 
partitioning, which is key 
to our implementation. 
©2014 Cloudera, Inc. All Rights Reserved.
89 
Flume Agent 
Web Logs 
Spooling 
Dir 
Source 
Timestamp 
Interceptor 
File 
Channel 
Avro Sink 
Avro Sink 
Avro Sink 
Flume Demo – Client Tier 
Web Server Flume Agent
90 
Flume Demo – Collector Tier 
Flume Agent 
Flume Agent 
HDFS 
Avro 
Source 
File 
Channel HDFS Sink 
HDFS Sink
What if…. We were to use Kafka? 
• Add Kafka producer to our webapp 
• Send clicks and searches as messages 
• Flume can ingest events from Kafka 
• We can add a second consumer for real-time 
processing in SparkStreaming 
• Another consumer for alerting… 
• And maybe a batch consumer too 
©2014 Cloudera, Inc. All rights reserved. 91
92 
Channels Sinks 
Flume Agent 
The Kafka Channel 
Kafka HDFS, Hbase, 
Solr 
Producer 
A 
Producer 
B 
Producer 
C 
Kafka 
Producers
93 
The Kafka Channel 
Sources Interceptors Channels 
Flume Agent 
Twitter, logs, JMS, 
webserver 
Mask, re-format, 
validate… 
Kafka 
Consumer 
A 
Consumer 
B 
Consumer 
C 
Kafka 
Consumers
94 
Sources Interceptors Selectors Channels Sinks 
Flume Agent 
The Kafka Channel 
Twitter, logs, JMS, 
webserver 
Mask, re-format, 
validate… 
DR, critical Kafka HDFS, HBase, 
Solr
95 
Architectural 
Considerations 
Data Processing – Engines 
tiny.cloudera.com/app-arch-slides
96 
Processing Engines 
• MapReduce 
• Abstractions 
• Spark 
• Spark Streaming 
• Impala 
Confidentiality Information Goes Here
97 
MapReduce 
• Oldie but goody 
• Restrictive Framework / Innovated Work Around 
• Extreme Batch 
Confidentiality Information Goes Here
98 
MapReduce Basic High Level 
Confidentiality Information Goes Here 
Mapper 
HDFS 
(Replicated) 
Native File System 
Block of 
Data 
Temp Spill 
Data 
Partitioned 
Sorted Data 
Reducer 
Reducer 
Local Copy 
Output File
99 
MapReduce Innovation 
• Mapper Memory Joins 
• Reducer Memory Joins 
• Buckets Sorted Joins 
• Cross Task Communication 
• Windowing 
• And Much More 
Confidentiality Information Goes Here
100 
Abstractions 
• SQL 
– Hive 
• Script/Code 
– Pig: Pig Latin 
– Crunch: Java/Scala 
– Cascading: Java/Scala 
Confidentiality Information Goes Here
101 
Spark 
• The New Kid that isn’t that New Anymore 
• Easily 10x less code 
• Extremely Easy and Powerful API 
• Very good for machine learning 
• Scala, Java, and Python 
• RDDs 
• DAG Engine 
Confidentiality Information Goes Here
102 
Spark - DAG 
Confidentiality Information Goes Here
103 
Spark - DAG 
Confidentiality Information Goes Here 
Filter KeyBy 
KeyBy 
TextFile 
TextFile 
Join Filter Take
104 
Spark - DAG 
Confidentiality Information Goes Here 
Filter KeyBy 
KeyBy 
TextFile 
TextFile 
Join Filter Take 
Good 
Good 
Good 
Good 
Good 
Good 
Good-Replay 
Good-Replay 
Good-Replay 
Good 
Good-Replay 
Good 
Good-Replay 
Lost Block 
Replay 
Good-Replay 
Lost Block 
Good 
Future 
Future 
Future 
Future
105 
Spark Streaming 
• Calling Spark in a Loop 
• Extends RDDs with DStream 
• Very Little Code Changes from ETL to Streaming 
Confidentiality Information Goes Here
106 
Spark Streaming 
Confidentiality Information Goes Here 
Single Pass 
Source Receiver RDD 
Source Receiver RDD 
RDD 
Filter Count Print 
Source Receiver RDD 
RDD 
RDD 
Single Pass 
Filter Count Print 
Pre-first 
Batch 
First 
Batch 
Second 
Batch
107 
Spark Streaming 
Confidentiality Information Goes Here 
Single Pass 
Source Receiver RDD 
Source Receiver RDD 
RDD 
Filter Count 
Print 
Source Receiver RDD 
RDD 
RDD 
Stateful RDD 1 
Single Pass 
Filter Count 
Pre-first 
Batch 
First 
Batch 
Second 
Batch 
Stateful RDD 1 
Stateful RDD 2 
Print
108 
Impala 
Confidentiality Information Goes Here 
• MPP Style SQL Engine on top of Hadoop 
• Very Fast 
• High Concurrency 
• Analytical windowing functions (C5.2).
109 
Impala – Broadcast Join 
Confidentiality Information Goes Here 
Impala Daemon 
Smaller Table 
Data Block 
100% Cached 
Smaller Table 
Smaller Table 
Data Block 
Impala Daemon 
100% Cached 
Smaller Table 
Impala Daemon 
100% Cached 
Smaller Table 
Impala Daemon 
Hash Join Function 
Bigger Table 
Data Block 
100% Cached 
Smaller Table 
Output 
Impala Daemon 
Hash Join Function 
Bigger Table 
Data Block 
100% Cached 
Smaller Table 
Output 
Impala Daemon 
Hash Join Function 
Bigger Table 
Data Block 
100% Cached 
Smaller Table 
Output
110 
Impala – Partitioned Hash Join 
Confidentiality Information Goes Here 
Impala Daemon 
Smaller Table 
Data Block 
~33% Cached 
Smaller Table 
Impala Daemon 
Smaller Table 
Data Block 
~33% Cached 
Smaller Table 
Impala Daemon 
~33% Cached 
Smaller Table 
Hash Partitioner Hash Partitioner 
Impala Daemon 
BiggerTable 
Data Block 
Impala Daemon Impala Daemon 
Hash Partitioner 
Hash Join Function 
33% Cached 
Smaller Table 
Hash Join Function 
33% Cached 
Smaller Table 
Hash Join Function 
33% Cached 
Smaller Table 
Output Output Output 
Hash Partitioner 
BiggerTable 
Data Block 
Hash Partitioner 
BiggerTable 
Data Block
111 
Impala vs Hive 
Confidentiality Information Goes Here 
• Very different approaches and 
• We may see convergence at some point 
• But for now 
– Impala for speed 
– Hive for batch
112 
Architectural 
Considerations 
Data Processing – Patterns and 
Recommendations
113 
What processing needs to happen? 
Confidentiality Information Goes Here 
• Sessionization 
• Filtering 
• Deduplication 
• BI / Discovery
114 
Sessionization 
Confidentiality Information Goes Here 
Website visit 
Visitor 1 
Session 1 
Visitor 1 
Session 2 
Visitor 2 
Session 1 
> 30 minutes
115 
Why sessionize? 
Helps answers questions like: 
• What is my website’s bounce rate? 
– i.e. how many % of visitors don’t go past the landing page? 
• Which marketing channels (e.g. organic search, display ad, etc.) are 
leading to most sessions? 
– Which ones of those lead to most conversions (e.g. people buying things, 
Confidentiality Information Goes Here 
signing up, etc.) 
• Do attribution analysis – which channels are responsible for most 
conversions?
116 
Sessionization 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12+1413580110 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; 
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 
Mobile Safari/533.1” 244.157.45.12+1413583199 
Confidentiality Information Goes Here
117 
How to Sessionize? 
Confidentiality Information Goes Here 
1. Given a list of clicks, determine which clicks 
came from the same user (Partitioning, 
ordering) 
2. Given a particular user's clicks, determine if a 
given click is a part of a new session or a 
continuation of the previous session (Identifying 
session boundaries)
118 
#1 – Which clicks are from same user? 
• We can use: 
– IP address (244.157.45.12) 
– Cookies (A9A3BECE0563982D) 
– IP address (244.157.45.12)and user agent string ((KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/537.36") 
©2014 Cloudera, Inc. All Rights Reserved.
119 
#1 – Which clicks are from same user? 
• We can use: 
– IP address (244.157.45.12) 
– Cookies (A9A3BECE0563982D) 
– IP address (244.157.45.12)and user agent string ((KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/537.36") 
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 
120 
#1 – Which clicks are from same user? 
©2014 Cloudera, Inc. All Rights Reserved.
> 30 mins apart = different 
sessions 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 
121 
#2 – Which clicks part of the same session? 
©2014 Cloudera, Inc. All Rights Reserved.
Sessionization engine recommendation 
• We have sessionization code in MR and Spark on github. The 
complexity of the code varies, depends on the expertise in the 
organization. 
• We choose MR 
– MR API is stable and widely known 
– No Spark + Oozie (orchestration engine) integration currently 
©2014 Cloudera, Inc. All rights reserved. 122
123 
Filtering – filter out incomplete records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U… 
©2014 Cloudera, Inc. All Rights Reserved.
Filtering – filter out records from bots/spiders 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 
124 
©2014 Cloudera, Inc. All Rights Reserved. 
Google spider IP address
Filtering recommendation 
• Bot/Spider filtering can be done easily in any of the engines 
• Incomplete records are harder to filter in schema systems like 
Hive, Impala, Pig, etc. 
• Flume interceptors can also be used 
• Pretty close choice between MR, Hive and Spark 
• Can be done in Spark using rdd.filter() 
• We can simply embed this in our MR sessionization job 
©2014 Cloudera, Inc. All rights reserved. 125
126 
Deduplication – remove duplicate records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
©2014 Cloudera, Inc. All Rights Reserved.
Deduplication recommendation 
• Can be done in all engines. 
• We already have a Hive table with all the columns, a simple 
DISTINCT query will perform deduplication 
• reduce() in spark 
• We use Pig 
©2014 Cloudera, Inc. All rights reserved. 127
BI/Discovery engine recommendation 
• Main requirements for this are: 
– Low latency 
– SQL interface (e.g. JDBC/ODBC) 
– Users don’t know how to code 
• We chose Impala 
– It’s a SQL engine 
– Much faster than other engines 
– Provides standard JDBC/ODBC interfaces 
©2014 Cloudera, Inc. All rights reserved. 128
End-to-end processing 
BI tools 
Deduplication Filtering Sessionization 
©2014 Cloudera, Inc. All rights reserved. 129
130 
Architectural 
Considerations 
Orchestration
131 
Orchestrating Clickstream 
• Data arrives through Flume 
• Triggers a processing event: 
– Sessionize 
– Enrich – Location, marketing channel… 
– Store as Parquet 
• Each day we process events from the previous day
• Workflow is fairly simple 
• Need to trigger workflow based on data 
• Be able to recover from errors 
• Perhaps notify on the status 
• And collect metrics for reporting 
©2014 Cloudera, Inc. All rights reserved. 132 
Choosing Right
133 
Oozie or Azkaban? 
©2014 Cloudera, Inc. All rights reserved.
©2014 Cloudera, Inc. All rights reserved. 134 
Oozie Architecture
• Part of all major Hadoop distributions 
• Hue integration 
• Built -in actions – Hive, Sqoop, MapReduce, SSH 
• Complex workflows with decisions 
• Event and time based scheduling 
• Notifications 
• SLA Monitoring 
• REST API 
©2014 Cloudera, Inc. All rights reserved. 135 
Oozie features
©2014 Cloudera, Inc. All rights reserved. 136 
Oozie Drawbacks 
• Overhead in launching jobs 
• Steep learning curve 
• XML Workflows
Client Hadoop 
©2014 Cloudera, Inc. All rights reserved. 137 
Azkaban Architecture 
Azkaban Executor Server 
Azkaban Web Server 
HDFS viewer 
plugin 
Job types plugin 
MySQL
• Simplicity 
• Great UI – including pluggable visualizers 
• Lots of plugins – Hive, Pig… 
• Reporting plugin 
©2014 Cloudera, Inc. All rights reserved. 138 
Azkaban features
• Doesn’t support workflow decisions 
• Can’t represent data dependency 
©2014 Cloudera, Inc. All rights reserved. 139 
Azkaban Limitations
• Workflow is fairly simple 
• Need to trigger workflow based on data 
• Be able to recover from errors 
• Perhaps notify on the status 
• And collect metrics for reporting 
©2014 Cloudera, Inc. All rights reserved. 140 
Choosing… 
Easier in Oozie
Choosing the right Orchestration Tool 
• Workflow is fairly simple 
• Need to trigger workflow based on data 
• Be able to recover from errors 
• Perhaps notify on the status 
• And collect metrics for reporting 
Better in Azkaban 
©2014 Cloudera, Inc. All rights reserved. 141
Important Decision Consideration! 
The best orchestration tool 
is the one you are an expert on 
©2014 Cloudera, Inc. All rights reserved. 142
145 
Putting It All 
Together 
Final Architecture
Consumption 
146 
Final Architecture – High Level Overview 
Data 
Sources Ingestion 
Raw Data 
Storage 
(Formats, 
Schema) 
Processed 
Data Storage 
(Formats, 
Schema) 
Processing Data 
©2014 Cloudera, Inc. All Rights Reserved. 
Orchestration 
(Scheduling, 
Managing, 
Monitoring)
Consumption 
147 
Final Architecture – High Level Overview 
Data 
Sources Ingestion 
Raw Data 
Storage 
(Formats, 
Schema) 
Processed 
Data Storage 
(Formats, 
Schema) 
Processing Data 
©2014 Cloudera, Inc. All Rights Reserved. 
Orchestration 
(Scheduling, 
Managing, 
Monitoring)
/etl/BI/casualcyclist/ 
clicks/rawlogs/ 
year=2014/month=10/ 
day=10! 
148 
Final Architecture – Ingestion/Storage 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Fan-in 
Pattern 
Multi Agents for 
Failover and rolling 
restarts 
HDFS 
©2014 Cloudera, Inc. All Rights Reserved.
149 
Final Architecture – High Level Overview 
Data 
Sources Ingestion 
Raw Data 
Storage 
(Formats, 
Schema) 
Processed 
Data Storage 
(Formats, 
Schema) 
©2014 Cloudera, Inc. All Rights Reserved. 
Processin 
g 
Data 
Consumption 
Orchestration 
(Scheduling, 
Managing, 
Monitoring)
150 
Final Architecture – Processing and Storage 
/etl/BI/casualcyclist/clicks/ 
rawlogs/year=2014/ 
month=10/day=10 
… 
dedup->filtering- 
>sessionization 
/data/bikeshop/ 
clickstream/year=2014/ 
month=10/day=10 
… 
©2014 Cloudera, Inc. All Rights Reserved. 
parquetize
Consumption 
151 
Final Architecture – High Level Overview 
Data 
Sources Ingestion 
Raw Data 
Storage 
(Formats, 
Schema) 
Processed 
Data Storage 
(Formats, 
Schema) 
Processing Data 
©2014 Cloudera, Inc. All Rights Reserved. 
Orchestration 
(Scheduling, 
Managing, 
Monitoring)
152 
Final Architecture – Data Access 
Hive/ 
Impala 
BI/ 
Analytics 
Tools 
JDBC/ODBC 
DWH 
Sqoop 
Local 
Disk 
R, etc. 
DB import tool 
©2014 Cloudera, Inc. All Rights Reserved.
Demo 
©2014 Cloudera, Inc. All rights reserved. 153
154 
Stay in touch! 
@hadooparchbook 
hadooparchitecturebook.com 
slideshare.com/hadooparchbook
Confidentiality Information Goes Here 155 
Join the Discussion 
Get community 
help or provide 
feedback 
cloudera.com/ 
community
156 
Try Hadoop Now 
cloudera.com/live
157 
Visit us at 
the Booth 
#408 
Highlights: 
Hear what’s new with 
5.2 including Impala 2.0 
Learn how Cloudera is 
setting the standard for 
Hadoop in the Cloud 
BOOK SIGNINGS THEATER SESSIONS 
TECHNICAL DEMOS GIVEAWAYS
158 
Free books and office hours! 
• Book signings 
– Nov 20, 3:15 – 3:45 PM in Expo Hall – Cloudera Booth (#408) 
– Nov 20, 6:25 – 6:55PM in Expo Hall - O'Reilly Booth 
• Office Hours 
– Mark and Ted, Nov 20, 1:45 PM Table A 
©2014 Cloudera, Inc. All Rights Reserved.
Thank you, 
Friends

More Related Content

What's hot

Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 

What's hot (20)

Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 

Viewers also liked

How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 

Viewers also liked (8)

How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 

Similar to Strata EU tutorial - Architectural considerations for hadoop applications

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateClouderaUserGroups
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全Jianwei Li
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo OverviewBill Havanki
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!KNIMESlides
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMERosaria Silipo
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 

Similar to Strata EU tutorial - Architectural considerations for hadoop applications (20)

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 

More from hadooparchbook (9)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 

Recently uploaded

Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 

Recently uploaded (20)

Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 

Strata EU tutorial - Architectural considerations for hadoop applications

  • 1. Architectural Considerations for Hadoop Applications Strata+Hadoop World Barcelona – November 19th 2014 tiny.cloudera.com/app-arch-slides Mark Grover | @mark_grover Ted Malaska | @TedMalaska Jonathan Seidman | @jseidman Gwen Shapira | @gwenshap
  • 2. 2 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. 3 About the presenters • Principal Solutions Architect at Cloudera • Previously, lead architect at FINRA • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark Ted Malaska Jonathan Seidman • Senior Solutions Architect/Partner Enablement at Cloudera • Previously, Technical Lead on the big data team at Orbitz Worldwide • Co-founder of the Chicago Hadoop User Group and Chicago Big Data ©2014 Cloudera, Inc. All Rights Reserved.
  • 4. 4 About the presenters Gwen Shapira Mark Grover • Solutions Architect turned Software Engineer at Cloudera • Contributor to Sqoop, Flume and Kafka • Formerly a senior consultant at Pythian, Oracle ACE Director • Software Engineer at Cloudera • Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) • Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. 5 Logistics • Break at 3:30-4:00 PM • Questions at the end of each section ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Case Study Clickstream Analysis
  • 7. 7 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. 8 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 9. 9 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. 11 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. 12 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. 13 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. 14 Web Logs – Combined Log Format 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. 15 Clickstream Analytics ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
  • 16. 16 Similar use-cases • Sensors – heart, agriculture, etc. • Casinos – session of a person at a table ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 Pre-Hadoop Architecture Clickstream Analysis
  • 18. 18 Click Stream Analysis (Before Hadoop) Transform/ Aggregate Business ©2014 Cloudera, Inc. All Rights Reserved. Web logs (full fidelity) (2 weeks) Data Warehouse Intelligence Tape Archive
  • 19. 19 Problems with Pre-Hadoop Architecture • Full fidelity data is stored for small amount of time (~weeks). • Older data is sent to tape, or even worse, deleted! • Inflexible workflow - think of all aggregations beforehand ©2014 Cloudera, Inc. All Rights Reserved.
  • 20. 20 Effects of Pre-Hadoop Architecture • Regenerating aggregates is expensive or worse, impossible • Can’t correct bugs in the workflow/aggregation logic • Can’t do experiments on existing data ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. 21 Why is Hadoop A Great Fit? Clickstream Analysis
  • 22. 22 Why is Hadoop a great fit? • Volume of clickstream data is huge • Velocity at which it comes in is high • Variety of data is diverse - semi-structured data • Hadoop enables – active archival of data – Aggregation jobs – Querying the above aggregates or raw fidelity data ©2014 Cloudera, Inc. All Rights Reserved.
  • 23. Intelligence 23 Click Stream Analysis (with Hadoop) Web logs Hadoop Business ©2014 Cloudera, Inc. All Rights Reserved. Active archive (no tape) Aggregation engine Querying engine
  • 24. 24 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 25. 25 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. 26 Other challenges - Architectural Considerations • Storage managers? – HDFS? HBase? • Data storage and modeling: – File formats? Compression? Schema design? • Data movement – How do we actually get the data into Hadoop? How do we get it out? • Metadata – How do we manage data about the data? • Data access and processing – How will the data be accessed once in Hadoop? How can we transform it? How do we query it? • Orchestration – How do we manage the workflow for all of this? ©2014 Cloudera, Inc. All Rights Reserved.
  • 27. 27 Case Study Requirements Overview of Requirements
  • 28. Consumption 28 Overview of Requirements Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data ©2014 Cloudera, Inc. All Rights Reserved. Orchestration (Scheduling, Managing, Monitoring)
  • 29. 29 Case Study Requirements Data Ingestion
  • 30. 30 Data Ingestion Requirements ©2014 Cloudera, Inc. All Rights Reserved. Web Servers Web Servers Web Servers Web Servers Web Servers Logs 244.157.45.12 - - [17/ Oct/2014:21:08:30 ] "GET /seatposts HTTP/ 1.0" 200 4463 … CRM Data ODS Web Servers Web Servers Web Servers Web Servers Web Servers Logs Hadoop
  • 31. 31 Data Ingestion Requirements • So we need to be able to support: – Reliable ingestion of large volumes of semi-structured event data arriving with high velocity (e.g. logs). – Timeliness of data availability – data needs to be available for processing to meet business service level agreements. – Periodic ingestion of data from relational data stores. ©2014 Cloudera, Inc. All Rights Reserved.
  • 32. 32 Case Study Requirements Data Storage
  • 33. 33 Data Storage Requirements ©2014 Cloudera, Inc. All Rights Reserved. Store all the data Make the data accessible for processing Compress the data
  • 34. 34 Case Study Requirements Data Processing
  • 35. 35 Processing requirements Be able to answer questions like: • What is my website’s bounce rate? – i.e. how many % of visitors don’t go past the landing page? • Which marketing channels are leading to most sessions? • Do attribution analysis – Which channels are responsible for most conversions? ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. 36 Sessionization ©2014 Cloudera, Inc. All Rights Reserved. Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 37. 37 Case Study Requirements Orchestration
  • 38. 38 Orchestration is simple We just need to execute actions One after another ©2014 Cloudera, Inc. All Rights Reserved.
  • 39. ©2014 Cloudera, Inc. All Rights Reserved. 39 Actually, we also need to handle errors And user notifications ….
  • 40. And… • Re-start workflows after errors • Reuse of actions in multiple workflows • Complex workflows with decision points • Trigger actions based on events • Tracking metadata • Integration with enterprise software • Data lifecycle • Data quality control • Reports ©2014 Cloudera, Inc. All Rights Reserved. 40
  • 41. ©2014 Cloudera, Inc. All Rights Reserved. 41 OK, maybe we need a product To help us do all that
  • 43. 43 Data Modeling Considerations • We need to consider the following in our architecture: – Storage layer – HDFS? HBase? Etc. – File system schemas – how will we lay out the data? – File formats – what storage formats to use for our data, both raw and processed data? – Data compression formats? ©2014 Cloudera, Inc. All Rights Reserved.
  • 44. 44 Architectural Considerations Data Modeling – Storage Layer
  • 45. 45 Data Storage Layer Choices • Two likely choices for raw data: ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. 46 Data Storage Layer Choices • Stores data directly as files • Fast scans • Poor random reads/writes • Stores data as Hfiles on HDFS • Slow scans • Fast random reads/writes ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. 47 Data Storage – Storage Manager Considerations • Incoming raw data: – Processing requirements call for batch transformations across multiple records – for example sessionization. • Processed data: – Access to processed data will be via things like analytical queries – again requiring access to multiple records. • We choose HDFS – Processing needs in this case served better by fast scans. ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. 48 Architectural Considerations Data Modeling – Raw Data Storage
  • 49. Storage Formats – Raw Data and Processed Data 49 ©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 50. 50 Data Storage – Format Considerations Click to enter confidentiality information Logs (plain text)
  • 51. 51 Data Storage – Format Considerations Click to enter confidentiality information Logs (plain text) Logs Logs (plain (plain text) text) Logs Logs (plain text) Logs (plain text) Logs (plain text) Logs Logs Logs Logs
  • 52. 52 Data Storage – Compression Click to enter confidentiality information snappy Well, maybe. But not splittable. X Splittable. Getting better… Splittable, but no... Hmmm….
  • 53. 53 Raw Data Storage – More About Snappy • Designed at Google to provide high compression speeds with reasonable compression. • Not the highest compression, but provides very good performance for processing on Hadoop. • Snappy is not splittable though, which brings us to… ©2014 Cloudera, Inc. All Rights Reserved.
  • 54. 54 Hadoop File Types • Formats designed specifically to store and process data on Hadoop: – File based – SequenceFile – Serialization formats – Thrift, Protocol Buffers, Avro – Columnar formats – RCFile, ORC, Parquet Click to enter confidentiality information
  • 55. 55 SequenceFile • Stores records as binary key/value pairs. • SequenceFile “blocks” can be compressed. • This enables splittability with non-splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.
  • 56. 56 Avro • Kinda SequenceFile on Steroids. • Self-documenting – stores schema in header. • Provides very efficient storage. • Supports splittable compression. ©2014 Cloudera, Inc. All Rights Reserved.
  • 57. 57 Our Format Recommendations for Raw Data… • Avro with Snappy – Snappy provides optimized compression. – Avro provides compact storage, self-documenting files, and supports schema evolution. – Avro also provides better failure handling than other choices. • SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. – But only supports Java. ©2014 Cloudera, Inc. All Rights Reserved.
  • 58. 58 But Note… • For simplicity, we’ll use plain text for raw data in our example. ©2014 Cloudera, Inc. All Rights Reserved.
  • 59. 59 Architectural Considerations Data Modeling – Processed Data Storage
  • 60. Storage Formats – Raw Data and Processed Data 60 ©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 61. 61 Access to Processed Data Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value ©2014 Cloudera, Inc. All Rights Reserved. Analytical Queries
  • 62. 62 Columnar Formats • Eliminates I/O for columns that are not part of a query. • Works well for queries that access a subset of columns. • Often provide better compression. • These add up to dramatically improved performance for many queries. ©2014 Cloudera, Inc. All Rights Reserved. 1 2014-10-1 3 abc 2 2014-10-1 4 def 3 2014-10-1 5 ghi 1 2 3 2014-10-1 2014-10-1 3 4 2014-10-1 5 abc def ghi
  • 63. 63 Columnar Choices – RCFile • Designed to provide efficient processing for Hive queries. • Only supports Java. • No Avro support. • Limited compression support. • Sub-optimal performance compared to newer columnar formats. ©2014 Cloudera, Inc. All Rights Reserved.
  • 64. 64 Columnar Choices – ORC • A better RCFile. • Also designed to provide efficient processing of Hive queries. • Only supports Java. ©2014 Cloudera, Inc. All Rights Reserved.
  • 65. 65 Columnar Choices – Parquet • Designed to provide efficient processing across Hadoop programming interfaces – MapReduce, Hive, Impala, Pig. • Multiple language support – Java, C++ • Good object model support, including Avro. • Broad vendor support. • These features make Parquet a good choice for our processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 66. 66 Architectural Considerations Data Modeling – Schema Design
  • 67. 67 HDFS Schema Design – One Recommendation /user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – processed data to be shared data with the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows ©2014 Cloudera, Inc. All Rights Reserved.
  • 68. 68 Partitioning • Split the dataset into smaller consumable chunks. • Rudimentary form of “indexing”. Reduces I/O needed to process queries. ©2014 Cloudera, Inc. All Rights Reserved.
  • 69. 69 Partitioning dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt ©2014 Cloudera, Inc. All Rights Reserved. Un-partitioned HDFS directory structure dataset file1.txt file2.txt … filen.txt Partitioned HDFS directory structure
  • 70. 70 Partitioning considerations • What column to partition by? – Don’t have too many partitions (<10,000) – Don’t have too many small files in the partitions – Good to have partition sizes at least ~1 GB • We’ll partition by timestamp. This applies to both our raw and processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 71. 71 Partitioning For Our Case Study • Raw dataset: – /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! • Processed dataset: – /data/bikeshop/clickstream/year=2014/month=10/day=10! ©2014 Cloudera, Inc. All Rights Reserved.
  • 73. Typical Clickstream data sources ©2014 Cloudera, Inc. All rights reserved. 73 • Omniture data on FTP • Apps • App Logs • RDBMS
  • 74. Getting Files from FTP ©2014 Cloudera, Inc. All rights reserved. 74
  • 75. Don’t over-complicate things curl ftp://myftpsite.com/sitecatalyst/ myreport_2014-10-05.tar.gz --user name:password | hdfs -put - /etl/clickstream/raw/ sitecatalyst/myreport_2014-10-05.tar.gz ©2014 Cloudera, Inc. All rights reserved. 75
  • 76. Event Streaming – Flume and Kafka Reliable, distributed and highly available systems That allow streaming events to Hadoop ©2014 Cloudera, Inc. All rights reserved. 76
  • 77. • Many available data collection sources • Well integrated into Hadoop • Supports file transformations • Can implement complex topologies • Very low latency • No programming required ©2014 Cloudera, Inc. All rights reserved. 77 Flume:
  • 78. “We just want to grab data from this directory and write it to HDFS” ©2014 Cloudera, Inc. All rights reserved. 78 We use Flume when:
  • 79. • Very high-throughput publish-subscribe messaging • Highly available • Stores data and can replay • Can support many consumers with no extra latency ©2014 Cloudera, Inc. All rights reserved. 79 Kafka is:
  • 80. “Kafka is awesome. We heard it cures cancer” ©2014 Cloudera, Inc. All rights reserved. 80 Use Kafka When:
  • 81. ©2014 Cloudera, Inc. All rights reserved. 81 Actually, why choose? • Use Flume with a Kafka Source • Allows to get data from Kafka, run some transformations write to HDFS, HBase or Solr
  • 82. • We want to ingest events from log files • Flume’s Spooling Directory source fits • With HDFS Sink • We would have used Kafka if… – We wanted the data in non-Hadoop systems too ©2014 Cloudera, Inc. All rights reserved. 82 In Our Example…
  • 83. 83 Short Intro to Flume Sources Interceptors Selectors Channels Sinks Flume Agent Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 84. 84 Configuration • Declarative – No coding required. – Configuration specifies how components are wired together. ©2014 Cloudera, Inc. All Rights Reserved.
  • 85. 85 Interceptors • Mask fields • Validate information against external source • Extract fields • Modify data format • Filter or split events ©2014 Cloudera, Inc. All rights reserved.
  • 86. ©2014 Cloudera, Inc. All rights reserved. 86 Any sufficiently complex configuration Is indistinguishable from code
  • 87. 87 A Brief Discussion of Flume Patterns – Fan-in • Flume agent runs on each of our servers. • These client agents send data to multiple agents to provide reliability. • Flume provides support for load balancing. ©2014 Cloudera, Inc. All Rights Reserved.
  • 88. 88 A Brief Discussion of Flume Patterns – Splitting • Common need is to split data on ingest. • For example: – Sending data to multiple clusters for DR. – To multiple destinations. • Flume also supports partitioning, which is key to our implementation. ©2014 Cloudera, Inc. All Rights Reserved.
  • 89. 89 Flume Agent Web Logs Spooling Dir Source Timestamp Interceptor File Channel Avro Sink Avro Sink Avro Sink Flume Demo – Client Tier Web Server Flume Agent
  • 90. 90 Flume Demo – Collector Tier Flume Agent Flume Agent HDFS Avro Source File Channel HDFS Sink HDFS Sink
  • 91. What if…. We were to use Kafka? • Add Kafka producer to our webapp • Send clicks and searches as messages • Flume can ingest events from Kafka • We can add a second consumer for real-time processing in SparkStreaming • Another consumer for alerting… • And maybe a batch consumer too ©2014 Cloudera, Inc. All rights reserved. 91
  • 92. 92 Channels Sinks Flume Agent The Kafka Channel Kafka HDFS, Hbase, Solr Producer A Producer B Producer C Kafka Producers
  • 93. 93 The Kafka Channel Sources Interceptors Channels Flume Agent Twitter, logs, JMS, webserver Mask, re-format, validate… Kafka Consumer A Consumer B Consumer C Kafka Consumers
  • 94. 94 Sources Interceptors Selectors Channels Sinks Flume Agent The Kafka Channel Twitter, logs, JMS, webserver Mask, re-format, validate… DR, critical Kafka HDFS, HBase, Solr
  • 95. 95 Architectural Considerations Data Processing – Engines tiny.cloudera.com/app-arch-slides
  • 96. 96 Processing Engines • MapReduce • Abstractions • Spark • Spark Streaming • Impala Confidentiality Information Goes Here
  • 97. 97 MapReduce • Oldie but goody • Restrictive Framework / Innovated Work Around • Extreme Batch Confidentiality Information Goes Here
  • 98. 98 MapReduce Basic High Level Confidentiality Information Goes Here Mapper HDFS (Replicated) Native File System Block of Data Temp Spill Data Partitioned Sorted Data Reducer Reducer Local Copy Output File
  • 99. 99 MapReduce Innovation • Mapper Memory Joins • Reducer Memory Joins • Buckets Sorted Joins • Cross Task Communication • Windowing • And Much More Confidentiality Information Goes Here
  • 100. 100 Abstractions • SQL – Hive • Script/Code – Pig: Pig Latin – Crunch: Java/Scala – Cascading: Java/Scala Confidentiality Information Goes Here
  • 101. 101 Spark • The New Kid that isn’t that New Anymore • Easily 10x less code • Extremely Easy and Powerful API • Very good for machine learning • Scala, Java, and Python • RDDs • DAG Engine Confidentiality Information Goes Here
  • 102. 102 Spark - DAG Confidentiality Information Goes Here
  • 103. 103 Spark - DAG Confidentiality Information Goes Here Filter KeyBy KeyBy TextFile TextFile Join Filter Take
  • 104. 104 Spark - DAG Confidentiality Information Goes Here Filter KeyBy KeyBy TextFile TextFile Join Filter Take Good Good Good Good Good Good Good-Replay Good-Replay Good-Replay Good Good-Replay Good Good-Replay Lost Block Replay Good-Replay Lost Block Good Future Future Future Future
  • 105. 105 Spark Streaming • Calling Spark in a Loop • Extends RDDs with DStream • Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here
  • 106. 106 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 107. 107 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Stateful RDD 1 Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Stateful RDD 2 Print
  • 108. 108 Impala Confidentiality Information Goes Here • MPP Style SQL Engine on top of Hadoop • Very Fast • High Concurrency • Analytical windowing functions (C5.2).
  • 109. 109 Impala – Broadcast Join Confidentiality Information Goes Here Impala Daemon Smaller Table Data Block 100% Cached Smaller Table Smaller Table Data Block Impala Daemon 100% Cached Smaller Table Impala Daemon 100% Cached Smaller Table Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output
  • 110. 110 Impala – Partitioned Hash Join Confidentiality Information Goes Here Impala Daemon Smaller Table Data Block ~33% Cached Smaller Table Impala Daemon Smaller Table Data Block ~33% Cached Smaller Table Impala Daemon ~33% Cached Smaller Table Hash Partitioner Hash Partitioner Impala Daemon BiggerTable Data Block Impala Daemon Impala Daemon Hash Partitioner Hash Join Function 33% Cached Smaller Table Hash Join Function 33% Cached Smaller Table Hash Join Function 33% Cached Smaller Table Output Output Output Hash Partitioner BiggerTable Data Block Hash Partitioner BiggerTable Data Block
  • 111. 111 Impala vs Hive Confidentiality Information Goes Here • Very different approaches and • We may see convergence at some point • But for now – Impala for speed – Hive for batch
  • 112. 112 Architectural Considerations Data Processing – Patterns and Recommendations
  • 113. 113 What processing needs to happen? Confidentiality Information Goes Here • Sessionization • Filtering • Deduplication • BI / Discovery
  • 114. 114 Sessionization Confidentiality Information Goes Here Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 115. 115 Why sessionize? Helps answers questions like: • What is my website’s bounce rate? – i.e. how many % of visitors don’t go past the landing page? • Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? – Which ones of those lead to most conversions (e.g. people buying things, Confidentiality Information Goes Here signing up, etc.) • Do attribution analysis – which channels are responsible for most conversions?
  • 116. 116 Sessionization 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 244.157.45.12+1413583199 Confidentiality Information Goes Here
  • 117. 117 How to Sessionize? Confidentiality Information Goes Here 1. Given a list of clicks, determine which clicks came from the same user (Partitioning, ordering) 2. Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session (Identifying session boundaries)
  • 118. 118 #1 – Which clicks are from same user? • We can use: – IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 119. 119 #1 – Which clicks are from same user? • We can use: – IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 120. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 120 #1 – Which clicks are from same user? ©2014 Cloudera, Inc. All Rights Reserved.
  • 121. > 30 mins apart = different sessions 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 121 #2 – Which clicks part of the same session? ©2014 Cloudera, Inc. All Rights Reserved.
  • 122. Sessionization engine recommendation • We have sessionization code in MR and Spark on github. The complexity of the code varies, depends on the expertise in the organization. • We choose MR – MR API is stable and widely known – No Spark + Oozie (orchestration engine) integration currently ©2014 Cloudera, Inc. All rights reserved. 122
  • 123. 123 Filtering – filter out incomplete records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U… ©2014 Cloudera, Inc. All Rights Reserved.
  • 124. Filtering – filter out records from bots/spiders 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 124 ©2014 Cloudera, Inc. All Rights Reserved. Google spider IP address
  • 125. Filtering recommendation • Bot/Spider filtering can be done easily in any of the engines • Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. • Flume interceptors can also be used • Pretty close choice between MR, Hive and Spark • Can be done in Spark using rdd.filter() • We can simply embed this in our MR sessionization job ©2014 Cloudera, Inc. All rights reserved. 125
  • 126. 126 Deduplication – remove duplicate records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” ©2014 Cloudera, Inc. All Rights Reserved.
  • 127. Deduplication recommendation • Can be done in all engines. • We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication • reduce() in spark • We use Pig ©2014 Cloudera, Inc. All rights reserved. 127
  • 128. BI/Discovery engine recommendation • Main requirements for this are: – Low latency – SQL interface (e.g. JDBC/ODBC) – Users don’t know how to code • We chose Impala – It’s a SQL engine – Much faster than other engines – Provides standard JDBC/ODBC interfaces ©2014 Cloudera, Inc. All rights reserved. 128
  • 129. End-to-end processing BI tools Deduplication Filtering Sessionization ©2014 Cloudera, Inc. All rights reserved. 129
  • 131. 131 Orchestrating Clickstream • Data arrives through Flume • Triggers a processing event: – Sessionize – Enrich – Location, marketing channel… – Store as Parquet • Each day we process events from the previous day
  • 132. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting ©2014 Cloudera, Inc. All rights reserved. 132 Choosing Right
  • 133. 133 Oozie or Azkaban? ©2014 Cloudera, Inc. All rights reserved.
  • 134. ©2014 Cloudera, Inc. All rights reserved. 134 Oozie Architecture
  • 135. • Part of all major Hadoop distributions • Hue integration • Built -in actions – Hive, Sqoop, MapReduce, SSH • Complex workflows with decisions • Event and time based scheduling • Notifications • SLA Monitoring • REST API ©2014 Cloudera, Inc. All rights reserved. 135 Oozie features
  • 136. ©2014 Cloudera, Inc. All rights reserved. 136 Oozie Drawbacks • Overhead in launching jobs • Steep learning curve • XML Workflows
  • 137. Client Hadoop ©2014 Cloudera, Inc. All rights reserved. 137 Azkaban Architecture Azkaban Executor Server Azkaban Web Server HDFS viewer plugin Job types plugin MySQL
  • 138. • Simplicity • Great UI – including pluggable visualizers • Lots of plugins – Hive, Pig… • Reporting plugin ©2014 Cloudera, Inc. All rights reserved. 138 Azkaban features
  • 139. • Doesn’t support workflow decisions • Can’t represent data dependency ©2014 Cloudera, Inc. All rights reserved. 139 Azkaban Limitations
  • 140. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting ©2014 Cloudera, Inc. All rights reserved. 140 Choosing… Easier in Oozie
  • 141. Choosing the right Orchestration Tool • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting Better in Azkaban ©2014 Cloudera, Inc. All rights reserved. 141
  • 142. Important Decision Consideration! The best orchestration tool is the one you are an expert on ©2014 Cloudera, Inc. All rights reserved. 142
  • 143. 145 Putting It All Together Final Architecture
  • 144. Consumption 146 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data ©2014 Cloudera, Inc. All Rights Reserved. Orchestration (Scheduling, Managing, Monitoring)
  • 145. Consumption 147 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data ©2014 Cloudera, Inc. All Rights Reserved. Orchestration (Scheduling, Managing, Monitoring)
  • 146. /etl/BI/casualcyclist/ clicks/rawlogs/ year=2014/month=10/ day=10! 148 Final Architecture – Ingestion/Storage Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-in Pattern Multi Agents for Failover and rolling restarts HDFS ©2014 Cloudera, Inc. All Rights Reserved.
  • 147. 149 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) ©2014 Cloudera, Inc. All Rights Reserved. Processin g Data Consumption Orchestration (Scheduling, Managing, Monitoring)
  • 148. 150 Final Architecture – Processing and Storage /etl/BI/casualcyclist/clicks/ rawlogs/year=2014/ month=10/day=10 … dedup->filtering- >sessionization /data/bikeshop/ clickstream/year=2014/ month=10/day=10 … ©2014 Cloudera, Inc. All Rights Reserved. parquetize
  • 149. Consumption 151 Final Architecture – High Level Overview Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data ©2014 Cloudera, Inc. All Rights Reserved. Orchestration (Scheduling, Managing, Monitoring)
  • 150. 152 Final Architecture – Data Access Hive/ Impala BI/ Analytics Tools JDBC/ODBC DWH Sqoop Local Disk R, etc. DB import tool ©2014 Cloudera, Inc. All Rights Reserved.
  • 151. Demo ©2014 Cloudera, Inc. All rights reserved. 153
  • 152. 154 Stay in touch! @hadooparchbook hadooparchitecturebook.com slideshare.com/hadooparchbook
  • 153. Confidentiality Information Goes Here 155 Join the Discussion Get community help or provide feedback cloudera.com/ community
  • 154. 156 Try Hadoop Now cloudera.com/live
  • 155. 157 Visit us at the Booth #408 Highlights: Hear what’s new with 5.2 including Impala 2.0 Learn how Cloudera is setting the standard for Hadoop in the Cloud BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS
  • 156. 158 Free books and office hours! • Book signings – Nov 20, 3:15 – 3:45 PM in Expo Hall – Cloudera Booth (#408) – Nov 20, 6:25 – 6:55PM in Expo Hall - O'Reilly Booth • Office Hours – Mark and Ted, Nov 20, 1:45 PM Table A ©2014 Cloudera, Inc. All Rights Reserved.

Editor's Notes

  1. Talk about confusion even with knowledgeable users about how all the components fit together to implement applications.
  2. Think about kittens/cars getting challenged
  3. Data ingestion – what requirements do we have for moving data into our processing flow? Data storage – what requirements do we have for the storage of data, both incoming raw data and processed data? Data processing – how do we need to process the data to meet our functional requirements? Workflow orchestration – how do we manage and monitor all the processing?
  4. We have a farm of web servers – this could be tens of servers, or hundreds of servers, and each of these servers is generating multiple logs every day. This may just be a few GB per server, but the total log volume over time can quickly become terabytes of data. As traffic on our websites increases, we add more web servers, which means even more logs. We may also decide we need to bring in additional data sources, for example CRM data, or data stored in our operational data stores. Additionally, we may determine that there’s valuable data in Hadoop that we want to bring in to external data stores – for example info to enrich our customer records.
  5. Add title slide for storage reqs
  6. Data needs to be stored in it’s raw form with full fidelity. This allows us to reprocess the data based on changing or new requirements. Data needs to be stored in a format that facilitates access by data processing frameworks on Hadoop. Data needs to be compressed to reduce storage requirements.
  7. So simple! We can just write a quick bash script, schedule it in cron and we are done. This is actually not a bad way to start a project – it shows value very quickly. The important part is to know when to ditch the script.
  8. I typically ditch the script the moment additional requirements arrive. The first few are still simple enough in bash, but soon enough…
  9. There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
  10. Even if we use an engine that allows for complex workflows like Spark – orcehstration makes things like recovering from error, managing dependencies and reusing components easier.
  11. Now that we understand our requirements, we need to look at considerations for meeting these requirements, starting with data storage.
  12. Now that we understand our requirements, we need to look at considerations for meeting these requirements, starting with data storage.
  13. Random access to data doesn’t provide any benefit for our workloads, so HBase is not a good choice. We may later decide that HBase has a place in our architecture, but would add unnecessary complexity right now.
  14. Recall that we’ll be dealing with both raw data being ingested from our web servers, as well as data that’s the output of processing. These two types of data will have different requirements and considerations. We’ll start by discussing the raw data.
  15. We could store the logs as plain text. This is well supported by Hadoop, and will allow processing by all processing frameworks that run on Hadoop. This will quickly consume considerable storage in Hadoop though. This may also not be optimal for processing.
  16. We could store the logs as plain text. This is well supported by Hadoop, and will allow processing by all processing frameworks that run on Hadoop. This will quickly consume considerable storage in Hadoop though. This may also not be optimal for processing.
  17. SequenceFiles are well suited as a container for data stored in Hadoop, and were specifically designed to work with MapReduce. SequenceFiles provide Block compression, which will compress a block of records once they reach a specific size. Block level compression with sequence files allows us to use a non-splittable compression format like Gzip or Snappy, and make it splittable. Important to note that SequenceFile blocks refer to a block of records compressed within a SequenceFile, and are different than HDFS blocks. What’s not shown here is a sync marker that’s written before each block of data, which allows readers of the file to sync to block boundaries.
  18. Avro can be seen as a more advanced SequenceFile Avro files store the metadata in the header using JSON. An important feature of Avro is that schemas can evolve, so the schema used to read the file doesn’t need to match the schema used to write the file. The Avro format is very compact, and also supports splittable compression.
  19. Recall that much of our access to the processed data will be through analytical queries that need to access multiple rows, and often only select columns from those rows.
  20. Access to /data often needs to be controlled, since it contains business critical data sets. Generally only automated processes write to this directory, and different business groups will have read access to only required sub-directories. /app will be used for things like artifacts required to run Oozie workflows,
  21. Note that partitions are actually directories.
  22. Note that partitions are actually directories.
  23. Indicate this applies to raw and processed
  24. Typically we are looking at few files landing at the FTP site once a day, scheduling a job to run on an edgenode of the cluster once a day, fetch the files and stream to HDFS is fine. If an import fails, it will fail completely and you can retry.
  25. Ease of deploy and management is important Customer will not write code Interceptors are important Data-push is important Data will always end up in Hadoop
  26. Many planned consumers High availability is critical You have control over sources of data You are happy to write producers yourself
  27. Does not require programming.
  28. Only without a debugger
  29. Multiple agents acting as collectors provides reliability – if one node goes down we’ll still be able to ingest events. Flume provides support for load balancing such as round robin.
  30. Does not require programming.
  31. Does not require programming.
  32. Does not require programming.
  33. There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
  34. There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
  35. There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
  36. There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
  37. There’s a need for a more sophisticated approach, or we’ll be drowning in bash scripts
  38. One or two hive actions would do, maybe some error handling The workflow is simple enough to work in any tool – Bash, Azkaban… but Oozie’s dataset triggers make it a good fit for this use-case Note that if we were to use Kafka, the workflow would be even simpler and we wouldn’t use time-based scheduling
  39. There are a lot of Orchestration tools out there. And ETL tools typically do orchestration too. We want to focus on the open-source systems that were built to work with Hadoop – they were built to scale with the cluster without a single node as a bottleneck
  40. No XML is huge for many people
  41. One or two hive actions would do, maybe some error handling The workflow is simple enough to work in any tool – Bash, Azkaban… but Oozie’s dataset triggers make it a good fit for this use-case Oozie also makes recovery from errors easier: Data sets are immutable, actions are idempotent and oozie supports restarting workflow and running only the failed action.
  42. This is something you’ll need to add yourself, possibly using an RDBMS and custom java action, if advanced metrics are important
  43. We should probably do this with Hue