SlideShare a Scribd company logo
1 of 157
Download to read offline
Architectural 
Considerations 
for Hadoop 
Applications 
Strata/Hadoop World 2014 ā€“ October 15th 2014 
Mark Grover | @mark_grover 
Ted Malaska | @TedMalaska 
Jonathan Seidman | @jseidman 
Gwen Shapira | @gwenshap
2 
About the book 
ā€¢ @hadooparchbook 
ā€¢ hadooparchitecturebook.com 
ā€¢ github.com/hadooparchitecturebook 
ā€¢ slideshare.com/hadooparchbook 
Ā©2014 Cloudera, Inc. All Rights Reserved.
3 
About the presenters 
ā€¢ Principal Solutions 
Architect at Cloudera 
ā€¢ Previously, lead architect 
at FINRA 
ā€¢ Contributor to Apache 
Flume, Avro, YARN, Pig 
and Spark 
Ted Malaska Jonathan Seidman 
ā€¢ Senior Solutions 
Architect/Partner 
Enablement at Cloudera 
ā€¢ Previously, Technical 
Lead on the big data 
team at Orbitz Worldwide 
ā€¢ Co-founder of the 
Chicago Hadoop User 
Group and Chicago Big 
Data 
Ā©2014 Cloudera, Inc. All Rights Reserved.
4 
About the presenters 
Gwen Shapira Mark Grover 
ā€¢ Solutions Architect turned 
Software Engineer at 
Cloudera 
ā€¢ Contributor to Sqoop, 
Flume and Kafka 
ā€¢ Formerly a senior 
consultant at Pythian, 
Oracle ACE Director 
ā€¢ Software Engineer at 
Cloudera 
ā€¢ Committer on Apache 
Bigtop, PMC member on 
Apache Sentry 
(incubating) 
ā€¢ Contributor to Apache 
Hadoop, Spark, Hive, 
Sqoop, Pig and Flume 
Ā©2014 Cloudera, Inc. All Rights Reserved.
5 
Logistics 
ā€¢ Coffee break around 10:30 
ā€¢ Questions at the end of each section 
Ā©2014 Cloudera, Inc. All Rights Reserved.
6 
Case Study 
Clickstream Analysis
7 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
8 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
9 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
10 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
11 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
12 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
13 
Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved.
14 
Web Logs ā€“ Combined Log Format 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1ā€ 
Ā©2014 Cloudera, Inc. All Rights Reserved.
15 
Clickstream Analytics 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
244.157.45.12 - - [17/Oct/ 
2014:21:08:30 ] "GET /seatposts 
HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/ 
top_online_shops" "Mozilla/5.0 
(Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/ 
537.36ā€
16 
Similar use-cases 
ā€¢ Sensors ā€“ heart, agriculture, etc. 
ā€¢ Casinos ā€“ session of a person at a table 
Ā©2014 Cloudera, Inc. All Rights Reserved.
17 
Pre-Hadoop 
Architecture 
Clickstream Analysis
18 
Click Stream Analysis (Before Hadoop) 
Transform/ 
Aggregate Business 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Web logs 
(full fidelity) 
(2 weeks) 
Data Warehouse 
Intelligence 
Tape Archive
19 
Problems with Pre-Hadoop Architecture 
ā€¢ Full fidelity data is stored for small amount of time (~weeks). 
ā€¢ Older data is sent to tape, or even worse, deleted! 
ā€¢ Inflexible workflow - think of all aggregations beforehand 
Ā©2014 Cloudera, Inc. All Rights Reserved.
20 
Effects of Pre-Hadoop Architecture 
ā€¢ Regenerating aggregates is expensive or worse, impossible 
ā€¢ Canā€™t correct bugs in the workflow/aggregation logic 
ā€¢ Canā€™t do experiments on existing data 
Ā©2014 Cloudera, Inc. All Rights Reserved.
21 
Why is Hadoop A 
Great Fit? 
Clickstream Analysis
22 
Why is Hadoop a great fit? 
ā€¢ Volume of clickstream data is huge 
ā€¢ Velocity at which it comes in is high 
ā€¢ Variety of data is diverse - semi-structured data 
ā€¢ Hadoop enables 
ā€“ active archival of data 
ā€“ Aggregation jobs 
ā€“ Querying the above aggregates or raw fidelity data 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Intelligence 
23 
Click Stream Analysis (with Hadoop) 
Web logs Hadoop Business 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Active archive (no tape) 
Aggregation engine 
Querying engine
24 
Challenges of Hadoop Implementation 
Ā©2014 Cloudera, Inc. All Rights Reserved.
25 
Challenges of Hadoop Implementation 
Ā©2014 Cloudera, Inc. All Rights Reserved.
26 
Other challenges - Architectural Considerations 
ā€¢ Storage managers? 
ā€“ HDFS? HBase? 
ā€¢ Data storage and modeling: 
ā€“ File formats? Compression? Schema design? 
ā€¢ Data movement 
ā€“ How do we actually get the data into Hadoop? How do we get it out? 
ā€¢ Metadata 
ā€“ How do we manage data about the data? 
ā€¢ Data access and processing 
ā€“ How will the data be accessed once in Hadoop? How can we transform it? How do 
we query it? 
ā€¢ Orchestration 
ā€“ How do we manage the workflow for all of this? 
Ā©2014 Cloudera, Inc. All Rights Reserved.
27 
Case Study 
Requirements 
Overview of Requirements
28 
Overview of Requirements 
ā€¢ Data ingestion ā€“ what requirements do we have for moving data into 
our processing flow? 
ā€¢ Data storage ā€“ what requirements do we have for the storage of 
data, both incoming raw data and processed data? 
ā€¢ Data processing ā€“ how do we need to process the data to meet our 
functional requirements? 
ā€¢ Workflow orchestration ā€“ how do we manage and monitor all the 
processing? 
Ā©2014 Cloudera, Inc. All Rights Reserved.
29 
Case Study 
Requirements 
Data Ingestion
30 
Data Ingestion Requirements 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Logs 
244.157.45.12 - - [17/ 
Oct/2014:21:08:30 ] 
"GET /seatposts HTTP/ 
1.0" 200 4463 ā€¦ 
CRM 
Data 
ODS 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Web Servers 
Logs 
Hadoop
31 
Data Ingestion Requirements 
ā€¢ So we need to be able to support: 
ā€“ Reliable ingestion of large volumes of semi-structured event data arriving with 
high velocity (e.g. logs). 
ā€“ Timeliness of data availability ā€“ data needs to be available for processing to 
meet business service level agreements. 
ā€“ Periodic ingestion of data from relational data stores. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
32 
Case Study 
Requirements 
Data Storage
33 
Data Storage Requirements 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Store all the data 
Make the data 
accessible for 
processing 
Compress 
the data
34 
Case Study 
Requirements 
Data Processing
35 
Processing requirements 
Be able to answer questions like: 
ā€¢ What is my websiteā€™s bounce rate? 
ā€“ i.e. how many % of visitors donā€™t go past the landing page? 
ā€¢ Which marketing channels are leading to most sessions? 
ā€¢ Do attribution analysis 
ā€“ Which channels are responsible for most conversions? 
Ā©2014 Cloudera, Inc. All Rights Reserved.
36 
Sessionization 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Website visit 
Visitor 1 
Session 1 
Visitor 1 
Session 2 
Visitor 2 
Session 1 
> 30 minutes
37 
Case Study 
Requirements 
Orchestration
38 
Orchestration is simple 
We just need to execute actions 
One after another 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Ā©2014 Cloudera, Inc. All Rights Reserved. 39 
Actually, 
we also need to handle errors 
And user notifications 
ā€¦.
Andā€¦ 
ā€¢ Re-start workflows after errors 
ā€¢ Reuse of actions in multiple workflows 
ā€¢ Complex workflows with decision points 
ā€¢ Trigger actions based on events 
ā€¢ Tracking metadata 
ā€¢ Integration with enterprise software 
ā€¢ Data lifecycle 
ā€¢ Data quality control 
ā€¢ Reports 
Ā©2014 Cloudera, Inc. All Rights Reserved. 40
Ā©2014 Cloudera, Inc. All Rights Reserved. 41 
OK, maybe we need a product 
To help us do all that
42 
Architectural 
Considerations 
Data Modeling
43 
Data Modeling Considerations 
ā€¢ We need to consider the following in our architecture: 
ā€“ Storage layer ā€“ HDFS? HBase? Etc. 
ā€“ File system schemas ā€“ how will we lay out the data? 
ā€“ File formats ā€“ what storage formats to use for our data, both raw and 
processed data? 
ā€“ Data compression formats? 
Ā©2014 Cloudera, Inc. All Rights Reserved.
44 
Architectural 
Considerations 
Data Modeling ā€“ Storage Layer
45 
Data Storage Layer Choices 
ā€¢ Two likely choices for raw data: 
Ā©2014 Cloudera, Inc. All Rights Reserved.
46 
Data Storage Layer Choices 
ā€¢ Stores data directly as files 
ā€¢ Fast scans 
ā€¢ Poor random reads/writes 
ā€¢ Stores data as Hfiles on 
HDFS 
ā€¢ Slow scans 
ā€¢ Fast random reads/writes 
Ā©2014 Cloudera, Inc. All Rights Reserved.
47 
Data Storage ā€“ Storage Manager Considerations 
ā€¢ Incoming raw data: 
ā€“ Processing requirements call for batch transformations across multiple 
records ā€“ for example sessionization. 
ā€¢ Processed data: 
ā€“ Access to processed data will be via things like analytical queries ā€“ again 
requiring access to multiple records. 
ā€¢ We choose HDFS 
ā€“ Processing needs in this case served better by fast scans. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
48 
Architectural 
Considerations 
Data Modeling ā€“ Raw Data Storage
Storage Formats ā€“ Raw Data and Processed Data 
49 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Processed 
Data 
Raw 
Data
50 
Raw Data Storage ā€“ Format Considerations 
ā€¢ Store as plain text? 
ā€“ Sure, well supported by Hadoop. 
ā€“ Text can easily be processed by MapReduce, loaded into Hive for analysis, 
and so on. 
ā€¢ Butā€¦ 
ā€“ Will begin to consume lots of space in HDFS. 
ā€“ May not be optimal for processing by tools in the Hadoop ecosystem. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
51 
Raw Data Storage ā€“ Format Considerations 
ā€¢ But, we can compress the text filesā€¦ 
ā€“ Gzip ā€“ supported by Hadoop, but not splittable. 
ā€“ Bzip2 ā€“ hey, splittable! Great compression! But decompression is slooowww. 
ā€“ LZO ā€“ splittable (with some work), good compress/de-compress performance. 
Good choice for storing text files on Hadoop. 
ā€“ Snappy ā€“ provides a good tradeoff between size and speed. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
52 
Raw Data Storage ā€“ More About Snappy 
ā€¢ Designed at Google to provide high compression speeds with reasonable 
compression. 
ā€¢ Not the highest compression, but provides very good performance for 
processing on Hadoop. 
ā€¢ Snappy is not splittable though, which brings us toā€¦ 
Ā©2014 Cloudera, Inc. All Rights Reserved.
53 
SequenceFile 
ā€¢ Stores records as 
binary key/value pairs. 
ā€¢ SequenceFile ā€œblocksā€ 
can be compressed. 
ā€¢ This enables 
splittability with non-splittable 
compression. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
54 
Avro 
ā€¢ Kinda SequenceFile on 
Steroids. 
ā€¢ Self-documenting ā€“ 
stores schema in 
header. 
ā€¢ Provides very efficient 
storage. 
ā€¢ Supports splittable 
compression. Ā©2014 Cloudera, Inc. All Rights Reserved.
55 
Our Format Recommendations for Raw Dataā€¦ 
ā€¢ Avro with Snappy 
ā€“ Snappy provides optimized compression. 
ā€“ Avro provides compact storage, self-documenting files, and supports schema 
evolution. 
ā€“ Avro also provides better failure handling than other choices. 
ā€¢ SequenceFiles would also be a good choice, and are directly 
supported by ingestion tools in the ecosystem. 
ā€“ But only supports Java. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
56 
But Noteā€¦ 
ā€¢ For simplicity, weā€™ll use plain text for raw data in our example. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
57 
Architectural 
Considerations 
Data Modeling ā€“ Processed Data Storage
Storage Formats ā€“ Raw Data and Processed Data 
58 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Processed 
Data 
Raw 
Data
59 
Access to Processed Data 
Column A Column B Column C Column D 
Value Value Value Value 
Value Value Value Value 
Value Value Value Value 
Value Value Value Value 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Analytical 
Queries
60 
Columnar Formats 
ā€¢ Eliminates I/O for columns that are not part of a query. 
ā€¢ Works well for queries that access a subset of columns. 
ā€¢ Often provide better compression. 
ā€¢ These add up to dramatically improved performance for many 
queries. 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
1 2014-10-1 
3 
abc 
2 2014-10-1 
4 
def 
3 2014-10-1 
5 
ghi 
1 2 3 
2014-10-1 
2014-10-1 
3 
4 
2014-10-1 
5 
abc def ghi
61 
Columnar Choices ā€“ RCFile 
ā€¢ Provides efficient processing for MapReduce applications, but 
generally only used with Hive. 
ā€¢ Only supports Java. 
ā€¢ No Avro support. 
ā€¢ Limited compression support. 
ā€¢ Sub-optimal performance compared to newer columnar formats. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
62 
Columnar Choices ā€“ ORC 
ā€¢ A better RCFile. 
ā€¢ Supports Hive, but currently limited support for other interfaces. 
ā€¢ Only supports Java. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
63 
Columnar Choices ā€“ Parquet 
ā€¢ Supports multiple programming interfaces ā€“ MapReduce, Hive, 
Impala, Pig. 
ā€¢ Multiple language support. 
ā€¢ Broad vendor support. 
ā€¢ These features make Parquet a good choice for our processed data. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
64 
Architectural 
Considerations 
Data Modeling ā€“ Schema Design
65 
HDFS Schema Design ā€“ One Recommendation 
/user/<username> - User specific data, jars, conf files 
/etl ā€“ Data in various stages of ETL workflow 
/tmp ā€“ temp data from tools or shared between users 
/data ā€“ processed data to be shared data with the entire organization 
/app ā€“ Everything but data: UDF jars, HQL files, Oozie workflows 
Ā©2014 Cloudera, Inc. All Rights Reserved.
66 
Partitioning 
ā€¢ Split the dataset into smaller consumable chunks. 
ā€¢ Rudimentary form of ā€œindexingā€. Reduces I/O needed to process 
queries. 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
dataset 
col=val1/file.txt 
col=val2/file.txt 
ā€¦ 
col=valn/file.txt 
Un-partitioned HDFS directory 
dataset 
file1.txt 
file2.txt 
ā€¦ 
filen.txt 
structure 
Partitioned HDFS directory 
structure
67 
Partitioning considerations 
ā€¢ What column to partition by? 
ā€“ Donā€™t have too many partitions (<10,000) 
ā€“ Donā€™t have too many small files in the partitions (more than block size 
generally) 
ā€¢ Weā€™ll partition by timestamp. This applies to both our raw and 
processed data. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
68 
Partitioning For Our Case Study 
ā€¢ Raw dataset: 
ā€“ /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! 
ā€¢ Processed dataset: 
ā€“ /data/bikeshop/clickstream/year=2014/month=10/day=10! 
Ā©2014 Cloudera, Inc. All Rights Reserved.
69 
A Word About Partitioning Alternatives 
year=2014/month=10/day=10 dt=2014-10-10 
Ā©2014 Cloudera, Inc. All Rights Reserved.
70 
Architectural 
Considerations 
Data Ingestion
Typical Clickstream data sources 
Ā©2014 Cloudera, Inc. All rights reserved. 71 
ā€¢ Omniture data on FTP 
ā€¢ Apps 
ā€¢ App Logs 
ā€¢ RDBMS
Getting Files from FTP 
Ā©2014 Cloudera, Inc. All rights reserved. 72
Donā€™t over-complicate things 
curl ftp://myftpsite.com/sitecatalyst/ 
myreport_2014-10-05.tar.gz 
--user name:password | hdfs -put - /etl/clickstream/raw/ 
sitecatalyst/myreport_2014-10-05.tar.gz 
Ā©2014 Cloudera, Inc. All rights reserved. 73
Event Streaming ā€“ Flume and Kafka 
Reliable, distributed and highly available systems 
That allow streaming events to Hadoop 
Ā©2014 Cloudera, Inc. All rights reserved. 74
ā€¢ Many available data collection sources 
ā€¢ Well integrated into Hadoop 
ā€¢ Supports file transformations 
ā€¢ Can implement complex topologies 
ā€¢ Very low latency 
ā€¢ No programming required 
Ā©2014 Cloudera, Inc. All rights reserved. 75 
Flume:
ā€œWe just want to grab data from this directory and 
write it to HDFSā€ 
Ā©2014 Cloudera, Inc. All rights reserved. 76 
We use Flume when:
ā€¢ Very high-throughput publish-subscribe messaging 
ā€¢ Highly available 
ā€¢ Stores data and can replay 
ā€¢ Can support many consumers with no extra latency 
Ā©2014 Cloudera, Inc. All rights reserved. 77 
Kafka is:
ā€œKafka is awesome. 
We heard it cures cancerā€ 
Ā©2014 Cloudera, Inc. All rights reserved. 78 
Use Kafka When:
Ā©2014 Cloudera, Inc. All rights reserved. 79 
Actually, why choose? 
ā€¢ Use Flume with a Kafka Source 
ā€¢ Allows to get data from Kafka, 
run some transformations 
write to HDFS, HBase or Solr
ā€¢ We want to ingest events from log files 
ā€¢ Flumeā€™s Spooling Directory source fits 
ā€¢ With HDFS Sink 
ā€¢ We would have used Kafka ifā€¦ 
ā€“ We wanted the data in non-Hadoop systems too 
Ā©2014 Cloudera, Inc. All rights reserved. 80 
In Our Exampleā€¦
81 
Short Intro to Flume 
Sources Interceptors Selectors Channels Sinks 
Flume Agent 
Twitter, logs, JMS, 
webserver 
Mask, re-format, 
validateā€¦ 
DR, critical Memory, file HDFS, Hbase, 
Solr
82 
Configuration 
ā€¢ Declarative 
ā€“ No coding required. 
ā€“ Configuration specifies 
how components are 
wired together. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
83 
Interceptors 
ā€¢ Mask fields 
ā€¢ Validate information 
against external source 
ā€¢ Extract fields 
ā€¢ Modify data format 
ā€¢ Filter or split events 
Ā©2014 Cloudera, Inc. All rights reserved.
Ā©2014 Cloudera, Inc. All rights reserved. 84 
Any sufficiently complex configuration 
Is indistinguishable from code
85 
A Brief Discussion of Flume Patterns ā€“ Fan-in 
ā€¢ Flume agent runs on 
each of our servers. 
ā€¢ These client agents 
send data to multiple 
agents to provide 
reliability. 
ā€¢ Flume provides support 
for load balancing. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
86 
A Brief Discussion of Flume Patterns ā€“ Splitting 
ā€¢ Common need is to split 
data on ingest. 
ā€¢ For example: 
ā€“ Sending data to multiple 
clusters for DR. 
ā€“ To multiple destinations. 
ā€¢ Flume also supports 
partitioning, which is key 
to our implementation. 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Flume Demo 
Ā©2014 Cloudera, Inc. All rights reserved. 87
88 
Flume Demo ā€“ Architecture 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Fan-in 
Pattern 
Multi Agents for 
Failover and rolling 
restarts 
HDFS 
Ā©2014 Cloudera, Inc. All Rights Reserved.
89 
Flume Agent 
Web Logs 
Spooling 
Dir 
Source 
Timestamp 
Interceptor 
File 
Channel 
Avro Sink 
Avro Sink 
Avro Sink 
Flume Demo ā€“ Client Tier 
Web Server Flume Agent
90 
Flume Demo ā€“ Collector Tier 
Flume Agent 
Flume Agent 
HDFS 
Avro 
Source 
File 
Channel HDFS Sink 
HDFS Sink
What ifā€¦. We were to use Kafka? 
ā€¢ Add Kafka producer to our webapp 
ā€¢ Send clicks and searches as messages 
ā€¢ Run an MR job once a day to grab all new events 
ā€¢ We can add a second consumer for real-time 
ā€¢ Another consumer for alertingā€¦ 
Ā©2014 Cloudera, Inc. All rights reserved. 91
Ā©2014 Cloudera, Inc. All rights reserved. 92 
Camus
93 
Architectural 
Considerations 
Data Processing ā€“ Engines 
tiny.cloudera.com/app-arch-slides
94 
Processing Engines 
ā€¢ MapReduce 
ā€¢ Abstractions 
ā€¢ Spark 
ā€¢ Spark Streaming 
ā€¢ Impala 
Confidentiality Information Goes Here
95 
MapReduce 
ā€¢ Oldie but goody 
ā€¢ Restrictive Framework / Innovated Work Around 
ā€¢ Extreme Batch 
Confidentiality Information Goes Here
96 
MapReduce Basic High Level 
Confidentiality Information Goes Here 
Mapper 
HDFS 
(Replicated) 
Native File System 
Block of 
Data 
Temp Spill 
Data 
Partitioned 
Sorted Data 
Reducer 
Reducer 
Local Copy 
Output File
97 
MapReduce Innovation 
ā€¢ Mapper Memory Joins 
ā€¢ Reducer Memory Joins 
ā€¢ Buckets Sorted Joins 
ā€¢ Cross Task Communication 
ā€¢ Windowing 
ā€¢ And Much More 
Confidentiality Information Goes Here
98 
Abstractions 
ā€¢ SQL 
ā€“ Hive 
ā€¢ Script/Code 
ā€“ Pig: Pig Latin 
ā€“ Crunch: Java/Scala 
ā€“ Cascading: Java/Scala 
Confidentiality Information Goes Here
99 
Spark 
ā€¢ The New Kid that isnā€™t that New Anymore 
ā€¢ Easily 10x less code 
ā€¢ Extremely Easy and Powerful API 
ā€¢ Very good for machine learning 
ā€¢ Scala, Java, and Python 
ā€¢ RDDs 
ā€¢ DAG Engine 
Confidentiality Information Goes Here
100 
Spark - DAG 
Confidentiality Information Goes Here
101 
Spark - DAG 
Confidentiality Information Goes Here 
Filter KeyBy 
KeyBy 
TextFile 
TextFile 
Join Filter Take
102 
Spark - DAG 
Confidentiality Information Goes Here 
Filter KeyBy 
KeyBy 
TextFile 
TextFile 
Join Filter Take 
Good 
Good 
Good 
Good 
Good 
Good 
Good-Replay 
Good-Replay 
Good-Replay 
Good 
Good-Replay 
Good 
Good-Replay 
Lost Block 
Replay 
Good-Replay 
Lost Block 
Good 
Future 
Future 
Future 
Future
103 
Spark Streaming 
ā€¢ Calling Spark in a Loop 
ā€¢ Extends RDDs with DStream 
ā€¢ Very Little Code Changes from ETL to Streaming 
Confidentiality Information Goes Here
104 
Spark Streaming 
Confidentiality Information Goes Here 
Single Pass 
Source Receiver RDD 
Source Receiver RDD 
RDD 
Filter Count Print 
Source Receiver RDD 
RDD 
RDD 
Single Pass 
Filter Count Print 
Pre-first 
Batch 
First 
Batch 
Second 
Batch
105 
Spark Streaming 
Confidentiality Information Goes Here 
Single Pass 
Source Receiver RDD 
Source Receiver RDD 
RDD 
Filter Count 
Print 
Source Receiver RDD 
RDD 
RDD 
Stateful RDD 1 
Single Pass 
Filter Count 
Pre-first 
Batch 
First 
Batch 
Second 
Batch 
Stateful RDD 1 
Stateful RDD 2 
Print
106 
Impala 
Confidentiality Information Goes Here 
ā€¢ MPP Style SQL Engine on top of Hadoop 
ā€¢ Very Fast 
ā€¢ High Concurrency
107 
Impala ā€“ Broadcast Join 
Confidentiality Information Goes Here 
Impala Daemon 
Smaller Table 
Data Block 
100% Cached 
Smaller Table 
Smaller Table 
Data Block 
Impala Daemon 
100% Cached 
Smaller Table 
Impala Daemon 
100% Cached 
Smaller Table 
Impala Daemon 
Hash Join Function 
Bigger Table 
Data Block 
100% Cached 
Smaller Table 
Output 
Impala Daemon 
Hash Join Function 
Bigger Table 
Data Block 
100% Cached 
Smaller Table 
Output 
Impala Daemon 
Hash Join Function 
Bigger Table 
Data Block 
100% Cached 
Smaller Table 
Output
108 
Impala ā€“ Broadcast Join 
Confidentiality Information Goes Here 
Impala Daemon 
Smaller Table 
Data Block 
~33% Cached 
Smaller Table 
Impala Daemon 
Smaller Table 
Data Block 
~33% Cached 
Smaller Table 
Impala Daemon 
~33% Cached 
Smaller Table 
Hash Partitioner Hash Partitioner 
Impala Daemon 
BiggerTable 
Data Block 
Impala Daemon Impala Daemon 
Hash Partitioner 
Hash Join Function 
33% Cached 
Smaller Table 
Hash Join Function 
33% Cached 
Smaller Table 
Hash Join Function 
33% Cached 
Smaller Table 
Output Output Output 
Hash Partitioner 
BiggerTable 
Data Block 
Hash Partitioner 
BiggerTable 
Data Block
109 
Impala vs Hive 
Confidentiality Information Goes Here 
ā€¢ Very different approaches and 
ā€¢ We may see convergence at some point 
ā€¢ But for now 
ā€“ Impala for speed 
ā€“ Hive for batch
110 
Architectural 
Considerations 
Data Processing ā€“ Patterns and 
Recommendations
111 
What processing needs to happen? 
Confidentiality Information Goes Here 
ā€¢ Sessionization 
ā€¢ Filtering 
ā€¢ Deduplication 
ā€¢ BI / Discovery
112 
Sessionization 
Confidentiality Information Goes Here 
Website visit 
Visitor 1 
Session 1 
Visitor 1 
Session 2 
Visitor 2 
Session 1 
> 30 minutes
113 
Why sessionize? 
Helps answers questions like: 
ā€¢ What is my websiteā€™s bounce rate? 
ā€“ i.e. how many % of visitors donā€™t go past the landing page? 
ā€¢ Which marketing channels (e.g. organic search, display ad, etc.) are 
leading to most sessions? 
ā€“ Which ones of those lead to most conversions (e.g. people buying things, 
Confidentiality Information Goes Here 
signing up, etc.) 
ā€¢ Do attribution analysis ā€“ which channels are responsible for most 
conversions?
114 
Sessionization 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
244.157.45.12+1413580110 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; 
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 
Mobile Safari/533.1ā€ 244.157.45.12+1413583199 
Confidentiality Information Goes Here
115 
How to Sessionize? 
Confidentiality Information Goes Here 
1. Given a list of clicks, determine which clicks 
came from the same user (Partitioning, 
ordering) 
2. Given a particular user's clicks, determine if a 
given click is a part of a new session or a 
continuation of the previous session (Identifying 
session boundaries)
116 
#1 ā€“ Which clicks are from same user? 
ā€¢ We can use: 
ā€“ IP address (244.157.45.12) 
ā€“ Cookies (A9A3BECE0563982D) 
ā€“ IP address (244.157.45.12)and user agent string ((KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/537.36") 
Ā©2014 Cloudera, Inc. All Rights Reserved.
117 
#1 ā€“ Which clicks are from same user? 
ā€¢ We can use: 
ā€“ IP address (244.157.45.12) 
ā€“ Cookies (A9A3BECE0563982D) 
ā€“ IP address (244.157.45.12)and user agent string ((KHTML, like 
Gecko) Chrome/36.0.1944.0 Safari/537.36") 
Ā©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 
118 
#1 ā€“ Which clicks are from same user? 
Ā©2014 Cloudera, Inc. All Rights Reserved.
> 30 mins apart = different 
sessions 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 
119 
#2 ā€“ Which clicks part of the same session? 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Sessionization engine recommendation 
ā€¢ We have sessionization code in MR, Spark and Hive on github. 
The complexity of the code varies, depends on the expertise in the 
organization. 
ā€¢ We choose Hive, since itā€™s a fairly small query to maintain instead 
of large pieces of code. 
Ā©2014 Cloudera, Inc. All rights reserved. 120
121 
Filtering ā€“ filter out incomplete records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; Uā€¦ 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Filtering ā€“ filter out records from bots/spiders 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC 
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 
122 
Ā©2014 Cloudera, Inc. All Rights Reserved. 
Google spider IP address
Filtering recommendation 
ā€¢ Bot/Spider filtering can be done easily in any of the engines 
ā€¢ Incomplete records are harder to filter in schema systems like 
Hive, Impala, Pig, etc. 
ā€¢ Pretty close choice between MR, Hive and Spark 
ā€¢ Can be done in Spark using rdd.filter() 
ā€¢ We can simply embed this in our sessionization job 
Ā©2014 Cloudera, Inc. All rights reserved. 123
124 
Deduplication ā€“ remove duplicate records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Deduplication recommendation 
ā€¢ Can be done in all engines. 
ā€¢ We already have a Hive table with all the columns, a simple 
DISTINCT query will perform deduplication 
ā€¢ reduce() in spark 
ā€¢ We use Pig 
Ā©2014 Cloudera, Inc. All rights reserved. 125
BI/Discovery engine recommendation 
ā€¢ Main requirements for this are: 
ā€“ Low latency 
ā€“ SQL interface (e.g. JDBC/ODBC) 
ā€“ Users donā€™t know how to code 
ā€¢ We chose Impala 
ā€“ Itā€™s a SQL engine 
ā€“ Much faster than other engines 
ā€“ Provides standard JDBC/ODBC interfaces 
Ā©2014 Cloudera, Inc. All rights reserved. 126
Processing Demo 
Ā©2014 Cloudera, Inc. All rights reserved. 127
128 
Architectural 
Considerations 
Orchestration
129 
Orchestrating Clickstream 
ā€¢ Data arrives through Flume 
ā€¢ Triggers a processing event: 
ā€“ Sessionize 
ā€“ Enrich ā€“ Location, marketing channelā€¦ 
ā€“ Store as Parquet 
ā€¢ Each day we process events from the previous day
ā€¢ Workflow is fairly simple 
ā€¢ Need to trigger workflow based on data 
ā€¢ Be able to recover from errors 
ā€¢ Perhaps notify on the status 
ā€¢ And collect metrics for reporting 
Ā©2014 Cloudera, Inc. All rights reserved. 130 
Choosing Right
131 
Oozie or Azkaban? 
Ā©2014 Cloudera, Inc. All rights reserved.
Ā©2014 Cloudera, Inc. All rights reserved. 132 
Oozie Architecture
ā€¢ Part of all major Hadoop distributions 
ā€¢ Hue integration 
ā€¢ Built -in actions ā€“ Hive, Sqoop, MapReduce, SSH 
ā€¢ Complex workflows with decisions 
ā€¢ Event and time based scheduling 
ā€¢ Notifications 
ā€¢ SLA Monitoring 
ā€¢ REST API 
Ā©2014 Cloudera, Inc. All rights reserved. 133 
Oozie features
Ā©2014 Cloudera, Inc. All rights reserved. 134 
Oozie Drawbacks 
ā€¢ Overhead in launching jobs 
ā€¢ Steep learning curve 
ā€¢ XML Workflows
Client Hadoop 
Ā©2014 Cloudera, Inc. All rights reserved. 135 
Azkaban Architecture 
Azkaban Executor Server 
Azkaban Web Server 
HDFS viewer 
plugin 
Job types plugin 
MySQL
ā€¢ Simplicity 
ā€¢ Great UI ā€“ including pluggable visualizers 
ā€¢ Lots of plugins ā€“ Hive, Pigā€¦ 
ā€¢ Reporting plugin 
Ā©2014 Cloudera, Inc. All rights reserved. 136 
Azkaban features
ā€¢ Doesnā€™t support workflow decisions 
ā€¢ Canā€™t represent data dependency 
Ā©2014 Cloudera, Inc. All rights reserved. 137 
Azkaban Limitations
ā€¢ Workflow is fairly simple 
ā€¢ Need to trigger workflow based on data 
ā€¢ Be able to recover from errors 
ā€¢ Perhaps notify on the status 
ā€¢ And collect metrics for reporting 
Ā©2014 Cloudera, Inc. All rights reserved. 138 
Choosingā€¦ 
Easier in Oozie
Choosing the right Orchestration Tool 
ā€¢ Workflow is fairly simple 
ā€¢ Need to trigger workflow based on data 
ā€¢ Be able to recover from errors 
ā€¢ Perhaps notify on the status 
ā€¢ And collect metrics for reporting 
Better in Azkaban 
Ā©2014 Cloudera, Inc. All rights reserved. 139
Important Decision Consideration! 
The best orchestration tool 
is the one you are an expert on 
Ā©2014 Cloudera, Inc. All rights reserved. 140
Orchestration Patterns ā€“ Fan Out 
Ā©2014 Cloudera, Inc. All rights reserved. 141
Capture & Decide Pattern 
Ā©2014 Cloudera, Inc. All rights reserved. 142
Oozie Demo 
Ā©2014 Cloudera, Inc. All rights reserved. 143
144 
Putting It All 
Together 
Final Architecture
145 
Final Architecture ā€“ High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
Ā©2014 Cloudera, Inc. All Rights Reserved.
146 
Final Architecture ā€“ High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
Ā©2014 Cloudera, Inc. All Rights Reserved.
147 
Final Architecture ā€“ Ingestion 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Web Server Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Flume Agent 
Fan-in 
Pattern 
Multi Agents for 
Failover and rolling 
restarts 
HDFS 
Ā©2014 Cloudera, Inc. All Rights Reserved.
148 
Final Architecture ā€“ High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
Ā©2014 Cloudera, Inc. All Rights Reserved.
149 
Final Architecture ā€“ Storage and Processing 
/etl/BI/casualcyclist/clicks/rawlogs/ 
year=2014/month=10/day=10 
ā€¦ 
Data Processing 
Deduplication 
Filtering 
Sessionization 
Etc. 
/data/bikeshop/clickstream/ 
year=2014/month=10/day=10 
ā€¦ 
Ā©2014 Cloudera, Inc. All Rights Reserved.
150 
Final Architecture ā€“ High Level Overview 
Data 
Sources Ingestion Data Storage/ 
Processing 
Data 
Reporting/ 
Analysis 
Ā©2014 Cloudera, Inc. All Rights Reserved.
151 
Final Architecture ā€“ Data Access 
Hive/ 
Impala 
BI/ 
Analytics 
Tools 
JDBC/ODBC 
DWH 
Sqoop 
Local 
Disk 
R, etc. 
DB import tool 
Ā©2014 Cloudera, Inc. All Rights Reserved.
152 
Stay in touch! 
@hadooparchbook 
hadooparchitecturebook.com 
slideshare.com/hadooparchbook
Confidentiality Information Goes Here 153 
Join the Discussion 
Get community 
help or provide 
feedback 
cloudera.com/ 
community
154 
Try Hadoop Now 
cloudera.com/live
155 
Visit us at 
Booth 
#305 
Highlights: 
Hear whatā€™s new with 
5.2 including Impala 2.0 
Learn how Cloudera is 
setting the standard for 
Hadoop in the Cloud 
BOOK SIGNINGS THEATER SESSIONS 
TECHNICAL DEMOS GIVEAWAYS
156 
Free books and office hours! 
ā€¢ Book signing #1 
ā€“ Tomorrow, 10/16, 3:45 ā€“ 4:15pm at Expo Hall - O'Reilly Authors Booth 
ā€¢ Book signing #2 
ā€“ Tomorrow, 10/16, 6:35 ā€“ 7:00pm at Cloudera Booth #305 
ā€¢ Office Hours, tomorrow, 10/16, 10:30 ā€“ 11:00 am 
ā€“ Mark and Ted in Room: Table A 
ā€“ Gwen and Jonathan in Room: Table C 
Ā©2014 Cloudera, Inc. All Rights Reserved.
Thank you

More Related Content

What's hot

Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
Ā 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
Ā 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
Ā 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
Ā 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
Ā 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
Ā 
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014hadooparchbook
Ā 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
Ā 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
Ā 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
Ā 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
Ā 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
Ā 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
Ā 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
Ā 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
Ā 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
Ā 
Application architectures with hadoop ā€“Ā big data techcon 2014
Application architectures with hadoop ā€“Ā big data techcon 2014Application architectures with hadoop ā€“Ā big data techcon 2014
Application architectures with hadoop ā€“Ā big data techcon 2014Jonathan Seidman
Ā 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
Ā 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Ā 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
Ā 

What's hot (20)

Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
Ā 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
Ā 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
Ā 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
Ā 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
Ā 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
Ā 
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Application architectures with Hadoop ā€“Ā Big Data TechCon 2014
Ā 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
Ā 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
Ā 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
Ā 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
Ā 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
Ā 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
Ā 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
Ā 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
Ā 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
Ā 
Application architectures with hadoop ā€“Ā big data techcon 2014
Application architectures with hadoop ā€“Ā big data techcon 2014Application architectures with hadoop ā€“Ā big data techcon 2014
Application architectures with hadoop ā€“Ā big data techcon 2014
Ā 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
Ā 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Ā 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
Ā 

Viewers also liked

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Ā 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
Ā 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Ā 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
Ā 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
Ā 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
Ā 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
Ā 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
Ā 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
Ā 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
Ā 
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionChicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionCloudera, Inc.
Ā 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Yahoo Developer Network
Ā 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
Ā 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Cloudera, Inc.
Ā 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoopzenyk
Ā 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
Ā 

Viewers also liked (16)

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
Ā 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Ā 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Ā 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
Ā 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Ā 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
Ā 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
Ā 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Ā 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Ā 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Ā 
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionChicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Ā 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Ā 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Ā 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Ā 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
Ā 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
Ā 

Similar to Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
Ā 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
Ā 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
Ā 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
Ā 
å¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Ø
å¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Øå¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Ø
å¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…ØJianwei Li
Ā 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
Ā 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oowGwen (Chen) Shapira
Ā 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
Ā 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
Ā 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
Ā 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
Ā 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
Ā 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
Ā 
Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...Cloudera, Inc.
Ā 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
Ā 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateClouderaUserGroups
Ā 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
Ā 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
Ā 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
Ā 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
Ā 

Similar to Strata NY 2014 - Architectural considerations for Hadoop applications tutorial (20)

Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
Ā 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Ā 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
Ā 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Ā 
å¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Ø
å¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Øå¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Ø
å¤§ę•°ę®ę•°ę®ę²»ē†åŠę•°ę®å®‰å…Ø
Ā 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Ā 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Ā 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Ā 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Ā 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Ā 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
Ā 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Ā 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
Ā 
Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the Worldā€™s Biggest Ret...
Ā 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Ā 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
Ā 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Ā 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
Ā 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Ā 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Ā 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Ā 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024The Digital Insurer
Ā 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Ā 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
Ā 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Ā 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Ā 
Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024Finology Group ā€“ Insurtech Innovation Award 2024
Finology Group ā€“ Insurtech Innovation Award 2024
Ā 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Ā 

Strata NY 2014 - Architectural considerations for Hadoop applications tutorial

  • 1. Architectural Considerations for Hadoop Applications Strata/Hadoop World 2014 ā€“ October 15th 2014 Mark Grover | @mark_grover Ted Malaska | @TedMalaska Jonathan Seidman | @jseidman Gwen Shapira | @gwenshap
  • 2. 2 About the book ā€¢ @hadooparchbook ā€¢ hadooparchitecturebook.com ā€¢ github.com/hadooparchitecturebook ā€¢ slideshare.com/hadooparchbook Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 3. 3 About the presenters ā€¢ Principal Solutions Architect at Cloudera ā€¢ Previously, lead architect at FINRA ā€¢ Contributor to Apache Flume, Avro, YARN, Pig and Spark Ted Malaska Jonathan Seidman ā€¢ Senior Solutions Architect/Partner Enablement at Cloudera ā€¢ Previously, Technical Lead on the big data team at Orbitz Worldwide ā€¢ Co-founder of the Chicago Hadoop User Group and Chicago Big Data Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 4. 4 About the presenters Gwen Shapira Mark Grover ā€¢ Solutions Architect turned Software Engineer at Cloudera ā€¢ Contributor to Sqoop, Flume and Kafka ā€¢ Formerly a senior consultant at Pythian, Oracle ACE Director ā€¢ Software Engineer at Cloudera ā€¢ Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) ā€¢ Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 5. 5 Logistics ā€¢ Coffee break around 10:30 ā€¢ Questions at the end of each section Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Case Study Clickstream Analysis
  • 7. 7 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 8. 8 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 9. 9 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 11. 11 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 12. 12 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 13. 13 Analytics Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 14. 14 Web Logs ā€“ Combined Log Format 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 15. 15 Clickstream Analytics Ā©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36ā€
  • 16. 16 Similar use-cases ā€¢ Sensors ā€“ heart, agriculture, etc. ā€¢ Casinos ā€“ session of a person at a table Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 Pre-Hadoop Architecture Clickstream Analysis
  • 18. 18 Click Stream Analysis (Before Hadoop) Transform/ Aggregate Business Ā©2014 Cloudera, Inc. All Rights Reserved. Web logs (full fidelity) (2 weeks) Data Warehouse Intelligence Tape Archive
  • 19. 19 Problems with Pre-Hadoop Architecture ā€¢ Full fidelity data is stored for small amount of time (~weeks). ā€¢ Older data is sent to tape, or even worse, deleted! ā€¢ Inflexible workflow - think of all aggregations beforehand Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 20. 20 Effects of Pre-Hadoop Architecture ā€¢ Regenerating aggregates is expensive or worse, impossible ā€¢ Canā€™t correct bugs in the workflow/aggregation logic ā€¢ Canā€™t do experiments on existing data Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 21. 21 Why is Hadoop A Great Fit? Clickstream Analysis
  • 22. 22 Why is Hadoop a great fit? ā€¢ Volume of clickstream data is huge ā€¢ Velocity at which it comes in is high ā€¢ Variety of data is diverse - semi-structured data ā€¢ Hadoop enables ā€“ active archival of data ā€“ Aggregation jobs ā€“ Querying the above aggregates or raw fidelity data Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 23. Intelligence 23 Click Stream Analysis (with Hadoop) Web logs Hadoop Business Ā©2014 Cloudera, Inc. All Rights Reserved. Active archive (no tape) Aggregation engine Querying engine
  • 24. 24 Challenges of Hadoop Implementation Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 25. 25 Challenges of Hadoop Implementation Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 26. 26 Other challenges - Architectural Considerations ā€¢ Storage managers? ā€“ HDFS? HBase? ā€¢ Data storage and modeling: ā€“ File formats? Compression? Schema design? ā€¢ Data movement ā€“ How do we actually get the data into Hadoop? How do we get it out? ā€¢ Metadata ā€“ How do we manage data about the data? ā€¢ Data access and processing ā€“ How will the data be accessed once in Hadoop? How can we transform it? How do we query it? ā€¢ Orchestration ā€“ How do we manage the workflow for all of this? Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 27. 27 Case Study Requirements Overview of Requirements
  • 28. 28 Overview of Requirements ā€¢ Data ingestion ā€“ what requirements do we have for moving data into our processing flow? ā€¢ Data storage ā€“ what requirements do we have for the storage of data, both incoming raw data and processed data? ā€¢ Data processing ā€“ how do we need to process the data to meet our functional requirements? ā€¢ Workflow orchestration ā€“ how do we manage and monitor all the processing? Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 29. 29 Case Study Requirements Data Ingestion
  • 30. 30 Data Ingestion Requirements Ā©2014 Cloudera, Inc. All Rights Reserved. Web Servers Web Servers Web Servers Web Servers Web Servers Logs 244.157.45.12 - - [17/ Oct/2014:21:08:30 ] "GET /seatposts HTTP/ 1.0" 200 4463 ā€¦ CRM Data ODS Web Servers Web Servers Web Servers Web Servers Web Servers Logs Hadoop
  • 31. 31 Data Ingestion Requirements ā€¢ So we need to be able to support: ā€“ Reliable ingestion of large volumes of semi-structured event data arriving with high velocity (e.g. logs). ā€“ Timeliness of data availability ā€“ data needs to be available for processing to meet business service level agreements. ā€“ Periodic ingestion of data from relational data stores. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 32. 32 Case Study Requirements Data Storage
  • 33. 33 Data Storage Requirements Ā©2014 Cloudera, Inc. All Rights Reserved. Store all the data Make the data accessible for processing Compress the data
  • 34. 34 Case Study Requirements Data Processing
  • 35. 35 Processing requirements Be able to answer questions like: ā€¢ What is my websiteā€™s bounce rate? ā€“ i.e. how many % of visitors donā€™t go past the landing page? ā€¢ Which marketing channels are leading to most sessions? ā€¢ Do attribution analysis ā€“ Which channels are responsible for most conversions? Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 36. 36 Sessionization Ā©2014 Cloudera, Inc. All Rights Reserved. Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 37. 37 Case Study Requirements Orchestration
  • 38. 38 Orchestration is simple We just need to execute actions One after another Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 39. Ā©2014 Cloudera, Inc. All Rights Reserved. 39 Actually, we also need to handle errors And user notifications ā€¦.
  • 40. Andā€¦ ā€¢ Re-start workflows after errors ā€¢ Reuse of actions in multiple workflows ā€¢ Complex workflows with decision points ā€¢ Trigger actions based on events ā€¢ Tracking metadata ā€¢ Integration with enterprise software ā€¢ Data lifecycle ā€¢ Data quality control ā€¢ Reports Ā©2014 Cloudera, Inc. All Rights Reserved. 40
  • 41. Ā©2014 Cloudera, Inc. All Rights Reserved. 41 OK, maybe we need a product To help us do all that
  • 43. 43 Data Modeling Considerations ā€¢ We need to consider the following in our architecture: ā€“ Storage layer ā€“ HDFS? HBase? Etc. ā€“ File system schemas ā€“ how will we lay out the data? ā€“ File formats ā€“ what storage formats to use for our data, both raw and processed data? ā€“ Data compression formats? Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 44. 44 Architectural Considerations Data Modeling ā€“ Storage Layer
  • 45. 45 Data Storage Layer Choices ā€¢ Two likely choices for raw data: Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 46. 46 Data Storage Layer Choices ā€¢ Stores data directly as files ā€¢ Fast scans ā€¢ Poor random reads/writes ā€¢ Stores data as Hfiles on HDFS ā€¢ Slow scans ā€¢ Fast random reads/writes Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 47. 47 Data Storage ā€“ Storage Manager Considerations ā€¢ Incoming raw data: ā€“ Processing requirements call for batch transformations across multiple records ā€“ for example sessionization. ā€¢ Processed data: ā€“ Access to processed data will be via things like analytical queries ā€“ again requiring access to multiple records. ā€¢ We choose HDFS ā€“ Processing needs in this case served better by fast scans. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 48. 48 Architectural Considerations Data Modeling ā€“ Raw Data Storage
  • 49. Storage Formats ā€“ Raw Data and Processed Data 49 Ā©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 50. 50 Raw Data Storage ā€“ Format Considerations ā€¢ Store as plain text? ā€“ Sure, well supported by Hadoop. ā€“ Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on. ā€¢ Butā€¦ ā€“ Will begin to consume lots of space in HDFS. ā€“ May not be optimal for processing by tools in the Hadoop ecosystem. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 51. 51 Raw Data Storage ā€“ Format Considerations ā€¢ But, we can compress the text filesā€¦ ā€“ Gzip ā€“ supported by Hadoop, but not splittable. ā€“ Bzip2 ā€“ hey, splittable! Great compression! But decompression is slooowww. ā€“ LZO ā€“ splittable (with some work), good compress/de-compress performance. Good choice for storing text files on Hadoop. ā€“ Snappy ā€“ provides a good tradeoff between size and speed. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 52. 52 Raw Data Storage ā€“ More About Snappy ā€¢ Designed at Google to provide high compression speeds with reasonable compression. ā€¢ Not the highest compression, but provides very good performance for processing on Hadoop. ā€¢ Snappy is not splittable though, which brings us toā€¦ Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 53. 53 SequenceFile ā€¢ Stores records as binary key/value pairs. ā€¢ SequenceFile ā€œblocksā€ can be compressed. ā€¢ This enables splittability with non-splittable compression. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 54. 54 Avro ā€¢ Kinda SequenceFile on Steroids. ā€¢ Self-documenting ā€“ stores schema in header. ā€¢ Provides very efficient storage. ā€¢ Supports splittable compression. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 55. 55 Our Format Recommendations for Raw Dataā€¦ ā€¢ Avro with Snappy ā€“ Snappy provides optimized compression. ā€“ Avro provides compact storage, self-documenting files, and supports schema evolution. ā€“ Avro also provides better failure handling than other choices. ā€¢ SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. ā€“ But only supports Java. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 56. 56 But Noteā€¦ ā€¢ For simplicity, weā€™ll use plain text for raw data in our example. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 57. 57 Architectural Considerations Data Modeling ā€“ Processed Data Storage
  • 58. Storage Formats ā€“ Raw Data and Processed Data 58 Ā©2014 Cloudera, Inc. All Rights Reserved. Processed Data Raw Data
  • 59. 59 Access to Processed Data Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Ā©2014 Cloudera, Inc. All Rights Reserved. Analytical Queries
  • 60. 60 Columnar Formats ā€¢ Eliminates I/O for columns that are not part of a query. ā€¢ Works well for queries that access a subset of columns. ā€¢ Often provide better compression. ā€¢ These add up to dramatically improved performance for many queries. Ā©2014 Cloudera, Inc. All Rights Reserved. 1 2014-10-1 3 abc 2 2014-10-1 4 def 3 2014-10-1 5 ghi 1 2 3 2014-10-1 2014-10-1 3 4 2014-10-1 5 abc def ghi
  • 61. 61 Columnar Choices ā€“ RCFile ā€¢ Provides efficient processing for MapReduce applications, but generally only used with Hive. ā€¢ Only supports Java. ā€¢ No Avro support. ā€¢ Limited compression support. ā€¢ Sub-optimal performance compared to newer columnar formats. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 62. 62 Columnar Choices ā€“ ORC ā€¢ A better RCFile. ā€¢ Supports Hive, but currently limited support for other interfaces. ā€¢ Only supports Java. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 63. 63 Columnar Choices ā€“ Parquet ā€¢ Supports multiple programming interfaces ā€“ MapReduce, Hive, Impala, Pig. ā€¢ Multiple language support. ā€¢ Broad vendor support. ā€¢ These features make Parquet a good choice for our processed data. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 64. 64 Architectural Considerations Data Modeling ā€“ Schema Design
  • 65. 65 HDFS Schema Design ā€“ One Recommendation /user/<username> - User specific data, jars, conf files /etl ā€“ Data in various stages of ETL workflow /tmp ā€“ temp data from tools or shared between users /data ā€“ processed data to be shared data with the entire organization /app ā€“ Everything but data: UDF jars, HQL files, Oozie workflows Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 66. 66 Partitioning ā€¢ Split the dataset into smaller consumable chunks. ā€¢ Rudimentary form of ā€œindexingā€. Reduces I/O needed to process queries. Ā©2014 Cloudera, Inc. All Rights Reserved. dataset col=val1/file.txt col=val2/file.txt ā€¦ col=valn/file.txt Un-partitioned HDFS directory dataset file1.txt file2.txt ā€¦ filen.txt structure Partitioned HDFS directory structure
  • 67. 67 Partitioning considerations ā€¢ What column to partition by? ā€“ Donā€™t have too many partitions (<10,000) ā€“ Donā€™t have too many small files in the partitions (more than block size generally) ā€¢ Weā€™ll partition by timestamp. This applies to both our raw and processed data. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 68. 68 Partitioning For Our Case Study ā€¢ Raw dataset: ā€“ /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! ā€¢ Processed dataset: ā€“ /data/bikeshop/clickstream/year=2014/month=10/day=10! Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 69. 69 A Word About Partitioning Alternatives year=2014/month=10/day=10 dt=2014-10-10 Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 71. Typical Clickstream data sources Ā©2014 Cloudera, Inc. All rights reserved. 71 ā€¢ Omniture data on FTP ā€¢ Apps ā€¢ App Logs ā€¢ RDBMS
  • 72. Getting Files from FTP Ā©2014 Cloudera, Inc. All rights reserved. 72
  • 73. Donā€™t over-complicate things curl ftp://myftpsite.com/sitecatalyst/ myreport_2014-10-05.tar.gz --user name:password | hdfs -put - /etl/clickstream/raw/ sitecatalyst/myreport_2014-10-05.tar.gz Ā©2014 Cloudera, Inc. All rights reserved. 73
  • 74. Event Streaming ā€“ Flume and Kafka Reliable, distributed and highly available systems That allow streaming events to Hadoop Ā©2014 Cloudera, Inc. All rights reserved. 74
  • 75. ā€¢ Many available data collection sources ā€¢ Well integrated into Hadoop ā€¢ Supports file transformations ā€¢ Can implement complex topologies ā€¢ Very low latency ā€¢ No programming required Ā©2014 Cloudera, Inc. All rights reserved. 75 Flume:
  • 76. ā€œWe just want to grab data from this directory and write it to HDFSā€ Ā©2014 Cloudera, Inc. All rights reserved. 76 We use Flume when:
  • 77. ā€¢ Very high-throughput publish-subscribe messaging ā€¢ Highly available ā€¢ Stores data and can replay ā€¢ Can support many consumers with no extra latency Ā©2014 Cloudera, Inc. All rights reserved. 77 Kafka is:
  • 78. ā€œKafka is awesome. We heard it cures cancerā€ Ā©2014 Cloudera, Inc. All rights reserved. 78 Use Kafka When:
  • 79. Ā©2014 Cloudera, Inc. All rights reserved. 79 Actually, why choose? ā€¢ Use Flume with a Kafka Source ā€¢ Allows to get data from Kafka, run some transformations write to HDFS, HBase or Solr
  • 80. ā€¢ We want to ingest events from log files ā€¢ Flumeā€™s Spooling Directory source fits ā€¢ With HDFS Sink ā€¢ We would have used Kafka ifā€¦ ā€“ We wanted the data in non-Hadoop systems too Ā©2014 Cloudera, Inc. All rights reserved. 80 In Our Exampleā€¦
  • 81. 81 Short Intro to Flume Sources Interceptors Selectors Channels Sinks Flume Agent Twitter, logs, JMS, webserver Mask, re-format, validateā€¦ DR, critical Memory, file HDFS, Hbase, Solr
  • 82. 82 Configuration ā€¢ Declarative ā€“ No coding required. ā€“ Configuration specifies how components are wired together. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 83. 83 Interceptors ā€¢ Mask fields ā€¢ Validate information against external source ā€¢ Extract fields ā€¢ Modify data format ā€¢ Filter or split events Ā©2014 Cloudera, Inc. All rights reserved.
  • 84. Ā©2014 Cloudera, Inc. All rights reserved. 84 Any sufficiently complex configuration Is indistinguishable from code
  • 85. 85 A Brief Discussion of Flume Patterns ā€“ Fan-in ā€¢ Flume agent runs on each of our servers. ā€¢ These client agents send data to multiple agents to provide reliability. ā€¢ Flume provides support for load balancing. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 86. 86 A Brief Discussion of Flume Patterns ā€“ Splitting ā€¢ Common need is to split data on ingest. ā€¢ For example: ā€“ Sending data to multiple clusters for DR. ā€“ To multiple destinations. ā€¢ Flume also supports partitioning, which is key to our implementation. Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 87. Flume Demo Ā©2014 Cloudera, Inc. All rights reserved. 87
  • 88. 88 Flume Demo ā€“ Architecture Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-in Pattern Multi Agents for Failover and rolling restarts HDFS Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 89. 89 Flume Agent Web Logs Spooling Dir Source Timestamp Interceptor File Channel Avro Sink Avro Sink Avro Sink Flume Demo ā€“ Client Tier Web Server Flume Agent
  • 90. 90 Flume Demo ā€“ Collector Tier Flume Agent Flume Agent HDFS Avro Source File Channel HDFS Sink HDFS Sink
  • 91. What ifā€¦. We were to use Kafka? ā€¢ Add Kafka producer to our webapp ā€¢ Send clicks and searches as messages ā€¢ Run an MR job once a day to grab all new events ā€¢ We can add a second consumer for real-time ā€¢ Another consumer for alertingā€¦ Ā©2014 Cloudera, Inc. All rights reserved. 91
  • 92. Ā©2014 Cloudera, Inc. All rights reserved. 92 Camus
  • 93. 93 Architectural Considerations Data Processing ā€“ Engines tiny.cloudera.com/app-arch-slides
  • 94. 94 Processing Engines ā€¢ MapReduce ā€¢ Abstractions ā€¢ Spark ā€¢ Spark Streaming ā€¢ Impala Confidentiality Information Goes Here
  • 95. 95 MapReduce ā€¢ Oldie but goody ā€¢ Restrictive Framework / Innovated Work Around ā€¢ Extreme Batch Confidentiality Information Goes Here
  • 96. 96 MapReduce Basic High Level Confidentiality Information Goes Here Mapper HDFS (Replicated) Native File System Block of Data Temp Spill Data Partitioned Sorted Data Reducer Reducer Local Copy Output File
  • 97. 97 MapReduce Innovation ā€¢ Mapper Memory Joins ā€¢ Reducer Memory Joins ā€¢ Buckets Sorted Joins ā€¢ Cross Task Communication ā€¢ Windowing ā€¢ And Much More Confidentiality Information Goes Here
  • 98. 98 Abstractions ā€¢ SQL ā€“ Hive ā€¢ Script/Code ā€“ Pig: Pig Latin ā€“ Crunch: Java/Scala ā€“ Cascading: Java/Scala Confidentiality Information Goes Here
  • 99. 99 Spark ā€¢ The New Kid that isnā€™t that New Anymore ā€¢ Easily 10x less code ā€¢ Extremely Easy and Powerful API ā€¢ Very good for machine learning ā€¢ Scala, Java, and Python ā€¢ RDDs ā€¢ DAG Engine Confidentiality Information Goes Here
  • 100. 100 Spark - DAG Confidentiality Information Goes Here
  • 101. 101 Spark - DAG Confidentiality Information Goes Here Filter KeyBy KeyBy TextFile TextFile Join Filter Take
  • 102. 102 Spark - DAG Confidentiality Information Goes Here Filter KeyBy KeyBy TextFile TextFile Join Filter Take Good Good Good Good Good Good Good-Replay Good-Replay Good-Replay Good Good-Replay Good Good-Replay Lost Block Replay Good-Replay Lost Block Good Future Future Future Future
  • 103. 103 Spark Streaming ā€¢ Calling Spark in a Loop ā€¢ Extends RDDs with DStream ā€¢ Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here
  • 104. 104 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 105. 105 Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Stateful RDD 1 Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Stateful RDD 2 Print
  • 106. 106 Impala Confidentiality Information Goes Here ā€¢ MPP Style SQL Engine on top of Hadoop ā€¢ Very Fast ā€¢ High Concurrency
  • 107. 107 Impala ā€“ Broadcast Join Confidentiality Information Goes Here Impala Daemon Smaller Table Data Block 100% Cached Smaller Table Smaller Table Data Block Impala Daemon 100% Cached Smaller Table Impala Daemon 100% Cached Smaller Table Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Function Bigger Table Data Block 100% Cached Smaller Table Output
  • 108. 108 Impala ā€“ Broadcast Join Confidentiality Information Goes Here Impala Daemon Smaller Table Data Block ~33% Cached Smaller Table Impala Daemon Smaller Table Data Block ~33% Cached Smaller Table Impala Daemon ~33% Cached Smaller Table Hash Partitioner Hash Partitioner Impala Daemon BiggerTable Data Block Impala Daemon Impala Daemon Hash Partitioner Hash Join Function 33% Cached Smaller Table Hash Join Function 33% Cached Smaller Table Hash Join Function 33% Cached Smaller Table Output Output Output Hash Partitioner BiggerTable Data Block Hash Partitioner BiggerTable Data Block
  • 109. 109 Impala vs Hive Confidentiality Information Goes Here ā€¢ Very different approaches and ā€¢ We may see convergence at some point ā€¢ But for now ā€“ Impala for speed ā€“ Hive for batch
  • 110. 110 Architectural Considerations Data Processing ā€“ Patterns and Recommendations
  • 111. 111 What processing needs to happen? Confidentiality Information Goes Here ā€¢ Sessionization ā€¢ Filtering ā€¢ Deduplication ā€¢ BI / Discovery
  • 112. 112 Sessionization Confidentiality Information Goes Here Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 113. 113 Why sessionize? Helps answers questions like: ā€¢ What is my websiteā€™s bounce rate? ā€“ i.e. how many % of visitors donā€™t go past the landing page? ā€¢ Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? ā€“ Which ones of those lead to most conversions (e.g. people buying things, Confidentiality Information Goes Here signing up, etc.) ā€¢ Do attribution analysis ā€“ which channels are responsible for most conversions?
  • 114. 114 Sessionization 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 244.157.45.12+1413583199 Confidentiality Information Goes Here
  • 115. 115 How to Sessionize? Confidentiality Information Goes Here 1. Given a list of clicks, determine which clicks came from the same user (Partitioning, ordering) 2. Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session (Identifying session boundaries)
  • 116. 116 #1 ā€“ Which clicks are from same user? ā€¢ We can use: ā€“ IP address (244.157.45.12) ā€“ Cookies (A9A3BECE0563982D) ā€“ IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 117. 117 #1 ā€“ Which clicks are from same user? ā€¢ We can use: ā€“ IP address (244.157.45.12) ā€“ Cookies (A9A3BECE0563982D) ā€“ IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 118. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 118 #1 ā€“ Which clicks are from same user? Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 119. > 30 mins apart = different sessions 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 119 #2 ā€“ Which clicks part of the same session? Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 120. Sessionization engine recommendation ā€¢ We have sessionization code in MR, Spark and Hive on github. The complexity of the code varies, depends on the expertise in the organization. ā€¢ We choose Hive, since itā€™s a fairly small query to maintain instead of large pieces of code. Ā©2014 Cloudera, Inc. All rights reserved. 120
  • 121. 121 Filtering ā€“ filter out incomplete records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; Uā€¦ Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 122. Filtering ā€“ filter out records from bots/spiders 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1ā€ 122 Ā©2014 Cloudera, Inc. All Rights Reserved. Google spider IP address
  • 123. Filtering recommendation ā€¢ Bot/Spider filtering can be done easily in any of the engines ā€¢ Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. ā€¢ Pretty close choice between MR, Hive and Spark ā€¢ Can be done in Spark using rdd.filter() ā€¢ We can simply embed this in our sessionization job Ā©2014 Cloudera, Inc. All rights reserved. 123
  • 124. 124 Deduplication ā€“ remove duplicate records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā€ Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 125. Deduplication recommendation ā€¢ Can be done in all engines. ā€¢ We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication ā€¢ reduce() in spark ā€¢ We use Pig Ā©2014 Cloudera, Inc. All rights reserved. 125
  • 126. BI/Discovery engine recommendation ā€¢ Main requirements for this are: ā€“ Low latency ā€“ SQL interface (e.g. JDBC/ODBC) ā€“ Users donā€™t know how to code ā€¢ We chose Impala ā€“ Itā€™s a SQL engine ā€“ Much faster than other engines ā€“ Provides standard JDBC/ODBC interfaces Ā©2014 Cloudera, Inc. All rights reserved. 126
  • 127. Processing Demo Ā©2014 Cloudera, Inc. All rights reserved. 127
  • 129. 129 Orchestrating Clickstream ā€¢ Data arrives through Flume ā€¢ Triggers a processing event: ā€“ Sessionize ā€“ Enrich ā€“ Location, marketing channelā€¦ ā€“ Store as Parquet ā€¢ Each day we process events from the previous day
  • 130. ā€¢ Workflow is fairly simple ā€¢ Need to trigger workflow based on data ā€¢ Be able to recover from errors ā€¢ Perhaps notify on the status ā€¢ And collect metrics for reporting Ā©2014 Cloudera, Inc. All rights reserved. 130 Choosing Right
  • 131. 131 Oozie or Azkaban? Ā©2014 Cloudera, Inc. All rights reserved.
  • 132. Ā©2014 Cloudera, Inc. All rights reserved. 132 Oozie Architecture
  • 133. ā€¢ Part of all major Hadoop distributions ā€¢ Hue integration ā€¢ Built -in actions ā€“ Hive, Sqoop, MapReduce, SSH ā€¢ Complex workflows with decisions ā€¢ Event and time based scheduling ā€¢ Notifications ā€¢ SLA Monitoring ā€¢ REST API Ā©2014 Cloudera, Inc. All rights reserved. 133 Oozie features
  • 134. Ā©2014 Cloudera, Inc. All rights reserved. 134 Oozie Drawbacks ā€¢ Overhead in launching jobs ā€¢ Steep learning curve ā€¢ XML Workflows
  • 135. Client Hadoop Ā©2014 Cloudera, Inc. All rights reserved. 135 Azkaban Architecture Azkaban Executor Server Azkaban Web Server HDFS viewer plugin Job types plugin MySQL
  • 136. ā€¢ Simplicity ā€¢ Great UI ā€“ including pluggable visualizers ā€¢ Lots of plugins ā€“ Hive, Pigā€¦ ā€¢ Reporting plugin Ā©2014 Cloudera, Inc. All rights reserved. 136 Azkaban features
  • 137. ā€¢ Doesnā€™t support workflow decisions ā€¢ Canā€™t represent data dependency Ā©2014 Cloudera, Inc. All rights reserved. 137 Azkaban Limitations
  • 138. ā€¢ Workflow is fairly simple ā€¢ Need to trigger workflow based on data ā€¢ Be able to recover from errors ā€¢ Perhaps notify on the status ā€¢ And collect metrics for reporting Ā©2014 Cloudera, Inc. All rights reserved. 138 Choosingā€¦ Easier in Oozie
  • 139. Choosing the right Orchestration Tool ā€¢ Workflow is fairly simple ā€¢ Need to trigger workflow based on data ā€¢ Be able to recover from errors ā€¢ Perhaps notify on the status ā€¢ And collect metrics for reporting Better in Azkaban Ā©2014 Cloudera, Inc. All rights reserved. 139
  • 140. Important Decision Consideration! The best orchestration tool is the one you are an expert on Ā©2014 Cloudera, Inc. All rights reserved. 140
  • 141. Orchestration Patterns ā€“ Fan Out Ā©2014 Cloudera, Inc. All rights reserved. 141
  • 142. Capture & Decide Pattern Ā©2014 Cloudera, Inc. All rights reserved. 142
  • 143. Oozie Demo Ā©2014 Cloudera, Inc. All rights reserved. 143
  • 144. 144 Putting It All Together Final Architecture
  • 145. 145 Final Architecture ā€“ High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 146. 146 Final Architecture ā€“ High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 147. 147 Final Architecture ā€“ Ingestion Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-in Pattern Multi Agents for Failover and rolling restarts HDFS Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 148. 148 Final Architecture ā€“ High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 149. 149 Final Architecture ā€“ Storage and Processing /etl/BI/casualcyclist/clicks/rawlogs/ year=2014/month=10/day=10 ā€¦ Data Processing Deduplication Filtering Sessionization Etc. /data/bikeshop/clickstream/ year=2014/month=10/day=10 ā€¦ Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 150. 150 Final Architecture ā€“ High Level Overview Data Sources Ingestion Data Storage/ Processing Data Reporting/ Analysis Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 151. 151 Final Architecture ā€“ Data Access Hive/ Impala BI/ Analytics Tools JDBC/ODBC DWH Sqoop Local Disk R, etc. DB import tool Ā©2014 Cloudera, Inc. All Rights Reserved.
  • 152. 152 Stay in touch! @hadooparchbook hadooparchitecturebook.com slideshare.com/hadooparchbook
  • 153. Confidentiality Information Goes Here 153 Join the Discussion Get community help or provide feedback cloudera.com/ community
  • 154. 154 Try Hadoop Now cloudera.com/live
  • 155. 155 Visit us at Booth #305 Highlights: Hear whatā€™s new with 5.2 including Impala 2.0 Learn how Cloudera is setting the standard for Hadoop in the Cloud BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS
  • 156. 156 Free books and office hours! ā€¢ Book signing #1 ā€“ Tomorrow, 10/16, 3:45 ā€“ 4:15pm at Expo Hall - O'Reilly Authors Booth ā€¢ Book signing #2 ā€“ Tomorrow, 10/16, 6:35 ā€“ 7:00pm at Cloudera Booth #305 ā€¢ Office Hours, tomorrow, 10/16, 10:30 ā€“ 11:00 am ā€“ Mark and Ted in Room: Table A ā€“ Gwen and Jonathan in Room: Table C Ā©2014 Cloudera, Inc. All Rights Reserved.