Submit Search
Upload
Hadoop Application Architectures tutorial at Big DataService 2015
•
3 likes
•
3,371 views
H
hadooparchbook
Follow
Hadoop Application Architectures tutorial at Big DataService 2015
Read less
Read more
Technology
Report
Share
Report
Share
1 of 160
Download now
Download to read offline
Recommended
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
Recommended
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
Fraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
Intro to hadoop tutorial
Intro to hadoop tutorial
markgrover
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
Architecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
Streaming architecture patterns
Streaming architecture patterns
hadooparchbook
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
What no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
Architecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
Fraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover
Data warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
StampedeCon
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
More Related Content
What's hot
Fraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
Intro to hadoop tutorial
Intro to hadoop tutorial
markgrover
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
Architecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
Streaming architecture patterns
Streaming architecture patterns
hadooparchbook
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
What no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
Architecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
Fraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover
Data warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
StampedeCon
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
What's hot
(20)
Fraud Detection using Hadoop
Fraud Detection using Hadoop
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
Intro to hadoop tutorial
Intro to hadoop tutorial
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud Detection
Architecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
Streaming architecture patterns
Streaming architecture patterns
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
What no one tells you about writing a streaming app
What no one tells you about writing a streaming app
Architecting a next generation data platform
Architecting a next generation data platform
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
Fraud Detection with Hadoop
Fraud Detection with Hadoop
Data warehousing with Hadoop
Data warehousing with Hadoop
Application Architectures with Hadoop
Application Architectures with Hadoop
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Viewers also liked
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
Architecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Remy Rosenbaum
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Codemotion
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
Valkyrja5
Portafolio alvaro romero
Portafolio alvaro romero
alvjor
All Events in City - VIBRANT START-UP CHALLENGE
All Events in City - VIBRANT START-UP CHALLENGE
Amit Panchal
Instructions DOCTER®sight Red Dot | Optics Trade
Instructions DOCTER®sight Red Dot | Optics Trade
Optics-Trade
Ensayo
Ensayo
Jesús Bush Paredes
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Rodrigo Lema González
Viewers also liked
(14)
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Architecting next generation big data platform
Architecting next generation big data platform
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
Portafolio alvaro romero
Portafolio alvaro romero
All Events in City - VIBRANT START-UP CHALLENGE
All Events in City - VIBRANT START-UP CHALLENGE
Instructions DOCTER®sight Red Dot | Optics Trade
Instructions DOCTER®sight Red Dot | Optics Trade
Ensayo
Ensayo
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Similar to Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
Architecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Cloudera, Inc.
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
Unlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQL
Matt Lord
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
Cloudera, Inc.
Big Data Fundamentals
Big Data Fundamentals
Cloudera, Inc.
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
KNIMESlides
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Rosaria Silipo
Hadoop and Manufacturing
Hadoop and Manufacturing
Cloudera, Inc.
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
DataStax Academy
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18 asher bartch
Cloudera, Inc.
Similar to Hadoop Application Architectures tutorial at Big DataService 2015
(20)
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Architecting Applications with Hadoop
Architecting Applications with Hadoop
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Unlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQL
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
Big Data Fundamentals
Big Data Fundamentals
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Hadoop and Manufacturing
Hadoop and Manufacturing
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18 asher bartch
Recently uploaded
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Michael W. Hawkins
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
Slack Application Development 101 Slides
Slack Application Development 101 Slides
praypatel2
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
hans926745
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Results
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Delhi Call girls
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Delhi Call girls
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
UK Journal
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Pixlogix Infotech
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
Recently uploaded
(20)
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Slack Application Development 101 Slides
Slack Application Development 101 Slides
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Hadoop Application Architectures tutorial at Big DataService 2015
1.
1 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons for Hadoop Applica8ons Mark Grover | @mark_grover Jonathan Seidman | @jseidman Gwen Shapira | @gwenshap IEEE BigDataService 2015 | March 30th, 2015 8ny.cloudera.com/app-‐arch-‐slides
2.
2 © Cloudera,
Inc. All rights reserved. About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
3.
3 © Cloudera,
Inc. All rights reserved. About the presenters • Principal Solu8ons Architect at Cloudera • Previously, lead architect at FINRA • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Senior Solu8ons Architect/ Partner Enablement at Cloudera • Previously, Technical Lead on the big data team at Orbitz Worldwide • Co-‐founder of the Chicago Hadoop User Group and Chicago Big Data Ted Malaska Jonathan Seidman
4.
4 © Cloudera,
Inc. All rights reserved. About the presenters • Solu8ons Architect turned So]ware Engineer at Cloudera • Commi^er on Apache Sqoop • Contributor to Apache Flume and Apache Kaaa • So]ware Engineer at Cloudera • Commi^er on Apache Bigtop, PMC member on Apache Sentry (incuba8ng) • Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume Gwen Shapira Mark Grover
5.
5 © Cloudera,
Inc. All rights reserved. Logis8cs • Break at 10:30-‐11:00 PM • Ques8ons at the end of each sec8on
6.
6 © Cloudera,
Inc. All rights reserved. Case Study Clickstream Analysis
7.
7 © Cloudera,
Inc. All rights reserved. Analy8cs
8.
8 © Cloudera,
Inc. All rights reserved. Analy8cs
9.
9 © Cloudera,
Inc. All rights reserved. Analy8cs
10.
10 © Cloudera,
Inc. All rights reserved. Analy8cs
11.
11 © Cloudera,
Inc. All rights reserved. Analy8cs
12.
12 © Cloudera,
Inc. All rights reserved. Analy8cs
13.
13 © Cloudera,
Inc. All rights reserved. Analy8cs
14.
14 © Cloudera,
Inc. All rights reserved. Web Logs – Combined Log Format 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
15.
15 © Cloudera,
Inc. All rights reserved. Clickstream Analy8cs 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
16.
16 © Cloudera,
Inc. All rights reserved. Similar use-‐cases • Sensors – heart, agriculture, etc. • Casinos – session of a person at a table
17.
17 © Cloudera,
Inc. All rights reserved. Pre-‐Hadoop Architecture Clickstream Analysis
18.
18 © Cloudera,
Inc. All rights reserved. Click Stream Analysis (Before Hadoop) Web logs (full fidelity) (2 weeks) Data Warehouse Transform/ Aggregate Business Intelligence Tape Archive
19.
19 © Cloudera,
Inc. All rights reserved. Problems with Pre-‐Hadoop Architecture • Full fidelity data is stored for small amount of 8me (~weeks). • Older data is sent to tape, or even worse, deleted! • Inflexible workflow -‐ think of all aggrega8ons beforehand
20.
20 © Cloudera,
Inc. All rights reserved. Effects of Pre-‐Hadoop Architecture • Regenera8ng aggregates is expensive or worse, impossible • Can’t correct bugs in the workflow/aggrega8on logic • Can’t do experiments on exis8ng data
21.
21 © Cloudera,
Inc. All rights reserved. Why is Hadoop A Great Fit? Clickstream Analysis
22.
22 © Cloudera,
Inc. All rights reserved. Why is Hadoop a great fit? • Volume of clickstream data is huge • Velocity at which it comes in is high • Variety of data is diverse -‐ semi-‐structured data • Hadoop enables • ac8ve archival of data • Aggrega8on jobs • Querying the above aggregates or raw fidelity data
23.
23 © Cloudera,
Inc. All rights reserved. Click Stream Analysis (with Hadoop) Web logs Hadoop Business Intelligence Ac8ve archive (no tape) Aggrega8on engine Querying engine
24.
24 © Cloudera,
Inc. All rights reserved. Challenges of Hadoop Implementa8on
25.
25 © Cloudera,
Inc. All rights reserved. Challenges of Hadoop Implementa8on
26.
26 © Cloudera,
Inc. All rights reserved. Other challenges -‐ Architectural Considera8ons • Storage managers? • HDFS? HBase? • Data storage and modeling: • File formats? Compression? Schema design? • Data movement • How do we actually get the data into Hadoop? How do we get it out? • Metadata • How do we manage data about the data? • Data access and processing • How will the data be accessed once in Hadoop? How can we transform it? How do we query it? • Orchestra8on • How do we manage the workflow for all of this?
27.
27 © Cloudera,
Inc. All rights reserved. Case Study Requirements Overview of Requirements
28.
28 © Cloudera,
Inc. All rights reserved. Overview of Requirements Data Sources Inges8on Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consump8on Orchestra8on (Scheduling, Managing, Monitoring)
29.
29 © Cloudera,
Inc. All rights reserved. Case Study Requirements Data Inges8on
30.
30 © Cloudera,
Inc. All rights reserved. Data Inges8on Requirements Web Servers Web Servers Web Servers Web Servers Web Servers Logs 244.157.45.12 - - [17/ Oct/2014:21:08:30 ] "GET /seatposts HTTP/ 1.0" 200 4463 … CRM Data ODS Web Servers Web Servers Web Servers Web Servers Web Servers Logs Hadoop
31.
31 © Cloudera,
Inc. All rights reserved. Data Inges8on Requirements • So we need to be able to support: • Reliable inges8on of large volumes of semi-‐structured event data arriving with high velocity (e.g. logs). • Timeliness of data availability – data needs to be available for processing to meet business service level agreements. • Periodic inges8on of data from rela8onal data stores.
32.
32 © Cloudera,
Inc. All rights reserved. Case Study Requirements Data Storage
33.
33 © Cloudera,
Inc. All rights reserved. Data Storage Requirements Store all the data Make the data accessible for processing Compress the data
34.
34 © Cloudera,
Inc. All rights reserved. Case Study Requirements Data Processing
35.
35 © Cloudera,
Inc. All rights reserved. Processing requirements Be able to answer ques8ons like: • What is my website’s bounce rate? • i.e. how many % of visitors don’t go past the landing page? • Which marke8ng channels are leading to most sessions? • Do a^ribu8on analysis • Which channels are responsible for most conversions?
36.
36 © Cloudera,
Inc. All rights reserved. Sessioniza8on Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
37.
37 © Cloudera,
Inc. All rights reserved. Case Study Requirements Orchestra8on
38.
38 © Cloudera,
Inc. All rights reserved. Orchestra8on is simple We just need to execute ac8ons One a]er another
39.
39 © Cloudera,
Inc. All rights reserved. Actually, we also need to handle errors And user no8fica8ons ….
40.
40 © Cloudera,
Inc. All rights reserved. And… • Re-‐start workflows a]er errors • Reuse of ac8ons in mul8ple workflows • Complex workflows with decision points • Trigger ac8ons based on events • Tracking metadata • Integra8on with enterprise so]ware • Data lifecycle • Data quality control • Reports
41.
41 © Cloudera,
Inc. All rights reserved. OK, maybe we need a product To help us do all that
42.
42 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Modeling
43.
43 © Cloudera,
Inc. All rights reserved. Data Modeling Considera8ons • We need to consider the following in our architecture: • Storage layer – HDFS? HBase? Etc. • File system schemas – how will we lay out the data? • File formats – what storage formats to use for our data, both raw and processed data? • Data compression formats?
44.
44 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Modeling – Storage Layer
45.
45 © Cloudera,
Inc. All rights reserved. Data Storage Layer Choices • Two likely choices for raw data:
46.
46 © Cloudera,
Inc. All rights reserved. Data Storage Layer Choices • Stores data directly as files • Fast scans • Poor random reads/writes • Stores data as Hfiles on HDFS • Slow scans • Fast random reads/writes
47.
47 © Cloudera,
Inc. All rights reserved. Data Storage – Storage Manager Considera8ons • Incoming raw data: • Processing requirements call for batch transforma8ons across mul8ple records – for example sessioniza8on. • Processed data: • Access to processed data will be via things like analy8cal queries – again requiring access to mul8ple records. • We choose HDFS • Processing needs in this case served be^er by fast scans.
48.
48 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Modeling – Raw Data Storage
49.
49 © Cloudera,
Inc. All rights reserved. Storage Formats – Raw Data and Processed Data Processed Data Raw Data
50.
50 © Cloudera,
Inc. All rights reserved. Data Storage – Format Considera8ons Logs (plain text)
51.
51 © Cloudera,
Inc. All rights reserved. Data Storage – Format Considera8ons Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs (plain text) Logs Logs Logs Logs Logs
52.
52 © Cloudera,
Inc. All rights reserved. Data Storage – Compression snappy Well, maybe. But not spli^able. X Spli^able. Gewng be^er… Hmmm…. Spli^able, but no...
53.
53 © Cloudera,
Inc. All rights reserved. Raw Data Storage – More About Snappy • Designed at Google to provide high compression speeds with reasonable compression. • Not the highest compression, but provides very good performance for processing on Hadoop. • Snappy is not spli^able though, which brings us to…
54.
54 © Cloudera,
Inc. All rights reserved. Hadoop File Types • Formats designed specifically to store and process data on Hadoop: • File based – SequenceFile • Serializa8on formats – Thri], Protocol Buffers, Avro • Columnar formats – RCFile, ORC, Parquet
55.
55 © Cloudera,
Inc. All rights reserved. SequenceFile • Stores records as binary key/value pairs. • SequenceFile “blocks” can be compressed. • This enables spli^ability with non-‐spli^able compression.
56.
56 © Cloudera,
Inc. All rights reserved. Avro • Kinda SequenceFile on Steroids. • Self-‐documen8ng – stores schema in header. • Provides very efficient storage. • Supports spli^able compression.
57.
57 © Cloudera,
Inc. All rights reserved. Our Format Recommenda8ons for Raw Data… • Avro with Snappy • Snappy provides op8mized compression. • Avro provides compact storage, self-‐documen8ng files, and supports schema evolu8on. • Avro also provides be^er failure handling than other choices. • SequenceFiles would also be a good choice, and are directly supported by inges8on tools in the ecosystem. • But only supports Java.
58.
58 © Cloudera,
Inc. All rights reserved. But Note… • For simplicity, we’ll use plain text for raw data in our example.
59.
59 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Modeling – Processed Data Storage
60.
60 © Cloudera,
Inc. All rights reserved. Storage Formats – Raw Data and Processed Data Processed Data Raw Data
61.
61 © Cloudera,
Inc. All rights reserved. Access to Processed Data Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Analy8cal Queries
62.
62 © Cloudera,
Inc. All rights reserved. Columnar Formats • Eliminates I/O for columns that are not part of a query. • Works well for queries that access a subset of columns. • O]en provide be^er compression. • These add up to drama8cally improved performance for many queries. 1 2014-‐10-‐13 abc 2 2014-‐10-‐14 def 3 2014-‐10-‐15 ghi 1 2 3 2014-‐10-‐13 2014-‐10-‐14 2014-‐10-‐15 abc def ghi
63.
63 © Cloudera,
Inc. All rights reserved. Columnar Choices – RCFile • Designed to provide efficient processing for Hive queries. • Only supports Java. • No Avro support. • Limited compression support. • Sub-‐op8mal performance compared to newer columnar formats.
64.
64 © Cloudera,
Inc. All rights reserved. Columnar Choices – ORC • A be^er RCFile. • Also designed to provide efficient processing of Hive queries. • Only supports Java.
65.
65 © Cloudera,
Inc. All rights reserved. Columnar Choices – Parquet • Designed to provide efficient processing across Hadoop programming interfaces – MapReduce, Hive, Impala, Pig. • Mul8ple language support – Java, C++ • Good object model support, including Avro. • Broad vendor support. • These features make Parquet a good choice for our processed data.
66.
66 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Modeling – Schema Design
67.
67 © Cloudera,
Inc. All rights reserved. HDFS Schema Design – One Recommenda8on /etl – Data in various stages of ETL workflow /data – processed data to be shared data with the en8re organiza8on /tmp – temp data from tools or shared between users /user/<username> -‐ User specific data, jars, conf files /app – Everything but data: UDF jars, HQL files, Oozie workflows
68.
68 © Cloudera,
Inc. All rights reserved. Par88oning • Split the dataset into smaller consumable chunks. • Rudimentary form of “indexing”. Reduces I/O needed to process queries.
69.
69 © Cloudera,
Inc. All rights reserved. Par88oning dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt dataset file1.txt file2.txt … filen.txt Un-‐par88oned HDFS directory structure Par88oned HDFS directory structure
70.
70 © Cloudera,
Inc. All rights reserved. Par88oning considera8ons • What column to par88on by? • Don’t have too many par88ons (<10,000) • Don’t have too many small files in the par88ons • Good to have par88on sizes at least ~1 GB, generally a mul8ple of block size. • We’ll par88on by 6mestamp. This applies to both our raw and processed data.
71.
71 © Cloudera,
Inc. All rights reserved. Par88oning For Our Case Study • Raw dataset: • /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! • Processed dataset: • /data/bikeshop/clickstream/year=2014/month=10/day=10!
72.
72 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Inges8on
73.
73 © Cloudera,
Inc. All rights reserved. • Omniture data on FTP • Apps • App Logs • RDBMS Typical Clickstream data sources
74.
74 © Cloudera,
Inc. All rights reserved. Gewng Files from FTP
75.
75 © Cloudera,
Inc. All rights reserved. curl ftp://myftpsite.com/sitecatalyst/ myreport_2014-10-05.tar.gz --user name:password | hdfs -put - /etl/clickstream/raw/ sitecatalyst/myreport_2014-10-05.tar.gz Don’t over-‐complicate things
76.
76 © Cloudera,
Inc. All rights reserved. Apache NiFi
77.
77 © Cloudera,
Inc. All rights reserved. Reliable, distributed and highly available systems That allow streaming events to Hadoop Event Streaming – Flume and Kaaa
78.
78 © Cloudera,
Inc. All rights reserved. • Many available data collec8on sources • Well integrated into Hadoop • Supports file transforma8ons • Can implement complex topologies • Very low latency • No programming required Flume:
79.
79 © Cloudera,
Inc. All rights reserved. “We just want to grab data from this directory and write it to HDFS” We use Flume when:
80.
80 © Cloudera,
Inc. All rights reserved. • Very high-‐throughput publish-‐subscribe messaging • Highly available • Stores data and can replay • Can support many consumers with no extra latency Kaaa is:
81.
81 © Cloudera,
Inc. All rights reserved. “Kaaa is awesome. We heard it cures cancer” Use Kaaa When:
82.
82 © Cloudera,
Inc. All rights reserved. • Use Flume with a Kaaa Source • Allows to get data from Kaaa, run some transforma8ons write to HDFS, HBase or Solr Actually, why choose?
83.
83 © Cloudera,
Inc. All rights reserved. • We want to ingest events from log files • Flume’s Spooling Directory source fits • With HDFS Sink • We would have used Kaaa if… • We wanted the data in non-‐Hadoop systems too In Our Example…
84.
84 © Cloudera,
Inc. All rights reserved. Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twi^er, logs, JMS, webserver, Kaaa Mask, re-‐format, validate… DR, cri8cal Memory, file, Kaaa HDFS, HBase, Solr
85.
85 © Cloudera,
Inc. All rights reserved. Configura8on • Declara8ve • No coding required. • Configura8on specifies how components are wired together.
86.
86 © Cloudera,
Inc. All rights reserved. Interceptors • Mask fields • Validate informa8on against external source • Extract fields • Modify data format • Filter or split events
87.
87 © Cloudera,
Inc. All rights reserved. Any sufficiently complex configura8on Is indis8nguishable from code
88.
88 © Cloudera,
Inc. All rights reserved. A Brief Discussion of Flume Pa^erns – Fan-‐in • Flume agent runs on each of our servers. • These client agents send data to mul8ple agents to provide reliability. • Flume provides support for load balancing.
89.
89 © Cloudera,
Inc. All rights reserved. A Brief Discussion of Flume Pa^erns – Spliwng • Common need is to split data on ingest. • For example: • Sending data to mul8ple clusters for DR. • To mul8ple des8na8ons. • Flume also supports par88oning, which is key to our implementa8on.
90.
90 © Cloudera,
Inc. All rights reserved. Flume Agent Web Logs Spooling Dir Source Timestamp Interceptor File Channel Avro Sink Avro Sink Avro Sink Flume Architecture – Client Tier Web Server Flume Agent
91.
91 © Cloudera,
Inc. All rights reserved. Flume Architecture – Collector Tier Flume Agent Flume Agent HDFS Avro Source File Channel HDFS Sink HDFS Sink
92.
92 © Cloudera,
Inc. All rights reserved. • Add Kaaa producer to our webapp • Send clicks and searches as messages • Flume can ingest events from Kaaa • We can add a second consumer for real-‐8me processing in SparkStreaming • Another consumer for aler8ng… • And maybe a batch consumer too What if…. We were to use Kaaa?
93.
93 © Cloudera,
Inc. All rights reserved. Channels Sinks Flume Agent The Kaaa Channel Kafka HDFS, HBase, Solr Producer A Producer B Producer C Kaaa Producers
94.
94 © Cloudera,
Inc. All rights reserved. Sources Interceptors Channels Flume Agent The Kaaa Channel Twi^er, logs, JMS, webserver Mask, re-‐format, validate… Kafka Consumer A Kaaa Consumers Consumer B Consumer C
95.
95 © Cloudera,
Inc. All rights reserved. Sources Interceptors Selectors Channels Sinks Flume Agent The Kaaa Channel Twi^er, logs, JMS, webserver Mask, re-‐format, validate… DR, cri8cal Kafka HDFS, HBase, Solr
96.
96 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Processing – Engines tiny.cloudera.com/app-‐arch-‐slides
97.
97 © Cloudera,
Inc. All rights reserved. Processing Engines • MapReduce • Abstrac8ons • Spark • Spark Streaming • Impala
98.
98 © Cloudera,
Inc. All rights reserved. Data Processing Engines MapReduce and Abstrac8ons
99.
99 © Cloudera,
Inc. All rights reserved. MapReduce • Oldie but goody • Restric8ve Framework / Innovated Work Around • Extreme Batch
100.
100 © Cloudera,
Inc. All rights reserved. MapReduce Basic High Level Mapper HDFS (Replicated) Na8ve File System Block of Data Temp Spill Data Par88oned Sorted Data Reducer Reducer Local Copy Output File
101.
101 © Cloudera,
Inc. All rights reserved. MapReduce Innova8on • Mapper Memory Joins • Reducer Memory Joins • Buckets Sorted Joins • Cross Task Communica8on • Windowing • And Much More
102.
102 © Cloudera,
Inc. All rights reserved. Abstrac8ons • SQL • Hive • Script/Code • Pig: Pig La8n • Crunch: Java/Scala • Cascading: Java/Scala
103.
103 © Cloudera,
Inc. All rights reserved. Data Processing Engines Spark and Spark Streaming
104.
104 © Cloudera,
Inc. All rights reserved. Spark • The New Kid that isn’t that New Anymore • Easily 10x less code • Extremely Easy and Powerful API • Very good for machine learning • Scala, Java, and Python • RDDs • DAG Engine
105.
105 © Cloudera,
Inc. All rights reserved. Spark -‐ DAG
106.
106 © Cloudera,
Inc. All rights reserved. Spark -‐ DAG Filter KeyBy KeyBy TextFile TextFile Join Filter Take
107.
107 © Cloudera,
Inc. All rights reserved. Spark -‐ DAG Filter KeyBy KeyBy TextFile TextFile Join Filter Take Good Good Good Good Good Good Good-‐Replay Good-‐Replay Good-‐Replay Good Good-‐Replay Good Good-‐Replay Lost Block Replay Good-‐Replay Lost Block Good Future Future Future Future
108.
108 © Cloudera,
Inc. All rights reserved. Spark Streaming • Calling Spark in a Loop • Extends RDDs with DStream • Very Li^le Code Changes from ETL to Streaming
109.
109 © Cloudera,
Inc. All rights reserved. Spark Streaming Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-‐first Batch First Batch Second Batch
110.
110 © Cloudera,
Inc. All rights reserved. Spark Streaming Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-‐first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
111.
111 © Cloudera,
Inc. All rights reserved. Data Processing Engines Impala
112.
112 © Cloudera,
Inc. All rights reserved. Impala • MPP Style SQL Engine on top of Hadoop • Very Fast • High Concurrency • Analy8cal windowing func8ons (C5.2).
113.
113 © Cloudera,
Inc. All rights reserved. Impala – Broadcast Join Impala Daemon Smaller Table Data Block 100% Cached Smaller Table Smaller Table Data Block Impala Daemon 100% Cached Smaller Table Impala Daemon 100% Cached Smaller Table Impala Daemon Hash Join Func8on Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Func8on Bigger Table Data Block 100% Cached Smaller Table Output Impala Daemon Hash Join Func8on Bigger Table Data Block 100% Cached Smaller Table Output
114.
114 © Cloudera,
Inc. All rights reserved. Impala – Par88oned Hash Join Impala Daemon Smaller Table Data Block ~33% Cached Smaller Table Smaller Table Data Block Impala Daemon ~33% Cached Smaller Table Impala Daemon ~33% Cached Smaller Table Hash Par88oner Hash Par88oner Impala Daemon BiggerTable Data Block Impala Daemon Impala Daemon Hash Par88oner Hash Join Func8on 33% Cached Smaller Table Hash Join Func8on 33% Cached Smaller Table Hash Join Func8on 33% Cached Smaller Table Output Output Output BiggerTable Data Block Hash Par88oner BiggerTable Data Block Hash Par88oner
115.
115 © Cloudera,
Inc. All rights reserved. Impala vs Hive • Very different approaches and • We may see convergence at some point • But for now • Impala for speed • Hive for batch
116.
116 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Data Processing – Pa^erns and Recommenda8ons
117.
117 © Cloudera,
Inc. All rights reserved. What processing needs to happen? • Sessioniza8on • Filtering • Deduplica8on • BI / Discovery
118.
118 © Cloudera,
Inc. All rights reserved. Sessioniza8on Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
119.
119 © Cloudera,
Inc. All rights reserved. Why sessionize? Helps answers ques8ons like: • What is my website’s bounce rate? • i.e. how many % of visitors don’t go past the landing page? • Which marke8ng channels (e.g. organic search, display ad, etc.) are leading to most sessions? • Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.) • Do a^ribu8on analysis – which channels are responsible for most conversions?
120.
120 © Cloudera,
Inc. All rights reserved. Sessioniza8on 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 244.157.45.12+1413583199
121.
121 © Cloudera,
Inc. All rights reserved. How to Sessionize? 1. Given a list of clicks, determine which clicks came from the same user (Par88oning, ordering) 2. Given a par8cular user's clicks, determine if a given click is a part of a new session or a con8nua8on of the previous session (Iden8fying session boundaries)
122.
122 © Cloudera,
Inc. All rights reserved. #1 – Which clicks are from same user? • We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
123.
123 © Cloudera,
Inc. All rights reserved. #1 – Which clicks are from same user? • We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
124.
124 © Cloudera,
Inc. All rights reserved. #1 – Which clicks are from same user? 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
125.
125 © Cloudera,
Inc. All rights reserved. #2 – Which clicks part of the same session? 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” > 30 mins apart = different sessions
126.
126 © Cloudera,
Inc. All rights reserved. Sessioniza8on engine recommenda8on • We have sessioniza8on code in MR and Spark on github. The complexity of the code varies, depends on the exper8se in the organiza8on. • We choose MR • MR API is stable and widely known • No Spark + Oozie (orchestra8on engine) integra8on currently
127.
127 © Cloudera,
Inc. All rights reserved. Filtering – filter out incomplete records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
128.
128 © Cloudera,
Inc. All rights reserved. Filtering – filter out records from bots/spiders 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” Google spider IP address
129.
129 © Cloudera,
Inc. All rights reserved. Filtering recommenda8on • Bot/Spider filtering can be done easily in any of the engines • Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. • Flume interceptors can also be used • Pre^y close choice between MR, Hive and Spark • Can be done in Spark using rdd.filter() • We can simply embed this in our MR sessioniza8on job
130.
130 © Cloudera,
Inc. All rights reserved. Deduplica8on – remove duplicate records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
131.
131 © Cloudera,
Inc. All rights reserved. Deduplica8on recommenda8on • Can be done in all engines. • We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplica8on • reduce() in spark • We use Pig
132.
132 © Cloudera,
Inc. All rights reserved. BI/Discovery engine recommenda8on • Main requirements for this are: • Low latency • SQL interface (e.g. JDBC/ODBC) • Users don’t know how to code • We chose Impala • It’s a SQL engine • Much faster than other engines • Provides standard JDBC/ODBC interfaces
133.
133 © Cloudera,
Inc. All rights reserved. End-‐to-‐end processing Sessioniza8on Filtering Deduplica8on BI tools
134.
134 © Cloudera,
Inc. All rights reserved. Architectural Considera8ons Orchestra8on
135.
135 © Cloudera,
Inc. All rights reserved. • Data arrives through Flume • Triggers a processing event: • Sessionize • Enrich – Loca8on, marke8ng channel… • Store as Parquet • Each day we process events from the previous day Orchestra8ng Clickstream
136.
136 © Cloudera,
Inc. All rights reserved. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps no8fy on the status • And collect metrics for repor8ng Choosing Right
137.
137 © Cloudera,
Inc. All rights reserved. Oozie or Azkaban?
138.
138 © Cloudera,
Inc. All rights reserved. Oozie Architecture
139.
139 © Cloudera,
Inc. All rights reserved. • Part of all major Hadoop distribu8ons • Hue integra8on • Built -‐in ac8ons – Hive, Sqoop, MapReduce, SSH • Complex workflows with decisions • Event and 8me based scheduling • No8fica8ons • SLA Monitoring • REST API Oozie features
140.
140 © Cloudera,
Inc. All rights reserved. • Overhead in launching jobs • Steep learning curve • XML Workflows Oozie Drawbacks
141.
141 © Cloudera,
Inc. All rights reserved. Azkaban Architecture Azkaban Executor Server Azkaban Web Server HDFS viewer plugin Job types plugin MySQL Client Hadoop
142.
142 © Cloudera,
Inc. All rights reserved. • Simplicity • Great UI – including pluggable visualizers • Lots of plugins – Hive, Pig… • Repor8ng plugin Azkaban features
143.
143 © Cloudera,
Inc. All rights reserved. • Doesn’t support workflow decisions • Can’t represent data dependency Azkaban Limita8ons
144.
144 © Cloudera,
Inc. All rights reserved. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps no8fy on the status • And collect metrics for repor8ng Choosing… Easier in Oozie
145.
145 © Cloudera,
Inc. All rights reserved. • Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps no8fy on the status • And collect metrics for repor8ng Choosing the right Orchestra8on Tool Be^er in Azkaban
146.
146 © Cloudera,
Inc. All rights reserved. The best orchestra8on tool is the one you are an expert on Important Decision Considera8on!
147.
147 © Cloudera,
Inc. All rights reserved. Orchestra8on Pa^erns – Fan Out
148.
148 © Cloudera,
Inc. All rights reserved. Capture & Decide Pa^ern
149.
149 © Cloudera,
Inc. All rights reserved. Puwng It All Together Final Architecture
150.
150 © Cloudera,
Inc. All rights reserved. Final Architecture – High Level Overview Data Sources Inges8on Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consump8on Orchestra8on (Scheduling, Managing, Monitoring)
151.
151 © Cloudera,
Inc. All rights reserved. Final Architecture – High Level Overview Data Sources Inges8on Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consump8on Orchestra8on (Scheduling, Managing, Monitoring)
152.
152 © Cloudera,
Inc. All rights reserved. Final Architecture – Inges8on/Storage Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Web Server Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-‐in Pa^ern Mul8 Agents for Failover and rolling restarts HDFS /etl/BI/casualcyclist/ clicks/rawlogs/ year=2014/month=10/ day=10!
153.
153 © Cloudera,
Inc. All rights reserved. Final Architecture – High Level Overview Data Sources Inges8on Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consump8on Orchestra8on (Scheduling, Managing, Monitoring)
154.
154 © Cloudera,
Inc. All rights reserved. Final Architecture – Processing and Storage /etl/BI/ casualcyclist/clicks/ rawlogs/year=2014/ month=10/day=10 … dedup-‐>filtering-‐ >sessioniza8on /data/bikeshop/ clickstream/ year=2014/ month=10/day=10 … parque8ze
155.
155 © Cloudera,
Inc. All rights reserved. Final Architecture – High Level Overview Data Sources Inges8on Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consump8on Orchestra8on (Scheduling, Managing, Monitoring)
156.
156 © Cloudera,
Inc. All rights reserved. Final Architecture – Data Access Hive/ Impala BI/Analy8cs Tools DWH Sqoop Local Disk R, etc. DB import tool JDBC/ODBC
157.
157 © Cloudera,
Inc. All rights reserved. Demo
158.
158 © Cloudera,
Inc. All rights reserved. Join the Discussion Get community help or provide feedback cloudera.com/community
159.
159 © Cloudera,
Inc. All rights reserved. Try Cloudera Live today—cloudera.com/live
160.
160 © Cloudera,
Inc. All rights reserved. Thank you hadooparchitecturebook.com 8ny.cloudera.com/app-‐arch-‐slides
Download now