SlideShare a Scribd company logo
Application
Architectures with
Hadoop
Hadoop Users Group UK – November, 2014
slideshare.com/hadooparchbook
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
2
About the book
•  @hadooparchbook
•  hadooparchitecturebook.com
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.
3
About Us
•  Mark
–  Software Engineer
–  Committer on Apache Bigtop, committer and PPMC member on Apache
Sentry (incubating).
–  Contributor to Hadoop, Hive, Spark, Sqoop, Flume.
•  Ted
–  Principal Solutions Architect
–  Previously Lead Architect at FINRA
–  Contributor to Apache Hadoop, HBase, Spark, Flume, Avro and Pig
©2014 Cloudera, Inc. All Rights Reserved.
4
Case Study
Clickstream Analysis
5
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
6
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
7
Web Logs – Combined Log Format
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
8
Clickstream Analytics
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”
9
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
10
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
11
Hadoop Architectural Considerations
•  Storage managers?
–  HDFS? HBase?
•  Data storage and modeling:
–  File formats? Compression? Schema design?
•  Data movement
–  How do we actually get the data into Hadoop? How do we get it out?
•  Metadata
–  How do we manage data about the data?
•  Data access and processing
–  How will the data be accessed once in Hadoop? How can we transform it? How do
we query it?
•  Orchestration
–  How do we manage the workflow for all of this?
©2014 Cloudera, Inc. All Rights Reserved.
12
Architectural
Considerations
Data Storage and Modeling
13
Data Modeling Considerations
•  We need to consider the following in our architecture:
–  Storage layer – HDFS? HBase? Etc.
–  File system schemas – how will we lay out the data?
–  File formats – what storage formats to use for our data, both raw and
processed data?
–  Data compression formats?
©2014 Cloudera, Inc. All Rights Reserved.
14
Architectural
Considerations
Data Modeling – Storage Layer
15
Data Storage Layer Choices
•  Two likely choices for raw data:
©2014 Cloudera, Inc. All Rights Reserved.
16
Data Storage Layer Choices
•  Stores data directly as files
•  Fast scans
•  Poor random reads/writes
•  Stores data as Hfiles on
HDFS
•  Slow scans
•  Fast random reads/writes
©2014 Cloudera, Inc. All Rights Reserved.
17
Data Storage – Storage Manager Considerations
•  Incoming raw data:
–  Processing requirements call for batch transformations across multiple
records – for example sessionization.
•  Processed data:
–  Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
•  We choose HDFS
–  Processing needs in this case served better by fast scans.
©2014 Cloudera, Inc. All Rights Reserved.
18
Architectural
Considerations
Data Modeling – Data Storage Format
19
Our Format Choices…
•  Raw data
–  Avro with Snappy
•  Processed data
–  Parquet
©2014 Cloudera, Inc. All Rights Reserved.
20
Architectural
Considerations
Data Modeling – HDFS Schema Design
21
Recommended HDFS Schema Design
•  How to lay out data on HDFS?
©2014 Cloudera, Inc. All Rights Reserved.
22
Recommended HDFS Schema Design
/user/<username> - User specific data, jars, conf files
/etl – Data in various stages of ETL workflow
/tmp – temp data from tools or shared between users
/data – shared data for the entire organization
/app – Everything but data: UDF jars, HQL files, Oozie workflows
©2014 Cloudera, Inc. All Rights Reserved.
23
Architectural
Considerations
Data Modeling – Advanced HDFS Schema
Design
24
Partitioning
©2014 Cloudera, Inc. All Rights Reserved.
dataset
col=val1/file.txt
col=val2/file.txt
…
col=valn/file.txt
dataset
file1.txt
file2.txt
…
filen.txt
Un-partitioned HDFS
directory structure
Partitioned HDFS
directory structure
25
Partitioning considerations
•  What column to partition by?
–  Don’t have too many partitions (<10,000)
–  Don’t have too many small files in the partitions
–  Good to have partition sizes at least ~1 GB
•  We’ll partition by timestamp. This applies to both our raw and
processed data.
©2014 Cloudera, Inc. All Rights Reserved.
26
Architectural
Considerations
Data Ingestion
27
File Transfers
•  “hadoop fs –put <file>”
•  Reliable, but not
resilient to failure.
•  Other options are
mountable HDFS, for
example NFSv3.
©2014 Cloudera, Inc. All Rights Reserved.
28
Streaming Ingestion
•  Flume
–  Reliable, distributed, and available system for efficient collection, aggregation
and movement of streaming data, e.g. logs.
•  Kafka
–  Reliable and distributed publish-subscribe messaging system.
©2014 Cloudera, Inc. All Rights Reserved.
29
Flume vs. Kafka
•  Purpose built for
Hadoop data ingest.
•  Pre-built sinks for
HDFS, HBase, etc.
•  Supports
transformation of data
in-flight.
•  General pub-sub
messaging framework.
•  Just a message
transport.
•  Have to use third party
tool to ingest.
©2014 Cloudera, Inc. All Rights Reserved.
30
Flume and Kafka
•  Kafka Source
•  Kafka Channel
©2014 Cloudera, Inc. All Rights Reserved.
31
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
32
A Brief Discussion of Flume Patterns – Fan-in
•  Flume agent runs on
each of our servers.
•  These agents send
data to multiple agents
to provide reliability.
•  Flume provides support
for load balancing.
©2014 Cloudera, Inc. All Rights Reserved.
33
Ingestion Decisions
•  Historical Data
–  File transfer
•  Incoming Data
–  Flume with the spooling directory source.
•  Relational Data Sources – ODS, CRM, etc.
–  Sqoop
©2014 Cloudera, Inc. All Rights Reserved.
34
Architectural
Considerations
Data Processing – Engines
35
Processing Engines
•  MapReduce
•  Abstractions – Pig, Hive, Cascading, Crunch
•  Spark
•  Impala
Confidentiality Information Goes Here
36
MapReduce
•  Oldie but goody
•  Restrictive Framework / Innovated Work Around
•  Extreme Batch
Confidentiality Information Goes Here
37
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
38
Abstractions
•  SQL
–  Hive
•  Script/Code
–  Pig: Pig Latin
–  Crunch: Java/Scala
–  Cascading: Java/Scala
Confidentiality Information Goes Here
39
Spark
•  The New Kid that isn’t that New Anymore
•  Easily 10x less code
•  Extremely Easy and Powerful API
•  Very good for machine learning
•  Scala, Java, and Python
•  RDDs
•  DAG Engine
Confidentiality Information Goes Here
40
Impala
• Real-time open source MPP style engine for Hadoop
• Doesn’t build on MapReduce
• Written in C++, uses LLVM for run-time code generation
• Can create tables over HDFS or HBase data
• Accesses Hive metastore for metadata
• Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
41
Architectural
Considerations
Data Processing – What processing needs to
happen?
42
What processing needs to happen?
Confidentiality Information Goes Here
•  Sessionization
•  Filtering
•  Deduplication
•  BI / Discovery
43
Sessionization
Confidentiality Information Goes Here
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
> 30 minutes
44
Why sessionize?
Confidentiality Information Goes Here
Helps answers questions like:
•  What is my website’s bounce rate?
–  i.e. how many % of visitors don’t go past the landing page?
•  Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
–  Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
•  Do attribution analysis – which channels are responsible for most
conversions?
45
How to Sessionize?
Confidentiality Information Goes Here
1.  Given a list of clicks, determine which clicks
came from the same user
2.  Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session
46
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
47
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
48
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
49
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different
sessions
50©2014 Cloudera, Inc. All rights reserved.
Sessionization engine recommendation
•  We have sessionization code in MR, Spark on github. The
complexity of the code varies, depends on the expertise in the
organization.
•  We choose MR, since it’s fairly simple and maintainable code.
51
Filtering – filter out incomplete records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
52
Filtering – filter out records from bots/spiders
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Google spider IP address
53©2014 Cloudera, Inc. All rights reserved.
Filtering recommendation
•  Bot/Spider filtering can be done easily in any of the engines
•  Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc.
•  Pretty close choice between MR, Hive and Spark
•  Can be done in Flume interceptors as well
•  We can simply embed this in our sessionization job
54
Deduplication – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
55©2014 Cloudera, Inc. All rights reserved.
Deduplication recommendation
•  Can be done in all engines.
•  We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication
•  We use Pig
56©2014 Cloudera, Inc. All rights reserved.
BI/Discovery engine recommendation
•  Main requirements for this are:
–  Low latency
–  SQL interface (e.g. JDBC/ODBC)
–  Users don’t know how to code
•  We chose Impala
–  It’s a SQL engine
–  Much faster than other engines
–  Provides standard JDBC/ODBC interfaces
57
Architectural
Considerations
Orchestration
58©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing…
Easier in Oozie
59©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing the right Orchestration Tool
Better in Azkaban
60©2014 Cloudera, Inc. All rights reserved.
•  The best orchestration tool
is the one you are an expert on
– Oozie
– Spark Streaming, etc. don’t require orchestration
tool
Important Decision Consideration!
61
Putting It All
Together
Final Architecture
62©2014 Cloudera, Inc. All rights reserved.
Final architecture
Hadoop
Cluster
BI/Visualization
tool (e.g.
microstrategy)
BI
Analysts
Spark For machine learning
and graph processing
R/Python Statistical Analysis
Custom
Apps
3. Accessing
2. Processing
4. Orchestration
1. Ingestion
Operational
Data Store
CRM System
Via Sqoop
Web servers
Website
users
Web logsVia Flume
The image cannot be displayed. Your computer may not have enough memory to open the image, or the
image may have been corrupted. Restart your computer, and then open the file again. If the red x still
appears, you may have to delete the image and then insert it again.
Thank you

More Related Content

What's hot

Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
Uwe Printz
 

What's hot (20)

Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 

Viewers also liked

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Remy Rosenbaum
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
Codemotion
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
Cloudera, Inc.
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
Cloudera, Inc.
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Amazon Web Services
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
Omid Vahdaty
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 

Viewers also liked (15)

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Application Architectures with Hadoop - UK Hadoop User Group

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
Jianwei Li
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
KMS Technology
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Kevin Crocker
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
Alexander Alten
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
Hitachi Vantara
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

Similar to Application Architectures with Hadoop - UK Hadoop User Group (20)

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

More from hadooparchbook

Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 

More from hadooparchbook (6)

Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Application Architectures with Hadoop - UK Hadoop User Group

  • 1. Application Architectures with Hadoop Hadoop Users Group UK – November, 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Ted Malaska | @TedMalaska
  • 2. 2 About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. 3 About Us •  Mark –  Software Engineer –  Committer on Apache Bigtop, committer and PPMC member on Apache Sentry (incubating). –  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. •  Ted –  Principal Solutions Architect –  Previously Lead Architect at FINRA –  Contributor to Apache Hadoop, HBase, Spark, Flume, Avro and Pig ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. 5 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. 7 Web Logs – Combined Log Format ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 8. 8 Clickstream Analytics ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
  • 9. 9 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. 11 Hadoop Architectural Considerations •  Storage managers? –  HDFS? HBase? •  Data storage and modeling: –  File formats? Compression? Schema design? •  Data movement –  How do we actually get the data into Hadoop? How do we get it out? •  Metadata –  How do we manage data about the data? •  Data access and processing –  How will the data be accessed once in Hadoop? How can we transform it? How do we query it? •  Orchestration –  How do we manage the workflow for all of this? ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. 13 Data Modeling Considerations •  We need to consider the following in our architecture: –  Storage layer – HDFS? HBase? Etc. –  File system schemas – how will we lay out the data? –  File formats – what storage formats to use for our data, both raw and processed data? –  Data compression formats? ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. 15 Data Storage Layer Choices •  Two likely choices for raw data: ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. 16 Data Storage Layer Choices •  Stores data directly as files •  Fast scans •  Poor random reads/writes •  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 Data Storage – Storage Manager Considerations •  Incoming raw data: –  Processing requirements call for batch transformations across multiple records – for example sessionization. •  Processed data: –  Access to processed data will be via things like analytical queries – again requiring access to multiple records. •  We choose HDFS –  Processing needs in this case served better by fast scans. ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. 19 Our Format Choices… •  Raw data –  Avro with Snappy •  Processed data –  Parquet ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. 21 Recommended HDFS Schema Design •  How to lay out data on HDFS? ©2014 Cloudera, Inc. All Rights Reserved.
  • 22. 22 Recommended HDFS Schema Design /user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. 24 Partitioning ©2014 Cloudera, Inc. All Rights Reserved. dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt dataset file1.txt file2.txt … filen.txt Un-partitioned HDFS directory structure Partitioned HDFS directory structure
  • 25. 25 Partitioning considerations •  What column to partition by? –  Don’t have too many partitions (<10,000) –  Don’t have too many small files in the partitions –  Good to have partition sizes at least ~1 GB •  We’ll partition by timestamp. This applies to both our raw and processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 27. 27 File Transfers •  “hadoop fs –put <file>” •  Reliable, but not resilient to failure. •  Other options are mountable HDFS, for example NFSv3. ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. 28 Streaming Ingestion •  Flume –  Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. •  Kafka –  Reliable and distributed publish-subscribe messaging system. ©2014 Cloudera, Inc. All Rights Reserved.
  • 29. 29 Flume vs. Kafka •  Purpose built for Hadoop data ingest. •  Pre-built sinks for HDFS, HBase, etc. •  Supports transformation of data in-flight. •  General pub-sub messaging framework. •  Just a message transport. •  Have to use third party tool to ingest. ©2014 Cloudera, Inc. All Rights Reserved.
  • 30. 30 Flume and Kafka •  Kafka Source •  Kafka Channel ©2014 Cloudera, Inc. All Rights Reserved.
  • 31. 31 Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twitter, logs, JMS, webserver Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 32. 32 A Brief Discussion of Flume Patterns – Fan-in •  Flume agent runs on each of our servers. •  These agents send data to multiple agents to provide reliability. •  Flume provides support for load balancing. ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. 33 Ingestion Decisions •  Historical Data –  File transfer •  Incoming Data –  Flume with the spooling directory source. •  Relational Data Sources – ODS, CRM, etc. –  Sqoop ©2014 Cloudera, Inc. All Rights Reserved.
  • 35. 35 Processing Engines •  MapReduce •  Abstractions – Pig, Hive, Cascading, Crunch •  Spark •  Impala Confidentiality Information Goes Here
  • 36. 36 MapReduce •  Oldie but goody •  Restrictive Framework / Innovated Work Around •  Extreme Batch Confidentiality Information Goes Here
  • 37. 37 MapReduce Basic High Level Confidentiality Information Goes Here Mapper HDFS (Replicated) Native File System Block of Data Temp Spill Data Partitioned Sorted Data Reducer Reducer Local Copy Output File
  • 38. 38 Abstractions •  SQL –  Hive •  Script/Code –  Pig: Pig Latin –  Crunch: Java/Scala –  Cascading: Java/Scala Confidentiality Information Goes Here
  • 39. 39 Spark •  The New Kid that isn’t that New Anymore •  Easily 10x less code •  Extremely Easy and Powerful API •  Very good for machine learning •  Scala, Java, and Python •  RDDs •  DAG Engine Confidentiality Information Goes Here
  • 40. 40 Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. 41 Architectural Considerations Data Processing – What processing needs to happen?
  • 42. 42 What processing needs to happen? Confidentiality Information Goes Here •  Sessionization •  Filtering •  Deduplication •  BI / Discovery
  • 43. 43 Sessionization Confidentiality Information Goes Here Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 44. 44 Why sessionize? Confidentiality Information Goes Here Helps answers questions like: •  What is my website’s bounce rate? –  i.e. how many % of visitors don’t go past the landing page? •  Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? –  Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.) •  Do attribution analysis – which channels are responsible for most conversions?
  • 45. 45 How to Sessionize? Confidentiality Information Goes Here 1.  Given a list of clicks, determine which clicks came from the same user 2.  Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session
  • 46. 46 #1 – Which clicks are from same user? •  We can use: –  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. 47 #1 – Which clicks are from same user? •  We can use: –  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. 48 #1 – Which clicks are from same user? ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 49. 49 #2 – Which clicks part of the same session? ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” > 30 mins apart = different sessions
  • 50. 50©2014 Cloudera, Inc. All rights reserved. Sessionization engine recommendation •  We have sessionization code in MR, Spark on github. The complexity of the code varies, depends on the expertise in the organization. •  We choose MR, since it’s fairly simple and maintainable code.
  • 51. 51 Filtering – filter out incomplete records ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
  • 52. 52 Filtering – filter out records from bots/spiders ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” Google spider IP address
  • 53. 53©2014 Cloudera, Inc. All rights reserved. Filtering recommendation •  Bot/Spider filtering can be done easily in any of the engines •  Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. •  Pretty close choice between MR, Hive and Spark •  Can be done in Flume interceptors as well •  We can simply embed this in our sessionization job
  • 54. 54 Deduplication – remove duplicate records ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
  • 55. 55©2014 Cloudera, Inc. All rights reserved. Deduplication recommendation •  Can be done in all engines. •  We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication •  We use Pig
  • 56. 56©2014 Cloudera, Inc. All rights reserved. BI/Discovery engine recommendation •  Main requirements for this are: –  Low latency –  SQL interface (e.g. JDBC/ODBC) –  Users don’t know how to code •  We chose Impala –  It’s a SQL engine –  Much faster than other engines –  Provides standard JDBC/ODBC interfaces
  • 58. 58©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing… Easier in Oozie
  • 59. 59©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing the right Orchestration Tool Better in Azkaban
  • 60. 60©2014 Cloudera, Inc. All rights reserved. •  The best orchestration tool is the one you are an expert on – Oozie – Spark Streaming, etc. don’t require orchestration tool Important Decision Consideration!
  • 62. 62©2014 Cloudera, Inc. All rights reserved. Final architecture Hadoop Cluster BI/Visualization tool (e.g. microstrategy) BI Analysts Spark For machine learning and graph processing R/Python Statistical Analysis Custom Apps 3. Accessing 2. Processing 4. Orchestration 1. Ingestion Operational Data Store CRM System Via Sqoop Web servers Website users Web logsVia Flume
  • 63. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Thank you