SlideShare a Scribd company logo
1 of 72
REMINDER
Check in on the
COLLABORATE mobile app
Architectural Considerations for
Data Warehousing with Hadoop
Prepared by:
Mark Grover, Software Engineer
Jonathan Seidman, Solutions Architect
Cloudera, Inc.
github.com/hadooparchitecturebook/h
adoop-arch-book/tree/master/ch11-
data-warehousing
Session ID#: 10251
@mark_grover
@jseidman
About Us
■ Mark
▪ Software Engineer at
Cloudera
▪ Committer on Apache
Bigtop, PMC member on
Apache Sentry
(incubating)
▪ Contributor to Apache
Hadoop, Spark, Hive,
Sqoop, Pig and Flume
■ Jonathan
▪ Senior Solutions
Architect/Partner
Engineering at Cloudera
▪ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
▪ Co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
About the Book
■ @hadooparchbook
■ hadooparchitecturebook.com
■ github.com/hadooparchitectur
ebook
■ slideshare.com/hadooparchbo
ok
Agenda
■ Typical data warehouse architecture.
■ Challenges with the existing data warehouse architecture.
■ How Hadoop complements an existing data warehouse
architecture.
■ (Very) Brief intro to Hadoop.
■ Example use case.
■ Walkthrough of example use case implementation.
Typical Data Warehouse
Architecture
Example High Level Data Warehouse
Architecture
Extract
Data
Staging
Area
Operational
Source
Systems
Load
Data
Warehouse
Data
Analysis/Visu
alization Tools
Transformations
Challenges with the Data
Warehouse Architecture
Challenge – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
Challenges – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1 Slow Data Transformations = Missed ETL SLAs.
2 Slow Queries = Frustrated Business Users.
1
2
1
Challenges – Data Archiving
Data
Warehouse
Tape
Archive
■ Full-fidelity data only kept for a short duration
■ Expensive or sometimes impossible to look at historical raw
data
Challenge – Disparate Data Sources
Data
Warehouse
■ How do you join data from disparate sources with EDW?
Business
Intelligence
???
Challenge – Lack of Agility
■ Responding to changing requirements, mistakes, etc.
requires lengthy processes.
Challenge – Exploratory Analysis in the
EDW
■ Difficult for users to do exploratory analysis of data in the data
warehouse.
Business
Users
Developers Analysts
Data
Warehouse
Complementing the EDW with
Hadoop
Data Warehouse Architecture with Hadoop
Extract
Hadoop
Operational
Source
Systems
EDW
BI/Analytics Tools
Logs,
machine
data, etc.
Extract
Transformation/Analysis
Load
Hadoop
ETL/ELT Optimization with Hadoop
OLTP
Enterprise
Applications
ODS
Business
Intelligence
Transform
Query
Store
ETL
Data Warehouse
Query
(High $/Byte)
Active Archiving with Hadoop
Data
Warehouse
Hadoop
Joining Disparate Data Sources with
Hadoop
Data
Warehouse
Business
IntelligenceHadoop
Agile Data Access with Hadoop
Schema-on-Write (RDBMS):
• Prescriptive Data Modeling:
• Create static DB schema
• Transform data into RDBMS
• Query data in RDBMS format
• New columns must be added
explicitly before new data can
propagate into the system.
• Good for Known Unknowns
(Repetition)
Schema-on-Read (Hadoop):
• Descriptive Data Modeling:
• Copy data in its native format
• Create schema + parser
• Query Data in its native format
• New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.
• Good for Unknown Unknowns
(Exploration)
Exploratory Analysis with Hadoop
Hadoop
Business
Users
Developers Analysts
Data
Warehouse
A Very Brief Intro to Hadoop
What is Apache Hadoop?
Has the Flexibility to Store
and Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides
workloads across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
Parallel
Processing
(MapReduce,
Spark, Impala,
etc.)
Distributed Computing
Frameworks
Apache Hadoop is an open source
platform for data storage and
processing that is…
 Scalable
 Fault tolerant
 Distributed
CORE HADOOP SYSTEM COMPONENTS
Oracle Big Data Appliance
■ All of the capabilities we’re talking about here are available as
part of the Oracle BDA.
Challenges of Hadoop Implementation
Challenges of Hadoop Implementation
Other Challenges – Architectural
Considerations
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data
Storage
(Formats,
Schema)
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
Metadata
Management
Hadoop Third Party Ecosystem
Data
Systems
Applications
Infrastructure
Operational
Tools
Walkthrough of Example Use
Case
Use-case
■ Movielens dataset
■ Users register by entering some demographic information
▪ Users can update demographic information later on
■ Rate movies
▪ Ratings can be updated later on
■ Auxillary information about movies available
▪ e.g. release date, IMDB URL, etc.
Movielens data set
u.user
user id | age | gender | occupation | zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
Movielens data set
u.item
movie id | movie title | release date | video release date
| IMDb URL | unknown | Action | Adventure | Animation
|Children's | Comedy | Crime | Documentary | Drama |
Fantasy | Film-Noir | Horror | Musical | Mystery | Romance
| Sci-Fi | Thriller | War | Western |
1|Toy Story (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0
|0|0|0
2|GoldenEye (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1
|0|0
3|Four Rooms (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|1|0|0
Movielens data set
u.data
user id | item id | rating | timestamp
196|242|3|881250949
186|302|3|891717742
22|377|1|878887116
244 51|2|880606923
166|346|1|886397596
OLTP schema
Movielens data set - OLTP
Data Modeling
Data Modeling Considerations
■ We need to consider the following in our architecture:
▪ Storage layer – HDFS? HBase? Etc.
▪ File system schemas – how will we lay out the data?
▪ File formats – what storage formats to use for our data, both raw
and processed data?
▪ Data compression formats?
■ Hadoop is not a database, so these considerations will be
different from an RDBMS.
Denormalization
■ Why denormalize?
■ When to do denormalize?
■ How much to denormalize?
Why Denormalize?
■ Regular joins are expensive in Hadoop
■ When you have 2 data sets, no guarantees that
corresponding records will be present on the same
■ Such a guarantee exists when storing such data in a single
data set
When to Denormalize?
■ Well, it’s difficult to say
■ It depends
Movielens Data Set - Denormalization
Denormalize Denormalize
Data Set in Hadoop
Tracking Updates (CDC)
■ Can’t update data in-place in HDFS
■ HDFS is append-only filesystem
■ We have to track all updates
Tracking Updates in Hadoop
Hadoop File Types
■ Formats designed specifically to store and process data on
Hadoop:
▪ File based – SequenceFile
▪ Serialization formats – Thrift, Protocol Buffers, Avro
▪ Columnar formats – RCFile, ORC, Parquet
Final Schema in Hadoop
Our Storage Format Recommendation
■ Columnar format (Parquet) for merged/compacted data sets
▪ user, user_rating, movie
■ Row format (Avro) for history/append-only data sets
▪ user_history, user_rating_fact
Ingestion
Sources Interceptors Selectors Channels Sinks
Flume Agent
Ingestion – Apache Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
Ingestion – Apache Kafka
Source System Source System Source System Source System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Ingestion – Apache Sqoop
■ Apache project designed to ease import and export of data
between Hadoop and external data stores such as an
RDBMS.
■ Provides functionality to do bulk imports and exports of data.
■ Leverages MapReduce to transfer data in parallel.
Client Sqoop
MapReduce Map Map Map
Hadoop
Run import Collect metadata
Generate code,
Execute MR job
Pull data
Write to Hadoop
Sqoop Import Example – Movie
sqoop import --connect 
jdbc:mysql://mysql_server:3306/movielens 
--username myuser --password mypass --query 
'SELECT movie.*, group_concat(genre.name)
FROM movie
JOIN movie_genre ON (movie.id =
movie_genre.movie_id)
JOIN genre ON (movie_genre.genre_id = genre.id)
WHERE ${CONDITIONS}
GROUP BY movie.id' 
--split-by movie.id --as-avrodatafile 
--target-dir /data/movielens/movie
Data Processing
Popular Processing Engines
■ MapReduce
▪ Programming paradigm
■ Pig
▪ Workflow language based
■ Hive
▪ Batch SQL-engine
■ Impala
▪ Near real-time concurrent SQL engine
■ Spark
▪ DAG engine
Final Schema in Hadoop
Merge Updates
hive>INSERT OVERWRITE TABLE user_tmp
SELECT user.*
FROM user
LEFT OUTER JOIN user_upserts
ON (user.id = user_upserts.id)
WHERE
user_upserts.id IS NULL
UNION ALL
SELECT
id, age, occupation, zipcode,
TIMESTAMP(last_modified)
FROM user_upserts;
Aggregations
user_rating_fact
user_rating
user_history
movie
user
Merge updates
One
record/user/movie
Merge updates
One record/user
avg_movie_rating
latest_trending_
movies
Aggregations
hive>CREATE TABLE avg_movie_rating AS
SELECT
movie_id,
ROUND(AVG(rating), 1) AS rating
FROM
user_rating
GROUP BY
movie_id;
Export to Data Warehouse
user_rating_fact
user_rating
user_history
movie
user
Merge updates
One
record/user/movie
Merge updates
One record/user
Data
Warehouse
avg_movie_rating
latest_trending_
movies
Export
Sqoop Export
sqoop export --connect 
jdbc:mysql:/mysql_server:3306/movie_dwh 
--username myuser --password mypass 
--table avg_movie_rating --export-dir 
/user/hive/warehouse/avg_movie_rating 
-m 16 --update-key movie_id --update-mode 
allowinsert --input-fields-terminated-by 
'001’ --lines-terminated-by 'n'
Final Architecture
Final Architecture
Please complete the session
evaluation
Thank you!
@hadooparchbook
You may complete the session evaluation either
on paper or online via the mobile app
This is a slide title that can be up to two
lines of text without losing readability
■ This is the first bullet of text
■ This is the second bullet of text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet of text (and should be as far sub-
bullets indent)
– This tertiary sub-bullet will be seldom used, but available
▪ This is another sub-bullet of text
■ And this is the third bullet of text
This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-
bullet of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet
of text
■ Senior Solutions
Architect/Partner
Enablement at Cloudera
■ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet
of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet of
text
■ This is the first bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
▪ This is a sub-bullet of text
— This secondary sub-bullet
▪ This is another sub-bullet of
text
■ And this is another bullet of
text
Subject number one Subject number two
This is a slide title for a slide with just the
title line (e.g., images/diagrams below)
What is Hadoop?
Hadoop is an open-source system designed
To store and process petabyte scale data.
That’s pretty much what you need to know.
Well almost…
Compression Codecs
snappy
Well, maybe.
Not splittable.
X
Splittable.
Getting
better…
Very good
choice
Splittable,
but no...
Our Compression Codec Recommendation
■ Snappy for all data sets (columnar as well as row based)
File Format Choices
Data set Storage format Compression Codec
movie Parquet Snappy
user_history Avro Snappy
user Parquet Snappy
user_rating_fact Avro Snappy
user_rating Parquet Snappy

More Related Content

What's hot

How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 

What's hot (20)

How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similar to Data warehousing with Hadoop

Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 

Similar to Data warehousing with Hadoop (20)

2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Apache drill
Apache drillApache drill
Apache drill
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 

More from hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Data warehousing with Hadoop

  • 1. REMINDER Check in on the COLLABORATE mobile app Architectural Considerations for Data Warehousing with Hadoop Prepared by: Mark Grover, Software Engineer Jonathan Seidman, Solutions Architect Cloudera, Inc. github.com/hadooparchitecturebook/h adoop-arch-book/tree/master/ch11- data-warehousing Session ID#: 10251 @mark_grover @jseidman
  • 2. About Us ■ Mark ▪ Software Engineer at Cloudera ▪ Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) ▪ Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume ■ Jonathan ▪ Senior Solutions Architect/Partner Engineering at Cloudera ▪ Previously, Technical Lead on the big data team at Orbitz Worldwide ▪ Co-founder of the Chicago Hadoop User Group and Chicago Big Data
  • 3. About the Book ■ @hadooparchbook ■ hadooparchitecturebook.com ■ github.com/hadooparchitectur ebook ■ slideshare.com/hadooparchbo ok
  • 4. Agenda ■ Typical data warehouse architecture. ■ Challenges with the existing data warehouse architecture. ■ How Hadoop complements an existing data warehouse architecture. ■ (Very) Brief intro to Hadoop. ■ Example use case. ■ Walkthrough of example use case implementation.
  • 6. Example High Level Data Warehouse Architecture Extract Data Staging Area Operational Source Systems Load Data Warehouse Data Analysis/Visu alization Tools Transformations
  • 7. Challenges with the Data Warehouse Architecture
  • 8. Challenge – ETL/ELT Processing OLTP Enterprise Applications ODS Data Warehouse QueryExtract Transform Load Business Intelligence Transform
  • 9. Challenges – ETL/ELT Processing OLTP Enterprise Applications ODS Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 Slow Data Transformations = Missed ETL SLAs. 2 Slow Queries = Frustrated Business Users. 1 2 1
  • 10. Challenges – Data Archiving Data Warehouse Tape Archive ■ Full-fidelity data only kept for a short duration ■ Expensive or sometimes impossible to look at historical raw data
  • 11. Challenge – Disparate Data Sources Data Warehouse ■ How do you join data from disparate sources with EDW? Business Intelligence ???
  • 12. Challenge – Lack of Agility ■ Responding to changing requirements, mistakes, etc. requires lengthy processes.
  • 13. Challenge – Exploratory Analysis in the EDW ■ Difficult for users to do exploratory analysis of data in the data warehouse. Business Users Developers Analysts Data Warehouse
  • 14. Complementing the EDW with Hadoop
  • 15. Data Warehouse Architecture with Hadoop Extract Hadoop Operational Source Systems EDW BI/Analytics Tools Logs, machine data, etc. Extract Transformation/Analysis Load
  • 16. Hadoop ETL/ELT Optimization with Hadoop OLTP Enterprise Applications ODS Business Intelligence Transform Query Store ETL Data Warehouse Query (High $/Byte)
  • 17. Active Archiving with Hadoop Data Warehouse Hadoop
  • 18. Joining Disparate Data Sources with Hadoop Data Warehouse Business IntelligenceHadoop
  • 19. Agile Data Access with Hadoop Schema-on-Write (RDBMS): • Prescriptive Data Modeling: • Create static DB schema • Transform data into RDBMS • Query data in RDBMS format • New columns must be added explicitly before new data can propagate into the system. • Good for Known Unknowns (Repetition) Schema-on-Read (Hadoop): • Descriptive Data Modeling: • Copy data in its native format • Create schema + parser • Query Data in its native format • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Unknown Unknowns (Exploration)
  • 20. Exploratory Analysis with Hadoop Hadoop Business Users Developers Analysts Data Warehouse
  • 21. A Very Brief Intro to Hadoop
  • 22. What is Apache Hadoop? Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage Parallel Processing (MapReduce, Spark, Impala, etc.) Distributed Computing Frameworks Apache Hadoop is an open source platform for data storage and processing that is…  Scalable  Fault tolerant  Distributed CORE HADOOP SYSTEM COMPONENTS
  • 23. Oracle Big Data Appliance ■ All of the capabilities we’re talking about here are available as part of the Oracle BDA.
  • 24. Challenges of Hadoop Implementation
  • 25. Challenges of Hadoop Implementation
  • 26. Other Challenges – Architectural Considerations Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring) Metadata Management
  • 27. Hadoop Third Party Ecosystem Data Systems Applications Infrastructure Operational Tools
  • 29. Use-case ■ Movielens dataset ■ Users register by entering some demographic information ▪ Users can update demographic information later on ■ Rate movies ▪ Ratings can be updated later on ■ Auxillary information about movies available ▪ e.g. release date, IMDB URL, etc.
  • 30. Movielens data set u.user user id | age | gender | occupation | zip code 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344
  • 31. Movielens data set u.item movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | 1|Toy Story (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0 |0|0|0 2|GoldenEye (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1 |0|0 3|Four Rooms (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| 0|1|0|0
  • 32. Movielens data set u.data user id | item id | rating | timestamp 196|242|3|881250949 186|302|3|891717742 22|377|1|878887116 244 51|2|880606923 166|346|1|886397596
  • 36. Data Modeling Considerations ■ We need to consider the following in our architecture: ▪ Storage layer – HDFS? HBase? Etc. ▪ File system schemas – how will we lay out the data? ▪ File formats – what storage formats to use for our data, both raw and processed data? ▪ Data compression formats? ■ Hadoop is not a database, so these considerations will be different from an RDBMS.
  • 37. Denormalization ■ Why denormalize? ■ When to do denormalize? ■ How much to denormalize?
  • 38. Why Denormalize? ■ Regular joins are expensive in Hadoop ■ When you have 2 data sets, no guarantees that corresponding records will be present on the same ■ Such a guarantee exists when storing such data in a single data set
  • 39. When to Denormalize? ■ Well, it’s difficult to say ■ It depends
  • 40. Movielens Data Set - Denormalization Denormalize Denormalize
  • 41. Data Set in Hadoop
  • 42. Tracking Updates (CDC) ■ Can’t update data in-place in HDFS ■ HDFS is append-only filesystem ■ We have to track all updates
  • 44. Hadoop File Types ■ Formats designed specifically to store and process data on Hadoop: ▪ File based – SequenceFile ▪ Serialization formats – Thrift, Protocol Buffers, Avro ▪ Columnar formats – RCFile, ORC, Parquet
  • 45. Final Schema in Hadoop
  • 46. Our Storage Format Recommendation ■ Columnar format (Parquet) for merged/compacted data sets ▪ user, user_rating, movie ■ Row format (Avro) for history/append-only data sets ▪ user_history, user_rating_fact
  • 48. Sources Interceptors Selectors Channels Sinks Flume Agent Ingestion – Apache Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 49. Ingestion – Apache Kafka Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka
  • 50. Ingestion – Apache Sqoop ■ Apache project designed to ease import and export of data between Hadoop and external data stores such as an RDBMS. ■ Provides functionality to do bulk imports and exports of data. ■ Leverages MapReduce to transfer data in parallel. Client Sqoop MapReduce Map Map Map Hadoop Run import Collect metadata Generate code, Execute MR job Pull data Write to Hadoop
  • 51. Sqoop Import Example – Movie sqoop import --connect jdbc:mysql://mysql_server:3306/movielens --username myuser --password mypass --query 'SELECT movie.*, group_concat(genre.name) FROM movie JOIN movie_genre ON (movie.id = movie_genre.movie_id) JOIN genre ON (movie_genre.genre_id = genre.id) WHERE ${CONDITIONS} GROUP BY movie.id' --split-by movie.id --as-avrodatafile --target-dir /data/movielens/movie
  • 53. Popular Processing Engines ■ MapReduce ▪ Programming paradigm ■ Pig ▪ Workflow language based ■ Hive ▪ Batch SQL-engine ■ Impala ▪ Near real-time concurrent SQL engine ■ Spark ▪ DAG engine
  • 54. Final Schema in Hadoop
  • 55. Merge Updates hive>INSERT OVERWRITE TABLE user_tmp SELECT user.* FROM user LEFT OUTER JOIN user_upserts ON (user.id = user_upserts.id) WHERE user_upserts.id IS NULL UNION ALL SELECT id, age, occupation, zipcode, TIMESTAMP(last_modified) FROM user_upserts;
  • 57. Aggregations hive>CREATE TABLE avg_movie_rating AS SELECT movie_id, ROUND(AVG(rating), 1) AS rating FROM user_rating GROUP BY movie_id;
  • 58. Export to Data Warehouse
  • 59. user_rating_fact user_rating user_history movie user Merge updates One record/user/movie Merge updates One record/user Data Warehouse avg_movie_rating latest_trending_ movies Export
  • 60. Sqoop Export sqoop export --connect jdbc:mysql:/mysql_server:3306/movie_dwh --username myuser --password mypass --table avg_movie_rating --export-dir /user/hive/warehouse/avg_movie_rating -m 16 --update-key movie_id --update-mode allowinsert --input-fields-terminated-by '001’ --lines-terminated-by 'n'
  • 63. Please complete the session evaluation Thank you! @hadooparchbook You may complete the session evaluation either on paper or online via the mobile app
  • 64. This is a slide title that can be up to two lines of text without losing readability ■ This is the first bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub-bullet of text (and should be as far sub- bullets indent) – This tertiary sub-bullet will be seldom used, but available ▪ This is another sub-bullet of text ■ And this is the third bullet of text
  • 65. This is a slide title (one or two lines) ■ This is the first bullet of text ▪ This is a sub-bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub- bullet of text – This tertiary sub-bullet that can be use ▪ This is another sub-bullet of text ■ Senior Solutions Architect/Partner Enablement at Cloudera ■ Previously, Technical Lead on the big data team at Orbitz Worldwide
  • 66. This is a slide title (one or two lines) ■ This is the first bullet of text ▪ This is a sub-bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub-bullet of text – This tertiary sub-bullet that can be use ▪ This is another sub-bullet of text ■ This is the first bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text ▪ This is a sub-bullet of text — This secondary sub-bullet ▪ This is another sub-bullet of text ■ And this is another bullet of text Subject number one Subject number two
  • 67. This is a slide title for a slide with just the title line (e.g., images/diagrams below)
  • 68.
  • 69. What is Hadoop? Hadoop is an open-source system designed To store and process petabyte scale data. That’s pretty much what you need to know. Well almost…
  • 70. Compression Codecs snappy Well, maybe. Not splittable. X Splittable. Getting better… Very good choice Splittable, but no...
  • 71. Our Compression Codec Recommendation ■ Snappy for all data sets (columnar as well as row based)
  • 72. File Format Choices Data set Storage format Compression Codec movie Parquet Snappy user_history Avro Snappy user Parquet Snappy user_rating_fact Avro Snappy user_rating Parquet Snappy