SlideShare a Scribd company logo
1 of 54
Download to read offline
1

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing
by Alexey Grigorev
2

Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions
3

Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity
4

MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• mp
a
• apply to each element of the list
• rdc = fl = acmlt
eue
od
cuuae
• aggregate a list and produce one value of output
• No side effects
5

MapReduce: Origins
• (eie(1e)( e 1)
dfn + l + l )
•

(a + (it123)
mp 1 ls
)

•

(eue+0(it234)
rdc
ls
)

•

(eue+0(a + (it123)
rdc
mp 1 ls
))

(it234
ls
)
9
9

⇒

⇒

⇒
6

MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
⇒

• (it1234
ls
)

( i t 1 2 and ( i t 3 4
ls
)
ls
)

• Apply map to each chuck separately, and then combine ( r d c them
e u e)
together
7

MapReduce: Origins
• Mapping separately:
•

(eiers (eue+0(a + (it12)
dfn e1 rdc
mp 1 ls
))

•

(eue+rs (a + (it34)
rdc
e1 mp 1 ls
))

• This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 )
rdc
mp 1 ls
))
• Note that for r d c the function must be additive
eue
8

MapReduce
• A m p function
a
• takes a key-value pair ( n k y i _ a )
i_e, nvl
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A r d c function
eue
• for each group in the intermediate results
• aggregates and produces the final output
9

MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply m p to each key-value pair
a
• group together the intermediate results by key
• reduce stage: apply r d c to each group
eue
10

MapReduce Stages
data
source

data
source

data
source

data
source

map

map

map

map

reduce

reduce

reduce

mp
a:
(nky i_a)i_e, nvl >
[otky otvl]
(u_e, u_a)

rdc:
eue
(u_e,[u_a] otky otvl) >
[e_a]
rsvl
11

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.
12

MapReduce Example
0 .d f m p S r n i p t k y S r n d c :
1 e a(tig nu_e, tig o)
0.
2
0.
3

frec wr wi dc
o ah od
n o:
EiItreit w 1
m t n e m d a e( , )

0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s :
4 e eueSrn uptky trtr uptvl)
0.
5

itrs=0
n e

0.
6

frec vi otu_as
o ah
n uptvl:

0.
7

rs+ v
e =

0.
8

Ei rs
m t( e )
13

MapReduce Example

w

)1 ,w(

• reduce stage: for each

pairs into

)]1 , . . . ,1 ,1[ ,w(

• group a list of

w

• map stage: output 1 for each word

calculate how many ones there are
14

MapReduce Example: Result
• amet: 2
• ante: 2
• aptent: 1
• consectetur: 1
• dictum: 3
• dolor: 2
• elit: 3
• ...
http://flickr.com/photos/erikeldridge/3614786392/

Hadoop
16

“

Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.
17

Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others
18

Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase
19

Hadoop
file

Read

Map

Combine

mapper
local storage

Pull
result

HDFS

Redu ce

Sort

reducer
local storage

Copy
20

http://escience.washington.edu/get-help-now/what-hadoop
21
22
23
24
25
26
27

≈

Fault-Tolerance

Load-Balancing

• No execution plan

⇒

• Node done

⇒

• Node failed

Task reassigned
Another task assigned

• No communication costs
28

Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware
29

Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young
30

Disadvantages

[Abouzeid, Azza et al 2009]
31

Hadoop as a Data Warehouse
• Cheetah
• Hive
32

Cheetah
• Typical DW relation-like schemas
• ... But not exactly
• They call it virtual views
33

Cheetah
34

Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary
35

Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying
36

HiveQL
Tables
0 .S A U U D T ( s r i i t s a u s r n , d s r n )
1 TTS PAEue d n, tts tig s tig
0 .P O I E ( s r d i t s h o s r n , g n e i t
2 RFLSuei n, col tig edr n)

0 .L A D T L C L I P T ' o s s a u _ p a e '
1 OD AA OA NAH lg/ttsudts
0 .I T T B E s a u _ p a e
2 NO AL ttsudts
0 .P R I I N ( s ' 0 9 0 - 0 )
3 ATTO d=20-32'
37

HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0 .I S R O E W I E T B E g n e _ u m r
5 NET VRRT AL edrsmay
0 .P R I I N ( s ' 0 9 0 - 0 )
6 ATTO d=20-32'
0 .S L C s b 1 g n e , c u t 1
7 EET uq.edr on()
0 .G O P B s b 1 g n e
8 RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
38

HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0. ISR OEWIETBEgne_umr
5
NET VRRT AL edrsmay
0. PRIIN(s'090-0)
6
ATTO d=20-32'
0. SLC sb1gne,cut1
7
EET uq.edr on()
0. GOPB sb1gne
8
RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
39

HiveQL
0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t
1 EUE uq.col uq.ee uq.n
0. UIG'o1.y A (col mm,ct
2
SN tp0p' S sho, ee n)
0 .F O (
3 RM
0.
4
SLC sb1sho,sb1mm,cut1 a ct
EET uq.col uq.ee on() s n
0.
5
FO
RM
0.
6
(A bsho,asau
MP .col .tts
0.
7
UIG'eeetatrp'
SN mm_xrco.y
0.
8
A (col mm)
S sho, ee
0.
9
FO sau_paeaJI poie b
RM ttsudt
ON rfls
1.
0
O (.srd=buei) sb1
N auei
.srd) uq
1.
1
GOPB sb1sho,sb1mm
RU Y uq.col uq.ee
1.
2
DSRBR B sho,mm
ITIUE Y col ee
1.
3
SR B sho,mm,ctds)
OT Y col ee n ec
1 .) s b 2
4
uq
http://www.flickr.com/photos/mrflip/5150336351/in/photos

Hadoop + Data Warehouse
41

Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data
42

ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW
43

ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing
44

Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset
45

Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW
46

⇒

Analytical Sandbox
http://www.flickr.com/photos/pasukaru76/9824401426/
http://www.flickr.com/photos/pasukaru76/4977447932/
49

Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there
50

Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox
51

References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
52

References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
Thank you
Prepared with Shower

More Related Content

What's hot

Exploiting Memory Overflows
Exploiting Memory OverflowsExploiting Memory Overflows
Exploiting Memory OverflowsAnkur Tyagi
 
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Nagi Teramo
 
The Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at WorkThe Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at WorkJorge Vásquez
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
 

What's hot (9)

Exploiting Memory Overflows
Exploiting Memory OverflowsExploiting Memory Overflows
Exploiting Memory Overflows
 
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment
 
The Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at WorkThe Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at Work
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Java 8 monads
Java 8   monadsJava 8   monads
Java 8 monads
 
Apache Spark: Moving on from Hadoop
Apache Spark: Moving on from HadoopApache Spark: Moving on from Hadoop
Apache Spark: Moving on from Hadoop
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
 

Viewers also liked

Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsZitao Liu
 
Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)rajsandhu1989
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
Scientific writing in Engineering and Technology
Scientific writing in Engineering and TechnologyScientific writing in Engineering and Technology
Scientific writing in Engineering and Technologyrajsandhu1989
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning Pranya Prabhakar
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expediahuguk
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsKalyan Hadoop
 
What is Comms Planning?
What is Comms Planning?What is Comms Planning?
What is Comms Planning?Julian Cole
 

Viewers also liked (20)

Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop
HadoopHadoop
Hadoop
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
 
Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Scientific writing in Engineering and Technology
Scientific writing in Engineering and TechnologyScientific writing in Engineering and Technology
Scientific writing in Engineering and Technology
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Big data
Big dataBig data
Big data
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
 
What is Comms Planning?
What is Comms Planning?What is Comms Planning?
What is Comms Planning?
 

Similar to Hadoop in Data Warehousing

HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3Adi Handarbeni
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and RubyManohar Amrutkar
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumertirlukachaitanya
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 

Similar to Hadoop in Data Warehousing (20)

R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and Ruby
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

More from Alexey Grigorev

Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX Alexey Grigorev
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogsAlexey Grigorev
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour KaressliAlexey Grigorev
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAlexey Grigorev
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesAlexey Grigorev
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data ScienceAlexey Grigorev
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningAlexey Grigorev
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningAlexey Grigorev
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentAlexey Grigorev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationAlexey Grigorev
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationAlexey Grigorev
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursAlexey Grigorev
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAlexey Grigorev
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectAlexey Grigorev
 

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Hadoop in Data Warehousing

  • 1. 1 INFO-H-419: Data Warehouses project Hadoop in Data Warehousing by Alexey Grigorev
  • 2. 2 Hadoop: In this Presentation 1. Introduction 2. Origins 3. MapReduce 4. Hadoop as MapReduce Implementation 5. Data Warehouse on Hadoop 6. Hadoop and Data Warehousing 7. Conclusions
  • 3. 3 Why? • Lot of Data • How to deal with it? • Hadoop to rescue! • When to use? • When not to use? • Curiosity
  • 4. 4 MapReduce: Origins • Functional Programming • High order functions to operate on lists • mp a • apply to each element of the list • rdc = fl = acmlt eue od cuuae • aggregate a list and produce one value of output • No side effects
  • 5. 5 MapReduce: Origins • (eie(1e)( e 1) dfn + l + l ) • (a + (it123) mp 1 ls ) • (eue+0(it234) rdc ls ) • (eue+0(a + (it123) rdc mp 1 ls )) (it234 ls ) 9 9 ⇒ ⇒ ⇒
  • 6. 6 MapReduce: Origins • These function do not have side effects • And can be parallelized easily • Can split the input data into chunks: ⇒ • (it1234 ls ) ( i t 1 2 and ( i t 3 4 ls ) ls ) • Apply map to each chuck separately, and then combine ( r d c them e u e) together
  • 7. 7 MapReduce: Origins • Mapping separately: • (eiers (eue+0(a + (it12) dfn e1 rdc mp 1 ls )) • (eue+rs (a + (it34) rdc e1 mp 1 ls )) • This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 ) rdc mp 1 ls )) • Note that for r d c the function must be additive eue
  • 8. 8 MapReduce • A m p function a • takes a key-value pair ( n k y i _ a ) i_e, nvl • produces zero or more key-value pairs: intermediate results • intermediate results are grouped by key • A r d c function eue • for each group in the intermediate results • aggregates and produces the final output
  • 9. 9 MapReduce Stages each MapReduce Job is executed in 3 stages • map stage: apply m p to each key-value pair a • group together the intermediate results by key • reduce stage: apply r d c to each group eue
  • 11. 11 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
  • 12. 12 MapReduce Example 0 .d f m p S r n i p t k y S r n d c : 1 e a(tig nu_e, tig o) 0. 2 0. 3 frec wr wi dc o ah od n o: EiItreit w 1 m t n e m d a e( , ) 0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s : 4 e eueSrn uptky trtr uptvl) 0. 5 itrs=0 n e 0. 6 frec vi otu_as o ah n uptvl: 0. 7 rs+ v e = 0. 8 Ei rs m t( e )
  • 13. 13 MapReduce Example w )1 ,w( • reduce stage: for each pairs into )]1 , . . . ,1 ,1[ ,w( • group a list of w • map stage: output 1 for each word calculate how many ones there are
  • 14. 14 MapReduce Example: Result • amet: 2 • ante: 2 • aptent: 1 • consectetur: 1 • dictum: 3 • dolor: 2 • elit: 3 • ...
  • 16. 16 “ Hadoop ... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 17. 17 Hadoop • Open Source implementation of MapReduce • "Hadoop": • HDFS • Hadoop MapReduce • HBase • Hive • ... many others
  • 18. 18 Hadoop Cluster: Terminology • Name Node: orchestrates the process • Workers: nodes that do the computation • Mappers do the map phase • Reducers do the reduce phase
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27 ≈ Fault-Tolerance Load-Balancing • No execution plan ⇒ • Node done ⇒ • Node failed Task reassigned Another task assigned • No communication costs
  • 28. 28 Advantages • Simple, especially for programmers who know FP • Fault tolerant • No schema, can process any data • Flexible • Cheap and runs on commodity hardware
  • 29. 29 Disadvantages • No declarative high-level language like SQL • Performance issues: • Map and Reduce are blocking • Name Node: single point of failure • It's young
  • 31. 31 Hadoop as a Data Warehouse • Cheetah • Hive
  • 32. 32 Cheetah • Typical DW relation-like schemas • ... But not exactly • They call it virtual views
  • 34. 34 Cheetah • Virtual views consist of columns that can be queried • Everything inside is entirely denormalized • Append-only design and slowly changing dimensions • Proprietary
  • 35. 35 Hive • A data warehousing solution built by Facebook • For Big data analysis: • in 2010 (4 years ago!), 30+ PB • Has its own data model • HiveQL: a declarative SQL-like language for ad-hoc querying
  • 36. 36 HiveQL Tables 0 .S A U U D T ( s r i i t s a u s r n , d s r n ) 1 TTS PAEue d n, tts tig s tig 0 .P O I E ( s r d i t s h o s r n , g n e i t 2 RFLSuei n, col tig edr n) 0 .L A D T L C L I P T ' o s s a u _ p a e ' 1 OD AA OA NAH lg/ttsudts 0 .I T T B E s a u _ p a e 2 NO AL ttsudts 0 .P R I I N ( s ' 0 9 0 - 0 ) 3 ATTO d=20-32'
  • 37. 37 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0 .I S R O E W I E T B E g n e _ u m r 5 NET VRRT AL edrsmay 0 .P R I I N ( s ' 0 9 0 - 0 ) 6 ATTO d=20-32' 0 .S L C s b 1 g n e , c u t 1 7 EET uq.edr on() 0 .G O P B s b 1 g n e 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 38. 38 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0. ISR OEWIETBEgne_umr 5 NET VRRT AL edrsmay 0. PRIIN(s'090-0) 6 ATTO d=20-32' 0. SLC sb1gne,cut1 7 EET uq.edr on() 0. GOPB sb1gne 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 39. 39 HiveQL 0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t 1 EUE uq.col uq.ee uq.n 0. UIG'o1.y A (col mm,ct 2 SN tp0p' S sho, ee n) 0 .F O ( 3 RM 0. 4 SLC sb1sho,sb1mm,cut1 a ct EET uq.col uq.ee on() s n 0. 5 FO RM 0. 6 (A bsho,asau MP .col .tts 0. 7 UIG'eeetatrp' SN mm_xrco.y 0. 8 A (col mm) S sho, ee 0. 9 FO sau_paeaJI poie b RM ttsudt ON rfls 1. 0 O (.srd=buei) sb1 N auei .srd) uq 1. 1 GOPB sb1sho,sb1mm RU Y uq.col uq.ee 1. 2 DSRBR B sho,mm ITIUE Y col ee 1. 3 SR B sho,mm,ctds) OT Y col ee n ec 1 .) s b 2 4 uq
  • 41. 41 Hadoop + Data Warehouse • Hadoop and Data Warehouses can co-exist • DW: OLAP, BI, transactional data • Hadoop: Raw, unstructured data
  • 42. 42 ETL • Extract: load to HDFS, parse, prepare • Run some analysis • Transform: clean data and transform to some structured format • with MapReduce • Load: extract from HDFS, load to DW
  • 43. 43 ETL: examples • Text processing • Call center records analysis • extract sentiment • link to profile • which customers are more important to keep? • Image processing
  • 44. 44 Active Storage • Don't delete the data after processing • Hadoop storage is cheap: it can store anything • Run more analysis when needed • Like: extract new keywords/features from the old dataset
  • 45. 45 Active Storage - 2 • Up to 80% of data is dormant (or cold) • Hadoop storage can be way cheaper than high-cost data management solutions • Move this data to Hadoop • When needed quickly analyze there or move back to DW
  • 49. 49 Analytical Sandbox • What are we looking in this data? • No structure - hard to know • Run ad-hoc Hive queries to see what's there
  • 50. 50 Conclusions • Hadoop is becoming more and more popular • Many companies plan to adopt • Best used with existent DW solutions • as an ETL • as Active Storage • as Analytical Sandbox
  • 51. 51 References 1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20. [pdf] 2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013. 3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010. [pdf] 4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and Teradata) 5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. [pdf] 6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
  • 52. 52 References 7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013. 8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf] 9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf] 10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf] 11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013. 12. Apache Hadoop project home page, url: [link]. 13. Apache HBase home page, [link]. 14. Apache Mahout home page, [link]. 15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014. 16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf] 17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]