1

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing
by Alexey Grigorev
2

Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions
3

Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity
4

MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• mp
a
• apply to each element of the list
• rdc = fl = acmlt
eue
od
cuuae
• aggregate a list and produce one value of output
• No side effects
5

MapReduce: Origins
• (eie(1e)( e 1)
dfn + l + l )
•

(a + (it123)
mp 1 ls
)

•

(eue+0(it234)
rdc
ls
)

•

(eue+0(a + (it123)
rdc
mp 1 ls
))

(it234
ls
)
9
9

⇒

⇒

⇒
6

MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
⇒

• (it1234
ls
)

( i t 1 2 and ( i t 3 4
ls
)
ls
)

• Apply map to each chuck separately, and then combine ( r d c them
e u e)
together
7

MapReduce: Origins
• Mapping separately:
•

(eiers (eue+0(a + (it12)
dfn e1 rdc
mp 1 ls
))

•

(eue+rs (a + (it34)
rdc
e1 mp 1 ls
))

• This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 )
rdc
mp 1 ls
))
• Note that for r d c the function must be additive
eue
8

MapReduce
• A m p function
a
• takes a key-value pair ( n k y i _ a )
i_e, nvl
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A r d c function
eue
• for each group in the intermediate results
• aggregates and produces the final output
9

MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply m p to each key-value pair
a
• group together the intermediate results by key
• reduce stage: apply r d c to each group
eue
10

MapReduce Stages
data
source

data
source

data
source

data
source

map

map

map

map

reduce

reduce

reduce

mp
a:
(nky i_a)i_e, nvl >
[otky otvl]
(u_e, u_a)

rdc:
eue
(u_e,[u_a] otky otvl) >
[e_a]
rsvl
11

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.
12

MapReduce Example
0 .d f m p S r n i p t k y S r n d c :
1 e a(tig nu_e, tig o)
0.
2
0.
3

frec wr wi dc
o ah od
n o:
EiItreit w 1
m t n e m d a e( , )

0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s :
4 e eueSrn uptky trtr uptvl)
0.
5

itrs=0
n e

0.
6

frec vi otu_as
o ah
n uptvl:

0.
7

rs+ v
e =

0.
8

Ei rs
m t( e )
13

MapReduce Example

w

)1 ,w(

• reduce stage: for each

pairs into

)]1 , . . . ,1 ,1[ ,w(

• group a list of

w

• map stage: output 1 for each word

calculate how many ones there are
14

MapReduce Example: Result
• amet: 2
• ante: 2
• aptent: 1
• consectetur: 1
• dictum: 3
• dolor: 2
• elit: 3
• ...
http://flickr.com/photos/erikeldridge/3614786392/

Hadoop
16

“

Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.
17

Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others
18

Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase
19

Hadoop
file

Read

Map

Combine

mapper
local storage

Pull
result

HDFS

Redu ce

Sort

reducer
local storage

Copy
20

http://escience.washington.edu/get-help-now/what-hadoop
21
22
23
24
25
26
27

≈

Fault-Tolerance

Load-Balancing

• No execution plan

⇒

• Node done

⇒

• Node failed

Task reassigned
Another task assigned

• No communication costs
28

Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware
29

Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young
30

Disadvantages

[Abouzeid, Azza et al 2009]
31

Hadoop as a Data Warehouse
• Cheetah
• Hive
32

Cheetah
• Typical DW relation-like schemas
• ... But not exactly
• They call it virtual views
33

Cheetah
34

Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary
35

Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying
36

HiveQL
Tables
0 .S A U U D T ( s r i i t s a u s r n , d s r n )
1 TTS PAEue d n, tts tig s tig
0 .P O I E ( s r d i t s h o s r n , g n e i t
2 RFLSuei n, col tig edr n)

0 .L A D T L C L I P T ' o s s a u _ p a e '
1 OD AA OA NAH lg/ttsudts
0 .I T T B E s a u _ p a e
2 NO AL ttsudts
0 .P R I I N ( s ' 0 9 0 - 0 )
3 ATTO d=20-32'
37

HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0 .I S R O E W I E T B E g n e _ u m r
5 NET VRRT AL edrsmay
0 .P R I I N ( s ' 0 9 0 - 0 )
6 ATTO d=20-32'
0 .S L C s b 1 g n e , c u t 1
7 EET uq.edr on()
0 .G O P B s b 1 g n e
8 RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
38

HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0. ISR OEWIETBEgne_umr
5
NET VRRT AL edrsmay
0. PRIIN(s'090-0)
6
ATTO d=20-32'
0. SLC sb1gne,cut1
7
EET uq.edr on()
0. GOPB sb1gne
8
RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
39

HiveQL
0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t
1 EUE uq.col uq.ee uq.n
0. UIG'o1.y A (col mm,ct
2
SN tp0p' S sho, ee n)
0 .F O (
3 RM
0.
4
SLC sb1sho,sb1mm,cut1 a ct
EET uq.col uq.ee on() s n
0.
5
FO
RM
0.
6
(A bsho,asau
MP .col .tts
0.
7
UIG'eeetatrp'
SN mm_xrco.y
0.
8
A (col mm)
S sho, ee
0.
9
FO sau_paeaJI poie b
RM ttsudt
ON rfls
1.
0
O (.srd=buei) sb1
N auei
.srd) uq
1.
1
GOPB sb1sho,sb1mm
RU Y uq.col uq.ee
1.
2
DSRBR B sho,mm
ITIUE Y col ee
1.
3
SR B sho,mm,ctds)
OT Y col ee n ec
1 .) s b 2
4
uq
http://www.flickr.com/photos/mrflip/5150336351/in/photos

Hadoop + Data Warehouse
41

Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data
42

ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW
43

ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing
44

Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset
45

Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW
46

⇒

Analytical Sandbox
http://www.flickr.com/photos/pasukaru76/9824401426/
http://www.flickr.com/photos/pasukaru76/4977447932/
49

Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there
50

Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox
51

References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
52

References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
Thank you
Prepared with Shower

Hadoop in Data Warehousing

  • 1.
    1 INFO-H-419: Data Warehousesproject Hadoop in Data Warehousing by Alexey Grigorev
  • 2.
    2 Hadoop: In thisPresentation 1. Introduction 2. Origins 3. MapReduce 4. Hadoop as MapReduce Implementation 5. Data Warehouse on Hadoop 6. Hadoop and Data Warehousing 7. Conclusions
  • 3.
    3 Why? • Lot ofData • How to deal with it? • Hadoop to rescue! • When to use? • When not to use? • Curiosity
  • 4.
    4 MapReduce: Origins • FunctionalProgramming • High order functions to operate on lists • mp a • apply to each element of the list • rdc = fl = acmlt eue od cuuae • aggregate a list and produce one value of output • No side effects
  • 5.
    5 MapReduce: Origins • (eie(1e)(e 1) dfn + l + l ) • (a + (it123) mp 1 ls ) • (eue+0(it234) rdc ls ) • (eue+0(a + (it123) rdc mp 1 ls )) (it234 ls ) 9 9 ⇒ ⇒ ⇒
  • 6.
    6 MapReduce: Origins • Thesefunction do not have side effects • And can be parallelized easily • Can split the input data into chunks: ⇒ • (it1234 ls ) ( i t 1 2 and ( i t 3 4 ls ) ls ) • Apply map to each chuck separately, and then combine ( r d c them e u e) together
  • 7.
    7 MapReduce: Origins • Mappingseparately: • (eiers (eue+0(a + (it12) dfn e1 rdc mp 1 ls )) • (eue+rs (a + (it34) rdc e1 mp 1 ls )) • This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 ) rdc mp 1 ls )) • Note that for r d c the function must be additive eue
  • 8.
    8 MapReduce • A mp function a • takes a key-value pair ( n k y i _ a ) i_e, nvl • produces zero or more key-value pairs: intermediate results • intermediate results are grouped by key • A r d c function eue • for each group in the intermediate results • aggregates and produces the final output
  • 9.
    9 MapReduce Stages each MapReduceJob is executed in 3 stages • map stage: apply m p to each key-value pair a • group together the intermediate results by key • reduce stage: apply r d c to each group eue
  • 10.
  • 11.
    11 Lorem ipsum dolorsit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
  • 12.
    12 MapReduce Example 0 .df m p S r n i p t k y S r n d c : 1 e a(tig nu_e, tig o) 0. 2 0. 3 frec wr wi dc o ah od n o: EiItreit w 1 m t n e m d a e( , ) 0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s : 4 e eueSrn uptky trtr uptvl) 0. 5 itrs=0 n e 0. 6 frec vi otu_as o ah n uptvl: 0. 7 rs+ v e = 0. 8 Ei rs m t( e )
  • 13.
    13 MapReduce Example w )1 ,w( •reduce stage: for each pairs into )]1 , . . . ,1 ,1[ ,w( • group a list of w • map stage: output 1 for each word calculate how many ones there are
  • 14.
    14 MapReduce Example: Result •amet: 2 • ante: 2 • aptent: 1 • consectetur: 1 • dictum: 3 • dolor: 2 • elit: 3 • ...
  • 15.
  • 16.
    16 “ Hadoop ... is aframework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 17.
    17 Hadoop • Open Sourceimplementation of MapReduce • "Hadoop": • HDFS • Hadoop MapReduce • HBase • Hive • ... many others
  • 18.
    18 Hadoop Cluster: Terminology •Name Node: orchestrates the process • Workers: nodes that do the computation • Mappers do the map phase • Reducers do the reduce phase
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    27 ≈ Fault-Tolerance Load-Balancing • No executionplan ⇒ • Node done ⇒ • Node failed Task reassigned Another task assigned • No communication costs
  • 28.
    28 Advantages • Simple, especiallyfor programmers who know FP • Fault tolerant • No schema, can process any data • Flexible • Cheap and runs on commodity hardware
  • 29.
    29 Disadvantages • No declarativehigh-level language like SQL • Performance issues: • Map and Reduce are blocking • Name Node: single point of failure • It's young
  • 30.
  • 31.
    31 Hadoop as aData Warehouse • Cheetah • Hive
  • 32.
    32 Cheetah • Typical DWrelation-like schemas • ... But not exactly • They call it virtual views
  • 33.
  • 34.
    34 Cheetah • Virtual viewsconsist of columns that can be queried • Everything inside is entirely denormalized • Append-only design and slowly changing dimensions • Proprietary
  • 35.
    35 Hive • A datawarehousing solution built by Facebook • For Big data analysis: • in 2010 (4 years ago!), 30+ PB • Has its own data model • HiveQL: a declarative SQL-like language for ad-hoc querying
  • 36.
    36 HiveQL Tables 0 .S AU U D T ( s r i i t s a u s r n , d s r n ) 1 TTS PAEue d n, tts tig s tig 0 .P O I E ( s r d i t s h o s r n , g n e i t 2 RFLSuei n, col tig edr n) 0 .L A D T L C L I P T ' o s s a u _ p a e ' 1 OD AA OA NAH lg/ttsudts 0 .I T T B E s a u _ p a e 2 NO AL ttsudts 0 .P R I I N ( s ' 0 9 0 - 0 ) 3 ATTO d=20-32'
  • 37.
    37 HiveQL 0 .F O 1RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0 .I S R O E W I E T B E g n e _ u m r 5 NET VRRT AL edrsmay 0 .P R I I N ( s ' 0 9 0 - 0 ) 6 ATTO d=20-32' 0 .S L C s b 1 g n e , c u t 1 7 EET uq.edr on() 0 .G O P B s b 1 g n e 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 38.
    38 HiveQL 0 .F O 1RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0. ISR OEWIETBEgne_umr 5 NET VRRT AL edrsmay 0. PRIIN(s'090-0) 6 ATTO d=20-32' 0. SLC sb1gne,cut1 7 EET uq.edr on() 0. GOPB sb1gne 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 39.
    39 HiveQL 0 .R DC s b 2 s h o , s b 2 m m , s b 2 c t 1 EUE uq.col uq.ee uq.n 0. UIG'o1.y A (col mm,ct 2 SN tp0p' S sho, ee n) 0 .F O ( 3 RM 0. 4 SLC sb1sho,sb1mm,cut1 a ct EET uq.col uq.ee on() s n 0. 5 FO RM 0. 6 (A bsho,asau MP .col .tts 0. 7 UIG'eeetatrp' SN mm_xrco.y 0. 8 A (col mm) S sho, ee 0. 9 FO sau_paeaJI poie b RM ttsudt ON rfls 1. 0 O (.srd=buei) sb1 N auei .srd) uq 1. 1 GOPB sb1sho,sb1mm RU Y uq.col uq.ee 1. 2 DSRBR B sho,mm ITIUE Y col ee 1. 3 SR B sho,mm,ctds) OT Y col ee n ec 1 .) s b 2 4 uq
  • 40.
  • 41.
    41 Hadoop + DataWarehouse • Hadoop and Data Warehouses can co-exist • DW: OLAP, BI, transactional data • Hadoop: Raw, unstructured data
  • 42.
    42 ETL • Extract: loadto HDFS, parse, prepare • Run some analysis • Transform: clean data and transform to some structured format • with MapReduce • Load: extract from HDFS, load to DW
  • 43.
    43 ETL: examples • Textprocessing • Call center records analysis • extract sentiment • link to profile • which customers are more important to keep? • Image processing
  • 44.
    44 Active Storage • Don'tdelete the data after processing • Hadoop storage is cheap: it can store anything • Run more analysis when needed • Like: extract new keywords/features from the old dataset
  • 45.
    45 Active Storage -2 • Up to 80% of data is dormant (or cold) • Hadoop storage can be way cheaper than high-cost data management solutions • Move this data to Hadoop • When needed quickly analyze there or move back to DW
  • 46.
  • 47.
  • 48.
  • 49.
    49 Analytical Sandbox • Whatare we looking in this data? • No structure - hard to know • Run ad-hoc Hive queries to see what's there
  • 50.
    50 Conclusions • Hadoop isbecoming more and more popular • Many companies plan to adopt • Best used with existent DW solutions • as an ETL • as Active Storage • as Analytical Sandbox
  • 51.
    51 References 1. Lee, Kyong-Ha,et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20. [pdf] 2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013. 3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010. [pdf] 4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and Teradata) 5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. [pdf] 6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
  • 52.
    52 References 7. "How (andWhy) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013. 8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf] 9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf] 10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf] 11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013. 12. Apache Hadoop project home page, url: [link]. 13. Apache HBase home page, [link]. 14. Apache Mahout home page, [link]. 15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014. 16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf] 17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
  • 53.
  • 54.