HadoopDB

1.
HadoopDB Miguel AngelPastor Olivar miguelinlas3 at gmail dot com http://miguelinlas3.blogspot.com http://twitter.com/miguelinlas3

2.
Contenidos Introduction,o bjetives and background

3.
HadoopDB Architecture

4.
Results

5.
Conclusions

6.
Introduction

7.
General Analytics are important today

8.
Data amount is exploding

9.
Previous problem ->Shared nothing architectures

10.
Approachs: Parallel databases

11.
Map/Reduce systems

12.
Desired properties PerformanceCheaper upgrades

13.
Pricing mode (cloud)Fault tolerance Transactional workloads: recover

14.
Analytics environments: notrestart querys

15.
Problem at scaling

16.
Desired properties Heterogeneusenvironments Increasing number of nodes

17.
Difficult homogeneous Flexiblequery interface BI usually JDBC or ODBC

18.
UDF mechanism

19.
Desirable SQL and no SQL interfaces

20.
Background: parallel databases Standard relational tables and SQL Indexing, compression,caching, I/O sharing Tables partitioned over nodes Transparent to the user

21.
Optimizer tailoredMeet p erformance Needed highly skilled DBA

22.
Background: parallel databases Flexible query interfaces UDFs varies acroos implementations Fault tolerance Not score so well

23.
Assumption: failures arerare

24.
Assumption: dozens ofnodes in clusters

25.
Engineering decisions

26.
Background: Map/Reduce

27.
Background: Map/Reduce Satisfiesfault tolerance

28.
Works on heterogeneusenvironment

29.
Drawback: performance Notprevious modeling

30.
No enhacing performancetechniques Interfaces Write M/R jobs in multiple languages

31.
SQL not supporteddirectly ( Hive )

32.
HadoopDB

33.
Ideas Main goal:achieve the properties described before

34.
Connect multiple single-datanodesystems Hadoop reponsible for task coordination and network layer

35.
Queries parallelized alongde nodes Fault tolerant and work in heterogeneus nodes

36.
Parallel databases performanceQuery processing in database engine

37.
Architecture background Hadoopdistributed file system (HDFS) Block structured file system managed by central node

38.
Files broken inblocks and ditributed Processing layer (Map/Reduce framework) Master/slave architecture

39.
Job and Tasktrackers

40.
Architecture

41.
Database connector moduleInterface between database and task tracker

42.
Responsabilities Connect tothe database

43.
Execute the SQLquery

44.
Return the resultsas key-value pairs Achieved goal Datasources are similar to datablocks in HDFS

45.
Catalog module Metadataabout databases Database location, driver class, credentials

46.
Datasets in cluster,replica or partitioning Catalog stored as xml file in HDFS

47.
Plan to deployas separated service

48.
Data loader moduleResponsabilities: Globally repartitioning data

49.
Breaking single datanode in ckunks

50.
Bulk-load data insingle data node chunks Two main components: Global hasher Map/Reduce job read from HDS and repartition Local Hasher Copies from HDFS to local file system

51.
SMS Planner moduleSQL interface to analyst based on Hive

52.
Steps AST building

53.
Semantic analyzer connectsto catalog

54.
DAG of relationaloperators

55.
Optimizer reestructuration

56.
Convert plan toM/R jobs

57.
DAG in M/Rserialized in xml plan

58.
SMS Planner extensionsUpdate metastore with table references

59.
Two phases beforeexecution Retrieve data fields to determine partitioning keys

60.
Traverse DAG (bottomup). Rule based SQL generator

61.
Benckmarking

62.
Environment Amazon EC2“large” instances

63.
Each instance 7,5GB memory

64.
2 virtual cores

65.
850 GB storage

66.
64 bits LinuxFedora 8

67.
Benchmarked systems Hadoop256MB data blocks

68.
1024 MB heapsize

69.
200Mb sort bufferHadoopDB Similar to Hadoop conf,

70.
PostgreSQL 8.2.5

71.
No compress data

72.
Benchmarked systems VerticaNew parallel database (column store),

73.
Used a cloudedition

74.
All data iscompressed DBMS-X Comercial parallel row

75.
Run on EC2(not cloud edition available)

76.
Used data Httplog files, html pages, ranking

77.
Sizes (per node):155 millions user visits (~ 20Gigabytes)

78.
18 millions ranking(~1Gigabyte)

79.
Stored as plaintext in HDFS

80.
Loading data

81.
Grep Task

82.
Selection Task Consultaejecutada Select pageUrl, pageRank from Rankings where pageRank > 10

83.
Aggregation Task Smallerquery SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); Larger query SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP

84.
Join Task QuerySELECT sourceIP, COUNT(pageRank), SUM(pageRank),SUM(adRevenue) FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’ GROUP BY UV.sourceIP; Not same query

85.
UDF Aggregation Task

86.
Summary HaddopDB approachparallel databases in absence of failures PostgreSLQ not column store

87.
DBMS-X 15% overlyoptimistic

88.
No compression unPosgreSQL data Outperforms Hadoop

89.
Fault tolerance andheterogeneus environments

90.
Benchmarks

91.
Discussion Vertica isfaster

92.
Reduce the numberof nodes to achieve the same order of magnitude

93.
Fault tolerance isimportant

94.
Conclusions

95.
Conclusion Approach paralleldatabases and fault tolerance

96.
PostgreSQL is nota column store

97.
Hadoop and hiverelatively new open source projects

98.
HadoopDB is flexibleand extensible

99.
References

100.
References Hadoop webpage

101.
HadoopDB article

102.
HadoopDB project

103.
Vertica

104.
Apache Hive

105.
That´s all!

HadoopDB

More Related Content

What's hot

Viewers also liked

Similar to HadoopDB

More from Miguel Pastor

Recently uploaded

HadoopDB