Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Hive Talk At IIT-Delhi

8,708 views

Published on

Talk at the CS department in IIT 04/02/09.

Published in: Technology

Hadoop Hive Talk At IIT-Delhi

  1. 1. Hadoop and Hive Large Scale Data Processing using Commodity HW/SW Joydeep Sen Sarma
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Hadoop </li></ul><ul><li>Hive </li></ul><ul><li>Hadoop/Hive Usage @Facebook </li></ul><ul><li>Wishlists/Projects </li></ul><ul><li>Questions </li></ul>
  3. 3. Data and Computing Trends <ul><li>Explosion of Data </li></ul><ul><ul><li>Web Logs, Ad-Server logs, Sensor Networks, Seismic Data, DNA sequences (?) </li></ul></ul><ul><ul><li>User generated content/Web 2.0 </li></ul></ul><ul><ul><li>Data as BI => Data as product (Search, Ads, Digg, Quantcast, …) </li></ul></ul><ul><li>Declining Revenue/GB </li></ul><ul><ul><li>Milk @ $3/gallon => $15M / GB </li></ul></ul><ul><ul><li>Ads @ 20c / 10^6 impressions => $1/GB </li></ul></ul><ul><ul><li>Google Analytics, Facebook Lexicon == Free! </li></ul></ul><ul><li>Hardware Trends </li></ul><ul><ul><li>Commodity Rocks: $4K 1U box = 8 cores + 16GB mem + 4x1TB </li></ul></ul><ul><ul><li>CPU: SMP  NUMA, Storage: $ Shared-Nothing << $ Shared, Networking: Ethernet </li></ul></ul><ul><li>Software Trends </li></ul><ul><ul><li>Open Source, SaaS </li></ul></ul><ul><ul><li>LAM P , Compute on Demand (EC2) </li></ul></ul>
  4. 4. Hadoop <ul><li>Parallel Computing platform </li></ul><ul><ul><li>Distributed FileSystem (HDFS) </li></ul></ul><ul><ul><li>Parallel Processing model (Map/Reduce) </li></ul></ul><ul><ul><li>Express Computation in any language </li></ul></ul><ul><ul><li>Job execution for Map/Reduce jobs (scheduling+localization+retries/speculation) </li></ul></ul><ul><li>Open-Source </li></ul><ul><ul><li>Most popular Apache project! </li></ul></ul><ul><ul><li>Highly Extensible Java Stack (@ expense of Efficiency) </li></ul></ul><ul><ul><li>Develop/Test on EC2! </li></ul></ul><ul><li>Ride the commodity curve: </li></ul><ul><ul><li>Cheap (but reliable) shared nothing storage </li></ul></ul><ul><ul><li>Data Local computing (don’t need high speed networks) </li></ul></ul><ul><ul><li>Highly Scalable (@expense of Efficiency) </li></ul></ul>
  5. 5. Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce
  6. 6. HDFS <ul><li>Separation of Metadata from Data </li></ul><ul><ul><li>Metadata == Inodes, attributes, block locations, block replication </li></ul></ul><ul><li>File = Σ data blocks (typically 128MB) </li></ul><ul><ul><li>Architected for large files and streaming reads </li></ul></ul><ul><li>Highly Reliable </li></ul><ul><ul><li>Each data block typically replicated 3X to different datanodes </li></ul></ul><ul><ul><li>Clients compute and verify block checksums (end-to-end) </li></ul></ul><ul><li>Single namenode </li></ul><ul><ul><li>All metadata stored In-memory. Passive standby </li></ul></ul><ul><li>Client talks to both namenode and datanodes </li></ul><ul><ul><li>Bulk data from datanode to client  linear scalability </li></ul></ul><ul><ul><li>Custom Client library in Java/C/Thrift </li></ul></ul><ul><ul><li>Not POSIX, not NFS </li></ul></ul>
  7. 7. In pictures .. NameNode Disks 32GB RAM Secondary NameNode Disks 32GB RAM DataNode DataNode DataNode DFS Client DataNode DataNode DataNode getLocations locations
  8. 8. Map and Reduce <ul><li>Map Function: </li></ul><ul><ul><li>Apply to input data </li></ul></ul><ul><ul><li>Emits reduction key and value </li></ul></ul><ul><li>Reduce Function: </li></ul><ul><ul><li>Apply to data grouped by reduction key </li></ul></ul><ul><ul><li>Often ‘reduces’ data (for example – sum(values)) </li></ul></ul><ul><li>Hadoop groups data by sorting </li></ul><ul><li>User can choose to apply reductions multiple times </li></ul><ul><ul><li>Combiner </li></ul></ul><ul><li>Partitioning, Sorting, Grouping different concepts </li></ul>
  9. 9. Programming with Map/Reduce <ul><li>Find the most imported package in Hive source: </li></ul><ul><li>$ find . -name '*.java' -exec egrep '^import' '{}' ; | awk '{print $2}' | sort | uniq -c | sort -nr +0 -1 | head -1 </li></ul><ul><ul><li>208 org.apache.commons.logging.LogFactory; </li></ul></ul><ul><li>In Map-Reduce: </li></ul><ul><ul><li>1a. Map using: egrep '^import'| awk '{print $2}' </li></ul></ul><ul><ul><li>1b. Reduce on first column (package name) </li></ul></ul><ul><ul><li>1c. Reduce Function: uniq -c </li></ul></ul><ul><ul><li>2a. Map using: awk ‘{print “%05d %s ”,100000-$1,$2}’ </li></ul></ul><ul><ul><li>2b. Reduce using first column (inverse counts), 1 reducer </li></ul></ul><ul><ul><li>2c. Reduce Function: Identity </li></ul></ul><ul><li>Scales to Terabytes </li></ul>
  10. 10. Map/Reduce DataFLow
  11. 11. Why HIVE? <ul><li>Large installed base of SQL users  </li></ul><ul><ul><li>ie. map-reduce is for ultra-geeks </li></ul></ul><ul><ul><li>much much easier to write sql query </li></ul></ul><ul><li>Analytics SQL queries translate really well to map-reduce </li></ul><ul><li>Files as insufficient data management abstraction </li></ul><ul><ul><li>Tables, Schemas, Partitions, Indices </li></ul></ul><ul><ul><li>Metadata allows optimization, discovery, browsing </li></ul></ul><ul><li>Love the programmability of Hadoop </li></ul><ul><li>Hate that RDBMS are closed </li></ul><ul><ul><li>Why not work on data in any format? </li></ul></ul><ul><ul><li>Complex data types are the norm </li></ul></ul>
  12. 12. Rubbing it in .. <ul><li>hive> select key, count(1) from kv1 where key > 100 group by key; </li></ul><ul><li>vs. </li></ul><ul><li>$ cat > /tmp/reducer.sh </li></ul><ul><li>uniq -c | awk '{print $2&quot; &quot;$1}‘ </li></ul><ul><li>$ cat > /tmp/map.sh </li></ul><ul><li>awk -F '01' '{if($1 > 100) print $1}‘ </li></ul><ul><li>$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 </li></ul><ul><li>$ bin/hadoop dfs –cat /tmp/largekey/part* </li></ul>
  13. 13. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
  14. 14. Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
  15. 15. Hive Query Language <ul><li>Basic SQL </li></ul><ul><ul><li>From clause subquery </li></ul></ul><ul><ul><li>ANSI JOIN (equi-join only) </li></ul></ul><ul><ul><li>Multi-table Insert </li></ul></ul><ul><ul><li>Multi group-by </li></ul></ul><ul><ul><li>Sampling </li></ul></ul><ul><ul><li>Objects traversal </li></ul></ul><ul><li>Extensibility </li></ul><ul><ul><li>Pluggable Map-reduce scripts using TRANSFORM </li></ul></ul>
  16. 16. Running Custom Map/Reduce Scripts <ul><ul><ul><li>FROM ( </li></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' </li></ul></ul></ul></ul><ul><ul><ul><ul><li>AS(dt, uid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>CLUSTER BY(dt)) map </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_users_reduced </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count); </li></ul></ul></ul></ul>
  17. 17. Hive QL – Join <ul><li>SQL: </li></ul><ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT pv.pageid, u.age </li></ul></ul><ul><ul><li>FROM page_view pv JOIN user u ON (pv.userid = u.userid); </li></ul></ul>X = page_view user pv_users pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32
  18. 18. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
  19. 19. Joins <ul><li>Outer Joins </li></ul><ul><li>INSERT INTO TABLE pv_users </li></ul><ul><li>SELECT pv.*, u.gender, u.age </li></ul><ul><li>FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) </li></ul><ul><li>WHERE pv.date = 2008-03-03; </li></ul>
  20. 20. Hive Optimizations – Merge Sequential Map Reduce Jobs <ul><li>SQL: </li></ul><ul><ul><li>FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … </li></ul></ul>A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
  21. 21. Join To Map Reduce <ul><li>Only Equality Joins with conjunctions supported </li></ul><ul><li>Future </li></ul><ul><ul><li>Pruning of values send from map to reduce on the basis of projections </li></ul></ul><ul><ul><li>Make Cartesian product more memory efficient </li></ul></ul><ul><ul><li>Map side joins </li></ul></ul><ul><ul><ul><li>Hash Joins if one of the tables is very small </li></ul></ul></ul><ul><ul><ul><li>Exploit pre-sorted data by doing map-side merge join </li></ul></ul></ul><ul><li>Estimate number of reducers </li></ul><ul><ul><li>Hard to measure effect of filters </li></ul></ul><ul><ul><li>Run map side for small part of input to estimate #reducers </li></ul></ul>
  22. 22. Hive QL – Group By <ul><ul><ul><ul><li>SELECT pageid, age, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY pageid, age; </li></ul></ul></ul></ul>pv_users pageid age 1 25 2 25 1 32 2 25 pageid age count 1 25 1 2 25 2 1 32 1
  23. 23. Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
  24. 24. Hive QL – Group By with Distinct <ul><ul><li>SELECT pageid, COUNT(DISTINCT userid) </li></ul></ul><ul><ul><li>FROM page_view GROUP BY pageid </li></ul></ul>page_view pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 pageid count_distinct_userid 1 2 2 1
  25. 25. Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
  26. 26. Group by Future optimizations <ul><li>Map side partial aggregations </li></ul><ul><ul><li>Hash Based aggregates (Done) </li></ul></ul><ul><ul><li>Serialized key/values in hash tables </li></ul></ul><ul><ul><li>Exploit pre-sorted data for distinct counts </li></ul></ul><ul><li>Partial aggregations in Combiner </li></ul><ul><li>Be smarter about how to avoid multiple stage (Done) </li></ul><ul><li>Exploit table/column statistics for deciding strategy </li></ul>
  27. 27. Dealing with Structured Data <ul><li>Type system </li></ul><ul><ul><li>Primitive types </li></ul></ul><ul><ul><li>Recursively build up using Composition/Maps/Lists </li></ul></ul><ul><li>ObjectInspector interface for user-defined types </li></ul><ul><ul><li>To recursively list schema </li></ul></ul><ul><ul><li>To recursively access fields within a row object </li></ul></ul><ul><li>Generic (De)Serialization Interface (SerDe) </li></ul><ul><li>Serialization families implement interface </li></ul><ul><ul><li>Thrift DDL based SerDe </li></ul></ul><ul><ul><li>Delimited text based SerDe </li></ul></ul><ul><ul><li>You can write your own SerDe (XML, JSON …) </li></ul></ul>
  28. 28. MetaStore <ul><li>Stores Table/Partition properties: </li></ul><ul><ul><li>Table schema and SerDe library </li></ul></ul><ul><ul><li>Table Location on HDFS </li></ul></ul><ul><ul><li>Logical Partitioning keys and types </li></ul></ul><ul><ul><li>Partition level metadata </li></ul></ul><ul><ul><li>Other information </li></ul></ul><ul><li>Thrift API </li></ul><ul><ul><li>Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI) </li></ul></ul><ul><li>Metadata stored in any SQL backend </li></ul><ul><li>Future </li></ul><ul><ul><li>Statistics </li></ul></ul><ul><ul><li>Schema Evolution </li></ul></ul>
  29. 29. Future Work <ul><li>Cost-based optimization </li></ul><ul><li>Multiple interfaces (JDBC…)/Integration with BI </li></ul><ul><li>SQL Compliance (order by, nested queries…) </li></ul><ul><li>Indexing </li></ul><ul><li>Data Compression </li></ul><ul><ul><li>Columnar storage schemes </li></ul></ul><ul><ul><li>Exploit lazy/functional Hive field retrieval interfaces </li></ul></ul><ul><li>Better data locality </li></ul><ul><ul><li>Co-locate hash partitions on same rack </li></ul></ul><ul><ul><li>Exploit high intra-rack bandwidth for merge joins </li></ul></ul><ul><li>Advanced operators </li></ul><ul><ul><li>Cubes/Frequent Item Sets </li></ul></ul>
  30. 30. Hive Status <ul><li>Available Hadoop Sub-project </li></ul><ul><ul><li>http://svn.apache.org/repos/asf/hadoop/hive/trunk </li></ul></ul><ul><li>[email_address] </li></ul><ul><li>IRC: #hive </li></ul><ul><li>VLDB demo submission </li></ul><ul><li>Hivers@Facebook: </li></ul><ul><ul><li>Ashish Thusoo </li></ul></ul><ul><ul><li>Zheng Shao </li></ul></ul><ul><ul><li>Prasad Chakka </li></ul></ul><ul><ul><li>Namit Jain </li></ul></ul><ul><ul><li>Raghu Murthy </li></ul></ul><ul><ul><li>Suresh Anthony </li></ul></ul>
  31. 31. Hive/Hadoop Usage @ Facebook <ul><li>Summarization </li></ul><ul><ul><li>Eg: Daily/Weekly aggregations of impression/click counts </li></ul></ul><ul><ul><li>Complex measures of user engagement </li></ul></ul><ul><li>Ad hoc Analysis </li></ul><ul><ul><li>Eg: how many group admins broken down by state/country </li></ul></ul><ul><li>Data Mining (Assembling training data) </li></ul><ul><ul><li>Eg: User Engagement as a function of user attributes </li></ul></ul><ul><li>Spam Detection </li></ul><ul><ul><li>Anomalous patterns in UGC </li></ul></ul><ul><ul><li>Application api usage patterns </li></ul></ul><ul><li>Ad Optimization </li></ul><ul><li>Too many to count .. </li></ul>
  32. 32. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  33. 33. Hadoop Usage @ Facebook <ul><li>Data statistics: </li></ul><ul><ul><li>Total Data: ~2.5PB </li></ul></ul><ul><ul><li>Net Data added/day: ~15TB </li></ul></ul><ul><ul><ul><li>6TB of uncompressed source logs </li></ul></ul></ul><ul><ul><ul><li>4TB of uncompressed dimension data reloaded daily </li></ul></ul></ul><ul><ul><li>Compression Factor ~5x (gzip, more with bzip) </li></ul></ul><ul><li>Usage statistics: </li></ul><ul><ul><li>3200 jobs/day with 800K tasks(map-reduce tasks)/day </li></ul></ul><ul><ul><li>55TB of compressed data scanned daily </li></ul></ul><ul><ul><li>15TB of compressed output data written to hdfs </li></ul></ul><ul><ul><li>80 MM compute minutes/day </li></ul></ul>
  34. 34. In Pictures
  35. 35. Hadoop Challenges @ Facebook <ul><li>QOS/Isolation: </li></ul><ul><ul><li>Big jobs can hog the cluster </li></ul></ul><ul><ul><li>JobTracker memory as limited resource </li></ul></ul><ul><ul><li>Limit memory impact of runaway tasks </li></ul></ul><ul><ul><li>Fair Scheduler (Matei) </li></ul></ul><ul><li>Protection </li></ul><ul><ul><li>What if software bug/disaster destroys NameNode metadata? </li></ul></ul><ul><ul><li>HDFS SnapShots (Dhruba) </li></ul></ul><ul><li>Data Archival </li></ul><ul><ul><li>Not all data is hot and needs colocation with Compute </li></ul></ul><ul><ul><li>Hadoop Data Archival Layer </li></ul></ul>
  36. 36. Hadoop Challenges @ Facebook <ul><li>Performance </li></ul><ul><ul><li>Really hard to understand what systemic bottlenecks are </li></ul></ul><ul><ul><li>Workloads are variable during daytime </li></ul></ul><ul><li>Small Job Performance </li></ul><ul><ul><li>Sampling encourage small test queries </li></ul></ul><ul><ul><li>Hadoop awful at locality for small jobs </li></ul></ul><ul><ul><li>Need to reduce task startup time (JVM reuse) </li></ul></ul><ul><ul><li>Large number of mappers each with small output produces terrible performance </li></ul></ul><ul><ul><li> Global Scheduling for better locality (Matei) </li></ul></ul>
  37. 37. Hadoop Wish List <ul><li>HDFS: </li></ul><ul><ul><li>3-way replication is unsustainable </li></ul></ul><ul><ul><ul><li>N+k erasure codes </li></ul></ul></ul><ul><ul><li>Snapshots </li></ul></ul><ul><ul><ul><li>Design out – but need people to work on it </li></ul></ul></ul><ul><ul><li>Namenode on-disk metadata </li></ul></ul><ul><ul><ul><li>In-memory model poses fundamental limits on growth </li></ul></ul></ul><ul><ul><li>Application hints on block/file co-location </li></ul></ul><ul><li>Map-Reduce </li></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Resource aware scheduling </li></ul></ul><ul><ul><li>Multi-Stage Map-Reduce </li></ul></ul>

×