Hadoop and Hive Large Scale Data Processing using Commodity HW/SW Joydeep Sen Sarma
Outline <ul><li>Introduction </li></ul><ul><li>Hadoop </li></ul><ul><li>Hive </li></ul><ul><li>Hadoop/Hive Usage @Facebook...
Data and Computing Trends <ul><li>Explosion of Data </li></ul><ul><ul><li>Web Logs, Ad-Server logs, Sensor Networks, Seism...
Hadoop <ul><li>Parallel Computing platform </li></ul><ul><ul><li>Distributed FileSystem (HDFS) </li></ul></ul><ul><ul><li>...
Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4-8 Gigabit Node = DataNode...
HDFS <ul><li>Separation of Metadata from Data </li></ul><ul><ul><li>Metadata == Inodes, attributes, block locations, block...
In pictures ..  NameNode Disks 32GB RAM Secondary NameNode Disks 32GB RAM DataNode DataNode DataNode DFS Client DataNode D...
Map and Reduce <ul><li>Map Function: </li></ul><ul><ul><li>Apply to input data </li></ul></ul><ul><ul><li>Emits reduction ...
Programming with Map/Reduce <ul><li>Find the most imported package in Hive source: </li></ul><ul><li>$ find . -name '*.jav...
Map/Reduce DataFLow
Why HIVE? <ul><li>Large installed base of SQL users   </li></ul><ul><ul><li>ie. map-reduce is for ultra-geeks </li></ul><...
Rubbing it in .. <ul><li>hive> select key, count(1) from kv1 where key > 100 group by key; </li></ul><ul><li>vs. </li></ul...
HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hiv...
Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-...
Hive Query Language <ul><li>Basic SQL </li></ul><ul><ul><li>From clause subquery </li></ul></ul><ul><ul><li>ANSI JOIN (equ...
Running Custom Map/Reduce Scripts <ul><ul><ul><li>FROM (  </li></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users  </li></ul...
Hive QL – Join <ul><li>SQL: </li></ul><ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT pv.pageid, ...
Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1>...
Joins <ul><li>Outer Joins </li></ul><ul><li>INSERT INTO TABLE pv_users  </li></ul><ul><li>SELECT pv.*, u.gender, u.age  </...
Hive Optimizations  – Merge Sequential Map Reduce Jobs <ul><li>SQL: </li></ul><ul><ul><li>FROM (a join b on a.key = b.key)...
Join To Map Reduce <ul><li>Only Equality Joins with conjunctions supported </li></ul><ul><li>Future </li></ul><ul><ul><li>...
Hive QL – Group By <ul><ul><ul><ul><li>SELECT pageid, age, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_u...
Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 page...
Hive QL – Group By with Distinct <ul><ul><li>SELECT pageid, COUNT(DISTINCT userid) </li></ul></ul><ul><ul><li>FROM page_vi...
Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid co...
Group by Future optimizations <ul><li>Map side partial aggregations </li></ul><ul><ul><li>Hash Based aggregates (Done) </l...
Dealing with Structured Data <ul><li>Type system </li></ul><ul><ul><li>Primitive types </li></ul></ul><ul><ul><li>Recursiv...
MetaStore <ul><li>Stores Table/Partition properties: </li></ul><ul><ul><li>Table schema and SerDe library </li></ul></ul><...
Future Work <ul><li>Cost-based optimization </li></ul><ul><li>Multiple interfaces (JDBC…)/Integration with BI </li></ul><u...
Hive Status <ul><li>Available Hadoop Sub-project </li></ul><ul><ul><li>http://svn.apache.org/repos/asf/hadoop/hive/trunk <...
Hive/Hadoop Usage @ Facebook <ul><li>Summarization  </li></ul><ul><ul><li>Eg: Daily/Weekly aggregations of impression/clic...
Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on  Hadoop Cluster Oracle RAC Federated MySQL
Hadoop Usage @ Facebook <ul><li>Data statistics: </li></ul><ul><ul><li>Total Data: ~2.5PB  </li></ul></ul><ul><ul><li>Net ...
In Pictures
Hadoop Challenges @ Facebook <ul><li>QOS/Isolation: </li></ul><ul><ul><li>Big jobs can hog the cluster </li></ul></ul><ul>...
Hadoop Challenges @ Facebook <ul><li>Performance </li></ul><ul><ul><li>Really hard to understand what systemic bottlenecks...
Hadoop Wish List <ul><li>HDFS: </li></ul><ul><ul><li>3-way replication is unsustainable </li></ul></ul><ul><ul><ul><li>N+k...
Upcoming SlideShare
Loading in...5
×

Hadoop Hive Talk At IIT-Delhi

7,743

Published on

Talk at the CS department in IIT 04/02/09.

Published in: Technology
1 Comment
12 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,743
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
589
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide
  • Offline and Near-Real time data processing Not online
  • Hadoop Hive Talk At IIT-Delhi

    1. 1. Hadoop and Hive Large Scale Data Processing using Commodity HW/SW Joydeep Sen Sarma
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Hadoop </li></ul><ul><li>Hive </li></ul><ul><li>Hadoop/Hive Usage @Facebook </li></ul><ul><li>Wishlists/Projects </li></ul><ul><li>Questions </li></ul>
    3. 3. Data and Computing Trends <ul><li>Explosion of Data </li></ul><ul><ul><li>Web Logs, Ad-Server logs, Sensor Networks, Seismic Data, DNA sequences (?) </li></ul></ul><ul><ul><li>User generated content/Web 2.0 </li></ul></ul><ul><ul><li>Data as BI => Data as product (Search, Ads, Digg, Quantcast, …) </li></ul></ul><ul><li>Declining Revenue/GB </li></ul><ul><ul><li>Milk @ $3/gallon => $15M / GB </li></ul></ul><ul><ul><li>Ads @ 20c / 10^6 impressions => $1/GB </li></ul></ul><ul><ul><li>Google Analytics, Facebook Lexicon == Free! </li></ul></ul><ul><li>Hardware Trends </li></ul><ul><ul><li>Commodity Rocks: $4K 1U box = 8 cores + 16GB mem + 4x1TB </li></ul></ul><ul><ul><li>CPU: SMP  NUMA, Storage: $ Shared-Nothing << $ Shared, Networking: Ethernet </li></ul></ul><ul><li>Software Trends </li></ul><ul><ul><li>Open Source, SaaS </li></ul></ul><ul><ul><li>LAM P , Compute on Demand (EC2) </li></ul></ul>
    4. 4. Hadoop <ul><li>Parallel Computing platform </li></ul><ul><ul><li>Distributed FileSystem (HDFS) </li></ul></ul><ul><ul><li>Parallel Processing model (Map/Reduce) </li></ul></ul><ul><ul><li>Express Computation in any language </li></ul></ul><ul><ul><li>Job execution for Map/Reduce jobs (scheduling+localization+retries/speculation) </li></ul></ul><ul><li>Open-Source </li></ul><ul><ul><li>Most popular Apache project! </li></ul></ul><ul><ul><li>Highly Extensible Java Stack (@ expense of Efficiency) </li></ul></ul><ul><ul><li>Develop/Test on EC2! </li></ul></ul><ul><li>Ride the commodity curve: </li></ul><ul><ul><li>Cheap (but reliable) shared nothing storage </li></ul></ul><ul><ul><li>Data Local computing (don’t need high speed networks) </li></ul></ul><ul><ul><li>Highly Scalable (@expense of Efficiency) </li></ul></ul>
    5. 5. Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce
    6. 6. HDFS <ul><li>Separation of Metadata from Data </li></ul><ul><ul><li>Metadata == Inodes, attributes, block locations, block replication </li></ul></ul><ul><li>File = Σ data blocks (typically 128MB) </li></ul><ul><ul><li>Architected for large files and streaming reads </li></ul></ul><ul><li>Highly Reliable </li></ul><ul><ul><li>Each data block typically replicated 3X to different datanodes </li></ul></ul><ul><ul><li>Clients compute and verify block checksums (end-to-end) </li></ul></ul><ul><li>Single namenode </li></ul><ul><ul><li>All metadata stored In-memory. Passive standby </li></ul></ul><ul><li>Client talks to both namenode and datanodes </li></ul><ul><ul><li>Bulk data from datanode to client  linear scalability </li></ul></ul><ul><ul><li>Custom Client library in Java/C/Thrift </li></ul></ul><ul><ul><li>Not POSIX, not NFS </li></ul></ul>
    7. 7. In pictures .. NameNode Disks 32GB RAM Secondary NameNode Disks 32GB RAM DataNode DataNode DataNode DFS Client DataNode DataNode DataNode getLocations locations
    8. 8. Map and Reduce <ul><li>Map Function: </li></ul><ul><ul><li>Apply to input data </li></ul></ul><ul><ul><li>Emits reduction key and value </li></ul></ul><ul><li>Reduce Function: </li></ul><ul><ul><li>Apply to data grouped by reduction key </li></ul></ul><ul><ul><li>Often ‘reduces’ data (for example – sum(values)) </li></ul></ul><ul><li>Hadoop groups data by sorting </li></ul><ul><li>User can choose to apply reductions multiple times </li></ul><ul><ul><li>Combiner </li></ul></ul><ul><li>Partitioning, Sorting, Grouping different concepts </li></ul>
    9. 9. Programming with Map/Reduce <ul><li>Find the most imported package in Hive source: </li></ul><ul><li>$ find . -name '*.java' -exec egrep '^import' '{}' ; | awk '{print $2}' | sort | uniq -c | sort -nr +0 -1 | head -1 </li></ul><ul><ul><li>208 org.apache.commons.logging.LogFactory; </li></ul></ul><ul><li>In Map-Reduce: </li></ul><ul><ul><li>1a. Map using: egrep '^import'| awk '{print $2}' </li></ul></ul><ul><ul><li>1b. Reduce on first column (package name) </li></ul></ul><ul><ul><li>1c. Reduce Function: uniq -c </li></ul></ul><ul><ul><li>2a. Map using: awk ‘{print “%05d %s ”,100000-$1,$2}’ </li></ul></ul><ul><ul><li>2b. Reduce using first column (inverse counts), 1 reducer </li></ul></ul><ul><ul><li>2c. Reduce Function: Identity </li></ul></ul><ul><li>Scales to Terabytes </li></ul>
    10. 10. Map/Reduce DataFLow
    11. 11. Why HIVE? <ul><li>Large installed base of SQL users  </li></ul><ul><ul><li>ie. map-reduce is for ultra-geeks </li></ul></ul><ul><ul><li>much much easier to write sql query </li></ul></ul><ul><li>Analytics SQL queries translate really well to map-reduce </li></ul><ul><li>Files as insufficient data management abstraction </li></ul><ul><ul><li>Tables, Schemas, Partitions, Indices </li></ul></ul><ul><ul><li>Metadata allows optimization, discovery, browsing </li></ul></ul><ul><li>Love the programmability of Hadoop </li></ul><ul><li>Hate that RDBMS are closed </li></ul><ul><ul><li>Why not work on data in any format? </li></ul></ul><ul><ul><li>Complex data types are the norm </li></ul></ul>
    12. 12. Rubbing it in .. <ul><li>hive> select key, count(1) from kv1 where key > 100 group by key; </li></ul><ul><li>vs. </li></ul><ul><li>$ cat > /tmp/reducer.sh </li></ul><ul><li>uniq -c | awk '{print $2&quot; &quot;$1}‘ </li></ul><ul><li>$ cat > /tmp/map.sh </li></ul><ul><li>awk -F '01' '{if($1 > 100) print $1}‘ </li></ul><ul><li>$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 </li></ul><ul><li>$ bin/hadoop dfs –cat /tmp/largekey/part* </li></ul>
    13. 13. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
    14. 14. Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
    15. 15. Hive Query Language <ul><li>Basic SQL </li></ul><ul><ul><li>From clause subquery </li></ul></ul><ul><ul><li>ANSI JOIN (equi-join only) </li></ul></ul><ul><ul><li>Multi-table Insert </li></ul></ul><ul><ul><li>Multi group-by </li></ul></ul><ul><ul><li>Sampling </li></ul></ul><ul><ul><li>Objects traversal </li></ul></ul><ul><li>Extensibility </li></ul><ul><ul><li>Pluggable Map-reduce scripts using TRANSFORM </li></ul></ul>
    16. 16. Running Custom Map/Reduce Scripts <ul><ul><ul><li>FROM ( </li></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' </li></ul></ul></ul></ul><ul><ul><ul><ul><li>AS(dt, uid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>CLUSTER BY(dt)) map </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_users_reduced </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count); </li></ul></ul></ul></ul>
    17. 17. Hive QL – Join <ul><li>SQL: </li></ul><ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT pv.pageid, u.age </li></ul></ul><ul><ul><li>FROM page_view pv JOIN user u ON (pv.userid = u.userid); </li></ul></ul>X = page_view user pv_users pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32
    18. 18. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
    19. 19. Joins <ul><li>Outer Joins </li></ul><ul><li>INSERT INTO TABLE pv_users </li></ul><ul><li>SELECT pv.*, u.gender, u.age </li></ul><ul><li>FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) </li></ul><ul><li>WHERE pv.date = 2008-03-03; </li></ul>
    20. 20. Hive Optimizations – Merge Sequential Map Reduce Jobs <ul><li>SQL: </li></ul><ul><ul><li>FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … </li></ul></ul>A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
    21. 21. Join To Map Reduce <ul><li>Only Equality Joins with conjunctions supported </li></ul><ul><li>Future </li></ul><ul><ul><li>Pruning of values send from map to reduce on the basis of projections </li></ul></ul><ul><ul><li>Make Cartesian product more memory efficient </li></ul></ul><ul><ul><li>Map side joins </li></ul></ul><ul><ul><ul><li>Hash Joins if one of the tables is very small </li></ul></ul></ul><ul><ul><ul><li>Exploit pre-sorted data by doing map-side merge join </li></ul></ul></ul><ul><li>Estimate number of reducers </li></ul><ul><ul><li>Hard to measure effect of filters </li></ul></ul><ul><ul><li>Run map side for small part of input to estimate #reducers </li></ul></ul>
    22. 22. Hive QL – Group By <ul><ul><ul><ul><li>SELECT pageid, age, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY pageid, age; </li></ul></ul></ul></ul>pv_users pageid age 1 25 2 25 1 32 2 25 pageid age count 1 25 1 2 25 2 1 32 1
    23. 23. Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
    24. 24. Hive QL – Group By with Distinct <ul><ul><li>SELECT pageid, COUNT(DISTINCT userid) </li></ul></ul><ul><ul><li>FROM page_view GROUP BY pageid </li></ul></ul>page_view pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 pageid count_distinct_userid 1 2 2 1
    25. 25. Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
    26. 26. Group by Future optimizations <ul><li>Map side partial aggregations </li></ul><ul><ul><li>Hash Based aggregates (Done) </li></ul></ul><ul><ul><li>Serialized key/values in hash tables </li></ul></ul><ul><ul><li>Exploit pre-sorted data for distinct counts </li></ul></ul><ul><li>Partial aggregations in Combiner </li></ul><ul><li>Be smarter about how to avoid multiple stage (Done) </li></ul><ul><li>Exploit table/column statistics for deciding strategy </li></ul>
    27. 27. Dealing with Structured Data <ul><li>Type system </li></ul><ul><ul><li>Primitive types </li></ul></ul><ul><ul><li>Recursively build up using Composition/Maps/Lists </li></ul></ul><ul><li>ObjectInspector interface for user-defined types </li></ul><ul><ul><li>To recursively list schema </li></ul></ul><ul><ul><li>To recursively access fields within a row object </li></ul></ul><ul><li>Generic (De)Serialization Interface (SerDe) </li></ul><ul><li>Serialization families implement interface </li></ul><ul><ul><li>Thrift DDL based SerDe </li></ul></ul><ul><ul><li>Delimited text based SerDe </li></ul></ul><ul><ul><li>You can write your own SerDe (XML, JSON …) </li></ul></ul>
    28. 28. MetaStore <ul><li>Stores Table/Partition properties: </li></ul><ul><ul><li>Table schema and SerDe library </li></ul></ul><ul><ul><li>Table Location on HDFS </li></ul></ul><ul><ul><li>Logical Partitioning keys and types </li></ul></ul><ul><ul><li>Partition level metadata </li></ul></ul><ul><ul><li>Other information </li></ul></ul><ul><li>Thrift API </li></ul><ul><ul><li>Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI) </li></ul></ul><ul><li>Metadata stored in any SQL backend </li></ul><ul><li>Future </li></ul><ul><ul><li>Statistics </li></ul></ul><ul><ul><li>Schema Evolution </li></ul></ul>
    29. 29. Future Work <ul><li>Cost-based optimization </li></ul><ul><li>Multiple interfaces (JDBC…)/Integration with BI </li></ul><ul><li>SQL Compliance (order by, nested queries…) </li></ul><ul><li>Indexing </li></ul><ul><li>Data Compression </li></ul><ul><ul><li>Columnar storage schemes </li></ul></ul><ul><ul><li>Exploit lazy/functional Hive field retrieval interfaces </li></ul></ul><ul><li>Better data locality </li></ul><ul><ul><li>Co-locate hash partitions on same rack </li></ul></ul><ul><ul><li>Exploit high intra-rack bandwidth for merge joins </li></ul></ul><ul><li>Advanced operators </li></ul><ul><ul><li>Cubes/Frequent Item Sets </li></ul></ul>
    30. 30. Hive Status <ul><li>Available Hadoop Sub-project </li></ul><ul><ul><li>http://svn.apache.org/repos/asf/hadoop/hive/trunk </li></ul></ul><ul><li>[email_address] </li></ul><ul><li>IRC: #hive </li></ul><ul><li>VLDB demo submission </li></ul><ul><li>Hivers@Facebook: </li></ul><ul><ul><li>Ashish Thusoo </li></ul></ul><ul><ul><li>Zheng Shao </li></ul></ul><ul><ul><li>Prasad Chakka </li></ul></ul><ul><ul><li>Namit Jain </li></ul></ul><ul><ul><li>Raghu Murthy </li></ul></ul><ul><ul><li>Suresh Anthony </li></ul></ul>
    31. 31. Hive/Hadoop Usage @ Facebook <ul><li>Summarization </li></ul><ul><ul><li>Eg: Daily/Weekly aggregations of impression/click counts </li></ul></ul><ul><ul><li>Complex measures of user engagement </li></ul></ul><ul><li>Ad hoc Analysis </li></ul><ul><ul><li>Eg: how many group admins broken down by state/country </li></ul></ul><ul><li>Data Mining (Assembling training data) </li></ul><ul><ul><li>Eg: User Engagement as a function of user attributes </li></ul></ul><ul><li>Spam Detection </li></ul><ul><ul><li>Anomalous patterns in UGC </li></ul></ul><ul><ul><li>Application api usage patterns </li></ul></ul><ul><li>Ad Optimization </li></ul><ul><li>Too many to count .. </li></ul>
    32. 32. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
    33. 33. Hadoop Usage @ Facebook <ul><li>Data statistics: </li></ul><ul><ul><li>Total Data: ~2.5PB </li></ul></ul><ul><ul><li>Net Data added/day: ~15TB </li></ul></ul><ul><ul><ul><li>6TB of uncompressed source logs </li></ul></ul></ul><ul><ul><ul><li>4TB of uncompressed dimension data reloaded daily </li></ul></ul></ul><ul><ul><li>Compression Factor ~5x (gzip, more with bzip) </li></ul></ul><ul><li>Usage statistics: </li></ul><ul><ul><li>3200 jobs/day with 800K tasks(map-reduce tasks)/day </li></ul></ul><ul><ul><li>55TB of compressed data scanned daily </li></ul></ul><ul><ul><li>15TB of compressed output data written to hdfs </li></ul></ul><ul><ul><li>80 MM compute minutes/day </li></ul></ul>
    34. 34. In Pictures
    35. 35. Hadoop Challenges @ Facebook <ul><li>QOS/Isolation: </li></ul><ul><ul><li>Big jobs can hog the cluster </li></ul></ul><ul><ul><li>JobTracker memory as limited resource </li></ul></ul><ul><ul><li>Limit memory impact of runaway tasks </li></ul></ul><ul><ul><li>Fair Scheduler (Matei) </li></ul></ul><ul><li>Protection </li></ul><ul><ul><li>What if software bug/disaster destroys NameNode metadata? </li></ul></ul><ul><ul><li>HDFS SnapShots (Dhruba) </li></ul></ul><ul><li>Data Archival </li></ul><ul><ul><li>Not all data is hot and needs colocation with Compute </li></ul></ul><ul><ul><li>Hadoop Data Archival Layer </li></ul></ul>
    36. 36. Hadoop Challenges @ Facebook <ul><li>Performance </li></ul><ul><ul><li>Really hard to understand what systemic bottlenecks are </li></ul></ul><ul><ul><li>Workloads are variable during daytime </li></ul></ul><ul><li>Small Job Performance </li></ul><ul><ul><li>Sampling encourage small test queries </li></ul></ul><ul><ul><li>Hadoop awful at locality for small jobs </li></ul></ul><ul><ul><li>Need to reduce task startup time (JVM reuse) </li></ul></ul><ul><ul><li>Large number of mappers each with small output produces terrible performance </li></ul></ul><ul><ul><li> Global Scheduling for better locality (Matei) </li></ul></ul>
    37. 37. Hadoop Wish List <ul><li>HDFS: </li></ul><ul><ul><li>3-way replication is unsustainable </li></ul></ul><ul><ul><ul><li>N+k erasure codes </li></ul></ul></ul><ul><ul><li>Snapshots </li></ul></ul><ul><ul><ul><li>Design out – but need people to work on it </li></ul></ul></ul><ul><ul><li>Namenode on-disk metadata </li></ul></ul><ul><ul><ul><li>In-memory model poses fundamental limits on growth </li></ul></ul></ul><ul><ul><li>Application hints on block/file co-location </li></ul></ul><ul><li>Map-Reduce </li></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Resource aware scheduling </li></ul></ul><ul><ul><li>Multi-Stage Map-Reduce </li></ul></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×