Your SlideShare is downloading. ×

Hadoop Hive Talk At IIT-Delhi

7,526
views

Published on

Talk at the CS department in IIT 04/02/09.

Talk at the CS department in IIT 04/02/09.

Published in: Technology

1 Comment
12 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,526
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
586
Comments
1
Likes
12
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Offline and Near-Real time data processing Not online
  • Transcript

    • 1. Hadoop and Hive Large Scale Data Processing using Commodity HW/SW Joydeep Sen Sarma
    • 2. Outline
      • Introduction
      • Hadoop
      • Hive
      • Hadoop/Hive Usage @Facebook
      • Wishlists/Projects
      • Questions
    • 3. Data and Computing Trends
      • Explosion of Data
        • Web Logs, Ad-Server logs, Sensor Networks, Seismic Data, DNA sequences (?)
        • User generated content/Web 2.0
        • Data as BI => Data as product (Search, Ads, Digg, Quantcast, …)
      • Declining Revenue/GB
        • Milk @ $3/gallon => $15M / GB
        • Ads @ 20c / 10^6 impressions => $1/GB
        • Google Analytics, Facebook Lexicon == Free!
      • Hardware Trends
        • Commodity Rocks: $4K 1U box = 8 cores + 16GB mem + 4x1TB
        • CPU: SMP  NUMA, Storage: $ Shared-Nothing << $ Shared, Networking: Ethernet
      • Software Trends
        • Open Source, SaaS
        • LAM P , Compute on Demand (EC2)
    • 4. Hadoop
      • Parallel Computing platform
        • Distributed FileSystem (HDFS)
        • Parallel Processing model (Map/Reduce)
        • Express Computation in any language
        • Job execution for Map/Reduce jobs (scheduling+localization+retries/speculation)
      • Open-Source
        • Most popular Apache project!
        • Highly Extensible Java Stack (@ expense of Efficiency)
        • Develop/Test on EC2!
      • Ride the commodity curve:
        • Cheap (but reliable) shared nothing storage
        • Data Local computing (don’t need high speed networks)
        • Highly Scalable (@expense of Efficiency)
    • 5. Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce
    • 6. HDFS
      • Separation of Metadata from Data
        • Metadata == Inodes, attributes, block locations, block replication
      • File = Σ data blocks (typically 128MB)
        • Architected for large files and streaming reads
      • Highly Reliable
        • Each data block typically replicated 3X to different datanodes
        • Clients compute and verify block checksums (end-to-end)
      • Single namenode
        • All metadata stored In-memory. Passive standby
      • Client talks to both namenode and datanodes
        • Bulk data from datanode to client  linear scalability
        • Custom Client library in Java/C/Thrift
        • Not POSIX, not NFS
    • 7. In pictures .. NameNode Disks 32GB RAM Secondary NameNode Disks 32GB RAM DataNode DataNode DataNode DFS Client DataNode DataNode DataNode getLocations locations
    • 8. Map and Reduce
      • Map Function:
        • Apply to input data
        • Emits reduction key and value
      • Reduce Function:
        • Apply to data grouped by reduction key
        • Often ‘reduces’ data (for example – sum(values))
      • Hadoop groups data by sorting
      • User can choose to apply reductions multiple times
        • Combiner
      • Partitioning, Sorting, Grouping different concepts
    • 9. Programming with Map/Reduce
      • Find the most imported package in Hive source:
      • $ find . -name '*.java' -exec egrep '^import' '{}' ; | awk '{print $2}' | sort | uniq -c | sort -nr +0 -1 | head -1
        • 208 org.apache.commons.logging.LogFactory;
      • In Map-Reduce:
        • 1a. Map using: egrep '^import'| awk '{print $2}'
        • 1b. Reduce on first column (package name)
        • 1c. Reduce Function: uniq -c
        • 2a. Map using: awk ‘{print “%05d %s ”,100000-$1,$2}’
        • 2b. Reduce using first column (inverse counts), 1 reducer
        • 2c. Reduce Function: Identity
      • Scales to Terabytes
    • 10. Map/Reduce DataFLow
    • 11. Why HIVE?
      • Large installed base of SQL users 
        • ie. map-reduce is for ultra-geeks
        • much much easier to write sql query
      • Analytics SQL queries translate really well to map-reduce
      • Files as insufficient data management abstraction
        • Tables, Schemas, Partitions, Indices
        • Metadata allows optimization, discovery, browsing
      • Love the programmability of Hadoop
      • Hate that RDBMS are closed
        • Why not work on data in any format?
        • Complex data types are the norm
    • 12. Rubbing it in ..
      • hive> select key, count(1) from kv1 where key > 100 group by key;
      • vs.
      • $ cat > /tmp/reducer.sh
      • uniq -c | awk '{print $2&quot; &quot;$1}‘
      • $ cat > /tmp/map.sh
      • awk -F '01' '{if($1 > 100) print $1}‘
      • $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
      • $ bin/hadoop dfs –cat /tmp/largekey/part*
    • 13. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
    • 14. Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
    • 15. Hive Query Language
      • Basic SQL
        • From clause subquery
        • ANSI JOIN (equi-join only)
        • Multi-table Insert
        • Multi group-by
        • Sampling
        • Objects traversal
      • Extensibility
        • Pluggable Map-reduce scripts using TRANSFORM
    • 16. Running Custom Map/Reduce Scripts
          • FROM (
            • FROM pv_users
            • SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script'
            • AS(dt, uid)
            • CLUSTER BY(dt)) map
          • INSERT INTO TABLE pv_users_reduced
            • SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);
    • 17. Hive QL – Join
      • SQL:
        • INSERT INTO TABLE pv_users
        • SELECT pv.pageid, u.age
        • FROM page_view pv JOIN user u ON (pv.userid = u.userid);
      X = page_view user pv_users pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32
    • 18. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
    • 19. Joins
      • Outer Joins
      • INSERT INTO TABLE pv_users
      • SELECT pv.*, u.gender, u.age
      • FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)
      • WHERE pv.date = 2008-03-03;
    • 20. Hive Optimizations – Merge Sequential Map Reduce Jobs
      • SQL:
        • FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT …
      A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
    • 21. Join To Map Reduce
      • Only Equality Joins with conjunctions supported
      • Future
        • Pruning of values send from map to reduce on the basis of projections
        • Make Cartesian product more memory efficient
        • Map side joins
          • Hash Joins if one of the tables is very small
          • Exploit pre-sorted data by doing map-side merge join
      • Estimate number of reducers
        • Hard to measure effect of filters
        • Run map side for small part of input to estimate #reducers
    • 22. Hive QL – Group By
            • SELECT pageid, age, count(1)
            • FROM pv_users
            • GROUP BY pageid, age;
      pv_users pageid age 1 25 2 25 1 32 2 25 pageid age count 1 25 1 2 25 2 1 32 1
    • 23. Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
    • 24. Hive QL – Group By with Distinct
        • SELECT pageid, COUNT(DISTINCT userid)
        • FROM page_view GROUP BY pageid
      page_view pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 pageid count_distinct_userid 1 2 2 1
    • 25. Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
    • 26. Group by Future optimizations
      • Map side partial aggregations
        • Hash Based aggregates (Done)
        • Serialized key/values in hash tables
        • Exploit pre-sorted data for distinct counts
      • Partial aggregations in Combiner
      • Be smarter about how to avoid multiple stage (Done)
      • Exploit table/column statistics for deciding strategy
    • 27. Dealing with Structured Data
      • Type system
        • Primitive types
        • Recursively build up using Composition/Maps/Lists
      • ObjectInspector interface for user-defined types
        • To recursively list schema
        • To recursively access fields within a row object
      • Generic (De)Serialization Interface (SerDe)
      • Serialization families implement interface
        • Thrift DDL based SerDe
        • Delimited text based SerDe
        • You can write your own SerDe (XML, JSON …)
    • 28. MetaStore
      • Stores Table/Partition properties:
        • Table schema and SerDe library
        • Table Location on HDFS
        • Logical Partitioning keys and types
        • Partition level metadata
        • Other information
      • Thrift API
        • Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI)
      • Metadata stored in any SQL backend
      • Future
        • Statistics
        • Schema Evolution
    • 29. Future Work
      • Cost-based optimization
      • Multiple interfaces (JDBC…)/Integration with BI
      • SQL Compliance (order by, nested queries…)
      • Indexing
      • Data Compression
        • Columnar storage schemes
        • Exploit lazy/functional Hive field retrieval interfaces
      • Better data locality
        • Co-locate hash partitions on same rack
        • Exploit high intra-rack bandwidth for merge joins
      • Advanced operators
        • Cubes/Frequent Item Sets
    • 30. Hive Status
      • Available Hadoop Sub-project
        • http://svn.apache.org/repos/asf/hadoop/hive/trunk
      • [email_address]
      • IRC: #hive
      • VLDB demo submission
      • Hivers@Facebook:
        • Ashish Thusoo
        • Zheng Shao
        • Prasad Chakka
        • Namit Jain
        • Raghu Murthy
        • Suresh Anthony
    • 31. Hive/Hadoop Usage @ Facebook
      • Summarization
        • Eg: Daily/Weekly aggregations of impression/click counts
        • Complex measures of user engagement
      • Ad hoc Analysis
        • Eg: how many group admins broken down by state/country
      • Data Mining (Assembling training data)
        • Eg: User Engagement as a function of user attributes
      • Spam Detection
        • Anomalous patterns in UGC
        • Application api usage patterns
      • Ad Optimization
      • Too many to count ..
    • 32. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
    • 33. Hadoop Usage @ Facebook
      • Data statistics:
        • Total Data: ~2.5PB
        • Net Data added/day: ~15TB
          • 6TB of uncompressed source logs
          • 4TB of uncompressed dimension data reloaded daily
        • Compression Factor ~5x (gzip, more with bzip)
      • Usage statistics:
        • 3200 jobs/day with 800K tasks(map-reduce tasks)/day
        • 55TB of compressed data scanned daily
        • 15TB of compressed output data written to hdfs
        • 80 MM compute minutes/day
    • 34. In Pictures
    • 35. Hadoop Challenges @ Facebook
      • QOS/Isolation:
        • Big jobs can hog the cluster
        • JobTracker memory as limited resource
        • Limit memory impact of runaway tasks
        • Fair Scheduler (Matei)
      • Protection
        • What if software bug/disaster destroys NameNode metadata?
        • HDFS SnapShots (Dhruba)
      • Data Archival
        • Not all data is hot and needs colocation with Compute
        • Hadoop Data Archival Layer
    • 36. Hadoop Challenges @ Facebook
      • Performance
        • Really hard to understand what systemic bottlenecks are
        • Workloads are variable during daytime
      • Small Job Performance
        • Sampling encourage small test queries
        • Hadoop awful at locality for small jobs
        • Need to reduce task startup time (JVM reuse)
        • Large number of mappers each with small output produces terrible performance
        •  Global Scheduling for better locality (Matei)
    • 37. Hadoop Wish List
      • HDFS:
        • 3-way replication is unsustainable
          • N+k erasure codes
        • Snapshots
          • Design out – but need people to work on it
        • Namenode on-disk metadata
          • In-memory model poses fundamental limits on growth
        • Application hints on block/file co-location
      • Map-Reduce
        • Performance
        • Resource aware scheduling
        • Multi-Stage Map-Reduce