Your SlideShare is downloading. ×
Hive at LinkedIn
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hive at LinkedIn

676
views

Published on

Hive efforts at Linkedin, Experiences of Hive-user. …

Hive efforts at Linkedin, Experiences of Hive-user.
Presented by Mohammad islam, Mark Wagner, Karthik Ramasamy

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
676
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hive -Adhoc and reporting , business analyticsPig – ETL pipeline, production WFsMR - Highly specialized application Az - LI WF
  • Which processData operation can detect root causeEmail, http address
  • Context of the problem
  • Transcript

    • 1. ©2013 LinkedIn Corporation. All Rights Reserved. Hive at LinkedIn
    • 2. ©2013 LinkedIn Corporation. All Rights Reserved. Agenda  LinkedIn Data and its Ecosystem  Performance Improvements – Avro  User experiences 3
    • 3. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Data Sources  Event Data – Page Views – Clicks – Search queries  Database Data – Profile (Users & Companies) – Connections  External Data – Salesforce, DoubleClick 4
    • 4. ©2013 LinkedIn Corporation. All Rights Reserved. Member Data (Profiles) Espresso and RDBMS External Partner Data Member Activity (Page views, button clicks) Kafka Topics Front-end Serving Systems Member-facing systems Lots of cool stuff not in this picture! Where's the Data at LinkedIn? © 2013 LinkedIn 24 June 2013 Data Ecosystem at LinkedIn 5 Member Facing Systems
    • 5. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 6 Member Facing Systems
    • 6. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 7 Member Facing Systems
    • 7. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 8 Member Facing Systems
    • 8. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 9 Member Facing Systems
    • 9. ©2013 LinkedIn Corporation. All Rights Reserved. Data in Hadoop  Almost all LinkedIn data is stored in Hadoop  Tools used – Hive/HCatalog – Pig – Java MapReduce – Azkaban 10
    • 10. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Usage  Use-cases – Ad-hoc query – Reporting – Building Platforms  Segmentation Engine  Experimentations Engine  Users – Data Scientist – Business Analytics – Security team – Product team 11
    • 11. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Challenges  Performance – Faster query execution  Performance – Faster query execution  Efficient MR* execution plan – Effective resource usage – Ensure cluster stability 12
    • 12. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Hive Initiatives  Make HCatalog work and deploy [OnGoing]  Hive Performance Improvement (Avro data reading) [On Going]  Stabilize Hive Server 2 at LI [About to Start]  Expand the scope of HCatalog metadata [Planning] 13
    • 13. ©2013 LinkedIn Corporation. All Rights Reserved. HCatalog Initiatives  Expand scope of meta-data – Who creates this data? – What are the inputs?  Helpful to create data lineage – Who is the maintainer of data? 14
    • 14. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
    • 15. ©2013 LinkedIn Corporation. All Rights Reserved. What is the Problem?  Reading Avro record takes long time. – 52 micro-second/record  Found the hotspot using VisualVm 16
    • 16. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #1  Reduce the number of Schema.equals() calls  Schema equality checks required primarily for evolved schema.  Solution includes caching to avoid unnecessary expensive calls  Results – Trunk read overhead : 52 μs/record – After this patch read overhead : 32 μs/record 17
    • 17. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #2  Reduce extra data transformations  Solution is to provide custom object inspectors  Results – Current read overhead : 52 μs/record – After this patch read overhead : 30 μs/record 18
    • 18. ©2013 LinkedIn Corporation. All Rights Reserved. Final Results 19 55 32 30 11 0 10 20 30 40 50 60 Trunk Improvement #1 Improvement #2 Combined
    • 19. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
    • 20. ©2013 LinkedIn Corporation. All Rights Reserved. 56%Never Used Hive 44%Use Hive 27%Primarily use Hive Out of all our Hadoop users: Hive User Base at LinkedIn 21 of Hive jobs were from ad-hoc queries32%
    • 21. ©2013 LinkedIn Corporation. All Rights Reserved. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Who uses Hive and who doesn’t 22 Data Scientists Engineers Product Managers Customer Support Specialists Analysts Hive adoption among Hadoop users by job title
    • 22. ©2013 LinkedIn Corporation. All Rights Reserved. Top concerns about Hive 23 Not friendly for long/complex workflows Performance, especially for ad-hoc queries Steep learning curve for tuning Data/UDFs unavailability