©2013 LinkedIn Corporation. All Rights Reserved.
Hive at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
 LinkedIn Data and its Ecosystem
 Performance Improvements – Avr...
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Data Sources
 Event Data
– Page Views
– Clicks
– Search queries...
©2013 LinkedIn Corporation. All Rights Reserved.
Member Data
(Profiles)
Espresso
and RDBMS
External
Partner Data
Member Ac...
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
6
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
7
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
8
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
9
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data in Hadoop
 Almost all LinkedIn data is stored in Hadoop
 Tools use...
©2013 LinkedIn Corporation. All Rights Reserved.
Hive Usage
 Use-cases
– Ad-hoc query
– Reporting
– Building Platforms
 ...
©2013 LinkedIn Corporation. All Rights Reserved.
Hive Challenges
 Performance
– Faster query execution
 Performance
– Fa...
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Hive Initiatives
 Make HCatalog work and deploy [OnGoing]
 Hiv...
©2013 LinkedIn Corporation. All Rights Reserved.
HCatalog Initiatives
 Expand scope of meta-data
– Who creates this data?...
©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved.
What is the Problem?
 Reading Avro record takes long time.
– 52 micro-se...
©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #1
 Reduce the number of Schema.equals() calls
 Schema equa...
©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #2
 Reduce extra data transformations
 Solution is to provi...
©2013 LinkedIn Corporation. All Rights Reserved.
Final Results
19
55
32
30
11
0
10
20
30
40
50
60
Trunk Improvement #1 Imp...
©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved.
56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoo...
©2013 LinkedIn Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t...
©2013 LinkedIn Corporation. All Rights Reserved.
Top concerns about Hive
23
Not friendly for long/complex workflows
Perfor...
Hive at LinkedIn
Hive at LinkedIn
Upcoming SlideShare
Loading in …5
×

Hive at LinkedIn

1,009 views

Published on

Hive efforts at Linkedin, Experiences of Hive-user.
Presented by Mohammad islam, Mark Wagner, Karthik Ramasamy

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • For Business Analytics tools Online Training register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
1,009
On SlideShare
0
From Embeds
0
Number of Embeds
73
Actions
Shares
0
Downloads
30
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hive -Adhoc and reporting , business analyticsPig – ETL pipeline, production WFsMR - Highly specialized application Az - LI WF
  • Which processData operation can detect root causeEmail, http address
  • Context of the problem
  • Hive at LinkedIn

    1. 1. ©2013 LinkedIn Corporation. All Rights Reserved. Hive at LinkedIn
    2. 2. ©2013 LinkedIn Corporation. All Rights Reserved. Agenda  LinkedIn Data and its Ecosystem  Performance Improvements – Avro  User experiences 3
    3. 3. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Data Sources  Event Data – Page Views – Clicks – Search queries  Database Data – Profile (Users & Companies) – Connections  External Data – Salesforce, DoubleClick 4
    4. 4. ©2013 LinkedIn Corporation. All Rights Reserved. Member Data (Profiles) Espresso and RDBMS External Partner Data Member Activity (Page views, button clicks) Kafka Topics Front-end Serving Systems Member-facing systems Lots of cool stuff not in this picture! Where's the Data at LinkedIn? © 2013 LinkedIn 24 June 2013 Data Ecosystem at LinkedIn 5 Member Facing Systems
    5. 5. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 6 Member Facing Systems
    6. 6. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 7 Member Facing Systems
    7. 7. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 8 Member Facing Systems
    8. 8. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 9 Member Facing Systems
    9. 9. ©2013 LinkedIn Corporation. All Rights Reserved. Data in Hadoop  Almost all LinkedIn data is stored in Hadoop  Tools used – Hive/HCatalog – Pig – Java MapReduce – Azkaban 10
    10. 10. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Usage  Use-cases – Ad-hoc query – Reporting – Building Platforms  Segmentation Engine  Experimentations Engine  Users – Data Scientist – Business Analytics – Security team – Product team 11
    11. 11. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Challenges  Performance – Faster query execution  Performance – Faster query execution  Efficient MR* execution plan – Effective resource usage – Ensure cluster stability 12
    12. 12. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Hive Initiatives  Make HCatalog work and deploy [OnGoing]  Hive Performance Improvement (Avro data reading) [On Going]  Stabilize Hive Server 2 at LI [About to Start]  Expand the scope of HCatalog metadata [Planning] 13
    13. 13. ©2013 LinkedIn Corporation. All Rights Reserved. HCatalog Initiatives  Expand scope of meta-data – Who creates this data? – What are the inputs?  Helpful to create data lineage – Who is the maintainer of data? 14
    14. 14. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
    15. 15. ©2013 LinkedIn Corporation. All Rights Reserved. What is the Problem?  Reading Avro record takes long time. – 52 micro-second/record  Found the hotspot using VisualVm 16
    16. 16. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #1  Reduce the number of Schema.equals() calls  Schema equality checks required primarily for evolved schema.  Solution includes caching to avoid unnecessary expensive calls  Results – Trunk read overhead : 52 μs/record – After this patch read overhead : 32 μs/record 17
    17. 17. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #2  Reduce extra data transformations  Solution is to provide custom object inspectors  Results – Current read overhead : 52 μs/record – After this patch read overhead : 30 μs/record 18
    18. 18. ©2013 LinkedIn Corporation. All Rights Reserved. Final Results 19 55 32 30 11 0 10 20 30 40 50 60 Trunk Improvement #1 Improvement #2 Combined
    19. 19. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
    20. 20. ©2013 LinkedIn Corporation. All Rights Reserved. 56%Never Used Hive 44%Use Hive 27%Primarily use Hive Out of all our Hadoop users: Hive User Base at LinkedIn 21 of Hive jobs were from ad-hoc queries32%
    21. 21. ©2013 LinkedIn Corporation. All Rights Reserved. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Who uses Hive and who doesn’t 22 Data Scientists Engineers Product Managers Customer Support Specialists Analysts Hive adoption among Hadoop users by job title
    22. 22. ©2013 LinkedIn Corporation. All Rights Reserved. Top concerns about Hive 23 Not friendly for long/complex workflows Performance, especially for ad-hoc queries Steep learning curve for tuning Data/UDFs unavailability

    ×