Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bigger Faster Easier: LinkedIn Hadoop Summit 2015

5,245 views

Published on

We discuss LinkedIn's big data ecosystem and its evolution through the years. We introduce three open source projects, Gobblin for ingestion, Cubert for computation and Pinot for fast OLAP serving. We also showcase our in-house data discovery and lineage portal WhereHows.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Bigger Faster Easier: LinkedIn Hadoop Summit 2015

  1. 1. The Data Driven Network Kapil Surlaker Director of Engineering Powering the Data Driven Network Kapil Surlaker and Shirshanka Das Hadoop Summit 2015
  2. 2. 2
  3. 3. How does PYMK work? 5
  4. 4. Houston we have a problem
  5. 5. Step 1 Central transport pipeline
  6. 6. Still have a problem
  7. 7. Hadoop Ingest Pipeline Complexity
  8. 8. Step 2 Central Ingestion Framework 11
  9. 9. Requirements Source Diversity Batch and Streaming Data Quality
  10. 10. Gobblin Architecture
  11. 11. 14 Source Work Unit Work Unit Work Unit Extract Extract Extract Convert Convert Convert Quality Quality Quality Write Write Write Data Publish Task Task Task
  12. 12. Taming Source Diversity REST SFTP JDBC Protocol Config Source Extractor checkpoint
  13. 13. Solving for real-time Inefficiencies in batch YARN based Apache Helix Continuous Auto-scaling YARN Helix Executor 1 Executor 2 Executor 3 HDFS Stream Source
  14. 14. Data Quality Per record, per task, or per job Composable quality checkers Schema compatibility Audit check Sensitive fields Unique key Policy driven Record WriterJob Task Quality Checker FailQuarantine Policy Checker
  15. 15. Current Activity Open source @ github.com/linkedin/gobblin In production @ LinkedIn Tens of TB per day Hundreds of datasets ~20 different sources Gobblin on YARN
  16. 16. Transformation: No one size fits all
  17. 17. Cubert: Converting hours to minutes http://github.com/linkedin/cubert Physical language Block organization Specialized operators
  18. 18. Got Diversity?
  19. 19. Where is the billings data? How did it get here? What data is used to create inferred skills data? Who owns that flow? When will the latest profile data show up? 24
  20. 20. 25
  21. 21. Where is my data? How did it get here? …. WhereHows 26
  22. 22. WhereHows architecture
  23. 23. 28
  24. 24. 29
  25. 25. 31
  26. 26. Lineage
  27. 27. WhereHows: Roadmap Streaming ecosystem integration Kafka, Samza Recommendations for Datasets, Metrics Exploring Open Source
  28. 28. Real-time. Interactive.
  29. 29. Slice and Dice metrics
  30. 30. Precompute! Device Geo View Android US 1 Android IN 1 iOS US 1 Dimension View Android 2 iOS 1 US 2 IN 1 Android,US 1 iOS,US 1 Android,IN 1
  31. 31. More dimensions! Device Geo Carrier View Android US ATT 1 Android IN Reliance 1 iOS US Verizon 1 Dimension View Android 2 iOS 1 US 2 IN 1 ATT 1 Reliance 1 Verizon 1 Android,US 1 ... ...
  32. 32. Challenges Horizontally scalable Low latency Data freshness Fault tolerance OLAP features
  33. 33. Introducing Pinot
  34. 34. Key features SQL-like interface Columnar storage and indexing Real-time data load
  35. 35. (S)QL: Filters and Aggs SELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'
  36. 36. (S)QL: Group By SELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action
  37. 37. (S)QL: ORDER BY and LIMIT SELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1
  38. 38. Columnar Storage
  39. 39. Forward Index
  40. 40. Broker Helix Real time Historical Kafka Hadoop Pinot Architecture Queries Raw Data Samza
  41. 41. Fast but needs a ton of RAM
  42. 42. To pre-compute or not?
  43. 43. Data aware pre-computation
  44. 44. Pinot@LinkedIn Site-­‐facing  Apps Reporting  dashboards Monitoring
  45. 45. Breaking the cycle Form hypothesis Query Repeat OR …
  46. 46. Hmm... whats up with portugese and spanish speaking countries?
  47. 47. Brazil?
  48. 48. 56
  49. 49. 57 Holidays in Brazil 2015
  50. 50. Pinot Roadmap Pinot is Open Source !!! github.com/linkedin/pinot 59
  51. 51. Kapil Surlaker @kapilsurlaker github.com/linkedin/ 60 gobblin cubert pinot Shirshanka Das @shirshanka Thanks!

×