Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto Summit 2018 - 02 - LinkedIn


Published on

LinkedIn’s Presto adventure (Mark Wagner, LinkedIn)
Presto Summit 2018 (

Published in: Data & Analytics
  • Be the first to comment

Presto Summit 2018 - 02 - LinkedIn

  1. 1. LinkedIn’s Presto Adventures Mark Wagner Engineer, LinkedIn
  2. 2. Analytics at LinkedIn • Reporting – Understanding business performance and activity • Product – Understanding users and making data-informed decisions • Customer service – Investigating incorrect behavior on the site • Data products – Building products for our user powered by data • Research – Economics research powered by the data we have • Systems engineering – Analyzing internal system performance
  3. 3. LinkedIn Analytics circa 2014 • Hadoop-centric • Pig is the dominant tool, followed by Hive • Bad addiction to Avro Engineering Everyone else • “Traditional” DWH • Data siphoned out of HDFS • Faster, more expensive, harder to scale
  4. 4. First steps with Presto • Separate cluster for Presto • Replicate data in • Avro? • How can we manage all this? Main HDFS cluster Presto
  5. 5. Apache Gobblin • Monitor tables for new data • Convert any new partitions or snapshots to ORC and replicate to the other cluster • Sort the data while we’re at it • Easy to scale, easy to onboard
  6. 6. Scaling up • Users love Presto! • Adoption took off • How are people using it? • Is the experience consistent? • What problems are people having? Daily users (artist’s impression)
  7. 7. Kafka logging • Had operational metrics, needed tracking • Publish everything to Kafka • Bring back to Presto for analysis • Used to require modification, now it’s just a plugin • No excuse not to do this public interface EventListener { default void queryCreated(…) { } default void queryCompleted(…) { } default void splitCompleted(…) { } }
  8. 8. Meta analysis • Resource intensive tables – Which tables are worth optimizing? • HDFS locality – Are we processing near the storage? • Bug triaging – How many people might be impacted? • User support – “Oops! I didn’t save my query” • User growth – Who’s using Presto? Who’s new? • Tools used – Are people using the tools we provide? Are they having trouble? • Expensive queries – Where are the resources going? • Lineage analysis – Who is using this data and how?
  9. 9. • Got us off the ground • Demonstrated value • Need to get out of this mess! That cluster?
  10. 10. • Very hard to get new data in • Equally hard to get data out • Can’t interoperate with our other tools • Don’t want to do a hard cutover Life in a silo
  11. 11. • JDBC • Pinot • Kafka • Venice • Espresso • Elasticsearch • OpenTSDB • Should we burden a user with this? Other data sources
  12. 12. Make many data sources look like one Federation • Hide the physical location of data from users • Allow infrastructure providers to make changes behind the scene • Scans and writes are routed to connectors dynamically
  13. 13. Federation of connectors • Like a multiplexer for connectors • Any connector can be federated • Flexible routing mechanism Pluggable routing logic Hive Kafka JDBC Federation plugin Read/write operations
  14. 14. Encapsulating business logic Dali views • All the goodness of views, but for Hadoop! • An abstraction layer for data • Manage data the same way you manage libraries and services
  15. 15. Dali tooling • Manage your data like a service • Allow versioned views for compatibility • Authoring, testing, deployment, and life-cycle tools git Artifactory Metastore Metastore Authoring tools Validation and versioning Deployment
  16. 16. HiveQL to Presto SQL conversion • Presto can’t run HiveQL • Translate through Apache Calcite • Evaluate result as a Presto view • Coverage for most Hive features Hive analyzer Calcite SQL generator Presto view? Yes No Presto SQL View analysis
  17. 17. • Every platform has their own API • Bridge across them with a common abstraction • Zero development cost per UDF per platform • A bit of runtime overhead UDFs
  18. 18. Encapsulating business logic Dali views • Translate SQL dialects • Build cross platform UDFs for custom logic • Keep all optimization and execution in Presto
  19. 19. 4 takeaways Big hammer for optimizing data Gobblin Measure everything, analyze later Kafka logger Making many data sources look like one Federation Encapsulating business logic Dali Views
  20. 20. We’re hiring! Questions? Contact me on LinkedIn or email