How LinkedIn Uses Scalding for Data Driven Product Development

1,376 views

Published on

Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/

Published in: Data & Analytics
3 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total views
1,376
On SlideShare
0
From Embeds
0
Number of Embeds
345
Actions
Shares
0
Downloads
15
Comments
3
Likes
2
Embeds 0
No embeds

No notes for slide

How LinkedIn Uses Scalding for Data Driven Product Development

  1. 1. Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn
  2. 2. http://linkedin.com/in/sashao • Studied Mathematical Physics at Moscow University • Software Engineering background • Work at LinkedIn on Email Experience • Publish open source at https://github.com/SashaOv • Publish music at SoundCloud
  3. 3. /home Scalding is a must-have tool in your arsenal of Hadoop development. – Hadoop ecosystem at LinkedIn – Hadoop development tools – Scalding: why and how – What we do with Scalding, code examples.
  4. 4. /linkedin/hadoop/overview Online Apps Databases NoSQL Data Stores Hadoop HDFS Hadoop Flows Tracking/log ging Analytics Data Products Messaging Message delivery
  5. 5. /linkedin/hadoop/practices • All online data end up in HDFS – Mostly encoded in Avro • Production Process – CI/Automatic Build • More info forthcoming – Production Review – Operations and Monitoring • More info at http://lnkd.in/gridops2013 • Result: Thousands of jobs running in production • More info at http://lnkd.in/big-data-ecosystem
  6. 6. /linkedin/hadoop/dev-tools • PIG • Java MR • Scalding • +many others, will not talk about them today
  7. 7. /hadoop/dev-tools/PIG • Relatively mature tool – first official release 2008 • Easy to learn • Availability of experienced people • Extendable via UDF
  8. 8. /hadoop/dev-tools/Java • Java MR – Maximum flexibility with Hadoop API – Verbose • Cascading – Retain (some) Java flexibility – Less verbose
  9. 9. /hadoop/dev-tools/Scalding http://github.com/twitter/scalding • Scala-based DSL • Built on Cascading, stable and mature framework • Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } • Succinct and powerful • High level of abstraction
  10. 10. …/tools/comparison PIG Java/Scala Debugging: stack traces No* Yes Code reuse Macros, jobs Classes, packages, modules, frameworks… Custom data structures/algorithms UDF Native Packaging Fat jars Thin jars Avro support Partial Native Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests PIG Java MR Scalding LOC count Small* Large Small
  11. 11. …/tools/buyers-guide If you need… Then use… Quick-and-dirty simple scripts, existing UDFs PIG, Hive Complex flows, full access to Avro, debugging, unit testing, productization Scalding Full flexibility of Hadoop API but not too complex processing Java MR
  12. 12. /linkedin/email-experience • Goal – Improve messaging users’ experience • Plan – Track – Experiment – Optimize – Personalize • Implementation – Generate messages offline – Apply sophisticated relevance algorithms – Shorten the release cycle to facilitate fast iteration
  13. 13. /linkedin/email-experience/overview Content sources (PIG) HDFS Content sources (Scalding) Content sources (Crunch) Targeting, Relevance (Scalding, Java ) Email/Message production (Java MR) Framework (Java) Online Delivery System
  14. 14. …/email-experience/why-scalding • Scala + Map Reduce = match made in heaven scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500 • Stack traces (yeah!) • Native Avro support • Integrates well with CI/build system
  15. 15. …/email-experience/code
  16. 16. …/email-experience/code/2
  17. 17. /linkedin/…/scalding/status • Started >1 year ago • Thousands of production LOC written in Scalding by our team – Pretty happy with readability and maintainability • ~10 flows are currently in production, and counting • Currently ~12 people are coding in Scalding • Created Scalding user group • Growing interest • Learning: – Scala[Scalding] < Scala[ _ ]
  18. 18. /linkedin/…/scalding/users • Data science • Enterprise services • Email experience • Content
  19. 19. /linkedin/…/scalding/what-to-improve • Better Scala language IDE tools • One-click development (-> demo) • Monitoring and troubleshooting – Counters – implemented in 0.9 – Better troubleshooting of the ser/de process • Better tools for tuning of jobs – setting #of mappers and reducers • Best practices
  20. 20. /home Scalding is a must-have tool in your arsenal of Hadoop development. – Hadoop ecosystem at LinkedIn – Hadoop development tools – Scalding: why and how – What we do with Scalding, code examples.
  21. 21. /linkedin/join-us • Work on unique and interesting problems • Be part of great engineering community • Use latest tools and technologies • Help connect the world’s professionals to help them become more productive and successful • We are looking for amazing people interested in Data Science and Software Engineering Questions?

×