Your SlideShare is downloading. ×
0
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

How LinkedIn Uses Scalding for Data Driven Product Development

626

Published on

Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/

Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/

Published in: Data & Analytics
3 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
626
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
3
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn
  • 2. http://linkedin.com/in/sashao • Studied Mathematical Physics at Moscow University • Software Engineering background • Work at LinkedIn on Email Experience • Publish open source at https://github.com/SashaOv • Publish music at SoundCloud
  • 3. /home Scalding is a must-have tool in your arsenal of Hadoop development. – Hadoop ecosystem at LinkedIn – Hadoop development tools – Scalding: why and how – What we do with Scalding, code examples.
  • 4. /linkedin/hadoop/overview Online Apps Databases NoSQL Data Stores Hadoop HDFS Hadoop Flows Tracking/log ging Analytics Data Products Messaging Message delivery
  • 5. /linkedin/hadoop/practices • All online data end up in HDFS – Mostly encoded in Avro • Production Process – CI/Automatic Build • More info forthcoming – Production Review – Operations and Monitoring • More info at http://lnkd.in/gridops2013 • Result: Thousands of jobs running in production • More info at http://lnkd.in/big-data-ecosystem
  • 6. /linkedin/hadoop/dev-tools • PIG • Java MR • Scalding • +many others, will not talk about them today
  • 7. /hadoop/dev-tools/PIG • Relatively mature tool – first official release 2008 • Easy to learn • Availability of experienced people • Extendable via UDF
  • 8. /hadoop/dev-tools/Java • Java MR – Maximum flexibility with Hadoop API – Verbose • Cascading – Retain (some) Java flexibility – Less verbose
  • 9. /hadoop/dev-tools/Scalding http://github.com/twitter/scalding • Scala-based DSL • Built on Cascading, stable and mature framework • Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } • Succinct and powerful • High level of abstraction
  • 10. …/tools/comparison PIG Java/Scala Debugging: stack traces No* Yes Code reuse Macros, jobs Classes, packages, modules, frameworks… Custom data structures/algorithms UDF Native Packaging Fat jars Thin jars Avro support Partial Native Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests PIG Java MR Scalding LOC count Small* Large Small
  • 11. …/tools/buyers-guide If you need… Then use… Quick-and-dirty simple scripts, existing UDFs PIG, Hive Complex flows, full access to Avro, debugging, unit testing, productization Scalding Full flexibility of Hadoop API but not too complex processing Java MR
  • 12. /linkedin/email-experience • Goal – Improve messaging users’ experience • Plan – Track – Experiment – Optimize – Personalize • Implementation – Generate messages offline – Apply sophisticated relevance algorithms – Shorten the release cycle to facilitate fast iteration
  • 13. /linkedin/email-experience/overview Content sources (PIG) HDFS Content sources (Scalding) Content sources (Crunch) Targeting, Relevance (Scalding, Java ) Email/Message production (Java MR) Framework (Java) Online Delivery System
  • 14. …/email-experience/why-scalding • Scala + Map Reduce = match made in heaven scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500 • Stack traces (yeah!) • Native Avro support • Integrates well with CI/build system
  • 15. …/email-experience/code
  • 16. …/email-experience/code/2
  • 17. /linkedin/…/scalding/status • Started >1 year ago • Thousands of production LOC written in Scalding by our team – Pretty happy with readability and maintainability • ~10 flows are currently in production, and counting • Currently ~12 people are coding in Scalding • Created Scalding user group • Growing interest • Learning: – Scala[Scalding] < Scala[ _ ]
  • 18. /linkedin/…/scalding/users • Data science • Enterprise services • Email experience • Content
  • 19. /linkedin/…/scalding/what-to-improve • Better Scala language IDE tools • One-click development (-> demo) • Monitoring and troubleshooting – Counters – implemented in 0.9 – Better troubleshooting of the ser/de process • Better tools for tuning of jobs – setting #of mappers and reducers • Best practices
  • 20. /home Scalding is a must-have tool in your arsenal of Hadoop development. – Hadoop ecosystem at LinkedIn – Hadoop development tools – Scalding: why and how – What we do with Scalding, code examples.
  • 21. /linkedin/join-us • Work on unique and interesting problems • Be part of great engineering community • Use latest tools and technologies • Help connect the world’s professionals to help them become more productive and successful • We are looking for amazing people interested in Data Science and Software Engineering Questions?

×