This document discusses using Scalding, which combines Scala and Hadoop, for data-driven product development. Scalding provides a domain-specific language for writing MapReduce jobs in Scala that allows processing large datasets in Hadoop. The document describes how LinkedIn uses Scalding for various tasks including processing web and application data at large scales. It highlights benefits like succinct code, abstraction, and running thousands of Scalding jobs successfully in LinkedIn's production environment.
13. /scalding
http://github.com/twitter/scalding
• Scala-based DSL for Map/Reduce jobs
• Built on Cascading, stable and mature Hadoop
framework
• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
}
• Succinct and powerful
• High level of abstraction
18. /linkedin/hadoop/practices
• All online data end up in HDFS
– Avro encoding is standard
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem
20. /linkedin/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability, maintainability and tooling
support
• Dozens of flows are currently in production, and
counting
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]
22. /linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help them
become more productive and successful
• We are looking for amazing people interested in
Software Engineering and Data Science
– http://linkedin.com/careers
Questions?