Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
@Scaldinghttps://github.com/twitter/scalding          Oscar Boykin            Twitter            @posco
#hadoopsummitI encourage live tweeting (mention @posco/@scalding)
• What is Scalding?• Why Scala for Map/Reduce?• How is it used at Twitter?• What’s next for Scalding?
Yep, we’re counting               words:Scalding jobssubclass Job
Yep, we’re counting                words:Logic is in the constructor
Yep, we’re counting              words:Functions can be called ordefined inline
Scalding Model• Source objects read and write data (from  HDFS, DBs, MemCache, etc...)• Pipes represent the flows of the da...
Yep, we’re counting               words:  Read and  Write data   throughSource objects
Yep, we’re counting               words:Data is modeled  as streams ofnamed Tuples (of     objects)
Why Scala• The scala language has a lot of built-in  features that make domain-specific  languages easy to implement.• Map/...
Word Co-occurrence
Word Co-occurrence We can usestandard scala  containers
Word Co-occurrenceWe can do real  logic in themapper withoutexternal UDFs.
Word Co-occurrence  Generalized“plus” handleslists/sets/maps   and can be  customized  (implement  Monoid[T])
GroupBuilder: enabling parallel reductions           • groupBy takes a             function that mutates a             Gro...
scald.rb• driver script that compiles the job and runs  it locally or transfers and runs remotely.• we plan to add EMR sup...
Most functions in the APIhave very close analogs in scala.collection.Iterable.
Cascading• is the java library that handles most of the map/  reduce planning for scalding.• has years of production use.•...
mapReduceMap• We abstract Cascading’s map-side  aggregation ability with a function called  mapReduceMap.• If only mapRedu...
Most Reductions are mapReduceMap
Optimized Joins• mapside join is called joinWithTiny.  Implements left or inner join with a very  small pipe.• blockJoin: ...
Scalding @Twitter• Revenue quality team (ads targeting, market  insight, click-prediction, traffic-quality) uses  scalding ...
Example: finding         similarity• A simple recommendation algorithm is  cosine similarity.• Represent user-tweet interac...
Cosine SimilarityMatrices are strongly  typed.
Cosine Similarity   Col,Rowtypes (Int,Int)     can be   anything comparable.  Strings areuseful for text    indices.
Cosine Similarity    Value(Double) can be anythingwith a Ring[T] (plus/times)
Cosine Similarity  Operator overloadinggives intuitive    code.
Matrix in foreground,        map/reduce behind    With this  syntax, we can  focus on logic,not how to maplinear algebra t...
Example uses:• Do random-walks on the following graph.  Matrix power iteration until convergence:  (m * m * m * m).• Dimen...
What is next?• Improve the logical flow planning (reorder  commuting filters/projections before maps,  etc...).• Improve Mat...
One more thing:• Type-safety geeks can relax: we just pushed  a type-safe API to scalding 0.6.0 analogous  to Scoobi/Scrun...
That’s it.• follow and mention: @scalding @posco• pull reqs: http://github.com/twitter/scalding
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
Upcoming SlideShare
Loading in …5
×

Scalding: Twitter's New DSL for Hadoop

6,602 views

Published on

Published in: Technology, Education

Scalding: Twitter's New DSL for Hadoop

  1. 1. @Scaldinghttps://github.com/twitter/scalding Oscar Boykin Twitter @posco
  2. 2. #hadoopsummitI encourage live tweeting (mention @posco/@scalding)
  3. 3. • What is Scalding?• Why Scala for Map/Reduce?• How is it used at Twitter?• What’s next for Scalding?
  4. 4. Yep, we’re counting words:Scalding jobssubclass Job
  5. 5. Yep, we’re counting words:Logic is in the constructor
  6. 6. Yep, we’re counting words:Functions can be called ordefined inline
  7. 7. Scalding Model• Source objects read and write data (from HDFS, DBs, MemCache, etc...)• Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
  8. 8. Yep, we’re counting words: Read and Write data throughSource objects
  9. 9. Yep, we’re counting words:Data is modeled as streams ofnamed Tuples (of objects)
  10. 10. Why Scala• The scala language has a lot of built-in features that make domain-specific languages easy to implement.• Map/Reduce is already within the functional paradigm.• Scala’s collection API covers almost all usual use cases.
  11. 11. Word Co-occurrence
  12. 12. Word Co-occurrence We can usestandard scala containers
  13. 13. Word Co-occurrenceWe can do real logic in themapper withoutexternal UDFs.
  14. 14. Word Co-occurrence Generalized“plus” handleslists/sets/maps and can be customized (implement Monoid[T])
  15. 15. GroupBuilder: enabling parallel reductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
  16. 16. scald.rb• driver script that compiles the job and runs it locally or transfers and runs remotely.• we plan to add EMR support.
  17. 17. Most functions in the APIhave very close analogs in scala.collection.Iterable.
  18. 18. Cascading• is the java library that handles most of the map/ reduce planning for scalding.• has years of production use.• is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy).• has a (very fast) local mode that runs without Hadoop.• flow planner designed to be portable (cascading on Spark? Storm?)
  19. 19. mapReduceMap• We abstract Cascading’s map-side aggregation ability with a function called mapReduceMap.• If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
  20. 20. Most Reductions are mapReduceMap
  21. 21. Optimized Joins• mapside join is called joinWithTiny. Implements left or inner join with a very small pipe.• blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama).• coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
  22. 22. Scalding @Twitter• Revenue quality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work.• Scala engineers throughout the company use it (i.e. storage, platform).• More than 60 in-production scalding jobs, more than 200 ad-hoc jobs.• Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
  23. 23. Example: finding similarity• A simple recommendation algorithm is cosine similarity.• Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question.• We’ve developed a Matrix library on top of scalding to make this easy.
  24. 24. Cosine SimilarityMatrices are strongly typed.
  25. 25. Cosine Similarity Col,Rowtypes (Int,Int) can be anything comparable. Strings areuseful for text indices.
  26. 26. Cosine Similarity Value(Double) can be anythingwith a Ring[T] (plus/times)
  27. 27. Cosine Similarity Operator overloadinggives intuitive code.
  28. 28. Matrix in foreground, map/reduce behind With this syntax, we can focus on logic,not how to maplinear algebra to Hadoop
  29. 29. Example uses:• Do random-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m).• Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix).• Triangle counting: (M*M*M).trace / 3
  30. 30. What is next?• Improve the logical flow planning (reorder commuting filters/projections before maps, etc...).• Improve Matrix flow planning to narrow the gap to hand optimized code.
  31. 31. One more thing:• Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
  32. 32. That’s it.• follow and mention: @scalding @posco• pull reqs: http://github.com/twitter/scalding

×