• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Scalding: Twitter's Scala DSL for Hadoop/Cascading

Scalding: Twitter's Scala DSL for Hadoop/Cascading



Talk given at the 2012 Hadoop Summit in San Jose, CA....

Talk given at the 2012 Hadoop Summit in San Jose, CA.

Scalding is a Scala DSL for Cascading which brings natural functional programming to Hadoop. It is open-source, developed by Twitter and others.

Follow: twitter.com/scalding



Total Views
Views on SlideShare
Embed Views



12 Embeds 2,278

http://www.scoop.it 2171
http://www.cnblogs.com 65
https://twitter.com 19
http://us-w1.rockmelt.com 6
https://twimg0-a.akamaihd.net 4
http://webcache.googleusercontent.com 3
http://translate.googleusercontent.com 3
https://si0.twimg.com 2
http://www.makaidong.com 2
http://www.twylah.com 1
https://abs.twimg.com 1
http://www.linkedin.com 1



Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Scalding: Twitter's Scala DSL for Hadoop/Cascading Scalding: Twitter's Scala DSL for Hadoop/Cascading Presentation Transcript

  • @Scaldinghttps://github.com/twitter/scalding Oscar Boykin Twitter @posco
  • #hadoopsummitI encourage live tweeting (mention @posco/@scalding)
  • • What is Scalding?• Why Scala for Map/Reduce?• How is it used at Twitter?• What’s next for Scalding?
  • Yep, we’re counting words:Scalding jobssubclass Job
  • Yep, we’re counting words:Logic is in the constructor
  • Yep, we’re counting words:Functions can be called ordefined inline
  • Scalding Model• Source objects read and write data (from HDFS, DBs, MemCache, etc...)• Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
  • Yep, we’re counting words: Read and Write data throughSource objects
  • Yep, we’re counting words:Data is modeled as streams ofnamed Tuples (of objects)
  • Why Scala• The scala language has a lot of built-in features that make domain-specific languages easy to implement.• Map/Reduce is already within the functional paradigm.• Scala’s collection API covers almost all usual use cases.
  • Word Co-occurrence
  • Word Co-occurrence We can usestandard scala containers
  • Word Co-occurrenceWe can do real logic in themapper withoutexternal UDFs.
  • Word Co-occurrence Generalized“plus” handleslists/sets/maps and can be customized (implement Monoid[T])
  • GroupBuilder: enabling parallel reductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
  • scald.rb• driver script that compiles the job and runs it locally or transfers and runs remotely.• we plan to add EMR support.
  • Most functions in the APIhave very close analogs in scala.collection.Iterable.
  • Cascading• is the java library that handles most of the map/ reduce planning for scalding.• has years of production use.• is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy).• has a (very fast) local mode that runs without Hadoop.• flow planner designed to be portable (cascading on Spark? Storm?)
  • mapReduceMap• We abstract Cascading’s map-side aggregation ability with a function called mapReduceMap.• If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
  • Most Reductions are mapReduceMap
  • Optimized Joins• mapside join is called joinWithTiny. Implements left or inner join with a very small pipe.• blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama).• coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
  • Scalding @Twitter• Revenue quality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work.• Scala engineers throughout the company use it (i.e. storage, platform).• More than 60 in-production scalding jobs, more than 200 ad-hoc jobs.• Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
  • Example: finding similarity• A simple recommendation algorithm is cosine similarity.• Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question.• We’ve developed a Matrix library on top of scalding to make this easy.
  • Cosine SimilarityMatrices are strongly typed.
  • Cosine Similarity Col,Rowtypes (Int,Int) can be anything comparable. Strings areuseful for text indices.
  • Cosine Similarity Value(Double) can be anythingwith a Ring[T] (plus/times)
  • Cosine Similarity Operator overloadinggives intuitive code.
  • Matrix in foreground, map/reduce behind With this syntax, we can focus on logic,not how to maplinear algebra to Hadoop
  • Example uses:• Do random-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m).• Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix).• Triangle counting: (M*M*M).trace / 3
  • What is next?• Improve the logical flow planning (reorder commuting filters/projections before maps, etc...).• Improve Matrix flow planning to narrow the gap to hand optimized code.
  • One more thing:• Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
  • That’s it.• follow and mention: @scalding @posco• pull reqs: http://github.com/twitter/scalding