• Save
Scalding: Twitter's New DSL for Hadoop
Upcoming SlideShare
Loading in...5

Scalding: Twitter's New DSL for Hadoop






Total Views
Views on SlideShare
Embed Views



2 Embeds 75

http://eventifier.co 57
http://eventifier.com 18



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Scalding: Twitter's New DSL for Hadoop Scalding: Twitter's New DSL for Hadoop Presentation Transcript

    • @Scaldinghttps://github.com/twitter/scalding Oscar Boykin Twitter @posco
    • #hadoopsummitI encourage live tweeting (mention @posco/@scalding)
    • • What is Scalding?• Why Scala for Map/Reduce?• How is it used at Twitter?• What’s next for Scalding?
    • Yep, we’re counting words:Scalding jobssubclass Job
    • Yep, we’re counting words:Logic is in the constructor
    • Yep, we’re counting words:Functions can be called ordefined inline
    • Scalding Model• Source objects read and write data (from HDFS, DBs, MemCache, etc...)• Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
    • Yep, we’re counting words: Read and Write data throughSource objects
    • Yep, we’re counting words:Data is modeled as streams ofnamed Tuples (of objects)
    • Why Scala• The scala language has a lot of built-in features that make domain-specific languages easy to implement.• Map/Reduce is already within the functional paradigm.• Scala’s collection API covers almost all usual use cases.
    • Word Co-occurrence
    • Word Co-occurrence We can usestandard scala containers
    • Word Co-occurrenceWe can do real logic in themapper withoutexternal UDFs.
    • Word Co-occurrence Generalized“plus” handleslists/sets/maps and can be customized (implement Monoid[T])
    • GroupBuilder: enabling parallel reductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
    • scald.rb• driver script that compiles the job and runs it locally or transfers and runs remotely.• we plan to add EMR support.
    • Most functions in the APIhave very close analogs in scala.collection.Iterable.
    • Cascading• is the java library that handles most of the map/ reduce planning for scalding.• has years of production use.• is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy).• has a (very fast) local mode that runs without Hadoop.• flow planner designed to be portable (cascading on Spark? Storm?)
    • mapReduceMap• We abstract Cascading’s map-side aggregation ability with a function called mapReduceMap.• If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
    • Most Reductions are mapReduceMap
    • Optimized Joins• mapside join is called joinWithTiny. Implements left or inner join with a very small pipe.• blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama).• coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
    • Scalding @Twitter• Revenue quality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work.• Scala engineers throughout the company use it (i.e. storage, platform).• More than 60 in-production scalding jobs, more than 200 ad-hoc jobs.• Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
    • Example: finding similarity• A simple recommendation algorithm is cosine similarity.• Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question.• We’ve developed a Matrix library on top of scalding to make this easy.
    • Cosine SimilarityMatrices are strongly typed.
    • Cosine Similarity Col,Rowtypes (Int,Int) can be anything comparable. Strings areuseful for text indices.
    • Cosine Similarity Value(Double) can be anythingwith a Ring[T] (plus/times)
    • Cosine Similarity Operator overloadinggives intuitive code.
    • Matrix in foreground, map/reduce behind With this syntax, we can focus on logic,not how to maplinear algebra to Hadoop
    • Example uses:• Do random-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m).• Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix).• Triangle counting: (M*M*M).trace / 3
    • What is next?• Improve the logical flow planning (reorder commuting filters/projections before maps, etc...).• Improve Matrix flow planning to narrow the gap to hand optimized code.
    • One more thing:• Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
    • That’s it.• follow and mention: @scalding @posco• pull reqs: http://github.com/twitter/scalding