SlideShare a Scribd company logo
@Scalding
https://github.com/twitter/scalding


          Oscar Boykin
            Twitter
            @posco
#hadoopsummit
I encourage live tweeting (mention @posco/@scalding)
• What is Scalding?
• Why Scala for Map/Reduce?
• How is it used at Twitter?
• What’s next for Scalding?
Yep, we’re counting
               words:
Scalding jobs
subclass Job
Yep, we’re counting
                words:

Logic is in the
 constructor
Yep, we’re counting
              words:


Functions can
 be called or
defined inline
Scalding Model

• Source objects read and write data (from
  HDFS, DBs, MemCache, etc...)
• Pipes represent the flows of the data in the
  job. You can think of Pipe as a distributed
  list.
Yep, we’re counting
               words:


  Read and
  Write data
   through
Source objects
Yep, we’re counting
               words:


Data is modeled
  as streams of
named Tuples (of
     objects)
Why Scala
• The scala language has a lot of built-in
  features that make domain-specific
  languages easy to implement.
• Map/Reduce is already within the functional
  paradigm.
• Scala’s collection API covers almost all usual
  use cases.
Word Co-occurrence
Word Co-occurrence


 We can use
standard scala
  containers
Word Co-occurrence


We can do real
  logic in the
mapper without
external UDFs.
Word Co-occurrence

  Generalized
“plus” handles
lists/sets/maps
   and can be
  customized
  (implement
  Monoid[T])
GroupBuilder: enabling
 parallel reductions
           • groupBy takes a
             function that mutates a
             GroupBuilder.
           • GroupBuilder adds
             fields which are
             reductions of
             (potentially different)
             inputs.
           • On the left, we add 7
             fields.
scald.rb
• driver script that compiles the job and runs
  it locally or transfers and runs remotely.
• we plan to add EMR support.
Most functions in the API
have very close analogs in
 scala.collection.Iterable.
Cascading
• is the java library that handles most of the map/
  reduce planning for scalding.
• has years of production use.
• is used, tested, and optimized by many teams
  (Concurrent Inc., DSLs in Scala, Clojure, Python
  @Twitter. Ruby at Etsy).
• has a (very fast) local mode that runs without
  Hadoop.
• flow planner designed to be portable (cascading
  on Spark? Storm?)
mapReduceMap
• We abstract Cascading’s map-side
  aggregation ability with a function called
  mapReduceMap.
• If only mapReduceMaps are called, map-side
  aggregation works. If a foldLeft is called
  (which cannot be done map-side), scalding
  falls back to pushing everything to the
  reducers.
Most Reductions are
 mapReduceMap
Optimized Joins
• mapside join is called joinWithTiny.
  Implements left or inner join with a very
  small pipe.
• blockJoin: deals with data skew by
  replicating the data (useful for walking the
  Twitter follower graph, where everyone
  follows Gaga/Bieber/Obama).
• coming: combine the above to dynamically
  set replication on a per key basis: only Gaga
  is replicated, and just the right amount.
Scalding @Twitter
• Revenue quality team (ads targeting, market
  insight, click-prediction, traffic-quality) uses
  scalding for all our work.
• Scala engineers throughout the company
  use it (i.e. storage, platform).
• More than 60 in-production scalding jobs,
  more than 200 ad-hoc jobs.
• Not our only tool: Pig, PyCascading,
  Cascalog, Hive are also used.
Example: finding
         similarity
• A simple recommendation algorithm is
  cosine similarity.
• Represent user-tweet interaction as a
  vector, then find the users whose vectors
  point in directions near the user in
  question.
• We’ve developed a Matrix library on top of
  scalding to make this easy.
Cosine Similarity

Matrices are
 strongly
  typed.
Cosine Similarity
   Col,Row
types (Int,Int)
     can be
   anything
 comparable.
  Strings are
useful for text
    indices.
Cosine Similarity

    Value
(Double) can
 be anything
with a Ring[T]
 (plus/times)
Cosine Similarity

  Operator
 overloading
gives intuitive
    code.
Matrix in foreground,
        map/reduce behind

    With this
  syntax, we can
  focus on logic,
not how to map
linear algebra to
     Hadoop
Example uses:
• Do random-walks on the following graph.
  Matrix power iteration until convergence:
  (m * m * m * m).
• Dimensionality reduction of follower graph
  (Matrix product by a lower dimensional
  projection matrix).
• Triangle counting: (M*M*M).trace / 3
What is next?

• Improve the logical flow planning (reorder
  commuting filters/projections before maps,
  etc...).
• Improve Matrix flow planning to narrow
  the gap to hand optimized code.
One more thing:


• Type-safety geeks can relax: we just pushed
  a type-safe API to scalding 0.6.0 analogous
  to Scoobi/Scrunch/Spark, so relax.
That’s it.


• follow and mention: @scalding @posco
• pull reqs: http://github.com/twitter/scalding

More Related Content

What's hot

Scalding
ScaldingScalding
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
Databricks
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
Scala introduction
Scala introductionScala introduction
Scala introduction
vito jeng
 
Meet scala
Meet scalaMeet scala
Meet scala
Wojciech Pituła
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
Databricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
Sunghyouk Bae
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Samir Bessalah
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
Nate Murray
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
Julian Hyde
 
Joy of scala
Joy of scalaJoy of scala
Joy of scala
Maxim Novak
 
Requery overview
Requery overviewRequery overview
Requery overview
Sunghyouk Bae
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and Language
Marko Rodriguez
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
Jonas Bonér
 

What's hot (20)

Scalding
ScaldingScalding
Scalding
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Scala+data
Scala+dataScala+data
Scala+data
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Joy of scala
Joy of scalaJoy of scala
Joy of scala
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and Language
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
 

Viewers also liked

Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
Martin Odersky
 
Scala test
Scala testScala test
Metaprogramming in Scala 2.10, Eugene Burmako,
Metaprogramming  in Scala 2.10, Eugene Burmako, Metaprogramming  in Scala 2.10, Eugene Burmako,
Metaprogramming in Scala 2.10, Eugene Burmako,
Vasil Remeniuk
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
Tomer Gabel
 
Scala profiling
Scala profilingScala profiling
Scala profiling
Filippo Pacifici
 
Procedure Typing for Scala
Procedure Typing for ScalaProcedure Typing for Scala
Procedure Typing for Scala
akuklev
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
Sriram Krishnan
 
BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2
고포릿 default
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
RESTful Java
RESTful JavaRESTful Java
RESTful Java
JavaCommunity.Org
 
Elixirと他言語の比較的紹介 ver.2
Elixirと他言語の比較的紹介ver.2Elixirと他言語の比較的紹介ver.2
Elixirと他言語の比較的紹介 ver.2
Tsunenori Oohara
 
Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方
Yasutaka Kawamoto
 
PHP7はなぜ速いのか
PHP7はなぜ速いのかPHP7はなぜ速いのか
PHP7はなぜ速いのか
Yoshio Hanawa
 
java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰
Sungchul Park
 
地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)
Tsunenori Oohara
 
PHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイントPHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイント
Yoshio Hanawa
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
NTT DATA OSS Professional Services
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
NTT DATA OSS Professional Services
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホン
Akihiko Horiuchi
 

Viewers also liked (19)

Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Scala test
Scala testScala test
Scala test
 
Metaprogramming in Scala 2.10, Eugene Burmako,
Metaprogramming  in Scala 2.10, Eugene Burmako, Metaprogramming  in Scala 2.10, Eugene Burmako,
Metaprogramming in Scala 2.10, Eugene Burmako,
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
Scala profiling
Scala profilingScala profiling
Scala profiling
 
Procedure Typing for Scala
Procedure Typing for ScalaProcedure Typing for Scala
Procedure Typing for Scala
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
RESTful Java
RESTful JavaRESTful Java
RESTful Java
 
Elixirと他言語の比較的紹介 ver.2
Elixirと他言語の比較的紹介ver.2Elixirと他言語の比較的紹介ver.2
Elixirと他言語の比較的紹介 ver.2
 
Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方
 
PHP7はなぜ速いのか
PHP7はなぜ速いのかPHP7はなぜ速いのか
PHP7はなぜ速いのか
 
java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰
 
地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)
 
PHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイントPHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイント
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホン
 

Similar to Scalding: Twitter's Scala DSL for Hadoop/Cascading

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
Joseph Adler
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Xebia Nederland BV
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
Juantomás García Molina
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
Denny Lee
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
Scalding intro 20141125
Scalding intro 20141125Scalding intro 20141125
Scalding intro 20141125
Erik Schmiegelow
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
Alex Payne
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
Viplav Jain
 

Similar to Scalding: Twitter's Scala DSL for Hadoop/Cascading (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Scalding intro 20141125
Scalding intro 20141125Scalding intro 20141125
Scalding intro 20141125
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
 

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 

Scalding: Twitter's Scala DSL for Hadoop/Cascading

  • 1. @Scalding https://github.com/twitter/scalding Oscar Boykin Twitter @posco
  • 2. #hadoopsummit I encourage live tweeting (mention @posco/@scalding)
  • 3. • What is Scalding? • Why Scala for Map/Reduce? • How is it used at Twitter? • What’s next for Scalding?
  • 4. Yep, we’re counting words: Scalding jobs subclass Job
  • 5. Yep, we’re counting words: Logic is in the constructor
  • 6. Yep, we’re counting words: Functions can be called or defined inline
  • 7. Scalding Model • Source objects read and write data (from HDFS, DBs, MemCache, etc...) • Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
  • 8. Yep, we’re counting words: Read and Write data through Source objects
  • 9. Yep, we’re counting words: Data is modeled as streams of named Tuples (of objects)
  • 10. Why Scala • The scala language has a lot of built-in features that make domain-specific languages easy to implement. • Map/Reduce is already within the functional paradigm. • Scala’s collection API covers almost all usual use cases.
  • 12. Word Co-occurrence We can use standard scala containers
  • 13. Word Co-occurrence We can do real logic in the mapper without external UDFs.
  • 14. Word Co-occurrence Generalized “plus” handles lists/sets/maps and can be customized (implement Monoid[T])
  • 15. GroupBuilder: enabling parallel reductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
  • 16. scald.rb • driver script that compiles the job and runs it locally or transfers and runs remotely. • we plan to add EMR support.
  • 17.
  • 18. Most functions in the API have very close analogs in scala.collection.Iterable.
  • 19. Cascading • is the java library that handles most of the map/ reduce planning for scalding. • has years of production use. • is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy). • has a (very fast) local mode that runs without Hadoop. • flow planner designed to be portable (cascading on Spark? Storm?)
  • 20. mapReduceMap • We abstract Cascading’s map-side aggregation ability with a function called mapReduceMap. • If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
  • 21. Most Reductions are mapReduceMap
  • 22.
  • 23. Optimized Joins • mapside join is called joinWithTiny. Implements left or inner join with a very small pipe. • blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama). • coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
  • 24. Scalding @Twitter • Revenue quality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work. • Scala engineers throughout the company use it (i.e. storage, platform). • More than 60 in-production scalding jobs, more than 200 ad-hoc jobs. • Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
  • 25. Example: finding similarity • A simple recommendation algorithm is cosine similarity. • Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question. • We’ve developed a Matrix library on top of scalding to make this easy.
  • 27. Cosine Similarity Col,Row types (Int,Int) can be anything comparable. Strings are useful for text indices.
  • 28. Cosine Similarity Value (Double) can be anything with a Ring[T] (plus/times)
  • 29. Cosine Similarity Operator overloading gives intuitive code.
  • 30. Matrix in foreground, map/reduce behind With this syntax, we can focus on logic, not how to map linear algebra to Hadoop
  • 31. Example uses: • Do random-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m). • Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix). • Triangle counting: (M*M*M).trace / 3
  • 32. What is next? • Improve the logical flow planning (reorder commuting filters/projections before maps, etc...). • Improve Matrix flow planning to narrow the gap to hand optimized code.
  • 33. One more thing: • Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
  • 34. That’s it. • follow and mention: @scalding @posco • pull reqs: http://github.com/twitter/scalding

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n