2. What is Scalding?
• Scalding is a Scala based API for Map Reduce
applications
• Scalding is built on top of Cascading
• Cascading is a flow oriented processing framework which
acts as an abstraction layer for MapReduce
3. What is Cascading?
• Cascading introduces the
concept of source taps
(input) and sink taps
(output) and pipes to
connect them, essentially
abstracting the key/value
scheme in MR
• Within a pipe, users define
the transformation of data
by applying operations
such as GroupBy, Every
and others.
5. In comes Scalding
• Scalding was created by Twitter, basically as a DSL for
Cascading.
• The goal is to offer functions to operate on the data flow
as opposed to constructing objects with embedded
operations
• Scalding applications feel and behave like scripts, ideally
replacing Pig.
6. Scalding APIs
• Scalding offers three different APIs:
• Field API – a simple, abstracted symbol based function oriented
API, first choice for most use cases
• Type safe API – a more low level, typed API with closer access to
Cascading. This API is used for more complex inputs, such as Avro
• Matrix API – allows to apply matrix and vector operations to pipes,
however of type Int, Long and String (due to comparator ops)
• Both Field and Type APIs can convert to one another, the
APIs are designed to offer the same type of functions, i.e.
(Field) Pipe instances convert to TypePipe and vice versa.
7. Functions
• Scalding has Map – like functions, such as:
• map
• flatMap
• filter and filterNot
• collect
• Grouping / Joining functions:
• groupBy
• groupAll
• Join (left,right, outer etc)
• Reduce functions:
• reduce (DUH!)
• foldLeft
• average, sum
Documentation: https://github.com/twitter/scalding/wiki/Fields-
based-API-Reference
10. Example – Configuring and running
Configuration uses hadoop
And the Job / Toolrunner scheme:
11. Flow Listener
• You can monitor the execution progress with cascading
listeners.
1. Define Scalding Stat objects (Case classes for Hadoop
counters)
2. Increment within your operations by calling incBy(Int)
3. Implement FlowListener interface and increment your
Jobs listeners:
override def listeners = super.listeners ++ List(new
FlowListener)
14. Resources
Scalding home and docs on Github:
https://github.com/twitter/scalding
Excellent intro and advanced topics:
http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-
2014