Scalding 
YARN Webinar Series 
September 18, 2014 
Page 1 © Hortonworks Inc. 2014 
Ajay Singh, Director - Hortonworks 
Jonathan Coveney, Senior Software Engineer - Twitter
Agenda 
Introduction: Ajay Singh, Hortonworks 
Modern Data Architecture and how Cascading and Scalding fit in 
Scalding: Jonathan Coveney, Twitter 
Why Scalding? 
Core Concepts and Limitations 
Scalding at Twitter 
Resources 
Page 2 © Hortonworks Inc. 2014
Speakers 
Page 3 © Hortonworks Inc. 2014 
Ajay Singh is Hortonworks Director of Technical 
Channels and leads the strategic alliances with partners 
from a technology standpoint such as driving alignment 
on roadmaps, product certifications and demos. Ajay is 
dedicated to building, scaling and delivering exceptional 
go-to-market solutions with partners. 
Jonathan Coveney currently works at Twitter, where he 
has spent a lot of time maintaining and updating Scalding; 
in the past, he has worked extensively on Apache Pig. He 
is deeply interested in functional programming, as well as 
developing usable, scalable API's for data processing at 
scale.
A Modern Data Architecture 
DATA 
SYSTEM 
APPLICATIONS 
RDBMS 
EDW 
MPP 
REPOSITORIES 
SOURCES 
Exis4ng 
Sources 
(CRM, 
ERP, 
Clickstream, 
Logs) 
Page 4 © Hortonworks Inc. 2014 
Emerging 
Sources 
(Sensor, 
Sen4ment, 
Geo, 
Unstructured) 
DEV 
& 
DATA 
TOOLS 
BUILD 
& 
TEST 
OPERATIONAL 
TOOLS 
MANAGE 
& 
MONITOR 
Business 
Analy4cs 
Custom 
Applica4ons 
Packaged 
Applica4ons 
Governance 
& Integration 
ENTERPRISE HADOOP 
Security 
Operations 
Data Access 
Data Management
HDP 2.1: Enterprise Hadoop 
HDP 2.1 
Hortonworks Data Platform 
Page 5 © Hortonworks Inc. 2014 
Provision, 
Manage 
& 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Data 
Workflow, 
Lifecycle 
& 
Governance 
Falcon 
Sqoop 
Flume 
NFS 
WebHDFS 
YARN 
: 
Data 
Opera4ng 
System 
DATA 
MANAGEMENT 
GOVERNANCE 
& 
DATA 
ACCESS 
SECURITY 
INTEGRATION 
Authen4ca4on 
Authoriza4on 
Accoun4ng 
Data 
Protec4on 
Storage: 
HDFS 
Resources: 
YARN 
Access: 
Hive, 
… 
Pipeline: 
Falcon 
Cluster: 
Knox 
OPERATIONS 
Script 
Pig 
Search 
Solr 
SQL 
Hive/Tez, 
HCatalog 
NoSQL 
HBase 
Accumulo 
Stream 
Storm 
Others 
In-­‐Memory 
AnalyNcs, 
ISV 
engines 
Cascading 
1 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
° 
N 
HDFS 
(Hadoop 
Distributed 
File 
System) 
Batch 
Map 
Reduce 
Deployment 
Choice 
Linux Windows On-Premise Cloud
Cascading SDK 
HDP Integrates and delivers Cascading SDK 
• Collection of tools, documentation, libraries, 
tutorials and example projects 
• Key Benefits 
• Simplified Development 
• Multi Language Support 
• Reuse existing skills and tools 
• Native YARN Integration 
Hortonworks delivers Enterprise support 
• Backed by Concurrent 
Hortonworks and Concurrent Advance Enterprise Data Application 
Development on Hadoop 
Page 6 © Hortonworks Inc. 2014
HDP Integration of Cascading SDK 
• Write once and deploy on your fabric of 
choice 
• Integration with data processing layer allows 
Cascading to take advantage of advances in 
interactive applications 
• Sep 17th - Cascading 3.0 WIP Now Supports 
Apache Tez 
– http://www.cascading.org/2014/09/17/ 
cascading-3-0-wip-now-supports-apache-tez/ 
Page 7 © Hortonworks Inc. 2014 
PRESENTATION 
& 
APPLICATION 
Efficient 
Cluster 
Resource 
Management 
& 
Shared 
Services 
(YARN) 
Batch 
Data 
Processing 
MapReduce 
Interac4ve 
Data 
Processing 
TEZ 
Java 
Cascading 
Scala 
Scalding 
SQL 
Lingual 
ML 
Pa6ern 
Java 
Cascading 
Scala 
Scalding 
SQL 
Lingual 
ML 
Pa6ern 
Enable both existing and new application to 
provide value to the organization 
CURRENT WIP
Cascading.org Scalding Resources 
Scalding Resources on Cascading.org 
• Videos and Tutorials 
• Mailing List 
• Newsletter 
Cascading 3.0 WIP With Tez Support 
• https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez 
Scalding Training Debuts This Fall 
• In-person, 1-day class with labs 
• Email: info@cascading.io 
Page 8 © Hortonworks Inc. 2014
Page 9 © Hortonworks Inc. 2014 
Jonathan Coveney 
Twitter 
@jco
Why Scalding? 
Writing raw map reduce is difficult! 
● Scalding is 
o Less verbose 
o Less error prone (type checking!) 
o Easier to evolve 
o Performant enough 
Page 10 © Hortonworks Inc. 2014
But what about Hive and Pig? 
● Really good for certain things 
o Excellent for quick, ad-hoc work 
o Easy to understand 
o Can leverage existing knowledge (ie SQL) 
● Not always the best for maintainability 
o Composition isn’t great 
o Testing is difficult 
o Type safety is lacking 
Page 11 © Hortonworks Inc. 2014
So… Cascading? 
● Still pretty verbose! 
● But you can use normal java tools 
o Maven 
o JUnit 
o IDEs 
● Handles the low level details for you 
● A good target for higher level languages 
Page 12 © Hortonworks Inc. 2014
Scalding 
● Concise, expressive syntax 
● Testable 
● Abstractable 
● Composable 
Because it’s in a full-featured, functional language! 
Page 13 © Hortonworks Inc. 2014
But Scala is scary! 
● Scalding doesn’t force you to use more complicated 
features 
● Can just write less-verbose Java if desired 
● Functional programming is an important paradigm -- but 
especially for big data 
Learning new things is good for your brain :) 
Page 14 © Hortonworks Inc. 2014
Example Scalding job 
class Webinar(arg: Args) extends Job(args) { 
import TDsl._ 
TextLine(args(“input”)) 
.flatMap { _.split(“s+”) } 
.map { w => (w, 1L) } 
.group 
.sum 
.write(TypedTsv[(String, Long)](args(“output”))) 
} 
“Hadoop is a system for counting words” -Oscar Boykin, @posco 
Page 15 © Hortonworks Inc. 2014
Core concepts 
● Source 
o How to read or write data 
● TypedPipe[T] 
o A distributed list of T 
o Kind of like a Seq[T] in Scala’s collections library 
● Grouped[K, T] 
o A grouping on K 
o Represents transition to reduce phase 
Page 16 © Hortonworks Inc. 2014
Word Co-Occurrence 
TextLine(args("input")) 
.flatMap { line => 
val words = line.split("s+") 
for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) 
}.group[String, Map[String, Long]] 
.sum 
.flatMap { case (word, wordMap) => wordMap.map { 
case (otherWord, count) => (word, otherWord, count) 
}}.write(TypedTsv[(String, String, Long)](args("output"))) 
Page 17 © Hortonworks Inc. 2014
Important concepts 
Scalding leverages a lot of Scala idioms, as well as 
concepts from functional programming 
● map 
o a 1 to 1 mapping for every piece of data 
● flatMap 
o a 1 to 0 or more mapping for every piece of data 
Page 18 © Hortonworks Inc. 2014
Important concepts (continued) 
● Typeclasses 
o The separation of computation from data types 
o Think Java’s Comparator (but way more powerful) 
o These are what power .sum 
Page 19 © Hortonworks Inc. 2014
Limitations 
Scalding’s limitations are MapReduce’s limitations 
● Bad at iterative jobs 
● Lots of checkpointing, serialization, sorting 
However... 
● Cascading on Tez could help! 
o in progress as part of Cascading 3.0 
● So could Cascading on Spark! 
Page 20 © Hortonworks Inc. 2014
The cutting edge 
● REPL support 
● Executor[T] 
o Decoupling TypedPipes from specifics of the execution 
engine 
o Makes Iterative algorithms much easier to express 
● Macros 
o Allowing easier use of case classes 
o Closure analysis? 
Page 21 © Hortonworks Inc. 2014
Scalding at Twitter 
● Thousands of users 
o Engineers AND data scientists 
● Many thousands of jobs every day 
o ETL 
o Recommendations 
o Email 
o Time series analysis 
When you use Twitter, you’re using features powered by 
Scalding! 
Page 22 © Hortonworks Inc. 2014
Useful practices 
● A standardized “Job” subclass with company specific 
information 
o Want the common case to be as simple as possible 
o Especially should configure serialization for users 
● Separate data from functions on data 
o At Twitter, this means Thrift for data, and various Scala 
functions operating and that data 
o Decouples the specification of some data from the derived 
data people want based on it 
Page 23 © Hortonworks Inc. 2014
Q&A 
Page 24 © Hortonworks Inc. 2014
Contribute! 
● Scalding 
● Algebird 
o Math inspired aggregators (.sum uses it) 
● Bijection 
o Conversion and serialization made fun 
● Summingbird 
o Abstraction for batch and online map/reduce (see resources for more) 
Page 25 © Hortonworks Inc. 2014
More resources 
Scalding/Algebird 
• Oscar Boykin: Algebra for Scalable Analytics 
• Avi Bryant: Add ALL the Things 
• Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce 
You might also be interested in… 
• Summingbird! Streaming real-time and batch analytics, unified and made 
beautiful 
• Oscar Boykin: Introduction to Summingbird 
• Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin: 
Summingbird, A Framework for Integrating Batch and Online MapReduce 
Computations 
Page 26 © Hortonworks Inc. 2014
Next Webinar – Oct 2 - Spark 
Writing applications to Hadoop and YARN using Spark 
• October 2nd at 9am Pacific Time 
• Register 
Find all webinars 
• Hortonworks.com/webinars 
Find past recorded webinars 
• Hortonworks.com/webinars/#library 
Page 27 © Hortonworks Inc. 2014
Thank you! 
Page 28 © Hortonworks Inc. 2014

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

  • 1.
    Scalding YARN WebinarSeries September 18, 2014 Page 1 © Hortonworks Inc. 2014 Ajay Singh, Director - Hortonworks Jonathan Coveney, Senior Software Engineer - Twitter
  • 2.
    Agenda Introduction: AjaySingh, Hortonworks Modern Data Architecture and how Cascading and Scalding fit in Scalding: Jonathan Coveney, Twitter Why Scalding? Core Concepts and Limitations Scalding at Twitter Resources Page 2 © Hortonworks Inc. 2014
  • 3.
    Speakers Page 3© Hortonworks Inc. 2014 Ajay Singh is Hortonworks Director of Technical Channels and leads the strategic alliances with partners from a technology standpoint such as driving alignment on roadmaps, product certifications and demos. Ajay is dedicated to building, scaling and delivering exceptional go-to-market solutions with partners. Jonathan Coveney currently works at Twitter, where he has spent a lot of time maintaining and updating Scalding; in the past, he has worked extensively on Apache Pig. He is deeply interested in functional programming, as well as developing usable, scalable API's for data processing at scale.
  • 4.
    A Modern DataArchitecture DATA SYSTEM APPLICATIONS RDBMS EDW MPP REPOSITORIES SOURCES Exis4ng Sources (CRM, ERP, Clickstream, Logs) Page 4 © Hortonworks Inc. 2014 Emerging Sources (Sensor, Sen4ment, Geo, Unstructured) DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS MANAGE & MONITOR Business Analy4cs Custom Applica4ons Packaged Applica4ons Governance & Integration ENTERPRISE HADOOP Security Operations Data Access Data Management
  • 5.
    HDP 2.1: EnterpriseHadoop HDP 2.1 Hortonworks Data Platform Page 5 © Hortonworks Inc. 2014 Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN : Data Opera4ng System DATA MANAGEMENT GOVERNANCE & DATA ACCESS SECURITY INTEGRATION Authen4ca4on Authoriza4on Accoun4ng Data Protec4on Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-­‐Memory AnalyNcs, ISV engines Cascading 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch Map Reduce Deployment Choice Linux Windows On-Premise Cloud
  • 6.
    Cascading SDK HDPIntegrates and delivers Cascading SDK • Collection of tools, documentation, libraries, tutorials and example projects • Key Benefits • Simplified Development • Multi Language Support • Reuse existing skills and tools • Native YARN Integration Hortonworks delivers Enterprise support • Backed by Concurrent Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop Page 6 © Hortonworks Inc. 2014
  • 7.
    HDP Integration ofCascading SDK • Write once and deploy on your fabric of choice • Integration with data processing layer allows Cascading to take advantage of advances in interactive applications • Sep 17th - Cascading 3.0 WIP Now Supports Apache Tez – http://www.cascading.org/2014/09/17/ cascading-3-0-wip-now-supports-apache-tez/ Page 7 © Hortonworks Inc. 2014 PRESENTATION & APPLICATION Efficient Cluster Resource Management & Shared Services (YARN) Batch Data Processing MapReduce Interac4ve Data Processing TEZ Java Cascading Scala Scalding SQL Lingual ML Pa6ern Java Cascading Scala Scalding SQL Lingual ML Pa6ern Enable both existing and new application to provide value to the organization CURRENT WIP
  • 8.
    Cascading.org Scalding Resources Scalding Resources on Cascading.org • Videos and Tutorials • Mailing List • Newsletter Cascading 3.0 WIP With Tez Support • https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez Scalding Training Debuts This Fall • In-person, 1-day class with labs • Email: info@cascading.io Page 8 © Hortonworks Inc. 2014
  • 9.
    Page 9 ©Hortonworks Inc. 2014 Jonathan Coveney Twitter @jco
  • 10.
    Why Scalding? Writingraw map reduce is difficult! ● Scalding is o Less verbose o Less error prone (type checking!) o Easier to evolve o Performant enough Page 10 © Hortonworks Inc. 2014
  • 11.
    But what aboutHive and Pig? ● Really good for certain things o Excellent for quick, ad-hoc work o Easy to understand o Can leverage existing knowledge (ie SQL) ● Not always the best for maintainability o Composition isn’t great o Testing is difficult o Type safety is lacking Page 11 © Hortonworks Inc. 2014
  • 12.
    So… Cascading? ●Still pretty verbose! ● But you can use normal java tools o Maven o JUnit o IDEs ● Handles the low level details for you ● A good target for higher level languages Page 12 © Hortonworks Inc. 2014
  • 13.
    Scalding ● Concise,expressive syntax ● Testable ● Abstractable ● Composable Because it’s in a full-featured, functional language! Page 13 © Hortonworks Inc. 2014
  • 14.
    But Scala isscary! ● Scalding doesn’t force you to use more complicated features ● Can just write less-verbose Java if desired ● Functional programming is an important paradigm -- but especially for big data Learning new things is good for your brain :) Page 14 © Hortonworks Inc. 2014
  • 15.
    Example Scalding job class Webinar(arg: Args) extends Job(args) { import TDsl._ TextLine(args(“input”)) .flatMap { _.split(“s+”) } .map { w => (w, 1L) } .group .sum .write(TypedTsv[(String, Long)](args(“output”))) } “Hadoop is a system for counting words” -Oscar Boykin, @posco Page 15 © Hortonworks Inc. 2014
  • 16.
    Core concepts ●Source o How to read or write data ● TypedPipe[T] o A distributed list of T o Kind of like a Seq[T] in Scala’s collections library ● Grouped[K, T] o A grouping on K o Represents transition to reduce phase Page 16 © Hortonworks Inc. 2014
  • 17.
    Word Co-Occurrence TextLine(args("input")) .flatMap { line => val words = line.split("s+") for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) }.group[String, Map[String, Long]] .sum .flatMap { case (word, wordMap) => wordMap.map { case (otherWord, count) => (word, otherWord, count) }}.write(TypedTsv[(String, String, Long)](args("output"))) Page 17 © Hortonworks Inc. 2014
  • 18.
    Important concepts Scaldingleverages a lot of Scala idioms, as well as concepts from functional programming ● map o a 1 to 1 mapping for every piece of data ● flatMap o a 1 to 0 or more mapping for every piece of data Page 18 © Hortonworks Inc. 2014
  • 19.
    Important concepts (continued) ● Typeclasses o The separation of computation from data types o Think Java’s Comparator (but way more powerful) o These are what power .sum Page 19 © Hortonworks Inc. 2014
  • 20.
    Limitations Scalding’s limitationsare MapReduce’s limitations ● Bad at iterative jobs ● Lots of checkpointing, serialization, sorting However... ● Cascading on Tez could help! o in progress as part of Cascading 3.0 ● So could Cascading on Spark! Page 20 © Hortonworks Inc. 2014
  • 21.
    The cutting edge ● REPL support ● Executor[T] o Decoupling TypedPipes from specifics of the execution engine o Makes Iterative algorithms much easier to express ● Macros o Allowing easier use of case classes o Closure analysis? Page 21 © Hortonworks Inc. 2014
  • 22.
    Scalding at Twitter ● Thousands of users o Engineers AND data scientists ● Many thousands of jobs every day o ETL o Recommendations o Email o Time series analysis When you use Twitter, you’re using features powered by Scalding! Page 22 © Hortonworks Inc. 2014
  • 23.
    Useful practices ●A standardized “Job” subclass with company specific information o Want the common case to be as simple as possible o Especially should configure serialization for users ● Separate data from functions on data o At Twitter, this means Thrift for data, and various Scala functions operating and that data o Decouples the specification of some data from the derived data people want based on it Page 23 © Hortonworks Inc. 2014
  • 24.
    Q&A Page 24© Hortonworks Inc. 2014
  • 25.
    Contribute! ● Scalding ● Algebird o Math inspired aggregators (.sum uses it) ● Bijection o Conversion and serialization made fun ● Summingbird o Abstraction for batch and online map/reduce (see resources for more) Page 25 © Hortonworks Inc. 2014
  • 26.
    More resources Scalding/Algebird • Oscar Boykin: Algebra for Scalable Analytics • Avi Bryant: Add ALL the Things • Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce You might also be interested in… • Summingbird! Streaming real-time and batch analytics, unified and made beautiful • Oscar Boykin: Introduction to Summingbird • Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin: Summingbird, A Framework for Integrating Batch and Online MapReduce Computations Page 26 © Hortonworks Inc. 2014
  • 27.
    Next Webinar –Oct 2 - Spark Writing applications to Hadoop and YARN using Spark • October 2nd at 9am Pacific Time • Register Find all webinars • Hortonworks.com/webinars Find past recorded webinars • Hortonworks.com/webinars/#library Page 27 © Hortonworks Inc. 2014
  • 28.
    Thank you! Page28 © Hortonworks Inc. 2014