Boost Performance with Scala – Learn From Those Who’ve Done It!

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Boost Performance with Scala
Learn From Those Who’ve Done It!
We do Hadoop.

Your speakers…
Dhruv Kumar
Partner Solutions Engineer
Hortonworks
Cyrille Chépélov
R&D Director
Transparency Rights Management

Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 437+ customers (as of March 31, 2015)
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1,000+ Ecosystem Partners

Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional

Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
 Manages new data paradigm
 Handles data at scale
 Cost effective
 Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce

Modern Data Architecture emerges to unify data & processing
Modern Data Architecture
• Enable applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW

Hortonworks & Concurrent
Hortonworks and Concurrent Advance Enterprise
Data Application Development on Hadoop
HDP Integrates and delivers Cascading SDK
• Collection of tools, documentation, libraries,
tutorials and example projects
• Simplifies SQL integration and enables Scala
development for Hadoop
Hortonworks provides level 1 & 2 support for
Cascading SDK
Cascading is the proven application development
platform for building data applications on Hadoop

Hortonworks & Concurrent: Partnership Benefits
• SDK empowers developers to quickly build rich data-centric
enterprise applications on Hadoop
• Leverage existing Java or Scala based skill sets to develop
complex applications
• Combines the robustness and simplicity of Cascading with
the reliability and stability of HDP
• Apps built on Cascading such as Scalding can easily take
advantage of YARN and Tez

Cascading SDK: Overview
• The most widely used application
development framework for building Big
Data applications
• Enables improved Developer Productivity
for enterprises using HDP

HDP Integration of Cascading SDK
• SDKs that enable the the rapid
development of batch and
interactive data-driven applications
• Integration with data processing
layer allows Cascading to take
advantage of advances in
interactive applications
Efficient Cluster Resource
Management & Shared Services
(YARN)
Interactive Data Processing
TEZ
Batch Data Processing
MapReduce
Java
Cascading
Scala
Scalding
SQL
Lingual
ML
Pattern
Java
Cascading
Scala
Scalding
SQL
Lingual
ML
Pattern
Enable both existing and new application to
provide value to the organization
PRESENTATION & APPLICATION

Your Trusted Third Party in the Digital Age™
Scalding on Tez

Copyright©2015TransparencyRightsManagement.Allrightsreserved
12
HOW DID WE CHOOSE SCALDING ?

13
• A Trusted Third Party
– Data escrow, controlled
execution
– Independent re-computation
– Privacy & Personal Data
compliance assessment
• Big Data Services for
Entertainment
– Metadata enrichment
– IP use certification
– Dataset analysis as a service
Why Scalding?
Transparency Rights Management:

14
Why Scalding?
« Big Data Services for Entertainment » - a Use Case
Digital Service
Provider
Report
Copyright Owners /
Collective Management
Organizations

15
Why Scalding?
« Big Data Services for Entertainment » - a Use Case
Digital Service
Provider
Report
Copyright Owners /
Collective Management
Organizations
Data Improvement Automatic Data Feed
(« in your format »)
Independent Report
Conformance Report

16
• September 2013: SQL Server overheats
• October 2013: using Lingual
12 SQL steps + bash scripts
• September 2014: Cascading + Java
• September 28th: tried out Scalding
• November 2014: delivered first results on
Scalding
• April 2015: First success on Scalding+Tez
Why Scalding?
Dataset analysis (from YouTube monthly reports)

17
Anatomy of a scalding app
Your App (in scala)
scalding
cascading
Hadoop + Tez platform libraries
You 
@TwitterOSS
Concurrent, Inc.
Apache 

18
SCALDING ON TEZ,
THE MINI-HOWTO

19
• Step 0: Prerequisites:
– A YARN cluster
– Cascading 3.0
– TEZ runtime lib in HDFS
– A version of scalding with fabric selection
Scalding on Tez, the mini-howto
0.6.2-SNAPSHOT
0.13.1 + PR1220

20https://github.com/cchepelov/wcplus/blob/master/build.sbt
Scalding on Tez, the mini-HOWTO
• Step 1: build.sbt

21
• Step 1: build.sbt (redux)
1.Regain control on what libraries are included
2.Exclude some « long transitive » dependencies
that pull in junk
3.Put in the desired fabric, in a configurable way
sbt --DCASCADING_FABRIC=hadoop clean assembly

22
• Step 1bis: assembly.sbt
We’re using fatjars to simplify deployment.
Because of jar hell, we « need » a complicated assembly.sbt
https://github.com/cchepelov/wcplus/blob/master/assembly.sbt

23
https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala
• Step 2: a few job flags

24
• tez.task.resource.memory.mb
– As large as you can afford to give, per CPU per
node
– The more memory, the less Tez needs to spill
intermediates to disk
• tez.container.max.java.heap.fraction
– Defaults (1024MiB * 0.8) assume the JVM’s Native
memory requirements don’t exceed 208 MiB
– Scalding + the Scala runtime + Cascading on top of
Tez seems to require more.
YARN kills offenders switftly!
– The 460MiB figure we’re using (1024+512)*(1-0.7)
• Step 2: a few job flags (continued)

25
THAT’S IT.
(ALMOST)

26
IN PRACTICE…

27
« A VERSION OF SCALDING WITH FABRIC
SELECTION »
WAIT, WHAT?

28
Scalding traditional --local and --hdfs
flags:
– Uses either LocalFlowConnector or
HadoopFlowConnector
– Types are hard-coded
Cascading 2.5 introduced a new fabric
concept. You can run either with cascading-
hadoop or with cascading-hadoop2-mr1. But:
– Incompatible jars (can’t load both)
– Main types visible to Scalding are different
In practice
« A version of scalding with fabric selection » Wait,
What?

29
PR1220:
 No longer hardcodes « either Local or Hadoop
1.X »
 Enables supplying any flow connector
implementation, as long as the jar’s around.
 --hdfs to be deprecated as an alias to --hadoop1
 Still built against Cascading 2.6
In practice
« A version of scalding with fabric selection » Wait,
What?

30
« STILL BUILT ON CASCADING 2.6 »
WHY?

31
Cascading 3.0 has carefully updated some argument types
to prepare for the future
This is source- and binary-compatible:
In practice
« Still built on Cascading 2.6 »
Scala enforces generic type safety, and the Cascading 3.0
upgrades are not legal with scalac.
But they still are with the JVM…
libraryconsumer
LibraryV2
Same
consumer
In Java

32
Scalding will require some adjustment to
become compatible with the java-level source
upgrades.
Can this happen without breaking scalding
application source code ?
In practice
… Going to native Cascading 3.0 ?

33
GUAVA

34
GUAVAGUAVA

35
• Guava is a nice library…
… of little use in Scala (?)
• In a Scalding/Cascading/Tez JVM, multiple
versions of guava are required. Each layer
depends on its own version.
About every single version from 11.0 to 16.0.2
• There have been breaking changes (method
renames & removals) in guava 13
• These happen on really mundane objects
In practice…
Guava

36
• Discussions and actions in progress to
remove the pain
• In the mean-time, using a patched version
« frankenguava » to provide both older and
newer interfaces, to keep all consumers
happy across the stack.
In practice…
Guava

37
CASCADING’S TEZ*REGISTRY

38
• Cascading 3.0 uses a set of mapping
registries to convert cascading patterns
into the back-end API.
The Tez registries are new, and distinct from the MR
registries
• The Tez registries are hardened against
Concurrent’s extensive test library, which
is built on years of MR experience.
Tez has its own trouble spots.
Beware of hash joins.
• It works fine now, but getting the
In practice…
Cascading’s Tez*Registry

39
• It works mostly fine now, but getting the
scalding test library onboard will help a
long way.
In practice…
Cascading’s Tez*Registry
Last-minute update:
.filterWithValue / .mapWithValue
currently crash the Cascading planner (as
of 3.0.1)
(implementation uses a HashJoin)

40
AN EXAMPLE

41
A small test:

42
A small test: « wc plus »
70 books
1.1M lines
10M words
56M bytes
Word,
relative frequency,
deviation from median relative freq
Two Words,
relative frequency,
Ten Words,
relative frequency,
Compute
Frequencies
Ignoring things that are more
frequent than 80% of the max
word frequency
All Expressions (1-W to 10-W),
relative frequency,
…

43
70 books
1.1M lines
10M words
56M bytes
Word,
relative frequency,
Two Words,
relative frequency,
Ten Words,
relative frequency,
Compute
Frequencies
Ignoring things that are more
frequent than 80% of the max
word frequency
All Expressions (1-W to 10-W),
relative frequency,
…
No .filterWithValue /
.mapWithValue for now
Roulex45 / Wikipedia
count
count
count
count

44
https://github.com/cchepelov/wcplus

45
TIPS & TRICKS

46
Run your job with
-Dcascading.planner.plan.path=/tmp/path/to/plan.lst
The planner will output a lot of useful files. One of them is
…/$(Job)/4-final-flow-steps/0000-step-node-sub-graph.dot
Run that file through graphviz
dot –O –Tpdf 0000-step-node-sub-graph.dot
or, if the PDF is illegible, Firefox’s great at zooming into
SVG files:
dot –O –Tsvg 0000-step-node-sub-graph.dot
Tips & Tricks
0000-step-node-sub-graph.dot

47
Tips & Tricks
0000-step-node-sub-graph.dot
This is how TEZ names our stuff !

48
MR
– One flow, many (MANY)
independent steps
– One or more operators
per step
– Step-to-step
communications involve
disk (HDFS)
– Each step is independent
as far as MR is
concerned
– Step scheduling managed
from outside the
cluster, by Cascading
TEZ
– One flow, one DAG. A DAG
includes several nodes.
– One or more operators
per node
– Node-to-Node
communications managed
by TEZ. Memory, direct
network or disk as
necessary
– YARN sees one
« Application » per flow
– Node scheduling managed
by TEZ DAG AppMaster
Tips & Tricks
Major differences between how a cascading job gets
mapped to MR and to TEZ:

49
Tips & Tricks
yarn-swimlanes.sh
• A tool included in the tez source
distribution, in tez-tools/swimlanes (bash
+ python)
• Requires YARN ATS to work
« yarn logs –applicationId application_1345431315_1511 » must work
• Reports, in a GANTT chart, the per-
container occupation

50
Tips & Tricks
yarn-swimlanes.sh (2)
application_1435150225179_0474.svg

51
Tips & Tricks
yarn-swimlanes.sh (3)
time
containers

52
Tips & Tricks
Consider using .forceToDisk to ensure work is
balanced within the DAG
890 seconds
160 seconds

53
Tips & Tricks
Consider using .forceToDisk to ensure work is
890 seconds 160 seconds

54
• .forceToDisk really means « don’t merge
those two TEZ nodes » which implies
« manage appropriate data transmission
between these two nodes »
• TextFile & other FixedPathSource friends
don’t seem to automatically spread out
work as well as they used to (huh?)
• YMMV, WIP.
Tips & Tricks
• Consider using .forceToDisk to ensure work is

55
PERFORMANCE

56
Performance
MR vs TEZ

57
Performance
MR vs TEZ; to scale

58
Performance
MR vs TEZ; TO SCALE!!!
MR run time:
14:22 (wall)
12:49 (cluster time)
5:43:26 (total CPU)
TEZ run time:
4:03(wall)
2:50(cluster time)
1:25:35 (total CPU)

59
CONCLUSION

60
Apache Tez enables very significant
performance gains compared to traditional
MAPREDUCE applications, on the same cluster
and alongside the legacy.
The new Tez back-end built by Concurrent,
enables these exciting performance gains for
existing Cascading and Scalding
applications.
Taking advantage of these performance gains
should become as easy as upgrading and
Conclusion

Next Steps…
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Concurrent & Hortonworks
http://hortonworks.com/partner/concurrent
More about Transparency Rights Management
http://www.transparencyrights.com/
Contact us: events@hortonworks.com

Q&A

Boost Performance with Scala – Learn From Those Who’ve Done It!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Boost Performance with Scala – Learn From Those Who’ve Done It!

Similar to Boost Performance with Scala – Learn From Those Who’ve Done It! (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Boost Performance with Scala – Learn From Those Who’ve Done It!

Editor's Notes