Scalding on tez (final)

Your Trusted Third Party in the Digital Age™
Scalding on Tez
Twitter HQ, July 14th, 2015

Copyright©2015TransparencyRightsManagement.Allrightsreserved
2
• Who’s this guy?
• How did we come to use Scalding?
• Scalding on Tez: the Mini-HOWTO
• In practice
• Tips and Tricks
• All aboard: how?
• Performance
Agenda

3
WHO’S THIS GUY?

4Images: Amos Evans / « Rama » / Marcin Wichary // Wikipedia
• I’m 39
• My oldest
computer is 33
Who’s this guy?
8-bit
Basic(s) Z80
assembly
Turbo
Pascal
C++
Python
Java
ISO CNC
C#
Scala
Still afraid of
Shapeless

5
HOW DID WE COME TO SCALDING?

6
• A Trusted Third Party
– Data escrow, controlled execution
– Independent re-computation
– Privacy & Personal Data compliance assessment
• Big Data Services for Entertainment
– Metadata enrichment
– IP use certification
– Dataset analysis as a service
Why Scalding?
Transparency Rights Management:

7
Why Scalding?
« Big Data Services for Entertainment » - a Use Case
Digital Service
Provider
Report
Copyright Owners /
Collective
Management
Organizations

8
Why Scalding?
« Big Data Services for Entertainment » - a Use Case
Digital Service
Provider
Report
Copyright Owners /
Collective
Management
Organizations
Data
Improvement
Automatic Data Feed
(« in your format »)
Independent
Report
Conformance
Report

9
• September 2013: SQL Server overheats
• October 2013: using Lingual
12 SQL steps + bash scripts
• September 2014: Cascading + Java
• September 28th: tried out Scalding
• November 2014: delivered first results on
Scalding
• April 2015: First success on Scalding+Tez
Why Scalding?
Dataset analysis (from YouTube monthly reports)

10
Our system…
Jenkins
git
Mesos
Chronos Marathon
YARN 2.6.0
HDFS 2.6.0
Debian Debian Debian DebianDebian
Ansible
APP
scalding
cascading
YARN
RM
APP (WS)
Akka Spray
Artifactory
4-way
Non-Reg
Jenkins
Slave

11
Our system…
7 machines, and still a lot of things to discover

12
SCALDING ON TEZ,
THE MINI-HOWTO

13
• Step 0: Prerequisites:
– A YARN cluster
– Cascading 3.0
– TEZ runtime lib in HDFS
– A version of scalding with fabric selection
Scalding on Tez, the mini-howto
(2.6.0)
0.6.2-SNAPSHOT
0.13.1 + PR1220

14https://github.com/cchepelov/wcplus/blob/master/build.sbt
Scalding on Tez, the mini-HOWTO
• Step 1: build.sbt

15
• Step 1: build.sbt (redux)
1. Regain control on what libraries are included
2. Exclude some « long transitive » dependencies that pull in junk
3. Put in the desired fabric, in a configurable way
sbt --DCASCADING_FABRIC=hadoop clean assembly

16
• Step 1bis: assembly.sbt
We’re using fatjars to simplify deployment.
Because of jar hell, we « need » a complicated assembly.sbt
https://github.com/cchepelov/wcplus/blob/master/assembly.sbt

17https://github.com/cchepelov/wcplus/blob/master/src/main/scala/com/transparencyrights/demo/wcplus/CommonJob.scala
• Step 2: a few job flags

18
• tez.task.resource.memory.mb
– As large as you can afford to give, per CPU per node
– The more memory, the less Tez needs to spill
intermediates to disk
• tez.container.max.java.heap.fraction
– Defaults (1024MiB * 0.8) assume the JVM’s Native
memory requirements don’t exceed 208 MiB
– Scalding + the Scala runtime + Cascading on top of
Tez seems to require more.
YARN kills offenders switftly!
– The 460MiB figure we’re using (1024+512)*(1-0.7)
may be a bit wasteful
• Step 2: a few job flags (continued)

19
THAT’S IT.
(ALMOST)

20
IN PRACTICE…

21
« A VERSION OF SCALDING WITH
FABRIC SELECTION »
WAIT, WHAT?

22
Scalding traditional --local and --hdfs flags:
– Uses either LocalFlowConnector or
HadoopFlowConnector
– Types are hard-coded
Cascading 2.5 introduced a new fabric concept.
You can run either with cascading-hadoop or
with cascading-hadoop2-mr1. But:
– Incompatible jars (can’t load both)
– Main types visible to Scalding are different
In practice
« A version of scalding with fabric selection » Wait, What?

23
PR1220:
 No longer hardcodes « either Local or Hadoop 1.X »
 Enables supplying any flow connector
implementation, as long as the jar’s around.
 --hdfs to be deprecated as an alias to --hadoop1
 Still built against Cascading 2.6
In practice
« A version of scalding with fabric selection » Wait, What?

24
« STILL BUILT ON CASCADING 2.6 »
WHY?

25
Cascading 3.0 has carefully updated some argument types
to prepare for the future
This is source- and binary-compatible:
In practice
« Still built on Cascading 2.6 »
Scala enforces generic type safety, and the Cascading 3.0
upgrades are not legal with scalac.
But they still are with the JVM…
libraryconsumer
LibraryV2
Same
consumer
In Java

26
Scalding will require some adjustment to
become compatible with the java-level source
upgrades.
Can this happen without breaking scalding
application source code ?
In practice
… Going to native Cascading 3.0 ?

27
GUAVA

28
GUAVAGUAVA

29
• Guava is a nice library…
… of little use in Scala (?)
• In a Scalding/Cascading/Tez JVM, multiple versions of
guava are required. Each layer depends on its own
version.
About every single version from 11.0 to 16.0.2
• There have been breaking changes (method renames &
removals) in guava 13
• These happen on really mundane objects (Closeable,
Stopwatch), but they’re major troublemakers
In practice…
Guava.

30
• Asking Apache to quickly upgrade to guava
18, or Google to re-introduce deprecated
interfaces… probably not immediate
• Solution: Frankenguava.
In practice…
Guava Hell: a temporary solution
Guava 18.0 JAR

31
In practice…
Guava 18.0 JAR
Stopwatch &
Closeables

32
In practice…
Guava 18.0 JAR
Stopwatch &
Closeables including
deprecated
overloads
Stopwatch &
Closeables

33
• Step 1: Post-prepare
the Tez runtime
• Step 2: Enforce the use
of the appropriate
guava
In practice…
Frankenguava: howto
• Build tez from source
• Unpack runtime jar from tez-dist
• Remove guava
• Put frankenguava
• Repack
• Deploy on HDFS

34
CASCADING’S TEZ*REGISTRY

35
• Cascading 3.0 uses a set of mapping registries
to convert cascading patterns into the back-
end API.
The Tez registries are new, and distinct from the MR
registries
• The Tez registries are hardened against
Concurrent’s extensive test library, which is built
on years of MR experience.
Tez has its own trouble spots.
Beware of hash joins.
• It works fine now, but getting the scalding test
library onboard will help a long way.
In practice…
Cascading’s Tez*Registry

36
• It works mostly fine now, but getting the scalding
test library onboard will help a long way.
In practice…
Cascading’s Tez*Registry
Last-minute update:
.filterWithValue / .mapWithValue currently
crash the Cascading planner (as of 3.0.1)
(implementation uses a HashJoin)

37
AN EXAMPLE

38
A small test:

39
A small test: « wc plus »
70 books
1.1M lines
10M words
56M bytes
Word,
relative frequency,
deviation from median relative freq
Two Words,
relative frequency,
Ten Words,
relative frequency,
Compute
Frequencies
Ignoring things that
are more frequent
than 80% of the max
word frequency
All Expressions (1-W to 10-W),
relative frequency,
…

40
70 books
1.1M lines
10M words
56M bytes
Word,
relative frequency,
Two Words,
relative frequency,
Ten Words,
relative frequency,
Compute
Frequencies
Ignoring things that
are more frequent
than 80% of the max
word frequency
All Expressions (1-W to 10-W),
relative frequency,
…
No .filterWithValue /
.mapWithValue for now
Roulex45 / Wikipedia
count
count
count
count

41

42
TIPS & TRICKS

43
Run your job with
-Dcascading.planner.plan.path=/tmp/path/to/plan.lst
The planner will output a lot of useful files. One of them is
…/$(Job)/4-final-flow-steps/0000-step-node-sub-graph.dot
Run that file through graphviz
dot –O –Tpdf 0000-step-node-sub-graph.dot
or, if the PDF is illegible, Firefox’s great at zooming into SVG files:
dot –O –Tsvg 0000-step-node-sub-graph.dot
Tips & Tricks
0000-step-node-sub-graph.dot

44
Tips & Tricks
0000-step-node-sub-graph.dot
This is how TEZ names
our stuff !

45
MR
– One flow, many (MANY)
independent steps
– One or more operators per
step
– Step-to-step
communications involve
disk (HDFS)
– Each step is independent
as far as MR is concerned
– Step scheduling managed
from outside the cluster,
by Cascading
TEZ
– One flow, one DAG. A DAG
includes several nodes.
– One or more operators per
node
– Node-to-Node
communications managed
by TEZ. Memory, direct
network or disk as
necessary
– YARN sees one
« Application » per flow
– Node scheduling managed
by TEZ DAG AppMaster
Tips & Tricks
Major differences between how a cascading job gets
mapped to MR and to TEZ:

46
Tips & Tricks
yarn-swimlanes.sh
• A tool included in the tez source distribution,
in tez-tools/swimlanes (bash + python)
• Requires YARN ATS to work
« yarn logs –applicationId application_1345431315_1511 » must work
• Reports, in a GANTT chart, the per-container
occupation

47
Tips & Tricks
yarn-swimlanes.sh (2)
application_1435150225179_0474.svg

48
Tips & Tricks
yarn-swimlanes.sh (3)
time
containers

49
Tips & Tricks
Consider using .forceToDisk to ensure work is balanced
within the DAG
890 seconds
160 seconds

50
Tips & Tricks
Consider using .forceToDisk to ensure work is balanced
within the DAG
890 seconds 160 seconds

51
• .forceToDisk really means « don’t merge
those two TEZ nodes » which implies
« manage appropriate data transmission
between these two nodes »
• TextFile & other FixedPathSource friends
don’t seem to automatically spread out work
as well as they used to (huh?)
• YMMV, WIP.
Tips & Tricks
• Consider using .forceToDisk to ensure work is balanced
within the DAG

52
ALL ABOARD: HOW?

53
• A build of scalding against Cascading 3.0.x
 Fabric-switching logic
 Get the test library to pass also on Tez
 Some applications might still uncover new mapping issues 
increased community test case experience
 ???
• Getting the « guava mess » fixed
 Ideally all of Apache goes to recent guavas
 Enforced shading of Guava across the whole stack?
 Failing that, automated runtime patcher?
 (my « build stuff » partner makes me write: OSGI/Java9)
 ???
• Except for that, Tez is really easy for a YARN shop. Drop it
in, and it runs!
All aboard: how?
Smoothening up the UX for us app developers

54
PERFORMANCE

55
Performance
MR vs TEZ

56
Performance
MR vs TEZ; to scale

57
Performance
MR vs TEZ; TO SCALE!!!
MR run time:
14:22 (wall)
12:49 (cluster time)
5:43:26 (total CPU)
TEZ run time:
4:03(wall)
2:50(cluster time)
1:25:35 (total CPU)

58
Performance
Output of tez-tool « yarn-swimlanes.sh »
• 1 « swimlane » per active container
• 1 colour per DAG Vertex (the black dots are actually the Vertex ID)
• Container occupation is pretty good while there is work to do
• (not demonstrated here) containers die when they are idle.
This is good!

59
CONCLUSION

60
As a conclusion…
A lot of effort so far… …but worth it!
Images: Nicholas Babaian // Flickr. Marathon du Médoc 2008

61
THANKS!
For building that tech
For helping out
For your attention today

Scalding on tez (final)

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Scalding on tez (final)

Similar to Scalding on tez (final) (20)

Recently uploaded

Recently uploaded (20)

Scalding on tez (final)

Editor's Notes