This document provides an overview of using Scalding on Tez. It begins with introducing the presenter and how Scalding was adopted. The document then covers:
1. Setting up Scalding to run on Tez including specifying fabric in build.sbt and job configuration flags.
2. An example job ("wc plus") that computes word frequencies from text is presented.
3. Tips are provided like visualizing the Tez DAG using dot files and load balancing using forceToDisk.
4. Outstanding issues discussed are upgrading Scalding for Cascading 3.0 and resolving Guava dependency conflicts across the stack. Overall, Tez is described as easy for YARN shops to use
Meet Jane. Jane loves music.
And Jane’s favourite music video platform has all the music Jane loves.
So Jane listens to music from the Platform.
After october 2013: went on different things, the topic was left in storage for a while
September 2014: new model, same concept; built on plain Cascading to simplify some of the hairiest SQL logic (Optiq lacks(ed) analytic functions, so the pretty much single SQL statement from SQL Server days had to be exploded into the 12 stages)
Met guys from Lausanne at the end of September. Was already curious about Scala / Scalding then, decided to spend two days to give it a spin.
Never turned back !
Myriad’s still a wishlist item for now, as it doesn’t seem to play nice with YARN in HA mode.
We REALLY don’t want to misrepresent our maturity level
TEZ 0.6.2-SNAPSHOT is required, as
Warning: TEZ 0.7 runtime is not API-compatible with 0.6 (altough the source-level API is quite close). Cascading might change the Tez dependency from time to time…
The typical Hadoop+Tez stacks pulls in a Jetty, a Tomcat, a Jersey, multiple guavas, and the kitchen sink.
We believe our workload requires 270-ish MiB of native memory. When we have time, we’ll either power down for extra sticks of RAM, or attempt to shave 20 MiB of heap per TezChild.
(reportedly)
Prune & Graft
Prune & Graft
Prune & Graft
Why these two steps? The « same » code is getting executed in wildly different CLASSPATH: Cascading driver, TezChild, etc.
Hash joins means hash joins, but also .filter/mapWithValue, joinWithTiny, etc.
Hash joins means hash joins, but also .filter/mapWithValue, joinWithTiny, etc.
Who wants to see another « Word Count » ?
Who wants to see another « Word Count » ?
Who wants to see another « Word Count » ?
I’m not going to look into that, fairly standard code except where I’ve been naïve. You get the idea.
« All of Apache goes to recent guavas… » or drops the library altogether. At the very least, every one not using the most recent version effing shades it.