5. Scalding
• Scala wrapper for Cascading
• Just like working with in-memory collections (map/filter/sort…)
• Built-in parsers for {T|C}SV, date annotations etc
• Helper algorithms e.g.
approximations (Algebird library)
matrix API
8. Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
9. Running on EMR
• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar
• hadoop jar job.jar
com.twitter.scalding.Tool Entry class
com.adform.dspr.MadeupJob Scalding job class
--hdfs Run in HDFS mode
--logs s3://dev-adform-test/logs Parameter
--meta s3://dev-adform-test/metadata Parameter
--output s3://dev-adform-test/output Parameter
For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a
custom runner app, check out
https://gitz.adform.com/dco/dco-amazon-runner
10. Development
• Two APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction
11. Development
• Fields:
• No need to parse columns
• Redundancy
• No IDE support like auto-completion
• Typed:
• All benefits of types, esp. compile-time checking
• More manual work with parsing
• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)
12. Downsides
• A lot of configuring and googling random issues
• Scarce documentation, have to read source code/stackoverflow
• IntelliJ is slow
• Boilerplate code for parsing data
13. Some tips
• In local mode you specify files as input/output, in HDFS – folders
• You can use Hadoop API to read files from HDFS directly, but only on submitting
node, not in the pipeline
• As a workaround for previous problem, you can use a distributed cache
mechanism, but that only works on Hadoop 1 AFAIK
• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding
Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”