Scalding by Adform Research, Alex Gryzlov

•Download as PPTX, PDF•

1 like•870 views

Vasil Remeniuk

Technology

Cascading
Tap / Pipe / Sink abstraction over Map / Reduce in Java

Scalding
• Scala wrapper for Cascading
• Just like working with in-memory collections (map/filter/sort…)
• Built-in parsers for {T|C}SV, date annotations etc
• Helper algorithms e.g.
 approximations (Algebird library)
 matrix API

run the WordCountJob in local
mode with given input and output

Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3

Running on EMR
• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar
• hadoop jar job.jar
com.twitter.scalding.Tool Entry class
com.adform.dspr.MadeupJob Scalding job class
--hdfs Run in HDFS mode
--logs s3://dev-adform-test/logs Parameter
--meta s3://dev-adform-test/metadata Parameter
--output s3://dev-adform-test/output Parameter
For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a
custom runner app, check out
https://gitz.adform.com/dco/dco-amazon-runner

Development
• Two APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction

Development
• Fields:
• No need to parse columns
• Redundancy
• No IDE support like auto-completion
• Typed:
• All benefits of types, esp. compile-time checking
• More manual work with parsing
• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

Downsides
• A lot of configuring and googling random issues
• Scarce documentation, have to read source code/stackoverflow
• IntelliJ is slow
• Boilerplate code for parsing data

Some tips
• In local mode you specify files as input/output, in HDFS – folders
• You can use Hadoop API to read files from HDFS directly, but only on submitting
node, not in the pipeline
• As a workaround for previous problem, you can use a distributed cache
mechanism, but that only works on Hadoop 1 AFAIK
• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding
Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

Resources
• https://github.com/twitter/scalding/wiki Wiki
• https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff
• https://github.com/twitter/scalding/tree/develop/scalding-
core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014

What's hot

HDFS ArchitectureJeff Hammerbacher

Hadoop technologytipanagiriharika

HADOOP TECHNOLOGY pptsravya raju

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

Apache HadoopAjit Koti

What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!

HadoopMallikarjuna G D

PPT on HadoopShubham Parmar

HDFSSteve Loughran

Big Data and HadoopFlavio Vit

Hadoop Architecture and HDFSEdureka!

HadoopRajesh Piryani

What is HDFS | Hadoop Distributed File System | EdurekaEdureka!

Big data Hadoop presentation Shivanee garg

Hadoop: Distributed Data ProcessingCloudera, Inc.

Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari

Big data and HadoopRahul Agarwal

Seminar Presentation HadoopVarun Narang

Hadoop technologySohini~~ Music

What's hot (19)

HDFS Architecture

Hadoop technology

HADOOP TECHNOLOGY ppt

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Apache Hadoop

What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka

Hadoop

PPT on Hadoop

HDFS

Big Data and Hadoop

Hadoop Architecture and HDFS

Hadoop

What is HDFS | Hadoop Distributed File System | Edureka

Big data Hadoop presentation

Hadoop: Distributed Data Processing

Big data Hadoop Analytic and Data warehouse comparison guide

Big data and Hadoop

Seminar Presentation Hadoop

Hadoop technology

Similar to Scalding by Adform Research, Alex Gryzlov

Scalding by Adform Research, Alex GryzlovVasil Remeniuk

Introduction to Apache Spark EcosystemBojan Babic

Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks

Introduction to SparkDavid Smelker

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Meet Hadoop Family: part 4caizer_x

Intro to Apache SparkRobert Sanders

Intro to Apache Sparkclairvoyantllc

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

Apache Spark FundamentalsZahra Eskandari

Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati

Unit II Real Time Data Processing tools.pptxRahul Borate

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Introduction to apache sparkUserReport

Productionizing Spark and the Spark Job ServerEvan Chan

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Apache Spark TutorialAhmet Bulut

Similar to Scalding by Adform Research, Alex Gryzlov (20)

Scalding by Adform Research, Alex Gryzlov

Introduction to Apache Spark Ecosystem

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Introduction to Spark

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Meet Hadoop Family: part 4

Intro to Apache Spark

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

Apache Spark Fundamentals

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Unit II Real Time Data Processing tools.pptx

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Real time Analytics with Apache Kafka and Apache Spark

Introduction to apache spark

Productionizing Spark and the Spark Job Server

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned

Productionizing Spark and the REST Job Server- Evan Chan

Apache Spark Tutorial

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Story boards and shot lists for my a level piececharlottematthew16

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Artificial intelligence in cctv survelliance.pptxhariprasad279825

"ML in Production",Oleksandr BaganFwdays

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Advanced Computer Architecture – An IntroductionDilum Bandara

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Are Multi-Cloud and Serverless Good or Bad?

WordPress Websites for Engineers: Elevate Your Brand

Story boards and shot lists for my a level piece

Unraveling Multimodality with Large Language Models.pdf

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

How AI, OpenAI, and ChatGPT impact business and software.

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Commit 2024 - Secret Management made easy

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Artificial intelligence in cctv survelliance.pptx

"ML in Production",Oleksandr Bagan

DSPy a system for AI to Write Prompts and Do Fine Tuning

Advanced Computer Architecture – An Introduction

Powerpoint exploring the locations used in television show Time Clash

Nell’iperspazio con Rocket: il Framework Web di Rust!

Scalding by Adform Research, Alex Gryzlov

1. Wordcount in MapReduce

2. Cascading Tap / Pipe / Sink abstraction over Map / Reduce in Java

3. Cascading

4. Wordcount in Cascading

5. Scalding • Scala wrapper for Cascading • Just like working with in-memory collections (map/filter/sort…) • Built-in parsers for {T|C}SV, date annotations etc • Helper algorithms e.g.  approximations (Algebird library)  matrix API

6. Wordcount in Scalding

7. run the WordCountJob in local mode with given input and output

8. Building and Deploying • Get sbt • sbt assembly produces jar file in target/scala_2.10 • sbt s3-upload produces jar and uploads to s3

9. Running on EMR • hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar • hadoop jar job.jar com.twitter.scalding.Tool Entry class com.adform.dspr.MadeupJob Scalding job class --hdfs Run in HDFS mode --logs s3://dev-adform-test/logs Parameter --meta s3://dev-adform-test/metadata Parameter --output s3://dev-adform-test/output Parameter For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a custom runner app, check out https://gitz.adform.com/dco/dco-amazon-runner

10. Development • Two APIs: • Fields – everything is a string • Typed – working with classes, e.g. Request/Transaction

11. Development • Fields: • No need to parse columns • Redundancy • No IDE support like auto-completion • Typed: • All benefits of types, esp. compile-time checking • More manual work with parsing • Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

12. Downsides • A lot of configuring and googling random issues • Scarce documentation, have to read source code/stackoverflow • IntelliJ is slow • Boilerplate code for parsing data

13. Some tips • In local mode you specify files as input/output, in HDFS – folders • You can use Hadoop API to read files from HDFS directly, but only on submitting node, not in the pipeline • As a workaround for previous problem, you can use a distributed cache mechanism, but that only works on Hadoop 1 AFAIK • Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

14. Resources • https://github.com/twitter/scalding/wiki Wiki • https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff • https://github.com/twitter/scalding/tree/develop/scalding- core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs • http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation • http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf • http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014

Scalding by Adform Research, Alex Gryzlov

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Scalding by Adform Research, Alex Gryzlov

Similar to Scalding by Adform Research, Alex Gryzlov (20)

More from Vasil Remeniuk

More from Vasil Remeniuk (20)

Recently uploaded

Recently uploaded (20)

Scalding by Adform Research, Alex Gryzlov