SlideShare a Scribd company logo
1 of 14
SCALDING
Introduction and usage
What is Scalding?
• Scalding is a Scala based API for Map Reduce
applications
• Scalding is built on top of Cascading
• Cascading is a flow oriented processing framework which
acts as an abstraction layer for MapReduce
What is Cascading?
• Cascading introduces the
concept of source taps
(input) and sink taps
(output) and pipes to
connect them, essentially
abstracting the key/value
scheme in MR
• Within a pipe, users define
the transformation of data
by applying operations
such as GroupBy, Every
and others.
WordCount!
• WordCount in Cascading:
In comes Scalding
• Scalding was created by Twitter, basically as a DSL for
Cascading.
• The goal is to offer functions to operate on the data flow
as opposed to constructing objects with embedded
operations
• Scalding applications feel and behave like scripts, ideally
replacing Pig.
Scalding APIs
• Scalding offers three different APIs:
• Field API – a simple, abstracted symbol based function oriented
API, first choice for most use cases
• Type safe API – a more low level, typed API with closer access to
Cascading. This API is used for more complex inputs, such as Avro
• Matrix API – allows to apply matrix and vector operations to pipes,
however of type Int, Long and String (due to comparator ops)
• Both Field and Type APIs can convert to one another, the
APIs are designed to offer the same type of functions, i.e.
(Field) Pipe instances convert to TypePipe and vice versa.
Functions
• Scalding has Map – like functions, such as:
• map
• flatMap
• filter and filterNot
• collect
• Grouping / Joining functions:
• groupBy
• groupAll
• Join (left,right, outer etc)
• Reduce functions:
• reduce (DUH!)
• foldLeft
• average, sum
Documentation: https://github.com/twitter/scalding/wiki/Fields-
based-API-Reference
Example – Field API
Simple map and filter with the Field API
Example – Typed API
Simple mapping with Avro and TypedAPI
Example – Configuring and running
Configuration uses hadoop
And the Job / Toolrunner scheme:
Flow Listener
• You can monitor the execution progress with cascading
listeners.
1. Define Scalding Stat objects (Case classes for Hadoop
counters)
2. Increment within your operations by calling incBy(Int)
3. Implement FlowListener interface and increment your
Jobs listeners:
override def listeners = super.listeners ++ List(new
FlowListener)
Example: Flow Listener
Example: Flow Listener
Accessing stats values:
Resources
Scalding home and docs on Github:
https://github.com/twitter/scalding
Excellent intro and advanced topics:
http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-
2014

More Related Content

What's hot

Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flinkKenny Gorman
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 
Hkube
HkubeHkube
Hkubehkube
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkEventador
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Generating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMEGenerating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMESafe Software
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushDaniel Ben-Zvi
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
 
EUROCONTROL LARA - Presentation
EUROCONTROL LARA - PresentationEUROCONTROL LARA - Presentation
EUROCONTROL LARA - PresentationSalvatoreBI
 
StockPredictionML Presentation
StockPredictionML PresentationStockPredictionML Presentation
StockPredictionML PresentationDustinJasmin
 
An Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram PluginAn Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram PluginMitsuhiro Tanda
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...Uri Savelchev
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
The journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data PipelineThe journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data PipelineRandy Huang
 
Reservoir drainage workflow new
Reservoir drainage workflow newReservoir drainage workflow new
Reservoir drainage workflow newAndrew Zolnai
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 

What's hot (20)

Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flink
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Hkube
HkubeHkube
Hkube
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on Flink
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Generating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FMEGenerating Pipeline Alignment Sheets Using FME
Generating Pipeline Alignment Sheets Using FME
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rush
 
Stream Patterns
Stream PatternsStream Patterns
Stream Patterns
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
EUROCONTROL LARA - Presentation
EUROCONTROL LARA - PresentationEUROCONTROL LARA - Presentation
EUROCONTROL LARA - Presentation
 
StockPredictionML Presentation
StockPredictionML PresentationStockPredictionML Presentation
StockPredictionML Presentation
 
An Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram PluginAn Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram Plugin
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
The journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data PipelineThe journey of Moving from AWS ELK to GCP Data Pipeline
The journey of Moving from AWS ELK to GCP Data Pipeline
 
AI at Scale
AI at ScaleAI at Scale
AI at Scale
 
Reservoir drainage workflow new
Reservoir drainage workflow newReservoir drainage workflow new
Reservoir drainage workflow new
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Fieldtrip GB
Fieldtrip GBFieldtrip GB
Fieldtrip GB
 

Viewers also liked

MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Groupacogoluegnes
 
스칼라
스칼라스칼라
스칼라Daegeun Kim
 
Programming Cascading
Programming CascadingProgramming Cascading
Programming CascadingTaewook Eom
 
Scalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaScalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaTaewook Eom
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
빅데이터 구축 사례
빅데이터 구축 사례빅데이터 구축 사례
빅데이터 구축 사례Taehyeon Oh
 
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지Changje Jeong
 
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈Minwoo Kim
 

Viewers also liked (11)

MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Cascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User GroupCascading at the Lyon Hadoop User Group
Cascading at the Lyon Hadoop User Group
 
스칼라
스칼라스칼라
스칼라
 
Scalding
ScaldingScalding
Scalding
 
Programming Cascading
Programming CascadingProgramming Cascading
Programming Cascading
 
Scalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaScalding - Big Data Programming with Scala
Scalding - Big Data Programming with Scala
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
빅데이터 구축 사례
빅데이터 구축 사례빅데이터 구축 사례
빅데이터 구축 사례
 
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
 
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
KGC 2014 가볍고 유연하게 데이터 분석하기 : 쿠키런 사례 중심 , 데브시스터즈
 

Similar to Scalding: An Introduction to Scala API for MapReduce

Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014soujavajug
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopGagan Agrawal
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

Similar to Scalding: An Introduction to Scala API for MapReduce (20)

Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Apache pig
Apache pigApache pig
Apache pig
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Graphql usage
Graphql usageGraphql usage
Graphql usage
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Scalding: An Introduction to Scala API for MapReduce

  • 2. What is Scalding? • Scalding is a Scala based API for Map Reduce applications • Scalding is built on top of Cascading • Cascading is a flow oriented processing framework which acts as an abstraction layer for MapReduce
  • 3. What is Cascading? • Cascading introduces the concept of source taps (input) and sink taps (output) and pipes to connect them, essentially abstracting the key/value scheme in MR • Within a pipe, users define the transformation of data by applying operations such as GroupBy, Every and others.
  • 5. In comes Scalding • Scalding was created by Twitter, basically as a DSL for Cascading. • The goal is to offer functions to operate on the data flow as opposed to constructing objects with embedded operations • Scalding applications feel and behave like scripts, ideally replacing Pig.
  • 6. Scalding APIs • Scalding offers three different APIs: • Field API – a simple, abstracted symbol based function oriented API, first choice for most use cases • Type safe API – a more low level, typed API with closer access to Cascading. This API is used for more complex inputs, such as Avro • Matrix API – allows to apply matrix and vector operations to pipes, however of type Int, Long and String (due to comparator ops) • Both Field and Type APIs can convert to one another, the APIs are designed to offer the same type of functions, i.e. (Field) Pipe instances convert to TypePipe and vice versa.
  • 7. Functions • Scalding has Map – like functions, such as: • map • flatMap • filter and filterNot • collect • Grouping / Joining functions: • groupBy • groupAll • Join (left,right, outer etc) • Reduce functions: • reduce (DUH!) • foldLeft • average, sum Documentation: https://github.com/twitter/scalding/wiki/Fields- based-API-Reference
  • 8. Example – Field API Simple map and filter with the Field API
  • 9. Example – Typed API Simple mapping with Avro and TypedAPI
  • 10. Example – Configuring and running Configuration uses hadoop And the Job / Toolrunner scheme:
  • 11. Flow Listener • You can monitor the execution progress with cascading listeners. 1. Define Scalding Stat objects (Case classes for Hadoop counters) 2. Increment within your operations by calling incBy(Int) 3. Implement FlowListener interface and increment your Jobs listeners: override def listeners = super.listeners ++ List(new FlowListener)
  • 14. Resources Scalding home and docs on Github: https://github.com/twitter/scalding Excellent intro and advanced topics: http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays- 2014