SlideShare a Scribd company logo
1 of 8
Apache Crunch
●

What is it ?

●

How does it work ?

●

Why use it ?

●

Hadoop MapReduce pipelines

●

Scrunch

●

Joins

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Apache Crunch – Pipe line
●

Crunch is based on Google's FlumeJava

●

Provides a Java based API for M/R pipelines

●

It uses an MST ( multiple serializable type ) data model

●

Good for processing complex data types

●

Better for “non tuple” data types i.e.
–

Images

–

Audio

–

Seismic data

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Apache Crunch – Pipe line
●

What is a Map Reduce Pipe line ?
–

Map

–

Shuffle

–

Reduce

–

Combine

●

Arranged in sequence and / or in parallel

●

Potentially very long chains

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Apache Crunch – Scala
●

Scrunch is a Scala wrapper for Apache Crunch

●

Reduced code

●

Functional and OO styles

●

Uses type inferencing for Map / Reduce

●

Incorporates Java Materialize functionality

●

Includes REPL ( read eval print loop )

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Apache Crunch – Joins

●

Details of Joins available in Crunch
–

Inner / Outer like SQL joins

–

Same with Left / Right / Full joins

–

MapSide join is an in memory join

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Apache Crunch – Performance
●

A light weight API that runs efficiently

●

Crunch is a thin veneer on top of Map Reduce

●

Two implementations available
–
–

●

Hadoop Writeables
Avro

Avro implementation much faster

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Apache Crunch – API
●

Data Model

●

Operators

–

Pipeline

–

DoFn

–

MRPipeline

–

CombineFn

–

MemPipeline

–

FilterFn

–

Pcollection

–

Joins

–

Ptable

–

Cartesian

–

PgroupTable

–

Sort

–

Source

–

Secondary Sort

–

Target

–

Pobject

–

Emitter

–

BloomFilters

–

PType

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz
Contact Us
●

Feel free to contact us at
–

www.semtech-solutions.co.nz

–

info@semtech-solutions.co.nz

●

We offer IT project consultancy

●

We are happy to hear about your problems

●

You can just pay for those hours that you need

●

To solve your problems

More Related Content

More from Mike Frampton

An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
Mike Frampton
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
Mike Frampton
 

More from Mike Frampton (20)

Prometheus
PrometheusPrometheus
Prometheus
 
Apache Tephra
Apache TephraApache Tephra
Apache Tephra
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Apache Bahir
Apache BahirApache Bahir
Apache Bahir
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Apache Ignite
Apache IgniteApache Ignite
Apache Ignite
 
Apache Samza
Apache SamzaApache Samza
Apache Samza
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Apache Edgent
Apache EdgentApache Edgent
Apache Edgent
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache Thrift
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
An example Hadoop Install
An example Hadoop InstallAn example Hadoop Install
An example Hadoop Install
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
An Introduction to Cloud Computing
An Introduction to Cloud ComputingAn Introduction to Cloud Computing
An Introduction to Cloud Computing
 
An Introduction to Hadoop Hue Gui
An Introduction to Hadoop Hue GuiAn Introduction to Hadoop Hue Gui
An Introduction to Hadoop Hue Gui
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

An introduction to Apache Crunch

  • 1. Apache Crunch ● What is it ? ● How does it work ? ● Why use it ? ● Hadoop MapReduce pipelines ● Scrunch ● Joins www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 2. Apache Crunch – Pipe line ● Crunch is based on Google's FlumeJava ● Provides a Java based API for M/R pipelines ● It uses an MST ( multiple serializable type ) data model ● Good for processing complex data types ● Better for “non tuple” data types i.e. – Images – Audio – Seismic data www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 3. Apache Crunch – Pipe line ● What is a Map Reduce Pipe line ? – Map – Shuffle – Reduce – Combine ● Arranged in sequence and / or in parallel ● Potentially very long chains www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 4. Apache Crunch – Scala ● Scrunch is a Scala wrapper for Apache Crunch ● Reduced code ● Functional and OO styles ● Uses type inferencing for Map / Reduce ● Incorporates Java Materialize functionality ● Includes REPL ( read eval print loop ) www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 5. Apache Crunch – Joins ● Details of Joins available in Crunch – Inner / Outer like SQL joins – Same with Left / Right / Full joins – MapSide join is an in memory join www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 6. Apache Crunch – Performance ● A light weight API that runs efficiently ● Crunch is a thin veneer on top of Map Reduce ● Two implementations available – – ● Hadoop Writeables Avro Avro implementation much faster www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 7. Apache Crunch – API ● Data Model ● Operators – Pipeline – DoFn – MRPipeline – CombineFn – MemPipeline – FilterFn – Pcollection – Joins – Ptable – Cartesian – PgroupTable – Sort – Source – Secondary Sort – Target – Pobject – Emitter – BloomFilters – PType www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 8. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems