An introduction to Apache Crunch

1,378 views
1,140 views

Published on

A short introduction to Apache Crunch. What is it and how does it simplify and aid the
creation of Hadoop pipelines ?

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,378
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

An introduction to Apache Crunch

  1. 1. Apache Crunch ● What is it ? ● How does it work ? ● Why use it ? ● Hadoop MapReduce pipelines ● Scrunch ● Joins www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  2. 2. Apache Crunch – Pipe line ● Crunch is based on Google's FlumeJava ● Provides a Java based API for M/R pipelines ● It uses an MST ( multiple serializable type ) data model ● Good for processing complex data types ● Better for “non tuple” data types i.e. – Images – Audio – Seismic data www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  3. 3. Apache Crunch – Pipe line ● What is a Map Reduce Pipe line ? – Map – Shuffle – Reduce – Combine ● Arranged in sequence and / or in parallel ● Potentially very long chains www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  4. 4. Apache Crunch – Scala ● Scrunch is a Scala wrapper for Apache Crunch ● Reduced code ● Functional and OO styles ● Uses type inferencing for Map / Reduce ● Incorporates Java Materialize functionality ● Includes REPL ( read eval print loop ) www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  5. 5. Apache Crunch – Joins ● Details of Joins available in Crunch – Inner / Outer like SQL joins – Same with Left / Right / Full joins – MapSide join is an in memory join www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  6. 6. Apache Crunch – Performance ● A light weight API that runs efficiently ● Crunch is a thin veneer on top of Map Reduce ● Two implementations available – – ● Hadoop Writeables Avro Avro implementation much faster www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  7. 7. Apache Crunch – API ● Data Model ● Operators – Pipeline – DoFn – MRPipeline – CombineFn – MemPipeline – FilterFn – Pcollection – Joins – Ptable – Cartesian – PgroupTable – Sort – Source – Secondary Sort – Target – Pobject – Emitter – BloomFilters – PType www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  8. 8. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems

×