• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
An introduction to Apache Crunch
 

An introduction to Apache Crunch

on

  • 508 views

A short introduction to Apache Crunch. What is it and how does it simplify and aid the

A short introduction to Apache Crunch. What is it and how does it simplify and aid the
creation of Hadoop pipelines ?

Statistics

Views

Total Views
508
Views on SlideShare
501
Embed Views
7

Actions

Likes
0
Downloads
7
Comments
0

2 Embeds 7

http://confluence.poscoict-bdp.com 5
http://www.slideee.com 2

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    An introduction to Apache Crunch An introduction to Apache Crunch Presentation Transcript

    • Apache Crunch ● What is it ? ● How does it work ? ● Why use it ? ● Hadoop MapReduce pipelines ● Scrunch ● Joins www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Apache Crunch – Pipe line ● Crunch is based on Google's FlumeJava ● Provides a Java based API for M/R pipelines ● It uses an MST ( multiple serializable type ) data model ● Good for processing complex data types ● Better for “non tuple” data types i.e. – Images – Audio – Seismic data www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Apache Crunch – Pipe line ● What is a Map Reduce Pipe line ? – Map – Shuffle – Reduce – Combine ● Arranged in sequence and / or in parallel ● Potentially very long chains www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Apache Crunch – Scala ● Scrunch is a Scala wrapper for Apache Crunch ● Reduced code ● Functional and OO styles ● Uses type inferencing for Map / Reduce ● Incorporates Java Materialize functionality ● Includes REPL ( read eval print loop ) www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Apache Crunch – Joins ● Details of Joins available in Crunch – Inner / Outer like SQL joins – Same with Left / Right / Full joins – MapSide join is an in memory join www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Apache Crunch – Performance ● A light weight API that runs efficiently ● Crunch is a thin veneer on top of Map Reduce ● Two implementations available – – ● Hadoop Writeables Avro Avro implementation much faster www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Apache Crunch – API ● Data Model ● Operators – Pipeline – DoFn – MRPipeline – CombineFn – MemPipeline – FilterFn – Pcollection – Joins – Ptable – Cartesian – PgroupTable – Sort – Source – Secondary Sort – Target – Pobject – Emitter – BloomFilters – PType www.semtech-solutions.co.nz info@semtech-solutions.co.nz
    • Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems