0
Apache CrunchRahul SharmaApache
Agenda :    Issues with MapReduce pipelines    Solving with Apache Crunch    Data Model & Operations    System Workflo...
Issues with MapReduce Pipelines                  Unit Testing pipeline ??                   You must be joking !!     Can ...
Apache Crunch    Is a Java library    Contains Collections which can excute Parallel operations    Lazy evaluation of C...
Apache Crunch    Supports Hadoop version 1 and 2-alpha    Supports HBase, jdbc etc    Works with Writables, Avro, Thrif...
Apache Crunch : Data Model          Pipeline          MRPipeline          MemPipeline          PCollection<T>        ...
Apache Crunch : Operations        DoFn<S,T>        CombineFn<S,T>        FilterFn<T>        Joins        Cartesian  ...
Apache Crunch : System Workflow                 Construct a pipeline                    Pipeline.done()         Map       ...
Apache Crunch : Examples        WordCount example        Avro example        Sorting example        SecondarySort    ...
Write to me : rsharma@apache.orgExample src : http://github.com/rahul0208                                            10Blo...
Upcoming SlideShare
Loading in...5
×

Indic threads pune12-apache-crunch

589

Published on

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012. http://pune12.indicthreads.com/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
589
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Indic threads pune12-apache-crunch"

  1. 1. Apache CrunchRahul SharmaApache
  2. 2. Agenda : Issues with MapReduce pipelines Solving with Apache Crunch Data Model & Operations System Workflow Examples Question & Answers 2
  3. 3. Issues with MapReduce Pipelines Unit Testing pipeline ?? You must be joking !! Can someone tell me where is the business logic ?? Chain performance?? Learn Latin(pig) first!! 3
  4. 4. Apache Crunch Is a Java library Contains Collections which can excute Parallel operations Lazy evaluation of Collections at runtime Operations merged at runtime to have efficient chains. Available @ http://incubator.apache.org/crunch/ Based on Google FlumeJava paper 4
  5. 5. Apache Crunch Supports Hadoop version 1 and 2-alpha Supports HBase, jdbc etc Works with Writables, Avro, Thrift and proto-buffers Scala varient also exists Integration with R and Clojure in process Archetype exists for creating sample maven project 5
  6. 6. Apache Crunch : Data Model  Pipeline  MRPipeline  MemPipeline  PCollection<T>  PTable<K,V>  PGroupTable<K,V>  Source<T>  Target<T>  Emitter<T> 6  PType<K,V>
  7. 7. Apache Crunch : Operations  DoFn<S,T>  CombineFn<S,T>  FilterFn<T>  Joins  Cartesian  Sort  SecondarySort  PObject<T>  BloomFilters 7
  8. 8. Apache Crunch : System Workflow Construct a pipeline Pipeline.done() Map Map Map GBK GBK Reduce Reduce 8 Output
  9. 9. Apache Crunch : Examples  WordCount example  Avro example  Sorting example  SecondarySort  Join Example  BloomFilters 9
  10. 10. Write to me : rsharma@apache.orgExample src : http://github.com/rahul0208 10Blog : devlearnings.wordpress.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×