Your SlideShare is downloading. ×
Indic threads pune12-apache-crunch
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Indic threads pune12-apache-crunch

533
views

Published on

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012. http://pune12.indicthreads.com/

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012. http://pune12.indicthreads.com/

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
533
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache CrunchRahul SharmaApache
  • 2. Agenda : Issues with MapReduce pipelines Solving with Apache Crunch Data Model & Operations System Workflow Examples Question & Answers 2
  • 3. Issues with MapReduce Pipelines Unit Testing pipeline ?? You must be joking !! Can someone tell me where is the business logic ?? Chain performance?? Learn Latin(pig) first!! 3
  • 4. Apache Crunch Is a Java library Contains Collections which can excute Parallel operations Lazy evaluation of Collections at runtime Operations merged at runtime to have efficient chains. Available @ http://incubator.apache.org/crunch/ Based on Google FlumeJava paper 4
  • 5. Apache Crunch Supports Hadoop version 1 and 2-alpha Supports HBase, jdbc etc Works with Writables, Avro, Thrift and proto-buffers Scala varient also exists Integration with R and Clojure in process Archetype exists for creating sample maven project 5
  • 6. Apache Crunch : Data Model  Pipeline  MRPipeline  MemPipeline  PCollection<T>  PTable<K,V>  PGroupTable<K,V>  Source<T>  Target<T>  Emitter<T> 6  PType<K,V>
  • 7. Apache Crunch : Operations  DoFn<S,T>  CombineFn<S,T>  FilterFn<T>  Joins  Cartesian  Sort  SecondarySort  PObject<T>  BloomFilters 7
  • 8. Apache Crunch : System Workflow Construct a pipeline Pipeline.done() Map Map Map GBK GBK Reduce Reduce 8 Output
  • 9. Apache Crunch : Examples  WordCount example  Avro example  Sorting example  SecondarySort  Join Example  BloomFilters 9
  • 10. Write to me : rsharma@apache.orgExample src : http://github.com/rahul0208 10Blog : devlearnings.wordpress.com