Cascading

4,494 views

Published on

High level overview of Cascading.

Published in: Technology, Education
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,494
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
136
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
















  • Cascading

    1. 1. Cascading Nathan Marz BackType
    2. 2. What is Cascading? Cascading is a Java library that makes development of complex Hadoop MapReduce workflows easy
    3. 3. Why Hadoop? • Process large amounts of data in a scalable, fault-tolerant way
    4. 4. Why Cascading? Tool How you feel Hadoop MapReduce Cascading
    5. 5. Tuples Cascading represents all data as “Tuples” (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
    6. 6. Tuples Tuples are named, ordered fields [“sentence”, “value”] (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
    7. 7. Flow A flow is a sequence of manipulations on pipes of tuple streams • Flow compiles to one or more MapReduce jobs • Inputs and outputs called “Taps”. • Each Tap produces or receives a pipe of tuples with the same format • Multiple inputs, multiple outputs
    8. 8. Example [“sentence”, “value”] [“word”, “sum”] Get the sum of the values for each word
    9. 9. Example [“sentence”, “value”] Split(“sentence”) -> “word” [“word”, “value”] GroupBy(“word”) [“word”, list<[“value”]>] Sum(“value”) -> “sum” [“word”, “sum”]
    10. 10. Example Split(“sentence”) -> “word” [“sentence”, “value”] [“word”, “value”] (“the” , 25) (“the man sat” , 25) (“man” , 25) (“hello dolly” , 42) (“sat” , 25) (“say hello” ,1 ) (“hello” , 42) (“the woman sat”, 10) (“dolly” , 42) (“say” ,1 ) (“hello” , 1 ) (“the” , 10) (“woman” , 10) (“sat” , 10)
    11. 11. Example GroupBy(“word”) [“word”, “value”] [“word”, list<[“value”]>] (“the” , 25) (“man” , 25) (“the” , [25, 10]) (“sat” , 25) (“man” , [25] ) (“hello” , 42) (“sat” , [25, 10]) (“dolly” , 42) (“hello” , [42, 1] ) (“say” ,1 ) (“dolly” , [42] ) (“hello” , 1 ) (“say” , [1] ) (“the” , 10) (“woman” , [10] ) (“woman” , 10) (“sat” , 10)
    12. 12. Example Sum(“value”) -> “sum” [“word”, list<[“value”]>] [“word”, “sum”] (“the” , [25, 10]) (“the” , 35) (“man” , [25] ) (“man” , 25) (“sat” , [25, 10]) (“sat” , 35) (“hello” , [42, 1] ) (“hello” , 43) (“dolly” , [42] ) (“dolly” , 42) (“say” , [1] ) (“say” ,1 ) (“woman” , [10] ) (“woman” , 10)
    13. 13. More functionality • Inner and outer joins natively supported • Seamlessly branch and merge pipes of tuples • Integrate diverse data sources
    14. 14. Why not Pig? • Pig is a custom language for writing MapReduce workflows • Because it’s a custom language, intermixing “plain logic” in between flows is painful • Not nearly as flexible as Cascading for custom needs
    15. 15. Learn more • Tutorial: http://blog.rapleaf.com/dev/?p=33 • Website: http://www.cascading.org
    16. 16. Questions?

    ×