0
Cascading
 Nathan Marz
  BackType
What is Cascading?

Cascading is a Java library that makes development of
   complex Hadoop MapReduce workflows easy
Why Hadoop?


• Process large amounts of data in a scalable,
  fault-tolerant way
Why Cascading?
    Tool           How you feel


Hadoop MapReduce




  Cascading
Tuples
Cascading represents all data as “Tuples”

       (“the man sat” , 25)
       (“hello dolly”  , 42)
       (“say he...
Tuples
Tuples are named, ordered fields

     [“sentence”, “value”]
     (“the man sat” , 25)
     (“hello dolly”  , 42)
  ...
Flow
  A flow is a sequence of manipulations on
           pipes of tuple streams
• Flow compiles to one or more MapReduce
...
Example

[“sentence”, “value”]         [“word”, “sum”]



      Get the sum of the values for each word
Example
  [“sentence”, “value”]
               Split(“sentence”) -> “word”
   [“word”, “value”]
               GroupBy(“wo...
Example
             Split(“sentence”) -> “word”

[“sentence”, “value”]          [“word”, “value”]
                       ...
Example
                   GroupBy(“word”)

[“word”, “value”]            [“word”, list<[“value”]>]
(“the”   , 25)
(“man” ,...
Example
                Sum(“value”) -> “sum”

[“word”, list<[“value”]>]        [“word”, “sum”]

(“the”   , [25, 10])     ...
More functionality

• Inner and outer joins natively supported
• Seamlessly branch and merge pipes of
  tuples
• Integrate...
Why not Pig?

• Pig is a custom language for writing
  MapReduce workflows
• Because it’s a custom language, intermixing
  ...
Learn more


• Tutorial: http://blog.rapleaf.com/dev/?p=33
• Website: http://www.cascading.org
Questions?
Upcoming SlideShare
Loading in...5
×

Cascading

3,538

Published on

High level overview of Cascading.

Published in: Technology, Education
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,538
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
122
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
















  • Transcript of "Cascading"

    1. 1. Cascading Nathan Marz BackType
    2. 2. What is Cascading? Cascading is a Java library that makes development of complex Hadoop MapReduce workflows easy
    3. 3. Why Hadoop? • Process large amounts of data in a scalable, fault-tolerant way
    4. 4. Why Cascading? Tool How you feel Hadoop MapReduce Cascading
    5. 5. Tuples Cascading represents all data as “Tuples” (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
    6. 6. Tuples Tuples are named, ordered fields [“sentence”, “value”] (“the man sat” , 25) (“hello dolly” , 42) (“say hello” ,1 ) (“the woman sat”, 10)
    7. 7. Flow A flow is a sequence of manipulations on pipes of tuple streams • Flow compiles to one or more MapReduce jobs • Inputs and outputs called “Taps”. • Each Tap produces or receives a pipe of tuples with the same format • Multiple inputs, multiple outputs
    8. 8. Example [“sentence”, “value”] [“word”, “sum”] Get the sum of the values for each word
    9. 9. Example [“sentence”, “value”] Split(“sentence”) -> “word” [“word”, “value”] GroupBy(“word”) [“word”, list<[“value”]>] Sum(“value”) -> “sum” [“word”, “sum”]
    10. 10. Example Split(“sentence”) -> “word” [“sentence”, “value”] [“word”, “value”] (“the” , 25) (“the man sat” , 25) (“man” , 25) (“hello dolly” , 42) (“sat” , 25) (“say hello” ,1 ) (“hello” , 42) (“the woman sat”, 10) (“dolly” , 42) (“say” ,1 ) (“hello” , 1 ) (“the” , 10) (“woman” , 10) (“sat” , 10)
    11. 11. Example GroupBy(“word”) [“word”, “value”] [“word”, list<[“value”]>] (“the” , 25) (“man” , 25) (“the” , [25, 10]) (“sat” , 25) (“man” , [25] ) (“hello” , 42) (“sat” , [25, 10]) (“dolly” , 42) (“hello” , [42, 1] ) (“say” ,1 ) (“dolly” , [42] ) (“hello” , 1 ) (“say” , [1] ) (“the” , 10) (“woman” , [10] ) (“woman” , 10) (“sat” , 10)
    12. 12. Example Sum(“value”) -> “sum” [“word”, list<[“value”]>] [“word”, “sum”] (“the” , [25, 10]) (“the” , 35) (“man” , [25] ) (“man” , 25) (“sat” , [25, 10]) (“sat” , 35) (“hello” , [42, 1] ) (“hello” , 43) (“dolly” , [42] ) (“dolly” , 42) (“say” , [1] ) (“say” ,1 ) (“woman” , [10] ) (“woman” , 10)
    13. 13. More functionality • Inner and outer joins natively supported • Seamlessly branch and merge pipes of tuples • Integrate diverse data sources
    14. 14. Why not Pig? • Pig is a custom language for writing MapReduce workflows • Because it’s a custom language, intermixing “plain logic” in between flows is painful • Not nearly as flexible as Cascading for custom needs
    15. 15. Learn more • Tutorial: http://blog.rapleaf.com/dev/?p=33 • Website: http://www.cascading.org
    16. 16. Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×