0
PyCascading for IntuitiveFlow Processing WithHadoopGabor SzaboSenior Data ScientistTwitter, Inc.
Outline• Basic concepts in the Hadoop ecosystem, with an example• Hadoop• Cascading• PyCascading• Essential PyCascading op...
HadoopArchitecture• The Hadoop file system (HDFS)• Large, distributed file system• Thousands of nodes, PBs of data• The stor...
HadoopIn practice• Language• Java• Need to think in MapReduce• Hard to translate the problem to MR• Hard to maintain and m...
Cascading	The Cascading way: flow processing• Cascading is built on top of Hadoop• Introduces semi-structured flow processin...
Flow processing in (Py)Cascading6Source: cascading.org
Flow processing in (Py)Cascading6Source: cascading.orgSource tap
Flow processing in (Py)Cascading6Source: cascading.orgSource tapSink tap
Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterSink tap
Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterSink tap“Map”
Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterSink tap“Map” Join
Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterGroup & aggregateSink tap“Map” Join
PyCascadingDesign• Built on top of Cascading• Uses the Jython 2.5 interpreter• Everything in Python• Building the pipeline...
Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count ...
Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count ...
Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count ...
Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count ...
Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined op...
Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined op...
Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined op...
Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined op...
word_count.pyPyCascading minimizes programmer effort10
word_count.pyPyCascading minimizes programmer effort10Map
word_count.pyPyCascading minimizes programmer effort10GMap
word_count.pyPyCascading minimizes programmer effort10GSupportcodeMap
PyCascading workflowThe basics of writing a Cascading flow in Python• There is one main script that must contain a main() fu...
PyCascading by exampleWalk through the operations using an example• Data• A friendship network in long format• List of int...
The full source13
Defining the inputs14
Defining the inputs14Need to use Javatypes since this is aCascading call
Shaping the fields: “mapping”15
Shaping the fields: “mapping”15Replace the interest fieldwith the result yielded bytake_first, and call itinterest
Shaping the fields: “mapping”15Decorators annotateuser-defined functionsReplace the interest fieldwith the result yielded byt...
Shaping the fields: “mapping”15Decorators annotateuser-defined functionstuple is a Cascadingrecord typeReplace the interest ...
Shaping the fields: “mapping”15Decorators annotateuser-defined functionstuple is a Cascadingrecord typeWe can return anynumb...
Checkpointing16
Checkpointing16Take the data EITHER from the cache(ID: “users_first_interests”), ORgenerate it if it’s not cached yet
Grouping & aggregating17
Grouping & aggregating17Group by user, and callthe two result fieldsuser and friend
Grouping & aggregating17Define a UDF that takes the thegrouping fields, a tuple iterator, andoptional argumentsGroup by user...
Grouping & aggregating17Define a UDF that takes the thegrouping fields, a tuple iterator, andoptional argumentsUse the .get ...
Grouping & aggregating17Define a UDF that takes the thegrouping fields, a tuple iterator, andoptional argumentsUse the .get ...
Joins & field algebra18
Joins & field algebra18
Joins & field algebra18Join on the friend field fromthe 1st stream, and on theuser field from the 2nd
Joins & field algebra18No field nameoverlap isallowedJoin on the friend field fromthe 1st stream, and on theuser field from th...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, a...
Split & aggregate more19
Split & aggregate more19Split the stream togroup by user, andfind the interestthat appears mostby count
Split & aggregate more19Split the stream togroup by user, andfind the interestthat appears mostby countOnce the data flow is...
Running the scriptLocal or remote runs• Cascading flows can run locally or on HDFS• Local run for testing• local_run.sh rec...
Some remarks• Benefits• Can use any Java class• Can be mixed with Java code• Can use Python libraries• Caveats• Only pure P...
Contact• Javadoc: http://cascading.org• Other Cascading-based wrappers• Scalding (Scala), Cascalog (Clojure), Cascading-JR...
Implementation detailsChallenges due to an interpreted language• We need to make code available on all workers• Java bytec...
Upcoming SlideShare
Loading in...5
×

PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)

1,225

Published on

Slides for a talk given by Gabor Szabo at PyData Silicon Valley 2013

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,225
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)"

  1. 1. PyCascading for IntuitiveFlow Processing WithHadoopGabor SzaboSenior Data ScientistTwitter, Inc.
  2. 2. Outline• Basic concepts in the Hadoop ecosystem, with an example• Hadoop• Cascading• PyCascading• Essential PyCascading operations• PyCascading by example: discovering main interests among friends• Miscellaneous remarks, caveats2
  3. 3. HadoopArchitecture• The Hadoop file system (HDFS)• Large, distributed file system• Thousands of nodes, PBs of data• The storage layer for Apache Hive, HBase, ...• MapReduce• Idea: ship the code to the data, not other way around• Do aggregations locally• Iterate on the results• Map phase: process the input records, emit a key & a value• Reduce phase: collect records with the same key from Map, emit a new (aggregate) record• Fault tolerance• Both storage and compute are fault tolerant (redundancy, replication, restart)3
  4. 4. HadoopIn practice• Language• Java• Need to think in MapReduce• Hard to translate the problem to MR• Hard to maintain and make changes in the topology• Best used for• Archiving (HDFS)• Batch processing (MR)4
  5. 5. Cascading The Cascading way: flow processing• Cascading is built on top of Hadoop• Introduces semi-structured flow processing of tuples with typed fields• Analogy: data is flowing in pipes• Input comes from source taps• Output goes to sink taps• Data is reshaped in the pipes by different operations• Builds a DAG from the job, and optimizes the topology to minimize the number ofMapReduce phases• The pipes analogy is more intuitive to use than raw MapReduce5
  6. 6. Flow processing in (Py)Cascading6Source: cascading.org
  7. 7. Flow processing in (Py)Cascading6Source: cascading.orgSource tap
  8. 8. Flow processing in (Py)Cascading6Source: cascading.orgSource tapSink tap
  9. 9. Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterSink tap
  10. 10. Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterSink tap“Map”
  11. 11. Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterSink tap“Map” Join
  12. 12. Flow processing in (Py)Cascading6Source: cascading.orgSource tapFilterGroup & aggregateSink tap“Map” Join
  13. 13. PyCascadingDesign• Built on top of Cascading• Uses the Jython 2.5 interpreter• Everything in Python• Building the pipelines• User-defined functions that operate on data• Completely hides Java if the user wants it to• However due to the strong ties with Java, it’s worth knowing the Cascading classes7
  14. 14. Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count how many times each word occurs8
  15. 15. Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count how many times each word occurs8M
  16. 16. Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count how many times each word occurs8MR
  17. 17. Example: as always, WordCountWriting MapReduce jobs by hand is hard• WordCount: split the input file into words, and count how many times each word occurs8MRSupportcodeSupportcode
  18. 18. Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined operation9
  19. 19. Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined operation9M
  20. 20. Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined operation9MG
  21. 21. Cascading WordCount• Still in Java, but algorithm design is easier• Need to write separate classes for each user-defined operation9MGSupportcodeSupportcode
  22. 22. word_count.pyPyCascading minimizes programmer effort10
  23. 23. word_count.pyPyCascading minimizes programmer effort10Map
  24. 24. word_count.pyPyCascading minimizes programmer effort10GMap
  25. 25. word_count.pyPyCascading minimizes programmer effort10GSupportcodeMap
  26. 26. PyCascading workflowThe basics of writing a Cascading flow in Python• There is one main script that must contain a main() function• We build the pipeline in main()• Pipes are joined together with the pipe operator |• Pipe ends may be assigned to variables and reused (split)• All the user-defined operations are Python functions• Globally or locally-scoped• Then submit the pipeline to be run to PyCascading• The main Python script will be executed on each of the workers when they spin up toimport global declarations• This is the reason we have to have main(), so that it won’t be executed again11
  27. 27. PyCascading by exampleWalk through the operations using an example• Data• A friendship network in long format• List of interests per user, ordered by decreasing importance• Question• For every user, find which main interest among the friends occurs the most• Workflow• Take the most important interest per user, and join it to the friendship table• For each user, count how many times each interest appears, and select the one with themaximum count12Friendship network User interests
  28. 28. The full source13
  29. 29. Defining the inputs14
  30. 30. Defining the inputs14Need to use Javatypes since this is aCascading call
  31. 31. Shaping the fields: “mapping”15
  32. 32. Shaping the fields: “mapping”15Replace the interest fieldwith the result yielded bytake_first, and call itinterest
  33. 33. Shaping the fields: “mapping”15Decorators annotateuser-defined functionsReplace the interest fieldwith the result yielded bytake_first, and call itinterest
  34. 34. Shaping the fields: “mapping”15Decorators annotateuser-defined functionstuple is a Cascadingrecord typeReplace the interest fieldwith the result yielded bytake_first, and call itinterest
  35. 35. Shaping the fields: “mapping”15Decorators annotateuser-defined functionstuple is a Cascadingrecord typeWe can return anynumber of newrecords with yieldReplace the interest fieldwith the result yielded bytake_first, and call itinterest
  36. 36. Checkpointing16
  37. 37. Checkpointing16Take the data EITHER from the cache(ID: “users_first_interests”), ORgenerate it if it’s not cached yet
  38. 38. Grouping & aggregating17
  39. 39. Grouping & aggregating17Group by user, and callthe two result fieldsuser and friend
  40. 40. Grouping & aggregating17Define a UDF that takes the thegrouping fields, a tuple iterator, andoptional argumentsGroup by user, and callthe two result fieldsuser and friend
  41. 41. Grouping & aggregating17Define a UDF that takes the thegrouping fields, a tuple iterator, andoptional argumentsUse the .get getterwith the field nameGroup by user, and callthe two result fieldsuser and friend
  42. 42. Grouping & aggregating17Define a UDF that takes the thegrouping fields, a tuple iterator, andoptional argumentsUse the .get getterwith the field nameYield any number ofresultsGroup by user, and callthe two result fieldsuser and friend
  43. 43. Joins & field algebra18
  44. 44. Joins & field algebra18
  45. 45. Joins & field algebra18Join on the friend field fromthe 1st stream, and on theuser field from the 2nd
  46. 46. Joins & field algebra18No field nameoverlap isallowedJoin on the friend field fromthe 1st stream, and on theuser field from the 2nd
  47. 47. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2nd
  48. 48. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2nd
  49. 49. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2nd
  50. 50. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2ndUse built-inaggregatorswhere possible
  51. 51. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2ndUse built-inaggregatorswhere possible
  52. 52. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2ndUse built-inaggregatorswhere possible
  53. 53. Joins & field algebra18No field nameoverlap isallowedKeep certainfields & renameJoin on the friend field fromthe 1st stream, and on theuser field from the 2ndSave this stream!Use built-inaggregatorswhere possible
  54. 54. Split & aggregate more19
  55. 55. Split & aggregate more19Split the stream togroup by user, andfind the interestthat appears mostby count
  56. 56. Split & aggregate more19Split the stream togroup by user, andfind the interestthat appears mostby countOnce the data flow isbuilt, submit and run it!
  57. 57. Running the scriptLocal or remote runs• Cascading flows can run locally or on HDFS• Local run for testing• local_run.sh recommendation.py• Remote run in production• remote_deploy.sh -s server.com recommendation.py• The example had 5 MR stages• Although the problem was simple, doing it by hand wouldhave been inconvenient20Friendship network User interestsfriends_interests_countsrecommendations
  58. 58. Some remarks• Benefits• Can use any Java class• Can be mixed with Java code• Can use Python libraries• Caveats• Only pure Python code can be used, no compiled C (numpy, scipy)• But with streaming it’s possible to execute a CPython interpreter• Some idiosyncrasies because of Jython’s representation of basic types• Strings are OK, but Python integers are represented as java.math.BigInteger, sobefore yielding explicit conversion is needed (joins!)21
  59. 59. Contact• Javadoc: http://cascading.org• Other Cascading-based wrappers• Scalding (Scala), Cascalog (Clojure), Cascading-JRuby (Ruby)22http://github.org/twitter/pycascadinghttp://pycascading.org@gaborjszabogabor@twitter.com
  60. 60. Implementation detailsChallenges due to an interpreted language• We need to make code available on all workers• Java bytecode is easy, same .jar everywhere• Although Jython represents Python functions as classes, they cannot be serialized• We need to start an interpreter on every worker• The Python source of the UDFs is retrieved and shipped in the .jar• Different Hadoop distributions explode the .jar differently, need to use the Hadoop distributedcache23
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×