Cascalog at Strange Loop

4,317 views
4,164 views

Published on

Presentation of Cascalog at Strange Loop on October 15th, 2010.

http://github.com/nathanmarz/cascalog

Published in: Technology

Cascalog at Strange Loop

  1. 1. Cascalog Data processing on Hadoop without the hassle Nathan Marz BackType @nathanmarz
  2. 2. What is Cascalog? Cascalog Variables and logic Abstraction Cascading Tuples, data workflows Key/value pairs, MapReduce aggregation
  3. 3. Cascalog’s components Cascading (the job execution engine) + Datalog (basis of the API design) + Clojure (the host programming language)
  4. 4. Clojure • General purpose programming language • Dialect of Lisp that compiles to Java bytecode
  5. 5. Clojure • “Programmable programming language”: Easy to build Domain Specific Languages (DSL) in Clojure
  6. 6. Clojure examples Clojure code Result (+ 1 2 3) 6 (> 20 18) true (defn incr [x] (+ 1 x)) 4 (incr 3)
  7. 7. Cascalog basics The “age” dataset
  8. 8. Cascalog basics
  9. 9. Cascalog basics Define and execute a query
  10. 10. Cascalog basics Where to emit results Define and execute a query
  11. 11. Cascalog basics Where to emit results Output variables Define and execute a query
  12. 12. Cascalog basics Where to “Predicates”: constrain emit results the output variables Output variables Define and execute a query
  13. 13. Predicates
  14. 14. Predicates Input fields
  15. 15. Predicates Input fields Output fields
  16. 16. Predicates Fields can be constants or variables
  17. 17. Predicates Fields can be constants or variables Variables are prefixed with ? or !
  18. 18. Predicates
  19. 19. Predicates • Functions • Filters • Aggregators • Generators: finite sources of tuples
  20. 20. Example #1 Generator Filter
  21. 21. Example #2 Generator Function
  22. 22. Example #3 Generator Aggregator Filter
  23. 23. Join example
  24. 24. Join example Triggers a join
  25. 25. Join example
  26. 26. Join example Joins are an implementation detail
  27. 27. Demo time!
  28. 28. Why another query language for Hadoop? Existing tools cause too much Accidental Complexity
  29. 29. Accidental complexity Complexity caused by the tool used to solve a problem rather than the problem itself
  30. 30. Accidental complexity • Distinct query languages cause accidental complexity • Example: SQL injection
  31. 31. Query language • We want: • Ability to abstract • Ability to compose
  32. 32. Abstraction Clojure function that returns a subquery
  33. 33. Abstraction Defining and using custom operation
  34. 34. Composability Dynamic query with parameterized operation
  35. 35. Composability “Predicate macro”
  36. 36. Composability expands to Using a predicate macro
  37. 37. Contrast to Pig “Average” is 300 lines of code in Pig
  38. 38. Optimized aggregators in Cascalog Implementation of count and sum
  39. 39. Why another query language for Hadoop? Existing tools cause too much Accidental Complexity
  40. 40. Composability Value normalization example #1
  41. 41. Composability Value normalization example #2
  42. 42. Composability For each id: select value with the biggest timestamp Value normalization algorithm
  43. 43. Composability Implementing value normalization
  44. 44. Composability Using value normalization
  45. 45. Try Cascalog yourself! Project Page http://www.github.com/nathanmarz/cascalog Introductory Tutorial http://nathanmarz.com/blog/introducing-cascalog/ 5 minutes to install Clojure, Hadoop, and Cascalog locally! See project README
  46. 46. BackType is hiring Think Cascalog’s cool? Come build amazing software at BackType. http://www.backtype.com/jobs
  47. 47. Questions? Follow me on Twitter at @nathanmarz nathan.marz@gmail.com

×