Cascalog
Data processing on Hadoop without the hassle


                                    Nathan Marz
                  ...
What is Cascalog?

               Cascalog   Variables and logic
Abstraction




              Cascading   Tuples, data wo...
Cascalog’s components

Cascading   (the job execution engine)
    +
 Datalog    (basis of the API design)
    +
 Clojure  ...
Clojure

• General purpose programming language
• Dialect of Lisp that compiles to Java bytecode
Clojure
• “Programmable programming language”:
  Easy to build Domain Specific Languages
  (DSL) in Clojure
Clojure examples
   Clojure code           Result
    (+ 1 2 3)               6
   (> 20 18)               true

(defn inc...
Cascalog basics




 The “age” dataset
Cascalog basics
Cascalog basics




Define and
execute a query
Cascalog basics


        Where to
        emit results



Define and
execute a query
Cascalog basics


        Where to
        emit results

                   Output variables
Define and
execute a query
Cascalog basics


        Where to                      “Predicates”: constrain
        emit results                  the ...
Predicates
Predicates


Input fields
Predicates


Input fields   Output fields
Predicates



Fields can be constants or variables
Predicates



Fields can be constants or variables

 Variables are prefixed with ? or !
Predicates
Predicates
• Functions
• Filters
• Aggregators
• Generators: finite sources of tuples
Example #1



    Generator   Filter
Example #2



Generator        Function
Example #3



Generator   Aggregator   Filter
Join example
Join example




     Triggers a join
Join example
Join example




Joins are an implementation detail
Demo time!
Why another query
 language for Hadoop?

Existing tools cause too much

Accidental Complexity
Accidental complexity

  Complexity caused by the tool used
  to solve a problem rather than the
  problem itself
Accidental complexity


• Distinct query languages cause accidental
  complexity
• Example: SQL injection
Query language

• We want:
 • Ability to abstract
 • Ability to compose
Abstraction




Clojure function that returns a subquery
Abstraction




Defining and using custom operation
Composability




Dynamic query with parameterized operation
Composability




 “Predicate macro”
Composability

       expands to




Using a predicate macro
Contrast to Pig




“Average” is 300 lines of code in Pig
Optimized aggregators
     in Cascalog




Implementation of count and sum
Why another query
 language for Hadoop?

Existing tools cause too much

Accidental Complexity
Composability




Value normalization example #1
Composability




Value normalization example #2
Composability


For each id:
 select value with the biggest timestamp




   Value normalization algorithm
Composability




Implementing value normalization
Composability




Using value normalization
Try Cascalog yourself!
Project Page
http://www.github.com/nathanmarz/cascalog

Introductory Tutorial
http://nathanmarz.com...
BackType is hiring

          Think Cascalog’s cool?
 Come build amazing software at BackType.



http://www.backtype.com/...
Questions?


Follow me on Twitter at @nathanmarz
      nathan.marz@gmail.com
Upcoming SlideShare
Loading in...5
×

Cascalog at Strange Loop

3,899

Published on

Presentation of Cascalog at Strange Loop on October 15th, 2010.

http://github.com/nathanmarz/cascalog

Published in: Technology

Cascalog at Strange Loop

  1. 1. Cascalog Data processing on Hadoop without the hassle Nathan Marz BackType @nathanmarz
  2. 2. What is Cascalog? Cascalog Variables and logic Abstraction Cascading Tuples, data workflows Key/value pairs, MapReduce aggregation
  3. 3. Cascalog’s components Cascading (the job execution engine) + Datalog (basis of the API design) + Clojure (the host programming language)
  4. 4. Clojure • General purpose programming language • Dialect of Lisp that compiles to Java bytecode
  5. 5. Clojure • “Programmable programming language”: Easy to build Domain Specific Languages (DSL) in Clojure
  6. 6. Clojure examples Clojure code Result (+ 1 2 3) 6 (> 20 18) true (defn incr [x] (+ 1 x)) 4 (incr 3)
  7. 7. Cascalog basics The “age” dataset
  8. 8. Cascalog basics
  9. 9. Cascalog basics Define and execute a query
  10. 10. Cascalog basics Where to emit results Define and execute a query
  11. 11. Cascalog basics Where to emit results Output variables Define and execute a query
  12. 12. Cascalog basics Where to “Predicates”: constrain emit results the output variables Output variables Define and execute a query
  13. 13. Predicates
  14. 14. Predicates Input fields
  15. 15. Predicates Input fields Output fields
  16. 16. Predicates Fields can be constants or variables
  17. 17. Predicates Fields can be constants or variables Variables are prefixed with ? or !
  18. 18. Predicates
  19. 19. Predicates • Functions • Filters • Aggregators • Generators: finite sources of tuples
  20. 20. Example #1 Generator Filter
  21. 21. Example #2 Generator Function
  22. 22. Example #3 Generator Aggregator Filter
  23. 23. Join example
  24. 24. Join example Triggers a join
  25. 25. Join example
  26. 26. Join example Joins are an implementation detail
  27. 27. Demo time!
  28. 28. Why another query language for Hadoop? Existing tools cause too much Accidental Complexity
  29. 29. Accidental complexity Complexity caused by the tool used to solve a problem rather than the problem itself
  30. 30. Accidental complexity • Distinct query languages cause accidental complexity • Example: SQL injection
  31. 31. Query language • We want: • Ability to abstract • Ability to compose
  32. 32. Abstraction Clojure function that returns a subquery
  33. 33. Abstraction Defining and using custom operation
  34. 34. Composability Dynamic query with parameterized operation
  35. 35. Composability “Predicate macro”
  36. 36. Composability expands to Using a predicate macro
  37. 37. Contrast to Pig “Average” is 300 lines of code in Pig
  38. 38. Optimized aggregators in Cascalog Implementation of count and sum
  39. 39. Why another query language for Hadoop? Existing tools cause too much Accidental Complexity
  40. 40. Composability Value normalization example #1
  41. 41. Composability Value normalization example #2
  42. 42. Composability For each id: select value with the biggest timestamp Value normalization algorithm
  43. 43. Composability Implementing value normalization
  44. 44. Composability Using value normalization
  45. 45. Try Cascalog yourself! Project Page http://www.github.com/nathanmarz/cascalog Introductory Tutorial http://nathanmarz.com/blog/introducing-cascalog/ 5 minutes to install Clojure, Hadoop, and Cascalog locally! See project README
  46. 46. BackType is hiring Think Cascalog’s cool? Come build amazing software at BackType. http://www.backtype.com/jobs
  47. 47. Questions? Follow me on Twitter at @nathanmarz nathan.marz@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×