Cascalog: an Interactive Query
Language for Hadoop
Nathan Marz
BackType
What is Cascalog?


Cascading (the job execution engine)

    +

 Datalog (basis of the API design)

    +

 Clojure (the ...
Why another query language for Hadoop?




 Existing tools cause too much

 Accidental Complexity
Accidental complexity




 Complexity caused by the tool used
 to solve a problem rather than the
 problem itself
Accidental complexity in existing tools




Pig               The query language is different
                  than the p...
When query tool is separate from
programming language


 Friction when embedding custom operations

 Interlacing queries...
Clojure

 General purpose programming language

 Dialect of Lisp that compiles to Java bytecode

 “Programmable program...
Clojure examples



    Clojure code             Result

      (+ 1 2 3)                6


     (> 20 18)                ...
Cascalog




 Domain Specific Language in Clojure for
 processing data using Hadoop
Cascalog




       Full power of a general purpose
 programming language available at all times
Cascalog




       Full power of a general purpose
 programming language available at all times


         Cascalog is a ...
Demo time!
Some of Cascalog’s features

   Inner and outer joins
   Aggregators
   Functions
   Subqueries
   Sorting
   Read f...
When query tool is separate from
programming language


 Friction when embedding custom operations

 Interlacing queries...
Cascalog, on the other hand...



 Custom operations defined just like any other
  function

 Interlacing queries with r...
Try Cascalog yourself!


Project Page
http://www.github.com/nathanmarz/cascalog

Introductory Tutorial
http://nathanmarz.c...
Questions?

Twitter: @nathanmarz
Email: nathan.marz@gmail.com
More benefits to being Clojure DSL


 Excellent module system

 Interactive REPL

 Make use of any Clojure function in ...
Upcoming SlideShare
Loading in...5
×

Cascalog at Hadoop Summit

2,953

Published on

My presentation about Cascalog at Hadoop Summit 2010.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,953
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
39
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • This is the Title slide.
    Please use the name of the presentation that was used in the abstract submission.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.

  • - UDFs, custom duct tape for registering and finding dependencies, separate files
    - separate files, testing?, error handling
    - things that you didn’t think were possible become idiomatic. compose queries, parameterize, pass queries and operations around









  • This is the final slide; generally for questions at the end of the talk.
    Please post your contact information here.

  • Cascalog at Hadoop Summit

    1. 1. Cascalog: an Interactive Query Language for Hadoop Nathan Marz BackType
    2. 2. What is Cascalog? Cascading (the job execution engine) + Datalog (basis of the API design) + Clojure (the host programming language)
    3. 3. Why another query language for Hadoop? Existing tools cause too much Accidental Complexity
    4. 4. Accidental complexity Complexity caused by the tool used to solve a problem rather than the problem itself
    5. 5. Accidental complexity in existing tools Pig The query language is different than the programming language Hive
    6. 6. When query tool is separate from programming language  Friction when embedding custom operations  Interlacing queries with regular application logic is unnatural  Generating queries dynamically is difficult
    7. 7. Clojure  General purpose programming language  Dialect of Lisp that compiles to Java bytecode  “Programmable programming language”: Easy to build Domain Specific Languages (DSL) in Clojure
    8. 8. Clojure examples Clojure code Result (+ 1 2 3) 6 (> 20 18) true (defn incr [x] (+ 1 x)) 4 (incr 3)
    9. 9. Cascalog Domain Specific Language in Clojure for processing data using Hadoop
    10. 10. Cascalog Full power of a general purpose programming language available at all times
    11. 11. Cascalog Full power of a general purpose programming language available at all times Cascalog is a Clojure library Example query: (?<- (stdout) [?p ?a] (age ?p 25))
    12. 12. Demo time!
    13. 13. Some of Cascalog’s features  Inner and outer joins  Aggregators  Functions  Subqueries  Sorting  Read from and write to arbitrary data sources › HDFS › HBase › MySQL › Etc.
    14. 14. When query tool is separate from programming language  Friction when embedding custom operations  Interlacing queries with regular application logic is unnatural  Generating queries dynamically is difficult
    15. 15. Cascalog, on the other hand...  Custom operations defined just like any other function  Interlacing queries with regular application logic is trivial  Generating queries dynamically is easy and idiomatic
    16. 16. Try Cascalog yourself! Project Page http://www.github.com/nathanmarz/cascalog Introductory Tutorial http://nathanmarz.com/blog/introducing- cascalog/ 5 minutes to install Clojure, Hadoop, and Cascalog locally! See project README
    17. 17. Questions? Twitter: @nathanmarz Email: nathan.marz@gmail.com
    18. 18. More benefits to being Clojure DSL  Excellent module system  Interactive REPL  Make use of any Clojure function in queries
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×