Cascalog 
An example for a complex workflow
Marc Limotte
Metamarkets Group
October 27, 2010
What is Cascalog?
• An internal DSL (domain specific language) for map/reduce
• Implemented in Clojure (a functional language that runs on 
the Java VM)
• Several layers of abstraction up-- based on Cascading (an 
API for building of Hadoop m/r jobs)
The Three Bears: Choosing a Solution
• Start with a problem or business requirement
• Interested here in the class of problems that require 
programmatic query construction:
o Java program on top of Hadoop API (too much control)
o External DSLs (too little control)
o An internal DSL (just right)
Business Requirement
Original data is an N-
dimensional cube
Business Requirement
Generate all 
possible 
aggregations
Business Requirement
2^N – 1 possible
ways to rollup the
data.
Tree showing ways to Aggregate
Alternative Problem Formulation
There is another way to solve this problem.
• For each input record
– Map task outputs a key for each possible agg
– Use map-side aggregation (combiner)
• Simpler
• In our tests, much slower
• Memory contention? aggregating on a large number
of keys.
Solution 1: External Query Language
CREATE TABLE emp_salary_dept_jt...
INSERT INTO emp_salary_dept_jt (...)
SELECT department, job_title, avg(salary), sum(cnt)
FROM emp_salary
GROUP BY department, job_title
CREATE TABLE emp_salary_dept_jt...
INSERT INTO emp_salary_jt (...)
SELECT job_title, group_avg(avg_salary, cnt)
FROM emp_salary_dept_jt
GROUP BY job_title
And 61 additional queries...
or extend the language to make it Turing Complete
or use string manipulation to construct the queries...
Not terrible, but...
Write code in another language to manipulate Query strings.
You have to work with two different languages.
• Different naming conventions
• Different semantics for escaping special characters, etc
• Your IDE will probably only help you with the outer language
(syntax highlighting, syntax verification, formatting, etc).
• Limits composability (UDFs are composable, but not the
control flow)
• Complicates abstraction
Solution 2: Java Map / Reduce
• Control logic to launch each of 63 jobs
• Map
o Parameterizable for data source (previous aggregation)
and which fields to collapse, could be passed in the
JobConf
• Reduce
o Compute avg using previous group avg and count
Solution 3: An Internal DSL
To my knowledge, Cascalog is the only option for
an “internal” DSL.
High level code walkthrough follows:
• Helper functions
• Custom function (UDF)
• Core Process
• Unit Test
• Execution
Helper Functions
(def DIMS ["?dept" "?country" "?city" "?jobtitle" "?manager" "?function"])
;All sub-lists generated by removing one member.
(defn sublists
([s] (sublists [] s))
([left right]
(let [left2 (conj (vec left) (first right)) right2 (rest right)]
(cons (concat left right2) (if (seq right2) (sublists left2 right2) [])))))
;Create key with only the requested dims. Other dims are replaced with *.
(defn key-for-dims
([dims key-str] (str-join "," (key-for-dims dims (.split key-str ",") 0)))
([dims key-split idx] (if (>= idx (count key-split)) []
(cons (if (some #{(nth DIMS idx)} dims) (nth key-split idx) "*")
(key-for-dims dims key-split (+ idx 1))))))
• Pure functions
• Immutable data types
• Recursion
;Computes an average from a set of other averages
(def grpavg (<- [!avg !cnt :> !newavg !newcnt]
(* !avg !cnt :> !s) (sum !s :> !total)
(sum !cnt :> !newcnt)
(div !total !newcnt :> !newavg)))
Cascalog's Analog to the UDF
• Same syntax
• No interface to implement
• Same source file (unless you want to share it)
;Computes an average from a set of other averages
(def grpavg (<- [!avg !cnt :> !newavg !newcnt]
(* !avg !cnt :> !s) (sum !s :> !total)
(sum !cnt :> !newcnt)
(div !total !newcnt :> !newavg)))
Cascalog's Analog to the UDF
• Same syntax
• No interface to implement
• Same source file (unless you want to share it)
Cascalog's Analog to the UDF
• Same syntax
• No interface to implement
• Same source file (unless you want to share it)
;Computes an average from a set of other averages
(def grpavg (<- [!avg !cnt :> !newavg !newcnt]
(* !avg !cnt :> !s) (sum !s :> !total)
(sum !cnt :> !newcnt)
(div !total !newcnt :> !newavg)))
Core Process
;Creates a query, where parent is the source, and the dims list is used
;to construct the output key. The other metrics are rolled up.
(defn make-qry [dims parent]
(<- [?key ?avg ?cnt]
(parent ?pkey ?pavg ?pcnt)
(key-for-dims dims ?pkey :> ?key)
(grpavg ?pavg ?pcnt :> ?avg ?cnt)))
;Given an initial src and a full list of dimensions; return a map of each
;subset of dims to a query that implements a rollup along that set of dims.
(defn generate-query-tree [src dims]
(let [query (if (= dims DIMS) (basic-qry src) (make-qry dims src))
gqt-with-src (partial generate-query-tree query)]
(if (empty? dims)
{dims query}
(assoc (apply merge (map gqt-with-src (sublists dims))) dims query))))
Unit Testing Cascalog
(deftest test-make-qry
(with-tmp-sources [src [["a,b,c,d,e1,f1" 200 50000]
["a,b,c,d,e1,f2" 100 20000]
["a,b,c,d,e3,f3" 300 40000]]]
(test?- [["a,b,c,d,e1,*" 300 40000]
["a,b,c,d,e3,*" 300 40000]]
(make-qry (choose DIMS [0 1 2 3 4]) src))
(test?- [["a,*,*,*,*,*" 600 40000]]
(make-qry (choose DIMS [0]) src))))
• sample input in green
• expected result in orange
Executing the Queries
For completeness, here is the code that executes all the queries.
(defn -main [input-dir output-dir]
(let [source (get-data (hfs-hadoop-seqfile input-dir))
queries (vals (generate-query-tree source DIMS))]
(?- (hfs-textline-replace output-dir) queries)))
Reference
• Cascalog Project (Nathan Marz)
http://github.com/nathanmarz/cascalog
• Cascading Project (Chris Wensel)
http://www.cascading.org/
• Google group
http://groups.google.com/group/cascalog-user
• IM: Come chat in the #cascading room on freenode
• Book: Practical Clojure by VanderHart and Sierra

Cascalog internal dsl_preso

  • 1.
    Cascalog  An example fora complex workflow Marc Limotte Metamarkets Group October 27, 2010
  • 2.
    What is Cascalog? •An internal DSL (domain specific language) for map/reduce • Implemented in Clojure (a functional language that runs on  the Java VM) • Several layers of abstraction up-- based on Cascading (an  API for building of Hadoop m/r jobs)
  • 3.
    The Three Bears: Choosing a Solution • Start with a problem or business requirement • Interested here in the class of problems that require  programmatic query construction: oJava program on top of Hadoop API (too much control) o External DSLs (too little control) o An internal DSL (just right)
  • 4.
  • 5.
  • 6.
    Business Requirement 2^N –1 possible ways to rollup the data.
  • 7.
    Tree showing waysto Aggregate
  • 8.
    Alternative Problem Formulation Thereis another way to solve this problem. • For each input record – Map task outputs a key for each possible agg – Use map-side aggregation (combiner) • Simpler • In our tests, much slower • Memory contention? aggregating on a large number of keys.
  • 9.
    Solution 1: ExternalQuery Language CREATE TABLE emp_salary_dept_jt... INSERT INTO emp_salary_dept_jt (...) SELECT department, job_title, avg(salary), sum(cnt) FROM emp_salary GROUP BY department, job_title CREATE TABLE emp_salary_dept_jt... INSERT INTO emp_salary_jt (...) SELECT job_title, group_avg(avg_salary, cnt) FROM emp_salary_dept_jt GROUP BY job_title And 61 additional queries... or extend the language to make it Turing Complete or use string manipulation to construct the queries...
  • 10.
    Not terrible, but... Writecode in another language to manipulate Query strings. You have to work with two different languages. • Different naming conventions • Different semantics for escaping special characters, etc • Your IDE will probably only help you with the outer language (syntax highlighting, syntax verification, formatting, etc). • Limits composability (UDFs are composable, but not the control flow) • Complicates abstraction
  • 11.
    Solution 2: JavaMap / Reduce • Control logic to launch each of 63 jobs • Map o Parameterizable for data source (previous aggregation) and which fields to collapse, could be passed in the JobConf • Reduce o Compute avg using previous group avg and count
  • 12.
    Solution 3: AnInternal DSL To my knowledge, Cascalog is the only option for an “internal” DSL. High level code walkthrough follows: • Helper functions • Custom function (UDF) • Core Process • Unit Test • Execution
  • 13.
    Helper Functions (def DIMS["?dept" "?country" "?city" "?jobtitle" "?manager" "?function"]) ;All sub-lists generated by removing one member. (defn sublists ([s] (sublists [] s)) ([left right] (let [left2 (conj (vec left) (first right)) right2 (rest right)] (cons (concat left right2) (if (seq right2) (sublists left2 right2) []))))) ;Create key with only the requested dims. Other dims are replaced with *. (defn key-for-dims ([dims key-str] (str-join "," (key-for-dims dims (.split key-str ",") 0))) ([dims key-split idx] (if (>= idx (count key-split)) [] (cons (if (some #{(nth DIMS idx)} dims) (nth key-split idx) "*") (key-for-dims dims key-split (+ idx 1)))))) • Pure functions • Immutable data types • Recursion
  • 14.
    ;Computes an averagefrom a set of other averages (def grpavg (<- [!avg !cnt :> !newavg !newcnt] (* !avg !cnt :> !s) (sum !s :> !total) (sum !cnt :> !newcnt) (div !total !newcnt :> !newavg))) Cascalog's Analog to the UDF • Same syntax • No interface to implement • Same source file (unless you want to share it)
  • 15.
    ;Computes an averagefrom a set of other averages (def grpavg (<- [!avg !cnt :> !newavg !newcnt] (* !avg !cnt :> !s) (sum !s :> !total) (sum !cnt :> !newcnt) (div !total !newcnt :> !newavg))) Cascalog's Analog to the UDF • Same syntax • No interface to implement • Same source file (unless you want to share it)
  • 16.
    Cascalog's Analog tothe UDF • Same syntax • No interface to implement • Same source file (unless you want to share it) ;Computes an average from a set of other averages (def grpavg (<- [!avg !cnt :> !newavg !newcnt] (* !avg !cnt :> !s) (sum !s :> !total) (sum !cnt :> !newcnt) (div !total !newcnt :> !newavg)))
  • 17.
    Core Process ;Creates aquery, where parent is the source, and the dims list is used ;to construct the output key. The other metrics are rolled up. (defn make-qry [dims parent] (<- [?key ?avg ?cnt] (parent ?pkey ?pavg ?pcnt) (key-for-dims dims ?pkey :> ?key) (grpavg ?pavg ?pcnt :> ?avg ?cnt))) ;Given an initial src and a full list of dimensions; return a map of each ;subset of dims to a query that implements a rollup along that set of dims. (defn generate-query-tree [src dims] (let [query (if (= dims DIMS) (basic-qry src) (make-qry dims src)) gqt-with-src (partial generate-query-tree query)] (if (empty? dims) {dims query} (assoc (apply merge (map gqt-with-src (sublists dims))) dims query))))
  • 18.
    Unit Testing Cascalog (deftesttest-make-qry (with-tmp-sources [src [["a,b,c,d,e1,f1" 200 50000] ["a,b,c,d,e1,f2" 100 20000] ["a,b,c,d,e3,f3" 300 40000]]] (test?- [["a,b,c,d,e1,*" 300 40000] ["a,b,c,d,e3,*" 300 40000]] (make-qry (choose DIMS [0 1 2 3 4]) src)) (test?- [["a,*,*,*,*,*" 600 40000]] (make-qry (choose DIMS [0]) src)))) • sample input in green • expected result in orange
  • 19.
    Executing the Queries Forcompleteness, here is the code that executes all the queries. (defn -main [input-dir output-dir] (let [source (get-data (hfs-hadoop-seqfile input-dir)) queries (vals (generate-query-tree source DIMS))] (?- (hfs-textline-replace output-dir) queries)))
  • 20.
    Reference • Cascalog Project(Nathan Marz) http://github.com/nathanmarz/cascalog • Cascading Project (Chris Wensel) http://www.cascading.org/ • Google group http://groups.google.com/group/cascalog-user • IM: Come chat in the #cascading room on freenode • Book: Practical Clojure by VanderHart and Sierra

Editor's Notes

  • #2 Case study not an INTRO Showing some code, but not describing the syntax (reference at the end)
  • #3 Clojure is fully interoperable with Java In this way we get the benefit of many simultaneous streams of work. - Optimizations (join optimizations, for example) - New Features - Adapters and Interfaces to other technologies (e.g. Sink taps for HBase or MySql)
  • #4 How to express the solution Too much control: Full java program (a lot of overhead to deal with before being able to express your business logic) Too little control: external query language.  Either evolve into a turing complete language, or use string manipulation to construct queries -------- A case study of a problem we had. Going to talk about the probelm   the three types of solutions and what they might look like and, of course, I&amp;apos;ll go into the most detail for eh Cascalog solution
  • #5 Requirement is to pre-aggregate a daily data-set.   Typical star-schema type data sample dimensions Highlighted columns are &amp;quot;dimensions&amp;quot; or &amp;quot;attributes&amp;quot; of the data.
  • #6 A possible aggregation
  • #7 With 3 dimensions, 8 rollups Including all dimensions and none.   if we were doing this with sql, this would be 7 different group by clauses.   Our data has 6 dimensions, so 63 aggregations
  • #8 It&amp;apos;s always possible to compute any agg from the original data set, but we can do better. Using A, B and C as the dimensions. Each node in the tree is a possible rollup. More efficient to make a child node from its parent, rather than the original data. Aggregations with lower cardinality are preferred.  [fewer records to aggregate]
  • #10   To look at something we&amp;apos;re all familiar with, consider how you would do this for a relational database.   first query is based on orig. data second is on a previous agg   Where group_avg is some UDF that can do an average given a series of previous averages and a count from a previous grouping.   DRY principle    - determine table names for INSERT and FROM.  - determine column lists and when to use UDF avg vs. standard avg.  - Determine which previous aggregate table to use as a source.
  • #12 alternative is a custom map/reduce program Not going in to much detail here. Too low level.  You have to do everything yourself. Limited compos-ability, only what you build yourself. Choose and name the output path Select the best previous aggregation as the input path Determine which fields need to be aggregated for this run and set up the Job Conf.    
  • #13 In this way we get the benefit of many simultaneous streams of work. - Optimizations (join optimizations, for example) - New Features - Adapters and Interfaces to other technologies (e.g. Sink taps for HBase or MySql)
  • #14 typical, idiomatic clojure code, functional,  recursion, immutability, etc.     An internal DSL extends the  language, but you still have the full power of the host language. none of the functions care about the contents or length of the DIMS vector [A,B,C] yields [A,B] [A,C] [B,C] For key a,b,c select dimensions [B,C] yields &amp;quot;*,b,c&amp;quot;
  • #18 notice &amp;lt;- instead of ?&amp;lt;-, which means just create the query, don&amp;apos;t execute it, for now. make-qry constructs a single job, to do 1 aggregation   generate-query-tree is also responsible for choosing the best previous aggregation to use for the current rollup.    
  • #19 All of the functions we&amp;apos;ve seen are testable with unit test or integration tests. 2 tests Sample input data Expected Results
  • #20 Becomes main function input-dir and output-dir are command line arguments to the program.  They are paths in HDFS, for example. Could be local file system or S3 paths instead.  Or, since this runs on top of Caascading,you coudl output to a database tap or Hbase, for example.
  • #21 Lastly, I had never used Clojure or any functional language before starting with Cascalog.