Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cascalog at May Bay Area Hadoop User Group

2,673 views

Published on

Presentation about Cascalog, a Clojure-based query language for Hadoop.

Published in: Technology
  • Be the first to comment

Cascalog at May Bay Area Hadoop User Group

  1. 1. Cascalog Nathan Marz, BackType Po wer fu l a n d ea sy-t o- us e data a n a lysi s to ol fo r H adoo p
  2. 2. About Me Tech Lead at BackType Have been working on many-terabyte scale systems for two years ETL workflows Data warehouses
  3. 3. Presentation Over view 1) High level introduction to Cascalog 2) Demo 3) Cascalog at BackType
  4. 4. What is Cascalog? Query language for Hadoop Queries are written as regular Clojure code Alternative to Pig and Hive
  5. 5. What is Clojure? Functional language that compiles to Java bytecode Lisp-based First-class integration with Java
  6. 6. Features Inner and outer joins Aggregators Functions Subqueries Sorting Arbitrary inputs and outputs
  7. 7. What sets Cascalog apart?
  8. 8. What sets Cascalog apart? Fully integrated in a general purpose programming language
  9. 9. What sets Cascalog apart? Full power of Clojure available at all times
  10. 10. What sets Cascalog apart? Full power of Clojure available at all times
  11. 11. What sets Cascalog apart? Custom operations No UDF interface Just Clojure functions
  12. 12. What sets Cascalog apart? Dynamic queries Write functions that return queries Manipulate queries as first-class entities in the language
  13. 13. What sets Cascalog apart? Use Cascalog side by side with other code Appends and Distributed Copies Consolidation Application logic
  14. 14. Easy Experimentation Ships with test dataset that can be queried locally (the “playground”) 5 minutes to setup Hadoop, Clojure, and Cascalog locally - see README
  15. 15. Demo time!
  16. 16. Cascalog at BackType BackType collects data about conversations around the web Tweets Blog comments Social news People
  17. 17. Cascalog at BackType
  18. 18. Cascalog at BackType Cascalog is used to:
  19. 19. Cascalog at BackType Cascalog is used to: Identify influencers
  20. 20. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter
  21. 21. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter Identify “interesting tweets”
  22. 22. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter Identify “interesting tweets” Study social engagement of domains over time
  23. 23. Cascalog at BackType Cascalog is used to: Identify influencers Determine number of people exposed to URLs on Twitter Identify “interesting tweets” Study social engagement of domains over time Etc, etc.
  24. 24. Cascalog at BackType Input and output Cascalog reads from MySQL databases and HDFS Cascalog writes to Cassandra and HDFS
  25. 25. Cascalog at BackType Rapid development Local playground dataset for development Develop queries in the REPL
  26. 26. Cascalog Roadmap Optimized joins: Replicated joins Bloom joins Negations Recursion
  27. 27. Questions? Project page: http://www.github.com/nathanmarz/cascalog Tutorial: http://nathanmarz.com/blog/introducing-cascalog Follow me on Twitter: @nathanmarz
  28. 28. Clojure and Cascalog Provided by Clojure: Module system Dynamic queries Custom operations Interactive REPL
  29. 29. Cascading and Cascalog Provided by Cascading: Tuple abstraction and tuple manipulation Workflow to MapReduce translation Read and write from anywhere with Taps

×