Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data analysis with Cascalog, Nathan Marz, BackType

27,873 views

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
27,873
On SlideShare
0
From Embeds
0
Number of Embeds
24,480
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data analysis with Cascalog, Nathan Marz, BackType

  1. 1. Cascalog <ul><li>Nathan Marz, BackType </li></ul>Powerful and easy-to-use data analysis tool for Hadoop
  2. 2. About Me <ul><li>Tech Lead at BackType </li></ul><ul><li>Have been working on many-terabyte scale systems for two years </li></ul><ul><ul><li>ETL workflows </li></ul></ul><ul><ul><li>Data warehouses </li></ul></ul>
  3. 3. Presentation Overview <ul><li>High level introduction to Cascalog </li></ul><ul><li>Demo </li></ul><ul><li>Cascalog at BackType </li></ul>
  4. 4. What is Cascalog? <ul><li>Query language for Hadoop </li></ul><ul><li>Queries are written as regular Clojure code </li></ul><ul><li>Alternative to Pig and Hive </li></ul>
  5. 5. What is Clojure? <ul><li>Functional language that compiles to Java bytecode </li></ul><ul><li>Lisp-based </li></ul><ul><li>First-class integration with Java </li></ul>
  6. 6. Features <ul><li>Inner and outer joins </li></ul><ul><li>Aggregators </li></ul><ul><li>Functions </li></ul><ul><li>Subqueries </li></ul><ul><li>Sorting </li></ul><ul><li>Arbitrary inputs and outputs </li></ul>
  7. 7. What sets Cascalog apart?
  8. 8. What sets Cascalog apart? Fully integrated in a general purpose programming language
  9. 9. What sets Cascalog apart? Full power of Clojure available at all times
  10. 10. What sets Cascalog apart? Full power of Clojure available at all times
  11. 11. What sets Cascalog apart? <ul><li>Custom operations </li></ul><ul><ul><li>No UDF interface </li></ul></ul><ul><ul><li>Just Clojure functions </li></ul></ul>
  12. 12. What sets Cascalog apart? <ul><li>Dynamic queries </li></ul><ul><ul><li>Write functions that return queries </li></ul></ul><ul><ul><li>Manipulate queries as first-class entities in the language </li></ul></ul>
  13. 13. What sets Cascalog apart? <ul><li>Use Cascalog side by side with other code </li></ul><ul><ul><li>Appends and Distributed Copies </li></ul></ul><ul><ul><li>Consolidation </li></ul></ul><ul><ul><li>Application logic </li></ul></ul>
  14. 14. Easy Experimentation <ul><li>Ships with test dataset that can be queried locally (the “playground”) </li></ul><ul><li>5 minutes to setup Hadoop, Clojure, and Cascalog locally - see README </li></ul>
  15. 15. Demo time!
  16. 16. Cascalog at BackType <ul><li>BackType collects data about conversations around the web </li></ul><ul><ul><li>Tweets </li></ul></ul><ul><ul><li>Blog comments </li></ul></ul><ul><ul><li>Social news </li></ul></ul><ul><ul><li>People </li></ul></ul>
  17. 17. Cascalog at BackType <ul><li>Cascalog is used to: </li></ul><ul><ul><li>Identify influencers </li></ul></ul><ul><ul><li>Determine number of people exposed to URLs on Twitter </li></ul></ul><ul><ul><li>Identify “interesting tweets” </li></ul></ul><ul><ul><li>Study social engagement of domains over time </li></ul></ul><ul><ul><li>Etc, etc. </li></ul></ul>
  18. 18. Cascalog at BackType <ul><li>Input and output </li></ul><ul><ul><li>Cascalog reads from MySQL databases </li></ul></ul><ul><ul><li>Cascalog writes to Cassandra </li></ul></ul>
  19. 19. Cascalog at BackType <ul><li>Rapid development </li></ul><ul><ul><li>Local playground dataset for development </li></ul></ul><ul><ul><li>Develop queries in the REPL </li></ul></ul>
  20. 20. Cascalog Roadmap <ul><li>Optimized joins: </li></ul><ul><ul><li>Replicated joins </li></ul></ul><ul><ul><li>Bloom joins </li></ul></ul><ul><li>Negations </li></ul><ul><li>Recursion </li></ul>
  21. 21. Questions? <ul><li>Project page: http://www.github.com/nathanmarz/cascalog </li></ul><ul><li>Tutorial: http://nathanmarz.com/blog/introducing-cascalog </li></ul><ul><li>Follow me on Twitter: @nathanmarz </li></ul>
  22. 22. Clojure and Cascalog <ul><li>Provided by Clojure: </li></ul><ul><ul><li>Module system </li></ul></ul><ul><ul><li>Dynamic queries </li></ul></ul><ul><ul><li>Custom operations </li></ul></ul><ul><ul><li>Interactive REPL </li></ul></ul>
  23. 23. Cascading and Cascalog <ul><li>Provided by Cascading: </li></ul><ul><ul><li>Tuple abstraction and tuple manipulation </li></ul></ul><ul><ul><li>Workflow to MapReduce translation </li></ul></ul><ul><ul><li>Read and write from anywhere with Taps </li></ul></ul>

×