Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ETL in Clojure


Published on

The talk will compare Cascalog, fully-featured data processing and querying library on top of Hadoop, and Sparkling – A Clojure API for Apache Spark. How both of these compare in terms of performance and code complexity for Big Data processing and why you shouldn’t be writing MapReduce jobs in plain Hadoop API.

Published in: Software
  • Be the first to comment

ETL in Clojure

  1. 1. ETL in ClojureETL in Clojure Dmitriy Morozov / JEEConf 2015
  2. 2. Dmitriy MorozovDmitriy Morozov Software engineer at Functional programming junky Occasional cyclist @argc
  3. 3. Plan of attackPlan of attack ETL at ZoomdataETL at Zoomdata CascalogCascalog SparkSpark DemoDemo ConclusionConclusion
  4. 4. Is a modern BI application focused onIs a modern BI application focused on allowing everyday business users toallowing everyday business users to be able to visually interact andbe able to visually interact and explore their data and discoverexplore their data and discover insight out of that data.insight out of that data.
  5. 5. What we do at ZoomdataWhat we do at Zoomdata
  6. 6. What we do at ZoomdataWhat we do at Zoomdata
  7. 7. We did ETL inWe did ETL in Hive/ImpalaHive/Impala
  8. 8. Using SQL for ETLUsing SQL for ETL Hive is slow, and so is Hive on Tez SQL is horrible for doing anything complicated Code is hard to maintain, reuse and test Lessons learnedLessons learned
  9. 9. Why Clojure?Why Clojure? Functional! Runs on JVM Interactive development Zero delta between prototyp code and production code
  10. 10. CascalogCascalog Datalog DSL in CLojure Built on top of Hadoop and Cascading Query compiles to Hadoop MapReduce jobs Supports local execution for prototyping Great testing story
  11. 11. DatalogDatalog language Syntactically is a subset of Prolog It is often used as a for deductive databases. Query statements can be stated in any order Logic programming query language
  12. 12. DatalogDatalog
  13. 13. Word Count using Hadoop API
  14. 14. Word count in CascalogWord count in Cascalog
  15. 15. Cascalog Query StructureCascalog Query Structure
  16. 16. Cascalog / GeneratorsCascalog / Generators
  17. 17. Cascalog / OperationsCascalog / Operations
  18. 18. Cascalog / OperationsCascalog / Operations
  19. 19. Cascalog / JoinsCascalog / Joins
  20. 20. Cascalog / OperationsCascalog / Operations
  21. 21. Cascalog / AggregatorsCascalog / Aggregators
  22. 22. Cascalog / AggregatorsCascalog / Aggregators
  23. 23. Cascalog / TroubleshootingCascalog / Troubleshooting
  24. 24. Cascalog / TestingCascalog / Testing
  25. 25. Cascalog / TroubleshootingCascalog / Troubleshooting
  26. 26. Flow Visualisation /Flow Visualisation / DOTDOT
  27. 27. Flow Visualisation /Flow Visualisation / DrivenDriven
  28. 28. DEMODEMO
  29. 29. Cascalog DownsidesCascalog Downsides Hadoop < SparkHadoop < Spark **
  30. 30. Cascalog DownsidesCascalog Downsides No supportNo support for streamingfor streaming datadata
  31. 31. Cascalog DownsidesCascalog Downsides
  32. 32. What are the alternatives?What are the alternatives? Java API forJava API for FlamboFlambo SparklingSparkling SparkSpark
  33. 33. Customer XCustomer X
  34. 34. Customer X wants to do DataCustomer X wants to do Data Science!Science!
  35. 35. Drug PersistenceDrug Persistence Determining whether a patient isDetermining whether a patient is persistent or not based on whether shepersistent or not based on whether she refilled the prescription in time.refilled the prescription in time.
  36. 36. Drug PersistenceDrug Persistence
  37. 37. Drug PersistenceDrug Persistence
  38. 38. Drug PersistenceDrug Persistence
  39. 39. Drug PersistenceDrug Persistence
  40. 40. Example: Drug PersistenceExample: Drug Persistence
  41. 41. Things to check outThings to check out How Yieldbot does Data science in Clojure Cascalog for the Impatient Streaming MapReduce in Clojure Sparkling Flambo
  42. 42. Thank you!Thank you!