Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Micheal Pershyn "Coljure 4 Big Data"

14 views

Published on

BigData & Data Engineering

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Micheal Pershyn "Coljure 4 Big Data"

  1. 1. Clojure 4 Big Data Michael Pershyn 2018-11-03
  2. 2. 2 About me and why Clojure 4 Big Data ● Make Software since 2005, work with Big Data since 2012 ● Work for ADITION Technologies AG – Leading european adserving provider – Part of european tech stack VirtualMinds – >2.5 bln events per day processed in real-time – Extra ~12 bln data points in (batch) ETL daily – 250 TB of data in hadoop data lake – Several own data centers – Low latency requirements – Written mostly in Clojure
  3. 3. 3
  4. 4. 4 Agenda ● Why Clojure in 3 Minutes ● Apache Storm ● Apache Trident ● Incanter ● Cascalog
  5. 5. 5 Why Clojure?
  6. 6. 6 ● Makes you think diferent and approach problems diferently and solve them faster ● Immutability, functions and map-reduce ● Powerful, interactive, small, concise ● Makes it hard to fall back to imperative style
  7. 7. 7
  8. 8. 8 ● Distributed realtime computation system ● Apache Top-Level Project since September 2014 ● Free and open source
  9. 9. 9 Core Concepts of Storm ● Spouts ● Bolts ● Topology ● Stream ● Cluster (Nimbus & Workers)
  10. 10. 10 Storm and Clojure
  11. 11. 11
  12. 12. 12
  13. 13. 13 Storm Pros and Cons ● No “exactly once” guarantee ● Fast, simple ● Multitenance and debugging ● Integrations
  14. 14. 14 Trident ● The “Cascading” of Storm ● High level abstraction processing library on top of Storm ● Rich API with joins, aggregations, grouping, etc. ● Provides stateful, exactly-once processing primitives
  15. 15. 15 Marceline Marceline provides a DSL that allows you to defne all of the primitives that Trident has to ofer from Clojure
  16. 16. 16
  17. 17. 17
  18. 18. 18 Trident compiles to Storm
  19. 19. 19 Incanter
  20. 20. 20
  21. 21. 21 Incanter and openhub.net
  22. 22. 22 Cascalog
  23. 23. 23 ● Cascading - a Java API – defning complex data fows – integrating those fows with back-end systems – query planner for mapping and executing logical fows onto a computing platform ● Cascalog – Clojure DSL for Cascading
  24. 24. 24 Cascading Concepts ● Decouple application logic from integration ● Flow, source, sink, taps, schemes
  25. 25. 25 Cascading Pros and Cons Hive Pig Cascading Pros ● SQL (non-standard) ● Low learning curve ● UDF ● Pig Latin ● Low learning curve ● UDF ● Java API ● Unit testable ● Flow control (if, try-catch) ● Good reusability Cons ● Testability ● Reusability ● Flow control ● Spread logic ● UDF Programming ● Testability ● Reusability ● Spread logic ● UDF Programming ● Programming
  26. 26. 26
  27. 27. 27 https://hortonworks.com/blog/cascading-hadoop-big-data-whatever/
  28. 28. 28 Trident and Cascalog ● Trident for Storm is like Cascading for Hadoop
  29. 29. 29 Simplicity is about living life with more enjoyment and less pain - John Maeda https://www.ted.com/speakers/john_maeda
  30. 30. 30 There are also other Clojure tools ● Flambo – Clojure DSL for Apache Spark ● http://riemann.io/ - Monitors Distributed System ● ...
  31. 31. 31 Thanks! Questions?

×