Clojure 4 Big Data
Michael Pershyn
2018-11-03
2
About me and why Clojure 4 Big Data
●
Make Software since 2005, work with Big Data since 2012
●
Work for ADITION Technologies AG
– Leading european adserving provider
– Part of european tech stack VirtualMinds
– >2.5 bln events per day processed in real-time
– Extra ~12 bln data points in (batch) ETL daily
– 250 TB of data in hadoop data lake
– Several own data centers
– Low latency requirements
– Written mostly in Clojure
3
4
Agenda
●
Why Clojure in 3 Minutes
●
Apache Storm
●
Apache Trident
●
Incanter
●
Cascalog
5
Why Clojure?
6
●
Makes you think diferent and approach problems
diferently and solve them faster
●
Immutability, functions and map-reduce
●
Powerful, interactive, small, concise
●
Makes it hard to fall back to imperative style
7
8
●
Distributed realtime computation system
●
Apache Top-Level Project since September 2014
●
Free and open source
9
Core Concepts of Storm
●
Spouts
●
Bolts
●
Topology
●
Stream
●
Cluster (Nimbus & Workers)
10
Storm and Clojure
11
12
13
Storm Pros and Cons
●
No “exactly once” guarantee
●
Fast, simple
●
Multitenance and debugging
●
Integrations
14
Trident
●
The “Cascading” of Storm
●
High level abstraction processing library on top of Storm
●
Rich API with joins, aggregations, grouping, etc.
●
Provides stateful, exactly-once processing primitives
15
Marceline
Marceline provides a DSL that allows you to defne all of
the primitives that Trident has to ofer from Clojure
16
17
18
Trident compiles to Storm
19
Incanter
20
21
Incanter and openhub.net
22
Cascalog
23
●
Cascading - a Java API
– defning complex data fows
– integrating those fows with back-end systems
– query planner for mapping and executing logical fows onto
a computing platform
●
Cascalog – Clojure DSL for Cascading
24
Cascading Concepts
●
Decouple application logic from integration
●
Flow, source, sink, taps, schemes
25
Cascading Pros and Cons
Hive Pig Cascading
Pros
●
SQL (non-standard)
●
Low learning curve
●
UDF
●
Pig Latin
●
Low learning curve
●
UDF
●
Java API
●
Unit testable
●
Flow control (if, try-catch)
●
Good reusability
Cons
●
Testability
●
Reusability
●
Flow control
●
Spread logic
●
UDF Programming
●
Testability
●
Reusability
●
Spread logic
●
UDF Programming
●
Programming
26
27
https://hortonworks.com/blog/cascading-hadoop-big-data-whatever/
28
Trident and Cascalog
●
Trident for Storm is like Cascading for Hadoop
29
Simplicity is about living life with more enjoyment and less pain
- John Maeda
https://www.ted.com/speakers/john_maeda
30
There are also other Clojure tools
●
Flambo – Clojure DSL for Apache Spark
●
http://riemann.io/ - Monitors Distributed System
●
...
31
Thanks!
Questions?

Micheal Pershyn "Coljure 4 Big Data"