Micheal Pershyn "Coljure 4 Big Data"

Clojure 4 Big Data
Michael Pershyn
2018-11-03

2
About me and why Clojure 4 Big Data
●
Make Software since 2005, work with Big Data since 2012
●
Work for ADITION Technologies AG
– Leading european adserving provider
– Part of european tech stack VirtualMinds
– >2.5 bln events per day processed in real-time
– Extra ~12 bln data points in (batch) ETL daily
– 250 TB of data in hadoop data lake
– Several own data centers
– Low latency requirements
– Written mostly in Clojure

4
Agenda
●
Why Clojure in 3 Minutes
●
Apache Storm
●
Apache Trident
●
Incanter
●
Cascalog

6
●
Makes you think diferent and approach problems
diferently and solve them faster
●
Immutability, functions and map-reduce
●
Powerful, interactive, small, concise
●
Makes it hard to fall back to imperative style

8
●
Distributed realtime computation system
●
Apache Top-Level Project since September 2014
●
Free and open source

9
Core Concepts of Storm
●
Spouts
●
Bolts
●
Topology
●
Stream
●
Cluster (Nimbus & Workers)

13
Storm Pros and Cons
●
No “exactly once” guarantee
●
Fast, simple
●
Multitenance and debugging
●
Integrations

14
Trident
●
The “Cascading” of Storm
●
High level abstraction processing library on top of Storm
●
Rich API with joins, aggregations, grouping, etc.
●
Provides stateful, exactly-once processing primitives

15
Marceline
Marceline provides a DSL that allows you to defne all of
the primitives that Trident has to ofer from Clojure

23
●
Cascading - a Java API
– defning complex data fows
– integrating those fows with back-end systems
– query planner for mapping and executing logical fows onto
a computing platform
●
Cascalog – Clojure DSL for Cascading

24
Cascading Concepts
●
Decouple application logic from integration
●
Flow, source, sink, taps, schemes

25
Cascading Pros and Cons
Hive Pig Cascading
Pros
●
SQL (non-standard)
●
Low learning curve
●
UDF
●
Pig Latin
●
Low learning curve
●
UDF
●
Java API
●
Unit testable
●
Flow control (if, try-catch)
●
Good reusability
Cons
●
Testability
●
Reusability
●
Flow control
●
Spread logic
●
UDF Programming
●
Testability
●
Reusability
●
Spread logic
●
UDF Programming
●
Programming

27
https://hortonworks.com/blog/cascading-hadoop-big-data-whatever/

28
Trident and Cascalog
●
Trident for Storm is like Cascading for Hadoop

29
Simplicity is about living life with more enjoyment and less pain
- John Maeda
https://www.ted.com/speakers/john_maeda

30
There are also other Clojure tools
●
Flambo – Clojure DSL for Apache Spark
●
http://riemann.io/ - Monitors Distributed System
●
...

Micheal Pershyn "Coljure 4 Big Data"

More Related Content

What's hot

Similar to Micheal Pershyn "Coljure 4 Big Data"

More from Lviv Startup Club

Recently uploaded

Micheal Pershyn "Coljure 4 Big Data"