2. Different Computation Engines
● Apache Spark
○ Large Scale Data Processing
○ Streaming Support
○ Micro Batch Support
○ Written in Scala
○ Supports Python and Java too
4. Different Computation Engines
● Apache Flink
○ Stream first and then batch
○ Streaming Data Flow Engine
○ Written in JAVA and Scala
5. Why Clojure?
Clojure is.....
● A dialect of Lisp
● Interop
● Emphasizes functional programming
● Runs on the Java Virtual Machine
● Designed for Concurrency
8. Clojure Collections
Lists
● Singly linked lists
● First item in calling
position
● Heterogeneous
elements
‘( 1 2 3 4 “foo” :bar )
Vectors
● Simply evaluate each
item in order.
● Fast looks ups
● Heterogeneous
elements
[ 1 2 3 4 “foo” :bar ]
9. Clojure Collections
Maps
● Maps store Keys and
Values
{:name “abhishek”}
Sets
● Store zero or more
unique items
#{ :a :b 1 2 3 }
10. Clojure Functions
● First Class
● Higher-Order
● Pure Functions
def or defn?
● Both bind to symbol or
name
● def is only evaluated
once
● defn is evaluated every
time it is called
12. What Onyx brings?
● Clojure Philosophy
● How to rethink about data?
● How to program your programs in Distributed Systems
13. What Onyx is?
● Masterless
● Cloud Scale
● Fault Tolerant
● High Performance Distributed Computation System
● Batch and Stream hybrid processing model
● Written in Clojure
14. Onyx Program
● Read data from the source
● Transform the data into various sources
● Write the data into the target
29. Sneak Peak into Zookeeper
● Apache Zookeeper is open source tool from Apache.
● Originally developed at Yahoo.
● Zookeeper is written in Java and it is platform
independent.
● Zookeeper service can run in 2 mode
○ Standalone
○ Quorum
30. How to interact with Zookeeper?
● Zookeeper CLI
○ create /avengers "infinitywar"
○ get /avengers
○ get /avengers [watch] 1
○ set /avengers endgame
○ delete /avengers
○ ls /
○ stat /avengers
31. How to interact with Zookeeper?
● Exhibitor
○ git@github.com:soabase/exhibitor.git
Guarrentee -- Exactly once
High Throughoput is too high as comapre to other engines
Literally any value but false or nil
Created on Demand
Stored in a Data Structure
Takes one or more function args
Return function as result
ALWAYS return same result
No side effects
Michael Drogalis. Its just a library
Simple workflow model for Onyx program is simple. DAG with the computation program. Directed Acyclic Graph
A workflow is the structural specification of an Onyx program. Its purpose is to articulate the paths that data flows through the cluster at runtime. It is specified via a directed, acyclic graph.
The workflow representation is a Clojure vector of vectors. Each inner vector contains exactly two elements, which are keywords. The keywords represent nodes in the graph, and the vector represents a directed edge from the first node to the second. The leaf parts are connected to the plugins for the source. The middle part is the Clojure functions. The increment is map to Clojure function that will get the value from in and apply the function and output.
Other frameworks define their own data type like rdds in Spark, flink primitives, in hadoop its split and in Onyx its A segment is the unit of data in Onyx, and it’s represented by a Clojure map. Segments represent the data flowing through the cluster. Segments are the only shape of data that Onyx allows you to emit between functions. You take maps and keep transforming it and finally you emit the maps.
A task is the smallest unit of work in Onyx. It represents an activity of either input, processing, or output.
All inputs, outputs,, and functions in a workflow must be described via a catalog. A catalog is a vector of maps, strikingly similar to Datomic’s schema. Configuration and docstrings are described in the catalog. You define catalog with different task like input, increment and output. In and out are connected to plugins and increment is Clojure functions. Here we are using Clojure Async. Core.async is for mostly development.
All inputs, outputs,, and functions in a workflow must be described via a catalog. A catalog is a vector of maps, strikingly similar to Datomic’s schema. Configuration and docstrings are described in the catalog. You define catalog with different task like input, increment and output. In and out are connected to plugins and increment is Clojure functions. Here we are using Clojure Async. Core.async is for mostly development.
All inputs, outputs,, and functions in a workflow must be described via a catalog. A catalog is a vector of maps, strikingly similar to Datomic’s schema. Configuration and docstrings are described in the catalog. You define catalog with different task like input, increment and output. In and out are connected to plugins and increment is Clojure functions. Here we are using Clojure Async. Core.async is for mostly development.
flow conditions specify on a segment-by-segment basis which direction data should flow determined by predicate functions. This is helpful for conditionally processing a segment based off of its content.
Lets say we want to ignore if incoming data is not even, some sort of condition. Make sure the data satisfy the flow condition. Before going to processing make sure its even. Everything is Clojure fucntions. No API. Predicates takes boolean.
Onyx is just not for batch processing but for stream processing. It have something called as window. Data streams from multiple sources. Define Windows and connect to task and you say you can have fix window which will go one by one or global window which will going to be like all events together. And then sliding window. Window is mostly define and sort of data collection. Fixed: - When you are doing aggregation lets say how many registration happened between certain time period.
Sliding :- How many registration happened between last 10 min and can ask the same in next 5 min and Ony do the computation.
Global :- How many times word occurred
Session:- Session windows are windows that dynamically resize their upper and lower bounds in reaction to incoming data. Sessions capture a time span of activity for a specific key, such as a user ID. If no activity occurs within a timeout gap, the session closes. If an event occurs within the bounds of a session, the window size is fused with the new event, and the session is extended by its timeout gap either in the forward or backward direction.
To do something to data we use trigger you define a trigger for a given window.
This trigger sleeps for a duration of :trigger/period. When it is done sleeping, the :trigger/sync function is invoked with its usual arguments. The trigger goes back to sleep and repeats itself.
Trigger wakes up in reaction to a new segment being processed. Trigger only fires once every :trigger/threshold segments.
Trigger wakes up in reaction to a new segment being processed. Trigger only fires if :trigger/pred evaluates to true.
Trigger only fires if the value of :window/window-key in the segment exceeds the upper-bound in the extent of an active window.
To do something to data we use trigger you define a trigger for a given window.
This trigger sleeps for a duration of :trigger/period. When it is done sleeping, the :trigger/sync function is invoked with its usual arguments. The trigger goes back to sleep and repeats itself.
Trigger wakes up in reaction to a new segment being processed. Trigger only fires once every :trigger/threshold segments.
Trigger wakes up in reaction to a new segment being processed. Trigger only fires if :trigger/pred evaluates to true.
Trigger only fires if the value of :window/window-key in the segment exceeds the upper-bound in the extent of an active window.
Apache ZooKeeper is used as both storage and communication layer. ZooKeeper takes care of things like CAS, consensus, and atomic counting. ZooKeeper watches are at the heart of how Onyx virtual peers detect machine failure.
A Peer is a node in the cluster responsible for processing data. A peer generally refers to a physical machine as its typical to only run one peer per machine.
A Virtual Peer refers to a single concurent worker running on a single physical machine. Each virtual peer spawns a small number threads since it uses asynchronous messaging. All virtual peers are equal, whether they are on the same physical machine or not. Virtual peers communicate segments directly to one another, and coordinate strictly via the log in
In a masterless design, there is no single entity that assigns tasks to peers. Instead, peers need to contend for tasks to execute as jobs are submitted to Onyx.
Peers depends on zookeeper. All peers are equal. Each of this peers can work on one job at a time and each job have 1 task so 1 task at a time. When we talk about peer there are 2 terms Messaging and Coordination.
Messaging :- Its about the data flowing between peeers. A message means we are getting some data and getting passed to other peer. Any peer can tlak to any peer.
If all the peers not coordinating to each other than how the cluster is getting form. Zookeeper comes in picture. All the peer is about zookeeper.
All peers write they will just write immutable append log in zk.
When they write log what they do write vectors of maps.
Each of this map as 1 key as function what to be executed and other function as argument.
All function written as a log in zookeeper are pure and idempotent.
Peers only depends on zookeeper log.
Coordination
A high and predictable performance is a main advantage of Aeron, it’s most useful in application which requires low-latency, high throughput (e.g. sending large files) or both (akka remoting uses Aeron).
Aeron uses unidirectional connections. If you need to send requests and receive responses, you should use two connections.
Publisher and Media Driver (see later) are used to send a message, Subscriber and Media Driver — to receive. Client talks to Media Driver via shared memory.
Its the default messaging component in Onyx. Its also provide multi plexing and shot circuiting.
Suppose at monotonic clock peer5 wants to join the cluster. In order to join the cluster peer5 needs to go through the 3 phase cluster join strategy. If you see all peers keeping watch on some other peers in zookeeper.
How they form ring is they maintained the data structure. Each peer know which peern they pointing too. Peers 1-4 are in ring and peer5 wants to join ring.
When new peer comes up it will read all zookeeper logs and make it updated and say hey I am available and I am ready to join the ring. Peer5 initiate phase 1 of the join protocol and peer 1 prepares to make peer 5 join the ring by keep watch on it.
Now peer 5 knows who is pointing to me and earlier to whom peer1 was pointing now peer 5 will say okay I alll keep watch on peer 4.
A high and predictable performance is a main advantage of Aeron, it’s most useful in application which requires low-latency, high throughput (e.g. sending large files) or both (akka remoting uses Aeron).
Aeron uses unidirectional connections. If you need to send requests and receive responses, you should use two connections.
Publisher and Media Driver (see later) are used to send a message, Subscriber and Media Driver — to receive. Client talks to Media Driver via shared memory.
Its the default messaging component in Onyx. Its also provide multi plexing and shot circuiting.
The Greedy job scheduler allocates all peers to each job in the order that it was submitted.
The Balanced job scheduler allocates peers in a rotating fashion to jobs that were submitted.
The Percentage job scheduler allows jobs to be submitted with a percentage value. The percentage value indicates what percentage of the cluster will be allocated to this job.
It’s often the case that a set of machines in your cluster are privileged in some way. Perhaps they are running special hardware, or they live in a specific data center, or they have a license to use a proprietary database. Sometimes, you’ll have Onyx jobs that require tasks to run on a predetermined set of machines. Tags are a feature that let peers denote "capabilities". Tasks may declare which tags peers must have in order to be selected to execute them.