This document discusses Dimensions Computation (DC), a solution for handling massive real-time data streams. DC extracts meaningful statistics and identifies trends over time by isolating the effect of different variables. It works by splitting data into tuples containing dimensions and measures. Dimensions are variables that can impact measures. DC produces aggregations of measures using dimension combinations. It is implemented using Apache Apex, which allows building distributed, fault-tolerant applications on Hadoop for real-time streaming data. DC is available through Data Torrent and resources for learning more are provided.
2. What’s The Problem?
Requirements
● Handle massive amount of data flowing into the system all the time.
● Extract meaningful statistics (aggregations) from the data in real-time.
● Isolate the effect of different variables in real-time.
● Identify trends over time, and observe changes in real time.
Who Cares?
● AdTech
● Telecom
● Industrial companies
● Appliance companies
● And many more
4. How Does DC Work?
Data Assumptions
● Our data is split into discrete pieces called Tuples.
● Each Tuple contains a set of Dimensions and a set of Measures.
● Measures are the pieces of information we want to collect statistics about.
● Dimensions are the variables which can impact our Measures.
● Each Tuple contains the same Dimensions and Measures.
5. How Does DC Work?
Processing Assumptions
● Dimensions Computation produces
Aggregations of our Measures
using Aggregators.
● Aggregators are Commutative and
Associative operations that are
performed on Measures.
● An Aggregation is represented by a
Dimension Combination.
● Dimension Combinations are
unique subsets of Dimensions.
6. How Does DC Work? (Example)
1. Take a Tuple
2. Extract the Dimensions
Combinations
3. Each Dimensions
Combination has an
Aggregation
4. Add the Tuple’s Measures
to the Aggregation for each
extracted Dimension
Combination
9. What About Other Aggregators?
Time Bucketing
● Aggregations every minute, hour, and day
Non-Commutative and Non-Associative Aggregators
● Average
● Standard Deviation
11. ● Distributed software platform for Big Data
● Runs on Hadoop
● Real-time streaming data
● Fault-tolerant
What is Apache Apex?
12. ● Tuple: Discrete unit of information sent from one operator to another.
● Operator: Java code that performs an operation on tuples. The code runs in a
hadoop container on a hadoop cluster.
● DAG: Operators can be connected to form an application. Tuple transfer between
operators is 1-way, so the application forms Directed Acyclic Graph.
● Window Id: An Id that is associated with Tuples and Operators, and is used for
fault tolerance.
Anatomy Of An Apache Apex App
14. ● Partition: A Partition is a copy of an Operator, which
processes a subset of the data intended for the Operator.
● Unifier: An Operator which combines the Tuples
produced by upstream operators.
Scaling An Apache Apex App
16. Apache Apex For Dimensions Computation
● Short-term aggregations
are done in-memory by
the DC Operator.
● The results of the in-
memory aggregations
are unified.
● Long-lived aggregations
are managed by the
Store Operator which
spools data to disk
(HDFS).