Dimensions computation meetup

•

0 likes•906 views

This document discusses Dimensions Computation (DC), a solution for handling massive real-time data streams. DC extracts meaningful statistics and identifies trends over time by isolating the effect of different variables. It works by splitting data into tuples containing dimensions and measures. Dimensions are variables that can impact measures. DC produces aggregations of measures using dimension combinations. It is implemented using Apache Apex, which allows building distributed, fault-tolerant applications on Hadoop for real-time streaming data. DC is available through Data Torrent and resources for learning more are provided.

Data & Analytics

The Fault-tolerant Aggregation of Massive Real-time
Data Streams
By: Timothy Farkas
Dimensions Computation

What’s The Problem?
Requirements
● Handle massive amount of data flowing into the system all the time.
● Extract meaningful statistics (aggregations) from the data in real-time.
● Isolate the effect of different variables in real-time.
● Identify trends over time, and observe changes in real time.
Who Cares?
● AdTech
● Telecom
● Industrial companies
● Appliance companies
● And many more

Dimensions Computation
An AdTech Use Case
What’s The Solution?

How Does DC Work?
Data Assumptions
● Our data is split into discrete pieces called Tuples.
● Each Tuple contains a set of Dimensions and a set of Measures.
● Measures are the pieces of information we want to collect statistics about.
● Dimensions are the variables which can impact our Measures.
● Each Tuple contains the same Dimensions and Measures.

How Does DC Work?
Processing Assumptions
● Dimensions Computation produces
Aggregations of our Measures
using Aggregators.
● Aggregators are Commutative and
Associative operations that are
performed on Measures.
● An Aggregation is represented by a
Dimension Combination.
● Dimension Combinations are
unique subsets of Dimensions.

How Does DC Work? (Example)
1. Take a Tuple
2. Extract the Dimensions
Combinations
3. Each Dimensions
Combination has an
Aggregation
4. Add the Tuple’s Measures
to the Aggregation for each
extracted Dimension
Combination

What About Other Aggregators?
Time Bucketing
● Aggregations every minute, hour, and day
Non-Commutative and Non-Associative Aggregators
● Average
● Standard Deviation

● Distributed software platform for Big Data
● Runs on Hadoop
● Real-time streaming data
● Fault-tolerant
What is Apache Apex?

● Tuple: Discrete unit of information sent from one operator to another.
● Operator: Java code that performs an operation on tuples. The code runs in a
hadoop container on a hadoop cluster.
● DAG: Operators can be connected to form an application. Tuple transfer between
operators is 1-way, so the application forms Directed Acyclic Graph.
● Window Id: An Id that is associated with Tuples and Operators, and is used for
fault tolerance.
Anatomy Of An Apache Apex App

● Partition: A Partition is a copy of an Operator, which
processes a subset of the data intended for the Operator.
● Unifier: An Operator which combines the Tuples
produced by upstream operators.
Scaling An Apache Apex App

Apache Apex For Dimensions Computation
● Short-term aggregations
are done in-memory by
the DC Operator.
● The results of the in-
memory aggregations
are unified.
● Long-lived aggregations
are managed by the
Store Operator which
spools data to disk
(HDFS).

● Dimensions Computation is available in Data Torrent RTS Enterprise Edition
● docs.datatorrent.com/operators/dimensions_computation/
● github.com/tweise/apex-samples/blob/master/mysqlDimensionsApp/
● www.datatorrent.com
● apex.apache.org
● Hadoop Summit Talk
Resources

● Widgets: Sasha Parfenov, Dean Lockgaurd
● Backend: David Yan, Bright Chen, Chinmay Kolhatkar
● Community Relationship: Jie Wu
● Apache Apex Community (1,400+ members)
Acknowledgements

Similar to Dimensions computation meetup

Big Stream Processing Systems, Big GraphsPetr Novotný

Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex

Major ppt98719021578

Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media

Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex

Stream Processing with Apache ApexPramod Immaneni

Sea of Data🍞 Daniel Marchant

Modern Software Architectures - Overview CodeOps Technologies LLP

Introduction to Apache Apex - CoDS 2016Bhupesh Chawda

Real-time Stream Processing using Apache ApexApache Apex

Challenges of monitoring distributed systemsNenad Bozic

IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...In-Memory Computing Summit

November 2013 HUG: Real-time analytics with in-memory gridYahoo Developer Network

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData

Datadogoverview.pptxssuser8bc443

Large scale virtual Machine log collector (Project-Report)Gaurav Bhardwaj

Workflows via Event driven architectureMilan Patel

The Lyft data platform: Now and in the futuremarkgrover

Lyft data Platform - 2019 slidesKarthik Murugesan

Similar to Dimensions computation meetup (20)

Big Stream Processing Systems, Big Graphs

Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare

Major ppt

Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...

Big Data Berlin v8.0 Stream Processing with Apache Apex

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

Stream Processing with Apache Apex

Sea of Data

Modern Software Architectures - Overview

Introduction to Apache Apex - CoDS 2016

Real-time Stream Processing using Apache Apex

Challenges of monitoring distributed systems

IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...

November 2013 HUG: Real-time analytics with in-memory grid

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...

Datadogoverview.pptx

Large scale virtual Machine log collector (Project-Report)

Workflows via Event driven architecture

The Lyft data platform: Now and in the future

Lyft data Platform - 2019 slides

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

How we prevented account sharing with MFAAndrei Kaleshka

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

RadioAdProWritingCinderellabyButleri.pdfgstagge

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

How we prevented account sharing with MFA

E-Commerce Order PredictionShraddha Kamble.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Call Girls In Dwarka 9654467111 Escorts Service

Customer Service Analytics - Make Sense of All Your Data.pptx

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

RadioAdProWritingCinderellabyButleri.pdf

Brighton SEO | April 2024 | Data Storytelling

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Call Girls in Saket 99530🔝 56974 Escort Service

Dimensions computation meetup

1. The Fault-tolerant Aggregation of Massive Real-time Data Streams By: Timothy Farkas Dimensions Computation

2. What’s The Problem? Requirements ● Handle massive amount of data flowing into the system all the time. ● Extract meaningful statistics (aggregations) from the data in real-time. ● Isolate the effect of different variables in real-time. ● Identify trends over time, and observe changes in real time. Who Cares? ● AdTech ● Telecom ● Industrial companies ● Appliance companies ● And many more

3. Dimensions Computation An AdTech Use Case What’s The Solution?

4. How Does DC Work? Data Assumptions ● Our data is split into discrete pieces called Tuples. ● Each Tuple contains a set of Dimensions and a set of Measures. ● Measures are the pieces of information we want to collect statistics about. ● Dimensions are the variables which can impact our Measures. ● Each Tuple contains the same Dimensions and Measures.

5. How Does DC Work? Processing Assumptions ● Dimensions Computation produces Aggregations of our Measures using Aggregators. ● Aggregators are Commutative and Associative operations that are performed on Measures. ● An Aggregation is represented by a Dimension Combination. ● Dimension Combinations are unique subsets of Dimensions.

6. How Does DC Work? (Example) 1. Take a Tuple 2. Extract the Dimensions Combinations 3. Each Dimensions Combination has an Aggregation 4. Add the Tuple’s Measures to the Aggregation for each extracted Dimension Combination

7. How Does DC Work? (Example)

8. How Does DC Work? (Example)

9. What About Other Aggregators? Time Bucketing ● Aggregations every minute, hour, and day Non-Commutative and Non-Associative Aggregators ● Average ● Standard Deviation

10. How Do We Build The App?

11. ● Distributed software platform for Big Data ● Runs on Hadoop ● Real-time streaming data ● Fault-tolerant What is Apache Apex?

12. ● Tuple: Discrete unit of information sent from one operator to another. ● Operator: Java code that performs an operation on tuples. The code runs in a hadoop container on a hadoop cluster. ● DAG: Operators can be connected to form an application. Tuple transfer between operators is 1-way, so the application forms Directed Acyclic Graph. ● Window Id: An Id that is associated with Tuples and Operators, and is used for fault tolerance. Anatomy Of An Apache Apex App

13. Anatomy Of An Apache Apex App

14. ● Partition: A Partition is a copy of an Operator, which processes a subset of the data intended for the Operator. ● Unifier: An Operator which combines the Tuples produced by upstream operators. Scaling An Apache Apex App

15. Scaling An Apache Apex App

16. Apache Apex For Dimensions Computation ● Short-term aggregations are done in-memory by the DC Operator. ● The results of the in- memory aggregations are unified. ● Long-lived aggregations are managed by the Store Operator which spools data to disk (HDFS).

17. How Do We Manage The App? RTS

18. ● Dimensions Computation is available in Data Torrent RTS Enterprise Edition ● docs.datatorrent.com/operators/dimensions_computation/ ● github.com/tweise/apex-samples/blob/master/mysqlDimensionsApp/ ● www.datatorrent.com ● apex.apache.org ● Hadoop Summit Talk Resources

19. ● Widgets: Sasha Parfenov, Dean Lockgaurd ● Backend: David Yan, Bright Chen, Chinmay Kolhatkar ● Community Relationship: Jie Wu ● Apache Apex Community (1,400+ members) Acknowledgements

20. Questions?

Dimensions computation meetup

Recommended

Recommended

More Related Content

Similar to Dimensions computation meetup

Similar to Dimensions computation meetup (20)

Recently uploaded

Recently uploaded (20)

Dimensions computation meetup