Uploaded byBill Slacum

904 views

Xxx treme aggregation

This document discusses techniques for pre-computing aggregations over large datasets using associative and commutative functions. It describes how to map input data to combinable structures, reduce the structures using associative operations, and store the results in Accumulo in a way that allows efficient querying. Examples are given of counting occurrences and finding top-k relationships between fields in the data. Considerations for more complex aggregations, security, and querying capabilities are also outlined.

Let's Aggregate
XXXXTREME JOE BEDWELL
EDITION
@BillSlacum
Accumulo Meetup, Sep 23, 2014

Have you heard of...
… TSAR, Summingbird? (Twitter)
… Mesa? (Google)
… Commutative Replicated Data Types?
Describe a system that pre-computes
aggregations over large datasets using
associative and/or commutative functions.

What's it all about, Alfie?
Imagine we have a dataset that describes
fights between two cities. We'd at some
point run a SQL query similar to SELECT
DISTINCT destination FROM flights WHERE
origin=”BWI” to see where everyone is
feeing from Baltimore.
We can pre-compute, in parallel, this answer
for systems that have too much data to
compute it every time a user queries the DB.

What do we need to pull this off?
We need data structures that can be
combined together. Numbers are a trivial
example of this, as we can combine two
numbers using a function (such as plus and
multiply). There are more advanced data
structures such as matrices,
HyperLogLogPlus, StreamSummary (used for
top-k) and Bloom filters that also have this
property!
val partial: T = op(a, b)

What do we need to pull this off?
We need operations that can be performed in
parallel. Associative operations are espoused
by Twitter, but for our case operations that
are both associative and commutative have
the better property that we can get correct
results no matter what order we receive the
data. Common operations that are
associative (summation, set building) are also
commutative.
op(op(a, b), c) == op(a, op(b, c))
op(a, b) == op(b, a)

Wait a minute isn't that...
You caught me! It's a commutative monoid!
From Wolfram:
Monoid: A monoid is a set that is closed
under an associative binary operation and
has an identity element I in S such that for all
a in S, Ia=aI=a
Commutative Monoid: A monoid that is
commutative i.e., a monoid M such that for
every two elements a and b in M, ab=ba.

Put it to work
The example we're about to see uses
MapReduce and Accumulo. The same can be
accomplished using any processing
framework that supports map and reduce
operations, such as Spark or Storm's Trident
interface.

We need two functions...
Map
– Takes an input datum and turns into some
combinable structure
– Like parsing strings to numbers, or creating
single element sets for combining
Reduce
– Combines the merge-able data structures using
our associative and commutative function

Yup, that's all!
●
Map will be called on the input data once in a
Mapper instance.
●
Reduce will be called in a Combiner, Reducer
and an Accumulo Iterator!
●
The Accumulo Iterator is configured to run on
major compactions, minor compactions, and
scans
●
That's five places the same piece of code
gets run-- talk about modularity!

What does our Accumulo Iterator
look like?
●
We can re-use Accumulo's Combiner type here:
override def reduce:(key: Key, values:
Iterator[Value]) Value = {
// deserialize and combine all intermediate
// values. This logic should be identical to
// what is in the mr.Combiner and Reducer
}
●
Our function has to be commutative because major
compactions will often pick smaller files to combine,
which means we only see discrete subsets of data in an
iterator invocation

Counting in practice (pt 1)
We've seen how to aggregate values together. What's
the best way to structure our data and query it?
Twitter's TSAR is a good starting point. It allows users to
declare what they want to aggregate:
Aggregate(
onKeys((“origin”, “destination”))
producing(Count))
This describes generating an edge between two cities
and calculating a weight for it.

Counting in practice (pt 2)
With that declaration, we can infer that the user wants their
operation to be summing over each instance of a given
pairing, so we can say the base value is 1 (sounds a bit like
word count, huh?). We need a key for each base value and
partial computation to be reduced with. For this simple
pairing we can have a schema like:
<field_1>0<value_1>0...<field_n>0<value_n> count: “”
[] <serialized long>
I recently traveled from Baltimore to Denver. Here's what that
trip would look like:
origin0bwi0destination0dia count: “” [] x01

Counting in practice (pt 3)
●
Iterator combines all values that are mapped
to the same key
●
We encoded the aggregation function into the
column family of the key
– We can arbitrarily add new aggregate functions
by updating a mapping of column family to
function and then updating the iterator
deployment

Something more than counting
●
Everybody counts, but what about something
like top-k?
●
The key schema isn't fexible enough to show
a relationship between two fields
●
We want to know the top-k relationship
between origin and destination cities
●
That column qualifier was looking awfully
blank. It'd be a shame if someone were to put
data in it...

How you like me now?
● Aggregate(
onKeys((“origin”))
producing(TopK(“destination”)))
● <field1>0<value1>0...<fieldN>0<valueN> <op>:
<relation> [] <serialized data structure>
●
Let's use my Baltimore->Denver trip as an
example:
origin0BWI topk: destination [] {“DIA”: 1}

But how do I query it?
●
This schema is really geared towards point
queries
●
Users would know exactly which dimensions
they were querying across to get an answer
– BUENO “What are the top-k destinations for Bill
when he leaves BWI?”
– NO BUENO “What are all the dimensions and
aggregations I have for Bill?”

Some thoughts to think about it
●
Prepare functions
– Preparing the input to do things like time bucketing and
normalization (Jared Winick's Trendulo)
●
Age off
– Combining down to a single value means that value represents
all historical data. Maybe we don't care about that and would
like to age off data after a day/week/month/year. Mesa's batch
Ids could be of use here.
●
Security labels
– Notice how I deftly avoided this topic. We should be able to
bucket aggregations based on visibility, but we need a way to
express the best way to handle this. Maybe just preserve the
input data's security labeling and attach it to the output of our
map function?

FIN
(hope this wasn't too hard to read)
Comments, suggestions or infammatory messages should
be sent to @BillSlacum or wslacum@gmail.com

Recommended

ODP

Aggregating In Accumulo

PPTX

Conversation with-search-engines (Ren et al. 2020)

PDF

Cs229 notes10

PDF

Social network-analysis-in-python

byJoe OntheRocks

PDF

Networkx tutorial

byDeepakshankar S

PDF

Bigdata analytics

bylakshmidkurup

PDF

L3

PPTX

Dstar Lite

byAdrian Sotelo

PDF

Three Functional Programming Technologies for Big Data

byDynamical Software, Inc.

PDF

Bloom Filters: An Introduction

byIRJET Journal

PDF

R Visualization Assignment

byVassilis Kapatsoulias

PDF

Graph Analyses with Python and NetworkX

byBenjamin Bengfort

PPTX

Locality sensitive hashing

bySameera Horawalavithana

PDF

ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics

byToyotaro Suzumura

PPTX

A Fast and Dirty Intro to NetworkX (and D3)

PPTX

Data visualization with R

byBiswajeet Dasmajumdar

PPT

Dynamic Memory Allocation

PPTX

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...

byAccumulo Summit

PDF

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...

PDF

Bringing back the excitement to data analysis

byData Science London

PDF

MongoDB, Hadoop and humongous data - MongoSV 2012

bySteven Francia

PDF

Next Top Data Model by Ian Plosker

PPTX

Beyond shuffling - Strata London 2016

PDF

Visual Api Training

PDF

codecentric AG: Using Cassandra and Clojure for Data Crunching backends

byDataStax Academy

PDF

Beyond shuffling - Scala Days Berlin 2016

PDF

Outrageous Ideas for Graph Databases

PDF

Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup

PDF

Aggregators: Data Day Texas, 2015

PPTX

Introduction to Map-Reduce in Hadoop.pptx

bytest1miniproject

More Related Content

ODP

Aggregating In Accumulo

PPTX

Conversation with-search-engines (Ren et al. 2020)

PDF

Cs229 notes10

PDF

Social network-analysis-in-python

byJoe OntheRocks

PDF

Networkx tutorial

byDeepakshankar S

PDF

Bigdata analytics

bylakshmidkurup

PDF

L3

PPTX

Dstar Lite

byAdrian Sotelo

Aggregating In Accumulo

Conversation with-search-engines (Ren et al. 2020)

Cs229 notes10

Social network-analysis-in-python

byJoe OntheRocks

Networkx tutorial

byDeepakshankar S

Bigdata analytics

bylakshmidkurup

L3

Dstar Lite

byAdrian Sotelo

What's hot

PDF

Three Functional Programming Technologies for Big Data

byDynamical Software, Inc.

PDF

Bloom Filters: An Introduction

byIRJET Journal

PDF

R Visualization Assignment

byVassilis Kapatsoulias

PDF

Graph Analyses with Python and NetworkX

byBenjamin Bengfort

PPTX

Locality sensitive hashing

bySameera Horawalavithana

PDF

ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics

byToyotaro Suzumura

PPTX

A Fast and Dirty Intro to NetworkX (and D3)

PPTX

Data visualization with R

byBiswajeet Dasmajumdar

PPT

Dynamic Memory Allocation

Three Functional Programming Technologies for Big Data

byDynamical Software, Inc.

Bloom Filters: An Introduction

byIRJET Journal

R Visualization Assignment

byVassilis Kapatsoulias

Graph Analyses with Python and NetworkX

byBenjamin Bengfort

Locality sensitive hashing

bySameera Horawalavithana

ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics

byToyotaro Suzumura

A Fast and Dirty Intro to NetworkX (and D3)

Data visualization with R

byBiswajeet Dasmajumdar

Dynamic Memory Allocation

Similar to Xxx treme aggregation

PPTX

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...

byAccumulo Summit

PDF

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...

PDF

Bringing back the excitement to data analysis

byData Science London

PDF

MongoDB, Hadoop and humongous data - MongoSV 2012

bySteven Francia

PDF

Next Top Data Model by Ian Plosker

PPTX

Beyond shuffling - Strata London 2016

PDF

Visual Api Training

PDF

codecentric AG: Using Cassandra and Clojure for Data Crunching backends

byDataStax Academy

PDF

Beyond shuffling - Scala Days Berlin 2016

PDF

Outrageous Ideas for Graph Databases

PDF

Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup

PDF

Aggregators: Data Day Texas, 2015

PPTX

Introduction to Map-Reduce in Hadoop.pptx

bytest1miniproject

PPTX

Introduction to Map-Reduce in Hadoop.pptx

bytest1miniproject

PPTX

Building Scalable Aggregation Systems

PDF

Realtime Analytics

byeXascale Infolab

PDF

Everyday Probabilistic Data Structures for Humans

PDF

Monads and Monoids by Oleksiy Dyagilev

PPTX

An Exploration of 3 Very Different ML Solutions Running on Accumulo

byAccumulo Summit

PPTX

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

byChristopher Severs

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...

byAccumulo Summit

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...

Bringing back the excitement to data analysis

byData Science London

MongoDB, Hadoop and humongous data - MongoSV 2012

bySteven Francia

Next Top Data Model by Ian Plosker

Beyond shuffling - Strata London 2016

Visual Api Training

codecentric AG: Using Cassandra and Clojure for Data Crunching backends

byDataStax Academy

Beyond shuffling - Scala Days Berlin 2016

Outrageous Ideas for Graph Databases

Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup

Aggregators: Data Day Texas, 2015

Introduction to Map-Reduce in Hadoop.pptx

bytest1miniproject

Introduction to Map-Reduce in Hadoop.pptx

bytest1miniproject

Building Scalable Aggregation Systems

Realtime Analytics

byeXascale Infolab

Everyday Probabilistic Data Structures for Humans

Monads and Monoids by Oleksiy Dyagilev

An Exploration of 3 Very Different ML Solutions Running on Accumulo

byAccumulo Summit

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

byChristopher Severs

Recently uploaded

PDF

Advanced SELinux Management - RHCSA (RH134).pdf

byLinuxCert Guru

PDF

The Manifesto for AI-enabled Governance !

byYannis Charalabidis

PDF

BEP-20 Token on BNB Chain: From Concept to Deployment.pdf

byimoliviabennett

PDF

Mount File Systems using UUID and Label - RHCSA (RH134).pdf

byLinuxCert Guru

PDF

From DeFi POC to Production MVP - A 12-16 Week Blueprint.pdf

PDF

How does MES(Manufacturing Execution System) work?

byakipii ogaoga

PDF

Transcript: Escape from the Forbidden Zone: Smuggling green and inclusive tec...

byBookNet Canada

PPTX

Working Session — Build a Document Understanding Automation Using an OOTB ML ...

PDF

Rustici Software: eLearning standards in the age of AI

byRustici Software

PDF

UiPath Automation Developer Associate Training Series 2026 - Session 3

PDF

Configure and Manage Systemd Timers- RHCSA (RH134).pdf

byLinuxCert Guru

PDF

Making Search Less Taxing: Leveraging Semantics and Keywords in Hybrid Search

byEnterprise Knowledge

PDF

AI for Risk Management & Fraud Detection ppt.pdf

byalexmartincaneda

PDF

Poročilo odbora CIS (CH08873) za leto 2025 na letni skupščini IEEE Slovenija ...

byUniversity of Maribor

PDF

Self-Correction Failure Diagnostic: Detecting Drift in Complex Systems

bySystems Research Group

PDF

Designing a Blog Using Wordpress

PPTX

Do You Control the AI, or Does the AI Control You?

bymedhateltoukhy

PDF

Post Quantum Cryptography for Dummies.pdf

byAde Ismail Isnan

PDF

UiPath Automation Developer Associate Training Series 2026 - Session 2

PDF

TravelTech Paris 2025 | The EU Digital Identity Wallet and it’s impact on Tra...

Advanced SELinux Management - RHCSA (RH134).pdf

byLinuxCert Guru

The Manifesto for AI-enabled Governance !

byYannis Charalabidis

BEP-20 Token on BNB Chain: From Concept to Deployment.pdf

byimoliviabennett

Mount File Systems using UUID and Label - RHCSA (RH134).pdf

byLinuxCert Guru

From DeFi POC to Production MVP - A 12-16 Week Blueprint.pdf

How does MES(Manufacturing Execution System) work?

byakipii ogaoga

Transcript: Escape from the Forbidden Zone: Smuggling green and inclusive tec...

byBookNet Canada

Working Session — Build a Document Understanding Automation Using an OOTB ML ...

Rustici Software: eLearning standards in the age of AI

byRustici Software

UiPath Automation Developer Associate Training Series 2026 - Session 3

Configure and Manage Systemd Timers- RHCSA (RH134).pdf

byLinuxCert Guru

Making Search Less Taxing: Leveraging Semantics and Keywords in Hybrid Search

byEnterprise Knowledge

AI for Risk Management & Fraud Detection ppt.pdf

byalexmartincaneda

Poročilo odbora CIS (CH08873) za leto 2025 na letni skupščini IEEE Slovenija ...

byUniversity of Maribor

Self-Correction Failure Diagnostic: Detecting Drift in Complex Systems

bySystems Research Group

Designing a Blog Using Wordpress

Do You Control the AI, or Does the AI Control You?

bymedhateltoukhy

Post Quantum Cryptography for Dummies.pdf

byAde Ismail Isnan

UiPath Automation Developer Associate Training Series 2026 - Session 2

TravelTech Paris 2025 | The EU Digital Identity Wallet and it’s impact on Tra...

Xxx treme aggregation

1.
Let's Aggregate XXXXTREME JOEBEDWELL EDITION @BillSlacum Accumulo Meetup, Sep 23, 2014
2.
Have you heardof... … TSAR, Summingbird? (Twitter) … Mesa? (Google) … Commutative Replicated Data Types? Describe a system that pre-computes aggregations over large datasets using associative and/or commutative functions.
3.
What's it allabout, Alfie? Imagine we have a dataset that describes fights between two cities. We'd at some point run a SQL query similar to SELECT DISTINCT destination FROM flights WHERE origin=”BWI” to see where everyone is feeing from Baltimore. We can pre-compute, in parallel, this answer for systems that have too much data to compute it every time a user queries the DB.
4.
What do weneed to pull this off? We need data structures that can be combined together. Numbers are a trivial example of this, as we can combine two numbers using a function (such as plus and multiply). There are more advanced data structures such as matrices, HyperLogLogPlus, StreamSummary (used for top-k) and Bloom filters that also have this property! val partial: T = op(a, b)
5.
What do weneed to pull this off? We need operations that can be performed in parallel. Associative operations are espoused by Twitter, but for our case operations that are both associative and commutative have the better property that we can get correct results no matter what order we receive the data. Common operations that are associative (summation, set building) are also commutative. op(op(a, b), c) == op(a, op(b, c)) op(a, b) == op(b, a)
6.
Wait a minuteisn't that... You caught me! It's a commutative monoid! From Wolfram: Monoid: A monoid is a set that is closed under an associative binary operation and has an identity element I in S such that for all a in S, Ia=aI=a Commutative Monoid: A monoid that is commutative i.e., a monoid M such that for every two elements a and b in M, ab=ba.
7.
Put it towork The example we're about to see uses MapReduce and Accumulo. The same can be accomplished using any processing framework that supports map and reduce operations, such as Spark or Storm's Trident interface.
8.
We need twofunctions... Map – Takes an input datum and turns into some combinable structure – Like parsing strings to numbers, or creating single element sets for combining Reduce – Combines the merge-able data structures using our associative and commutative function
9.
Yup, that's all! ● Mapwill be called on the input data once in a Mapper instance. ● Reduce will be called in a Combiner, Reducer and an Accumulo Iterator! ● The Accumulo Iterator is configured to run on major compactions, minor compactions, and scans ● That's five places the same piece of code gets run-- talk about modularity!
10.
What does ourAccumulo Iterator look like? ● We can re-use Accumulo's Combiner type here: override def reduce:(key: Key, values: Iterator[Value]) Value = { // deserialize and combine all intermediate // values. This logic should be identical to // what is in the mr.Combiner and Reducer } ● Our function has to be commutative because major compactions will often pick smaller files to combine, which means we only see discrete subsets of data in an iterator invocation
11.
Counting in practice(pt 1) We've seen how to aggregate values together. What's the best way to structure our data and query it? Twitter's TSAR is a good starting point. It allows users to declare what they want to aggregate: Aggregate( onKeys((“origin”, “destination”)) producing(Count)) This describes generating an edge between two cities and calculating a weight for it.
12.
Counting in practice(pt 2) With that declaration, we can infer that the user wants their operation to be summing over each instance of a given pairing, so we can say the base value is 1 (sounds a bit like word count, huh?). We need a key for each base value and partial computation to be reduced with. For this simple pairing we can have a schema like: <field_1>0<value_1>0...<field_n>0<value_n> count: “” [] <serialized long> I recently traveled from Baltimore to Denver. Here's what that trip would look like: origin0bwi0destination0dia count: “” [] x01
13.
Counting in practice(pt 3) ● Iterator combines all values that are mapped to the same key ● We encoded the aggregation function into the column family of the key – We can arbitrarily add new aggregate functions by updating a mapping of column family to function and then updating the iterator deployment
14.
Something more thancounting ● Everybody counts, but what about something like top-k? ● The key schema isn't fexible enough to show a relationship between two fields ● We want to know the top-k relationship between origin and destination cities ● That column qualifier was looking awfully blank. It'd be a shame if someone were to put data in it...
15.
How you likeme now? ● Aggregate( onKeys((“origin”)) producing(TopK(“destination”))) ● <field1>0<value1>0...<fieldN>0<valueN> <op>: <relation> [] <serialized data structure> ● Let's use my Baltimore->Denver trip as an example: origin0BWI topk: destination [] {“DIA”: 1}
16.
But how doI query it? ● This schema is really geared towards point queries ● Users would know exactly which dimensions they were querying across to get an answer – BUENO “What are the top-k destinations for Bill when he leaves BWI?” – NO BUENO “What are all the dimensions and aggregations I have for Bill?”
17.
Some thoughts tothink about it ● Prepare functions – Preparing the input to do things like time bucketing and normalization (Jared Winick's Trendulo) ● Age off – Combining down to a single value means that value represents all historical data. Maybe we don't care about that and would like to age off data after a day/week/month/year. Mesa's batch Ids could be of use here. ● Security labels – Notice how I deftly avoided this topic. We should be able to bucket aggregations based on visibility, but we need a way to express the best way to handle this. Maybe just preserve the input data's security labeling and attach it to the output of our map function?
18.
FIN (hope this wasn'ttoo hard to read) Comments, suggestions or infammatory messages should be sent to @BillSlacum or wslacum@gmail.com