SlideShare a Scribd company logo
Building Scalable
Aggregation Systems
Accumulo Summit
April 28th, 2015
Gadalia O’Bryan and Bill Slacum
gadaliaobryan@koverse.com
billslacum@koverse.com
Outline
• The Value of Aggregations
• Abstractions
• Systems
• Details
• Demo
• References/Additional Information
Aggregation provides a means of
turning billions of pieces of raw data
into condensed, human-consumable
information.
Aggregation of Aggregations
Time Series
Set Size/Cardinality
Top-K
Quantiles
Density/Heatmap
16.3k Unique
Users
G1
G2
Abstractions
1
2
3
4
10
+
+
+
=
Concept from (P1)
1
2
3
4
3
+ +
=
7
=
10
=
+
We can parallelize integer addition
Associative + Commutative
Operations
• Associative: 1 + (2 + 3) = (1 + 2) + 3
• Commutative: 1 + 2 = 2 + 1
• Allows us to parallelize our reduce (for
instance locally in combiners)
• Applies to many operations, not just
integer addition.
• Spoiler: Key to incremental aggregations
{a,
b}
{b, c}
{a, c}
{a}
{a, b,
c}
+ +
=
{a, c}
=
{a, b,
c}
=
+
We can also parallelize the “addition” of other types, like Sets, as
Set Union is associative
Monoid Interface
• Abstract Algebra provides a formal foundation for
what we can casually observe.
• Don’t be thrown off by the name, just think of it as
another trait/interface.
• Monoids provide a critical abstraction to treat
aggregations of different types in the same way
Many Monoid Implementations
Already Exist
• https://github.com/twitter/algebird/
• Long, String, Set, Seq, Map, etc…
• HyperLogLog – Cardinality Estimates
• QTree – Quantile Estimates
• SpaceSaver/HeavyHitters – Approx Top-K
• Also easy to add your own with libraries
like stream-lib [C3]
Serialization
• One additional trait we need our
“aggregatable” types to have is that we
can serialize/deserialize them.
1
2
3
4
3
+ +
=
7
=
1
0
=
+
1) zero()
2) plus()
3) plus()
4) serialize()
6) deserialize()
5) zero()
7) plus()
9) plus()
3
78) deserialize()
These abstractions enable a
small library of reusable code to
aggregate data in many parts of
your system.
Systems
SQL on Hadoop
• Impala, Hive, SparkSQL
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
Online Incremental Systems
• Twitter’s Summingbird [PA1, C4], Google’s Mesa [PA2],
Koverse’s Aggregation Framework
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
S
M
K
Online Incremental Systems:
Common Components
• Aggregations are computed/reduced
incrementally via associative operations
• Results are mostly pre-computed for so
queries are inexpensive
• Aggregations, keyed by dimensions, are
stored in low latency, scalable key-value
store
Summingbird Program
Summingbird
Data
HDFS
Queues Storm
Topology
Hadoop
Job
Online
KV store
Batch
KV store
Client
Library
Client
Reduce
Reduce
Reduce
Reduce
Mesa
Data (batches)
Colossus
Query
Server
61
62
91
92
…
Singletons
61-70
Cumulatives
61-80
61-90
0-60
Base
Compaction
Worker
Reduce
Reduce
Client
Koverse
Data
Apache Accumulo
Koverse
Server
Hadoop Job
Reduce
Reduce
ClientRecords Aggregates
Min/Maj
Compation
Iterator
Reduce
Scan
Iterator
Reduce
Details
Ingest (1/2)
• We bulk import RFiles over writing via a
BatchWriter
• Failure case is simpler as we can retry
whole batch in case an aggregation job
fails or a bulk import fails
• BatchWriters can be used, but code needs
to be written handle Mutations that are
uncommitted and there’s no roll back for
successful commits
Ingest (2/2)
• As a consequence of importing (usually
small) RFiles, we will be compacting more
• In testing (20 nodes, 200+ jobs/day), we
have not had to tweak compaction
thresholds nor strategies
• Can possibly be attributed to relatively
small amounts of data being held at any
given time due to reduction
Accumulo Iterator
• Combiner Iterator:
A SortedKeyValueIterator that combines the
Values for different versions (timestamp) of a
Key within a row into a single Value. Combiner
will replace one or more versions of a Key and
their Values with the most recent Key and a
Value which is the result of the reduce method.
Our Combiner
• We can re-use Accumulo's Combiner type here:
override def reduce:(key: Key, values:
Iterator[Value]) Value = {
val sum = agg.reduceAll(
values.map(v => agg deserialize v))
return (key, sum)
}
• Our function has to be commutative because major
compactions will often pick smaller files to combine,
which means we only see discrete subsets of data in
an iterator invocation.
Accumulo Table Structure
row colf colq visibility timestamp value
field1Namex1Ffiel
d1Valuex1Ffield2
Namex1Ffield2Val
ue...
Aggregation
Type
relation visibility timestamp Serialized
aggregation
results
Example: originx1FBWI count: [U] 6074
Example: originx1FBWI topk:destination [U] {“DIA”: 1}
Example: originx1FBWIx1Fdatex1F20150427 count: [U] 104
Visibilities (1/2)
• Easy to store, bit tougher to query
• Data can be stored at separate visibilities
• Combiner logic has no concept of visibility,
it only loops over a given
PartialKey.ROW_COLFAM_COLQUAL
• We know how to combine values (Longs,
CountMinSketchs), but how do we
combine visibilities?
Visibilities (2/2)
• Say we have some data on Facebook photo
albums:
– facebookx1falbum_size count: [public] 800
– facebookx1falbum_size count: [private] 100
• Combined value would be 900
• But, what should we return for the visibility of
public + private? We need more context to
properly interpret this value.
• Alternatively, we can just drop it
Queries
• This schema is geared towards point
queries.
• Order of fields matters.
• GOOD “What are the top-k destinations
from BWI?”
• NOT GOOD“What are all the dimensions
and aggregations I have for BWI?”
Demo
References
Presentations
P1. Algebra for Analytics - https://speakerdeck.com/johnynek/algebra-for-analytics
Code
C1. Algebird - https://github.com/twitter/algebird
C2. Simmer - https://github.com/avibryant/simmer
C3. stream-lib https://github.com/addthis/stream-lib
C4. Summingbird - https://github.com/twitter/summingbird
Papers
PA1. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations
http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf
PA2. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42851.pdf
PA3. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms http://arxiv.org/abs/1304.7544
Video
V1. Intro To Summingbird - https://engineering.twitter.com/university/videos/introduction-to-summingbird
Graphics
G1. Histogram Graphic - http://www.statmethods.net/graphs/density.html
G2. Heatmap Graphic - https://www.mapbox.com/blog/twitter-map-every-tweet/
G3. The Matrix Background - http://wall.alphacoders.com/by_sub_category.php?id=198802
Backup Slides
Monoid Examples
Monoid Examples
Aggregation Flow
RowId: hour:2014_08_24_09|
client:Web
CF: Count
CQ:
Value: 3
RowId: client:Android
CF: Count
CQ:
Value: 1
RowId: client:Android
CF: Count
CQ:
Value: 5
RowId: client:iPhone
CF: Count
CQ:
Value: 6
kv_records kv_aggregates
New Records from Import Jobs client: iPhone
timestamp: 1408935773
...
client: Android
timestamp: 1408935871
...
client: Web
timestamp: 1408935792
...
Periodic, Incremental MapReduce Jobs
(like the current Stats Job) read Records
and emit Aggregate KVs based on the
Aggregate configuration for the Collection
Aggregate(
onKey(
“client”,
“hour”, “client”)
produce(
Count)
prepare(
(“timestamp”, “hour”, BinByHour())
)
Aggregate Configuration is a type-safe,
Scala object. Code is sent to the server
as a String, where it is compiled (not
executed). The serialized object is
passed to the MR job to generate KVs
from Records. Contains the dimensions
(onKeys), aggregation operation
(produce), and optional projections
(prepare) which can be built-in functions
or custom Scala closures. We envision
an UI building these objects in the future.
Map
Combine
Emit KVs.
Key = dimension + operation
Value = Serialized Monoid Aggregator
Aggregation Reduction
Reduce
Aggregation Reduction
RFiles
RowId: client:iPhone
CF: Count
CQ:
Value: 3
RowId: client:Android
CF: Count
CQ:
Value: 5
RowId: hour:2014_08_24_09|
client:Android
CF: Count
CQ:
Value: 2
MinC
MajC
Aggregation Reduction
Aggregation Reduction
UserQuery
Scan
Iterator
Aggregation Reduction
{ key: “client:iPhone”, produce: Count }
{ key: “client:iPhone”, produce: Count, value: 9 }
Aggregation Reduction is the same common code in 5 places. For
Aggregates with the same Key, the Values are reduced based on the
operation (Sum, Set, Cardinality Est., etc). The Values are always
serialized objects that implement the MonoidAggregator interface.
Adding a new aggregation operation will impact a single class only -
no new Iterators or MR code.
RowId: hour:2014_08_24_09|
client:Web
CF: Count
CQ:
Value: 8
Aggregate Queries are simple point
queries for a single KV. If the user wants
something like an enumeration of “client”
values, they will use a Set or Top-K
operation and the single value will contain
the answer with no range scans required.
The API may support batching multiple
keys per request to efficiently support
queries to build timeseries (e.g., counts
for each hour in the day)

More Related Content

What's hot

Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
PyData
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
InfluxData
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Databricks
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
Prakash Chockalingam
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
InfluxData
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Altinity Ltd
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Databricks
 

What's hot (20)

Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
 

Similar to Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Accumulo]

How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
Qureshi Tehmina
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
confluent
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
Treasure Data, Inc.
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
Fabrizio Fortino
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Amazon Web Services
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
Sadayuki Furuhashi
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menannoFMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
Verein FM Konferenz
 
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
Amazon Web Services
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
InfluxData
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
DECK36
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL
 

Similar to Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Accumulo] (20)

How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menannoFMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
 
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Accumulo]

  • 1. Building Scalable Aggregation Systems Accumulo Summit April 28th, 2015 Gadalia O’Bryan and Bill Slacum gadaliaobryan@koverse.com billslacum@koverse.com
  • 2. Outline • The Value of Aggregations • Abstractions • Systems • Details • Demo • References/Additional Information
  • 3.
  • 4. Aggregation provides a means of turning billions of pieces of raw data into condensed, human-consumable information.
  • 5. Aggregation of Aggregations Time Series Set Size/Cardinality Top-K Quantiles Density/Heatmap 16.3k Unique Users G1 G2
  • 8. 1 2 3 4 3 + + = 7 = 10 = + We can parallelize integer addition
  • 9. Associative + Commutative Operations • Associative: 1 + (2 + 3) = (1 + 2) + 3 • Commutative: 1 + 2 = 2 + 1 • Allows us to parallelize our reduce (for instance locally in combiners) • Applies to many operations, not just integer addition. • Spoiler: Key to incremental aggregations
  • 10. {a, b} {b, c} {a, c} {a} {a, b, c} + + = {a, c} = {a, b, c} = + We can also parallelize the “addition” of other types, like Sets, as Set Union is associative
  • 11. Monoid Interface • Abstract Algebra provides a formal foundation for what we can casually observe. • Don’t be thrown off by the name, just think of it as another trait/interface. • Monoids provide a critical abstraction to treat aggregations of different types in the same way
  • 12. Many Monoid Implementations Already Exist • https://github.com/twitter/algebird/ • Long, String, Set, Seq, Map, etc… • HyperLogLog – Cardinality Estimates • QTree – Quantile Estimates • SpaceSaver/HeavyHitters – Approx Top-K • Also easy to add your own with libraries like stream-lib [C3]
  • 13. Serialization • One additional trait we need our “aggregatable” types to have is that we can serialize/deserialize them. 1 2 3 4 3 + + = 7 = 1 0 = + 1) zero() 2) plus() 3) plus() 4) serialize() 6) deserialize() 5) zero() 7) plus() 9) plus() 3 78) deserialize()
  • 14. These abstractions enable a small library of reusable code to aggregate data in many parts of your system.
  • 16. SQL on Hadoop • Impala, Hive, SparkSQL milliseconds seconds minutes large many few seconds minutes hours billions millions thousands Query Latency # of Users Freshness Data Size
  • 17. Online Incremental Systems • Twitter’s Summingbird [PA1, C4], Google’s Mesa [PA2], Koverse’s Aggregation Framework milliseconds seconds minutes large many few seconds minutes hours billions millions thousands Query Latency # of Users Freshness Data Size S M K
  • 18. Online Incremental Systems: Common Components • Aggregations are computed/reduced incrementally via associative operations • Results are mostly pre-computed for so queries are inexpensive • Aggregations, keyed by dimensions, are stored in low latency, scalable key-value store
  • 19. Summingbird Program Summingbird Data HDFS Queues Storm Topology Hadoop Job Online KV store Batch KV store Client Library Client Reduce Reduce Reduce Reduce
  • 21. Koverse Data Apache Accumulo Koverse Server Hadoop Job Reduce Reduce ClientRecords Aggregates Min/Maj Compation Iterator Reduce Scan Iterator Reduce
  • 23. Ingest (1/2) • We bulk import RFiles over writing via a BatchWriter • Failure case is simpler as we can retry whole batch in case an aggregation job fails or a bulk import fails • BatchWriters can be used, but code needs to be written handle Mutations that are uncommitted and there’s no roll back for successful commits
  • 24. Ingest (2/2) • As a consequence of importing (usually small) RFiles, we will be compacting more • In testing (20 nodes, 200+ jobs/day), we have not had to tweak compaction thresholds nor strategies • Can possibly be attributed to relatively small amounts of data being held at any given time due to reduction
  • 25. Accumulo Iterator • Combiner Iterator: A SortedKeyValueIterator that combines the Values for different versions (timestamp) of a Key within a row into a single Value. Combiner will replace one or more versions of a Key and their Values with the most recent Key and a Value which is the result of the reduce method.
  • 26. Our Combiner • We can re-use Accumulo's Combiner type here: override def reduce:(key: Key, values: Iterator[Value]) Value = { val sum = agg.reduceAll( values.map(v => agg deserialize v)) return (key, sum) } • Our function has to be commutative because major compactions will often pick smaller files to combine, which means we only see discrete subsets of data in an iterator invocation.
  • 27. Accumulo Table Structure row colf colq visibility timestamp value field1Namex1Ffiel d1Valuex1Ffield2 Namex1Ffield2Val ue... Aggregation Type relation visibility timestamp Serialized aggregation results Example: originx1FBWI count: [U] 6074 Example: originx1FBWI topk:destination [U] {“DIA”: 1} Example: originx1FBWIx1Fdatex1F20150427 count: [U] 104
  • 28. Visibilities (1/2) • Easy to store, bit tougher to query • Data can be stored at separate visibilities • Combiner logic has no concept of visibility, it only loops over a given PartialKey.ROW_COLFAM_COLQUAL • We know how to combine values (Longs, CountMinSketchs), but how do we combine visibilities?
  • 29. Visibilities (2/2) • Say we have some data on Facebook photo albums: – facebookx1falbum_size count: [public] 800 – facebookx1falbum_size count: [private] 100 • Combined value would be 900 • But, what should we return for the visibility of public + private? We need more context to properly interpret this value. • Alternatively, we can just drop it
  • 30. Queries • This schema is geared towards point queries. • Order of fields matters. • GOOD “What are the top-k destinations from BWI?” • NOT GOOD“What are all the dimensions and aggregations I have for BWI?”
  • 31. Demo
  • 32. References Presentations P1. Algebra for Analytics - https://speakerdeck.com/johnynek/algebra-for-analytics Code C1. Algebird - https://github.com/twitter/algebird C2. Simmer - https://github.com/avibryant/simmer C3. stream-lib https://github.com/addthis/stream-lib C4. Summingbird - https://github.com/twitter/summingbird Papers PA1. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf PA2. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42851.pdf PA3. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms http://arxiv.org/abs/1304.7544 Video V1. Intro To Summingbird - https://engineering.twitter.com/university/videos/introduction-to-summingbird Graphics G1. Histogram Graphic - http://www.statmethods.net/graphs/density.html G2. Heatmap Graphic - https://www.mapbox.com/blog/twitter-map-every-tweet/ G3. The Matrix Background - http://wall.alphacoders.com/by_sub_category.php?id=198802
  • 36. Aggregation Flow RowId: hour:2014_08_24_09| client:Web CF: Count CQ: Value: 3 RowId: client:Android CF: Count CQ: Value: 1 RowId: client:Android CF: Count CQ: Value: 5 RowId: client:iPhone CF: Count CQ: Value: 6 kv_records kv_aggregates New Records from Import Jobs client: iPhone timestamp: 1408935773 ... client: Android timestamp: 1408935871 ... client: Web timestamp: 1408935792 ... Periodic, Incremental MapReduce Jobs (like the current Stats Job) read Records and emit Aggregate KVs based on the Aggregate configuration for the Collection Aggregate( onKey( “client”, “hour”, “client”) produce( Count) prepare( (“timestamp”, “hour”, BinByHour()) ) Aggregate Configuration is a type-safe, Scala object. Code is sent to the server as a String, where it is compiled (not executed). The serialized object is passed to the MR job to generate KVs from Records. Contains the dimensions (onKeys), aggregation operation (produce), and optional projections (prepare) which can be built-in functions or custom Scala closures. We envision an UI building these objects in the future. Map Combine Emit KVs. Key = dimension + operation Value = Serialized Monoid Aggregator Aggregation Reduction Reduce Aggregation Reduction RFiles RowId: client:iPhone CF: Count CQ: Value: 3 RowId: client:Android CF: Count CQ: Value: 5 RowId: hour:2014_08_24_09| client:Android CF: Count CQ: Value: 2 MinC MajC Aggregation Reduction Aggregation Reduction UserQuery Scan Iterator Aggregation Reduction { key: “client:iPhone”, produce: Count } { key: “client:iPhone”, produce: Count, value: 9 } Aggregation Reduction is the same common code in 5 places. For Aggregates with the same Key, the Values are reduced based on the operation (Sum, Set, Cardinality Est., etc). The Values are always serialized objects that implement the MonoidAggregator interface. Adding a new aggregation operation will impact a single class only - no new Iterators or MR code. RowId: hour:2014_08_24_09| client:Web CF: Count CQ: Value: 8 Aggregate Queries are simple point queries for a single KV. If the user wants something like an enumeration of “client” values, they will use a Set or Top-K operation and the single value will contain the answer with no range scans required. The API may support batching multiple keys per request to efficiently support queries to build timeseries (e.g., counts for each hour in the day)