Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Mike Walch
Using Fluo to incrementally
process data in Accumulo

Problem: Maintain counts of inbound links
fluo.io
github.com
apache.org
nytimes.com
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound Links
0
3
2
0
Example DataExample Graph

Solution 1 - Maintain counts using batch processing
Website
fluo.io
github.com
apache.org
github.com
nytimes.com
apache.org
# Inbound
+1
-1
+1
-1
+1
+1
Link count change log
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
+1
-23
+65
+105
Last Hour Aggregates
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
54
1,385,169
2,528,255
53,395,105
Historical
Latest Counts
MapReduce
MapReduce
Web
Crawler
Internet
Web
Cache

Solution 2 - Maintain counts using Fluo
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Fluo Table
+1
-1
Web
Crawler
Internet
Web
Cache

Solution 3 - Use both: update popular sites using batch
processing & update long tail using Fluo
# Inbound
Links
Update every
hour using
MapReduce
Update in real-time
using Fluo
Website Distribution
nytimes.com
github.com
fluo.io

Fluo 101 - Basics
- Provides cross-row transactions and snapshot isolation which
makes it safe to do concurrent updates
- Allows for incremental processing of data
- Based on Google’s Percolator paper
- Started as a side project by Keith Turner in 2013
- Originally called Accismus
- Tested using synthetic workloads
- Almost ready for production environments

Fluo 101 - Accumulo vs Fluo
- Fluo is a transactional API built on top of Accumulo
- Fluo stores its data in Accumulo
- Fluo uses Accumulo conditional mutations for transactions
- Fluo has a table structure (row, column, value) similar to Accumulo
except Fluo has no timestamp
- Each Fluo application runs its own processes
- Oracle allocates timestamps for transactions
- Workers run user code (called observers) that perform transactions

Fluo 101 - Architecture
Accumulo
HDFS
Zookeeper
YARN
Client Cluster
Fluo Client
for App 1
Fluo Client
for App 1
Fluo Client
for App 2
Fluo Application 2Fluo Application 1
Fluo Worker
Observer1 Observer2
Fluo Oracle
Fluo Worker
ObserverA
Fluo Oracle
Fluo Worker
Observer1 Observer2
Table1 Table2

Fluo 101 - Client API
Used by developers to ingest data or interact with Fluo from
external applications (REST services, crawlers, etc)
public void addDocument(FluoClient fluoClient, String docId, String content) {
TypeLayer typeLayer = new TypeLayer(new StringEncoder());
try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {
if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) {
tx1.mutate().row(docId).col(CONTENT_COL).set(content);
tx1.commit();
}
}
}

Fluo 101 - Observers
- Developers can write observers that are triggered when a column is
modified and run by Fluo workers.
- Best practice: Do work/transactions in observers over client code
public class DocumentObserver extends TypedObserver {
@Override
public void process(TypedTransactionBase tx, Bytes row, Column column) {
// do work here
}
@Override
public ObservedColumn getObservedColumn() {
return new ObservedColumn(CONTENT_COL, NotificationType.STRONG);
}
}

Example Fluo Application
- Problem: Maintain word & document counts as documents
are added and deleted from Fluo in real time
- Fluo client performs two actions:
1. Add document to table
2. Mark document for deletion
- Which triggers two observers:
- Add Observer - increase word and document counts
- Delete Observer - decrease counts and clean up

Add first document to table
Fluo Table
Row
d : doc1
Column
doc
Value
my first hello world
Fluo Client
Client Cluster
Add
Observer
Delete
Observer

An observer increments word counts
Fluo Table
Row
d : doc1
w : first
w : hello
w : my
w : world
total : docs
Column
doc
cnt
cnt
cnt
cnt
cnt
Value
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer

A second document is added
Fluo Table
Row
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer

First document is marked for deletion
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer

Observer decrements counts and deletes document
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
second hello world
1
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer

Things to watch out for...
- Collisions occur when two transactions update the same data at the
same time
- Only one transaction will succeed. Others need to be retried.
- Some OK but too many can slow computation
- Avoid collisions by not updating same row/column on every transaction
- Write Skew occurs when two transactions read an overlapping data set
and make disjoint updates without seeing the other update
- Result is different than if transactions were serialized
- Prevent write skew by making both transactions update same row/column. If
concurrent, a collision will occur and only one transaction will succeed.

How does Fluo fit in?
Higher
Large Join
Throughput
Lower
Slower Processing Latency Faster
Batch
Processing
MapReduce,
Spark
Incremental
Processing
Fluo, Percolator
Stream
Processing
Storm

Don’t use Fluo if...
1. You want to do ad-hoc analysis on your data
(use batch processing instead)
2. Your incoming data is being joined with a small data set
(use stream processing instead)

Use Fluo if...
1. If you want to maintain a large scale computation
using a series of small transaction updates
2. Periodic batch processing jobs are taking too long to
join new data with existing data

Fluo Application Lifecycle
1. Use batch processing to seed computation with historical data
2. Use Fluo to process incoming data and maintain computation in
real-time
3. While processing, Fluo can be queried and notifications can be
made to user

Major Progress
2010 2013 2014 2015
Google releases
Percolator paper
Keith Turner starts
work on Percolator
implementation for
Accumulo as a side
project (originally
called Accismus)
Fluo can
process
transactions
1.0.0-alpha
released
Oracle and worker
can be run in YARN
Changed project
name to Fluo
1.0.0-beta
releasing
soon
Solidified Fluo
Client/Observer API
Automated running
Fluo cluster on
Amazon EC2
Multi-application
support
Improved how observer
notifications are found
Created
Stress Test

Fluo Stress Test
- Motivation: Needed test that stresses Fluo
and is easy to verify for correctness
- The stress test computes the number of
unique integers by building a bitwise trie
- New integers are added at leaf nodes
- Observers watch all nodes, create parents,
and percolate total up to root node
- Test runs successfully if count at root is
same a number of leaf nodes
- Multiple transactions can operate on same
nodes causing collisions
1110
11xx = 3
1100
10xx = 0 01xx = 1 00xx = 1
xxxx = 5
0101 00011110

Easy to run Fluo
1. On machine with Maven+Git, clone the fluo-dev and fluo repos
2. Follow some basic configuration steps
3. Run the following commands
It’s just as easy to run a Fluo cluster on Amazon EC2
fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballs
fluo-dev setup # Sets up locally Accumulo, Hadoop, etc
fluo-dev deploy # Build Fluo distribution and deploy locally
fluo new myapp # Create configuration for ‘myapp’ Fluo application
fluo init myapp # Initialize ‘myapp’ in Zookeeper
fluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARN
fluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’

Fluo Ecosystem
fluo
Main Project Repo
fluo-quickstart
Simple Fluo
example
fluo-stress
Stresses Fluo on
cluster
fluo-io.github.io
Fluo project website
phrasecount
In-depth Fluo
example
fluo-deploy
Run Fluo on EC2
cluster
fluo-dev
Helps developers
run Fluo locally

Future Direction
- Primary focus: Release production-ready 1.0 release with stable API
- Other possible work:
- Fluo-32: Real world example application
- Possibly using CommonCrawl data
- Fluo-58: Support writing observers in Python
- Fluo-290: Support running Fluo on Mesos
- Fluo-478: Automatically scale up & down Fluo workers based on
workload

Get involved!
1. Experiment with Fluo
- API has stabilized
- Tools and development process make it easy
- Not recommended for production yet (wait for 1.0)
2. Contribute to Fluo
- ~85 open issues on GitHub
- Review-then-commit process

Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Similar to Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API] (20)

Recently uploaded

Recently uploaded (20)

Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]