Talk Abstract
Fluo provides a framework to incrementally process large datasets stored in Accumulo. Using Fluo, developers can write applications that maintain a large scale computation using a series of small transactional updates. When compared to batch processing frameworks, Fluo enables lower latency, continuous analysis of data by sacrificing throughput. This talk will provide an overview of the Fluo project by touching on its design, use cases, and API. The talk will show how developers can write Fluo applications to solve problems in a new way. It will highlight the benefits of using Fluo as well as cover the trade offs and potential problems developers may face when writing Fluo applications. The talk will end with a discussion of the current status and future direction of the Fluo project.
Speaker
Michael Walch
Software Engineer, Peterson Technologies
Mike is a software engineer and committer on the Fluo project. He has a background in distributed systems and data science. He holds a Masters in Computer Science from Johns Hopkins University and and B.S in Electrical & Computer Engineering from Carnegie Mellon University.
4. Solution 2 - Maintain counts using Fluo
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Fluo Table
+1
-1
Web
Crawler
Internet
Web
Cache
5. Solution 3 - Use both: update popular sites using batch
processing & update long tail using Fluo
# Inbound
Links
Update every
hour using
MapReduce
Update in real-time
using Fluo
Website Distribution
nytimes.com
github.com
fluo.io
6. Fluo 101 - Basics
- Provides cross-row transactions and snapshot isolation which
makes it safe to do concurrent updates
- Allows for incremental processing of data
- Based on Google’s Percolator paper
- Started as a side project by Keith Turner in 2013
- Originally called Accismus
- Tested using synthetic workloads
- Almost ready for production environments
7. Fluo 101 - Accumulo vs Fluo
- Fluo is a transactional API built on top of Accumulo
- Fluo stores its data in Accumulo
- Fluo uses Accumulo conditional mutations for transactions
- Fluo has a table structure (row, column, value) similar to Accumulo
except Fluo has no timestamp
- Each Fluo application runs its own processes
- Oracle allocates timestamps for transactions
- Workers run user code (called observers) that perform transactions
9. Fluo 101 - Client API
Used by developers to ingest data or interact with Fluo from
external applications (REST services, crawlers, etc)
public void addDocument(FluoClient fluoClient, String docId, String content) {
TypeLayer typeLayer = new TypeLayer(new StringEncoder());
try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {
if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) {
tx1.mutate().row(docId).col(CONTENT_COL).set(content);
tx1.commit();
}
}
}
10. Fluo 101 - Observers
- Developers can write observers that are triggered when a column is
modified and run by Fluo workers.
- Best practice: Do work/transactions in observers over client code
public class DocumentObserver extends TypedObserver {
@Override
public void process(TypedTransactionBase tx, Bytes row, Column column) {
// do work here
}
@Override
public ObservedColumn getObservedColumn() {
return new ObservedColumn(CONTENT_COL, NotificationType.STRONG);
}
}
11. Example Fluo Application
- Problem: Maintain word & document counts as documents
are added and deleted from Fluo in real time
- Fluo client performs two actions:
1. Add document to table
2. Mark document for deletion
- Which triggers two observers:
- Add Observer - increase word and document counts
- Delete Observer - decrease counts and clean up
12. Add first document to table
Fluo Table
Row
d : doc1
Column
doc
Value
my first hello world
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
13. An observer increments word counts
Fluo Table
Row
d : doc1
w : first
w : hello
w : my
w : world
total : docs
Column
doc
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
14. A second document is added
Fluo Table
Row
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
15. First document is marked for deletion
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
16. Observer decrements counts and deletes document
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
17. Things to watch out for...
- Collisions occur when two transactions update the same data at the
same time
- Only one transaction will succeed. Others need to be retried.
- Some OK but too many can slow computation
- Avoid collisions by not updating same row/column on every transaction
- Write Skew occurs when two transactions read an overlapping data set
and make disjoint updates without seeing the other update
- Result is different than if transactions were serialized
- Prevent write skew by making both transactions update same row/column. If
concurrent, a collision will occur and only one transaction will succeed.
18. How does Fluo fit in?
Higher
Large Join
Throughput
Lower
Slower Processing Latency Faster
Batch
Processing
MapReduce,
Spark
Incremental
Processing
Fluo, Percolator
Stream
Processing
Storm
19. Don’t use Fluo if...
1. You want to do ad-hoc analysis on your data
(use batch processing instead)
2. Your incoming data is being joined with a small data set
(use stream processing instead)
20. Use Fluo if...
1. If you want to maintain a large scale computation
using a series of small transaction updates
2. Periodic batch processing jobs are taking too long to
join new data with existing data
21. Fluo Application Lifecycle
1. Use batch processing to seed computation with historical data
2. Use Fluo to process incoming data and maintain computation in
real-time
3. While processing, Fluo can be queried and notifications can be
made to user
22. Major Progress
2010 2013 2014 2015
Google releases
Percolator paper
Keith Turner starts
work on Percolator
implementation for
Accumulo as a side
project (originally
called Accismus)
Fluo can
process
transactions
1.0.0-alpha
released
Oracle and worker
can be run in YARN
Changed project
name to Fluo
1.0.0-beta
releasing
soon
Solidified Fluo
Client/Observer API
Automated running
Fluo cluster on
Amazon EC2
Multi-application
support
Improved how observer
notifications are found
Created
Stress Test
23. Fluo Stress Test
- Motivation: Needed test that stresses Fluo
and is easy to verify for correctness
- The stress test computes the number of
unique integers by building a bitwise trie
- New integers are added at leaf nodes
- Observers watch all nodes, create parents,
and percolate total up to root node
- Test runs successfully if count at root is
same a number of leaf nodes
- Multiple transactions can operate on same
nodes causing collisions
1110
11xx = 3
1100
10xx = 0 01xx = 1 00xx = 1
xxxx = 5
0101 00011110
24. Easy to run Fluo
1. On machine with Maven+Git, clone the fluo-dev and fluo repos
2. Follow some basic configuration steps
3. Run the following commands
It’s just as easy to run a Fluo cluster on Amazon EC2
fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballs
fluo-dev setup # Sets up locally Accumulo, Hadoop, etc
fluo-dev deploy # Build Fluo distribution and deploy locally
fluo new myapp # Create configuration for ‘myapp’ Fluo application
fluo init myapp # Initialize ‘myapp’ in Zookeeper
fluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARN
fluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
25. Fluo Ecosystem
fluo
Main Project Repo
fluo-quickstart
Simple Fluo
example
fluo-stress
Stresses Fluo on
cluster
fluo-io.github.io
Fluo project website
phrasecount
In-depth Fluo
example
fluo-deploy
Run Fluo on EC2
cluster
fluo-dev
Helps developers
run Fluo locally
26. Future Direction
- Primary focus: Release production-ready 1.0 release with stable API
- Other possible work:
- Fluo-32: Real world example application
- Possibly using CommonCrawl data
- Fluo-58: Support writing observers in Python
- Fluo-290: Support running Fluo on Mesos
- Fluo-478: Automatically scale up & down Fluo workers based on
workload
27. Get involved!
1. Experiment with Fluo
- API has stabilized
- Tools and development process make it easy
- Not recommended for production yet (wait for 1.0)
2. Contribute to Fluo
- ~85 open issues on GitHub
- Review-then-commit process