SlideShare a Scribd company logo
Mike Walch
Using Fluo to incrementally
process data in Accumulo
Problem: Maintain counts of inbound links
fluo.io
github.com
apache.org
nytimes.com
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound Links
0
3
2
0
Example DataExample Graph
Solution 1 - Maintain counts using batch processing
Website
fluo.io
github.com
apache.org
github.com
nytimes.com
apache.org
# Inbound
+1
-1
+1
-1
+1
+1
Link count change log
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
+1
-23
+65
+105
Last Hour Aggregates
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
54
1,385,169
2,528,255
53,395,105
Historical
Latest Counts
MapReduce
MapReduce
Web
Crawler
Internet
Web
Cache
Solution 2 - Maintain counts using Fluo
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Fluo Table
+1
-1
Web
Crawler
Internet
Web
Cache
Solution 3 - Use both: update popular sites using batch
processing & update long tail using Fluo
# Inbound
Links
Update every
hour using
MapReduce
Update in real-time
using Fluo
Website Distribution
nytimes.com
github.com
fluo.io
Fluo 101 - Basics
- Provides cross-row transactions and snapshot isolation which
makes it safe to do concurrent updates
- Allows for incremental processing of data
- Based on Google’s Percolator paper
- Started as a side project by Keith Turner in 2013
- Originally called Accismus
- Tested using synthetic workloads
- Almost ready for production environments
Fluo 101 - Accumulo vs Fluo
- Fluo is a transactional API built on top of Accumulo
- Fluo stores its data in Accumulo
- Fluo uses Accumulo conditional mutations for transactions
- Fluo has a table structure (row, column, value) similar to Accumulo
except Fluo has no timestamp
- Each Fluo application runs its own processes
- Oracle allocates timestamps for transactions
- Workers run user code (called observers) that perform transactions
Fluo 101 - Architecture
Accumulo
HDFS
Zookeeper
YARN
Client Cluster
Fluo Client
for App 1
Fluo Client
for App 1
Fluo Client
for App 2
Fluo Application 2Fluo Application 1
Fluo Worker
Observer1 Observer2
Fluo Oracle
Fluo Worker
ObserverA
Fluo Oracle
Fluo Worker
Observer1 Observer2
Table1 Table2
Fluo 101 - Client API
Used by developers to ingest data or interact with Fluo from
external applications (REST services, crawlers, etc)
public void addDocument(FluoClient fluoClient, String docId, String content) {
TypeLayer typeLayer = new TypeLayer(new StringEncoder());
try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {
if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) {
tx1.mutate().row(docId).col(CONTENT_COL).set(content);
tx1.commit();
}
}
}
Fluo 101 - Observers
- Developers can write observers that are triggered when a column is
modified and run by Fluo workers.
- Best practice: Do work/transactions in observers over client code
public class DocumentObserver extends TypedObserver {
@Override
public void process(TypedTransactionBase tx, Bytes row, Column column) {
// do work here
}
@Override
public ObservedColumn getObservedColumn() {
return new ObservedColumn(CONTENT_COL, NotificationType.STRONG);
}
}
Example Fluo Application
- Problem: Maintain word & document counts as documents
are added and deleted from Fluo in real time
- Fluo client performs two actions:
1. Add document to table
2. Mark document for deletion
- Which triggers two observers:
- Add Observer - increase word and document counts
- Delete Observer - decrease counts and clean up
Add first document to table
Fluo Table
Row
d : doc1
Column
doc
Value
my first hello world
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
An observer increments word counts
Fluo Table
Row
d : doc1
w : first
w : hello
w : my
w : world
total : docs
Column
doc
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
A second document is added
Fluo Table
Row
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
First document is marked for deletion
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
Observer decrements counts and deletes document
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
Things to watch out for...
- Collisions occur when two transactions update the same data at the
same time
- Only one transaction will succeed. Others need to be retried.
- Some OK but too many can slow computation
- Avoid collisions by not updating same row/column on every transaction
- Write Skew occurs when two transactions read an overlapping data set
and make disjoint updates without seeing the other update
- Result is different than if transactions were serialized
- Prevent write skew by making both transactions update same row/column. If
concurrent, a collision will occur and only one transaction will succeed.
How does Fluo fit in?
Higher
Large Join
Throughput
Lower
Slower Processing Latency Faster
Batch
Processing
MapReduce,
Spark
Incremental
Processing
Fluo, Percolator
Stream
Processing
Storm
Don’t use Fluo if...
1. You want to do ad-hoc analysis on your data
(use batch processing instead)
2. Your incoming data is being joined with a small data set
(use stream processing instead)
Use Fluo if...
1. If you want to maintain a large scale computation
using a series of small transaction updates
2. Periodic batch processing jobs are taking too long to
join new data with existing data
Fluo Application Lifecycle
1. Use batch processing to seed computation with historical data
2. Use Fluo to process incoming data and maintain computation in
real-time
3. While processing, Fluo can be queried and notifications can be
made to user
Major Progress
2010 2013 2014 2015
Google releases
Percolator paper
Keith Turner starts
work on Percolator
implementation for
Accumulo as a side
project (originally
called Accismus)
Fluo can
process
transactions
1.0.0-alpha
released
Oracle and worker
can be run in YARN
Changed project
name to Fluo
1.0.0-beta
releasing
soon
Solidified Fluo
Client/Observer API
Automated running
Fluo cluster on
Amazon EC2
Multi-application
support
Improved how observer
notifications are found
Created
Stress Test
Fluo Stress Test
- Motivation: Needed test that stresses Fluo
and is easy to verify for correctness
- The stress test computes the number of
unique integers by building a bitwise trie
- New integers are added at leaf nodes
- Observers watch all nodes, create parents,
and percolate total up to root node
- Test runs successfully if count at root is
same a number of leaf nodes
- Multiple transactions can operate on same
nodes causing collisions
1110
11xx = 3
1100
10xx = 0 01xx = 1 00xx = 1
xxxx = 5
0101 00011110
Easy to run Fluo
1. On machine with Maven+Git, clone the fluo-dev and fluo repos
2. Follow some basic configuration steps
3. Run the following commands
It’s just as easy to run a Fluo cluster on Amazon EC2
fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballs
fluo-dev setup # Sets up locally Accumulo, Hadoop, etc
fluo-dev deploy # Build Fluo distribution and deploy locally
fluo new myapp # Create configuration for ‘myapp’ Fluo application
fluo init myapp # Initialize ‘myapp’ in Zookeeper
fluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARN
fluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
Fluo Ecosystem
fluo
Main Project Repo
fluo-quickstart
Simple Fluo
example
fluo-stress
Stresses Fluo on
cluster
fluo-io.github.io
Fluo project website
phrasecount
In-depth Fluo
example
fluo-deploy
Run Fluo on EC2
cluster
fluo-dev
Helps developers
run Fluo locally
Future Direction
- Primary focus: Release production-ready 1.0 release with stable API
- Other possible work:
- Fluo-32: Real world example application
- Possibly using CommonCrawl data
- Fluo-58: Support writing observers in Python
- Fluo-290: Support running Fluo on Mesos
- Fluo-478: Automatically scale up & down Fluo workers based on
workload
Get involved!
1. Experiment with Fluo
- API has stabilized
- Tools and development process make it easy
- Not recommended for production yet (wait for 1.0)
2. Contribute to Fluo
- ~85 open issues on GitHub
- Review-then-commit process

More Related Content

What's hot

Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# DriverCassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
DataStax Academy
 
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache CassandraCassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
aaronmorton
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
Altinity Ltd
 
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL-Consulting
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
Hans-Jürgen Schönig
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
Laine Campbell
 
Schema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cSchema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12c
uzzal basak
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
Hao Chen
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
Celine George
 
Oracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmastersOracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmastersKyle Hailey
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Davorin Vukelic
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
Kyle Hailey
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
Apache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra InternalsApache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra Internals
aaronmorton
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 

What's hot (20)

Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# DriverCassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
 
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache CassandraCassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQ
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
 
Schema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cSchema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12c
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Oracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmastersOracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmasters
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Oracle Golden Gate
Oracle Golden GateOracle Golden Gate
Oracle Golden Gate
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
 
Intro to ASH
Intro to ASHIntro to ASH
Intro to ASH
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Apache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra InternalsApache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra Internals
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 

Similar to Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
StreamNative
 
2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna
myGrid team
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc  2015 HTTP 1, HTTP 2 and folksDevoxx Maroc  2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Nicolas Martignole
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
InfluxData
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.js
Su Zin Kyaw
 
Deploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfDeploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using Surf
Karsten Dambekalns
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Aori Nevo, PhD
 
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
Xoomworks Business Intelligence
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
Senturus
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
SaarBergerbest
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
Weaveworks
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
Philip Norton
 
2004 ugm-tips-tricks
2004 ugm-tips-tricks2004 ugm-tips-tricks
2004 ugm-tips-tricks
Shamoon Jamshed
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
Eugene Koifman
 
Apache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration ProposalApache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration Proposal
Giacomo Lamonaco
 

Similar to Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API] (20)

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc  2015 HTTP 1, HTTP 2 and folksDevoxx Maroc  2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.js
 
Deploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfDeploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using Surf
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
 
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
 
2004 ugm-tips-tricks
2004 ugm-tips-tricks2004 ugm-tips-tricks
2004 ugm-tips-tricks
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Apache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration ProposalApache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration Proposal
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

  • 1. Mike Walch Using Fluo to incrementally process data in Accumulo
  • 2. Problem: Maintain counts of inbound links fluo.io github.com apache.org nytimes.com Website fluo.io github.com apache.org nytimes.com # Inbound Links 0 3 2 0 Example DataExample Graph
  • 3. Solution 1 - Maintain counts using batch processing Website fluo.io github.com apache.org github.com nytimes.com apache.org # Inbound +1 -1 +1 -1 +1 +1 Link count change log Website fluo.io github.com apache.org nytimes.com # Inbound +1 -23 +65 +105 Last Hour Aggregates Website fluo.io github.com apache.org nytimes.com # Inbound 53 1,385,192 2,528,190 53,395,000 Website fluo.io github.com apache.org nytimes.com # Inbound 54 1,385,169 2,528,255 53,395,105 Historical Latest Counts MapReduce MapReduce Web Crawler Internet Web Cache
  • 4. Solution 2 - Maintain counts using Fluo Website fluo.io github.com apache.org nytimes.com # Inbound 53 1,385,192 2,528,190 53,395,000 Fluo Table +1 -1 Web Crawler Internet Web Cache
  • 5. Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo # Inbound Links Update every hour using MapReduce Update in real-time using Fluo Website Distribution nytimes.com github.com fluo.io
  • 6. Fluo 101 - Basics - Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates - Allows for incremental processing of data - Based on Google’s Percolator paper - Started as a side project by Keith Turner in 2013 - Originally called Accismus - Tested using synthetic workloads - Almost ready for production environments
  • 7. Fluo 101 - Accumulo vs Fluo - Fluo is a transactional API built on top of Accumulo - Fluo stores its data in Accumulo - Fluo uses Accumulo conditional mutations for transactions - Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp - Each Fluo application runs its own processes - Oracle allocates timestamps for transactions - Workers run user code (called observers) that perform transactions
  • 8. Fluo 101 - Architecture Accumulo HDFS Zookeeper YARN Client Cluster Fluo Client for App 1 Fluo Client for App 1 Fluo Client for App 2 Fluo Application 2Fluo Application 1 Fluo Worker Observer1 Observer2 Fluo Oracle Fluo Worker ObserverA Fluo Oracle Fluo Worker Observer1 Observer2 Table1 Table2
  • 9. Fluo 101 - Client API Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc) public void addDocument(FluoClient fluoClient, String docId, String content) { TypeLayer typeLayer = new TypeLayer(new StringEncoder()); try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) { if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } } }
  • 10. Fluo 101 - Observers - Developers can write observers that are triggered when a column is modified and run by Fluo workers. - Best practice: Do work/transactions in observers over client code public class DocumentObserver extends TypedObserver { @Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here } @Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); } }
  • 11. Example Fluo Application - Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time - Fluo client performs two actions: 1. Add document to table 2. Mark document for deletion - Which triggers two observers: - Add Observer - increase word and document counts - Delete Observer - decrease counts and clean up
  • 12. Add first document to table Fluo Table Row d : doc1 Column doc Value my first hello world Fluo Client Client Cluster Add Observer Delete Observer
  • 13. An observer increments word counts Fluo Table Row d : doc1 w : first w : hello w : my w : world total : docs Column doc cnt cnt cnt cnt cnt Value my first hello world 1 1 1 1 1 Fluo Client Client Cluster Add Observer Delete Observer
  • 14. A second document is added Fluo Table Row d : doc1 d : doc2 w : first w : hello w : my w : second w : world total : doc Column doc doc cnt cnt cnt cnt cnt cnt Value my first hello world second hello world 1 2 1 1 2 2 Fluo Client Client Cluster Add Observer Delete Observer
  • 15. First document is marked for deletion Fluo Table Row d : doc1 d : doc1 d : doc2 w : first w : hello w : my w : second w : world total : doc Column doc delete doc cnt cnt cnt cnt cnt cnt Value my first hello world second hello world 1 2 1 1 2 2 Fluo Client Client Cluster Add Observer Delete Observer
  • 16. Observer decrements counts and deletes document Fluo Table Row d : doc1 d : doc1 d : doc2 w : first w : hello w : my w : second w : world total : doc Column doc delete doc cnt cnt cnt cnt cnt cnt Value my first hello world second hello world 1 1 1 1 1 1 Fluo Client Client Cluster Add Observer Delete Observer
  • 17. Things to watch out for... - Collisions occur when two transactions update the same data at the same time - Only one transaction will succeed. Others need to be retried. - Some OK but too many can slow computation - Avoid collisions by not updating same row/column on every transaction - Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update - Result is different than if transactions were serialized - Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.
  • 18. How does Fluo fit in? Higher Large Join Throughput Lower Slower Processing Latency Faster Batch Processing MapReduce, Spark Incremental Processing Fluo, Percolator Stream Processing Storm
  • 19. Don’t use Fluo if... 1. You want to do ad-hoc analysis on your data (use batch processing instead) 2. Your incoming data is being joined with a small data set (use stream processing instead)
  • 20. Use Fluo if... 1. If you want to maintain a large scale computation using a series of small transaction updates 2. Periodic batch processing jobs are taking too long to join new data with existing data
  • 21. Fluo Application Lifecycle 1. Use batch processing to seed computation with historical data 2. Use Fluo to process incoming data and maintain computation in real-time 3. While processing, Fluo can be queried and notifications can be made to user
  • 22. Major Progress 2010 2013 2014 2015 Google releases Percolator paper Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus) Fluo can process transactions 1.0.0-alpha released Oracle and worker can be run in YARN Changed project name to Fluo 1.0.0-beta releasing soon Solidified Fluo Client/Observer API Automated running Fluo cluster on Amazon EC2 Multi-application support Improved how observer notifications are found Created Stress Test
  • 23. Fluo Stress Test - Motivation: Needed test that stresses Fluo and is easy to verify for correctness - The stress test computes the number of unique integers by building a bitwise trie - New integers are added at leaf nodes - Observers watch all nodes, create parents, and percolate total up to root node - Test runs successfully if count at root is same a number of leaf nodes - Multiple transactions can operate on same nodes causing collisions 1110 11xx = 3 1100 10xx = 0 01xx = 1 00xx = 1 xxxx = 5 0101 00011110
  • 24. Easy to run Fluo 1. On machine with Maven+Git, clone the fluo-dev and fluo repos 2. Follow some basic configuration steps 3. Run the following commands It’s just as easy to run a Fluo cluster on Amazon EC2 fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballs fluo-dev setup # Sets up locally Accumulo, Hadoop, etc fluo-dev deploy # Build Fluo distribution and deploy locally fluo new myapp # Create configuration for ‘myapp’ Fluo application fluo init myapp # Initialize ‘myapp’ in Zookeeper fluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARN fluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
  • 25. Fluo Ecosystem fluo Main Project Repo fluo-quickstart Simple Fluo example fluo-stress Stresses Fluo on cluster fluo-io.github.io Fluo project website phrasecount In-depth Fluo example fluo-deploy Run Fluo on EC2 cluster fluo-dev Helps developers run Fluo locally
  • 26. Future Direction - Primary focus: Release production-ready 1.0 release with stable API - Other possible work: - Fluo-32: Real world example application - Possibly using CommonCrawl data - Fluo-58: Support writing observers in Python - Fluo-290: Support running Fluo on Mesos - Fluo-478: Automatically scale up & down Fluo workers based on workload
  • 27. Get involved! 1. Experiment with Fluo - API has stabilized - Tools and development process make it easy - Not recommended for production yet (wait for 1.0) 2. Contribute to Fluo - ~85 open issues on GitHub - Review-then-commit process