SlideShare a Scribd company logo
1 of 27
Download to read offline
Mike Walch
Using Fluo to incrementally
process data in Accumulo
Problem: Maintain counts of inbound links
fluo.io
github.com
apache.org
nytimes.com
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound Links
0
3
2
0
Example DataExample Graph
Solution 1 - Maintain counts using batch processing
Website
fluo.io
github.com
apache.org
github.com
nytimes.com
apache.org
# Inbound
+1
-1
+1
-1
+1
+1
Link count change log
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
+1
-23
+65
+105
Last Hour Aggregates
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
54
1,385,169
2,528,255
53,395,105
Historical
Latest Counts
MapReduce
MapReduce
Web
Crawler
Internet
Web
Cache
Solution 2 - Maintain counts using Fluo
Website
fluo.io
github.com
apache.org
nytimes.com
# Inbound
53
1,385,192
2,528,190
53,395,000
Fluo Table
+1
-1
Web
Crawler
Internet
Web
Cache
Solution 3 - Use both: update popular sites using batch
processing & update long tail using Fluo
# Inbound
Links
Update every
hour using
MapReduce
Update in real-time
using Fluo
Website Distribution
nytimes.com
github.com
fluo.io
Fluo 101 - Basics
- Provides cross-row transactions and snapshot isolation which
makes it safe to do concurrent updates
- Allows for incremental processing of data
- Based on Google’s Percolator paper
- Started as a side project by Keith Turner in 2013
- Originally called Accismus
- Tested using synthetic workloads
- Almost ready for production environments
Fluo 101 - Accumulo vs Fluo
- Fluo is a transactional API built on top of Accumulo
- Fluo stores its data in Accumulo
- Fluo uses Accumulo conditional mutations for transactions
- Fluo has a table structure (row, column, value) similar to Accumulo
except Fluo has no timestamp
- Each Fluo application runs its own processes
- Oracle allocates timestamps for transactions
- Workers run user code (called observers) that perform transactions
Fluo 101 - Architecture
Accumulo
HDFS
Zookeeper
YARN
Client Cluster
Fluo Client
for App 1
Fluo Client
for App 1
Fluo Client
for App 2
Fluo Application 2Fluo Application 1
Fluo Worker
Observer1 Observer2
Fluo Oracle
Fluo Worker
ObserverA
Fluo Oracle
Fluo Worker
Observer1 Observer2
Table1 Table2
Fluo 101 - Client API
Used by developers to ingest data or interact with Fluo from
external applications (REST services, crawlers, etc)
public void addDocument(FluoClient fluoClient, String docId, String content) {
TypeLayer typeLayer = new TypeLayer(new StringEncoder());
try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {
if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) {
tx1.mutate().row(docId).col(CONTENT_COL).set(content);
tx1.commit();
}
}
}
Fluo 101 - Observers
- Developers can write observers that are triggered when a column is
modified and run by Fluo workers.
- Best practice: Do work/transactions in observers over client code
public class DocumentObserver extends TypedObserver {
@Override
public void process(TypedTransactionBase tx, Bytes row, Column column) {
// do work here
}
@Override
public ObservedColumn getObservedColumn() {
return new ObservedColumn(CONTENT_COL, NotificationType.STRONG);
}
}
Example Fluo Application
- Problem: Maintain word & document counts as documents
are added and deleted from Fluo in real time
- Fluo client performs two actions:
1. Add document to table
2. Mark document for deletion
- Which triggers two observers:
- Add Observer - increase word and document counts
- Delete Observer - decrease counts and clean up
Add first document to table
Fluo Table
Row
d : doc1
Column
doc
Value
my first hello world
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
An observer increments word counts
Fluo Table
Row
d : doc1
w : first
w : hello
w : my
w : world
total : docs
Column
doc
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
A second document is added
Fluo Table
Row
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
First document is marked for deletion
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
2
1
1
2
2
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
Observer decrements counts and deletes document
Fluo Table
Row
d : doc1
d : doc1
d : doc2
w : first
w : hello
w : my
w : second
w : world
total : doc
Column
doc
delete
doc
cnt
cnt
cnt
cnt
cnt
cnt
Value
my first hello world
second hello world
1
1
1
1
1
1
Fluo Client
Client Cluster
Add
Observer
Delete
Observer
Things to watch out for...
- Collisions occur when two transactions update the same data at the
same time
- Only one transaction will succeed. Others need to be retried.
- Some OK but too many can slow computation
- Avoid collisions by not updating same row/column on every transaction
- Write Skew occurs when two transactions read an overlapping data set
and make disjoint updates without seeing the other update
- Result is different than if transactions were serialized
- Prevent write skew by making both transactions update same row/column. If
concurrent, a collision will occur and only one transaction will succeed.
How does Fluo fit in?
Higher
Large Join
Throughput
Lower
Slower Processing Latency Faster
Batch
Processing
MapReduce,
Spark
Incremental
Processing
Fluo, Percolator
Stream
Processing
Storm
Don’t use Fluo if...
1. You want to do ad-hoc analysis on your data
(use batch processing instead)
2. Your incoming data is being joined with a small data set
(use stream processing instead)
Use Fluo if...
1. If you want to maintain a large scale computation
using a series of small transaction updates
2. Periodic batch processing jobs are taking too long to
join new data with existing data
Fluo Application Lifecycle
1. Use batch processing to seed computation with historical data
2. Use Fluo to process incoming data and maintain computation in
real-time
3. While processing, Fluo can be queried and notifications can be
made to user
Major Progress
2010 2013 2014 2015
Google releases
Percolator paper
Keith Turner starts
work on Percolator
implementation for
Accumulo as a side
project (originally
called Accismus)
Fluo can
process
transactions
1.0.0-alpha
released
Oracle and worker
can be run in YARN
Changed project
name to Fluo
1.0.0-beta
releasing
soon
Solidified Fluo
Client/Observer API
Automated running
Fluo cluster on
Amazon EC2
Multi-application
support
Improved how observer
notifications are found
Created
Stress Test
Fluo Stress Test
- Motivation: Needed test that stresses Fluo
and is easy to verify for correctness
- The stress test computes the number of
unique integers by building a bitwise trie
- New integers are added at leaf nodes
- Observers watch all nodes, create parents,
and percolate total up to root node
- Test runs successfully if count at root is
same a number of leaf nodes
- Multiple transactions can operate on same
nodes causing collisions
1110
11xx = 3
1100
10xx = 0 01xx = 1 00xx = 1
xxxx = 5
0101 00011110
Easy to run Fluo
1. On machine with Maven+Git, clone the fluo-dev and fluo repos
2. Follow some basic configuration steps
3. Run the following commands
It’s just as easy to run a Fluo cluster on Amazon EC2
fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballs
fluo-dev setup # Sets up locally Accumulo, Hadoop, etc
fluo-dev deploy # Build Fluo distribution and deploy locally
fluo new myapp # Create configuration for ‘myapp’ Fluo application
fluo init myapp # Initialize ‘myapp’ in Zookeeper
fluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARN
fluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
Fluo Ecosystem
fluo
Main Project Repo
fluo-quickstart
Simple Fluo
example
fluo-stress
Stresses Fluo on
cluster
fluo-io.github.io
Fluo project website
phrasecount
In-depth Fluo
example
fluo-deploy
Run Fluo on EC2
cluster
fluo-dev
Helps developers
run Fluo locally
Future Direction
- Primary focus: Release production-ready 1.0 release with stable API
- Other possible work:
- Fluo-32: Real world example application
- Possibly using CommonCrawl data
- Fluo-58: Support writing observers in Python
- Fluo-290: Support running Fluo on Mesos
- Fluo-478: Automatically scale up & down Fluo workers based on
workload
Get involved!
1. Experiment with Fluo
- API has stabilized
- Tools and development process make it easy
- Not recommended for production yet (wait for 1.0)
2. Contribute to Fluo
- ~85 open issues on GitHub
- Review-then-commit process

More Related Content

What's hot

Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# DriverCassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# DriverDataStax Academy
 
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache CassandraCassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandraaaronmorton
 
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL-Consulting
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingLaine Campbell
 
Schema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cSchema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cuzzal basak
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming PatternsHao Chen
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Oracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmastersOracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmastersKyle Hailey
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Kyle Hailey
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streamingRamūnas Urbonas
 
Apache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra InternalsApache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra Internalsaaronmorton
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsCommand Prompt., Inc
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 

What's hot (20)

Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# DriverCassandra Day NY 2014: Getting Started with the DataStax C# Driver
Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
 
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache CassandraCassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQ
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
 
Schema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cSchema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12c
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Oracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmastersOracle Open World Thursday 230 ashmasters
Oracle Open World Thursday 230 ashmasters
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Oracle Golden Gate
Oracle Golden GateOracle Golden Gate
Oracle Golden Gate
 
Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle Ash masters : advanced ash analytics on Oracle
Ash masters : advanced ash analytics on Oracle
 
Intro to ASH
Intro to ASHIntro to ASH
Intro to ASH
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Apache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra InternalsApache Con NA 2013 - Cassandra Internals
Apache Con NA 2013 - Cassandra Internals
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 

Similar to Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
 
2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced TavernamyGrid team
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc  2015 HTTP 1, HTTP 2 and folksDevoxx Maroc  2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folksNicolas Martignole
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData InfluxData
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.jsSu Zin Kyaw
 
Deploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfDeploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfKarsten Dambekalns
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Aori Nevo, PhD
 
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...Xoomworks Business Intelligence
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSWeaveworks
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthPhilip Norton
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in HiveEugene Koifman
 
Apache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration ProposalApache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration ProposalGiacomo Lamonaco
 

Similar to Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API] (20)

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna2014 Taverna tutorial Advanced Taverna
2014 Taverna tutorial Advanced Taverna
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc  2015 HTTP 1, HTTP 2 and folksDevoxx Maroc  2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.js
 
Deploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfDeploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using Surf
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
 
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...How to pinpoint and fix sources of performance problems in your SAP BusinessO...
How to pinpoint and fix sources of performance problems in your SAP BusinessO...
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
 
2004 ugm-tips-tricks
2004 ugm-tips-tricks2004 ugm-tips-tricks
2004 ugm-tips-tricks
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Apache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration ProposalApache Syncope: an Apache Camel Integration Proposal
Apache Syncope: an Apache Camel Integration Proposal
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]

  • 1. Mike Walch Using Fluo to incrementally process data in Accumulo
  • 2. Problem: Maintain counts of inbound links fluo.io github.com apache.org nytimes.com Website fluo.io github.com apache.org nytimes.com # Inbound Links 0 3 2 0 Example DataExample Graph
  • 3. Solution 1 - Maintain counts using batch processing Website fluo.io github.com apache.org github.com nytimes.com apache.org # Inbound +1 -1 +1 -1 +1 +1 Link count change log Website fluo.io github.com apache.org nytimes.com # Inbound +1 -23 +65 +105 Last Hour Aggregates Website fluo.io github.com apache.org nytimes.com # Inbound 53 1,385,192 2,528,190 53,395,000 Website fluo.io github.com apache.org nytimes.com # Inbound 54 1,385,169 2,528,255 53,395,105 Historical Latest Counts MapReduce MapReduce Web Crawler Internet Web Cache
  • 4. Solution 2 - Maintain counts using Fluo Website fluo.io github.com apache.org nytimes.com # Inbound 53 1,385,192 2,528,190 53,395,000 Fluo Table +1 -1 Web Crawler Internet Web Cache
  • 5. Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo # Inbound Links Update every hour using MapReduce Update in real-time using Fluo Website Distribution nytimes.com github.com fluo.io
  • 6. Fluo 101 - Basics - Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates - Allows for incremental processing of data - Based on Google’s Percolator paper - Started as a side project by Keith Turner in 2013 - Originally called Accismus - Tested using synthetic workloads - Almost ready for production environments
  • 7. Fluo 101 - Accumulo vs Fluo - Fluo is a transactional API built on top of Accumulo - Fluo stores its data in Accumulo - Fluo uses Accumulo conditional mutations for transactions - Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp - Each Fluo application runs its own processes - Oracle allocates timestamps for transactions - Workers run user code (called observers) that perform transactions
  • 8. Fluo 101 - Architecture Accumulo HDFS Zookeeper YARN Client Cluster Fluo Client for App 1 Fluo Client for App 1 Fluo Client for App 2 Fluo Application 2Fluo Application 1 Fluo Worker Observer1 Observer2 Fluo Oracle Fluo Worker ObserverA Fluo Oracle Fluo Worker Observer1 Observer2 Table1 Table2
  • 9. Fluo 101 - Client API Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc) public void addDocument(FluoClient fluoClient, String docId, String content) { TypeLayer typeLayer = new TypeLayer(new StringEncoder()); try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) { if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } } }
  • 10. Fluo 101 - Observers - Developers can write observers that are triggered when a column is modified and run by Fluo workers. - Best practice: Do work/transactions in observers over client code public class DocumentObserver extends TypedObserver { @Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here } @Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); } }
  • 11. Example Fluo Application - Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time - Fluo client performs two actions: 1. Add document to table 2. Mark document for deletion - Which triggers two observers: - Add Observer - increase word and document counts - Delete Observer - decrease counts and clean up
  • 12. Add first document to table Fluo Table Row d : doc1 Column doc Value my first hello world Fluo Client Client Cluster Add Observer Delete Observer
  • 13. An observer increments word counts Fluo Table Row d : doc1 w : first w : hello w : my w : world total : docs Column doc cnt cnt cnt cnt cnt Value my first hello world 1 1 1 1 1 Fluo Client Client Cluster Add Observer Delete Observer
  • 14. A second document is added Fluo Table Row d : doc1 d : doc2 w : first w : hello w : my w : second w : world total : doc Column doc doc cnt cnt cnt cnt cnt cnt Value my first hello world second hello world 1 2 1 1 2 2 Fluo Client Client Cluster Add Observer Delete Observer
  • 15. First document is marked for deletion Fluo Table Row d : doc1 d : doc1 d : doc2 w : first w : hello w : my w : second w : world total : doc Column doc delete doc cnt cnt cnt cnt cnt cnt Value my first hello world second hello world 1 2 1 1 2 2 Fluo Client Client Cluster Add Observer Delete Observer
  • 16. Observer decrements counts and deletes document Fluo Table Row d : doc1 d : doc1 d : doc2 w : first w : hello w : my w : second w : world total : doc Column doc delete doc cnt cnt cnt cnt cnt cnt Value my first hello world second hello world 1 1 1 1 1 1 Fluo Client Client Cluster Add Observer Delete Observer
  • 17. Things to watch out for... - Collisions occur when two transactions update the same data at the same time - Only one transaction will succeed. Others need to be retried. - Some OK but too many can slow computation - Avoid collisions by not updating same row/column on every transaction - Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update - Result is different than if transactions were serialized - Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.
  • 18. How does Fluo fit in? Higher Large Join Throughput Lower Slower Processing Latency Faster Batch Processing MapReduce, Spark Incremental Processing Fluo, Percolator Stream Processing Storm
  • 19. Don’t use Fluo if... 1. You want to do ad-hoc analysis on your data (use batch processing instead) 2. Your incoming data is being joined with a small data set (use stream processing instead)
  • 20. Use Fluo if... 1. If you want to maintain a large scale computation using a series of small transaction updates 2. Periodic batch processing jobs are taking too long to join new data with existing data
  • 21. Fluo Application Lifecycle 1. Use batch processing to seed computation with historical data 2. Use Fluo to process incoming data and maintain computation in real-time 3. While processing, Fluo can be queried and notifications can be made to user
  • 22. Major Progress 2010 2013 2014 2015 Google releases Percolator paper Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus) Fluo can process transactions 1.0.0-alpha released Oracle and worker can be run in YARN Changed project name to Fluo 1.0.0-beta releasing soon Solidified Fluo Client/Observer API Automated running Fluo cluster on Amazon EC2 Multi-application support Improved how observer notifications are found Created Stress Test
  • 23. Fluo Stress Test - Motivation: Needed test that stresses Fluo and is easy to verify for correctness - The stress test computes the number of unique integers by building a bitwise trie - New integers are added at leaf nodes - Observers watch all nodes, create parents, and percolate total up to root node - Test runs successfully if count at root is same a number of leaf nodes - Multiple transactions can operate on same nodes causing collisions 1110 11xx = 3 1100 10xx = 0 01xx = 1 00xx = 1 xxxx = 5 0101 00011110
  • 24. Easy to run Fluo 1. On machine with Maven+Git, clone the fluo-dev and fluo repos 2. Follow some basic configuration steps 3. Run the following commands It’s just as easy to run a Fluo cluster on Amazon EC2 fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballs fluo-dev setup # Sets up locally Accumulo, Hadoop, etc fluo-dev deploy # Build Fluo distribution and deploy locally fluo new myapp # Create configuration for ‘myapp’ Fluo application fluo init myapp # Initialize ‘myapp’ in Zookeeper fluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARN fluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
  • 25. Fluo Ecosystem fluo Main Project Repo fluo-quickstart Simple Fluo example fluo-stress Stresses Fluo on cluster fluo-io.github.io Fluo project website phrasecount In-depth Fluo example fluo-deploy Run Fluo on EC2 cluster fluo-dev Helps developers run Fluo locally
  • 26. Future Direction - Primary focus: Release production-ready 1.0 release with stable API - Other possible work: - Fluo-32: Real world example application - Possibly using CommonCrawl data - Fluo-58: Support writing observers in Python - Fluo-290: Support running Fluo on Mesos - Fluo-478: Automatically scale up & down Fluo workers based on workload
  • 27. Get involved! 1. Experiment with Fluo - API has stabilized - Tools and development process make it easy - Not recommended for production yet (wait for 1.0) 2. Contribute to Fluo - ~85 open issues on GitHub - Review-then-commit process