SlideShare a Scribd company logo
1 of 14
Adaptive Data Cleansing with
StreamSets and Cassandra
Pat Patterson
Community Champion
@metadaddy
pat@superpat.com
About Pat
Pat Patterson
Community Champion @ StreamSets
Formerly Developer Evangelist at
Salesforce
Contact
pat@streamsets.com
@metadaddy
Feeding Cassandra with StreamSets Data Collector
User defined aggregate functions in Cassandra
Push statistics back into the data pipeline
Resources
Q & A
Agenda
Devices sending sensor readings to
RabbitMQ via MQTT
Sensor id, time, temperature, orientation
Convert orientation from integer to string
0 -> “horizontal”, 1 -> “vertical”
Filter outlier values
Write ‘clean’ readings to Cassandra
Use Case: IoT Sensor Data
StreamSets Data Collector
http://api
Open source continuous big data
ingest infrastructure
◦ Off- or near-cluster
◦ Operating on data in-motion
◦ One-time processing, scales linearly
◦ Direct control over data integrity
Ingest from RabbitMQ to Cassandra
Could easily filter on some static boundary
But that will only catch ‘obvious’ problems
Can we find outlier values in a more dynamic,
flexible way?
Let’s define an outlier as any temperature value
more than 4 standard deviations from the past
hour’s mean temperature
Cleaning Up The Data
Standard aggregate functions: min, max, avg, sum, count
SELECT avg(temperature), count(temperature)
FROM readings
WHERE sensor_id = 1 AND time > '2016-08-17 18:11:00+0000'
Define custom aggregates as online algorithms in terms of two Java/JavaScript
functions: state function and final function
State function takes old state and value, returns new state
Tuple stateFn(Tuple state, double x)
Final function takes final state, returns aggregate value
double finalFn(Tuple state)
Cassandra User Defined Aggregates
def online_variance(data):
n = 0
mean = 0.0
M2 = 0.0
for x in data:
n += 1
delta = x – mean
mean += delta/n
M2 += delta*(x - mean)
if n < 2:
return float('nan')
else:
return M2 / (n - 1)
Online Standard Deviation - Welford’s Method
cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdState ( state tuple<int,double,double>, val
double ) CALLED ON NULL INPUT RETURNS tuple<int,double,double> LANGUAGE java AS 'int n =
state.getInt(0); double mean = state.getDouble(1); double m2 = state.getDouble(2); n++; double delta
= val - mean; mean += delta / n; m2 += delta * (val - mean); state.setInt(0, n); state.setDouble(1,
mean); state.setDouble(2, m2); return state;';
cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdFinal ( state tuple<int,double,double> )
CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'int n = state.getInt(0); double m2
= state.getDouble(2); if (n < 1) { return null; } return Math.sqrt(m2 / (n - 1));';
cqlsh:mykeyspace> CREATE AGGREGATE IF NOT EXISTS stdev ( double ) ... SFUNC sdState
STYPE tuple<int,double,double> FINALFUNC sdFinal INITCOND (0,0,0);
Define Standard Deviation as a UDA
Java app periodically queries Cassandra for mean, standard
deviation for the past hour’s data, writes to resource files
PreparedStatement statement = session.prepare(
"SELECT AVG(temperature), STDEV(temperature) " +
"FROM sensor_readings " +
"WHERE sensor_id = ? AND TIME > ?");
BoundStatement boundStatement = new BoundStatement(statement);
...
long startMillis = System.currentTimeMillis() - timeRangeMillis;
ResultSet results = session.execute(
boundStatement.bind(sensorId, new Date(startMillis)));
double avg = row.getDouble("system.avg(temperature)"),
sd = row.getDouble("mykeyspace.stdev(temperature)");
Feeding Statistics Back Into the Pipeline
Putting It All Together
Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets
http://bit.ly/ingest-mqtt
Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector
http://bit.ly/ingest-sensors
Standard Deviations on Cassandra – Rolling Your Own Aggregate Function
http://bit.ly/cassandra-uda
Dynamic Outlier Detection with StreamSets and Cassandra
http://bit.ly/dynamic-outliers
Resources
Questions?
Pat Patterson
Community Champion
@metadaddy
pat@superpat.com

More Related Content

What's hot

codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
 
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at SignalCassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at SignalDataStax Academy
 
Microservice performance-b
Microservice performance-bMicroservice performance-b
Microservice performance-bdbgannon
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeMigrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeFred de Villamil
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014dhiguero
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...DevOpsDays Tel Aviv
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducePolyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducethumbtacktech
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_PresentationSourabh Gujar
 
Unidirectional Data Flow with Reactor
Unidirectional Data Flow with ReactorUnidirectional Data Flow with Reactor
Unidirectional Data Flow with ReactorJason Larsen
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleEric Lubow
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
 
Dynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using TimeDynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using TimeMagnify Analytic Solutions
 

What's hot (19)

codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
 
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at SignalCassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
 
Mario on spark
Mario on sparkMario on spark
Mario on spark
 
Microservice performance-b
Microservice performance-bMicroservice performance-b
Microservice performance-b
 
RxSubject And Operators
RxSubject And OperatorsRxSubject And Operators
RxSubject And Operators
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeMigrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducePolyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_Presentation
 
Unidirectional Data Flow with Reactor
Unidirectional Data Flow with ReactorUnidirectional Data Flow with Reactor
Unidirectional Data Flow with Reactor
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary Tale
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
slides_nyc_2015
slides_nyc_2015slides_nyc_2015
slides_nyc_2015
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using spark
 
Dynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using TimeDynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using Time
 

Viewers also liked

Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorCask Data
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Pat Patterson
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Pat Patterson
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityProvisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityPat Patterson
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)Pat Patterson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
All Aboard the Boxcar! Going Beyond the Basics of REST
All Aboard the Boxcar! Going Beyond the Basics of RESTAll Aboard the Boxcar! Going Beyond the Basics of REST
All Aboard the Boxcar! Going Beyond the Basics of RESTPat Patterson
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Viewers also liked (18)

Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityProvisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
 
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
OData: Universal Data Solvent or Clunky Enterprise Goo? (GlueCon 2015)
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
All Aboard the Boxcar! Going Beyond the Basics of REST
All Aboard the Boxcar! Going Beyond the Basics of RESTAll Aboard the Boxcar! Going Beyond the Basics of REST
All Aboard the Boxcar! Going Beyond the Basics of REST
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Adaptive Data Cleansing with StreamSets and Cassandra

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiyTues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiyAnton Yazovskiy
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
 
Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016Evan Schultz
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.comMcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.comkashif kashif
 
Average- An android project
Average- An android projectAverage- An android project
Average- An android projectIpsit Dash
 
From Startup to Mature Company: PostgreSQL Tips and techniques
From Startup to Mature Company:  PostgreSQL Tips and techniquesFrom Startup to Mature Company:  PostgreSQL Tips and techniques
From Startup to Mature Company: PostgreSQL Tips and techniquesJohn Ashmead
 
Ft10 de smet
Ft10 de smetFt10 de smet
Ft10 de smetnkaluva
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)DataStax Academy
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbLucidworks
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Keshav Murthy
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraLuke Tillman
 

Similar to Adaptive Data Cleansing with StreamSets and Cassandra (20)

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiyTues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Jdbc
JdbcJdbc
Jdbc
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Android wearpp
Android wearppAndroid wearpp
Android wearpp
 
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.comMcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
 
Average- An android project
Average- An android projectAverage- An android project
Average- An android project
 
Introduction to Redux
Introduction to ReduxIntroduction to Redux
Introduction to Redux
 
MCRL2
MCRL2MCRL2
MCRL2
 
From Startup to Mature Company: PostgreSQL Tips and techniques
From Startup to Mature Company:  PostgreSQL Tips and techniquesFrom Startup to Mature Company:  PostgreSQL Tips and techniques
From Startup to Mature Company: PostgreSQL Tips and techniques
 
Ft10 de smet
Ft10 de smetFt10 de smet
Ft10 de smet
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data.
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
 

More from Pat Patterson

DevOps from the Provider Perspective
DevOps from the Provider PerspectiveDevOps from the Provider Perspective
DevOps from the Provider PerspectivePat Patterson
 
How Imprivata Combines External Data Sources for Business Insights
How Imprivata Combines External Data Sources for Business InsightsHow Imprivata Combines External Data Sources for Business Insights
How Imprivata Combines External Data Sources for Business InsightsPat Patterson
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowPat Patterson
 
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Pat Patterson
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakePat Patterson
 
Integrating with Einstein Analytics
Integrating with Einstein AnalyticsIntegrating with Einstein Analytics
Integrating with Einstein AnalyticsPat Patterson
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryPat Patterson
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakePat Patterson
 
Enterprise IoT: Data in Context
Enterprise IoT: Data in ContextEnterprise IoT: Data in Context
Enterprise IoT: Data in ContextPat Patterson
 
OData: A Standard API for Data Access
OData: A Standard API for Data AccessOData: A Standard API for Data Access
OData: A Standard API for Data AccessPat Patterson
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FutureAPI-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FuturePat Patterson
 
Using Salesforce to Manage Your Developer Community
Using Salesforce to Manage Your Developer CommunityUsing Salesforce to Manage Your Developer Community
Using Salesforce to Manage Your Developer CommunityPat Patterson
 
Identity in the Cloud
Identity in the CloudIdentity in the Cloud
Identity in the CloudPat Patterson
 
OpenID Connect: An Overview
OpenID Connect: An OverviewOpenID Connect: An Overview
OpenID Connect: An OverviewPat Patterson
 
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)Pat Patterson
 
Salesforce Integration with Twilio
Salesforce Integration with TwilioSalesforce Integration with Twilio
Salesforce Integration with TwilioPat Patterson
 
How I Learned to Stop Worrying and Love Open Source Identity
How I Learned to Stop Worrying and Love Open Source IdentityHow I Learned to Stop Worrying and Love Open Source Identity
How I Learned to Stop Worrying and Love Open Source IdentityPat Patterson
 
Mobile Developer Week
Mobile Developer WeekMobile Developer Week
Mobile Developer WeekPat Patterson
 
Taking Identity from the Enterprise to the Cloud
Taking Identity from the Enterprise to the CloudTaking Identity from the Enterprise to the Cloud
Taking Identity from the Enterprise to the CloudPat Patterson
 

More from Pat Patterson (20)

DevOps from the Provider Perspective
DevOps from the Provider PerspectiveDevOps from the Provider Perspective
DevOps from the Provider Perspective
 
How Imprivata Combines External Data Sources for Business Insights
How Imprivata Combines External Data Sources for Business InsightsHow Imprivata Combines External Data Sources for Business Insights
How Imprivata Combines External Data Sources for Business Insights
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
Integrating with Einstein Analytics
Integrating with Einstein AnalyticsIntegrating with Einstein Analytics
Integrating with Einstein Analytics
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema Registry
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data Lake
 
Enterprise IoT: Data in Context
Enterprise IoT: Data in ContextEnterprise IoT: Data in Context
Enterprise IoT: Data in Context
 
OData: A Standard API for Data Access
OData: A Standard API for Data AccessOData: A Standard API for Data Access
OData: A Standard API for Data Access
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FutureAPI-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the Future
 
Using Salesforce to Manage Your Developer Community
Using Salesforce to Manage Your Developer CommunityUsing Salesforce to Manage Your Developer Community
Using Salesforce to Manage Your Developer Community
 
Identity in the Cloud
Identity in the CloudIdentity in the Cloud
Identity in the Cloud
 
OpenID Connect: An Overview
OpenID Connect: An OverviewOpenID Connect: An Overview
OpenID Connect: An Overview
 
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
How I Learned to Stop Worrying and Love Open Source Identity (Paris Edition)
 
Salesforce Integration with Twilio
Salesforce Integration with TwilioSalesforce Integration with Twilio
Salesforce Integration with Twilio
 
SAML Smackdown
SAML SmackdownSAML Smackdown
SAML Smackdown
 
How I Learned to Stop Worrying and Love Open Source Identity
How I Learned to Stop Worrying and Love Open Source IdentityHow I Learned to Stop Worrying and Love Open Source Identity
How I Learned to Stop Worrying and Love Open Source Identity
 
Mobile Developer Week
Mobile Developer WeekMobile Developer Week
Mobile Developer Week
 
Taking Identity from the Enterprise to the Cloud
Taking Identity from the Enterprise to the CloudTaking Identity from the Enterprise to the Cloud
Taking Identity from the Enterprise to the Cloud
 

Recently uploaded

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Recently uploaded (20)

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 

Adaptive Data Cleansing with StreamSets and Cassandra

  • 1. Adaptive Data Cleansing with StreamSets and Cassandra Pat Patterson Community Champion @metadaddy pat@superpat.com
  • 2. About Pat Pat Patterson Community Champion @ StreamSets Formerly Developer Evangelist at Salesforce Contact pat@streamsets.com @metadaddy
  • 3. Feeding Cassandra with StreamSets Data Collector User defined aggregate functions in Cassandra Push statistics back into the data pipeline Resources Q & A Agenda
  • 4. Devices sending sensor readings to RabbitMQ via MQTT Sensor id, time, temperature, orientation Convert orientation from integer to string 0 -> “horizontal”, 1 -> “vertical” Filter outlier values Write ‘clean’ readings to Cassandra Use Case: IoT Sensor Data
  • 5. StreamSets Data Collector http://api Open source continuous big data ingest infrastructure ◦ Off- or near-cluster ◦ Operating on data in-motion ◦ One-time processing, scales linearly ◦ Direct control over data integrity
  • 6. Ingest from RabbitMQ to Cassandra
  • 7. Could easily filter on some static boundary But that will only catch ‘obvious’ problems Can we find outlier values in a more dynamic, flexible way? Let’s define an outlier as any temperature value more than 4 standard deviations from the past hour’s mean temperature Cleaning Up The Data
  • 8. Standard aggregate functions: min, max, avg, sum, count SELECT avg(temperature), count(temperature) FROM readings WHERE sensor_id = 1 AND time > '2016-08-17 18:11:00+0000' Define custom aggregates as online algorithms in terms of two Java/JavaScript functions: state function and final function State function takes old state and value, returns new state Tuple stateFn(Tuple state, double x) Final function takes final state, returns aggregate value double finalFn(Tuple state) Cassandra User Defined Aggregates
  • 9. def online_variance(data): n = 0 mean = 0.0 M2 = 0.0 for x in data: n += 1 delta = x – mean mean += delta/n M2 += delta*(x - mean) if n < 2: return float('nan') else: return M2 / (n - 1) Online Standard Deviation - Welford’s Method
  • 10. cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdState ( state tuple<int,double,double>, val double ) CALLED ON NULL INPUT RETURNS tuple<int,double,double> LANGUAGE java AS 'int n = state.getInt(0); double mean = state.getDouble(1); double m2 = state.getDouble(2); n++; double delta = val - mean; mean += delta / n; m2 += delta * (val - mean); state.setInt(0, n); state.setDouble(1, mean); state.setDouble(2, m2); return state;'; cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdFinal ( state tuple<int,double,double> ) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'int n = state.getInt(0); double m2 = state.getDouble(2); if (n < 1) { return null; } return Math.sqrt(m2 / (n - 1));'; cqlsh:mykeyspace> CREATE AGGREGATE IF NOT EXISTS stdev ( double ) ... SFUNC sdState STYPE tuple<int,double,double> FINALFUNC sdFinal INITCOND (0,0,0); Define Standard Deviation as a UDA
  • 11. Java app periodically queries Cassandra for mean, standard deviation for the past hour’s data, writes to resource files PreparedStatement statement = session.prepare( "SELECT AVG(temperature), STDEV(temperature) " + "FROM sensor_readings " + "WHERE sensor_id = ? AND TIME > ?"); BoundStatement boundStatement = new BoundStatement(statement); ... long startMillis = System.currentTimeMillis() - timeRangeMillis; ResultSet results = session.execute( boundStatement.bind(sensorId, new Date(startMillis))); double avg = row.getDouble("system.avg(temperature)"), sd = row.getDouble("mykeyspace.stdev(temperature)"); Feeding Statistics Back Into the Pipeline
  • 12. Putting It All Together
  • 13. Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets http://bit.ly/ingest-mqtt Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector http://bit.ly/ingest-sensors Standard Deviations on Cassandra – Rolling Your Own Aggregate Function http://bit.ly/cassandra-uda Dynamic Outlier Detection with StreamSets and Cassandra http://bit.ly/dynamic-outliers Resources