SlideShare a Scribd company logo
1 of 14
Adaptive Data Cleansing with
StreamSets and Cassandra
Pat Patterson
Community Champion
@metadaddy
pat@superpat.com
About Pat
Pat Patterson
Community Champion @ StreamSets
Formerly Developer Evangelist at
Salesforce
Contact
pat@streamsets.com
@metadaddy
Feeding Cassandra with StreamSets Data Collector
User defined aggregate functions in Cassandra
Push statistics back into the data pipeline
Resources
Q & A
Agenda
Devices sending sensor readings to
RabbitMQ via MQTT
Sensor id, time, temperature, orientation
Convert orientation from integer to string
0 -> “horizontal”, 1 -> “vertical”
Filter outlier values
Write ‘clean’ readings to Cassandra
Use Case: IoT Sensor Data
StreamSets Data Collector
http://api
Open source continuous big data
ingest infrastructure
◦ Off- or near-cluster
◦ Operating on data in-motion
◦ One-time processing, scales linearly
◦ Direct control over data integrity
Ingest from RabbitMQ to Cassandra
Could easily filter on some static boundary
But that will only catch ‘obvious’ problems
Can we find outlier values in a more dynamic,
flexible way?
Let’s define an outlier as any temperature value
more than 4 standard deviations from the past
hour’s mean temperature
Cleaning Up The Data
Standard aggregate functions: min, max, avg, sum, count
SELECT avg(temperature), count(temperature)
FROM readings
WHERE sensor_id = 1 AND time > '2016-08-17 18:11:00+0000'
Define custom aggregates as online algorithms in terms of two Java/JavaScript
functions: state function and final function
State function takes old state and value, returns new state
Tuple stateFn(Tuple state, double x)
Final function takes final state, returns aggregate value
double finalFn(Tuple state)
Cassandra User Defined Aggregates
def online_variance(data):
n = 0
mean = 0.0
M2 = 0.0
for x in data:
n += 1
delta = x – mean
mean += delta/n
M2 += delta*(x - mean)
if n < 2:
return float('nan')
else:
return M2 / (n - 1)
Online Standard Deviation - Welford’s Method
cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdState ( state tuple<int,double,double>, val
double ) CALLED ON NULL INPUT RETURNS tuple<int,double,double> LANGUAGE java AS 'int n =
state.getInt(0); double mean = state.getDouble(1); double m2 = state.getDouble(2); n++; double delta
= val - mean; mean += delta / n; m2 += delta * (val - mean); state.setInt(0, n); state.setDouble(1,
mean); state.setDouble(2, m2); return state;';
cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdFinal ( state tuple<int,double,double> )
CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'int n = state.getInt(0); double m2
= state.getDouble(2); if (n < 1) { return null; } return Math.sqrt(m2 / (n - 1));';
cqlsh:mykeyspace> CREATE AGGREGATE IF NOT EXISTS stdev ( double ) ... SFUNC sdState
STYPE tuple<int,double,double> FINALFUNC sdFinal INITCOND (0,0,0);
Define Standard Deviation as a UDA
Java app periodically queries Cassandra for mean, standard
deviation for the past hour’s data, writes to resource files
PreparedStatement statement = session.prepare(
"SELECT AVG(temperature), STDEV(temperature) " +
"FROM sensor_readings " +
"WHERE sensor_id = ? AND TIME > ?");
BoundStatement boundStatement = new BoundStatement(statement);
...
long startMillis = System.currentTimeMillis() - timeRangeMillis;
ResultSet results = session.execute(
boundStatement.bind(sensorId, new Date(startMillis)));
double avg = row.getDouble("system.avg(temperature)"),
sd = row.getDouble("mykeyspace.stdev(temperature)");
Feeding Statistics Back Into the Pipeline
Putting It All Together
Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets
http://bit.ly/ingest-mqtt
Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector
http://bit.ly/ingest-sensors
Standard Deviations on Cassandra – Rolling Your Own Aggregate Function
http://bit.ly/cassandra-uda
Dynamic Outlier Detection with StreamSets and Cassandra
http://bit.ly/dynamic-outliers
Resources
Questions?
Pat Patterson
Community Champion
@metadaddy
pat@superpat.com

More Related Content

What's hot

Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_Presentation
Sourabh Gujar
 

What's hot (19)

codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
 
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at SignalCassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
Cassandra Day Chicago 2015: The Evolution of Apache Cassandra at Signal
 
Mario on spark
Mario on sparkMario on spark
Mario on spark
 
Microservice performance-b
Microservice performance-bMicroservice performance-b
Microservice performance-b
 
RxSubject And Operators
RxSubject And OperatorsRxSubject And Operators
RxSubject And Operators
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeMigrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
DevOps Days Tel Aviv 2013: Ignite Talk: Monitoring Patterns with Riemann - It...
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducePolyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_Presentation
 
Unidirectional Data Flow with Reactor
Unidirectional Data Flow with ReactorUnidirectional Data Flow with Reactor
Unidirectional Data Flow with Reactor
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary Tale
 
slides_nyc_2015
slides_nyc_2015slides_nyc_2015
slides_nyc_2015
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using spark
 
Dynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using TimeDynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using Time
 

Viewers also liked

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
ArangoDB Database
 

Viewers also liked (20)

Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
The Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) DataThe Cost of Bad (And Clean) Data
The Cost of Bad (And Clean) Data
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Ten canoes
Ten canoesTen canoes
Ten canoes
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databases
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 

Similar to Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamSets) | C* Summit 2016

Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016
Evan Schultz
 
Average- An android project
Average- An android projectAverage- An android project
Average- An android project
Ipsit Dash
 
From Startup to Mature Company: PostgreSQL Tips and techniques
From Startup to Mature Company:  PostgreSQL Tips and techniquesFrom Startup to Mature Company:  PostgreSQL Tips and techniques
From Startup to Mature Company: PostgreSQL Tips and techniques
John Ashmead
 
Ft10 de smet
Ft10 de smetFt10 de smet
Ft10 de smet
nkaluva
 

Similar to Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamSets) | C* Summit 2016 (20)

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiyTues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016Evan Schultz - Angular Summit - 2016
Evan Schultz - Angular Summit - 2016
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Jdbc
JdbcJdbc
Jdbc
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Android wearpp
Android wearppAndroid wearpp
Android wearpp
 
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.comMcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
 
Average- An android project
Average- An android projectAverage- An android project
Average- An android project
 
Introduction to Redux
Introduction to ReduxIntroduction to Redux
Introduction to Redux
 
MCRL2
MCRL2MCRL2
MCRL2
 
From Startup to Mature Company: PostgreSQL Tips and techniques
From Startup to Mature Company:  PostgreSQL Tips and techniquesFrom Startup to Mature Company:  PostgreSQL Tips and techniques
From Startup to Mature Company: PostgreSQL Tips and techniques
 
Ft10 de smet
Ft10 de smetFt10 de smet
Ft10 de smet
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data.
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
 

More from DataStax

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 

Recently uploaded (20)

What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 

Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamSets) | C* Summit 2016

  • 1. Adaptive Data Cleansing with StreamSets and Cassandra Pat Patterson Community Champion @metadaddy pat@superpat.com
  • 2. About Pat Pat Patterson Community Champion @ StreamSets Formerly Developer Evangelist at Salesforce Contact pat@streamsets.com @metadaddy
  • 3. Feeding Cassandra with StreamSets Data Collector User defined aggregate functions in Cassandra Push statistics back into the data pipeline Resources Q & A Agenda
  • 4. Devices sending sensor readings to RabbitMQ via MQTT Sensor id, time, temperature, orientation Convert orientation from integer to string 0 -> “horizontal”, 1 -> “vertical” Filter outlier values Write ‘clean’ readings to Cassandra Use Case: IoT Sensor Data
  • 5. StreamSets Data Collector http://api Open source continuous big data ingest infrastructure ◦ Off- or near-cluster ◦ Operating on data in-motion ◦ One-time processing, scales linearly ◦ Direct control over data integrity
  • 6. Ingest from RabbitMQ to Cassandra
  • 7. Could easily filter on some static boundary But that will only catch ‘obvious’ problems Can we find outlier values in a more dynamic, flexible way? Let’s define an outlier as any temperature value more than 4 standard deviations from the past hour’s mean temperature Cleaning Up The Data
  • 8. Standard aggregate functions: min, max, avg, sum, count SELECT avg(temperature), count(temperature) FROM readings WHERE sensor_id = 1 AND time > '2016-08-17 18:11:00+0000' Define custom aggregates as online algorithms in terms of two Java/JavaScript functions: state function and final function State function takes old state and value, returns new state Tuple stateFn(Tuple state, double x) Final function takes final state, returns aggregate value double finalFn(Tuple state) Cassandra User Defined Aggregates
  • 9. def online_variance(data): n = 0 mean = 0.0 M2 = 0.0 for x in data: n += 1 delta = x – mean mean += delta/n M2 += delta*(x - mean) if n < 2: return float('nan') else: return M2 / (n - 1) Online Standard Deviation - Welford’s Method
  • 10. cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdState ( state tuple<int,double,double>, val double ) CALLED ON NULL INPUT RETURNS tuple<int,double,double> LANGUAGE java AS 'int n = state.getInt(0); double mean = state.getDouble(1); double m2 = state.getDouble(2); n++; double delta = val - mean; mean += delta / n; m2 += delta * (val - mean); state.setInt(0, n); state.setDouble(1, mean); state.setDouble(2, m2); return state;'; cqlsh:mykeyspace> CREATE OR REPLACE FUNCTION sdFinal ( state tuple<int,double,double> ) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'int n = state.getInt(0); double m2 = state.getDouble(2); if (n < 1) { return null; } return Math.sqrt(m2 / (n - 1));'; cqlsh:mykeyspace> CREATE AGGREGATE IF NOT EXISTS stdev ( double ) ... SFUNC sdState STYPE tuple<int,double,double> FINALFUNC sdFinal INITCOND (0,0,0); Define Standard Deviation as a UDA
  • 11. Java app periodically queries Cassandra for mean, standard deviation for the past hour’s data, writes to resource files PreparedStatement statement = session.prepare( "SELECT AVG(temperature), STDEV(temperature) " + "FROM sensor_readings " + "WHERE sensor_id = ? AND TIME > ?"); BoundStatement boundStatement = new BoundStatement(statement); ... long startMillis = System.currentTimeMillis() - timeRangeMillis; ResultSet results = session.execute( boundStatement.bind(sensorId, new Date(startMillis))); double avg = row.getDouble("system.avg(temperature)"), sd = row.getDouble("mykeyspace.stdev(temperature)"); Feeding Statistics Back Into the Pipeline
  • 12. Putting It All Together
  • 13. Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets http://bit.ly/ingest-mqtt Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector http://bit.ly/ingest-sensors Standard Deviations on Cassandra – Rolling Your Own Aggregate Function http://bit.ly/cassandra-uda Dynamic Outlier Detection with StreamSets and Cassandra http://bit.ly/dynamic-outliers Resources