SlideShare a Scribd company logo
1 of 32
Download to read offline
Ivo Everts, GoDataDriven (Amsterdam, NL)
Predictive Maintenance
at the Dutch Railways
Air leakage detection in train breaking pipes
#DSSAIS16
#DSSAIS16 !2
Tünde Alkemade
Ivo Everts
Jelte Hoekstra
Giovanni Lanzani
Janelle Zoutkamp
Stefan Hendriks
Wouter Hordijk
Inge Kalsbeek
Wan-Jui Lee
Inka Locht
Kee Man
Nick Oosterhof
Margot Peters
Cyriana Roelofs
Mattijs Suurland
#DSSAIS16
Published paper in June 2017
'Contextual Air Leakage Detection in Train Braking Pipes'
@ 'Advances in Artificial Intelligence: From Theory to Practice'
!2
Tünde Alkemade
Ivo Everts
Jelte Hoekstra
Giovanni Lanzani
Janelle Zoutkamp
Stefan Hendriks
Wouter Hordijk
Inge Kalsbeek
Wan-Jui Lee
Inka Locht
Kee Man
Nick Oosterhof
Margot Peters
Cyriana Roelofs
Mattijs Suurland
#DSSAIS16
Implemented the paper using PySpark; teamwork for data delivery, proper productionizing, way of working
Published paper in June 2017
'Contextual Air Leakage Detection in Train Braking Pipes'
@ 'Advances in Artificial Intelligence: From Theory to Practice'
!2
Tünde Alkemade
Ivo Everts
Jelte Hoekstra
Giovanni Lanzani
Janelle Zoutkamp
Stefan Hendriks
Wouter Hordijk
Inge Kalsbeek
Wan-Jui Lee
Inka Locht
Kee Man
Nick Oosterhof
Margot Peters
Cyriana Roelofs
Mattijs Suurland
#DSSAIS16
Consulting data scientist @ GoDataDriven
MSc in AI; PhD in Computer Vision
Did lot of scientific programming
Now mostly Python, Hadoop, Spark, Keras
Implemented the paper using PySpark; teamwork for data delivery, proper productionizing, way of working
Published paper in June 2017
'Contextual Air Leakage Detection in Train Braking Pipes'
@ 'Advances in Artificial Intelligence: From Theory to Practice'
!2
Tünde Alkemade
Ivo Everts
Jelte Hoekstra
Giovanni Lanzani
Janelle Zoutkamp
Stefan Hendriks
Wouter Hordijk
Inge Kalsbeek
Wan-Jui Lee
Inka Locht
Kee Man
Nick Oosterhof
Margot Peters
Cyriana Roelofs
Mattijs Suurland
#DSSAIS16
Big Data @ NS (Dutch Railways)
!3
SLT
DDZ
VIRM
FLIRT
SNG
ICNG
Pantomonitoring
Gotcha/Quovadis
Data
conversie
RTM
operations
MBN
Maximo
Werkorder
OB/SB
Big Data
RTM
analytics
MonteurWerkvoorbereider
4M
Helpdesk
7503
4M
7503
TRAXX
#DSSAIS16
Outline
• Data ingestion
• Air leakage detection
• Other use cases
!4
#DSSAIS16
Data ingestion
Collecting data for Real Time Monitoring
!5
SLT
DDZ
VIRM
FLIRT
SNG
ICNG
Pantomonitoring
Gotcha/Quovadis
Data
conversie
RTM
operations
MBN
Maximo
Werkorder
OB/SB
Big Data
RTM
analytics
MonteurWerkvoorbereider
4M
Helpdesk
7503
4M
7503
TRAXX
#DSSAIS16
Data ingestion
Real time data Historic data
!6
~20GB per day; 22B records 22B records
#DSSAIS16
Data ingestion
Real time data Historic data
!6
~20GB per day; 22B records 22B records
#DSSAIS16
Data ingestion
Real time data Historic data
!6
~20GB per day; 22B records 22B records
#DSSAIS16
Data ingestion
Real time data Historic data
!6
# fetch args
table_name = sys.argv[1]
origin = sys.argv[2]
# read xml string from stdin
xml_string = ''.join(sys.stdin.readlines())
# parse 'm and print to stdout
data = xml2table(xml_string, table_name, origin, start_dt, end_dt)
print(data.to_csv(encoding='utf-8', header=False, index=False,
~20GB per day; 22B records 22B records
#DSSAIS16
Data ingestion
Real time data Historic data
!6
# fetch args
table_name = sys.argv[1]
origin = sys.argv[2]
# read xml string from stdin
xml_string = ''.join(sys.stdin.readlines())
# parse 'm and print to stdout
data = xml2table(xml_string, table_name, origin, start_dt, end_dt)
print(data.to_csv(encoding='utf-8', header=False, index=False,
~20GB per day; 22B records 22B records
#DSSAIS16
Data ingestion
Real time data Historic data
!6
# fetch args
table_name = sys.argv[1]
origin = sys.argv[2]
# read xml string from stdin
xml_string = ''.join(sys.stdin.readlines())
# parse 'm and print to stdout
data = xml2table(xml_string, table_name, origin, start_dt, end_dt)
print(data.to_csv(encoding='utf-8', header=False, index=False,
~20GB per day; 22B records 22B records
def get_history_df(hdfs_binary_files, output_table_name):
rdd = (hdfs_binary_files
.map(unzip_xml)
.flatMap(lambda x: xml2rows(x, output_table_name)))
return None if rdd.isEmpty() else rdd.toDF()
...
...
# read binary files and parse 'm
hdfs_binary_files = (sc.binaryFiles('{}*'.format(file_group))
.persist(StorageLevel.MEMORY_AND_DISK_SER)
.repartition(num_executors))
sdf = get_history_df(hdfs_binary_files, args.output_table_name)
#DSSAIS16
Data ingestion
Real time data Historic data
!6
# fetch args
table_name = sys.argv[1]
origin = sys.argv[2]
# read xml string from stdin
xml_string = ''.join(sys.stdin.readlines())
# parse 'm and print to stdout
data = xml2table(xml_string, table_name, origin, start_dt, end_dt)
print(data.to_csv(encoding='utf-8', header=False, index=False,
>> Better suited for large datasets
>> Easier to (re)partition and append
>> Preference for a unified approach
~20GB per day; 22B records 22B records
def get_history_df(hdfs_binary_files, output_table_name):
rdd = (hdfs_binary_files
.map(unzip_xml)
.flatMap(lambda x: xml2rows(x, output_table_name)))
return None if rdd.isEmpty() else rdd.toDF()
...
...
# read binary files and parse 'm
hdfs_binary_files = (sc.binaryFiles('{}*'.format(file_group))
.persist(StorageLevel.MEMORY_AND_DISK_SER)
.repartition(num_executors))
sdf = get_history_df(hdfs_binary_files, args.output_table_name)
#DSSAIS16
Air leakage detection
One of the cases using RTM data
!7
SLT
DDZ
VIRM
FLIRT
SNG
ICNG
Pantomonitoring
Gotcha/Quovadis
Data
conversie
RTM
operations
MBN
Maximo
Werkorder
OB/SB
Big Data
RTM
analytics
MonteurWerkvoorbereider
4M
Helpdesk
7503
4M
7503
TRAXX
#DSSAIS16
Air leakage detection
!8
Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
#DSSAIS16
Air leakage detection
!9
Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
We extract median compressor run- and idle- times per hour
#DSSAIS16
Air leakage detection
!10
Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
One model per train =>
compressor durations collected with a groupBy()
and modelled in a UDF using scikit-learn
to obtain the per-class probabilities:
!11
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
One model per train =>
compressor durations collected with a groupBy()
and modelled in a UDF using scikit-learn
to obtain the per-class probabilities:
Anomalies
!11
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
# group data per train and collect data LOL
train_data = (df.groupBy('rolling_stock_number')
.agg(F.collect_list('median_duration').alias('median_duration'),
F.collect_list('label').alias('label')))
# define UDF and fit model
logistic_regression = F.udf(lambda X, y: sklearn_wrapper(X, y), ArrayType(FloatType()))
model_data = (train_data.withColumn('logistic_model_parameters',
logistic_regression(train_data.median_duration,
train_data.label)))
One model per train =>
compressor durations collected with a groupBy()
and modelled in a UDF using scikit-learn
to obtain the per-class probabilities:
Anomalies
!11
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
!12
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
Air leakage is defined as a cluster of anomalies, robust against outliers.
For this we use a 'dynamic' variation of dbscan clustering, in a UDF,
where the minimal cluster size is dependent on the logistic function:
!13
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
Air leakage is defined as a cluster of anomalies, robust against outliers.
For this we use a 'dynamic' variation of dbscan clustering, in a UDF,
where the minimal cluster size is dependent on the logistic function:
!13
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
!14
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
!15
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
# import relevant classes from sparkml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
...
# prepare and run linear regression
assembler = VectorAssembler(inputCols=x_cols, outputCol='features')
input = assembler.transform(train_data).select(['label', 'features'])
output = LinearRegression().setSolver('l-bfgs').fit(input)
In the current implementation
we train one regressor for all trains,
so a pure Spark implementation is possible
!16
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
Oh no, humans in the loop!
>> Prognosis is difficult to explain
and does not perform well enough
>> Diagnosis does perform pretty good
(~85% true positives; ~15% solved on time)
>> Dedicated staff for bridging between
algorithm and organisation
>> We are observing recent air leakages
that would have been prevented,
so we're getting there
SLT
DDZ
VIRM
FLIRT
SNG
ICNG
Pantomonitoring
Gotcha/Quovadis
Data
conversie
RTM
operations
MBN
Maximo
Werkorder
OB/SB
Big Data
RTM
analytics
MonteurWerkvoorbereider
4M
Helpdesk
7503
4M
7503
TRAXX
!17
#DSSAIS16
Air leakage detection Data
compressor logs
Detection
classification
Diagnosis
clustering
Prognosis
regression
Operation
humans
Oh no, humans in the loop!
>> Prognosis is difficult to explain
and does not perform well enough
>> Diagnosis does perform pretty good
(~85% true positives; ~15% solved on time)
>> Dedicated staff for bridging between
algorithm and organisation
>> We are observing recent air leakages
that would have been prevented,
so we're getting there
SLT
DDZ
VIRM
FLIRT
SNG
ICNG
Pantomonitoring
Gotcha/Quovadis
Data
conversie
RTM
operations
MBN
Maximo
Werkorder
OB/SB
Big Data
RTM
analytics
MonteurWerkvoorbereider
4M
Helpdesk
7503
4M
7503
TRAXX
!17
#DSSAIS16
More use cases
>> Hot wheels: failing axle bearings
>> Crowd control: passengers distribution
>> Slippery tracks: traction detection
>> Water usage: when you got to go you got to go
>> ...and many more to come
!18
Ivo Everts | ivoeverts@godatadriven.com | @ivoeverts
Predictive Maintenance
at the Dutch Railways
Air leakage detection in train breaking pipes
#DSSAIS16

More Related Content

What's hot

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream ProcessingSafe Software
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdfBOSupport
 
Building the Modern Data Hub
Building the Modern Data HubBuilding the Modern Data Hub
Building the Modern Data HubDatavail
 
Data Visualization Techniques
Data Visualization TechniquesData Visualization Techniques
Data Visualization TechniquesAllAnalytics
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 

What's hot (20)

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream Processing
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
Building the Modern Data Hub
Building the Modern Data HubBuilding the Modern Data Hub
Building the Modern Data Hub
 
Data Visualization Techniques
Data Visualization TechniquesData Visualization Techniques
Data Visualization Techniques
 
Rapid miner
Rapid minerRapid miner
Rapid miner
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 

Similar to Predictive Maintenance at the Dutch Railways with Ivo Everts

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalAlexander Decker
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalAlexander Decker
 
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Thomas Gottron
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Mingxuan Li
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...semanticsconference
 
ElasticSearch.pptx
ElasticSearch.pptxElasticSearch.pptx
ElasticSearch.pptxTrnHiu748002
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingMax Kleiner
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark Anyscale
 
ONLINE STUDENT MANAGEMENT SYSTEM
ONLINE STUDENT MANAGEMENT SYSTEMONLINE STUDENT MANAGEMENT SYSTEM
ONLINE STUDENT MANAGEMENT SYSTEMRohit malav
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 

Similar to Predictive Maintenance at the Dutch Railways with Ivo Everts (20)

Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Odp
OdpOdp
Odp
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
Shilpa ppt
Shilpa pptShilpa ppt
Shilpa ppt
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...
 
ElasticSearch.pptx
ElasticSearch.pptxElasticSearch.pptx
ElasticSearch.pptx
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
ONLINE STUDENT MANAGEMENT SYSTEM
ONLINE STUDENT MANAGEMENT SYSTEMONLINE STUDENT MANAGEMENT SYSTEM
ONLINE STUDENT MANAGEMENT SYSTEM
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 

Recently uploaded (16)

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 

Predictive Maintenance at the Dutch Railways with Ivo Everts

  • 1. Ivo Everts, GoDataDriven (Amsterdam, NL) Predictive Maintenance at the Dutch Railways Air leakage detection in train breaking pipes #DSSAIS16
  • 2. #DSSAIS16 !2 Tünde Alkemade Ivo Everts Jelte Hoekstra Giovanni Lanzani Janelle Zoutkamp Stefan Hendriks Wouter Hordijk Inge Kalsbeek Wan-Jui Lee Inka Locht Kee Man Nick Oosterhof Margot Peters Cyriana Roelofs Mattijs Suurland
  • 3. #DSSAIS16 Published paper in June 2017 'Contextual Air Leakage Detection in Train Braking Pipes' @ 'Advances in Artificial Intelligence: From Theory to Practice' !2 Tünde Alkemade Ivo Everts Jelte Hoekstra Giovanni Lanzani Janelle Zoutkamp Stefan Hendriks Wouter Hordijk Inge Kalsbeek Wan-Jui Lee Inka Locht Kee Man Nick Oosterhof Margot Peters Cyriana Roelofs Mattijs Suurland
  • 4. #DSSAIS16 Implemented the paper using PySpark; teamwork for data delivery, proper productionizing, way of working Published paper in June 2017 'Contextual Air Leakage Detection in Train Braking Pipes' @ 'Advances in Artificial Intelligence: From Theory to Practice' !2 Tünde Alkemade Ivo Everts Jelte Hoekstra Giovanni Lanzani Janelle Zoutkamp Stefan Hendriks Wouter Hordijk Inge Kalsbeek Wan-Jui Lee Inka Locht Kee Man Nick Oosterhof Margot Peters Cyriana Roelofs Mattijs Suurland
  • 5. #DSSAIS16 Consulting data scientist @ GoDataDriven MSc in AI; PhD in Computer Vision Did lot of scientific programming Now mostly Python, Hadoop, Spark, Keras Implemented the paper using PySpark; teamwork for data delivery, proper productionizing, way of working Published paper in June 2017 'Contextual Air Leakage Detection in Train Braking Pipes' @ 'Advances in Artificial Intelligence: From Theory to Practice' !2 Tünde Alkemade Ivo Everts Jelte Hoekstra Giovanni Lanzani Janelle Zoutkamp Stefan Hendriks Wouter Hordijk Inge Kalsbeek Wan-Jui Lee Inka Locht Kee Man Nick Oosterhof Margot Peters Cyriana Roelofs Mattijs Suurland
  • 6. #DSSAIS16 Big Data @ NS (Dutch Railways) !3 SLT DDZ VIRM FLIRT SNG ICNG Pantomonitoring Gotcha/Quovadis Data conversie RTM operations MBN Maximo Werkorder OB/SB Big Data RTM analytics MonteurWerkvoorbereider 4M Helpdesk 7503 4M 7503 TRAXX
  • 7. #DSSAIS16 Outline • Data ingestion • Air leakage detection • Other use cases !4
  • 8. #DSSAIS16 Data ingestion Collecting data for Real Time Monitoring !5 SLT DDZ VIRM FLIRT SNG ICNG Pantomonitoring Gotcha/Quovadis Data conversie RTM operations MBN Maximo Werkorder OB/SB Big Data RTM analytics MonteurWerkvoorbereider 4M Helpdesk 7503 4M 7503 TRAXX
  • 9. #DSSAIS16 Data ingestion Real time data Historic data !6 ~20GB per day; 22B records 22B records
  • 10. #DSSAIS16 Data ingestion Real time data Historic data !6 ~20GB per day; 22B records 22B records
  • 11. #DSSAIS16 Data ingestion Real time data Historic data !6 ~20GB per day; 22B records 22B records
  • 12. #DSSAIS16 Data ingestion Real time data Historic data !6 # fetch args table_name = sys.argv[1] origin = sys.argv[2] # read xml string from stdin xml_string = ''.join(sys.stdin.readlines()) # parse 'm and print to stdout data = xml2table(xml_string, table_name, origin, start_dt, end_dt) print(data.to_csv(encoding='utf-8', header=False, index=False, ~20GB per day; 22B records 22B records
  • 13. #DSSAIS16 Data ingestion Real time data Historic data !6 # fetch args table_name = sys.argv[1] origin = sys.argv[2] # read xml string from stdin xml_string = ''.join(sys.stdin.readlines()) # parse 'm and print to stdout data = xml2table(xml_string, table_name, origin, start_dt, end_dt) print(data.to_csv(encoding='utf-8', header=False, index=False, ~20GB per day; 22B records 22B records
  • 14. #DSSAIS16 Data ingestion Real time data Historic data !6 # fetch args table_name = sys.argv[1] origin = sys.argv[2] # read xml string from stdin xml_string = ''.join(sys.stdin.readlines()) # parse 'm and print to stdout data = xml2table(xml_string, table_name, origin, start_dt, end_dt) print(data.to_csv(encoding='utf-8', header=False, index=False, ~20GB per day; 22B records 22B records def get_history_df(hdfs_binary_files, output_table_name): rdd = (hdfs_binary_files .map(unzip_xml) .flatMap(lambda x: xml2rows(x, output_table_name))) return None if rdd.isEmpty() else rdd.toDF() ... ... # read binary files and parse 'm hdfs_binary_files = (sc.binaryFiles('{}*'.format(file_group)) .persist(StorageLevel.MEMORY_AND_DISK_SER) .repartition(num_executors)) sdf = get_history_df(hdfs_binary_files, args.output_table_name)
  • 15. #DSSAIS16 Data ingestion Real time data Historic data !6 # fetch args table_name = sys.argv[1] origin = sys.argv[2] # read xml string from stdin xml_string = ''.join(sys.stdin.readlines()) # parse 'm and print to stdout data = xml2table(xml_string, table_name, origin, start_dt, end_dt) print(data.to_csv(encoding='utf-8', header=False, index=False, >> Better suited for large datasets >> Easier to (re)partition and append >> Preference for a unified approach ~20GB per day; 22B records 22B records def get_history_df(hdfs_binary_files, output_table_name): rdd = (hdfs_binary_files .map(unzip_xml) .flatMap(lambda x: xml2rows(x, output_table_name))) return None if rdd.isEmpty() else rdd.toDF() ... ... # read binary files and parse 'm hdfs_binary_files = (sc.binaryFiles('{}*'.format(file_group)) .persist(StorageLevel.MEMORY_AND_DISK_SER) .repartition(num_executors)) sdf = get_history_df(hdfs_binary_files, args.output_table_name)
  • 16. #DSSAIS16 Air leakage detection One of the cases using RTM data !7 SLT DDZ VIRM FLIRT SNG ICNG Pantomonitoring Gotcha/Quovadis Data conversie RTM operations MBN Maximo Werkorder OB/SB Big Data RTM analytics MonteurWerkvoorbereider 4M Helpdesk 7503 4M 7503 TRAXX
  • 17. #DSSAIS16 Air leakage detection !8 Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans
  • 18. #DSSAIS16 Air leakage detection !9 Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans We extract median compressor run- and idle- times per hour
  • 19. #DSSAIS16 Air leakage detection !10 Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans
  • 20. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans One model per train => compressor durations collected with a groupBy() and modelled in a UDF using scikit-learn to obtain the per-class probabilities: !11
  • 21. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans One model per train => compressor durations collected with a groupBy() and modelled in a UDF using scikit-learn to obtain the per-class probabilities: Anomalies !11
  • 22. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans # group data per train and collect data LOL train_data = (df.groupBy('rolling_stock_number') .agg(F.collect_list('median_duration').alias('median_duration'), F.collect_list('label').alias('label'))) # define UDF and fit model logistic_regression = F.udf(lambda X, y: sklearn_wrapper(X, y), ArrayType(FloatType())) model_data = (train_data.withColumn('logistic_model_parameters', logistic_regression(train_data.median_duration, train_data.label))) One model per train => compressor durations collected with a groupBy() and modelled in a UDF using scikit-learn to obtain the per-class probabilities: Anomalies !11
  • 23. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans !12
  • 24. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans Air leakage is defined as a cluster of anomalies, robust against outliers. For this we use a 'dynamic' variation of dbscan clustering, in a UDF, where the minimal cluster size is dependent on the logistic function: !13
  • 25. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans Air leakage is defined as a cluster of anomalies, robust against outliers. For this we use a 'dynamic' variation of dbscan clustering, in a UDF, where the minimal cluster size is dependent on the logistic function: !13
  • 26. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans !14
  • 27. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans !15
  • 28. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans # import relevant classes from sparkml from pyspark.ml.feature import VectorAssembler from pyspark.ml.regression import LinearRegression ... # prepare and run linear regression assembler = VectorAssembler(inputCols=x_cols, outputCol='features') input = assembler.transform(train_data).select(['label', 'features']) output = LinearRegression().setSolver('l-bfgs').fit(input) In the current implementation we train one regressor for all trains, so a pure Spark implementation is possible !16
  • 29. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans Oh no, humans in the loop! >> Prognosis is difficult to explain and does not perform well enough >> Diagnosis does perform pretty good (~85% true positives; ~15% solved on time) >> Dedicated staff for bridging between algorithm and organisation >> We are observing recent air leakages that would have been prevented, so we're getting there SLT DDZ VIRM FLIRT SNG ICNG Pantomonitoring Gotcha/Quovadis Data conversie RTM operations MBN Maximo Werkorder OB/SB Big Data RTM analytics MonteurWerkvoorbereider 4M Helpdesk 7503 4M 7503 TRAXX !17
  • 30. #DSSAIS16 Air leakage detection Data compressor logs Detection classification Diagnosis clustering Prognosis regression Operation humans Oh no, humans in the loop! >> Prognosis is difficult to explain and does not perform well enough >> Diagnosis does perform pretty good (~85% true positives; ~15% solved on time) >> Dedicated staff for bridging between algorithm and organisation >> We are observing recent air leakages that would have been prevented, so we're getting there SLT DDZ VIRM FLIRT SNG ICNG Pantomonitoring Gotcha/Quovadis Data conversie RTM operations MBN Maximo Werkorder OB/SB Big Data RTM analytics MonteurWerkvoorbereider 4M Helpdesk 7503 4M 7503 TRAXX !17
  • 31. #DSSAIS16 More use cases >> Hot wheels: failing axle bearings >> Crowd control: passengers distribution >> Slippery tracks: traction detection >> Water usage: when you got to go you got to go >> ...and many more to come !18
  • 32. Ivo Everts | ivoeverts@godatadriven.com | @ivoeverts Predictive Maintenance at the Dutch Railways Air leakage detection in train breaking pipes #DSSAIS16