SlideShare a Scribd company logo
1 of 20
Flink in Genomics
Integrating Flink and Kafka in the standard genomic pipeline
F. VERSACI L. PIREDDU G. ZANETTI
– Flink Forward 2017 –
13 September 2017
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 1 / 19
CRS4
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
CRS4
Research center in Sardinia, Italy
Focus on big data, biosciences, HPC, visual computing, energy and environment
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 2 / 19
Outline
1 Introduction
2 Implementation
3 Experimental evaluation
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 3 / 19
Next-Generation Sequencing
Cost
Genome sequencing is now much cheaper than in the past
About 1000 euros per whole human genome
103
104
105
106
107
108
Cost[USD]
2001 2004 2007 2010 2013 2016
Year
(Data from https://www.genome.gov/sequencingcosts/)
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 4 / 19
Next-Generation Sequencing
Applications
High-throughput DNA sequencing has many applications, including
Research into understanding human genetic diseases
Medicine, e.g., oncology, clinical pathology, ...
Human phylogeny
Personalized diagnostic applications
Huge amount of data
A single sequencer can produce 1 TB/day of data
Which need to be converted, filtered, aggregated, reconstructed, analyzed, ...
A single genomic center can have hundreds of sequencers
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 5 / 19
Shotgun genome sequencing
The DNA is a sequence of four bases: Adenine, Cytosine,
Guanine and Thymine (A, C, G and T)
The genome is broken up into short fragments (reads)
The fragments are attached to a support (tile)
Fluorescent molecules are iteratively attached to the bases
At each cycle, the machine acquires (optically) a single base
from all the fragments
(CC-BY-3.0 image from https://goo.gl/9dqN8g)
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 6 / 19
Data processing in NGS
First steps and standard pipeline
Pre-processing Reads need to be reconstructed from the raw, cycle-based, sequencer’s
output (BCL files)
Alignment The short reads are aligned to a reference genome
When using Illumina sequencers, the standard pipeline starts with two programs:
bcl2fastq2 Proprietary, open-source tool by Illumina to convert raw BCL data to FASTQ
format
BWA-MEM Free (GPLv3) aligner to reconstruct the full genomic sequence based on the
short reads generated by the sequencer
Problems
Parallel tools, but shared-memory (single node)
Communication via intermediate files:
Different steps must be run sequentially
New data requires re-computing the workflow from scratch
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 7 / 19
Our approach
Module 1: Data pre-processing Module 2: Reads alignment
Raw BCL
Transposition and
filtering
FASTQ Kafka broker Data retrieval
RAPI BWA/MEM
alignment
CRAM
Data pre-processing
Flink program, written from scratch (see
FlinkForward 2016 and BigData 20161)
Modified to sends its output to Kafka
It creates many (thousands) topics
Reads alignment
Flink program, powered by RAPI and
BWA/MEM
Reads from Kafka
Writes to Kafka or HDFS
1
Francesco Versaci, Luca Pireddu, and Gianluigi Zanetti (2016). “Scalable genomics: From raw data to
aligned reads on Apache YARN”. In: IEEE BigData 2016, pp. 1232–1241
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 8 / 19
Outline
1 Introduction
2 Implementation
3 Experimental evaluation
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 9 / 19
Data pre-processing
Bases and quality scores are decoded
A data transposition is performed to obtain the reads
For each tile several new Kafka data topics are created
The list of the new topics is sent to a special control
topic
The reads are written into the data topics
Finally, an EOS marker is appended to each data topic
File 1
File 2
File 3
File 4
File 5
A
A
G
C
T
TCGG
CAGA
CAGA
TGAT
CGTC
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Location5
Location4
Location3
Location2
Location1
A A G C T
C A A G G
G G G A T
G A A T C
T C C T C
Read 1
Read 2
etc...
Sending finite streams
Each data topic will contain a finite amount of data
Kafka streams are potentially infinite
Hence we need to send a EOS when a data topic is completed
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 10 / 19
Reads alignment
The control topic is polled every 3 seconds, to
receive notice of newly opened data topics
The data topics are grouped into Flink jobs and
aligned in parallel
For each data topic, the reads are chunked
together and aligned
The aligned reads are sent to the output sink (e.g.,
written as CRAM files).
The number of data topics is
not known in advance
When a data topic is
completed the corresponding
DataSource needs to be
terminated
To improve the aligner
performance the chunks
should be the same size
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 11 / 19
Handling finite streams
Our experience
Finite streams do not work as (we) intended with count windows (last window is not
evaluated)
There is an easy work-around which uses event time windows
val env = StreamExecutionEnvironment . getExecutionEnvironment
env . set StreamTimeCharacteristic ( Ti me C har ac ter ist ic . EventTime )
val consumer = new FlinkKafkaConsumer010 [ MyType ] (
topicname , new MyDeserializer , p r o p e r t i e s )
. assignTimestampsAndWatermarks ( new MyTStamper )
val ds = env . addSource ( consumer )
val out = ds . timeWindowAll ( Time . milliseconds (1024))
. apply ( new MyAllWindowFunction )
/ / do something with the output
/ / . . .
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 12 / 19
Handling finite streams
Deserialization
c l a s s MyDeserializer extends DeserializationSchema [ MyType ] {
override def getProducedType =
TypeInformation . of ( classOf [ MyType ] )
override def isEndOfStream ( el : MyType ) : Boolean = {
return el . isEOS
}
override def d e s e r i a l i z e ( data : Array [ Byte ] ) : MyType = {
/ / d e s e r i a l i z a t i o n code here
}
}
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 13 / 19
Handling finite streams
Timestamping
c l a s s MyTStamper extends AssignerWithPeriodicWatermarks [ MyType ] {
var cur = 0 l
override def extractTimestamp ( el : MyType , prev : Long ) : Long = {
val r = cur
cur += 1
return r
}
override def getCurrentWatermark : Watermark = {
new Watermark ( cur − 1)
}
}
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 14 / 19
Outline
1 Introduction
2 Implementation
3 Experimental evaluation
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 15 / 19
Experimental setup
Experiments run on the Amazon Elastic Compute
Cloud (EC2)
Up to 12 instances of i3.8xlarge machines
CPUs 32 virtual cores (Xeon E5-2686 v4, 45 MB
cache)
RAM 244 GiB
Disks 4 x 1.9 TB NVMe SSD
Network 10 Gb Ethernet
OS Ubuntu 16.04.2 LTS (Linux kernel
4.4.0-1013-aws)
Input dataset
Sequencer Illumina HiSeq 3000
Size of data files 47.8 GB
(gzip-compressed)
Number of files 47,040
Number of samples 12
Number of reads 3.38 · 108
Number of DNA bases 3.38 · 1010
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 16 / 19
Results
Strong scalability
Nodes Time (minutes)
Baseline 137
1 152
2 77
4 39.6
8 20.4
12 14.3
1
2
4
6
8
10
12
Speed-up
1 2 4 8 12
Number of nodes
Perfect scalability
Absolute scalability
Relative scalability
Same input, variable number of nodes
On single node: 11% slower than the baseline
On more nodes: almost linear scalability
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 17 / 19
Load on Kafka
We also tested the Kafka
broker under stress, using
4 nodes
Node-1 Running the
Kafka broker and Flink
Job-Manager
Node-[2-4] Running the
preprocessor at full
speed (32 concurrent
jobs per node)
Data flow of 450 MB/s towards the broker for about 3
minutes
2688 topics created
340 million messages sent
The CPU load was about 900%
Assuming linearity, the network bandwidth is saturated at
about 25 cores (10 Gb/s = 1250 MB/s)
Limits to scalability
An aligning node at maximum power (32 cores at 100%)
requires 10 MB/s of data
A single Kafka broker can sustain up to 125 nodes
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 18 / 19
Conclusion and future work
We have presented a fully scalable streaming pipeline to process raw Illumina NGS
data up to the stage of aligning reads
To facilitate the adoption of scalable stream processing in genomics...
...we aim at integrating our pipeline with the upcoming Genome Analysis Toolkit 4
(GATK4, a Spark-based genomic analysis tool)
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 19 / 19
Conclusion and future work
We have presented a fully scalable streaming pipeline to process raw Illumina NGS
data up to the stage of aligning reads
To facilitate the adoption of scalable stream processing in genomics...
...we aim at integrating our pipeline with the upcoming Genome Analysis Toolkit 4
(GATK4, a Spark-based genomic analysis tool)
Thanks for your attention!
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 19 / 19

More Related Content

What's hot

Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...Flink Forward
 
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...Flink Forward
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Stephan Ewen
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
 
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4Flink Forward
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica
 
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...Flink Forward
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Ververica
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...Flink Forward
 

What's hot (20)

Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
Flink Forward Berlin 2017: Dr. Radu Tudoran - Huawei Cloud Stream Service in ...
 
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
 
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
 
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
 

Similar to Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in the standard genomics pipeline

Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Flink Forward
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defensemarek_pomocka
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector BuilderMark Wilkinson
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Plank
PlankPlank
PlankFNian
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File MulticastYoss Cohen
 
CS589 paper presentation - What is in unison? A formal specification and refe...
CS589 paper presentation - What is in unison? A formal specification and refe...CS589 paper presentation - What is in unison? A formal specification and refe...
CS589 paper presentation - What is in unison? A formal specification and refe...Sergii Shmarkatiuk
 
Diminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesDiminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesSebastian Werner
 
Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...
Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...
Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...Codemotion
 
Stateful stream processing made easy with Apache Flink
Stateful stream processing made easy with Apache FlinkStateful stream processing made easy with Apache Flink
Stateful stream processing made easy with Apache FlinkK-TEQ Srls
 
Transport layer (computer networks)
Transport layer (computer networks)Transport layer (computer networks)
Transport layer (computer networks)Fatbardh Hysa
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation2040.io
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs
ADMS'13  High-Performance Holistic XML Twig Filtering Using GPUsADMS'13  High-Performance Holistic XML Twig Filtering Using GPUs
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUsty1er
 

Similar to Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in the standard genomics pipeline (20)

Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defense
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector Builder
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
 
Plank
PlankPlank
Plank
 
Tape Access Optimization With TReqS
Tape Access Optimization With TReqSTape Access Optimization With TReqS
Tape Access Optimization With TReqS
 
project_531
project_531project_531
project_531
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File Multicast
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
CS589 paper presentation - What is in unison? A formal specification and refe...
CS589 paper presentation - What is in unison? A formal specification and refe...CS589 paper presentation - What is in unison? A formal specification and refe...
CS589 paper presentation - What is in unison? A formal specification and refe...
 
TransPAC3/ACE Measurement & PerfSONAR Update
TransPAC3/ACE Measurement & PerfSONAR UpdateTransPAC3/ACE Measurement & PerfSONAR Update
TransPAC3/ACE Measurement & PerfSONAR Update
 
Diminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesDiminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations Slides
 
Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...
Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...
Stateful stream processing made easy with Apache Flink. - A.Mancini F.Tosi - ...
 
Stateful stream processing made easy with Apache Flink
Stateful stream processing made easy with Apache FlinkStateful stream processing made easy with Apache Flink
Stateful stream processing made easy with Apache Flink
 
Transport layer (computer networks)
Transport layer (computer networks)Transport layer (computer networks)
Transport layer (computer networks)
 
Chapter 3 v6.01
Chapter 3 v6.01Chapter 3 v6.01
Chapter 3 v6.01
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs
ADMS'13  High-Performance Holistic XML Twig Filtering Using GPUsADMS'13  High-Performance Holistic XML Twig Filtering Using GPUs
ADMS'13 High-Performance Holistic XML Twig Filtering Using GPUs
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in the standard genomics pipeline

  • 1. Flink in Genomics Integrating Flink and Kafka in the standard genomic pipeline F. VERSACI L. PIREDDU G. ZANETTI – Flink Forward 2017 – 13 September 2017 Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 1 / 19
  • 2. CRS4 Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna CRS4 Research center in Sardinia, Italy Focus on big data, biosciences, HPC, visual computing, energy and environment Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 2 / 19
  • 3. Outline 1 Introduction 2 Implementation 3 Experimental evaluation Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 3 / 19
  • 4. Next-Generation Sequencing Cost Genome sequencing is now much cheaper than in the past About 1000 euros per whole human genome 103 104 105 106 107 108 Cost[USD] 2001 2004 2007 2010 2013 2016 Year (Data from https://www.genome.gov/sequencingcosts/) Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 4 / 19
  • 5. Next-Generation Sequencing Applications High-throughput DNA sequencing has many applications, including Research into understanding human genetic diseases Medicine, e.g., oncology, clinical pathology, ... Human phylogeny Personalized diagnostic applications Huge amount of data A single sequencer can produce 1 TB/day of data Which need to be converted, filtered, aggregated, reconstructed, analyzed, ... A single genomic center can have hundreds of sequencers Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 5 / 19
  • 6. Shotgun genome sequencing The DNA is a sequence of four bases: Adenine, Cytosine, Guanine and Thymine (A, C, G and T) The genome is broken up into short fragments (reads) The fragments are attached to a support (tile) Fluorescent molecules are iteratively attached to the bases At each cycle, the machine acquires (optically) a single base from all the fragments (CC-BY-3.0 image from https://goo.gl/9dqN8g) Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 6 / 19
  • 7. Data processing in NGS First steps and standard pipeline Pre-processing Reads need to be reconstructed from the raw, cycle-based, sequencer’s output (BCL files) Alignment The short reads are aligned to a reference genome When using Illumina sequencers, the standard pipeline starts with two programs: bcl2fastq2 Proprietary, open-source tool by Illumina to convert raw BCL data to FASTQ format BWA-MEM Free (GPLv3) aligner to reconstruct the full genomic sequence based on the short reads generated by the sequencer Problems Parallel tools, but shared-memory (single node) Communication via intermediate files: Different steps must be run sequentially New data requires re-computing the workflow from scratch Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 7 / 19
  • 8. Our approach Module 1: Data pre-processing Module 2: Reads alignment Raw BCL Transposition and filtering FASTQ Kafka broker Data retrieval RAPI BWA/MEM alignment CRAM Data pre-processing Flink program, written from scratch (see FlinkForward 2016 and BigData 20161) Modified to sends its output to Kafka It creates many (thousands) topics Reads alignment Flink program, powered by RAPI and BWA/MEM Reads from Kafka Writes to Kafka or HDFS 1 Francesco Versaci, Luca Pireddu, and Gianluigi Zanetti (2016). “Scalable genomics: From raw data to aligned reads on Apache YARN”. In: IEEE BigData 2016, pp. 1232–1241 Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 8 / 19
  • 9. Outline 1 Introduction 2 Implementation 3 Experimental evaluation Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 9 / 19
  • 10. Data pre-processing Bases and quality scores are decoded A data transposition is performed to obtain the reads For each tile several new Kafka data topics are created The list of the new topics is sent to a special control topic The reads are written into the data topics Finally, an EOS marker is appended to each data topic File 1 File 2 File 3 File 4 File 5 A A G C T TCGG CAGA CAGA TGAT CGTC Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Location5 Location4 Location3 Location2 Location1 A A G C T C A A G G G G G A T G A A T C T C C T C Read 1 Read 2 etc... Sending finite streams Each data topic will contain a finite amount of data Kafka streams are potentially infinite Hence we need to send a EOS when a data topic is completed Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 10 / 19
  • 11. Reads alignment The control topic is polled every 3 seconds, to receive notice of newly opened data topics The data topics are grouped into Flink jobs and aligned in parallel For each data topic, the reads are chunked together and aligned The aligned reads are sent to the output sink (e.g., written as CRAM files). The number of data topics is not known in advance When a data topic is completed the corresponding DataSource needs to be terminated To improve the aligner performance the chunks should be the same size Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 11 / 19
  • 12. Handling finite streams Our experience Finite streams do not work as (we) intended with count windows (last window is not evaluated) There is an easy work-around which uses event time windows val env = StreamExecutionEnvironment . getExecutionEnvironment env . set StreamTimeCharacteristic ( Ti me C har ac ter ist ic . EventTime ) val consumer = new FlinkKafkaConsumer010 [ MyType ] ( topicname , new MyDeserializer , p r o p e r t i e s ) . assignTimestampsAndWatermarks ( new MyTStamper ) val ds = env . addSource ( consumer ) val out = ds . timeWindowAll ( Time . milliseconds (1024)) . apply ( new MyAllWindowFunction ) / / do something with the output / / . . . Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 12 / 19
  • 13. Handling finite streams Deserialization c l a s s MyDeserializer extends DeserializationSchema [ MyType ] { override def getProducedType = TypeInformation . of ( classOf [ MyType ] ) override def isEndOfStream ( el : MyType ) : Boolean = { return el . isEOS } override def d e s e r i a l i z e ( data : Array [ Byte ] ) : MyType = { / / d e s e r i a l i z a t i o n code here } } Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 13 / 19
  • 14. Handling finite streams Timestamping c l a s s MyTStamper extends AssignerWithPeriodicWatermarks [ MyType ] { var cur = 0 l override def extractTimestamp ( el : MyType , prev : Long ) : Long = { val r = cur cur += 1 return r } override def getCurrentWatermark : Watermark = { new Watermark ( cur − 1) } } Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 14 / 19
  • 15. Outline 1 Introduction 2 Implementation 3 Experimental evaluation Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 15 / 19
  • 16. Experimental setup Experiments run on the Amazon Elastic Compute Cloud (EC2) Up to 12 instances of i3.8xlarge machines CPUs 32 virtual cores (Xeon E5-2686 v4, 45 MB cache) RAM 244 GiB Disks 4 x 1.9 TB NVMe SSD Network 10 Gb Ethernet OS Ubuntu 16.04.2 LTS (Linux kernel 4.4.0-1013-aws) Input dataset Sequencer Illumina HiSeq 3000 Size of data files 47.8 GB (gzip-compressed) Number of files 47,040 Number of samples 12 Number of reads 3.38 · 108 Number of DNA bases 3.38 · 1010 Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 16 / 19
  • 17. Results Strong scalability Nodes Time (minutes) Baseline 137 1 152 2 77 4 39.6 8 20.4 12 14.3 1 2 4 6 8 10 12 Speed-up 1 2 4 8 12 Number of nodes Perfect scalability Absolute scalability Relative scalability Same input, variable number of nodes On single node: 11% slower than the baseline On more nodes: almost linear scalability Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 17 / 19
  • 18. Load on Kafka We also tested the Kafka broker under stress, using 4 nodes Node-1 Running the Kafka broker and Flink Job-Manager Node-[2-4] Running the preprocessor at full speed (32 concurrent jobs per node) Data flow of 450 MB/s towards the broker for about 3 minutes 2688 topics created 340 million messages sent The CPU load was about 900% Assuming linearity, the network bandwidth is saturated at about 25 cores (10 Gb/s = 1250 MB/s) Limits to scalability An aligning node at maximum power (32 cores at 100%) requires 10 MB/s of data A single Kafka broker can sustain up to 125 nodes Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 18 / 19
  • 19. Conclusion and future work We have presented a fully scalable streaming pipeline to process raw Illumina NGS data up to the stage of aligning reads To facilitate the adoption of scalable stream processing in genomics... ...we aim at integrating our pipeline with the upcoming Genome Analysis Toolkit 4 (GATK4, a Spark-based genomic analysis tool) Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 19 / 19
  • 20. Conclusion and future work We have presented a fully scalable streaming pipeline to process raw Illumina NGS data up to the stage of aligning reads To facilitate the adoption of scalable stream processing in genomics... ...we aim at integrating our pipeline with the upcoming Genome Analysis Toolkit 4 (GATK4, a Spark-based genomic analysis tool) Thanks for your attention! Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 19 / 19