High-throughput DNA sequencing is a key data acquisition technology which enables dozens of important applications, from oncology to personalized diagnostics. We extended work presented last year to port additional portions of the standard genomics data processing pipeline to Flink. Our Flink-based processor consists of two distinct specialized modules (reader and writer) that are loosely linked via Kafka streams, thus allowing for easy composability and integration into already existing Hadoop workflows. To extend our work we had to manage the dynamical creation and detection of the data streams: the set of output files is not known in advance by the writer, which learns it at running time. Particular care had to be taken to handle the finite nature of the genomic streams: since we use some already existing Hadoop output formats, we had to properly handle the flow of end-of-streams markers through Flink and Kafka, in order to have the final output files correctly finalized.
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in the standard genomics pipeline
1. Flink in Genomics
Integrating Flink and Kafka in the standard genomic pipeline
F. VERSACI L. PIREDDU G. ZANETTI
– Flink Forward 2017 –
13 September 2017
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 1 / 19
2. CRS4
Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna
CRS4
Research center in Sardinia, Italy
Focus on big data, biosciences, HPC, visual computing, energy and environment
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 2 / 19
4. Next-Generation Sequencing
Cost
Genome sequencing is now much cheaper than in the past
About 1000 euros per whole human genome
103
104
105
106
107
108
Cost[USD]
2001 2004 2007 2010 2013 2016
Year
(Data from https://www.genome.gov/sequencingcosts/)
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 4 / 19
5. Next-Generation Sequencing
Applications
High-throughput DNA sequencing has many applications, including
Research into understanding human genetic diseases
Medicine, e.g., oncology, clinical pathology, ...
Human phylogeny
Personalized diagnostic applications
Huge amount of data
A single sequencer can produce 1 TB/day of data
Which need to be converted, filtered, aggregated, reconstructed, analyzed, ...
A single genomic center can have hundreds of sequencers
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 5 / 19
6. Shotgun genome sequencing
The DNA is a sequence of four bases: Adenine, Cytosine,
Guanine and Thymine (A, C, G and T)
The genome is broken up into short fragments (reads)
The fragments are attached to a support (tile)
Fluorescent molecules are iteratively attached to the bases
At each cycle, the machine acquires (optically) a single base
from all the fragments
(CC-BY-3.0 image from https://goo.gl/9dqN8g)
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 6 / 19
7. Data processing in NGS
First steps and standard pipeline
Pre-processing Reads need to be reconstructed from the raw, cycle-based, sequencer’s
output (BCL files)
Alignment The short reads are aligned to a reference genome
When using Illumina sequencers, the standard pipeline starts with two programs:
bcl2fastq2 Proprietary, open-source tool by Illumina to convert raw BCL data to FASTQ
format
BWA-MEM Free (GPLv3) aligner to reconstruct the full genomic sequence based on the
short reads generated by the sequencer
Problems
Parallel tools, but shared-memory (single node)
Communication via intermediate files:
Different steps must be run sequentially
New data requires re-computing the workflow from scratch
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 7 / 19
8. Our approach
Module 1: Data pre-processing Module 2: Reads alignment
Raw BCL
Transposition and
filtering
FASTQ Kafka broker Data retrieval
RAPI BWA/MEM
alignment
CRAM
Data pre-processing
Flink program, written from scratch (see
FlinkForward 2016 and BigData 20161)
Modified to sends its output to Kafka
It creates many (thousands) topics
Reads alignment
Flink program, powered by RAPI and
BWA/MEM
Reads from Kafka
Writes to Kafka or HDFS
1
Francesco Versaci, Luca Pireddu, and Gianluigi Zanetti (2016). “Scalable genomics: From raw data to
aligned reads on Apache YARN”. In: IEEE BigData 2016, pp. 1232–1241
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 8 / 19
10. Data pre-processing
Bases and quality scores are decoded
A data transposition is performed to obtain the reads
For each tile several new Kafka data topics are created
The list of the new topics is sent to a special control
topic
The reads are written into the data topics
Finally, an EOS marker is appended to each data topic
File 1
File 2
File 3
File 4
File 5
A
A
G
C
T
TCGG
CAGA
CAGA
TGAT
CGTC
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Location5
Location4
Location3
Location2
Location1
A A G C T
C A A G G
G G G A T
G A A T C
T C C T C
Read 1
Read 2
etc...
Sending finite streams
Each data topic will contain a finite amount of data
Kafka streams are potentially infinite
Hence we need to send a EOS when a data topic is completed
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 10 / 19
11. Reads alignment
The control topic is polled every 3 seconds, to
receive notice of newly opened data topics
The data topics are grouped into Flink jobs and
aligned in parallel
For each data topic, the reads are chunked
together and aligned
The aligned reads are sent to the output sink (e.g.,
written as CRAM files).
The number of data topics is
not known in advance
When a data topic is
completed the corresponding
DataSource needs to be
terminated
To improve the aligner
performance the chunks
should be the same size
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 11 / 19
12. Handling finite streams
Our experience
Finite streams do not work as (we) intended with count windows (last window is not
evaluated)
There is an easy work-around which uses event time windows
val env = StreamExecutionEnvironment . getExecutionEnvironment
env . set StreamTimeCharacteristic ( Ti me C har ac ter ist ic . EventTime )
val consumer = new FlinkKafkaConsumer010 [ MyType ] (
topicname , new MyDeserializer , p r o p e r t i e s )
. assignTimestampsAndWatermarks ( new MyTStamper )
val ds = env . addSource ( consumer )
val out = ds . timeWindowAll ( Time . milliseconds (1024))
. apply ( new MyAllWindowFunction )
/ / do something with the output
/ / . . .
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 12 / 19
13. Handling finite streams
Deserialization
c l a s s MyDeserializer extends DeserializationSchema [ MyType ] {
override def getProducedType =
TypeInformation . of ( classOf [ MyType ] )
override def isEndOfStream ( el : MyType ) : Boolean = {
return el . isEOS
}
override def d e s e r i a l i z e ( data : Array [ Byte ] ) : MyType = {
/ / d e s e r i a l i z a t i o n code here
}
}
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 13 / 19
14. Handling finite streams
Timestamping
c l a s s MyTStamper extends AssignerWithPeriodicWatermarks [ MyType ] {
var cur = 0 l
override def extractTimestamp ( el : MyType , prev : Long ) : Long = {
val r = cur
cur += 1
return r
}
override def getCurrentWatermark : Watermark = {
new Watermark ( cur − 1)
}
}
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 14 / 19
16. Experimental setup
Experiments run on the Amazon Elastic Compute
Cloud (EC2)
Up to 12 instances of i3.8xlarge machines
CPUs 32 virtual cores (Xeon E5-2686 v4, 45 MB
cache)
RAM 244 GiB
Disks 4 x 1.9 TB NVMe SSD
Network 10 Gb Ethernet
OS Ubuntu 16.04.2 LTS (Linux kernel
4.4.0-1013-aws)
Input dataset
Sequencer Illumina HiSeq 3000
Size of data files 47.8 GB
(gzip-compressed)
Number of files 47,040
Number of samples 12
Number of reads 3.38 · 108
Number of DNA bases 3.38 · 1010
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 16 / 19
17. Results
Strong scalability
Nodes Time (minutes)
Baseline 137
1 152
2 77
4 39.6
8 20.4
12 14.3
1
2
4
6
8
10
12
Speed-up
1 2 4 8 12
Number of nodes
Perfect scalability
Absolute scalability
Relative scalability
Same input, variable number of nodes
On single node: 11% slower than the baseline
On more nodes: almost linear scalability
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 17 / 19
18. Load on Kafka
We also tested the Kafka
broker under stress, using
4 nodes
Node-1 Running the
Kafka broker and Flink
Job-Manager
Node-[2-4] Running the
preprocessor at full
speed (32 concurrent
jobs per node)
Data flow of 450 MB/s towards the broker for about 3
minutes
2688 topics created
340 million messages sent
The CPU load was about 900%
Assuming linearity, the network bandwidth is saturated at
about 25 cores (10 Gb/s = 1250 MB/s)
Limits to scalability
An aligning node at maximum power (32 cores at 100%)
requires 10 MB/s of data
A single Kafka broker can sustain up to 125 nodes
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 18 / 19
19. Conclusion and future work
We have presented a fully scalable streaming pipeline to process raw Illumina NGS
data up to the stage of aligning reads
To facilitate the adoption of scalable stream processing in genomics...
...we aim at integrating our pipeline with the upcoming Genome Analysis Toolkit 4
(GATK4, a Spark-based genomic analysis tool)
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 19 / 19
20. Conclusion and future work
We have presented a fully scalable streaming pipeline to process raw Illumina NGS
data up to the stage of aligning reads
To facilitate the adoption of scalable stream processing in genomics...
...we aim at integrating our pipeline with the upcoming Genome Analysis Toolkit 4
(GATK4, a Spark-based genomic analysis tool)
Thanks for your attention!
Francesco Versaci (CRS4) Flink in Genomics 13 September 2017 19 / 19