SeqPig
A simple and scalable scripting language for
large sequencing data sets in Hadoop
arian pasquali
june 6, 2014
/me
Arian Pasquali
Master's student in Data Mining
Data engineer at Semasio
background
- engineering - cloud computing
- d...
study case
SeqPig: simple and scalable scripting for large
sequencing data sets in Hadoop.
Schumacher A1, Pireddu L, Nieme...
but first, some background
● Real world bioinformatics datasets are huge
● Gigabytes/Petabytes are hard to handle on a
sin...
Parallel programming models
some high-performance
programming models
- Serial (doesn’t scale)
- MPI (expensive)
- MapReduc...
hadoop
Hadoop is an open source implementation of
that enables you to run MapReduce programs.
It is aimed to process huge ...
how mapreduce works on hadoop
Provides a framework for
MapReduce, a fault-tolerant
parallel programing model
- easier to w...
an application: BLAST algorithm
MapReduce Tasks
- load data
- map sequences
- partitionate
- reduce (merge)
- output resul...
MapReduce is easier, but not trivial
Apache Pig tries to solve that
Apache Pig solves that.
Under the hood it applies MapReduce
paradigm
It hides all the pitfa...
Pig version of the same code
Apache Pig in Bioinformatics
It is a platform for analyzing large data sets that consists of
a high-level language for exp...
SeqPig
Scalable scripting language based on
Apache Pig for large scale sequence
analysis
SeqPig
● a script language,
● a library,
● and a collection of tools to manipulate,
analyze and query sequencing datasets ...
SeqPig and data format support
Currently it supports
BAM
SAM
FastQ
Qseq input and output
FASTA input
possible use cases
● converting data formats
● filters regions of a chromossome
● computing base frequencies
● alignments
...
code example
run scripts/filter_defs.pig
A = load 'input.bam' using BamLoader('yes');
B = FILTER A BY not ReadUnmapped(fla...
results
A 0 {(A,19),(G,2)}
A 1 {(A,10)}
A 2 {(A,18)}
A 3 {(A,16)}
A 4 {(A,14)}
A 5 {(A,15)}
A 6 {(A,16),(G,2)}
...
A 98 {(...
results plotted
scalability test
● 61Gb dataset
● running some
FastQC stats
* speed in minutes
related work
Biodoop: Bioinformatics on Hadoop
http://dl.acm.org/citation.cfm?id=1679817
BioPig: A Hadoop-based Analytic T...
some cloud computing solutions
Amazon AWS , general use purpouse
http://aws.amazon.com/
Mortar Data , focused on data scie...
cloudgene, mapreduce for bioinformatics
conclusions
Bioinformatics have been creating innovative algorithms
and solutions that sometimes are adopted in different ...
thank you
hi@arianpasquali.com
Upcoming SlideShare
Loading in …5
×

Seqpig script language for large bioinformatic datasets

452 views
327 views

Published on

presenting alternatives for processing large bioinformatics datasets

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
452
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Seqpig script language for large bioinformatic datasets

  1. 1. SeqPig A simple and scalable scripting language for large sequencing data sets in Hadoop arian pasquali june 6, 2014
  2. 2. /me Arian Pasquali Master's student in Data Mining Data engineer at Semasio background - engineering - cloud computing - data mining on big data - social networks
  3. 3. study case SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K. Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093 /bioinformatics/btt601. Epub 2013 Oct 22. http://www.ncbi.nlm.nih.gov/pubmed/24149054
  4. 4. but first, some background ● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a single computer ● in order to handle big data sets we have to master parallel programming models
  5. 5. Parallel programming models some high-performance programming models - Serial (doesn’t scale) - MPI (expensive) - MapReduce - Hadoop (cheap and scalable)
  6. 6. hadoop Hadoop is an open source implementation of that enables you to run MapReduce programs. It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios. http://hadoop.apache.org/
  7. 7. how mapreduce works on hadoop Provides a framework for MapReduce, a fault-tolerant parallel programing model - easier to write programs than other paradigms - easier means cheaper - runs on clusters with commodity hardware - scales horizontally - need more power? just add more nodes
  8. 8. an application: BLAST algorithm MapReduce Tasks - load data - map sequences - partitionate - reduce (merge) - output results
  9. 9. MapReduce is easier, but not trivial
  10. 10. Apache Pig tries to solve that Apache Pig solves that. Under the hood it applies MapReduce paradigm It hides all the pitfalls about writing MapReduce code
  11. 11. Pig version of the same code
  12. 12. Apache Pig in Bioinformatics It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. It can be easier
  13. 13. SeqPig Scalable scripting language based on Apache Pig for large scale sequence analysis
  14. 14. SeqPig ● a script language, ● a library, ● and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner http://seqpig.sourceforge.net/
  15. 15. SeqPig and data format support Currently it supports BAM SAM FastQ Qseq input and output FASTA input
  16. 16. possible use cases ● converting data formats ● filters regions of a chromossome ● computing base frequencies ● alignments ● collecting read-mapping-quality-statistics
  17. 17. code example run scripts/filter_defs.pig A = load 'input.bam' using BamLoader('yes'); B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags); C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname, attributes#'MD'); D = FOREACH C GENERATE FLATTEN($0); base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase; base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase); base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount; base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos); base_stats = FOREACH base_stats_grouped { TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount; TMP2 = ORDER TMP1 BY bcount desc; GENERATE group.$0, group.$1, TMP2; } STORE base_stats into 'outputfile_readstats.txt';
  18. 18. results A 0 {(A,19),(G,2)} A 1 {(A,10)} A 2 {(A,18)} A 3 {(A,16)} A 4 {(A,14)} A 5 {(A,15)} A 6 {(A,16),(G,2)} ... A 98 {(A,7)} A 99 {(A,14)} C 0 {(C,6)} C 1 {(C,11)} C 2 {(C,9)}
  19. 19. results plotted
  20. 20. scalability test ● 61Gb dataset ● running some FastQC stats * speed in minutes
  21. 21. related work Biodoop: Bioinformatics on Hadoop http://dl.acm.org/citation.cfm?id=1679817 BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journals http://bioinformatics.oxfordjournals. org/content/early/2013/09/10/bioinformatics.btt528
  22. 22. some cloud computing solutions Amazon AWS , general use purpouse http://aws.amazon.com/ Mortar Data , focused on data science http://www.mortardata.com/ CloudGene, focused on bioinformatics users http://cloudgene.uibk.ac.at/
  23. 23. cloudgene, mapreduce for bioinformatics
  24. 24. conclusions Bioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science. Neural networks in Artificial Intelligence and Machine learning is an example. Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.
  25. 25. thank you hi@arianpasquali.com

×