Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Seqpig script language for large bioinformatic datasets

651 views

Published on

presenting alternatives for processing large bioinformatics datasets

  • Be the first to comment

  • Be the first to like this

Seqpig script language for large bioinformatic datasets

  1. 1. SeqPig A simple and scalable scripting language for large sequencing data sets in Hadoop arian pasquali june 6, 2014
  2. 2. /me Arian Pasquali Master's student in Data Mining Data engineer at Semasio background - engineering - cloud computing - data mining on big data - social networks
  3. 3. study case SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K. Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093 /bioinformatics/btt601. Epub 2013 Oct 22. http://www.ncbi.nlm.nih.gov/pubmed/24149054
  4. 4. but first, some background ● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a single computer ● in order to handle big data sets we have to master parallel programming models
  5. 5. Parallel programming models some high-performance programming models - Serial (doesn’t scale) - MPI (expensive) - MapReduce - Hadoop (cheap and scalable)
  6. 6. hadoop Hadoop is an open source implementation of that enables you to run MapReduce programs. It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios. http://hadoop.apache.org/
  7. 7. how mapreduce works on hadoop Provides a framework for MapReduce, a fault-tolerant parallel programing model - easier to write programs than other paradigms - easier means cheaper - runs on clusters with commodity hardware - scales horizontally - need more power? just add more nodes
  8. 8. an application: BLAST algorithm MapReduce Tasks - load data - map sequences - partitionate - reduce (merge) - output results
  9. 9. MapReduce is easier, but not trivial
  10. 10. Apache Pig tries to solve that Apache Pig solves that. Under the hood it applies MapReduce paradigm It hides all the pitfalls about writing MapReduce code
  11. 11. Pig version of the same code
  12. 12. Apache Pig in Bioinformatics It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. It can be easier
  13. 13. SeqPig Scalable scripting language based on Apache Pig for large scale sequence analysis
  14. 14. SeqPig ● a script language, ● a library, ● and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner http://seqpig.sourceforge.net/
  15. 15. SeqPig and data format support Currently it supports BAM SAM FastQ Qseq input and output FASTA input
  16. 16. possible use cases ● converting data formats ● filters regions of a chromossome ● computing base frequencies ● alignments ● collecting read-mapping-quality-statistics
  17. 17. code example run scripts/filter_defs.pig A = load 'input.bam' using BamLoader('yes'); B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags); C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname, attributes#'MD'); D = FOREACH C GENERATE FLATTEN($0); base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase; base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase); base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount; base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos); base_stats = FOREACH base_stats_grouped { TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount; TMP2 = ORDER TMP1 BY bcount desc; GENERATE group.$0, group.$1, TMP2; } STORE base_stats into 'outputfile_readstats.txt';
  18. 18. results A 0 {(A,19),(G,2)} A 1 {(A,10)} A 2 {(A,18)} A 3 {(A,16)} A 4 {(A,14)} A 5 {(A,15)} A 6 {(A,16),(G,2)} ... A 98 {(A,7)} A 99 {(A,14)} C 0 {(C,6)} C 1 {(C,11)} C 2 {(C,9)}
  19. 19. results plotted
  20. 20. scalability test ● 61Gb dataset ● running some FastQC stats * speed in minutes
  21. 21. related work Biodoop: Bioinformatics on Hadoop http://dl.acm.org/citation.cfm?id=1679817 BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journals http://bioinformatics.oxfordjournals. org/content/early/2013/09/10/bioinformatics.btt528
  22. 22. some cloud computing solutions Amazon AWS , general use purpouse http://aws.amazon.com/ Mortar Data , focused on data science http://www.mortardata.com/ CloudGene, focused on bioinformatics users http://cloudgene.uibk.ac.at/
  23. 23. cloudgene, mapreduce for bioinformatics
  24. 24. conclusions Bioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science. Neural networks in Artificial Intelligence and Machine learning is an example. Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.
  25. 25. thank you hi@arianpasquali.com

×