Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

This bioinformatics lesson is brought to you by the letter 'W'

3,054 views

Published on

Some tips on Unix and Bioinformatics. 'W' is for 'Worfklows', 'What?', and 'Why?'

This was a talk given at UC Davis on 15th June 2015 as part of a Bioinformatics Core teaching workshop.

Author: Keith Bradnam, Genome Center, UC Davis. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Published in: Education
  • Be the first to comment

This bioinformatics lesson is brought to you by the letter 'W'

  1. 1. Today's bioinformatics lesson is brought to you by the letter 'W' by Keith Bradnam Image from flickr.com/91619273@N00/ Today'sbloinformatieslesson isbroughttoyoubytheletter1W1 Imagefromflickr.com/91619273©NO0/
  2. 2. Wis for WorkflowsisforWorkflows
  3. 3. A typical bioinformatics workflow Illumina data (FASTQ format) Remove adapter contamination Atypicalbioinformaticsworkflow Removeadaptercontamination
  4. 4. A typical bioinformatics workflow Illumina data (FASTQ format) Remove adapter contamination scythe cutadapt trimgalore skewer Btrim Trimmomatic Atypicalbioinformaticsworkflow Removeadaptercontamination scythe cutadapt trimgalore skewer Btrim Trimmomatic
  5. 5. A typical bioinformatics workflow Illumina data (FASTQ format) Remove adapter contamination scythe cutadapt trimgalore skewer Btrim Trimmomatic Lots of tools you could use! Atypicalbioinformaticsworkflow Lotsoftools youcoulduse! Removeadaptercontamination scythe cutadapt trimgalore skewer Btrim Trimmomatic
  6. 6. Trim reads for low quality bases sickle Qtrim FastQC FastX PRINSEQ Trimmomatic Trimreadsforlowqualitybases sickle Qtrim FastQC FastX PRINSEC) Trimmomatic
  7. 7. Map reads to genome/transcriptome BWA Bowtie TopHat SHRiMP BFAST MAQ From ebi.ac.uk/~nf/hts_mappers/ There are a lot of read mappers out there! Fromebi.ac.uk/-nf/hts_mappers/ H I S A T •-JAGuaR • - BWA-PSSM • - - MOSAIK•- - - - - - Hobbes2 • CUSHAW3a- NextGenMap • Subread/Subjunc • CRAC•- SRmapper•- GEM• STAR • ERNE•- BatMelh•- BLASRa- YAHA • SeciAlto • Batmis • Therearealotof DynMaPp O S A • ContextMap•- as?n1 •- RUMa_ readmappersoutthere!StampydrFAST•-Bismark•- •- MapSplicea-REALa-- BS-Seekera-- - B S - S e e k e r 2 - •• Supersplat liceMapRAT • - B R A T - S W -•- BFAST•- segemeht•- GNUMAP•- GenomeMapper•- mrFAST • • - mrsFAST m r s FA S T- L i l t r a - -• - - - - PerM • - - - - - --- RNA-Mate • - - -X-Matea- - - - SBSMAP • - - - - S p l a z e r RazerS • --•--MicroRazerS - • - - • RazerS3 SHRIMPa ——•SHR1MP2-• BWAs - - •BWA-SW CloudBurst • ProbeMatch •• W H A M - • TopHata- T o p H a t 2-•- Bowlie •- B o w t i e 2 •- MOM4- PASS•- P A S S - b i s - -• Slider • - - -Slider-II- ()PALMA • SOCS"- MAO• SegMap • ZOOM• PalMaNa- RMAP• SOAP• —SOAP2--• BWT-SW • - - S O A P S p l i c e - -• Blata- SSAHA• GMAP • Exonerate • Mummer3 • ELAND • GSNAP-a- 20012002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Years
  8. 8. Map reads to genome/transcriptome BWA Bowtie TopHat SHRiMP BFAST MAQ From ebi.ac.uk/~nf/hts_mappers/Fromeloi.ac.uki-nti GnotdrnietAtft.- 2 c 1 4 . 1.5auppl9:512 hitk.,:,www.bicrileckentrakuoiryt41-2105/75.•9•512 HISAT JAGuaIR - - Bw •A-PSSM - - - -M0-A1K Approach ARYANA:AligningReadsbyVetAnother MiladGnoliimi•r,Arjeankba::'',AliSharifiviv:1-•.44,Harritireza(..hitsazMerio. . ..ignit5. Abstract PitTsburgh,PA,1..,'SA31March-OSApril20.4 iert)mRic:COM8-Seq:FourthAnnualRkC(....V/111Satellite'Workshopor)MassivelyParallelSequencing Motivation:Althoughthereare '•'--AarlycihretentaigorithmsancsoftwarerookbrNigningsequencingreacio s r gappeos,Fo./pncesearchisfarfromsoivenStrongInterestinfastalignrrien:-ishest1.1,1pc7e0intheSV or.7tmforaigorithms',V-rhbeperrionfastaridaccuratealignment. anclitiortdenow?assembtyofneat-GeneratoniPet.enringlngreadequitesfastoveriap-layriur-concensus tieInnoczmvecompetitiononagoingaroller:tonofreadstoagiverdatabasedfreferencegenomes.In -f_ultra-• - Contribution:I'leintrot-LreARvANA.afastgappecrear!alignerdevelopedonMebissofiilleAincleA•ing nisastr,_cturewithaco-ripletelyneooaighrrentengOPthatrh.akesitsignrfiramlyfasterthan7hreeotheraligner's: Sowtie2,BMAantiSegAirt),wtncomparableGen-t,c-.:tyant:acruracy.Insteadofthporne-consurningt-haricraciong:vac:et:ores''L,!•handhingrntsrnatrtx5,s,ARYANIAcome;withthpsese-anO-exten0aigorIMmirframeworkanoa 5lonificantlyIrnPrOvedmth efficiencybyIntegrongriNpialgorithmictetirnidt.elincluong dynamArseer:seteCtion, nin'ectionalspeceltensiortreset-4.rephashtablesanogap-fillingcAnynn•nirbrogsarnming.Asthpreaclength _ - - increasesARYA-V/A•.!TItioeflornyintermsofspeedanaahgnmentratebecomesmoreevelent.Thisisinperfect ',lakesAtpar)/todeveionmission-specieNignersforotherappiicationsusingARVANAengine.harmony4viththeiFelilit'ngthtrenaas:heseci4enclnigTechnologiesevohieIhealgorithmcplaTformofARYANA introduction Availability:ARYAN.4compip7esourcerexiecanheobrairteilfromkittp.//gitbubcOrnlar)'ana-aligner i:vt-tyliv:nscellcarriesahatA4offnreconsistingorseveralusedalaborioushierarchilprocesstodividethegertorne thnuNanditl r billitmsofcharacteniwithanswerstomany into srnalier.covegtamwhiletheCelera(;i-siolnicsfirm vitalqumlions_.1-11.mnineffortstodecipherthathookhasreplacedthatb rin yatrnnputationalsequence-assemblysoli- Islernatio,:ratilnynanGenolne..eq.ite-ncingConxort,Lion gainedincreasing:rloitivntlintsince/953WhtiLthedoublewareappliedtothedatageneatedfrontbhoellyshredded helicalstructure011)NAwasdiscovered-'twentyyears(shotgun)wholegentorte17,.ti:.'theautomatedSanger Liter.W..GilbertandA.Maxarnreactthenrst2,1-tit...It-atter r methodwasthegoldstandardfin-abouttwodettleN,as wordofthebook[I].svhenIISangerandhistsolleastiesthe.first*-ene.,-ntieoror021i/Axecitiencing.untiliecreasing applicationoflabeleddideoxynucleotidetriphosphatexvolome ofen-orfreegenomirinformationcan%edmiler- weredmelopinganothmsequentingmethodbasedonthedemandforla.,,tandinexpensivemethodstoproducehigh I I thatact;ISchainterminatorsinaPC.Rrmclior:/2,3... genceofnewtechnologies.thesotailedNett-Geno-rainn I drearnofreadingthehunzarihonk f e wasrtallaedhyAboutthreedecadesafterthefirnONAvegurnLing,SequericisvOVG,S) .-1,paradigrnshihinboththeexperimentaltechnititieli 2 0 1 3 2 0 1 4 2 0 1 5 completionofthe t 3 I li t h efrulnangenrmreprofect(4-61,rhe and computationalInettulthocturred doetothetransition SSAHA• -II B l o t •-_ Ftli 1stca'Aut'O' iniblniran 1 avaiklii‘41MI' (–CIa? V* artfig• . rit:ctir;s1P,eye iveSangermate-pairedreadst-,-41t7to •coeirsgt:,-,1,vi, i,),:kly•ieri?itt,ari, relmenregerunnes,suchasthehumangenotr, ormore hvananliJ-Ktrutoa' V areSarrt-tunnowtr-eas,tat, ttore-.4.0,7f4,,ati, than2000prokitryotex-toilvar),nesandArchaea.lamg, totheNGStec:hnologiesandalso;Availabilityoffinished 2001 2 0 0 0 WattledCentral'''''..•„ Nzvoetr - - --—-ecthecrtPrta4 4..0,,,,t,:.0.,.a.,....„.0,,,elun.:06,z,kx...,0_,-;:t:eC—rnOrdo.Ercfo;CerretnseS:0;xa:13'stect'AL:i.deelat;,,13,17,a5Vt.GISrbtco,„.-"•amoeue?aro%x,,,, (-1'sYl't“:""Mort$Fttecr,...-0-?D14',1C.4,Tr'lelow:ccrseitv..43P.Ittfrtfct'NIa61Lt)&-.ACUISark*arnkozoimat,re:errrao'rPt.v•nit el,A (611; Bloinformatics
  9. 9. Filter for uniquely mapped reads SAMtools Picard GATK Unix Filterforuniquelymappedreads SAMtools Picard GATK Unix
  10. 10. Filter for high quality alignments SAMtools Picard GATK Unix Filterforhighqualityalignments SAMtools Picard GATK Unix
  11. 11. Data suitable for final analysis Datasuitablefor finalanalysis
  12. 12. Some questions you should ask yourself…Somequestionsyoushouldaskyourself..
  13. 13. Wis for 'Why?'isfor'Why?
  14. 14. Why are each of these steps needed?Whyareeachofthesestepsneeded?
  15. 15. Why should I use tool 'X' at this step?WhyshouldIusetoolX'atthisstep?
  16. 16. Wis for 'What?'isfor'What?'
  17. 17. What is the effect on running each step?Whatistheeffectonrunningeachstep?
  18. 18. What is a good result?Whatisagoodresult?
  19. 19. The effect of applying many 'bioinformatics axes' Illumina data (FASTQ format) 2 FASTQ files Files are ~6.5 GB 52.5 million reads total Theeffectofapplyingmany 1bloinformaticsaxes' IIluminadata (FASTQformat) 2FASIQfiles 52.5millionreadstotal Filesare,-,64.5GB
  20. 20. Remove adapters & trim 50.1 million reads Removeadapters&trim 50.1millionreads
  21. 21. Align to transcriptome with Bowtie 35.8 million reads map AligntotranscriptomewithBowtie 35.8millionreadsmap
  22. 22. Filter for uniquely mapped reads 31.4 million reads align uniquely Filterforuniquelymappedreads 31.4millionreadsalignuniquely
  23. 23. Filter for high quality alignments 22.7 million reads have alignment scores of zero Filterforhighqualityalignments 22.7millionreadshavealignmentscoresofzero
  24. 24. Data suitable for final analysis Reduced data from 52.5 to 22.7 million reads Datasuitablefor finalanalysis Reduceddatafrom52.5to22.7millionreads
  25. 25. It can be helpful to know how the different steps in a workflow reduce your data Itcanbehelpfultoknowhowthedifferent stepsinaworkflowreduceyourdata
  26. 26. One final tip…Onefinaltip...
  27. 27. ls -ltris ltr
  28. 28. Run this command after every step of a workflow Runthiscommandafter everystepofaworkflow
  29. 29. Let's you see whether output files were actually created Let'syouseewhetheroutputfiles wereactuallycreated
  30. 30. Let's you see whether output files contain any data Let'syouseewhetheroutputfiles containanydata
  31. 31. Most recently modified files will be at bottom of your terminal window Mostrecentlymodifiedfileswillbe atbottomofyourterminalwindow
  32. 32. The endTheend

×