Your SlideShare is downloading. ×
0
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
How to be a bioinformatician
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

How to be a bioinformatician

1,396

Published on

Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.

Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.

Published in: Science, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,396
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
76
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Version 5
  • Funny rant about bioinformatics, not to be taken literally:http://madhadron.com/posts/2012-03-26-a-farewell-to-bioinformatics.html
  • Transcript

    • 1. 1 How to be a bioinformatician Christian Frech, PhD St. Anna Children’s Cancer Research Institute, Vienna, Austria Talk at University of Applied Sciences, Hagenberg, Austria April 23rd, 2014
    • 2. What is a bioinformatician? 2 Informatician Statistician Biologist Data scientist Modified from http://blog.fejes.ca/?p=2418
    • 3. Bioinformatician vs. computational biologist  Asks biological questions  Analyzes & interprets biological data  Runs existing programs  Ad hoc scripting  Perl, R, Python 3  IT savvy  Builds & maintains biological databases & Web sites  Designs & implements clever algorithms  C/C++, Java, Python Bioinformatician Computational biologist Grasp of computational subjectsmore less Grasp of biological subjectsless more or vice versa
    • 4. Why do we need bioinformaticians?  Amount of generated biological data requires sophisticated computing for data management and analysis  Programmers lack biological knowledge  Biologists don‟t program  The two don‟t understand each other 4 http://www.youtube.com/watch?v=Hz1fyhVOjr4 Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs 1 terabase (TB) of data in 6 days1! Biologists talks to statistician 1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
    • 5. What are bioinformaticians doing? 5
    • 6. 6 What are bioinformaticians doing? Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
    • 7. Challenges as bioinformatician  Biology is complex, not black and white  As many exceptions as rules (e.g.: define “gene”)  No single optimal solution to a problem  Results interpretable in many ways (story telling, cherry picking)  Understanding the biological question  Field is moving incredibly fast  Lack of standards, immature/abandoned software  Standard of today obsolete tomorrow  Much time spent on collecting/cleaning-up data, troubleshooting errors  Stay flexible, don‟t overinvest in single platform/technology  Hundreds of software tools and databases out there  Easy to get lost  Important to understand their strengths and weaknesses 8
    • 8. Which tools should I use? 9 179 tools Heard of: 65% Used: 30%
    • 9. 10 http://omictools.com/
    • 10. Things to have in your bioinformatics toolbox  Linux command line  Scripting language with associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)  Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction  Sequence alignment (FASTA & BLAST)  Biological databases  Regular expressions  Sequencing technologies  Web technologies (HTML, XML, …) 11  Advanced R skills  Parallel/distributed computing  DBMS, SQL  (Semi-)compiled language (C/C++, Java)  Dimensionality reduction (e.g. PCA)  Cluster analysis  Support Vector Machines  Hidden Markov models  Web framework (e.g. Django)  Version control system (e.g. Git)  Advanced text editor (Emacs, vim)  IDE (e.g. Eclipse, NetBeans) Must haves Highly recommended
    • 11. Requirement Recommended Language Speed matters, low-level programming Rich-client enterprise application development Text file processing (regex) Statistical analysis, fancy plots Rapid prototyping, readable & maintainable scripts Workflow automation What programming language should I learn? 12Be a jack of all trades, master of ONE!
    • 12. Perl on decline, R and Python gaining popularity 13 http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming- languages.html http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png Perl most popular bioinformatics programming language in 2008 R and Python take the lead in 2014
    • 13. Top 10 most common and/or annoying mistakes in bioinformatics 14 Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
    • 14. Top-10 most common/annoying mistakes in bioinformatics # 10 Using genome coordinates with wrong genome version (for example, using gene coordinates from human genome version hg18 but reference sequence from version hg19) 15
    • 15. Top-10 most common/annoying mistakes in bioinformatics # 9 Forgetting to process the second strand of DNA sequence 16
    • 16. Top-10 most common/annoying mistakes in bioinformatics # 8 Processing second strand of DNA sequence, but taking reverse instead of reverse complement sequence 17
    • 17. Top-10 most common/annoying mistakes in bioinformatics # 7 Not accounting for different human chromosomes names between UCSC and Ensembl Example: UCSC: “chr1” Ensembl: “1” 18
    • 18. Top-10 most common/annoying mistakes in bioinformatics # 6 Assuming the alphabetical order of chromosome names is “chr1”, “chr2”, “chr3”, … when in fact it is “chr1”, “chr10”, “chr11”, … 19
    • 19. Top-10 most common/annoying mistakes in bioinformatics # 5 Assuming „tab‟ field separator when in fact it is „blank‟ (or vice versa) (look almost identical in text editor) 20
    • 20. Top-10 most common/annoying mistakes in bioinformatics # 4 Assuming DNA sequence consists of only four letters (A, T, C, G) while in fact there is a fifth 21 „N‟ for missing base („X‟ for missing amino acid)
    • 21. Top-10 most common/annoying mistakes in bioinformatics # 3 Forgetting to use dos2unix on a Windows text file before processing it under Linux plus spending 1 hour to debug the problem plus being tricked by this multiple times Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR). 22
    • 22. Top-10 most common/annoying mistakes in bioinformatics # 2 When importing data into MS Excel, letting it auto-convert HUGO gene names into dates and forgetting about it (e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import) ~30 genes in total 23
    • 23. #1 Off-by-one error There are only two common problems in bioinformatics: (1) lack of standards, (2) ID conversion, and (3) off-by-one errors 24 http://en.wikipedia.org/wiki/Off-by-one_error Top-10 most common/annoying mistakes in bioinformatics
    • 24. Ten personal recommendations for your future work as bioinformatician 25
    • 25. #1 - Learn Linux!  Most bioinformatics tools not available on Windows  Linux file systems better for many and/or very large files  Command line interface (CLI) has advantages over graphical user interface (GUI)  Recorded command history (reproducibility)  Key stroke to re-run analysis, instead of repeating 100 mouse clicks  Linux CLI (Shell) much more powerful than Windows CLI 26
    • 26. # 2 - Embrace the “Unix tools philosophy”  Small programs (“tools”) instead of monolithic applications  Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)  Many and well documented parameters  Combined with Unix pipes (read from STDIN, write to STDOUT)  cut -f 3 myfile.txt | sort | uniq  Advantages  Great flexibility, easy re-use of existing tools  Intermediate output can be stored and inspected for troubleshooting  Complex tasks can be performed quickly with shell „one-liners‟  This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways 27http://www.linuxdevcenter.com/lpt/a/302
    • 27. Example NGS use case demonstrating the power of the Unix tools philosophy  Explanation  „samtools mpileup‟ piles up short reads from the input BAM file for each position in the reference genome  „bcftools view‟ calls the variants  „vcfutils vcf2fq‟ computes the consensus sequence  The resulting FASTA sequence is redirected to the output file cns.fq  By knowing available tools and their parameters, bioinformatics „wizards‟ can get complex stuff done in almost no time 28 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq http://samtools.sourceforge.net/mpileup.shtml
    • 28. #3 - Don’t reinvent the wheel  Coding is fun, but look around before you hack into your keyboard  Don‟t write the 29th FASTA file parser if proven solutions are available  BioPerl  BioPython  Bioconductor 29
    • 29. #4 - If you happen to invent a wheel, …  Document source and parameters well  Use version control system (git, svn)  Deposit code in public repository  sourceforge.net  github.com  Write test cases 30
    • 30. # 5 - Automate pipelines with GNU/Make  Developed in 1970s to build executables from source files  Incredibly useful for data-driven workflows as well  Automatic error checking  Parallelization (utilize multiple cores)  Incremental builds (re-start your pipeline from point of failure)  Bug-free  Get started at http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/ 31
    • 31. # 6 - Value your time  Architecture vs. accomplishment  “Perfect is the enemy of the good” -- Voltaire  OO design and normalized databases are nice, but can be an overkill if requirements change from analysis to analysis  Automate what can be automated  Reproducibility  Easy to repeat analysis with slightly changed parameters  BUT: Don‟t spend two days automating a one-time analysis that can be done manually in 10 minutes 32
    • 32. # 7 – Make use of free online resources to learn about specialized topics  www.coursera.org  Bioinformatics Algorithms (https://www.coursera.org/course/bioinformatics)  Computing for Data Analysis (https://www.coursera.org/course/compdata)  R Programming (https://www.coursera.org/course/rprog)  https://www.edx.org/  Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx- ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)  Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x- introduction-biology-secret-1768#.U1TVL3V52R8)  http://rosalind.info/problems/locations/ 33
    • 33. # 8 - Become an expert  Identify an area of interest and get really good at it  Work at places where you can learn from the best  Spend time abroad  Great experience  Labs/companies will not only hire you for what you know, but who you know 34
    • 34. # 9 - Decide early on if you want to stay in academia or go into industry 35 Academia Industry • PhD highly recommended • Take your time to find compatible supervisor + Freedom to pursue own ideas + Very flexible working hours + Work independently - Steep & competitive career ladder (postdoc >> PI/prof) - Lower pay - Publish or perish • PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases) + More frequent (positive) feedback + Higher pay + Job security - More (external) deadlines - Higher pressure to get things done See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
    • 35. # 10 - Stay informed & get connected  Follow literature and blogs  http://en.wikipedia.org/wiki/List_of_bioinformatics_journals  http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay- current-in-bioinformaticsgenomics/  Subscribe via RSS feeds  http://feedly.com or others  Platform independent (e.g. read on your phone)  Bioinformatics Q&A forums  http://www.biostars.org (highly recommended)  http://seqanswers.com/ (focus on NGS)  http://www.reddit.com/r/bioinformatics/ (student-oriented)  Other  http://bioinformatics.org – fosters collaboration in bioinformatics  http://www.researchgate.net – “Facebook” for researchers  German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin) 36
    • 36. Conclusion  As bioinformatician, you will be at the forefront of one of the greatest scientific enterprises of our time  Biologists overwhelmed with massive data sets  YOU will get to see exciting results first  Requires integration of knowledge from many domains  IT, biology, medicine, statistics, math, …  Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable 37
    • 37. Thank you! Christian Frech frech.christian@gmail.com 38
    • 38. Further Reading  “So you want to be a computational biologist?” http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html  “What It Takes to Be a Bioinformatician” http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/  “The alternative „what it takes to be a bioinformatician‟” https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/  “So You Want To Be a Computational Biologist, Or A Bioinformatician?” http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html  “Being a bioinformatician is hard” http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/  “How not to be a bioinformatician” http://www.scfbm.org/content/7/1/3  “Ten Simple Rules for Reproducible Computational Research” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285  “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2 E21C9E565378C7F714D76  “A Quick Guide for Developing Effective Bioinformatics Programming Skills” http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589  “What Is Really the Salary of a Bioinformatician/Computational Biologist?” http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational- biologist/ 39

    ×