How to be a bioinformatician


Published on

Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.

Published in: Science, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Version 5
  • Funny rant about bioinformatics, not to be taken literally:
  • How to be a bioinformatician

    1. 1. 1 How to be a bioinformatician Christian Frech, PhD St. Anna Children’s Cancer Research Institute, Vienna, Austria Talk at University of Applied Sciences, Hagenberg, Austria April 23rd, 2014
    2. 2. What is a bioinformatician? 2 Informatician Statistician Biologist Data scientist Modified from
    3. 3. Bioinformatician vs. computational biologist  Asks biological questions  Analyzes & interprets biological data  Runs existing programs  Ad hoc scripting  Perl, R, Python 3  IT savvy  Builds & maintains biological databases & Web sites  Designs & implements clever algorithms  C/C++, Java, Python Bioinformatician Computational biologist Grasp of computational subjectsmore less Grasp of biological subjectsless more or vice versa
    4. 4. Why do we need bioinformaticians?  Amount of generated biological data requires sophisticated computing for data management and analysis  Programmers lack biological knowledge  Biologists don‟t program  The two don‟t understand each other 4 Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs 1 terabase (TB) of data in 6 days1! Biologists talks to statistician 1
    5. 5. What are bioinformaticians doing? 5
    6. 6. 6 What are bioinformaticians doing? Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
    7. 7. Challenges as bioinformatician  Biology is complex, not black and white  As many exceptions as rules (e.g.: define “gene”)  No single optimal solution to a problem  Results interpretable in many ways (story telling, cherry picking)  Understanding the biological question  Field is moving incredibly fast  Lack of standards, immature/abandoned software  Standard of today obsolete tomorrow  Much time spent on collecting/cleaning-up data, troubleshooting errors  Stay flexible, don‟t overinvest in single platform/technology  Hundreds of software tools and databases out there  Easy to get lost  Important to understand their strengths and weaknesses 8
    8. 8. Which tools should I use? 9 179 tools Heard of: 65% Used: 30%
    9. 9. 10
    10. 10. Things to have in your bioinformatics toolbox  Linux command line  Scripting language with associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)  Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction  Sequence alignment (FASTA & BLAST)  Biological databases  Regular expressions  Sequencing technologies  Web technologies (HTML, XML, …) 11  Advanced R skills  Parallel/distributed computing  DBMS, SQL  (Semi-)compiled language (C/C++, Java)  Dimensionality reduction (e.g. PCA)  Cluster analysis  Support Vector Machines  Hidden Markov models  Web framework (e.g. Django)  Version control system (e.g. Git)  Advanced text editor (Emacs, vim)  IDE (e.g. Eclipse, NetBeans) Must haves Highly recommended
    11. 11. Requirement Recommended Language Speed matters, low-level programming Rich-client enterprise application development Text file processing (regex) Statistical analysis, fancy plots Rapid prototyping, readable & maintainable scripts Workflow automation What programming language should I learn? 12Be a jack of all trades, master of ONE!
    12. 12. Perl on decline, R and Python gaining popularity 13 languages.html Perl most popular bioinformatics programming language in 2008 R and Python take the lead in 2014
    13. 13. Top 10 most common and/or annoying mistakes in bioinformatics 14 Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (
    14. 14. Top-10 most common/annoying mistakes in bioinformatics # 10 Using genome coordinates with wrong genome version (for example, using gene coordinates from human genome version hg18 but reference sequence from version hg19) 15
    15. 15. Top-10 most common/annoying mistakes in bioinformatics # 9 Forgetting to process the second strand of DNA sequence 16
    16. 16. Top-10 most common/annoying mistakes in bioinformatics # 8 Processing second strand of DNA sequence, but taking reverse instead of reverse complement sequence 17
    17. 17. Top-10 most common/annoying mistakes in bioinformatics # 7 Not accounting for different human chromosomes names between UCSC and Ensembl Example: UCSC: “chr1” Ensembl: “1” 18
    18. 18. Top-10 most common/annoying mistakes in bioinformatics # 6 Assuming the alphabetical order of chromosome names is “chr1”, “chr2”, “chr3”, … when in fact it is “chr1”, “chr10”, “chr11”, … 19
    19. 19. Top-10 most common/annoying mistakes in bioinformatics # 5 Assuming „tab‟ field separator when in fact it is „blank‟ (or vice versa) (look almost identical in text editor) 20
    20. 20. Top-10 most common/annoying mistakes in bioinformatics # 4 Assuming DNA sequence consists of only four letters (A, T, C, G) while in fact there is a fifth 21 „N‟ for missing base („X‟ for missing amino acid)
    21. 21. Top-10 most common/annoying mistakes in bioinformatics # 3 Forgetting to use dos2unix on a Windows text file before processing it under Linux plus spending 1 hour to debug the problem plus being tricked by this multiple times Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR). 22
    22. 22. Top-10 most common/annoying mistakes in bioinformatics # 2 When importing data into MS Excel, letting it auto-convert HUGO gene names into dates and forgetting about it (e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import) ~30 genes in total 23
    23. 23. #1 Off-by-one error There are only two common problems in bioinformatics: (1) lack of standards, (2) ID conversion, and (3) off-by-one errors 24 Top-10 most common/annoying mistakes in bioinformatics
    24. 24. Ten personal recommendations for your future work as bioinformatician 25
    25. 25. #1 - Learn Linux!  Most bioinformatics tools not available on Windows  Linux file systems better for many and/or very large files  Command line interface (CLI) has advantages over graphical user interface (GUI)  Recorded command history (reproducibility)  Key stroke to re-run analysis, instead of repeating 100 mouse clicks  Linux CLI (Shell) much more powerful than Windows CLI 26
    26. 26. # 2 - Embrace the “Unix tools philosophy”  Small programs (“tools”) instead of monolithic applications  Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)  Many and well documented parameters  Combined with Unix pipes (read from STDIN, write to STDOUT)  cut -f 3 myfile.txt | sort | uniq  Advantages  Great flexibility, easy re-use of existing tools  Intermediate output can be stored and inspected for troubleshooting  Complex tasks can be performed quickly with shell „one-liners‟  This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways 27
    27. 27. Example NGS use case demonstrating the power of the Unix tools philosophy  Explanation  „samtools mpileup‟ piles up short reads from the input BAM file for each position in the reference genome  „bcftools view‟ calls the variants  „vcfutils vcf2fq‟ computes the consensus sequence  The resulting FASTA sequence is redirected to the output file cns.fq  By knowing available tools and their parameters, bioinformatics „wizards‟ can get complex stuff done in almost no time 28 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcf2fq > cns.fq
    28. 28. #3 - Don’t reinvent the wheel  Coding is fun, but look around before you hack into your keyboard  Don‟t write the 29th FASTA file parser if proven solutions are available  BioPerl  BioPython  Bioconductor 29
    29. 29. #4 - If you happen to invent a wheel, …  Document source and parameters well  Use version control system (git, svn)  Deposit code in public repository    Write test cases 30
    30. 30. # 5 - Automate pipelines with GNU/Make  Developed in 1970s to build executables from source files  Incredibly useful for data-driven workflows as well  Automatic error checking  Parallelization (utilize multiple cores)  Incremental builds (re-start your pipeline from point of failure)  Bug-free  Get started at 31
    31. 31. # 6 - Value your time  Architecture vs. accomplishment  “Perfect is the enemy of the good” -- Voltaire  OO design and normalized databases are nice, but can be an overkill if requirements change from analysis to analysis  Automate what can be automated  Reproducibility  Easy to repeat analysis with slightly changed parameters  BUT: Don‟t spend two days automating a one-time analysis that can be done manually in 10 minutes 32
    32. 32. # 7 – Make use of free online resources to learn about specialized topics   Bioinformatics Algorithms (  Computing for Data Analysis (  R Programming (   Data Analysis for Genomics ( ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)  Introduction to Biology ( introduction-biology-secret-1768#.U1TVL3V52R8)  33
    33. 33. # 8 - Become an expert  Identify an area of interest and get really good at it  Work at places where you can learn from the best  Spend time abroad  Great experience  Labs/companies will not only hire you for what you know, but who you know 34
    34. 34. # 9 - Decide early on if you want to stay in academia or go into industry 35 Academia Industry • PhD highly recommended • Take your time to find compatible supervisor + Freedom to pursue own ideas + Very flexible working hours + Work independently - Steep & competitive career ladder (postdoc >> PI/prof) - Lower pay - Publish or perish • PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases) + More frequent (positive) feedback + Higher pay + Job security - More (external) deadlines - Higher pressure to get things done See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
    35. 35. # 10 - Stay informed & get connected  Follow literature and blogs   current-in-bioinformaticsgenomics/  Subscribe via RSS feeds  or others  Platform independent (e.g. read on your phone)  Bioinformatics Q&A forums  (highly recommended)  (focus on NGS)  (student-oriented)  Other  – fosters collaboration in bioinformatics  – “Facebook” for researchers  German bioinformatics group on XING ( 36
    36. 36. Conclusion  As bioinformatician, you will be at the forefront of one of the greatest scientific enterprises of our time  Biologists overwhelmed with massive data sets  YOU will get to see exciting results first  Requires integration of knowledge from many domains  IT, biology, medicine, statistics, math, …  Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable 37
    37. 37. Thank you! Christian Frech 38
    38. 38. Further Reading  “So you want to be a computational biologist?”  “What It Takes to Be a Bioinformatician”  “The alternative „what it takes to be a bioinformatician‟”  “So You Want To Be a Computational Biologist, Or A Bioinformatician?”  “Being a bioinformatician is hard”  “How not to be a bioinformatician”  “Ten Simple Rules for Reproducible Computational Research”  “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”;jsessionid=6D5D844E0E2 E21C9E565378C7F714D76  “A Quick Guide for Developing Effective Bioinformatics Programming Skills”  “What Is Really the Salary of a Bioinformatician/Computational Biologist?” biologist/ 39