1
How to be a bioinformatician
Christian Frech, PhD
St. Anna Children’s Cancer Research Institute, Vienna, Austria
Talk at...
What is a bioinformatician?
2
Informatician Statistician
Biologist
Data
scientist
Modified from http://blog.fejes.ca/?p=24...
Bioinformatician vs. computational biologist
 Asks biological questions
 Analyzes & interprets
biological data
 Runs ex...
Why do we need bioinformaticians?
 Amount of generated biological data requires sophisticated
computing for data manageme...
What are bioinformaticians doing?
5
6
What are bioinformaticians doing?
Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2...
Challenges as bioinformatician
 Biology is complex, not black and white
 As many exceptions as rules (e.g.: define “gene...
Which tools should I use?
9
179 tools
Heard of: 65%
Used: 30%
10
http://omictools.com/
Things to have in your bioinformatics
toolbox
 Linux command line
 Scripting language with
associated Bio* library (BioP...
Requirement
Recommended
Language
Speed matters, low-level programming
Rich-client enterprise application development
Text ...
Perl on decline, R and Python gaining popularity
13
http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-pr...
Top 10 most common and/or
annoying mistakes in bioinformatics
14
Inspired by “What Are The Most Common Stupid Mistakes In ...
Top-10 most common/annoying mistakes in bioinformatics
# 10
Using genome coordinates with wrong
genome version
(for exampl...
Top-10 most common/annoying mistakes in bioinformatics
# 9
Forgetting to process the second strand of
DNA sequence
16
Top-10 most common/annoying mistakes in bioinformatics
# 8
Processing second strand of DNA sequence,
but taking reverse in...
Top-10 most common/annoying mistakes in bioinformatics
# 7
Not accounting for different human
chromosomes names between
UC...
Top-10 most common/annoying mistakes in bioinformatics
# 6
Assuming the alphabetical order of
chromosome names is
“chr1”, ...
Top-10 most common/annoying mistakes in bioinformatics
# 5
Assuming „tab‟ field separator
when in fact it is „blank‟
(or v...
Top-10 most common/annoying mistakes in bioinformatics
# 4
Assuming DNA sequence consists of only
four letters (A, T, C, G...
Top-10 most common/annoying mistakes in bioinformatics
# 3
Forgetting to use dos2unix on a Windows text file
before proces...
Top-10 most common/annoying mistakes in bioinformatics
# 2
When importing data into MS Excel, letting it
auto-convert HUGO...
#1
Off-by-one error
There are only two common problems in bioinformatics:
(1) lack of standards, (2) ID conversion, and
(3...
Ten personal recommendations for
your future work as bioinformatician
25
#1 - Learn Linux!
 Most bioinformatics tools not available
on Windows
 Linux file systems better for many and/or very la...
# 2 - Embrace the “Unix tools philosophy”
 Small programs (“tools”) instead of monolithic applications
 Designed for sim...
Example NGS use case demonstrating the power
of the Unix tools philosophy
 Explanation
 „samtools mpileup‟ piles up shor...
#3 - Don’t reinvent the wheel
 Coding is fun, but look
around before you hack
into your keyboard
 Don‟t write the 29th F...
#4 - If you happen to invent a wheel, …
 Document source and parameters well
 Use version control system (git, svn)
 De...
# 5 - Automate pipelines
with GNU/Make
 Developed in 1970s to build executables from
source files
 Incredibly useful for...
# 6 - Value your time
 Architecture vs. accomplishment
 “Perfect is the enemy of the good” -- Voltaire
 OO design and n...
# 7 – Make use of free online resources to learn
about specialized topics
 www.coursera.org
 Bioinformatics Algorithms
(...
# 8 - Become an expert
 Identify an area of interest
and get really good at it
 Work at places where you
can learn from ...
# 9 - Decide early on if you want to stay in
academia or go into industry
35
Academia Industry
• PhD highly recommended
• ...
# 10 - Stay informed & get connected
 Follow literature and blogs
 http://en.wikipedia.org/wiki/List_of_bioinformatics_j...
Conclusion
 As bioinformatician, you will be at the
forefront of one of the greatest scientific
enterprises of our time
...
Thank you!
Christian Frech
frech.christian@gmail.com
38
Further Reading
 “So you want to be a computational biologist?”
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.h...
Upcoming SlideShare
Loading in...5
×

How to be a bioinformatician

1,680

Published on

Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.

Published in: Science, Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,680
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
89
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Version 5
  • Funny rant about bioinformatics, not to be taken literally:http://madhadron.com/posts/2012-03-26-a-farewell-to-bioinformatics.html
  • Transcript of "How to be a bioinformatician"

    1. 1. 1 How to be a bioinformatician Christian Frech, PhD St. Anna Children’s Cancer Research Institute, Vienna, Austria Talk at University of Applied Sciences, Hagenberg, Austria April 23rd, 2014
    2. 2. What is a bioinformatician? 2 Informatician Statistician Biologist Data scientist Modified from http://blog.fejes.ca/?p=2418
    3. 3. Bioinformatician vs. computational biologist  Asks biological questions  Analyzes & interprets biological data  Runs existing programs  Ad hoc scripting  Perl, R, Python 3  IT savvy  Builds & maintains biological databases & Web sites  Designs & implements clever algorithms  C/C++, Java, Python Bioinformatician Computational biologist Grasp of computational subjectsmore less Grasp of biological subjectsless more or vice versa
    4. 4. Why do we need bioinformaticians?  Amount of generated biological data requires sophisticated computing for data management and analysis  Programmers lack biological knowledge  Biologists don‟t program  The two don‟t understand each other 4 http://www.youtube.com/watch?v=Hz1fyhVOjr4 Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs 1 terabase (TB) of data in 6 days1! Biologists talks to statistician 1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
    5. 5. What are bioinformaticians doing? 5
    6. 6. 6 What are bioinformaticians doing? Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
    7. 7. Challenges as bioinformatician  Biology is complex, not black and white  As many exceptions as rules (e.g.: define “gene”)  No single optimal solution to a problem  Results interpretable in many ways (story telling, cherry picking)  Understanding the biological question  Field is moving incredibly fast  Lack of standards, immature/abandoned software  Standard of today obsolete tomorrow  Much time spent on collecting/cleaning-up data, troubleshooting errors  Stay flexible, don‟t overinvest in single platform/technology  Hundreds of software tools and databases out there  Easy to get lost  Important to understand their strengths and weaknesses 8
    8. 8. Which tools should I use? 9 179 tools Heard of: 65% Used: 30%
    9. 9. 10 http://omictools.com/
    10. 10. Things to have in your bioinformatics toolbox  Linux command line  Scripting language with associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)  Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction  Sequence alignment (FASTA & BLAST)  Biological databases  Regular expressions  Sequencing technologies  Web technologies (HTML, XML, …) 11  Advanced R skills  Parallel/distributed computing  DBMS, SQL  (Semi-)compiled language (C/C++, Java)  Dimensionality reduction (e.g. PCA)  Cluster analysis  Support Vector Machines  Hidden Markov models  Web framework (e.g. Django)  Version control system (e.g. Git)  Advanced text editor (Emacs, vim)  IDE (e.g. Eclipse, NetBeans) Must haves Highly recommended
    11. 11. Requirement Recommended Language Speed matters, low-level programming Rich-client enterprise application development Text file processing (regex) Statistical analysis, fancy plots Rapid prototyping, readable & maintainable scripts Workflow automation What programming language should I learn? 12Be a jack of all trades, master of ONE!
    12. 12. Perl on decline, R and Python gaining popularity 13 http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming- languages.html http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png Perl most popular bioinformatics programming language in 2008 R and Python take the lead in 2014
    13. 13. Top 10 most common and/or annoying mistakes in bioinformatics 14 Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
    14. 14. Top-10 most common/annoying mistakes in bioinformatics # 10 Using genome coordinates with wrong genome version (for example, using gene coordinates from human genome version hg18 but reference sequence from version hg19) 15
    15. 15. Top-10 most common/annoying mistakes in bioinformatics # 9 Forgetting to process the second strand of DNA sequence 16
    16. 16. Top-10 most common/annoying mistakes in bioinformatics # 8 Processing second strand of DNA sequence, but taking reverse instead of reverse complement sequence 17
    17. 17. Top-10 most common/annoying mistakes in bioinformatics # 7 Not accounting for different human chromosomes names between UCSC and Ensembl Example: UCSC: “chr1” Ensembl: “1” 18
    18. 18. Top-10 most common/annoying mistakes in bioinformatics # 6 Assuming the alphabetical order of chromosome names is “chr1”, “chr2”, “chr3”, … when in fact it is “chr1”, “chr10”, “chr11”, … 19
    19. 19. Top-10 most common/annoying mistakes in bioinformatics # 5 Assuming „tab‟ field separator when in fact it is „blank‟ (or vice versa) (look almost identical in text editor) 20
    20. 20. Top-10 most common/annoying mistakes in bioinformatics # 4 Assuming DNA sequence consists of only four letters (A, T, C, G) while in fact there is a fifth 21 „N‟ for missing base („X‟ for missing amino acid)
    21. 21. Top-10 most common/annoying mistakes in bioinformatics # 3 Forgetting to use dos2unix on a Windows text file before processing it under Linux plus spending 1 hour to debug the problem plus being tricked by this multiple times Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR). 22
    22. 22. Top-10 most common/annoying mistakes in bioinformatics # 2 When importing data into MS Excel, letting it auto-convert HUGO gene names into dates and forgetting about it (e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import) ~30 genes in total 23
    23. 23. #1 Off-by-one error There are only two common problems in bioinformatics: (1) lack of standards, (2) ID conversion, and (3) off-by-one errors 24 http://en.wikipedia.org/wiki/Off-by-one_error Top-10 most common/annoying mistakes in bioinformatics
    24. 24. Ten personal recommendations for your future work as bioinformatician 25
    25. 25. #1 - Learn Linux!  Most bioinformatics tools not available on Windows  Linux file systems better for many and/or very large files  Command line interface (CLI) has advantages over graphical user interface (GUI)  Recorded command history (reproducibility)  Key stroke to re-run analysis, instead of repeating 100 mouse clicks  Linux CLI (Shell) much more powerful than Windows CLI 26
    26. 26. # 2 - Embrace the “Unix tools philosophy”  Small programs (“tools”) instead of monolithic applications  Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)  Many and well documented parameters  Combined with Unix pipes (read from STDIN, write to STDOUT)  cut -f 3 myfile.txt | sort | uniq  Advantages  Great flexibility, easy re-use of existing tools  Intermediate output can be stored and inspected for troubleshooting  Complex tasks can be performed quickly with shell „one-liners‟  This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways 27http://www.linuxdevcenter.com/lpt/a/302
    27. 27. Example NGS use case demonstrating the power of the Unix tools philosophy  Explanation  „samtools mpileup‟ piles up short reads from the input BAM file for each position in the reference genome  „bcftools view‟ calls the variants  „vcfutils vcf2fq‟ computes the consensus sequence  The resulting FASTA sequence is redirected to the output file cns.fq  By knowing available tools and their parameters, bioinformatics „wizards‟ can get complex stuff done in almost no time 28 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq http://samtools.sourceforge.net/mpileup.shtml
    28. 28. #3 - Don’t reinvent the wheel  Coding is fun, but look around before you hack into your keyboard  Don‟t write the 29th FASTA file parser if proven solutions are available  BioPerl  BioPython  Bioconductor 29
    29. 29. #4 - If you happen to invent a wheel, …  Document source and parameters well  Use version control system (git, svn)  Deposit code in public repository  sourceforge.net  github.com  Write test cases 30
    30. 30. # 5 - Automate pipelines with GNU/Make  Developed in 1970s to build executables from source files  Incredibly useful for data-driven workflows as well  Automatic error checking  Parallelization (utilize multiple cores)  Incremental builds (re-start your pipeline from point of failure)  Bug-free  Get started at http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/ 31
    31. 31. # 6 - Value your time  Architecture vs. accomplishment  “Perfect is the enemy of the good” -- Voltaire  OO design and normalized databases are nice, but can be an overkill if requirements change from analysis to analysis  Automate what can be automated  Reproducibility  Easy to repeat analysis with slightly changed parameters  BUT: Don‟t spend two days automating a one-time analysis that can be done manually in 10 minutes 32
    32. 32. # 7 – Make use of free online resources to learn about specialized topics  www.coursera.org  Bioinformatics Algorithms (https://www.coursera.org/course/bioinformatics)  Computing for Data Analysis (https://www.coursera.org/course/compdata)  R Programming (https://www.coursera.org/course/rprog)  https://www.edx.org/  Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx- ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)  Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x- introduction-biology-secret-1768#.U1TVL3V52R8)  http://rosalind.info/problems/locations/ 33
    33. 33. # 8 - Become an expert  Identify an area of interest and get really good at it  Work at places where you can learn from the best  Spend time abroad  Great experience  Labs/companies will not only hire you for what you know, but who you know 34
    34. 34. # 9 - Decide early on if you want to stay in academia or go into industry 35 Academia Industry • PhD highly recommended • Take your time to find compatible supervisor + Freedom to pursue own ideas + Very flexible working hours + Work independently - Steep & competitive career ladder (postdoc >> PI/prof) - Lower pay - Publish or perish • PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases) + More frequent (positive) feedback + Higher pay + Job security - More (external) deadlines - Higher pressure to get things done See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
    35. 35. # 10 - Stay informed & get connected  Follow literature and blogs  http://en.wikipedia.org/wiki/List_of_bioinformatics_journals  http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay- current-in-bioinformaticsgenomics/  Subscribe via RSS feeds  http://feedly.com or others  Platform independent (e.g. read on your phone)  Bioinformatics Q&A forums  http://www.biostars.org (highly recommended)  http://seqanswers.com/ (focus on NGS)  http://www.reddit.com/r/bioinformatics/ (student-oriented)  Other  http://bioinformatics.org – fosters collaboration in bioinformatics  http://www.researchgate.net – “Facebook” for researchers  German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin) 36
    36. 36. Conclusion  As bioinformatician, you will be at the forefront of one of the greatest scientific enterprises of our time  Biologists overwhelmed with massive data sets  YOU will get to see exciting results first  Requires integration of knowledge from many domains  IT, biology, medicine, statistics, math, …  Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable 37
    37. 37. Thank you! Christian Frech frech.christian@gmail.com 38
    38. 38. Further Reading  “So you want to be a computational biologist?” http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html  “What It Takes to Be a Bioinformatician” http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/  “The alternative „what it takes to be a bioinformatician‟” https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/  “So You Want To Be a Computational Biologist, Or A Bioinformatician?” http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html  “Being a bioinformatician is hard” http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/  “How not to be a bioinformatician” http://www.scfbm.org/content/7/1/3  “Ten Simple Rules for Reproducible Computational Research” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285  “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2 E21C9E565378C7F714D76  “A Quick Guide for Developing Effective Bioinformatics Programming Skills” http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589  “What Is Really the Salary of a Bioinformatician/Computational Biologist?” http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational- biologist/ 39
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×