How to be a bioinformatician

1
How to be a bioinformatician
Christian Frech, PhD
St. Anna Children’s Cancer Research Institute, Vienna, Austria
Talk at University of Applied Sciences, Hagenberg, Austria
April 23rd, 2014

What is a bioinformatician?
2
Informatician Statistician
Biologist
Data
scientist
Modified from http://blog.fejes.ca/?p=2418

Bioinformatician vs. computational biologist
 Asks biological questions
 Analyzes & interprets
biological data
 Runs existing programs
 Ad hoc scripting
 Perl, R, Python
3
 IT savvy
 Builds & maintains
biological databases &
Web sites
 Designs & implements
clever algorithms
 C/C++, Java, Python
Bioinformatician Computational
biologist
Grasp of computational subjectsmore less
Grasp of biological subjectsless more
or vice versa

Why do we need bioinformaticians?
 Amount of generated biological data requires sophisticated
computing for data management and analysis
 Programmers lack biological knowledge
 Biologists don‟t program
 The two don‟t understand each other
4
http://www.youtube.com/watch?v=Hz1fyhVOjr4
Latest Illumina sequencer shipped last
week (HiSeq v4 reagent kit) outputs
1 terabase (TB) of data in 6 days1!
Biologists talks to statistician
1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn

What are bioinformaticians doing?
5

6
What are bioinformaticians doing?
Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014

Challenges as bioinformatician
 Biology is complex, not black and white
 As many exceptions as rules (e.g.: define “gene”)
 No single optimal solution to a problem
 Results interpretable in many ways (story telling, cherry picking)
 Understanding the biological question
 Field is moving incredibly fast
 Lack of standards, immature/abandoned software
 Standard of today obsolete tomorrow
 Much time spent on collecting/cleaning-up data, troubleshooting errors
 Stay flexible, don‟t overinvest in single platform/technology
 Hundreds of software tools and databases out there
 Easy to get lost
 Important to understand their strengths and weaknesses
8

Which tools should I use?
9
179 tools
Heard of: 65%
Used: 30%

Things to have in your bioinformatics
toolbox
 Linux command line
 Scripting language with
associated Bio* library (BioPerl,
BioPython, R/Bioconductor, …)
 Basic statistical tests, regression,
p-values, maximum likelihood,
multiple testing correction
 Sequence alignment
(FASTA & BLAST)
 Biological databases
 Regular expressions
 Sequencing technologies
 Web technologies (HTML, XML, …)
11
 Advanced R skills
 Parallel/distributed computing
 DBMS, SQL
 (Semi-)compiled language (C/C++, Java)
 Dimensionality reduction (e.g. PCA)
 Cluster analysis
 Support Vector Machines
 Hidden Markov models
 Web framework (e.g. Django)
 Version control system (e.g. Git)
 Advanced text editor (Emacs, vim)
 IDE (e.g. Eclipse, NetBeans)
Must haves Highly recommended

Requirement
Recommended
Language
Speed matters, low-level programming
Rich-client enterprise application development
Text file processing (regex)
Statistical analysis, fancy plots
Rapid prototyping, readable & maintainable scripts
Workflow automation
What programming language should I learn?
12Be a jack of all trades, master of ONE!

Perl on decline, R and Python gaining popularity
13
http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming-
languages.html
http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png
Perl most popular bioinformatics
programming language in 2008
R and Python take the lead in 2014

Top 10 most common and/or
annoying mistakes in bioinformatics
14
Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)

Top-10 most common/annoying mistakes in bioinformatics
# 10
Using genome coordinates with wrong
genome version
(for example, using gene coordinates from human genome
version hg18 but reference sequence from version hg19)
15

# 9
Forgetting to process the second strand of
DNA sequence
16

# 8
Processing second strand of DNA sequence,
but taking reverse instead of reverse
complement sequence
17

# 7
Not accounting for different human
chromosomes names between
UCSC and Ensembl
Example:
UCSC: “chr1”
Ensembl: “1”
18

# 6
Assuming the alphabetical order of
chromosome names is
“chr1”, “chr2”, “chr3”, …
when in fact it is
“chr1”, “chr10”, “chr11”, …
19

# 5
Assuming „tab‟ field separator
when in fact it is „blank‟
(or vice versa)
(look almost identical in text editor)
20

# 4
Assuming DNA sequence consists of only
four letters (A, T, C, G) while in fact
there is a fifth
21
„N‟ for missing base
(„X‟ for missing amino acid)

# 3
Forgetting to use dos2unix on a Windows text file
before processing it under Linux
plus spending 1 hour to debug the problem
plus being tricked by this multiple times
Text file line breaks differ between platforms:
Linux (LF); Windows (CR+LF); classic Mac (CR).
22

# 2
When importing data into MS Excel, letting it
auto-convert HUGO gene names into dates
and forgetting about it
(e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import)
~30 genes in total
23

#1
Off-by-one error
There are only two common problems in bioinformatics:
(1) lack of standards, (2) ID conversion, and
(3) off-by-one errors
24
http://en.wikipedia.org/wiki/Off-by-one_error

Ten personal recommendations for
your future work as bioinformatician
25

#1 - Learn Linux!
 Most bioinformatics tools not available
on Windows
 Linux file systems better for many and/or very large files
 Command line interface (CLI) has advantages over
graphical user interface (GUI)
 Recorded command history (reproducibility)
 Key stroke to re-run analysis, instead of repeating 100 mouse
clicks
 Linux CLI (Shell) much more powerful than Windows CLI
26

# 2 - Embrace the “Unix tools philosophy”
 Small programs (“tools”) instead of monolithic applications
 Designed for simple, specific tasks that are performed well
(awk, cat, grep, wc, etc.)
 Many and well documented parameters
 Combined with Unix pipes (read from STDIN, write to STDOUT)
 cut -f 3 myfile.txt | sort | uniq
 Advantages
 Great flexibility, easy re-use of existing tools
 Intermediate output can be stored and inspected for troubleshooting
 Complex tasks can be performed quickly with shell „one-liners‟
 This paradigm fits bioinformatics well, where often many
heterogeneous data files need to be processed in many
different ways
27http://www.linuxdevcenter.com/lpt/a/302

Example NGS use case demonstrating the power
of the Unix tools philosophy
 Explanation
 „samtools mpileup‟ piles up short reads from the input BAM file for
each position in the reference genome
 „bcftools view‟ calls the variants
 „vcfutils vcf2fq‟ computes the consensus sequence
 The resulting FASTA sequence is redirected to the output file cns.fq
 By knowing available tools and their parameters, bioinformatics
„wizards‟ can get complex stuff done in almost no time
28
samtools mpileup -uf ref.fa aln.bam |
bcftools view -cg - |
vcfutils.pl vcf2fq > cns.fq
http://samtools.sourceforge.net/mpileup.shtml

#3 - Don’t reinvent the wheel
 Coding is fun, but look
around before you hack
into your keyboard
 Don‟t write the 29th FASTA
file parser if proven solutions
are available
 BioPerl
 BioPython
 Bioconductor
29

#4 - If you happen to invent a wheel, …
 Document source and parameters well
 Use version control system (git, svn)
 Deposit code in public repository
 sourceforge.net
 github.com
 Write test cases
30

# 5 - Automate pipelines
with GNU/Make
 Developed in 1970s to build executables from
source files
 Incredibly useful for data-driven workflows as well
 Automatic error checking
 Parallelization (utilize multiple cores)
 Incremental builds (re-start your pipeline from point of failure)
 Bug-free
 Get started at
http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/
31

# 6 - Value your time
 Architecture vs. accomplishment
 “Perfect is the enemy of the good” -- Voltaire
 OO design and normalized databases are nice, but can be an
overkill if requirements change from analysis to analysis
 Automate what can be automated
 Reproducibility
 Easy to repeat analysis with slightly changed parameters
 BUT: Don‟t spend two days automating a one-time
analysis that can be done manually in 10 minutes
32

# 7 – Make use of free online resources to learn
about specialized topics
 www.coursera.org
 Bioinformatics Algorithms
(https://www.coursera.org/course/bioinformatics)
 Computing for Data Analysis
(https://www.coursera.org/course/compdata)
 R Programming
(https://www.coursera.org/course/rprog)
 https://www.edx.org/
 Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx-
ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)
 Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x-
introduction-biology-secret-1768#.U1TVL3V52R8)
 http://rosalind.info/problems/locations/
33

# 8 - Become an expert
 Identify an area of interest
and get really good at it
 Work at places where you
can learn from the best
 Spend time abroad
 Great experience
 Labs/companies will not only hire you for what you
know, but who you know
34

# 9 - Decide early on if you want to stay in
academia or go into industry
35
Academia Industry
• PhD highly recommended
• Take your time to find
compatible supervisor
+ Freedom to pursue own ideas
+ Very flexible working hours
+ Work independently
- Steep & competitive career
ladder (postdoc >> PI/prof)
- Lower pay
- Publish or perish
• PhD beneficial (to get in), but
not necessarily required for
daily work (e.g. build/maintain
databases)
+ More frequent (positive)
feedback
+ Higher pay
+ Job security
- More (external) deadlines
- Higher pressure to get things
done
See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)

# 10 - Stay informed & get connected
 Follow literature and blogs
 http://en.wikipedia.org/wiki/List_of_bioinformatics_journals
 http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay-
current-in-bioinformaticsgenomics/
 Subscribe via RSS feeds
 http://feedly.com or others
 Platform independent (e.g. read on your phone)
 Bioinformatics Q&A forums
 http://www.biostars.org (highly recommended)
 http://seqanswers.com/ (focus on NGS)
 http://www.reddit.com/r/bioinformatics/ (student-oriented)
 Other
 http://bioinformatics.org – fosters collaboration in bioinformatics
 http://www.researchgate.net – “Facebook” for researchers
 German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin)
36

Conclusion
 As bioinformatician, you will be at the
forefront of one of the greatest scientific
enterprises of our time
 Biologists overwhelmed with massive
data sets
 YOU will get to see exciting results first
 Requires integration of knowledge from many domains
 IT, biology, medicine, statistics, math, …
 Knowing your informatics toolbox AND understanding the biological
question is what makes you very valuable
37

Thank you!
Christian Frech
frech.christian@gmail.com
38

Further Reading
 “So you want to be a computational biologist?”
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html
 “What It Takes to Be a Bioinformatician”
http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/
 “The alternative „what it takes to be a bioinformatician‟”
https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/
 “So You Want To Be a Computational Biologist, Or A Bioinformatician?”
http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html
 “Being a bioinformatician is hard”
http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/
 “How not to be a bioinformatician”
http://www.scfbm.org/content/7/1/3
 “Ten Simple Rules for Reproducible Computational Research”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285
 “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2
E21C9E565378C7F714D76
 “A Quick Guide for Developing Effective Bioinformatics Programming Skills”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589
 “What Is Really the Salary of a Bioinformatician/Computational Biologist?”
http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational-
biologist/
39

How to be a bioinformatician

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to How to be a bioinformatician

Similar to How to be a bioinformatician (20)

Recently uploaded

Recently uploaded (20)

How to be a bioinformatician

Editor's Notes