There’s No Avoiding It:
Programming Skills You’ll Need

Yannick Pouliot, PhD
10/14/2011
Three Things I want To Impress
• Why software programming is essential for
bioresearch
▫ … as essential as knowing how to use a pipette
• Why you should partially dump Excel and use a
relational database
• Why the Cloud is your friend
The Good News
•
•
•
•

Free software!
Free algorithms!
Pre-coded algorithms (i.e., packages)!
Very cheap computing power!
The Bad News
•
•
•
•

Dunno how to use
“Not talented”
“Not enough time”
(can’t be bothered)
▫ e.g., reading the paper describing the software tool
one is relying on
More Good News
•
•
•
•
•

Not that hard
Lots and lots of good resources
Read a book, dammit
Find a buddy
Use Cloud instances (preconfigured machines)
▫ Can even be free!
The Quest For Situation-Appropriate Storage & Computation
Or, when Excel fails you
Some Questions…
1.
2.
3.
4.
5.

Do you use MS Excel?
How much time do you spend using it?
Are you good at it? Be honest…
Have you ever read a book or tutorial on Excel?
So how are you going to improve your ability?
Are You an Excelaholic?
• Do you have an unhealthy dependence on Excel?
▫ Do you use Excel to store data?
▫ Do you feel like you’re making Excel jump through
hoops to perform your calculations?
 Do you have a vague feeling of shame as a result?
The Worst Case (More Frequent Than You’d Wish)
• Postdoc uses Excel to keep track of complex
experiment involving two external groups
• Eventually realizes that data stored in Excel were
corrupted (“paste failure”)
▫ Result: it took her six months to recover

• She now uses FileMaker (relational database)
The Next Level Up: Relational Databases
Take Your Pick
A Real Example From Yours Truly
But You Also Need Programming…
Why Programming?
• Address small problems that can nail you
• Address bigger problems by standing on the
shoulders of giants
• Flexibility: If you’re doing “real” science, off-theshelf software will fail you every time
▫ 80% rule…
Don’t Try This
With Excel
•Millions of reads
compared against mouse
transcriptome
• Determining number of
distinct species and
frequency of members in
each
• Summarize using plots
for each codon
Remember
SQL?
The Quest For Power
Heard at lab meeting:

“I would have shown you this graph
but Excel crashed while computing
a big file”
→You can’t do this (censored) on your
laptop anymore
Welcome To The Cloud
Why Own When You Can Rent?
An Example: PathSeq
• Compare millions of short-read sequences
against all genomic + transcriptomic sequences
for all microbes (!)

Amazon Cloud
“Management Console”
Why The Cloud Matters For Biologists
• You can purchase as much computing power as you
need
▫ You don’t have to run/manage what you don’t use

• Your purchasing computing power, not machines
▫  never outdated

• Can easily migrate from one machine type to
another (minutes)
• Can add storage in seconds
• Accessible from anywhere
• Easy to share e.g., (large) datasets with others
23

WEKA: the software
• Machine learning/data mining software written
in Java (distributed under the GNU Public
License)
• Used for research, education, and applications
• Complements “Data Mining” by Witten & Frank
• Main features:
▫ Comprehensive set of data pre-processing
tools, learning algorithms and evaluation methods
▫ Graphical user interfaces (incl. data visualization)
▫ Environment for comparing learning algorithms
2/22/2014
24
University of Waikato

2/22/2014
25

Explorer: building “classifiers”
• Classifiers in WEKA are models for predicting
nominal or numeric quantities
• Implemented learning schemes include:
▫ Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes’ nets, …

• “Meta”-classifiers include:
▫ Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, …
University of Waikato
2/22/2014
26
University of Waikato

2/22/2014
27
University of Waikato

2/22/2014
28
University of Waikato

2/22/2014

There’s No Avoiding It: Programming Skills You’ll Need

  • 1.
    There’s No AvoidingIt: Programming Skills You’ll Need Yannick Pouliot, PhD 10/14/2011
  • 2.
    Three Things Iwant To Impress • Why software programming is essential for bioresearch ▫ … as essential as knowing how to use a pipette • Why you should partially dump Excel and use a relational database • Why the Cloud is your friend
  • 3.
    The Good News • • • • Freesoftware! Free algorithms! Pre-coded algorithms (i.e., packages)! Very cheap computing power!
  • 4.
    The Bad News • • • • Dunnohow to use “Not talented” “Not enough time” (can’t be bothered) ▫ e.g., reading the paper describing the software tool one is relying on
  • 5.
    More Good News • • • • • Notthat hard Lots and lots of good resources Read a book, dammit Find a buddy Use Cloud instances (preconfigured machines) ▫ Can even be free!
  • 6.
    The Quest ForSituation-Appropriate Storage & Computation Or, when Excel fails you
  • 7.
    Some Questions… 1. 2. 3. 4. 5. Do youuse MS Excel? How much time do you spend using it? Are you good at it? Be honest… Have you ever read a book or tutorial on Excel? So how are you going to improve your ability?
  • 8.
    Are You anExcelaholic? • Do you have an unhealthy dependence on Excel? ▫ Do you use Excel to store data? ▫ Do you feel like you’re making Excel jump through hoops to perform your calculations?  Do you have a vague feeling of shame as a result?
  • 9.
    The Worst Case(More Frequent Than You’d Wish) • Postdoc uses Excel to keep track of complex experiment involving two external groups • Eventually realizes that data stored in Excel were corrupted (“paste failure”) ▫ Result: it took her six months to recover • She now uses FileMaker (relational database)
  • 10.
    The Next LevelUp: Relational Databases Take Your Pick
  • 11.
    A Real ExampleFrom Yours Truly
  • 12.
    But You AlsoNeed Programming…
  • 13.
    Why Programming? • Addresssmall problems that can nail you • Address bigger problems by standing on the shoulders of giants • Flexibility: If you’re doing “real” science, off-theshelf software will fail you every time ▫ 80% rule…
  • 14.
    Don’t Try This WithExcel •Millions of reads compared against mouse transcriptome • Determining number of distinct species and frequency of members in each • Summarize using plots for each codon
  • 15.
  • 16.
  • 17.
    Heard at labmeeting: “I would have shown you this graph but Excel crashed while computing a big file” →You can’t do this (censored) on your laptop anymore
  • 18.
  • 19.
    Why Own WhenYou Can Rent?
  • 20.
    An Example: PathSeq •Compare millions of short-read sequences against all genomic + transcriptomic sequences for all microbes (!) Amazon Cloud “Management Console”
  • 21.
    Why The CloudMatters For Biologists • You can purchase as much computing power as you need ▫ You don’t have to run/manage what you don’t use • Your purchasing computing power, not machines ▫  never outdated • Can easily migrate from one machine type to another (minutes) • Can add storage in seconds • Accessible from anywhere • Easy to share e.g., (large) datasets with others
  • 23.
    23 WEKA: the software •Machine learning/data mining software written in Java (distributed under the GNU Public License) • Used for research, education, and applications • Complements “Data Mining” by Witten & Frank • Main features: ▫ Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods ▫ Graphical user interfaces (incl. data visualization) ▫ Environment for comparing learning algorithms 2/22/2014
  • 24.
  • 25.
    25 Explorer: building “classifiers” •Classifiers in WEKA are models for predicting nominal or numeric quantities • Implemented learning schemes include: ▫ Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, … • “Meta”-classifiers include: ▫ Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, … University of Waikato 2/22/2014
  • 26.
  • 27.
  • 28.