21. A language all scientists should know
How R helped me look at billions of genotypes and how it can
help you too
Mitchell Bekritsky
WSBS Graduate Student
22. What is R?
• Language for statistical
analysis, data manipulation
and graphics
• Open source
• Flexible language
• Powerful built-in functions
• Strong user community
• Publication quality graphs
• Free!
Graphic
from
h=p://blenditbayes.blogspot.com/2013/06/visualising-‐crime-‐hotspots-‐in-‐england_25.html
23.
24. Who uses R?
Source:
h=p://www.revoluKonanalyKcs.com/what-‐is-‐open-‐source-‐r/companies-‐using-‐r.php
25. What is R used for?
• Movie recommendations
• Clinical drug development
• Credit risk analysis
• News graphics
• Tailoring online advertising
• Modeling oil spills
• Predicting economic activity
• Predicting election outcomes
Graphic
from
h=p://www.nyKmes.com/interacKve/2009/06/25/arts/0625-‐jackson-‐graphic.html
27. How R helped me see my data
• First time looking at microsatellite genotypes
• How many microsatellites differ from reference genome?
• By how much?
Problems:
– Lots of data (4.7 million genotypes)
– Complex information
– Too big for Excel
– No good graphics in Excel either
28. One of my first graphs in R
Lessons learned about my data
• Lots of microsatellites differ
from reference by a little bit
• Thousands differ by ± 20 bp
• 8.27% of all microsatellites
differ from reference (~400k)
Lessons learned about my graph
• This is a terrible graph
29. A bad R graph is better than no R graph
Bad graphs helped me
• Understand my data better
• Improve my analyses
• Improve how I communicate
my data
• R has incredible flexibility for
graphing—if you can dream it,
you can probably build it
30. A bad R graph is better than no R graph
Bad graphs helped me
• Understand my data better
• Improve my analyses
• Improve how I communicate
my data
• R has incredible flexibility for
graphing—if you can dream it,
you can probably build it
My best R graphs make one point clearly without clutter
32. How R saved my thesis
• Processing lots of sequencing
data in hundreds of people
• Too many people and
processes to monitor all steps
of pipeline by eye while data
was being processed
Sanity check
• After data processing did data
look bi-allelic?
33. How R saved my thesis
• Processing lots of sequencing
data in hundreds of people
• Too many people and
processes to monitor all steps
of pipeline by eye while data
was being processed
Sanity check
• After data processing did data
look bi-allelic?
No!!
34. Troubleshooting using R
• People don’t actually have massive deletions and amplifications
• My pipeline was deleting files because of a bug, which would
remove large chunks of chromosomes
• Thanks to R, I found people where this had happened, tracked
down the bug, and didn’t report massive CNVs in autistic children
Side note
• If it looks too good to be true, it probably is
35. R helped me build a better genotyper
• Some non-reference alleles
aren’t covered well
• Leads to incorrect genotype
calls
Problem
• How do I develop a smarter
genotyper and know that it
works?
36. R helped me build a better genotyper
• Some non-reference alleles
chr19:54772760 A repeat, reference length 8
aren’t covered well
Genotypes
100
• Leads to incorrect genotype
works?
60
40
20
0
genotyper and know that it
10 bp allele coverage
• How do I develop a smarter
80
calls
Problem
10|-1
10|10
8|-1
8|10
8|8
0
20
40
60
8 bp allele coverage
80
100
37. Modeling genotypes in R
• Built a model for biased
genotypes in R
• Model helped me build a more
accurate genotyper
• When applied to real data,
clear improvements
38. R finds de novo mutations for me
• >300 million genotypes
• How do I find de novo mutations in all that data?
R to the rescue!
39. What R has done for me
Data mining
•
Finding de novo mutations
•
Quality control for my data
Data manipulation
•
Converting raw read counts to genotypes
Data simulation and modeling
•
Finding ways to improve my genotyper
Data visualization
40. R has extensive support for biologists
Bioconductor is an incredible resource for biological analyses in R
• Microarrays
• Differential expression (DESeq, edgeR, cummeRbund)
• Gene models
• Flow cytometry (flowCore, flowStats, flowViz)
• Interacting with Ensembl, Cosmic, Gramene, etc. (biomaRt)
41. Installing R
• R can be downloaded from rproject.org
• R runs on PCs, Macs and
Linux computers
• The R project website has an
R manual to get you started
42. Working in R
Native R interface can be hard to
work with
• Lots of windows
• Difficult to keep things
organized
43. RStudio interface
• All your variables, help pages,
script windows and consoles
in one place
• Highlights R code for easier
programming
• Tabbed windows for multiple
scripts
• History saves all previous
commands, plot history saves
all previous plots
• Find it at rstudio.com
44. Learning R
Many online tutorials
• R has its own introduction
• Statistics Using R with Biological Examples
Take interesting data, use it to explore R
• Plot, graph, use statistical tests
Ask someone who knows R
• Getting started is pretty easy
• Learn what you need when you need it
47. The Bioscience Entreprise Club is dedicated to helping CSHL’s science research
professionals and alumni cultivate and leverage their cross-disciplinary skill sets and
expertise to transition into diverse careers.
48. Current Exchange is CSHL’s very own student-run magazine. We feature articles about
science aimed at a general audience. Check out our inaugural issue at issuu.com/
currentexchange
Send your articles to raboukha@cshl.edu by November 5, 2013