A brief introduction to the Bayesian analysis program PyRate for paleobiology colleagues. Given at a lab meeting, so the format is casual and a good chunk of prior knowledge is assumed.
1. PyRate for fun and profit research
Brianna McHorse Spring 2019
2. What does PyRate do?
âPyRate is a program to estimate speciation, extinction, and preservation rates from
fossil occurrence data using a Bayesian framework.â
https://github.com/dsilvestro/PyRate
3. What does PyRate do?
âPyRate is a program to estimate speciation, extinction, and preservation rates from
fossil occurrence data using a Bayesian framework.â
4. What does PyRate do?
âPyRate is a program to estimate speciation, extinction, and preservation rates from
fossil occurrence data using a Bayesian framework.â
With a basic PyRate analysis, we can ask questions like:
- How do speciation and extinction rates of the Canidae vary through time?
- When are there changes in speciation/extinction rates in the Crocodylia?
PyRateOccurrence Data
Speciation rates
Extinction rates
Preservation rates
(through time)
6. Requirements to run PyRate
- R
- Python 2 (I usually use 2.7)
- PyRate
- Download the PyRate repository from https://github.com/dsilvestro/PyRate (click âclone or
downloadâ and download as a .zip, then unzip)
- Occurrence data
- Check out MioMap for mammals (https://ucmp.berkeley.edu/miomap/)
- Paleobiodb or fossilworks for most things (http://paleobiodb.org, http://fossilworks.org/)
- Optional: powerful computer and/or cluster access
- This makes life easier when youâre doing lots of replicates, which weâll talk about later
7. Optional follow-along step: download data
â fossilworks.org â Download â Collection, occurrence, or specimen data
â Fill in Taxon or taxa to include with a group of your choice (I suggest a well-
populated family like Canidae, Felidae, Equidae, etc)
â Collection fields tab: tick boxes for maximum age (Ma) and minimum age (Ma)
â Click Create data set (at the bottom)
â Clean and modify your data as necessary :)
â Decide on a file structure (see next slide for what I use)
â ./Data/datafile.csv refers to datafile.csv in the Data folder, which is inside the PyRate folder
â ../Data/datafile.csv does the same thing, but if your Data folder is next to the PyRate folder
â one dot refers to the same folder youâre in, two dots goes up to the parent folder
â Examples will proceed as if your Data folder is OUTSIDE of (next to) your PyRate folder
8. Suggested folder structure
- project_name
- R
- PyRate-setup.R
- data
- Crocodylidae.csv
- Canis_pbdb_data.csv
- PyRate-master
- all the folders/files that come with your download
- manuscript
- etc
In an ideal world, we would set this up as an R project.
But weâll try not to add too many new things at a time right now.
9. What does PyRate do?
âPyRate is a program to estimate speciation, extinction, and preservation rates from
fossil occurrence data using a Bayesian framework.â
With a basic PyRate analysis, we can ask questions like:
- How do speciation and extinction rates of the Felidae vary through time?
- When are there changes in speciation/extinction rates in the Crocodylia?
PyRateOccurrence Data
Speciation rates
Extinction rates
Preservation rates
(through time)
10. A test run: BDS model
Birth-death with rate shifts
(birth = speciation aka origination, death = extinction, shifts = those rates can change)
This is your basic âwhat are rates doing through time and when do they changeâ
analysis.
11. 1. Process your data in R: PyRate-setup.R
- Data cleaning (PBDB data is great, but it always has errors)
BDS Model
12. 1. Process your data in R: PyRate-setup.R
- Data cleaning (PBDB data is great, but it always has errors)
You might need to rename columns!
fossilworks.org and paleobiodb.org arenât always
consistent with column names.
PyRate expects:
Species
min_age
max_age
This should work with min_age (Ma), and even
max_ma, because they begin with the same word.
But, ma_max or min_ma (which are in our fossilworks
datasets) would need to be renamed and will give
you a cryptic error.
Itâs all part of data cleaning!
BDS Model
13. 1. Process your data in R: PyRate-setup.R
- Data cleaning (PBDB data is great, but it always has errors)
You might need to rename columns!
fossilworks.org and paleobiodb.org arenât always
consistent with column names.
PyRate expects:
Species
min_age
max_age
This should work with min_age (Ma), and even
max_ma, because it begins the same way.
But, ma_max or min_ma (which are in our fossilworks
datasets) would need to be renamed and will give
you a cryptic error.
Itâs all part of data cleaning!
BDS Model
14. 1. Process your data in R: PyRate-setup.R
- Data cleaning (PBDB data is great, but it always has errors)
- Define extant taxa
- extant = c("Canis rufus","Canis lupus","Canis aureus","Canis latrans","Canis mesomelas","Canis
anthus","Pseudalopex gymnocercus","Canis adustus","Canis familiaris")
OR:
- extant = c(âCrocodylus acutusâ, âCrocodylus intermediusâ, âCrocodylus johnsoniâ, âCrocodylus
mindorensisâ, âCrocodylus moreletiiâ, âCrocodylus niloticusâ, âCrocodylus novaeguineaeâ, âCrocodylus
palustrisâ, âCrocodylus porosusâ, âCrocodylus rhombiferâ, âCrocodylus siamensisâ, Crocodylus suchusâ,
âOsteolaemus tetraspisâ, âMecistops cataphractusâ, âMecistops leptorhynchusâ)
BDS Model
15. 1. Process your data in R: PyRate-setup.R
- Data cleaning (PBDB data is great, but it always has errors)
- Define extant taxa
- extant = c("Canis rufus","Canis lupus","Canis aureus","Canis latrans","Canis mesomelas","Canis
anthus","Pseudalopex gymnocercus","Canis adustus","Canis familiaris")
OR:
- extant = c(âCrocodylus acutusâ, âCrocodylus intermediusâ, âCrocodylus johnsoniâ, âCrocodylus
mindorensisâ, âCrocodylus moreletiiâ, âCrocodylus niloticusâ, âCrocodylus novaeguineaeâ, âCrocodylus
palustrisâ, âCrocodylus porosusâ, âCrocodylus rhombiferâ, âCrocodylus siamensisâ, Crocodylus suchusâ,
âOsteolaemus tetraspisâ, âMecistops cataphractusâ, âMecistops leptorhynchusâ)
- Source the utilities file: source("../PyRate-master/pyrate_utilities.r")
- Parse your data: extract.ages.pbdb(file= "../data/[data-file].csv",extant_species=extant)
BDS Model
16. 1. Process your data in R: PyRate-setup.R
- Data cleaning (PBDB data is great, but it always has errors)
- Define extant taxa
- extant = c("Canis rufus","Canis lupus","Canis aureus","Canis latrans","Canis mesomelas","Canis
anthus","Pseudalopex gymnocercus","Canis adustus","Canis familiaris")
OR:
- extant = c(âCrocodylus acutusâ, âCrocodylus intermediusâ, âCrocodylus johnsoniâ, âCrocodylus
mindorensisâ, âCrocodylus moreletiiâ, âCrocodylus niloticusâ, âCrocodylus novaeguineaeâ, âCrocodylus
palustrisâ, âCrocodylus porosusâ, âCrocodylus rhombiferâ, âCrocodylus siamensisâ, Crocodylus suchusâ,
âOsteolaemus tetraspisâ, âMecistops cataphractusâ, âMecistops leptorhynchusâ)
- Source the utilities file: source("../PyRate-master/pyrate_utilities.r")
- Parse your data: extract.ages.pbdb(file= "../data/[data-file].csv",extant_species=extant)
Remember our file structure?
Weâre in the R folder, and our R Project thinks thatâs home.
â..âtells the program that it needs to go up one folder first,
before looking for the PyRate-master or the data folders.
BDS Model
17. 2. Open up the command prompt
If youâre on Windows, enclose file paths with â not with â
1. Check working directory
> chdir [Windows]
> pwd [Mac terminal]
1. Set working directory to the folder project_name
> cd C:/Users/Bri/Desktop/awesome_project [Windows & Mac]
3. Check out data info
> python â./PyRate-master/PyRate.pyâ â./Data/[data_file]_PyRate.pyâ -
data_info
BDS Model
18. 2. Open up the command prompt
If youâre on Windows, enclose file paths with â not with â
1. Check working directory
> dir [Windows]
> pwd [Mac terminal]
1. Set working directory to the folder project_name
> cd C:/Users/Bri/Desktop/awesome_project [Windows & Mac]
3. Check out data info
> python â./PyRate-master/PyRate.pyâ â./Data/[data_file]_PyRate.pyâ -
data_info
BDS Model
Both PyRate-master and Data are folders in our current
folder, or working directory: project_name. So, we use a
single . to access them.
19. 2. Open up the command prompt
If youâre on Windows, enclose file paths with â not with â
1. Check working directory
> dir [Windows]
> pwd [Mac terminal]
1. Set working directory to the folder project_name
> cd C:/Users/Bri/Desktop/awesome_project [Windows & Mac]
3. Check out data info
> python â./PyRate-master/PyRate.pyâ â./Data/[data_file]_PyRate.pyâ -
data_info
BDS Model
Use Python and the stuff in the
PyRate folder
on this data file to give me the
data info
20. 3. Run your BDS analysis
Now we run the analysis, specifying a few parameters with flags (those are the
things that come after a dash).
> python â./PyRate-master/PyRate.pyâ â./Data/[data_file]_PyRate.pyâ -A 4 -mG -n 200000 -s
5000
BDS Model
Same as before: tell Python where PyRate
is and which data file to use. These are flags!
-A 4 use algorithm 4
(RJMCMC)
-mG allow heterogeneity in
preservation rate acro
lineages
-n 200000 do 200k iterations
-s 5000 record values every 5k iterations
https://github.com/dsilvestro/PyRate/blob/master/tutorials/pyrate_tutorial_1.md#defining-the-preservation-model
22. A wild folder appeared!
- project_name
- R
- PyRate-setup.R
- data
- pyrate_mcmc_logs
- Crocodylidae.csv
- Canis_pbdb_data.csv
- PyRate-master
- all the folders/files that come with your download
- manuscript
- etc
4. Look at your results
BDS Model
23. A wild folder appeared!
- project_name
- R
- PyRate-setup.R
- data
- pyrate_mcmc_logs
- Crocodylidae.csv
- Canis_pbdb_data.csv
- PyRate-master
- all the folders/files that come with your download
- manuscript
- etc
4. Look at your results
BDS Model
This folder has results files in it.
[data_file]_1_Grj_ex_rates.log
[data_file]_1_Grj_mcmc.log
[data_file]_1_Grj_sp_rates.log
[data_file]_1_Grj_sum.log
24. Summarize model probabilities: how many rate shifts happened?
> python â./PyRate-master/PyRate.pyâ -mProb â./data/pyrate_mcmc_logs/[data_file]_Grj_mcmc.logâ -b 10
4. Look at your results
BDS Model
Give me the rate shift probabilities from the MCMC logs in
our results folder
with a burn-in of 10 samples
(aka: drop the first 10 because the
parameters were wandering
around)
26. Summarize model probabilities: how many rate shifts happened?
Plot your results: what does it look like??
> python â./PyRate-master/PyRate.pyâ -plotRJ â./data/pyrate_mcmc_logs/â -b 10
4. Look at your results
BDS Model
27. Summarize model probabilities: how many rate shifts happened?
Plot your results: what does it look like??
> python â./PyRate-master/PyRate.pyâ -plotRJ â./data/pyrate_mcmc_logs/â -b 10
> Rscript â./data/pyrate_mcmc_logs/RTT_plots.râ
4. Look at your results
BDS Model
29. What else can we do with PyRate?
1. Trait-correlated diversification models
2. Multivariate birth-death models
With these further analyses, we can ask questions like:
- Do larger-bodied canids go extinct more often?
- Do any/all of global temperature, the genus-level diversity of mammals, and
global proportion of swampland relative to other habitats correlate with
speciation or extinction in crocodylids?
30. Further analysis: Covar model
A trait covariation model lets speciation and extinction vary, per lineage, as a
function of an estimated correlation with a continuous trait.
Does larger body mass correlate with higher extinction rates in canids?
Covar Model
31. 1. Provide a trait data file
We want a tab-separated text file of just two columns: Species and Trait.
This is usually easiest to make in R. Weâll put it in the data folder.
Covar Model
32. 1. Run your Covar analysis
Again, we run the analysis, specifying a few parameters with flags.
> python â./PyRate-master/PyRate.pyâ â./data/[data_file]_PyRate.pyâ .
-trait_file â./data/trait_data.txtâ -mCov 2 -logT 2 .
Covar Model
Flags weâre using:
-trait_file says where the trait data file can be found
-mCov 2 mCov specifies which algorithm to use; 2 tests
correlation with extinction rates only
-logT 2 specifies to transform the data with log10 (0 would
specify to not transform, and 1 is log base e)
See more at:
https://github.com/dsilvestro/PyRate/blob/master/tutorials/pyrate_tutorial_4.md#trait-correlated-diversification
33. 1. Run your Covar analysis
Again, we run the analysis, specifying a few parameters with flags.
> python â./PyRate-master/PyRate.pyâ â./data/[data_file]_PyRate.pyâ .
-trait_file â./data/trait_data.txtâ -mCov 2 -logT 2 .
Covar Model
Flags weâre using:
-trait_file says where the trait data file can be found
-mCov 2 mCov specifies which algorithm to use; 2 tests
correlation with extinction rates only
-logT 2 specifies to transform the data with log10 (0 would
specify to not transform, and 1 is log base e)
See more at:
https://github.com/dsilvestro/PyRate/blob/master/tutorials/pyrate_tutorial_4.md#trait-correlated-diversification
-mCov 1 correlated speciation
-mCov 2 correlated extinction
-mCov 3 correlated speciation and extinction
-mCov 4 correlated preservation
-mCov 5 correlated speciation, extinction, preservation
34.
35. Covar fix
Instead of using extract.ages.pbdb(), we need to use extract.ages() on a data
frame that already has the Trait data in it.
You can take Canis_pbdb_data.txt (in our data folder) and add a Trait column to it in
the same way we just did for the trait_data.txt file.
Then, call extract.ages() on it and it should work.
36. What else can we do with PyRate?
1. Trait-correlated diversification models
2. Multivariate birth-death models
With these further analyses, we can ask questions like:
- Do larger-bodied canids go extinct more often?
- Do any/all of global temperature, the genus-level diversity of mammals, and
global proportion of swampland relative to other habitats correlate with
speciation or extinction in crocodylids?
37. What else can we do with PyRate?
1. Trait-correlated diversification models
2. Multivariate birth-death models
BDS Model
Preservation
rates
Origination/
extinction times
Origination/
extinction rates
Other clade origination/
extinction times
Environmental
variables
MBD Model
Covar Model
Continuous trait
Occurrences
Do rates correlate with other clade
diversity or environmental factors
like global temperature?
Do rates correlate with a continuous
trait (on a lineage-specific basis)?
38. What else can we do with PyRate?
1. Trait-correlated diversification models
2. Multivariate birth-death models
3. MULTIPLE REPLICATES (discuss)
Editor's Notes
https://github.com/dsilvestro/PyRate
See tutorials at: https://github.com/dsilvestro/PyRate/tree/master/tutorials
Theyâre fairly regularly updated, but sometimes have slightly out-of-date syntax.
Diversification rates are an entire field, because fundamentally, itâs interesting to know why clades are shaped the way they are. Did lots of things go extinct really fast? Did speciation drop off a cliff?
Itâs Bayesian because we start with data and say, given these data, whatâs the likelihood of these parameters? (As opposed to âstandardâ or frequentist statistics, where you start with the parameters and figure out the probability of getting your data) Then we make some small tweaks to the proposed parameters and test again, which is the Markov Chain Monte Carlo bit.
We do all of this using occurrence data.
An occurrence is literally just that: someone found a fossil of a thing, decided what it was, and put it in a database. Weâll work with the Paleobiology Database today, a very common source. The PBDB only works from published occurrences, so itâs smaller but more curated than some others.
You can access the data from fossilworks.org or paleobiodb.org. They are run by different people but should technically have the same content.
Install R: https://www.r-project.org/
I recommend RStudio as a GUI for working with R: https://www.rstudio.com/. Although many of the basic tasks for setting up a PyRate analysis can be done by opening R at the command line, itâs not as comfortable as using RStudio for most people.
How to check if you have Python 2 installed (also has download instructions): https://edu.google.com/openonline/course-builder/docs/1.10/set-up-course-builder/check-for-python.html
If you already have Python 3 but not Python 2, here's helpful pointers for Windows and for Mac. You may need to do some searching about setting up virtual environments.
For scientific purposes, look into the Anaconda distribution, which comes pre-packaged with a bunch of scientific computing packages like numpy, scipy, and pandas. https://www.anaconda.com/distribution/
This will also make sure that you have a couple of PyRateâs required packages installed (numpy and scipy).
Data cleaning: itâs a thing.
Carefully check through your data. This might include doing things like listing unique(dataframe$Species) to see a list of all the taxa. Check for typos, old names that need to be updated, etc.
You might need to do things like filter out cetaceans from an artiodactyl analysis.
You might need to clean or assign trait data for a later analysis.
Have fun, be thorough, and save it in a script so you can see what you did and repeat it later!
Hereâs one option for file structure, and itâs usually how I set up my analyses.
The directory project_name will be our working directory for the rest of these analyses. That means weâll always refer to the other folders relative to it.
On why R projects in RStudio are awesome: https://www.tidyverse.org/articles/2017/12/workflow-vs-script/
R projects also enable using the here package, which is AMAZING for working with file paths across computers (i.e., in any code that you ever want to share, or if you work on multiple devices): https://github.com/jennybc/here_here
As a reminder, these are the questions weâre starting with. You can basically shove your occurrence data into PyRate and get these results out. How exciting!
BDS is the basic/first model of a PyRate analysis, and itâs what was diagrammed on the first slide.
The processing step can be done at the command line, which is how the official tutorial shows it...or in an actual R script, which I recommend.
We wonât bother with data cleaning for todayâs stuff, but here is an example from my ungulates PyRate project - this is just typo fixes, dropping some unwanted genera, and then the very start of updating taxonomy.
One step that can trip you up: column naming.
This is the cryptic error and itâs very not-helpful if you havenât run into this problem before. It means you need to rename your age columns to start with min_ and max_ !
OK, define your vector of extant taxa. Feel free to copy and paste. One for dogs, one for crocs.
Now run the command to process what you need.
Since in an ideal world we would be using an R project, Iâm assuming that we are working in the R subfolder - hence the two dots ../ used to go up one level, to the project_name folder, before going back into the subfolders for PyRate-master or data.
You can list the files in your current working directory using ls (Mac, Linux/Unix) or dir (Windows)
Note: -data_info is a REALLY good place to make sure FastPyRateC is loaded successfully. Otherwise, everything will be really slow. (It will show up as âModule FastPyRateC was loaded successfully.â if it worked.)
If FastPyRateC didnât load successfully, follow the instructions here: https://github.com/dsilvestro/PyRate/blob/master/pyrate_lib/fastPyRateC/README.md
A translation of what weâre telling the command prompt.
There are lots of different options for a basic BDS analysis, and we set them with flags. Happily, the official tutorial is pretty clear about ~best practices, so you can get more info there.
In a real analysis, youâd probably want closer to 10-50 million iterations, ish.
If itâs working, your analysis will print stuff out. These are the parameters as they update with each round of Markov Chain Monte Carlo and you can mostly ignore it, but itâs fun to watch the progress go by.
If you get errors here about required libraries not being installed, you need the numpy and scipy packages.
If you have an Anaconda Prompt from installing Anaconda, or youâre on a Linux/Unix system (or maybe Mac?), just try pip install numpy and pip install scipy at the command line.
On Windows, try python -m pip install numpy (and then the same for scipy)
If those donât work, get thee to a web search for âHow to install [numpy/scipy] on [your operating system]â
You donât need to directly interact with these files unless youâre getting exact numbers and posterior means to report in a paper. All the interactions will happen from more stuff at the command line.
Burn-in refers to dropping the first little bit of sampling while your parameters are jumping around like mad - so, waiting until your chain has converged.
The amount to drop is going to vary by dataset. You can look at the traces of your *_mcmc.log file in a program like tracer to determine how much to cut off.
Note also that first 10 is really 10 x sampling rate, which is 5000. So weâre dropping 50,000 of 200k.
So the 1-rate model is most probable: there is one speciation rate and one extinction rate. (Weâre wonât put much stock in this because itâs a small dataset and we didnât use too many iterations and we also didnât clean the data at all, so please infer absolutely no biological relevance from these results.)
Itâll print some more stuff. There may be an error about not finding the file, which is annoying but easily fixable (see next slide)
...basically, it generates an R file that will make your figures, and the R file generation works even if running it does not. So, we can manually run the R file if we got the error by using this Rscript command.
Speciation up top, extinction below. Rates through time on the left, frequency of rate shift on the right.
Because itâs Bayesian, the frequency means: out of all the times we sampled using MCMC and got a result, what proportion showed a rate shift in this 1-million-year bin? (You can change the bin size as one of the options when you call the plotting command.) If the frequency goes above the dotted line, itâs a significant shift. (Here we donât have that)
This isnât real trait data! rnorm() draws from a normal distribution with a mean and standard deviation that you give it.
But you can do it with real data, this is just for an example :)
Here are your various model options.
:(
It looks like the option to include your trait file as a separate file is broken.
Hereâs the fix I have used in the past. Currently left as an exercise for the reader, sorry.
A reminder of the other kinds of analysis we can do.
A flowchart of how these analyses are related to BDS.
Multiple replicates let you integrate uncertainty in the date ranges of your fossils.
Itâs accomplished by adding replicates = n to the extract.ages() or extract.ages.pbdb() function we used to prepare our dataset earlier, in order to get n replicates.
Youâll then want to run a BDS analysis on every replicate. This is where we start to get into needing to script things, because you donât want to manually enter 100 datasets of BDS analysis into the terminal.
This is also where cluster access comes in handy (or patience and a decently-powered desktop).
I am happy to provide advice and examples on these steps but itâs outside the scope of this lab meeting!