9. 1. Find the most correlated genes to a gene of interest
Use data from the human and mouse RNA-seq database ARCHS4
gget.archs4(gene, ensembl=False, which='correlation',
gene_count=100, species='human', json=False, save=False)
gget is a free, open-source command-line tool and Python package that enables efficient querying of genomic databases.
gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying in a single line of code.
a majority of researchers currently access genomic reference databases to annotate and functionally characterize putative marker genes through manual web access
manual web access is time-consuming and potentially error-prone, as it requires manually copying and pasting data, such as gene IDs.
ref(species, which='all', release=None, ftp=False, save=False, list_species=False)
Fetch FTPs for reference genomes and annotations by species from Ensembl.
Args:
species Defines the species for which the reference should be fetched in the format "<genus>_<species>", e.g. species = "homo_sapiens".
which Defines which results to return.
Default: 'all' -> Returns all available results.
Possible entries are one or a combination (as a list of strings) of the following:
'gtf' - Returns the annotation (GTF).
'cdna' - Returns the trancriptome (cDNA).
'dna' - Returns the genome (DNA).
'cds - Returns the coding sequences corresponding to Ensembl genes. (Does not contain UTR or intronic sequence.)
'cdrna' - Returns transcript sequences corresponding to non-coding RNA genes (ncRNA).
'pep' - Returns the protein translations of Ensembl genes.
release Defines the Ensembl release number from which the files are fetched, e.g. release = 104.
Default: None -> latest Ensembl release is used
ftp Return only the requested FTP links in a list (default: False).
save Save the results in the local directory (default: False).
list_species If True and `species=None`, returns a list of all available species from the Ensembl database for large genomes
(not including plants/bacteria) (default: False).
(Can be combined with `release` to get the available species from a specific Ensembl release.)
Returns a dictionary containing the requested URLs with their respective Ensembl version and release date and time. (If FTP=True, returns a list containing only the URLs.)
search(searchwords, species, id_type='gene', seqtype=None, andor='or', limit=None, wrap_text=False, json=False, save=False)
Function to query Ensembl for genes based on species and free form search terms. Automatically fetches results from latest Ensembl release, unless user specifies database (see 'species' argument).
Args:
searchwords Free form search words (not case-sensitive) as a string or list of strings (e.g.searchwords = ["GABA", "gamma-aminobutyric"]).
species Species can be passed in the format "genus_species", e.g. "homo_sapiens". To pass a specific database, enter the name of the core database, e.g. 'mus_musculus_dba2j_core_105_1'. All availabale species databases can be found here: http://ftp.ensembl.org/pub/release-106/mysql/
id_type "gene" (default) or "transcript" Defines whether genes or transcripts matching the searchwords are returned.
andor "or" (default) or "and"
"or": Returns all genes that INCLUDE AT LEAST ONE of the searchwords in their name/description.
"and": Returns only genes that INCLUDE ALL of the searchwords in their name/description.
limit (int) Limit the number of search results returned (default: None).
wrap_text If True, displays data frame with wrapped text for easy reading. Default: False.
json If True, returns results in json format instead of data frame. Default: False.
save If True, the data frame is saved as a csv in the current directory (default: False).
Returns a data frame with the query results. Deprecated arguments: 'seqtype' (renamed to id_type)
archs4(gene, ensembl=False, which='correlation', gene_count=100, species='human', json=False, save=False)
Find the most correlated genes or the tissue expression atlas of a gene of interest using data from the human and mouse RNA-seq database ARCHS4 (https://maayanlab.cloud/archs4/).
Args:
gene Short name (Entrez gene symbol) of gene of interest (str), e.g. 'STAT4'. Set 'ensembl=True' to input an Ensembl gene ID, e.g. ENSG00000138378.
ensembl Define as 'True' if 'gene' is an Ensembl gene ID. (Default: False)
which 'correlation' (default) or 'tissue’.
'correlation' returns a gene correlation table that contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues in ARCHS4.
'tissue' returns a tissue expression atlas calculated from human or mouse samples (as defined by 'species') in ARCHS4.
gene_count Number of correlated genes to return (default: 100). (Only for gene correlation.)
species 'human' (default) or 'mouse'. (Only for tissue expression atlas.)
json If True, returns results in json format instead of data frame. Default: False.
save True/False whether to save the results in the local directory.
Returns a data frame with the requested results.
The Pearson correlation is calculated over all samples and tissues. The gene list can be uploaded to Enrichr for further investigation.
archs4(gene, ensembl=False, which='correlation', gene_count=100, species='human', json=False, save=False)
Find the most correlated genes or the tissue expression atlas of a gene of interest using data from the human and mouse RNA-seq database ARCHS4 (https://maayanlab.cloud/archs4/).
Args:
gene Short name (Entrez gene symbol) of gene of interest (str), e.g. 'STAT4'. Set 'ensembl=True' to input an Ensembl gene ID, e.g. ENSG00000138378.
ensembl Define as 'True' if 'gene' is an Ensembl gene ID. (Default: False)
which 'correlation' (default) or 'tissue’.
'correlation' returns a gene correlation table that contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues in ARCHS4.
'tissue' returns a tissue expression atlas calculated from human or mouse samples (as defined by 'species') in ARCHS4.
gene_count Number of correlated genes to return (default: 100). (Only for gene correlation.)
species 'human' (default) or 'mouse'. (Only for tissue expression atlas.)
json If True, returns results in json format instead of data frame. Default: False.
save True/False whether to save the results in the local directory.
Returns a data frame with the requested results.
enrichr(genes, database, ensembl=False, plot=False, figsize=(10, 10), ax=None, json=False, save=False)
Perform an enrichment analysis on a list of genes using Enrichr (https://maayanlab.cloud/Enrichr/).
Args:
genes List of Entrez gene symbols to perform enrichment analysis on, passed as a list of strings, e.g. ['PHF14', 'RBM3', 'MSL1', 'PHF21A']. Set 'ensembl = True' to input a list of Ensembl gene IDs, e.g. ['ENSG00000106443', 'ENSG00000102317', 'ENSG00000188895’].
database Database to use as reference for the enrichment analysis. Supported shortcuts (and their default database):
'pathway' (KEGG_2021_Human) 'transcription' (ChEA_2016)
'ontology' (GO_Biological_Process_2021)
'diseases_drugs' (GWAS_Catalog_2019)
'celltypes' (PanglaoDB_Augmented_2021)
'kinase_interactions' (KEA_2015)
or any database listed under Gene-set Library at: https://maayanlab.cloud/Enrichr/#libraries
ensembl Define as 'True' if 'genes' is a list of Ensembl gene IDs. (Default: False)
plot True/False whether to provide a graphical overview of the first 15 results. (Default: False)
figsize (width, height) of plot in inches. (Default: (10,10))
ax Pass a matplotlib axes object for further customization of the plot. (Default: None)
json If True, returns results in json format instead of data frame. (Default: False)
save True/False whether to save the results in the local directory. (Default: False)
Returns a data frame with the Enrichr results.
enrichr(genes, database, ensembl=False, plot=False, figsize=(10, 10), ax=None, json=False, save=False)
Perform an enrichment analysis on a list of genes using Enrichr (https://maayanlab.cloud/Enrichr/).
Args:
genes List of Entrez gene symbols to perform enrichment analysis on, passed as a list of strings, e.g. ['PHF14', 'RBM3', 'MSL1', 'PHF21A']. Set 'ensembl = True' to input a list of Ensembl gene IDs, e.g. ['ENSG00000106443', 'ENSG00000102317', 'ENSG00000188895’].
database Database to use as reference for the enrichment analysis. Supported shortcuts (and their default database):
'pathway' (KEGG_2021_Human) 'transcription' (ChEA_2016)
'ontology' (GO_Biological_Process_2021)
'diseases_drugs' (GWAS_Catalog_2019)
'celltypes' (PanglaoDB_Augmented_2021)
'kinase_interactions' (KEA_2015)
or any database listed under Gene-set Library at: https://maayanlab.cloud/Enrichr/#libraries
ensembl Define as 'True' if 'genes' is a list of Ensembl gene IDs. (Default: False)
plot True/False whether to provide a graphical overview of the first 15 results. (Default: False)
figsize (width, height) of plot in inches. (Default: (10,10))
ax Pass a matplotlib axes object for further customization of the plot. (Default: None)
json If True, returns results in json format instead of data frame. (Default: False)
save True/False whether to save the results in the local directory. (Default: False)
Returns a data frame with the Enrichr results.
info(ens_ids, wrap_text=False, pdb=False, ensembl_only=False, json=False, verbose=True, save=False, expand=False)
Fetch gene and transcript metadata using Ensembl IDs.
Args:
ens_ids One or more Ensembl IDs to look up (string or list of strings). Also supports WormBase and Flybase IDs.
wrap_text If True, displays data frame with wrapped text for easy reading. Default: False.
pdb If True, also returns PDB IDs (might increase run time). Default: False.
ensembl_only If True, only returns results from Ensembl (excludes PDB, UniProt, and NCBI results). Default: False.
json If True, returns results in json/dictionary format instead of data frame. Default: False.
verbose True/False whether to print progress information. Default True.
save True/False wether to save csv with query results in current working directory. Default: False.
Returns a data frame containing the requested information.
info(ens_ids, wrap_text=False, pdb=False, ensembl_only=False, json=False, verbose=True, save=False, expand=False)
Fetch gene and transcript metadata using Ensembl IDs.
Args:
ens_ids One or more Ensembl IDs to look up (string or list of strings). Also supports WormBase and Flybase IDs.
wrap_text If True, displays data frame with wrapped text for easy reading. Default: False.
pdb If True, also returns PDB IDs (might increase run time). Default: False.
ensembl_only If True, only returns results from Ensembl (excludes PDB, UniProt, and NCBI results). Default: False.
json If True, returns results in json/dictionary format instead of data frame. Default: False.
verbose True/False whether to print progress information. Default True.
save True/False wether to save csv with query results in current working directory. Default: False.
Returns a data frame containing the requested information.
seq(ens_ids, translate=False, isoforms=False, save=False, transcribe=None, seqtype=None)
Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all its isoforms) or transcript by Ensembl, WormBase or FlyBase ID.
Args:
ens_ids One or more Ensembl IDs (passed as string or list of strings). Also supports WormBase and FlyBase IDs.
translate True/False
(default: False -> returns nucleotide sequences).
Defines whether nucleotide or amino acid sequences are returned.
Nucleotide sequences are fetched from the Ensembl REST API server.
Amino acid sequences are fetched from the UniProt REST API server.
isoforms If True, returns the sequences of all known transcripts (default: False). (Only for gene IDs.)
save If True, saves output FASTA to current directory (default: False).
Returns a list (or FASTA file if 'save=True') containing the requested sequences.
seq(ens_ids, translate=False, isoforms=False, save=False, transcribe=None, seqtype=None)
Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all its isoforms) or transcript by Ensembl, WormBase or FlyBase ID.
Args:
ens_ids One or more Ensembl IDs (passed as string or list of strings). Also supports WormBase and FlyBase IDs.
translate True/False
(default: False -> returns nucleotide sequences).
Defines whether nucleotide or amino acid sequences are returned.
Nucleotide sequences are fetched from the Ensembl REST API server.
Amino acid sequences are fetched from the UniProt REST API server.
isoforms If True, returns the sequences of all known transcripts (default: False). (Only for gene IDs.)
save If True, saves output FASTA to current directory (default: False).
Returns a list (or FASTA file if 'save=True') containing the requested sequences.
muscle(fasta, super5=False, out=None)
Align multiple nucleotide or amino acid sequences against each other (using the Muscle v5 algorithm).
Args:
fasta Path to fasta file containing the sequences to be aligned.
super5 True/False (default: False).
If True, align input using Super5 algorithm instead of PPP algorithm to decrease time and memory.
Use for large inputs (a few hundred sequences).
out Path to save an 'aligned FASTA' (.afa) file with the results, e.g. 'path/to/directory/results.afa’.
Default: 'None' -> Results will be printed in Clustal format.
Returns alignment results in an "aligned FASTA" (.afa) file.
blast(sequence, program='default', database='default', limit=50, expect=10.0, low_comp_filt=False, megablast=True, verbose=True, wrap_text=False, json=False, save=False)
BLAST a nucleotide or amino acid sequence against any BLAST DB.
Args:
sequence Sequence (str) or path to FASTA file.
(If more than one sequence in FASTA file, only the first will be submitted to BLAST.)
program 'blastn', 'blastp', 'blastx', 'tblastn', or 'tblastx’.
Default: 'blastn' for nucleotide sequences; 'blastp' for amino acid sequences.
database 'nt', 'nr', 'refseq_rna', 'refseq_protein', 'swissprot', 'pdbaa', or 'pdbnt’.
Default: 'nt' for nucleotide sequences; 'nr' for amino acid sequences.
More info on BLAST databases: https://ncbi.github.io/blast-cloud/blastdb/available-blastdbs.html - limit Limits number of hits to return. Default 50.
expect float or None. An expect value cutoff. Default 10.0. - low_comp_filt True/False whether to apply low complexity filter. Default False.
megablast True/False whether to use the MegaBLAST algorithm (blastn only). Default True.
verbose True/False whether to print progress information. Default True.
wrap_text If True, displays data frame with wrapped text for easy reading. Default: False.
json If True, returns results in json/dictionary format instead of data frame. Default: False.
save If True, the data frame is saved as a csv in the current directory (default: False).
Returns a data frame with the BLAST results.
blat(sequence, seqtype='default', assembly='human', json=False, save=False)
BLAT a nucleotide or amino acid sequence against any BLAT UCSC assembly.
Args:
sequence Sequence (str) or path to fasta file containing one sequence.
seqtype 'DNA', 'protein', 'translated%20RNA', or 'translated%20DNA'. Default: 'DNA' for nucleotide sequences; 'protein' for amino acid sequences.
assembly 'human' (hg38) (default), 'mouse' (mm39), 'zebrafinch' (taeGut2), or any of the species assemblies available at https://genome.ucsc.edu/cgi-bin/hgBlat
(use short assembly name as listed after the "/").
json If True, returns results in json format instead of data frame. Default: False.
save If True, the data frame is saved as a csv in the current directory (default: False).
Returns a data frame with the BLAT results.
alphafold(sequence, out='2022_12_30-1803_gget_alphafold_prediction', multimer_for_monomer=False, relax=False, multimer_recycles=3, plot=True, show_sidechains=True)
Predicts the structure of a protein using a slightly simplified version of AlphaFold v2.3.0 (https://doi.org/10.1038/s41586-021-03819-2) published in the AlphaFold Colab notebook (https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb).
Args:
sequence Amino acid sequence (str), a list of sequences, or path to a FASTA file.
out Path to folder to save prediction results in (str). Default: "./[date_time]_gget_alphafold_prediction"
multimer_for_monomer Use multimer model for a monomer (default: False).
multimer_recycles The multimer model will continue recycling until the predictions stop changing, up to the limit set here (default: 3).
For higher accuracy, at the potential cost of longer inference times, set this to 20.
relax True/False whether to AMBER relax the best model (default: False).
plot True/False whether to provide a graphical overview of the prediction (default: True).
show_sidechains True/False whether to show side chains in the plot (default: True).
Saves the predicted aligned error (json) and the prediction (PDB) in the defined 'out' folder.
alphafold(sequence, out='2022_12_30-1803_gget_alphafold_prediction', multimer_for_monomer=False, relax=False, multimer_recycles=3, plot=True, show_sidechains=True)
Predicts the structure of a protein using a slightly simplified version of AlphaFold v2.3.0 (https://doi.org/10.1038/s41586-021-03819-2) published in the AlphaFold Colab notebook (https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb).
Args:
sequence Amino acid sequence (str), a list of sequences, or path to a FASTA file.
out Path to folder to save prediction results in (str). Default: "./[date_time]_gget_alphafold_prediction"
multimer_for_monomer Use multimer model for a monomer (default: False).
multimer_recycles The multimer model will continue recycling until the predictions stop changing, up to the limit set here (default: 3).
For higher accuracy, at the potential cost of longer inference times, set this to 20.
relax True/False whether to AMBER relax the best model (default: False).
plot True/False whether to provide a graphical overview of the prediction (default: True).
show_sidechains True/False whether to show side chains in the plot (default: True).
Saves the predicted aligned error (json) and the prediction (PDB) in the defined 'out' folder.