SlideShare a Scribd company logo
13 August 2011
Supervisor’sstatementforNehaGupta
In her summer 2011 co-op, Neha worked in Knome’s Research group, scripting for live genome analysis
projects and scouting bioinformatics resources for potential use in future development. I supervised her,
and she also worked closely with another NU co-op and with two bioinformaticians in the Operations
group. On some projects, she also worked with members of the Development group.
In just a few weeks with Knome, Neha worked on many different projects. She wrote several key scripts
for Knome, including one that helps streamline our literature search when finishing a project analysis. She
also researched, and wrote code to help parse/prepare files for, bioinformatics resources such as PLINK.
We at Knome enjoyed having her work with us, and wish her very well!
Sincerely,
Nathaniel Pearson
Director of Research, Knome, Inc.
PersonalStatement:
During the summer of 2011, I, Neha Gupta, had the privilege to work and experience business world
at Knome Inc located at Cambridge, Boston. My interests and the Research skills acquired during the
Bioinformatics Coursework at Northeastern University gelled very well with what the company does.
Knome was founded in 2007 and is the first organization to deliver human whole genome analysis
and interpretation services. The amazing input and effort by the organization is reflected by the fact
that it has sequenced and analyzed the highest number of human genomes by far. The whole genome
and exome analysis services are based on automating the process of finding the shortlists of genetic
variants implicated in a disease, particularly Cancer, studying their drug response and summarizing
their functional importance. Knome provides the most detailed genome and exome interpretation to
various pharmaceutical and clinical researchers aiding in the drug discovery and development pipelines.
I had the opportunity to work as a Bioinformatics Research Intern at Knome Inc for
three months. During this period, there were lot of interesting ongoing projects. During the start of my
internship, I got to work on multiple projects and also using my thought process to come out with
novel project ideas in assisting in the research. For instance, ideas like generating heat maps linking
the variant allele frequencies with the drug response and developing GUI in Circos Genome Data
Visualization tool for Cancer. The challenges coming in any project kept me encouraged and motivated
in turn enhancing my skills. The brainstorming required towards finding a solution to a problem
reaching a dead end especially during the team discussions and participations was challenging and a
positive experience.
Project1: Interpreting the effects of variants – Condel scores:
Background: Single nucleotide polymorphisms are the most common type of genetic variations in
humans. The SNPs make up about 90% of all human genetic variations and occur every 100 to 300
bases along the 3-billion-base human genome but the knowledge about the association of these variants
to a particular disease is laborious and time consuming. Condel- Consensus deleteriousness score of
Single Nucleotide Variants computes a weighted average of the scores of five known computational
tools aimed at classifying all non-synonymous SNV as likely deleterious, neutral or good. These tools
are SIFT(Sorting Intolerant from Tolerant), Polyphen2, MAPP(Multivariate Analysis Of Protein
Polymorphism), Pfam LogRE-value, Mutation Assessor1
.
Running Condel :
There are three ways of running the Condel program. Ensembl Variation database version 62 contains
precomputed SIFT and Polyphen scores for every possible mutation in the human proteome. Condel
was run using Ensembl-Variation database Perl API. Variant Effect Predictor API takes in a large list
of variants in VCF or pileup format and predicts the Condel scores based on SIFT and Polyphen
scores precomputed in the Ensembl database7
. The input file’s genomic coordinates should be matching
the reference genome used for the Ensembl database version (v.62 to hg19). It is run through a
command line interface. The command line options can also be defined in a configuration file . A
second optional registry file is used for pointing to which database the queries are executed (MySQL
ensemble homo sapiens core and variations databases are essential) .
An example of Registry file is :
use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::DBSQL::DBAdaptor-> new (
‘-species’=> ‘Homo sapiens’,
‘-group’ => ‘core’,
‘-port’ => 3306,
‘-host’ => ‘kensembl01’,
‘-user’ => ‘ngupta’,
‘-pass’ => ‘abcd’,
‘-dbname’ => ‘homo_sapiens_core_62_37g’
);
Bio::EnsEMBL::Variation::DBSQL::DBAdaptor-> new (
‘-species’ => ‘Homo sapiens’,
‘-group’ => ‘Variation’,
‘-port’ => 3306,
‘-host’ => ‘kensembl01’,
‘-user’ => ‘ngupta’,
‘-pass’ => ‘abcd’,
‘dbname’ => ‘homo_sapiens_variation_62_37g’
);
All Adaptors are stored using Bio::EnsEMBL::Registry Module. This module is used to get all the
adaptors required. Primarily, Bio::EnsEMBL::DBSQL::DBAdaptor connects to the database
‘homo_sapiens_core_62_37g’ through DBConnection Class and creates a new object with species, group
and other flags. This object retrieval is by the help of Bio::EnsEMBL::Registry Module. The module
Bio::EnsEMBL::Variation::DBSQL::DBAdaptor in the second half of the Registry file provides a
connection to the Ensembl Variation database ‘homo_sapiens_variation_62_37g’ by creating a new
object .
An output of Condel scores was produced and analysed. It assesses the effects of missense, Nonsense
mediated Decay variants and read-through variants. A missense mutation leads to a change in the
amino acid during translation thereby altering the entire protein synthesized. Nonsense Mediated Decay
is a mRNA surveillance mechanism that detects nonsense mutations and prevents the expression of
erroneous proteins. Following transcription, precursor mRNA undergoes pre-mRNA processing. Large
intron size in eukaryotes leading to aberrant mRNA splicing results in a nonsense codon (UAA, UAG,
UGA) within the Open Reading Frame. A frameshift (read-through) mutation is caused by insertions or
deletions of nucleotides. Due to the triplet nature of gene expression by codons, the indel changes the
reading frame (the grouping of the codons) resulting in a altered protein synthesized after translation. A
Condel score of 1 is illustrative of a bad consequence of a allele variant whereas a score of 0 is a
good variant. This part of the project was organized as a team project. The key question now was to
see how accurate are the Condel scores produced based only on SIFT and Polyphen and draw a final
comparison with the Condel scores taking all the five tools into account. To answer this question, I
researched independently on the program MAPP.
MAPP : MAPP reads an alignment of protein sequences and a alignment tree to calculate the predicted
impact of each potential SNP at each position6
. These predictions are completely based on a set of
scales (of physicochemical properties) for which each amino acid has a numeric value. This open
source program takes in two input files in order to run, one containing the alignment in fasta format
and the other text file having the alignment tree in parenthesis representation, with branch lengths6
.
The alignment file should have properly aligned sequences and the gaps must be indicated by
hyphens. Also, the tree file should be properly formatted in a way the MAPP program accepts. This
program is built on a Java platform and requires a large number of Orthologs . The alignment and the
tree files are obtained by installing ClustalW locally. Other options like Semphy and ClustalX can also
be utilized to run a Multiple sequence Alignment6
. Ran ClustalW program locally to perform multiple
sequence alignment of a set of six sample human sequences generating an output of alignment fasta
file and a tree file. The physicochemical properties used by MAPP are Hydropathy, Polarity, Charge,
Volume, Free energy in alpha-helix conformation, Free energy in beta-strand Conformation2
. It produces
an output table in an excel format having the column headers :
1. Position- position of the column in the alignment
2. Column score- MAPP median score for this alignment column
3. Column P-value- P value interpretation of column’s MAPP score
4. Alignment- Amino acids observed in that column in alignment order
5. Gap Weight-Weighted fraction of gaps in column
6. Over gap weight threshold- indicating whether the column is over gap weight threshold as MAPP
does not calculates scores if the gap weight is greater than 50%
7. P-values corresponding to the significance of each physicochemical property.
8. Columns for MAPP scores for each possible amino acid variant
9. P-value interpretations of the MAPP scores, predicting the impact of each amino acid variant.
10. Good Amino acids at this position
11. Bad Amino acids- List of amino acids deleterious at this position
An example of how the output is produced in MAPP is as follows:
USCS Genome browser5
has the tracks having information about chromosome, allelic coordinates, base
pair position, etc for t-RNA genes, miRNA genes, Genes involved in Regulation, Expression Genes
among others. These files are in various formats like Browser Extensible Data format, Big BED,
Wiggle, etc. I have researched and downloaded the .BED tracks in order to integrate the BED parser in
one of the company’s tools.
Future Work: The decision to divide the work independently to research on Condel and other
tools Pfam LogRE values , MAPP, Mutation Assessor was good as the work was delegated and hence
the time was not wasted due to the repetitive work assigned to two. The future work in this project
would be to integrate the SIFT, Polyphen with MAPP, P-fam logRE and Mutation Assessor which I
believe is an accurate way of predicting Condel scores based on all the five computational tools.
MAPP, P-fam logRE and Mutation Assessor are more in depth and accurate in predicting the effects of
all types of variants in a human proteome4,6
. So, a weighted score of these five tools would be more
accurate than the Condel score based only on SIFT and Polyphen1
.However, the key frustration in
MAPP is to automate the generation of alignment file and tree file on a whole genome basis. On a
single MAPP run, it produces the scores only for a single alignment file and a alignment tree file but
for the entire genome of an individual, it has to be automated so that the MAPP alignment file and
MAPP alignment tree file for all the orthologous proteins could be generated. It is memory intensive
and difficult to align all the orthologous proteins at a go to produce a single alignment file6
. The
methodology by which P-fam logRE works on a genome wide basis would be an interesting concept
to know about.
References:
1. Condel – Consensus deleteriousness score of missense SNVs
http://bg.upf.edu/group/projects/condel.php
2. Physicochemical constraint violation by missense substitutions mediates impairment of protein
function and disease severity.
http://mendel.stanford.edu/SidowLab/pdfs/2005StoneSidow.pdf
3. Polyphen
http://www.ncbi.nlm.nih.gov/pubmed/21412949
http://www.ncbi.nlm.nih.gov/pubmed/20642364
4. Pfam - http://cancerres.aacrjournals.org/content/67/2/465.long
5. Encode - http://genome.ucsc.edu/ENCODE/
6. MAPP
http://mendel.stanford.edu/sidowlab/downloads/MAPP/index.html
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000968
7. Ensembl- http://useast.ensembl.org/info/docs/variation/index.html
Project2: Pedigree Analysis and estimation of Inbreeding :
This project was directed as a independent project and the aim is to get the kinship estimates for the
company’s subject data.
PLINK for GWAS : PLINK is a open source whole genome association analysis program written in
C/C++ useful in performing a range of large-scale analyses in a computationally efficient manner. The
main focus of this tool is to analyze genotypes(SNPs)/phenotype data. It is integrated with gPLINK
(GUI PLINK) and Haploview to also support for subsequent visualization and annotation of the
results1
. PLINK v1.07 is freely available for download for any Operating system from their website.
Running PLINK: PLINK is run through a command line interface. The good part of PLINK is that
the previous run is stored in a log file and hence --rerun option is pretty useful to keep track of a
previous analysis using this log file as a base.
Input files : PLINK takes as an input two files for a accurate analysis specified by the command line
option --file. Alternatively, it is easier to give the PED and MAP files separately. For instance,
./plink --ped data.ped --map data.map
(1) PED: Ped file stores the pedigree information for each individual in a specific format acceptable
by PLINK. For a larger file (100K+ SNPs), a binary format of the ped file is helpful for
increasing the efficiency of the program. The PED file is a white-space delimited file where the
first six columns referring to the pedigree information for an individual are mandatory. They are
Family ID,Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female; other=unknown),
Phenotype1
. The phenotype can either be a quantitative trait or an affection status column
.Affection status by default is -9-missing, 0-missing, 1-unaffected, 2-affected1
. The genotypes
begin from column 7 which should be white-space delimited specified by characters 1, 2, 3, 4 or
A,G,C,T and 0 for missing genotypes. All the SNP markers , whether haploid or not, must be
biallelic (missing genotype – 00)1
.
Excluding the family ID, parents ID, sex and phenotype for a basic ped file can be done by
specifying --no option on a command line. To save space and time, a binary ped file (*.bed) is
created. It saves the pedigree/phenotype information in separate file (*.fam) and creates an
extended map file (*.bim)1
.
(2) MAP files:
Map files save the genetic mapping information for the SNP markers of the ped file. Each line
of a map file is descriptive of a single marker and contains four columns : Chromosome (1-22,
X, Y, 0 if unplaced), rs# or SNP identifier number, Genetic distance in Morgans, Base-pair
position1
. Alternatively, a MAP file excluding the genetic distance can be used as well. SNP
identifiers can contain any characters except white spaces and *. To exclude a SNP from the
analysis, setting the bp position column to a negative value is apt. The MAP file should be
containing as many markers as there are in the PED file. Other codes and PLINK’s internal
numeric coding used for specifying other chromosomal types are X- X chromosome (23), Y- Y
chromosome(24), XY- Pseudo-autosomal region of X(25) and MT-Mitochondrial(26)1
.
Inbreeding Analysis: Among the various types of analyses performed by PLINK, a useful one is the
calculation of inbreeding Coefficients (F) based on the observed versus the expected number of
homozygous genotypes. Inbreeding is a good measure of kinship between a group of individuals
possessing a large numbers of SNPs. Six individuals belonging to six different families were considered
to get the inbreeding coefficients . The ped file and map file are generated as follows :
Ped file is : 1 1 0 0 1 1 A A G T Map file: 1 snp1 0 1
2 1 0 0 1 1 A C T G 1 snp2 0 2
3 1 0 0 1 1 C C G G
4 1 0 0 1 2 A C T T
5 1 0 0 1 2 C C G T
6 1 0 0 1 2 C C T T
For the estimation of inbreeding between these subjects, PLINK command is run on these input ped
and map files as ./plink -- ped data.ped --map data.map --het which creates an output file plink.het
exemplified as follows :
FID IID O(HOM) E(HOM) N(NM) F
1 1 1 1.069 2 -0.07463
2 1 0 1.069 2 -1.149
3 1 2 1.069 2 1
Where FID = Family ID, IID = Individual ID, O(HOM) = observed number of homozygotes, E(HOM) =
Expected number of homozygotes, N(NM) = Number of non missing genotypes, F = Inbreeding coefficient
estimate.
For analyzing large number of SNP markers, downloaded the hapmap genotype and pedigree data for
the Yoruba tribe trio Y117 from their website considering only thousand SNP markers of each
individual. Generated a single genotype file for each individual in order to automate the process of
getting the .ped and .map PLINK files using Perl.
Pairwise IBD estimation: Identity by Descent analysis is useful in detecting the pairs of individuals
who look too similar to each other, that is more than expected by chance alone in a random sample. It
is possible to calculate genome wide IBD as far as large number of SNPs are available. (ideally 100K
or even more)1
. It is run by specifying --genome option along with the input files. The output for
kinship estimation is :
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
1 14 1 11 PO 0.5 0.00 1.00 0.00 0.5 1 0.5 NA NA
1 14 1 1 OT 0 0.00 0.9851 0.014 0.5075 0 0.75 0.7603 NA
1 13 6 1 UN NA 1.00 0 0 0 1 0.5 0.078 0.0
where FID1= Family ID for first individual , IID1 = Individual ID for first individual, FID2 = Family ID
for second individual, IID2 = Individual ID for second individual, RT = Relationship type, EZ = Expected
IBD sharing given PED file , Z0 = P (IBD=0, that is probability of sharing 0 alleles) , Z1 = P( IBD=1, one
allele sharing), Z2 = P (IBD=2), PI_HAT = proportion of IBD, PHE = estimated kinship, DST = IBS
distance, PPC = IBS binomial test, RATIO = IBS 0 SNPs
KING tool for GWAS : King tool works in a similar fashion as PLINK. However, the advantages of
KING over PLINK is that it can be used to flag pedigree errors by estimating all kinship coefficients
for all pairwise relationships3
. Moreover, the kinship estimates for closely related individuals are highly
accurate. In comparison to PLINK, this tool is very fast and robust to the population structure as it
can also take complicated pedigrees into account.
Installing King: King version 1.2 can be freely downloaded from their website for Linux, Windows,
Mac and Ubuntu O.S. King is run through a command line interface for an accurate GWAS for a
population sample size as large as > 10000 individuals3
.
Input files: The input files include a data file (-d), a pedigree file (-p) and a map file in MERLIN
format or for increased efficiency a binary format file (-b). The command line run is ./king –b
ex.bgeno.gz --kinship --related. The ped and map file formats of PLINK and KING are the same. King
tool however reads in an additional data file which includes one row per data item in the pedigree file
indicating the data type:
<contents of data.dat>
A some_disease
T some_trait
M some_marker
M1 Marker 1
….
</contents of data.dat>
Running KING : The pedigree information file and the file containing the genotypes are downloaded
for the trio NA19238, NA19239 and NA19240 from the hapmap website. Later, generated three genotype
files for three individuals by aggregating the genotypes for all the downloaded chromosome files from
hapmap website and created a map file for each of these individuals from the pedigree file. To
analyze over a million SNP sites in these individuals, wrote a perl script that converts the genotypes
of these individuals to a input pedigree file format having million markers. Similarly, using perl
converted the map file for the trio to the MAP file format acceptable by King. Also, automated
producing a DAT file using Perl for any number of markers given as an argument.
The king tool was run on these three input files obtained to produce a king.kin file saving all
the kinship scores of the trio. The king.bgeno is the binary format of .ped , .map and .dat files that
can be produced by running the command:
Prompt > ./king –d ex.dat –p ex.ped –m ex.map --binary
The king tool is now run on the binary file to get the final output of estimation of kinship between
three individuals from the hapdata. The run is Prompt> ./king –b ex.bgeno --kinship3
. The output king.kin
file is as follows:
FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error
117 1 2 494819 0.000 0.2500 0.278 0.0004 0.2486 0
117 1 3 498224 0.000 0.2500 0.268 0.0002 0.2446 0
117 2 3 496116 1.000 0.0000 0.206 0.1197 -0.0306 0
This is the case of a within- family relationship where FID = Family ID for the pair, ID1 = Individual
ID for the first individual of the pair, ID2 = Individual ID for the second individual of the pair, N_SNP
= The number of SNPs that do not have missing genotypes in either individuals, Z0 = Pr(IBD=0), Phi =
Observed kinship coefficient, HetHet = Proportion of SNPs with double heterozygotes, IBS0 = Proportion
of zero Identity by similarity (estimate of difference between individuals) , Kinship = Estimated kinship
coefficient from the SNP data, Error = Difference between the estimated and specified kinship
coefficients ( 1 for error, 0.5 for warning).
A negative kinship coefficient indicates an unrelated relationship. An estimated kinship coefficient range
> 0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-
degree, 2nd –degree and 3rd-degree relationships respectively. The main disadvantage of this tool is the
relationship inference for more distant relationships.
Frustration and Future Work: The main frustration during this work was to optimize the code for
around two million variant sites. The Future Work would be to get a plot of the kinship scores of
both PLINK and King tools and analyze the correlation of the two tools. Also, going for a 2000 or
3000 individuals genome wide association studies. To utilize the gPLINK and Haploview for a more
detailed analyses would be another future work .
References:
1.PLINK- http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml#format
2. Visualizing sample relatedness in a GWAS using PLINK and R
http://gettinggeneticsdone.blogspot.com/2009/10/visualizing-sample-relatedness-in-gwas.html
3. King- http://people.virginia.edu/~wc9c/KING/manual.html
Project3: Developing Circos and JavaFX:
The first half of the project was organized and directed as a team project .
Background: Circos is a software package for visualizing genomic data and information. It
visualizes data in a circular layout which makes it ideal for exploring relationships between objects or
positions. It is very easily automated as it is controlled by plain-text configuration files. It is written
in Perl and produces bitmap (PNG) and vector (SVG) images in HTML format using plain text
configuration and input files2
. Circos is a free software and circos-0.54, circos-0.15 tools, configuration
files, data files were downloaded from its website. The configuration files support drawing Ideograms ,
Highlights, Tick marks, Links and Ribbons, Scatter plots, Line plots, Histograms, Heat maps, Image
maps, etc3
.
Usually, the image generation process is driven by a central configuration file which later imports
other configuration files such as global color and font settings. The steps involved are (1) Parsing data
files in Circos format (2)Deciding how the data is to be shown (3) Constructing a config file either
from scratch or from the tutorials in their website (4) Running Circos to create PNG and SVG files3
.
The configuration files has a block format. For instance ,
<plot>
File = data.txt
r0 = 0.5r
r1 = 0.6r
</plot>
Running Circos : The primary objective of this project was to develop GUI Circos. Circos needs Perl
core modules . In addition , it needs Config::General, GD, GD::Polyline, List::MoreUtils, Math::Beizer,
Math::Round, Math::VecStat, Params::Validate, Readonly, Regexp::Common, Set::Inspan1
. After installing
and testing for the correct modules, Circos can be run through a command line where the circos perl
script will take in the input central configuration file and the path to the output directory , output
image name can be specified3
.
Image maps: HTML image maps allow the associations between the areas of the image with the web
links. Any element in the image , like ideogram, tick, band, ribbon, highlight, etc can be having an
associated URL . The image map is useful in producing clickable regions within the image . After a
SVG or PNG image is produced in HTML format , parameters are set in the configuration file to
produce an image map. In order to associate the image with its image map, the contents of the image
map .html file is placed and then USEMAP attribute is used to associate image with its image map3
.
Circos Image generation: Also, the configuration files were modified to produce links in the circos
image as shown below :
Utilizing JavaFX 1.3 in Netbeans 6.9 IDE: This phase of the project was an independent research on
how best the Graphical user interface can be added to Circos for a effective visualization. JavaFX 1.3
SDK, JavaFX 1.3 production Suite and Netbeans IDE 6.9 for JavaFX 1.3 can be downloaded from
Oracle’s website for Windows, Mac, Linux and OpenSolaris4
. The latest release of JavaFX 2.0 does
not support Mac and Linux OS. JavaFX 1.3 is used for developing and building GUI applications and
it has enhanced support for UI controls, CSS, etc. It also has support for TV applications and TV
emulator. JavaFX comes with Netbeans IDE which includes JavaFX composer4
. Netbeans also provide
improved stability and performance , a rich set of UI components and enhancements to the JavaFX
editor.
As can be seen from the figure, the name of the main project (package) can be specified by clicking
File and then browsing through the options. For instance, Chaptersix is the main project which on
clicked collapses into different files . These files represent snippets of Java script that adds graphics in
the applications being developed. The palette in the right side of the Netbeans IDE contains Swing,
Stage, Scene, Text boxes and other types of actions that can be applied to the images .
Swing application: Swings are very powerful and can be developed using JavaFx. For instance, the
swing application that is developed using JavaFX is :
Image Application : Images can be generated by specifying the URL in a proper format in the
snippet of code. The following figure shows the Circos image application being produced by running
the Java code
Applying Effects to the Circos image:
Various kinds of effects can be applied to the images like hue, brightness, blur, etc for a clear
visualization. JavaFX production suite which is downloaded along with JavaFX SDK has tools /plugins
for Adobe photoshop and Adobe illustrator so that the images can be edited in Photoshop and
converted to .FXZ format to be easily used in applying animations, videos in the applications. The
circos image produced after applying effects is as below:
Given below is the snippet used here to apply the effects to Circos genome data visualization tool for Cancer:
/*
* Chapter7.fx
*
* v1.0 - Neha Gupta
*
* 7/5/2011 - created
*
* Applying effects
*
*/
package chaptersix;
import javafx.stage.Stage;
import javafx.scene.Scene;
import javafx.fxd.FXDNode;
import javafx.scene.Group;
import javafx.scene.effect.*;
import javafx.scene.image.ImageView;
import javafx.scene.image.Image;
import javafx.scene.paint.Color;
var imagePath : String = "file:///depts/research/cancerCircos.fxz";
var circosImage : FXDNode = FXDNode{
url : imagePath;
};
/**
* @author ngupa
*/
Stage {
title: "Effects"
onClose: function () { }
scene: Scene {
width: 1000
height: 1000
content:[
SetImages(circosImage)
]
}
}
function SetImages(image : FXDNode) : FXDNode {
var circos : ImageView;
circos = (image.getNode("circos") as ImageView);
circos.translateX = -50;
circos.translateY = -50;
return image;
}
function ApplyEffects() : Effect{
var effectToApply : Effect;
effectToApply =
MotionBlur {
angle: 45
radius: 10
}
return effectToApply;
};
Using Events:
The mousing over events of the images supported by JavaFX are OnmouseMoved, onMouseEntered,
onMouseclicked, onMouseExited, onMousepressed, onMouseDragged, onMouseReleased, etc. The images
below represent onMouseClicked, onMouseReleased and onMouseDragged Events produced after
running the Java snippet files :
Mouse over events also has been applied to Circos image illustrated as below7
:
Frustrations and Future Work: The future work would be to use the latest release of JavaFX 2.0 . A
new feature of this release would be scene graphs . A scene graph is a tree data structure which
maintains an internal model of the graphical objects (like links, ideograms, ticks, etc) in the applications.
It knows what objects or elements of the image are to be used as clickable regions because the
application can be developed using scene, Group, Line, Arc, circle, etc and each element of the graph
would react to a full array of mouse over events. Also, developing animations and videos in the Circos
would be my future work. Overall, the project was well directed. The single frustration of this project
was the repetitive work assigned to two.
References:
1. Circos Modules - http://mkweb.bcgsc.ca/dev/circos/tutorials/lessons/configuration/perl_and_modules/
2. Circos Readme - http://circos.ca/software/readme/
3. Circos Tutorials - http://circos.ca/tutorials/lessons/
4. JavaFX – http://javafx.com
5. JavaFX production suite - http://download.oracle.com/javafx/1.3/reference/releasenotes/production-suite-
release-notes-1-3.html#2
6. Inkscape - http://silveiraneto.net/2008/11/21/inkscape-and-javafx-working-together/
7. JavaFX – A beginner’s guide
Project4 : Automating Pubmed Searches:
This project was well directed and organized as a independent project. The main objective of this
project was to extract all the abstracts for a larger number of genes and phenotypes/disorders in few
seconds from Pubmed. I have researched on tools that can automate pubmed searches. Firstly, I have
tried using PubCrawler to go forward with this.
(1) PubCrawler-Automated Retrieval of Pubmed and GenBank Reports :
PubCrawler is a free software and can be downloaded from its website which has a Pubcrawler perl
script and the configuration file. After installing locally, it can be run through a command line. An
example run is shown as follows:
Pubcrawler.pl [-add_path /cwd/lib] [-c <config_file>] [-check] [-d <directory>] [-db <database>] [-fullmax
<max-docs in full>] [-getmax <max-docs to retrieve] [-i] [-out <output-file>] [-relentrezdate <relative-entrez-
date] [-s <search_term>] [-viewdays <view-days>]
The above run has all the mandatory command line flags to automate retrieval of Pubmed reports. The
command line flag –db is the name of the database file that Pubcrawler uses. The –fullmax flag is
useful in specifying the maximum number of full length reports to be shown per search. The –getmax
is for the maximum number of documents to retrieve per search. The –I flag states to include the
configuration file in HTML format. The name of the output file for HTML output is specified by –out
option . The –relentrezdate is the maximum age of a document to be retrieved (1 year, 2 years, 5 years,
‘no limit’, etc). The –s is for the search term in the Pubmed2
. The –viewdays flag is for the number of
days each document can be shown and the –check flag checks if everything is set up correctly. All of
these mandatory command line parameters as well as the optional flags can be set in the Pubcrawler
Configuration file . However, if the command line options and the settings in the Configuration files
conflict, the command line setting takes priority.
Search terms : The definition of the search terms used here is the same as given by Entrez search
system. For instance, a boolean search for getting the abstracts for a gene and a phenotype would be
‘pubcrawler.pl –s ‘pubmed KRAS [ALL] AND Cancer [ALL]’ entered at the command line hides all
other queries of the configuration file2
. In order to automate the multiple queries through this program,
wrote a perl script that makes the new Configuration file containing multiple such search queries for
any number of genes and any number of phenotypes.
Optional Variables: Some of the optional command line variables include (1) work_dir – working
directory for PubCrawler (2) mail- an email address to which results are being sent after each
Pubcrawler run (3) format- format for reports from Pubmed and Genbank : ‘DocSum’, ‘Brief’,
‘Abstract’, ‘FASTA’, etc, (4) time_out- time in seconds to wait for internet responses, defaults to 180
(5) proxy – proxy server for internet connection (6) indent – indents Pubcrawler comments and pixel to
align it with Entrez output.
Drawback : The main disadvantage of this run is the time consumption. For a single query, it takes
around 5 minutes to get back all the Research papers including all the fields(author, primary accession
number, Publication date, etc) and specifying no limit to the release Entrez date. Also, tried running for
around 120 queries corresponding to 120 genes and implicated phenotypes/ disorders. The time it took
for a complete run was around 10 hours which is quite inefficient.
(2) Using NCBI Eutils:
Eutils are very powerful and efficient in grabbing the research articles from the Pubmed database .
There are various URLs which perform different functions. The following link exemplifies the url to get
abstract for a single pubmed ID :
http://www.ncbi.nlm.nih.gov/entrez/queryd.fcgi?db=pubmed&cmd=Retrieve&dopt=Abstract&list_uids=18461
658&itool=pubmed_docsum
Similarly, the url to display PMIDs in xml retrieval mode is :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933,11700088&retmode=xml
Creating URLs for a larger gene list and phenotypes list:
I have modified the EUtils urls for getting the abstracts and for getting the Pubmed IDs respectively
for a larger list of genes and phenotypes. The new URL for getting the Pubmed Ids from a list of
genes is as follows :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retstart=0&retmax=10000000&usehistory
=y&retmode=xml&term=Amelogenesisimperfecta and (gene1 or gene2 or gene3 or gene4 or …….)
The new URL for getting abstracts is :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=1,2,3&retmode=html&rettype=abstract
Perl scripts :
After making the urls that gives the PMIDS and the abstracts, tried using a list of genes and a list
of phenotypes/disorders, wrote a perl script that can make the URL to get the Pubmed IDs directly
from the pubmed database for all the research papers associated with this list . The content when the
URL is manually clicked is in XML format. Modified this perl script to grab the content of this
URL automatically without having to click manually the link everytime and parses the XML into a
perl hash of hash of array format. This is then used to selectively print out the list of pubmed IDs
from hash of hash of array generated after parsing the XML output. With the list of Pubmed IDs, the
URL to get abstracts was made and again the content of this web URL was grabbed automatically
using perl modules and their functions. This content in XML format was parsed to print out the list
of abstracts to the standard output. However, there is a URL limit to the number of search terms
specified. It is around 200 boolean terms for genes and implicated diseases. The maximum number of
abstracts this script retrieves now is 800.
Frustrations and Future Work: This project was very challenging and the key frustration which
would be the future work is to remove this URL limit so that the url can take in an infinite gene
list and phenotypes list and grab the contents of the webpage automatically. Also, changing the
parameters like restricting articles retrieval based on author names, Entrez release date, Accession
numbers, etc would be a future work.
References:
1. Integrating automated literature searches and text mining in biomarker discovery-
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2956392/
2. Pubcrawler and EUtils - http://www.my-whiteboard.com/how-to-automate-pubmed-search-using-perl-php-
or-java/
Project5 : 1000 genomes andBioGPS :
This project is organized as a team project. Firstly, the genotypes data for the 1000 individuals was
downloaded for hg18 in a VCF format. The main idea behind this approach is to get the union of
1000 genome data with the 50 individual genomes data of the company and analyzing this union data
using PLINK association studies. In order to proceed with this, the sites that are unique to the 1000
genomes data which are not called in the 50 individuals data , sites that are unique to the 50
individuals data that are not called in 1000 genomes data, sites that are being called in both the 1000
genomes and the 50 genomes data are generated. The figure below best represents this idea and the
explanation:
The list of approximately two million sites unique to 1000 genomes data is used in this work. The
key question that came up here was to get the genotypes for these two million sites. To go forward
with this, wrote a perl script to query the company’s internal database to grab the genotypes for these
two million sites for each of the fifty subject genome Ids of 50 genomes data.
Simultaneously, did data mining using the BioGPS gene portal database that has expression charts for
genes in humans, mouse, rats, etc. Researched on the correlation parameters settings which displays the
50 Individuals
genomes
unique sites
1000
Individuals
unique sites
1000 and 50
individuals
common sites
correlation between the expression charts of two genes. Downloaded the expression data for around
44000 probe sets in their website for making a vector matrix of expression data with the company’s
genes expression data.
Frustrations and Future Work: The initial Perl script was slow on these sites as it had to query
every time the database for each of these two million sites. Optimizing the code was important in
order to accomplish this. My future work would be to convert these data from the union to ped and
map input file formats supported by PLINK and go forward with association studies.

More Related Content

What's hot

RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubJennifer Shelton
 
Introduction to Proteogenomics
Introduction to Proteogenomics Introduction to Proteogenomics
Introduction to Proteogenomics
Yasset Perez-Riverol
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
Denis C. Bauer
 
Protein databases
Protein databasesProtein databases
Protein databasessarumalay
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
Monica Munoz-Torres
 
BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.
BITS
 
Liu_Jiangyuan_1201662_FR
Liu_Jiangyuan_1201662_FRLiu_Jiangyuan_1201662_FR
Liu_Jiangyuan_1201662_FR姜圆 刘
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
Yasset Perez-Riverol
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
BITS
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melo
bioejjournal
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
Chris Evelo
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
Gunnar Rätsch
 
916215 bioinformatics-over-view
916215 bioinformatics-over-view916215 bioinformatics-over-view
916215 bioinformatics-over-view
kudipudi
 
Prediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein featuresPrediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein features
Lars Juhl Jensen
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
AyeshaYousaf20
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
Manjappa Ganiger
 
Rna seq
Rna seq Rna seq
Rna seq
Amitha Dasari
 
Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497
Robin Gutell
 

What's hot (20)

RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
 
Introduction to Proteogenomics
Introduction to Proteogenomics Introduction to Proteogenomics
Introduction to Proteogenomics
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.BITs: Genome browsers and interpretation of gene lists.
BITs: Genome browsers and interpretation of gene lists.
 
Liu_Jiangyuan_1201662_FR
Liu_Jiangyuan_1201662_FRLiu_Jiangyuan_1201662_FR
Liu_Jiangyuan_1201662_FR
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
OpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scaleOpenMS: Quantitative proteomics at large scale
OpenMS: Quantitative proteomics at large scale
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melo
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
916215 bioinformatics-over-view
916215 bioinformatics-over-view916215 bioinformatics-over-view
916215 bioinformatics-over-view
 
Prediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein featuresPrediction of protein function from sequence derived protein features
Prediction of protein function from sequence derived protein features
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
Rna seq
Rna seq Rna seq
Rna seq
 
Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497
 

Viewers also liked

Uniforme moderno
Uniforme modernoUniforme moderno
Uniforme moderno
Estefy Ordoñez
 
7 signs your computer has virus
7 signs your computer has virus7 signs your computer has virus
7 signs your computer has virus
techexpert2345
 
Continuacion 111
Continuacion 111Continuacion 111
Continuacion 111
maria gutierrez mamani
 
curriculum vitae
curriculum vitaecurriculum vitae
curriculum vitaeSunil Singh
 
DJ VISHAL PROFILE 2015-2016
DJ  VISHAL PROFILE 2015-2016DJ  VISHAL PROFILE 2015-2016
DJ VISHAL PROFILE 2015-2016vishal waval
 
Tài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh tài liệu, ebook
Tài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh   tài liệu, ebookTài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh   tài liệu, ebook
Tài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh tài liệu, ebook
Tráng Hà Viết
 
Vanessa
VanessaVanessa
Oop unit 09 lập trình tổng quát
Oop unit 09 lập trình tổng quátOop unit 09 lập trình tổng quát
Oop unit 09 lập trình tổng quát
Tráng Hà Viết
 
Tips to make your computer happy
Tips to make your computer happyTips to make your computer happy
Tips to make your computer happy
techexpert2345
 

Viewers also liked (11)

blogger
bloggerblogger
blogger
 
Uniforme moderno
Uniforme modernoUniforme moderno
Uniforme moderno
 
7 signs your computer has virus
7 signs your computer has virus7 signs your computer has virus
7 signs your computer has virus
 
Continuacion 111
Continuacion 111Continuacion 111
Continuacion 111
 
curriculum vitae
curriculum vitaecurriculum vitae
curriculum vitae
 
DJ VISHAL PROFILE 2015-2016
DJ  VISHAL PROFILE 2015-2016DJ  VISHAL PROFILE 2015-2016
DJ VISHAL PROFILE 2015-2016
 
Tài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh tài liệu, ebook
Tài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh   tài liệu, ebookTài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh   tài liệu, ebook
Tài liệu môn trí tuệ nhân tạo đh bách khoa tp hồ chí minh tài liệu, ebook
 
MapleBrochure
MapleBrochureMapleBrochure
MapleBrochure
 
Vanessa
VanessaVanessa
Vanessa
 
Oop unit 09 lập trình tổng quát
Oop unit 09 lập trình tổng quátOop unit 09 lập trình tổng quát
Oop unit 09 lập trình tổng quát
 
Tips to make your computer happy
Tips to make your computer happyTips to make your computer happy
Tips to make your computer happy
 

Similar to Internship Report

Analyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAnalyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and Varcode
Alex Rubinsteyn
 
T-bioinfo overview
T-bioinfo overviewT-bioinfo overview
T-bioinfo overview
Jaclyn Williams
 
T-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesT-BioInfo Methods and Approaches
T-BioInfo Methods and Approaches
Elia Brodsky
 
Summer internship at University of Tokyo
Summer internship at University of TokyoSummer internship at University of Tokyo
Summer internship at University of Tokyo
Vaibhav Kulshrestha
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
Monica Munoz-Torres
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
Justin Johnson
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
Piyush Mishra
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013Iddo
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.
Monica Munoz-Torres
 
Knowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional PredictionsKnowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional Predictions
Golden Helix Inc
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing ResearchTanmay Ghai
 
Protein database
Protein databaseProtein database
Protein database
Khalid Hakeem
 
Proteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomicsProteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomics
Claudine83
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
Rajendra K Labala
 
Web Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterWeb Apollo Workshop University of Exeter
Web Apollo Workshop University of Exeter
Monica Munoz-Torres
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
Dan Gaston
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
Ankit Bhardwaj
 

Similar to Internship Report (20)

Analyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAnalyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and Varcode
 
T-bioinfo overview
T-bioinfo overviewT-bioinfo overview
T-bioinfo overview
 
T-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesT-BioInfo Methods and Approaches
T-BioInfo Methods and Approaches
 
Summer internship at University of Tokyo
Summer internship at University of TokyoSummer internship at University of Tokyo
Summer internship at University of Tokyo
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.
 
Knowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional PredictionsKnowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional Predictions
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
Protein database
Protein databaseProtein database
Protein database
 
Proteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomicsProteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomics
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
Web Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterWeb Apollo Workshop University of Exeter
Web Apollo Workshop University of Exeter
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
LOKITHESWARI VIPPALA
LOKITHESWARI VIPPALALOKITHESWARI VIPPALA
LOKITHESWARI VIPPALA
 

More from Neha Gupta

applications-and-current-challenges-of-supercomputing-across-multiple-domains...
applications-and-current-challenges-of-supercomputing-across-multiple-domains...applications-and-current-challenges-of-supercomputing-across-multiple-domains...
applications-and-current-challenges-of-supercomputing-across-multiple-domains...Neha Gupta
 
5315Syll2015Fall
5315Syll2015Fall5315Syll2015Fall
5315Syll2015FallNeha Gupta
 
syllabus_OS_Fall_2015
syllabus_OS_Fall_2015syllabus_OS_Fall_2015
syllabus_OS_Fall_2015Neha Gupta
 
Stem Cell Presentation
Stem Cell PresentationStem Cell Presentation
Stem Cell PresentationNeha Gupta
 
biomedical_machine_learning_poster_48''x36''
biomedical_machine_learning_poster_48''x36''biomedical_machine_learning_poster_48''x36''
biomedical_machine_learning_poster_48''x36''Neha Gupta
 
NehaGupta-SupervisorStatement
NehaGupta-SupervisorStatementNehaGupta-SupervisorStatement
NehaGupta-SupervisorStatementNeha Gupta
 
human_mutation_article
human_mutation_articlehuman_mutation_article
human_mutation_articleNeha Gupta
 

More from Neha Gupta (9)

applications-and-current-challenges-of-supercomputing-across-multiple-domains...
applications-and-current-challenges-of-supercomputing-across-multiple-domains...applications-and-current-challenges-of-supercomputing-across-multiple-domains...
applications-and-current-challenges-of-supercomputing-across-multiple-domains...
 
5315Syll2015Fall
5315Syll2015Fall5315Syll2015Fall
5315Syll2015Fall
 
syllabus_OS_Fall_2015
syllabus_OS_Fall_2015syllabus_OS_Fall_2015
syllabus_OS_Fall_2015
 
Stem Cell Presentation
Stem Cell PresentationStem Cell Presentation
Stem Cell Presentation
 
Presentation1
Presentation1Presentation1
Presentation1
 
neha_ppt
neha_pptneha_ppt
neha_ppt
 
biomedical_machine_learning_poster_48''x36''
biomedical_machine_learning_poster_48''x36''biomedical_machine_learning_poster_48''x36''
biomedical_machine_learning_poster_48''x36''
 
NehaGupta-SupervisorStatement
NehaGupta-SupervisorStatementNehaGupta-SupervisorStatement
NehaGupta-SupervisorStatement
 
human_mutation_article
human_mutation_articlehuman_mutation_article
human_mutation_article
 

Internship Report

  • 1.
  • 2. 13 August 2011 Supervisor’sstatementforNehaGupta In her summer 2011 co-op, Neha worked in Knome’s Research group, scripting for live genome analysis projects and scouting bioinformatics resources for potential use in future development. I supervised her, and she also worked closely with another NU co-op and with two bioinformaticians in the Operations group. On some projects, she also worked with members of the Development group. In just a few weeks with Knome, Neha worked on many different projects. She wrote several key scripts for Knome, including one that helps streamline our literature search when finishing a project analysis. She also researched, and wrote code to help parse/prepare files for, bioinformatics resources such as PLINK. We at Knome enjoyed having her work with us, and wish her very well! Sincerely, Nathaniel Pearson Director of Research, Knome, Inc.
  • 3. PersonalStatement: During the summer of 2011, I, Neha Gupta, had the privilege to work and experience business world at Knome Inc located at Cambridge, Boston. My interests and the Research skills acquired during the Bioinformatics Coursework at Northeastern University gelled very well with what the company does. Knome was founded in 2007 and is the first organization to deliver human whole genome analysis and interpretation services. The amazing input and effort by the organization is reflected by the fact that it has sequenced and analyzed the highest number of human genomes by far. The whole genome and exome analysis services are based on automating the process of finding the shortlists of genetic variants implicated in a disease, particularly Cancer, studying their drug response and summarizing their functional importance. Knome provides the most detailed genome and exome interpretation to various pharmaceutical and clinical researchers aiding in the drug discovery and development pipelines. I had the opportunity to work as a Bioinformatics Research Intern at Knome Inc for three months. During this period, there were lot of interesting ongoing projects. During the start of my internship, I got to work on multiple projects and also using my thought process to come out with novel project ideas in assisting in the research. For instance, ideas like generating heat maps linking the variant allele frequencies with the drug response and developing GUI in Circos Genome Data Visualization tool for Cancer. The challenges coming in any project kept me encouraged and motivated in turn enhancing my skills. The brainstorming required towards finding a solution to a problem reaching a dead end especially during the team discussions and participations was challenging and a positive experience. Project1: Interpreting the effects of variants – Condel scores: Background: Single nucleotide polymorphisms are the most common type of genetic variations in humans. The SNPs make up about 90% of all human genetic variations and occur every 100 to 300 bases along the 3-billion-base human genome but the knowledge about the association of these variants to a particular disease is laborious and time consuming. Condel- Consensus deleteriousness score of Single Nucleotide Variants computes a weighted average of the scores of five known computational tools aimed at classifying all non-synonymous SNV as likely deleterious, neutral or good. These tools are SIFT(Sorting Intolerant from Tolerant), Polyphen2, MAPP(Multivariate Analysis Of Protein Polymorphism), Pfam LogRE-value, Mutation Assessor1 . Running Condel : There are three ways of running the Condel program. Ensembl Variation database version 62 contains precomputed SIFT and Polyphen scores for every possible mutation in the human proteome. Condel was run using Ensembl-Variation database Perl API. Variant Effect Predictor API takes in a large list of variants in VCF or pileup format and predicts the Condel scores based on SIFT and Polyphen scores precomputed in the Ensembl database7 . The input file’s genomic coordinates should be matching the reference genome used for the Ensembl database version (v.62 to hg19). It is run through a command line interface. The command line options can also be defined in a configuration file . A second optional registry file is used for pointing to which database the queries are executed (MySQL ensemble homo sapiens core and variations databases are essential) . An example of Registry file is :
  • 4. use Bio::EnsEMBL::DBSQL::DBAdaptor; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::Registry; Bio::EnsEMBL::DBSQL::DBAdaptor-> new ( ‘-species’=> ‘Homo sapiens’, ‘-group’ => ‘core’, ‘-port’ => 3306, ‘-host’ => ‘kensembl01’, ‘-user’ => ‘ngupta’, ‘-pass’ => ‘abcd’, ‘-dbname’ => ‘homo_sapiens_core_62_37g’ ); Bio::EnsEMBL::Variation::DBSQL::DBAdaptor-> new ( ‘-species’ => ‘Homo sapiens’, ‘-group’ => ‘Variation’, ‘-port’ => 3306, ‘-host’ => ‘kensembl01’, ‘-user’ => ‘ngupta’, ‘-pass’ => ‘abcd’, ‘dbname’ => ‘homo_sapiens_variation_62_37g’ ); All Adaptors are stored using Bio::EnsEMBL::Registry Module. This module is used to get all the adaptors required. Primarily, Bio::EnsEMBL::DBSQL::DBAdaptor connects to the database ‘homo_sapiens_core_62_37g’ through DBConnection Class and creates a new object with species, group and other flags. This object retrieval is by the help of Bio::EnsEMBL::Registry Module. The module Bio::EnsEMBL::Variation::DBSQL::DBAdaptor in the second half of the Registry file provides a connection to the Ensembl Variation database ‘homo_sapiens_variation_62_37g’ by creating a new object . An output of Condel scores was produced and analysed. It assesses the effects of missense, Nonsense mediated Decay variants and read-through variants. A missense mutation leads to a change in the amino acid during translation thereby altering the entire protein synthesized. Nonsense Mediated Decay is a mRNA surveillance mechanism that detects nonsense mutations and prevents the expression of erroneous proteins. Following transcription, precursor mRNA undergoes pre-mRNA processing. Large intron size in eukaryotes leading to aberrant mRNA splicing results in a nonsense codon (UAA, UAG, UGA) within the Open Reading Frame. A frameshift (read-through) mutation is caused by insertions or deletions of nucleotides. Due to the triplet nature of gene expression by codons, the indel changes the reading frame (the grouping of the codons) resulting in a altered protein synthesized after translation. A Condel score of 1 is illustrative of a bad consequence of a allele variant whereas a score of 0 is a good variant. This part of the project was organized as a team project. The key question now was to see how accurate are the Condel scores produced based only on SIFT and Polyphen and draw a final comparison with the Condel scores taking all the five tools into account. To answer this question, I researched independently on the program MAPP.
  • 5. MAPP : MAPP reads an alignment of protein sequences and a alignment tree to calculate the predicted impact of each potential SNP at each position6 . These predictions are completely based on a set of scales (of physicochemical properties) for which each amino acid has a numeric value. This open source program takes in two input files in order to run, one containing the alignment in fasta format and the other text file having the alignment tree in parenthesis representation, with branch lengths6 . The alignment file should have properly aligned sequences and the gaps must be indicated by hyphens. Also, the tree file should be properly formatted in a way the MAPP program accepts. This program is built on a Java platform and requires a large number of Orthologs . The alignment and the tree files are obtained by installing ClustalW locally. Other options like Semphy and ClustalX can also be utilized to run a Multiple sequence Alignment6 . Ran ClustalW program locally to perform multiple sequence alignment of a set of six sample human sequences generating an output of alignment fasta file and a tree file. The physicochemical properties used by MAPP are Hydropathy, Polarity, Charge, Volume, Free energy in alpha-helix conformation, Free energy in beta-strand Conformation2 . It produces an output table in an excel format having the column headers : 1. Position- position of the column in the alignment 2. Column score- MAPP median score for this alignment column 3. Column P-value- P value interpretation of column’s MAPP score 4. Alignment- Amino acids observed in that column in alignment order 5. Gap Weight-Weighted fraction of gaps in column 6. Over gap weight threshold- indicating whether the column is over gap weight threshold as MAPP does not calculates scores if the gap weight is greater than 50% 7. P-values corresponding to the significance of each physicochemical property. 8. Columns for MAPP scores for each possible amino acid variant 9. P-value interpretations of the MAPP scores, predicting the impact of each amino acid variant. 10. Good Amino acids at this position 11. Bad Amino acids- List of amino acids deleterious at this position An example of how the output is produced in MAPP is as follows:
  • 6. USCS Genome browser5 has the tracks having information about chromosome, allelic coordinates, base pair position, etc for t-RNA genes, miRNA genes, Genes involved in Regulation, Expression Genes among others. These files are in various formats like Browser Extensible Data format, Big BED,
  • 7. Wiggle, etc. I have researched and downloaded the .BED tracks in order to integrate the BED parser in one of the company’s tools. Future Work: The decision to divide the work independently to research on Condel and other tools Pfam LogRE values , MAPP, Mutation Assessor was good as the work was delegated and hence the time was not wasted due to the repetitive work assigned to two. The future work in this project would be to integrate the SIFT, Polyphen with MAPP, P-fam logRE and Mutation Assessor which I believe is an accurate way of predicting Condel scores based on all the five computational tools. MAPP, P-fam logRE and Mutation Assessor are more in depth and accurate in predicting the effects of all types of variants in a human proteome4,6 . So, a weighted score of these five tools would be more accurate than the Condel score based only on SIFT and Polyphen1 .However, the key frustration in MAPP is to automate the generation of alignment file and tree file on a whole genome basis. On a single MAPP run, it produces the scores only for a single alignment file and a alignment tree file but for the entire genome of an individual, it has to be automated so that the MAPP alignment file and MAPP alignment tree file for all the orthologous proteins could be generated. It is memory intensive and difficult to align all the orthologous proteins at a go to produce a single alignment file6 . The methodology by which P-fam logRE works on a genome wide basis would be an interesting concept to know about. References: 1. Condel – Consensus deleteriousness score of missense SNVs http://bg.upf.edu/group/projects/condel.php 2. Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. http://mendel.stanford.edu/SidowLab/pdfs/2005StoneSidow.pdf 3. Polyphen http://www.ncbi.nlm.nih.gov/pubmed/21412949 http://www.ncbi.nlm.nih.gov/pubmed/20642364 4. Pfam - http://cancerres.aacrjournals.org/content/67/2/465.long 5. Encode - http://genome.ucsc.edu/ENCODE/ 6. MAPP http://mendel.stanford.edu/sidowlab/downloads/MAPP/index.html http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000968 7. Ensembl- http://useast.ensembl.org/info/docs/variation/index.html Project2: Pedigree Analysis and estimation of Inbreeding : This project was directed as a independent project and the aim is to get the kinship estimates for the company’s subject data. PLINK for GWAS : PLINK is a open source whole genome association analysis program written in C/C++ useful in performing a range of large-scale analyses in a computationally efficient manner. The main focus of this tool is to analyze genotypes(SNPs)/phenotype data. It is integrated with gPLINK
  • 8. (GUI PLINK) and Haploview to also support for subsequent visualization and annotation of the results1 . PLINK v1.07 is freely available for download for any Operating system from their website. Running PLINK: PLINK is run through a command line interface. The good part of PLINK is that the previous run is stored in a log file and hence --rerun option is pretty useful to keep track of a previous analysis using this log file as a base. Input files : PLINK takes as an input two files for a accurate analysis specified by the command line option --file. Alternatively, it is easier to give the PED and MAP files separately. For instance, ./plink --ped data.ped --map data.map (1) PED: Ped file stores the pedigree information for each individual in a specific format acceptable by PLINK. For a larger file (100K+ SNPs), a binary format of the ped file is helpful for increasing the efficiency of the program. The PED file is a white-space delimited file where the first six columns referring to the pedigree information for an individual are mandatory. They are Family ID,Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female; other=unknown), Phenotype1 . The phenotype can either be a quantitative trait or an affection status column .Affection status by default is -9-missing, 0-missing, 1-unaffected, 2-affected1 . The genotypes begin from column 7 which should be white-space delimited specified by characters 1, 2, 3, 4 or A,G,C,T and 0 for missing genotypes. All the SNP markers , whether haploid or not, must be biallelic (missing genotype – 00)1 . Excluding the family ID, parents ID, sex and phenotype for a basic ped file can be done by specifying --no option on a command line. To save space and time, a binary ped file (*.bed) is created. It saves the pedigree/phenotype information in separate file (*.fam) and creates an extended map file (*.bim)1 . (2) MAP files: Map files save the genetic mapping information for the SNP markers of the ped file. Each line of a map file is descriptive of a single marker and contains four columns : Chromosome (1-22, X, Y, 0 if unplaced), rs# or SNP identifier number, Genetic distance in Morgans, Base-pair position1 . Alternatively, a MAP file excluding the genetic distance can be used as well. SNP identifiers can contain any characters except white spaces and *. To exclude a SNP from the analysis, setting the bp position column to a negative value is apt. The MAP file should be containing as many markers as there are in the PED file. Other codes and PLINK’s internal numeric coding used for specifying other chromosomal types are X- X chromosome (23), Y- Y chromosome(24), XY- Pseudo-autosomal region of X(25) and MT-Mitochondrial(26)1 . Inbreeding Analysis: Among the various types of analyses performed by PLINK, a useful one is the calculation of inbreeding Coefficients (F) based on the observed versus the expected number of homozygous genotypes. Inbreeding is a good measure of kinship between a group of individuals possessing a large numbers of SNPs. Six individuals belonging to six different families were considered to get the inbreeding coefficients . The ped file and map file are generated as follows : Ped file is : 1 1 0 0 1 1 A A G T Map file: 1 snp1 0 1 2 1 0 0 1 1 A C T G 1 snp2 0 2
  • 9. 3 1 0 0 1 1 C C G G 4 1 0 0 1 2 A C T T 5 1 0 0 1 2 C C G T 6 1 0 0 1 2 C C T T For the estimation of inbreeding between these subjects, PLINK command is run on these input ped and map files as ./plink -- ped data.ped --map data.map --het which creates an output file plink.het exemplified as follows : FID IID O(HOM) E(HOM) N(NM) F 1 1 1 1.069 2 -0.07463 2 1 0 1.069 2 -1.149 3 1 2 1.069 2 1 Where FID = Family ID, IID = Individual ID, O(HOM) = observed number of homozygotes, E(HOM) = Expected number of homozygotes, N(NM) = Number of non missing genotypes, F = Inbreeding coefficient estimate. For analyzing large number of SNP markers, downloaded the hapmap genotype and pedigree data for the Yoruba tribe trio Y117 from their website considering only thousand SNP markers of each individual. Generated a single genotype file for each individual in order to automate the process of getting the .ped and .map PLINK files using Perl. Pairwise IBD estimation: Identity by Descent analysis is useful in detecting the pairs of individuals who look too similar to each other, that is more than expected by chance alone in a random sample. It is possible to calculate genome wide IBD as far as large number of SNPs are available. (ideally 100K or even more)1 . It is run by specifying --genome option along with the input files. The output for kinship estimation is : FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO 1 14 1 11 PO 0.5 0.00 1.00 0.00 0.5 1 0.5 NA NA 1 14 1 1 OT 0 0.00 0.9851 0.014 0.5075 0 0.75 0.7603 NA 1 13 6 1 UN NA 1.00 0 0 0 1 0.5 0.078 0.0 where FID1= Family ID for first individual , IID1 = Individual ID for first individual, FID2 = Family ID for second individual, IID2 = Individual ID for second individual, RT = Relationship type, EZ = Expected IBD sharing given PED file , Z0 = P (IBD=0, that is probability of sharing 0 alleles) , Z1 = P( IBD=1, one allele sharing), Z2 = P (IBD=2), PI_HAT = proportion of IBD, PHE = estimated kinship, DST = IBS distance, PPC = IBS binomial test, RATIO = IBS 0 SNPs KING tool for GWAS : King tool works in a similar fashion as PLINK. However, the advantages of KING over PLINK is that it can be used to flag pedigree errors by estimating all kinship coefficients for all pairwise relationships3 . Moreover, the kinship estimates for closely related individuals are highly accurate. In comparison to PLINK, this tool is very fast and robust to the population structure as it can also take complicated pedigrees into account.
  • 10. Installing King: King version 1.2 can be freely downloaded from their website for Linux, Windows, Mac and Ubuntu O.S. King is run through a command line interface for an accurate GWAS for a population sample size as large as > 10000 individuals3 . Input files: The input files include a data file (-d), a pedigree file (-p) and a map file in MERLIN format or for increased efficiency a binary format file (-b). The command line run is ./king –b ex.bgeno.gz --kinship --related. The ped and map file formats of PLINK and KING are the same. King tool however reads in an additional data file which includes one row per data item in the pedigree file indicating the data type: <contents of data.dat> A some_disease T some_trait M some_marker M1 Marker 1 …. </contents of data.dat> Running KING : The pedigree information file and the file containing the genotypes are downloaded for the trio NA19238, NA19239 and NA19240 from the hapmap website. Later, generated three genotype files for three individuals by aggregating the genotypes for all the downloaded chromosome files from hapmap website and created a map file for each of these individuals from the pedigree file. To analyze over a million SNP sites in these individuals, wrote a perl script that converts the genotypes of these individuals to a input pedigree file format having million markers. Similarly, using perl converted the map file for the trio to the MAP file format acceptable by King. Also, automated producing a DAT file using Perl for any number of markers given as an argument. The king tool was run on these three input files obtained to produce a king.kin file saving all the kinship scores of the trio. The king.bgeno is the binary format of .ped , .map and .dat files that can be produced by running the command: Prompt > ./king –d ex.dat –p ex.ped –m ex.map --binary The king tool is now run on the binary file to get the final output of estimation of kinship between three individuals from the hapdata. The run is Prompt> ./king –b ex.bgeno --kinship3 . The output king.kin file is as follows: FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error 117 1 2 494819 0.000 0.2500 0.278 0.0004 0.2486 0 117 1 3 498224 0.000 0.2500 0.268 0.0002 0.2446 0 117 2 3 496116 1.000 0.0000 0.206 0.1197 -0.0306 0 This is the case of a within- family relationship where FID = Family ID for the pair, ID1 = Individual ID for the first individual of the pair, ID2 = Individual ID for the second individual of the pair, N_SNP = The number of SNPs that do not have missing genotypes in either individuals, Z0 = Pr(IBD=0), Phi = Observed kinship coefficient, HetHet = Proportion of SNPs with double heterozygotes, IBS0 = Proportion
  • 11. of zero Identity by similarity (estimate of difference between individuals) , Kinship = Estimated kinship coefficient from the SNP data, Error = Difference between the estimated and specified kinship coefficients ( 1 for error, 0.5 for warning). A negative kinship coefficient indicates an unrelated relationship. An estimated kinship coefficient range > 0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st- degree, 2nd –degree and 3rd-degree relationships respectively. The main disadvantage of this tool is the relationship inference for more distant relationships. Frustration and Future Work: The main frustration during this work was to optimize the code for around two million variant sites. The Future Work would be to get a plot of the kinship scores of both PLINK and King tools and analyze the correlation of the two tools. Also, going for a 2000 or 3000 individuals genome wide association studies. To utilize the gPLINK and Haploview for a more detailed analyses would be another future work . References: 1.PLINK- http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml#format 2. Visualizing sample relatedness in a GWAS using PLINK and R http://gettinggeneticsdone.blogspot.com/2009/10/visualizing-sample-relatedness-in-gwas.html 3. King- http://people.virginia.edu/~wc9c/KING/manual.html Project3: Developing Circos and JavaFX: The first half of the project was organized and directed as a team project . Background: Circos is a software package for visualizing genomic data and information. It visualizes data in a circular layout which makes it ideal for exploring relationships between objects or positions. It is very easily automated as it is controlled by plain-text configuration files. It is written in Perl and produces bitmap (PNG) and vector (SVG) images in HTML format using plain text configuration and input files2 . Circos is a free software and circos-0.54, circos-0.15 tools, configuration files, data files were downloaded from its website. The configuration files support drawing Ideograms , Highlights, Tick marks, Links and Ribbons, Scatter plots, Line plots, Histograms, Heat maps, Image maps, etc3 . Usually, the image generation process is driven by a central configuration file which later imports other configuration files such as global color and font settings. The steps involved are (1) Parsing data files in Circos format (2)Deciding how the data is to be shown (3) Constructing a config file either from scratch or from the tutorials in their website (4) Running Circos to create PNG and SVG files3 . The configuration files has a block format. For instance , <plot> File = data.txt r0 = 0.5r r1 = 0.6r </plot>
  • 12. Running Circos : The primary objective of this project was to develop GUI Circos. Circos needs Perl core modules . In addition , it needs Config::General, GD, GD::Polyline, List::MoreUtils, Math::Beizer, Math::Round, Math::VecStat, Params::Validate, Readonly, Regexp::Common, Set::Inspan1 . After installing and testing for the correct modules, Circos can be run through a command line where the circos perl script will take in the input central configuration file and the path to the output directory , output image name can be specified3 . Image maps: HTML image maps allow the associations between the areas of the image with the web links. Any element in the image , like ideogram, tick, band, ribbon, highlight, etc can be having an associated URL . The image map is useful in producing clickable regions within the image . After a SVG or PNG image is produced in HTML format , parameters are set in the configuration file to produce an image map. In order to associate the image with its image map, the contents of the image map .html file is placed and then USEMAP attribute is used to associate image with its image map3 . Circos Image generation: Also, the configuration files were modified to produce links in the circos image as shown below :
  • 13. Utilizing JavaFX 1.3 in Netbeans 6.9 IDE: This phase of the project was an independent research on how best the Graphical user interface can be added to Circos for a effective visualization. JavaFX 1.3 SDK, JavaFX 1.3 production Suite and Netbeans IDE 6.9 for JavaFX 1.3 can be downloaded from Oracle’s website for Windows, Mac, Linux and OpenSolaris4 . The latest release of JavaFX 2.0 does not support Mac and Linux OS. JavaFX 1.3 is used for developing and building GUI applications and it has enhanced support for UI controls, CSS, etc. It also has support for TV applications and TV emulator. JavaFX comes with Netbeans IDE which includes JavaFX composer4 . Netbeans also provide
  • 14. improved stability and performance , a rich set of UI components and enhancements to the JavaFX editor. As can be seen from the figure, the name of the main project (package) can be specified by clicking File and then browsing through the options. For instance, Chaptersix is the main project which on clicked collapses into different files . These files represent snippets of Java script that adds graphics in
  • 15. the applications being developed. The palette in the right side of the Netbeans IDE contains Swing, Stage, Scene, Text boxes and other types of actions that can be applied to the images . Swing application: Swings are very powerful and can be developed using JavaFx. For instance, the swing application that is developed using JavaFX is :
  • 16. Image Application : Images can be generated by specifying the URL in a proper format in the snippet of code. The following figure shows the Circos image application being produced by running the Java code Applying Effects to the Circos image: Various kinds of effects can be applied to the images like hue, brightness, blur, etc for a clear visualization. JavaFX production suite which is downloaded along with JavaFX SDK has tools /plugins for Adobe photoshop and Adobe illustrator so that the images can be edited in Photoshop and converted to .FXZ format to be easily used in applying animations, videos in the applications. The circos image produced after applying effects is as below:
  • 17. Given below is the snippet used here to apply the effects to Circos genome data visualization tool for Cancer: /* * Chapter7.fx * * v1.0 - Neha Gupta * * 7/5/2011 - created * * Applying effects * */ package chaptersix;
  • 18. import javafx.stage.Stage; import javafx.scene.Scene; import javafx.fxd.FXDNode; import javafx.scene.Group; import javafx.scene.effect.*; import javafx.scene.image.ImageView; import javafx.scene.image.Image; import javafx.scene.paint.Color; var imagePath : String = "file:///depts/research/cancerCircos.fxz"; var circosImage : FXDNode = FXDNode{ url : imagePath; }; /** * @author ngupa */ Stage { title: "Effects" onClose: function () { } scene: Scene { width: 1000 height: 1000 content:[ SetImages(circosImage) ] } } function SetImages(image : FXDNode) : FXDNode { var circos : ImageView; circos = (image.getNode("circos") as ImageView); circos.translateX = -50; circos.translateY = -50; return image; } function ApplyEffects() : Effect{ var effectToApply : Effect; effectToApply = MotionBlur { angle: 45 radius: 10 } return effectToApply; }; Using Events: The mousing over events of the images supported by JavaFX are OnmouseMoved, onMouseEntered, onMouseclicked, onMouseExited, onMousepressed, onMouseDragged, onMouseReleased, etc. The images below represent onMouseClicked, onMouseReleased and onMouseDragged Events produced after running the Java snippet files :
  • 19. Mouse over events also has been applied to Circos image illustrated as below7 :
  • 20. Frustrations and Future Work: The future work would be to use the latest release of JavaFX 2.0 . A new feature of this release would be scene graphs . A scene graph is a tree data structure which maintains an internal model of the graphical objects (like links, ideograms, ticks, etc) in the applications. It knows what objects or elements of the image are to be used as clickable regions because the application can be developed using scene, Group, Line, Arc, circle, etc and each element of the graph would react to a full array of mouse over events. Also, developing animations and videos in the Circos would be my future work. Overall, the project was well directed. The single frustration of this project was the repetitive work assigned to two.
  • 21. References: 1. Circos Modules - http://mkweb.bcgsc.ca/dev/circos/tutorials/lessons/configuration/perl_and_modules/ 2. Circos Readme - http://circos.ca/software/readme/ 3. Circos Tutorials - http://circos.ca/tutorials/lessons/ 4. JavaFX – http://javafx.com 5. JavaFX production suite - http://download.oracle.com/javafx/1.3/reference/releasenotes/production-suite- release-notes-1-3.html#2 6. Inkscape - http://silveiraneto.net/2008/11/21/inkscape-and-javafx-working-together/ 7. JavaFX – A beginner’s guide Project4 : Automating Pubmed Searches: This project was well directed and organized as a independent project. The main objective of this project was to extract all the abstracts for a larger number of genes and phenotypes/disorders in few seconds from Pubmed. I have researched on tools that can automate pubmed searches. Firstly, I have tried using PubCrawler to go forward with this. (1) PubCrawler-Automated Retrieval of Pubmed and GenBank Reports : PubCrawler is a free software and can be downloaded from its website which has a Pubcrawler perl script and the configuration file. After installing locally, it can be run through a command line. An example run is shown as follows: Pubcrawler.pl [-add_path /cwd/lib] [-c <config_file>] [-check] [-d <directory>] [-db <database>] [-fullmax <max-docs in full>] [-getmax <max-docs to retrieve] [-i] [-out <output-file>] [-relentrezdate <relative-entrez- date] [-s <search_term>] [-viewdays <view-days>] The above run has all the mandatory command line flags to automate retrieval of Pubmed reports. The command line flag –db is the name of the database file that Pubcrawler uses. The –fullmax flag is useful in specifying the maximum number of full length reports to be shown per search. The –getmax is for the maximum number of documents to retrieve per search. The –I flag states to include the configuration file in HTML format. The name of the output file for HTML output is specified by –out option . The –relentrezdate is the maximum age of a document to be retrieved (1 year, 2 years, 5 years, ‘no limit’, etc). The –s is for the search term in the Pubmed2 . The –viewdays flag is for the number of days each document can be shown and the –check flag checks if everything is set up correctly. All of these mandatory command line parameters as well as the optional flags can be set in the Pubcrawler Configuration file . However, if the command line options and the settings in the Configuration files conflict, the command line setting takes priority. Search terms : The definition of the search terms used here is the same as given by Entrez search system. For instance, a boolean search for getting the abstracts for a gene and a phenotype would be ‘pubcrawler.pl –s ‘pubmed KRAS [ALL] AND Cancer [ALL]’ entered at the command line hides all other queries of the configuration file2 . In order to automate the multiple queries through this program, wrote a perl script that makes the new Configuration file containing multiple such search queries for any number of genes and any number of phenotypes.
  • 22. Optional Variables: Some of the optional command line variables include (1) work_dir – working directory for PubCrawler (2) mail- an email address to which results are being sent after each Pubcrawler run (3) format- format for reports from Pubmed and Genbank : ‘DocSum’, ‘Brief’, ‘Abstract’, ‘FASTA’, etc, (4) time_out- time in seconds to wait for internet responses, defaults to 180 (5) proxy – proxy server for internet connection (6) indent – indents Pubcrawler comments and pixel to align it with Entrez output. Drawback : The main disadvantage of this run is the time consumption. For a single query, it takes around 5 minutes to get back all the Research papers including all the fields(author, primary accession number, Publication date, etc) and specifying no limit to the release Entrez date. Also, tried running for around 120 queries corresponding to 120 genes and implicated phenotypes/ disorders. The time it took for a complete run was around 10 hours which is quite inefficient. (2) Using NCBI Eutils: Eutils are very powerful and efficient in grabbing the research articles from the Pubmed database . There are various URLs which perform different functions. The following link exemplifies the url to get abstract for a single pubmed ID : http://www.ncbi.nlm.nih.gov/entrez/queryd.fcgi?db=pubmed&cmd=Retrieve&dopt=Abstract&list_uids=18461 658&itool=pubmed_docsum Similarly, the url to display PMIDs in xml retrieval mode is : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933,11700088&retmode=xml Creating URLs for a larger gene list and phenotypes list: I have modified the EUtils urls for getting the abstracts and for getting the Pubmed IDs respectively for a larger list of genes and phenotypes. The new URL for getting the Pubmed Ids from a list of genes is as follows : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retstart=0&retmax=10000000&usehistory =y&retmode=xml&term=Amelogenesisimperfecta and (gene1 or gene2 or gene3 or gene4 or …….) The new URL for getting abstracts is : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=1,2,3&retmode=html&rettype=abstract Perl scripts : After making the urls that gives the PMIDS and the abstracts, tried using a list of genes and a list of phenotypes/disorders, wrote a perl script that can make the URL to get the Pubmed IDs directly from the pubmed database for all the research papers associated with this list . The content when the URL is manually clicked is in XML format. Modified this perl script to grab the content of this URL automatically without having to click manually the link everytime and parses the XML into a perl hash of hash of array format. This is then used to selectively print out the list of pubmed IDs from hash of hash of array generated after parsing the XML output. With the list of Pubmed IDs, the URL to get abstracts was made and again the content of this web URL was grabbed automatically using perl modules and their functions. This content in XML format was parsed to print out the list
  • 23. of abstracts to the standard output. However, there is a URL limit to the number of search terms specified. It is around 200 boolean terms for genes and implicated diseases. The maximum number of abstracts this script retrieves now is 800. Frustrations and Future Work: This project was very challenging and the key frustration which would be the future work is to remove this URL limit so that the url can take in an infinite gene list and phenotypes list and grab the contents of the webpage automatically. Also, changing the parameters like restricting articles retrieval based on author names, Entrez release date, Accession numbers, etc would be a future work. References: 1. Integrating automated literature searches and text mining in biomarker discovery- http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2956392/ 2. Pubcrawler and EUtils - http://www.my-whiteboard.com/how-to-automate-pubmed-search-using-perl-php- or-java/ Project5 : 1000 genomes andBioGPS : This project is organized as a team project. Firstly, the genotypes data for the 1000 individuals was downloaded for hg18 in a VCF format. The main idea behind this approach is to get the union of 1000 genome data with the 50 individual genomes data of the company and analyzing this union data using PLINK association studies. In order to proceed with this, the sites that are unique to the 1000 genomes data which are not called in the 50 individuals data , sites that are unique to the 50 individuals data that are not called in 1000 genomes data, sites that are being called in both the 1000 genomes and the 50 genomes data are generated. The figure below best represents this idea and the explanation: The list of approximately two million sites unique to 1000 genomes data is used in this work. The key question that came up here was to get the genotypes for these two million sites. To go forward with this, wrote a perl script to query the company’s internal database to grab the genotypes for these two million sites for each of the fifty subject genome Ids of 50 genomes data. Simultaneously, did data mining using the BioGPS gene portal database that has expression charts for genes in humans, mouse, rats, etc. Researched on the correlation parameters settings which displays the 50 Individuals genomes unique sites 1000 Individuals unique sites 1000 and 50 individuals common sites
  • 24. correlation between the expression charts of two genes. Downloaded the expression data for around 44000 probe sets in their website for making a vector matrix of expression data with the company’s genes expression data. Frustrations and Future Work: The initial Perl script was slow on these sites as it had to query every time the database for each of these two million sites. Optimizing the code was important in order to accomplish this. My future work would be to convert these data from the union to ped and map input file formats supported by PLINK and go forward with association studies.