Internship Report

13 August 2011
Supervisor’sstatementforNehaGupta
In her summer 2011 co-op, Neha worked in Knome’s Research group, scripting for live genome analysis
projects and scouting bioinformatics resources for potential use in future development. I supervised her,
and she also worked closely with another NU co-op and with two bioinformaticians in the Operations
group. On some projects, she also worked with members of the Development group.
In just a few weeks with Knome, Neha worked on many different projects. She wrote several key scripts
for Knome, including one that helps streamline our literature search when finishing a project analysis. She
also researched, and wrote code to help parse/prepare files for, bioinformatics resources such as PLINK.
We at Knome enjoyed having her work with us, and wish her very well!
Sincerely,
Nathaniel Pearson
Director of Research, Knome, Inc.

PersonalStatement:
During the summer of 2011, I, Neha Gupta, had the privilege to work and experience business world
at Knome Inc located at Cambridge, Boston. My interests and the Research skills acquired during the
Bioinformatics Coursework at Northeastern University gelled very well with what the company does.
Knome was founded in 2007 and is the first organization to deliver human whole genome analysis
and interpretation services. The amazing input and effort by the organization is reflected by the fact
that it has sequenced and analyzed the highest number of human genomes by far. The whole genome
and exome analysis services are based on automating the process of finding the shortlists of genetic
variants implicated in a disease, particularly Cancer, studying their drug response and summarizing
their functional importance. Knome provides the most detailed genome and exome interpretation to
various pharmaceutical and clinical researchers aiding in the drug discovery and development pipelines.
I had the opportunity to work as a Bioinformatics Research Intern at Knome Inc for
three months. During this period, there were lot of interesting ongoing projects. During the start of my
internship, I got to work on multiple projects and also using my thought process to come out with
novel project ideas in assisting in the research. For instance, ideas like generating heat maps linking
the variant allele frequencies with the drug response and developing GUI in Circos Genome Data
Visualization tool for Cancer. The challenges coming in any project kept me encouraged and motivated
in turn enhancing my skills. The brainstorming required towards finding a solution to a problem
reaching a dead end especially during the team discussions and participations was challenging and a
positive experience.
Project1: Interpreting the effects of variants – Condel scores:
Background: Single nucleotide polymorphisms are the most common type of genetic variations in
humans. The SNPs make up about 90% of all human genetic variations and occur every 100 to 300
bases along the 3-billion-base human genome but the knowledge about the association of these variants
to a particular disease is laborious and time consuming. Condel- Consensus deleteriousness score of
Single Nucleotide Variants computes a weighted average of the scores of five known computational
tools aimed at classifying all non-synonymous SNV as likely deleterious, neutral or good. These tools
are SIFT(Sorting Intolerant from Tolerant), Polyphen2, MAPP(Multivariate Analysis Of Protein
Polymorphism), Pfam LogRE-value, Mutation Assessor1
.
Running Condel :
There are three ways of running the Condel program. Ensembl Variation database version 62 contains
precomputed SIFT and Polyphen scores for every possible mutation in the human proteome. Condel
was run using Ensembl-Variation database Perl API. Variant Effect Predictor API takes in a large list
of variants in VCF or pileup format and predicts the Condel scores based on SIFT and Polyphen
scores precomputed in the Ensembl database7
. The input file’s genomic coordinates should be matching
the reference genome used for the Ensembl database version (v.62 to hg19). It is run through a
command line interface. The command line options can also be defined in a configuration file . A
second optional registry file is used for pointing to which database the queries are executed (MySQL
ensemble homo sapiens core and variations databases are essential) .
An example of Registry file is :

use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::DBSQL::DBAdaptor-> new (
‘-species’=> ‘Homo sapiens’,
‘-group’ => ‘core’,
‘-port’ => 3306,
‘-host’ => ‘kensembl01’,
‘-user’ => ‘ngupta’,
‘-pass’ => ‘abcd’,
‘-dbname’ => ‘homo_sapiens_core_62_37g’
);
Bio::EnsEMBL::Variation::DBSQL::DBAdaptor-> new (
‘-species’ => ‘Homo sapiens’,
‘-group’ => ‘Variation’,
‘-port’ => 3306,
‘-host’ => ‘kensembl01’,
‘-user’ => ‘ngupta’,
‘-pass’ => ‘abcd’,
‘dbname’ => ‘homo_sapiens_variation_62_37g’
);
All Adaptors are stored using Bio::EnsEMBL::Registry Module. This module is used to get all the
adaptors required. Primarily, Bio::EnsEMBL::DBSQL::DBAdaptor connects to the database
‘homo_sapiens_core_62_37g’ through DBConnection Class and creates a new object with species, group
and other flags. This object retrieval is by the help of Bio::EnsEMBL::Registry Module. The module
Bio::EnsEMBL::Variation::DBSQL::DBAdaptor in the second half of the Registry file provides a
connection to the Ensembl Variation database ‘homo_sapiens_variation_62_37g’ by creating a new
object .
An output of Condel scores was produced and analysed. It assesses the effects of missense, Nonsense
mediated Decay variants and read-through variants. A missense mutation leads to a change in the
amino acid during translation thereby altering the entire protein synthesized. Nonsense Mediated Decay
is a mRNA surveillance mechanism that detects nonsense mutations and prevents the expression of
erroneous proteins. Following transcription, precursor mRNA undergoes pre-mRNA processing. Large
intron size in eukaryotes leading to aberrant mRNA splicing results in a nonsense codon (UAA, UAG,
UGA) within the Open Reading Frame. A frameshift (read-through) mutation is caused by insertions or
deletions of nucleotides. Due to the triplet nature of gene expression by codons, the indel changes the
reading frame (the grouping of the codons) resulting in a altered protein synthesized after translation. A
Condel score of 1 is illustrative of a bad consequence of a allele variant whereas a score of 0 is a
good variant. This part of the project was organized as a team project. The key question now was to
see how accurate are the Condel scores produced based only on SIFT and Polyphen and draw a final
comparison with the Condel scores taking all the five tools into account. To answer this question, I
researched independently on the program MAPP.

MAPP : MAPP reads an alignment of protein sequences and a alignment tree to calculate the predicted
impact of each potential SNP at each position6
. These predictions are completely based on a set of
scales (of physicochemical properties) for which each amino acid has a numeric value. This open
source program takes in two input files in order to run, one containing the alignment in fasta format
and the other text file having the alignment tree in parenthesis representation, with branch lengths6
.
The alignment file should have properly aligned sequences and the gaps must be indicated by
hyphens. Also, the tree file should be properly formatted in a way the MAPP program accepts. This
program is built on a Java platform and requires a large number of Orthologs . The alignment and the
tree files are obtained by installing ClustalW locally. Other options like Semphy and ClustalX can also
be utilized to run a Multiple sequence Alignment6
. Ran ClustalW program locally to perform multiple
sequence alignment of a set of six sample human sequences generating an output of alignment fasta
file and a tree file. The physicochemical properties used by MAPP are Hydropathy, Polarity, Charge,
Volume, Free energy in alpha-helix conformation, Free energy in beta-strand Conformation2
. It produces
an output table in an excel format having the column headers :
1. Position- position of the column in the alignment
2. Column score- MAPP median score for this alignment column
3. Column P-value- P value interpretation of column’s MAPP score
4. Alignment- Amino acids observed in that column in alignment order
5. Gap Weight-Weighted fraction of gaps in column
6. Over gap weight threshold- indicating whether the column is over gap weight threshold as MAPP
does not calculates scores if the gap weight is greater than 50%
7. P-values corresponding to the significance of each physicochemical property.
8. Columns for MAPP scores for each possible amino acid variant
9. P-value interpretations of the MAPP scores, predicting the impact of each amino acid variant.
10. Good Amino acids at this position
11. Bad Amino acids- List of amino acids deleterious at this position
An example of how the output is produced in MAPP is as follows:

USCS Genome browser5
has the tracks having information about chromosome, allelic coordinates, base
pair position, etc for t-RNA genes, miRNA genes, Genes involved in Regulation, Expression Genes
among others. These files are in various formats like Browser Extensible Data format, Big BED,

Wiggle, etc. I have researched and downloaded the .BED tracks in order to integrate the BED parser in
one of the company’s tools.
Future Work: The decision to divide the work independently to research on Condel and other
tools Pfam LogRE values , MAPP, Mutation Assessor was good as the work was delegated and hence
the time was not wasted due to the repetitive work assigned to two. The future work in this project
would be to integrate the SIFT, Polyphen with MAPP, P-fam logRE and Mutation Assessor which I
believe is an accurate way of predicting Condel scores based on all the five computational tools.
MAPP, P-fam logRE and Mutation Assessor are more in depth and accurate in predicting the effects of
all types of variants in a human proteome4,6
. So, a weighted score of these five tools would be more
accurate than the Condel score based only on SIFT and Polyphen1
.However, the key frustration in
MAPP is to automate the generation of alignment file and tree file on a whole genome basis. On a
single MAPP run, it produces the scores only for a single alignment file and a alignment tree file but
for the entire genome of an individual, it has to be automated so that the MAPP alignment file and
MAPP alignment tree file for all the orthologous proteins could be generated. It is memory intensive
and difficult to align all the orthologous proteins at a go to produce a single alignment file6
. The
methodology by which P-fam logRE works on a genome wide basis would be an interesting concept
to know about.
References:
1. Condel – Consensus deleteriousness score of missense SNVs
http://bg.upf.edu/group/projects/condel.php
2. Physicochemical constraint violation by missense substitutions mediates impairment of protein
function and disease severity.
http://mendel.stanford.edu/SidowLab/pdfs/2005StoneSidow.pdf
3. Polyphen
http://www.ncbi.nlm.nih.gov/pubmed/21412949
http://www.ncbi.nlm.nih.gov/pubmed/20642364
4. Pfam - http://cancerres.aacrjournals.org/content/67/2/465.long
5. Encode - http://genome.ucsc.edu/ENCODE/
6. MAPP
http://mendel.stanford.edu/sidowlab/downloads/MAPP/index.html
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000968
7. Ensembl- http://useast.ensembl.org/info/docs/variation/index.html
Project2: Pedigree Analysis and estimation of Inbreeding :
This project was directed as a independent project and the aim is to get the kinship estimates for the
company’s subject data.
PLINK for GWAS : PLINK is a open source whole genome association analysis program written in
C/C++ useful in performing a range of large-scale analyses in a computationally efficient manner. The
main focus of this tool is to analyze genotypes(SNPs)/phenotype data. It is integrated with gPLINK

(GUI PLINK) and Haploview to also support for subsequent visualization and annotation of the
results1
. PLINK v1.07 is freely available for download for any Operating system from their website.
Running PLINK: PLINK is run through a command line interface. The good part of PLINK is that
the previous run is stored in a log file and hence --rerun option is pretty useful to keep track of a
previous analysis using this log file as a base.
Input files : PLINK takes as an input two files for a accurate analysis specified by the command line
option --file. Alternatively, it is easier to give the PED and MAP files separately. For instance,
./plink --ped data.ped --map data.map
(1) PED: Ped file stores the pedigree information for each individual in a specific format acceptable
by PLINK. For a larger file (100K+ SNPs), a binary format of the ped file is helpful for
increasing the efficiency of the program. The PED file is a white-space delimited file where the
first six columns referring to the pedigree information for an individual are mandatory. They are
Family ID,Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female; other=unknown),
Phenotype1
. The phenotype can either be a quantitative trait or an affection status column
.Affection status by default is -9-missing, 0-missing, 1-unaffected, 2-affected1
. The genotypes
begin from column 7 which should be white-space delimited specified by characters 1, 2, 3, 4 or
A,G,C,T and 0 for missing genotypes. All the SNP markers , whether haploid or not, must be
biallelic (missing genotype – 00)1
.
Excluding the family ID, parents ID, sex and phenotype for a basic ped file can be done by
specifying --no option on a command line. To save space and time, a binary ped file (*.bed) is
created. It saves the pedigree/phenotype information in separate file (*.fam) and creates an
extended map file (*.bim)1
.
(2) MAP files:
Map files save the genetic mapping information for the SNP markers of the ped file. Each line
of a map file is descriptive of a single marker and contains four columns : Chromosome (1-22,
X, Y, 0 if unplaced), rs# or SNP identifier number, Genetic distance in Morgans, Base-pair
position1
. Alternatively, a MAP file excluding the genetic distance can be used as well. SNP
identifiers can contain any characters except white spaces and *. To exclude a SNP from the
analysis, setting the bp position column to a negative value is apt. The MAP file should be
containing as many markers as there are in the PED file. Other codes and PLINK’s internal
numeric coding used for specifying other chromosomal types are X- X chromosome (23), Y- Y
chromosome(24), XY- Pseudo-autosomal region of X(25) and MT-Mitochondrial(26)1
.
Inbreeding Analysis: Among the various types of analyses performed by PLINK, a useful one is the
calculation of inbreeding Coefficients (F) based on the observed versus the expected number of
homozygous genotypes. Inbreeding is a good measure of kinship between a group of individuals
possessing a large numbers of SNPs. Six individuals belonging to six different families were considered
to get the inbreeding coefficients . The ped file and map file are generated as follows :
Ped file is : 1 1 0 0 1 1 A A G T Map file: 1 snp1 0 1
2 1 0 0 1 1 A C T G 1 snp2 0 2

3 1 0 0 1 1 C C G G
4 1 0 0 1 2 A C T T
5 1 0 0 1 2 C C G T
6 1 0 0 1 2 C C T T
For the estimation of inbreeding between these subjects, PLINK command is run on these input ped
and map files as ./plink -- ped data.ped --map data.map --het which creates an output file plink.het
exemplified as follows :
FID IID O(HOM) E(HOM) N(NM) F
1 1 1 1.069 2 -0.07463
2 1 0 1.069 2 -1.149
3 1 2 1.069 2 1
Where FID = Family ID, IID = Individual ID, O(HOM) = observed number of homozygotes, E(HOM) =
Expected number of homozygotes, N(NM) = Number of non missing genotypes, F = Inbreeding coefficient
estimate.
For analyzing large number of SNP markers, downloaded the hapmap genotype and pedigree data for
the Yoruba tribe trio Y117 from their website considering only thousand SNP markers of each
individual. Generated a single genotype file for each individual in order to automate the process of
getting the .ped and .map PLINK files using Perl.
Pairwise IBD estimation: Identity by Descent analysis is useful in detecting the pairs of individuals
who look too similar to each other, that is more than expected by chance alone in a random sample. It
is possible to calculate genome wide IBD as far as large number of SNPs are available. (ideally 100K
or even more)1
. It is run by specifying --genome option along with the input files. The output for
kinship estimation is :
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
1 14 1 11 PO 0.5 0.00 1.00 0.00 0.5 1 0.5 NA NA
1 14 1 1 OT 0 0.00 0.9851 0.014 0.5075 0 0.75 0.7603 NA
1 13 6 1 UN NA 1.00 0 0 0 1 0.5 0.078 0.0
where FID1= Family ID for first individual , IID1 = Individual ID for first individual, FID2 = Family ID
for second individual, IID2 = Individual ID for second individual, RT = Relationship type, EZ = Expected
IBD sharing given PED file , Z0 = P (IBD=0, that is probability of sharing 0 alleles) , Z1 = P( IBD=1, one
allele sharing), Z2 = P (IBD=2), PI_HAT = proportion of IBD, PHE = estimated kinship, DST = IBS
distance, PPC = IBS binomial test, RATIO = IBS 0 SNPs
KING tool for GWAS : King tool works in a similar fashion as PLINK. However, the advantages of
KING over PLINK is that it can be used to flag pedigree errors by estimating all kinship coefficients
for all pairwise relationships3
. Moreover, the kinship estimates for closely related individuals are highly
accurate. In comparison to PLINK, this tool is very fast and robust to the population structure as it
can also take complicated pedigrees into account.

Installing King: King version 1.2 can be freely downloaded from their website for Linux, Windows,
Mac and Ubuntu O.S. King is run through a command line interface for an accurate GWAS for a
population sample size as large as > 10000 individuals3
.
Input files: The input files include a data file (-d), a pedigree file (-p) and a map file in MERLIN
format or for increased efficiency a binary format file (-b). The command line run is ./king –b
ex.bgeno.gz --kinship --related. The ped and map file formats of PLINK and KING are the same. King
tool however reads in an additional data file which includes one row per data item in the pedigree file
indicating the data type:
<contents of data.dat>
A some_disease
T some_trait
M some_marker
M1 Marker 1
….
</contents of data.dat>
Running KING : The pedigree information file and the file containing the genotypes are downloaded
for the trio NA19238, NA19239 and NA19240 from the hapmap website. Later, generated three genotype
files for three individuals by aggregating the genotypes for all the downloaded chromosome files from
hapmap website and created a map file for each of these individuals from the pedigree file. To
analyze over a million SNP sites in these individuals, wrote a perl script that converts the genotypes
of these individuals to a input pedigree file format having million markers. Similarly, using perl
converted the map file for the trio to the MAP file format acceptable by King. Also, automated
producing a DAT file using Perl for any number of markers given as an argument.
The king tool was run on these three input files obtained to produce a king.kin file saving all
the kinship scores of the trio. The king.bgeno is the binary format of .ped , .map and .dat files that
can be produced by running the command:
Prompt > ./king –d ex.dat –p ex.ped –m ex.map --binary
The king tool is now run on the binary file to get the final output of estimation of kinship between
three individuals from the hapdata. The run is Prompt> ./king –b ex.bgeno --kinship3
. The output king.kin
file is as follows:
FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error
117 1 2 494819 0.000 0.2500 0.278 0.0004 0.2486 0
117 1 3 498224 0.000 0.2500 0.268 0.0002 0.2446 0
117 2 3 496116 1.000 0.0000 0.206 0.1197 -0.0306 0
This is the case of a within- family relationship where FID = Family ID for the pair, ID1 = Individual
ID for the first individual of the pair, ID2 = Individual ID for the second individual of the pair, N_SNP
= The number of SNPs that do not have missing genotypes in either individuals, Z0 = Pr(IBD=0), Phi =
Observed kinship coefficient, HetHet = Proportion of SNPs with double heterozygotes, IBS0 = Proportion

of zero Identity by similarity (estimate of difference between individuals) , Kinship = Estimated kinship
coefficient from the SNP data, Error = Difference between the estimated and specified kinship
coefficients ( 1 for error, 0.5 for warning).
A negative kinship coefficient indicates an unrelated relationship. An estimated kinship coefficient range
> 0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-
degree, 2nd –degree and 3rd-degree relationships respectively. The main disadvantage of this tool is the
relationship inference for more distant relationships.
Frustration and Future Work: The main frustration during this work was to optimize the code for
around two million variant sites. The Future Work would be to get a plot of the kinship scores of
both PLINK and King tools and analyze the correlation of the two tools. Also, going for a 2000 or
3000 individuals genome wide association studies. To utilize the gPLINK and Haploview for a more
detailed analyses would be another future work .
References:
1.PLINK- http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml#format
2. Visualizing sample relatedness in a GWAS using PLINK and R
http://gettinggeneticsdone.blogspot.com/2009/10/visualizing-sample-relatedness-in-gwas.html
3. King- http://people.virginia.edu/~wc9c/KING/manual.html
Project3: Developing Circos and JavaFX:
The first half of the project was organized and directed as a team project .
Background: Circos is a software package for visualizing genomic data and information. It
visualizes data in a circular layout which makes it ideal for exploring relationships between objects or
positions. It is very easily automated as it is controlled by plain-text configuration files. It is written
in Perl and produces bitmap (PNG) and vector (SVG) images in HTML format using plain text
configuration and input files2
. Circos is a free software and circos-0.54, circos-0.15 tools, configuration
files, data files were downloaded from its website. The configuration files support drawing Ideograms ,
Highlights, Tick marks, Links and Ribbons, Scatter plots, Line plots, Histograms, Heat maps, Image
maps, etc3
.
Usually, the image generation process is driven by a central configuration file which later imports
other configuration files such as global color and font settings. The steps involved are (1) Parsing data
files in Circos format (2)Deciding how the data is to be shown (3) Constructing a config file either
from scratch or from the tutorials in their website (4) Running Circos to create PNG and SVG files3
.
The configuration files has a block format. For instance ,
<plot>
File = data.txt
r0 = 0.5r
r1 = 0.6r
</plot>

Running Circos : The primary objective of this project was to develop GUI Circos. Circos needs Perl
core modules . In addition , it needs Config::General, GD, GD::Polyline, List::MoreUtils, Math::Beizer,
Math::Round, Math::VecStat, Params::Validate, Readonly, Regexp::Common, Set::Inspan1
. After installing
and testing for the correct modules, Circos can be run through a command line where the circos perl
script will take in the input central configuration file and the path to the output directory , output
image name can be specified3
.
Image maps: HTML image maps allow the associations between the areas of the image with the web
links. Any element in the image , like ideogram, tick, band, ribbon, highlight, etc can be having an
associated URL . The image map is useful in producing clickable regions within the image . After a
SVG or PNG image is produced in HTML format , parameters are set in the configuration file to
produce an image map. In order to associate the image with its image map, the contents of the image
map .html file is placed and then USEMAP attribute is used to associate image with its image map3
.
Circos Image generation: Also, the configuration files were modified to produce links in the circos
image as shown below :

Utilizing JavaFX 1.3 in Netbeans 6.9 IDE: This phase of the project was an independent research on
how best the Graphical user interface can be added to Circos for a effective visualization. JavaFX 1.3
SDK, JavaFX 1.3 production Suite and Netbeans IDE 6.9 for JavaFX 1.3 can be downloaded from
Oracle’s website for Windows, Mac, Linux and OpenSolaris4
. The latest release of JavaFX 2.0 does
not support Mac and Linux OS. JavaFX 1.3 is used for developing and building GUI applications and
it has enhanced support for UI controls, CSS, etc. It also has support for TV applications and TV
emulator. JavaFX comes with Netbeans IDE which includes JavaFX composer4
. Netbeans also provide

improved stability and performance , a rich set of UI components and enhancements to the JavaFX
editor.
As can be seen from the figure, the name of the main project (package) can be specified by clicking
File and then browsing through the options. For instance, Chaptersix is the main project which on
clicked collapses into different files . These files represent snippets of Java script that adds graphics in

the applications being developed. The palette in the right side of the Netbeans IDE contains Swing,
Stage, Scene, Text boxes and other types of actions that can be applied to the images .
Swing application: Swings are very powerful and can be developed using JavaFx. For instance, the
swing application that is developed using JavaFX is :

Image Application : Images can be generated by specifying the URL in a proper format in the
snippet of code. The following figure shows the Circos image application being produced by running
the Java code
Applying Effects to the Circos image:
Various kinds of effects can be applied to the images like hue, brightness, blur, etc for a clear
visualization. JavaFX production suite which is downloaded along with JavaFX SDK has tools /plugins
for Adobe photoshop and Adobe illustrator so that the images can be edited in Photoshop and
converted to .FXZ format to be easily used in applying animations, videos in the applications. The
circos image produced after applying effects is as below:

Given below is the snippet used here to apply the effects to Circos genome data visualization tool for Cancer:
/*
* Chapter7.fx
*
* v1.0 - Neha Gupta
*
* 7/5/2011 - created
*
* Applying effects
*
*/
package chaptersix;

import javafx.stage.Stage;
import javafx.scene.Scene;
import javafx.fxd.FXDNode;
import javafx.scene.Group;
import javafx.scene.effect.*;
import javafx.scene.image.ImageView;
import javafx.scene.image.Image;
import javafx.scene.paint.Color;
var imagePath : String = "file:///depts/research/cancerCircos.fxz";
var circosImage : FXDNode = FXDNode{
url : imagePath;
};
/**
* @author ngupa
*/
Stage {
title: "Effects"
onClose: function () { }
scene: Scene {
width: 1000
height: 1000
content:[
SetImages(circosImage)
]
}
}
function SetImages(image : FXDNode) : FXDNode {
var circos : ImageView;
circos = (image.getNode("circos") as ImageView);
circos.translateX = -50;
circos.translateY = -50;
return image;
}
function ApplyEffects() : Effect{
var effectToApply : Effect;
effectToApply =
MotionBlur {
angle: 45
radius: 10
}
return effectToApply;
};
Using Events:
The mousing over events of the images supported by JavaFX are OnmouseMoved, onMouseEntered,
onMouseclicked, onMouseExited, onMousepressed, onMouseDragged, onMouseReleased, etc. The images
below represent onMouseClicked, onMouseReleased and onMouseDragged Events produced after
running the Java snippet files :

Mouse over events also has been applied to Circos image illustrated as below7
:

Frustrations and Future Work: The future work would be to use the latest release of JavaFX 2.0 . A
new feature of this release would be scene graphs . A scene graph is a tree data structure which
maintains an internal model of the graphical objects (like links, ideograms, ticks, etc) in the applications.
It knows what objects or elements of the image are to be used as clickable regions because the
application can be developed using scene, Group, Line, Arc, circle, etc and each element of the graph
would react to a full array of mouse over events. Also, developing animations and videos in the Circos
would be my future work. Overall, the project was well directed. The single frustration of this project
was the repetitive work assigned to two.

References:
1. Circos Modules - http://mkweb.bcgsc.ca/dev/circos/tutorials/lessons/configuration/perl_and_modules/
2. Circos Readme - http://circos.ca/software/readme/
3. Circos Tutorials - http://circos.ca/tutorials/lessons/
4. JavaFX – http://javafx.com
5. JavaFX production suite - http://download.oracle.com/javafx/1.3/reference/releasenotes/production-suite-
release-notes-1-3.html#2
6. Inkscape - http://silveiraneto.net/2008/11/21/inkscape-and-javafx-working-together/
7. JavaFX – A beginner’s guide
Project4 : Automating Pubmed Searches:
This project was well directed and organized as a independent project. The main objective of this
project was to extract all the abstracts for a larger number of genes and phenotypes/disorders in few
seconds from Pubmed. I have researched on tools that can automate pubmed searches. Firstly, I have
tried using PubCrawler to go forward with this.
(1) PubCrawler-Automated Retrieval of Pubmed and GenBank Reports :
PubCrawler is a free software and can be downloaded from its website which has a Pubcrawler perl
script and the configuration file. After installing locally, it can be run through a command line. An
example run is shown as follows:
Pubcrawler.pl [-add_path /cwd/lib] [-c <config_file>] [-check] [-d <directory>] [-db <database>] [-fullmax
<max-docs in full>] [-getmax <max-docs to retrieve] [-i] [-out <output-file>] [-relentrezdate <relative-entrez-
date] [-s <search_term>] [-viewdays <view-days>]
The above run has all the mandatory command line flags to automate retrieval of Pubmed reports. The
command line flag –db is the name of the database file that Pubcrawler uses. The –fullmax flag is
useful in specifying the maximum number of full length reports to be shown per search. The –getmax
is for the maximum number of documents to retrieve per search. The –I flag states to include the
configuration file in HTML format. The name of the output file for HTML output is specified by –out
option . The –relentrezdate is the maximum age of a document to be retrieved (1 year, 2 years, 5 years,
‘no limit’, etc). The –s is for the search term in the Pubmed2
. The –viewdays flag is for the number of
days each document can be shown and the –check flag checks if everything is set up correctly. All of
these mandatory command line parameters as well as the optional flags can be set in the Pubcrawler
Configuration file . However, if the command line options and the settings in the Configuration files
conflict, the command line setting takes priority.
Search terms : The definition of the search terms used here is the same as given by Entrez search
system. For instance, a boolean search for getting the abstracts for a gene and a phenotype would be
‘pubcrawler.pl –s ‘pubmed KRAS [ALL] AND Cancer [ALL]’ entered at the command line hides all
other queries of the configuration file2
. In order to automate the multiple queries through this program,
wrote a perl script that makes the new Configuration file containing multiple such search queries for
any number of genes and any number of phenotypes.

Optional Variables: Some of the optional command line variables include (1) work_dir – working
directory for PubCrawler (2) mail- an email address to which results are being sent after each
Pubcrawler run (3) format- format for reports from Pubmed and Genbank : ‘DocSum’, ‘Brief’,
‘Abstract’, ‘FASTA’, etc, (4) time_out- time in seconds to wait for internet responses, defaults to 180
(5) proxy – proxy server for internet connection (6) indent – indents Pubcrawler comments and pixel to
align it with Entrez output.
Drawback : The main disadvantage of this run is the time consumption. For a single query, it takes
around 5 minutes to get back all the Research papers including all the fields(author, primary accession
number, Publication date, etc) and specifying no limit to the release Entrez date. Also, tried running for
around 120 queries corresponding to 120 genes and implicated phenotypes/ disorders. The time it took
for a complete run was around 10 hours which is quite inefficient.
(2) Using NCBI Eutils:
Eutils are very powerful and efficient in grabbing the research articles from the Pubmed database .
There are various URLs which perform different functions. The following link exemplifies the url to get
abstract for a single pubmed ID :
http://www.ncbi.nlm.nih.gov/entrez/queryd.fcgi?db=pubmed&cmd=Retrieve&dopt=Abstract&list_uids=18461
658&itool=pubmed_docsum
Similarly, the url to display PMIDs in xml retrieval mode is :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933,11700088&retmode=xml
Creating URLs for a larger gene list and phenotypes list:
I have modified the EUtils urls for getting the abstracts and for getting the Pubmed IDs respectively
for a larger list of genes and phenotypes. The new URL for getting the Pubmed Ids from a list of
genes is as follows :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retstart=0&retmax=10000000&usehistory
=y&retmode=xml&term=Amelogenesisimperfecta and (gene1 or gene2 or gene3 or gene4 or …….)
The new URL for getting abstracts is :
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=1,2,3&retmode=html&rettype=abstract
Perl scripts :
After making the urls that gives the PMIDS and the abstracts, tried using a list of genes and a list
of phenotypes/disorders, wrote a perl script that can make the URL to get the Pubmed IDs directly
from the pubmed database for all the research papers associated with this list . The content when the
URL is manually clicked is in XML format. Modified this perl script to grab the content of this
URL automatically without having to click manually the link everytime and parses the XML into a
perl hash of hash of array format. This is then used to selectively print out the list of pubmed IDs
from hash of hash of array generated after parsing the XML output. With the list of Pubmed IDs, the
URL to get abstracts was made and again the content of this web URL was grabbed automatically
using perl modules and their functions. This content in XML format was parsed to print out the list

of abstracts to the standard output. However, there is a URL limit to the number of search terms
specified. It is around 200 boolean terms for genes and implicated diseases. The maximum number of
abstracts this script retrieves now is 800.
Frustrations and Future Work: This project was very challenging and the key frustration which
would be the future work is to remove this URL limit so that the url can take in an infinite gene
list and phenotypes list and grab the contents of the webpage automatically. Also, changing the
parameters like restricting articles retrieval based on author names, Entrez release date, Accession
numbers, etc would be a future work.
References:
1. Integrating automated literature searches and text mining in biomarker discovery-
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2956392/
2. Pubcrawler and EUtils - http://www.my-whiteboard.com/how-to-automate-pubmed-search-using-perl-php-
or-java/
Project5 : 1000 genomes andBioGPS :
This project is organized as a team project. Firstly, the genotypes data for the 1000 individuals was
downloaded for hg18 in a VCF format. The main idea behind this approach is to get the union of
1000 genome data with the 50 individual genomes data of the company and analyzing this union data
using PLINK association studies. In order to proceed with this, the sites that are unique to the 1000
genomes data which are not called in the 50 individuals data , sites that are unique to the 50
individuals data that are not called in 1000 genomes data, sites that are being called in both the 1000
genomes and the 50 genomes data are generated. The figure below best represents this idea and the
explanation:
The list of approximately two million sites unique to 1000 genomes data is used in this work. The
key question that came up here was to get the genotypes for these two million sites. To go forward
with this, wrote a perl script to query the company’s internal database to grab the genotypes for these
two million sites for each of the fifty subject genome Ids of 50 genomes data.
Simultaneously, did data mining using the BioGPS gene portal database that has expression charts for
genes in humans, mouse, rats, etc. Researched on the correlation parameters settings which displays the
50 Individuals
genomes
unique sites
1000
Individuals
unique sites
1000 and 50
individuals
common sites

correlation between the expression charts of two genes. Downloaded the expression data for around
44000 probe sets in their website for making a vector matrix of expression data with the company’s
genes expression data.
Frustrations and Future Work: The initial Perl script was slow on these sites as it had to query
every time the database for each of these two million sites. Optimizing the code was important in
order to accomplish this. My future work would be to convert these data from the union to ped and
map input file formats supported by PLINK and go forward with association studies.

Internship Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Internship Report

Similar to Internship Report (20)

More from Neha Gupta

More from Neha Gupta (9)

Internship Report