More Related Content Similar to BME230FinalPapertext Similar to BME230FinalPapertext (20) BME230FinalPapertext2. Eisenhart 1
ABSTRACT
The UCSC Genome Browser is widely used across the scientific community. In 2013 UCSC released the
fifth major reference genome, hg38. However the scientific community has been reluctant to adopt hg38 over its
predecessor hg19. In order to make hg38 more appealing key tracks from hg19 were converted to hg38 using the
liftOver program. Additionally the tracks were modified to have a biologically relevant color scheme. A comparison of
analysis pipeline results across hg19 and hg38 quantifies the benefits of hg38 compared to hg19. Hg38 can be
made more appealing to the scientific community by importing old data, and by the widespread acceptance that the
data on hg38 is of a higher quality.
INTRODUCTION
The completion of the Human Genome Project signified the first ever creation of a
human reference genome. The reference was completed in February of 2003 at UCSC, it was
given the name hg16 by UCSC and NCBI34 by NCBI(6). The first reference genome had
3,091,959,510 bases with a total gap length of 226,873,222. There were 490 scaffolds, with a
scaffold N50 of 29,105,798, and 1,756 contigs with a contig N50 of 28,857,747(6). This
amounted to years of scientific knowledge and millions of dollars. Today the most used
reference genome, hg19, covers 3,137,144,693 bases with a gap length of 239,850,738
bases(7). These stats are near identical to the first reference genome hg16, but looking at the
auxiliary stats the benefit of newer technologies and methods can be seen. Hg19/GRCh37 has
258 scaffolds, with a scaffold N50 of 46,395,641, additionally there were 459 contigs with a
contig N50 of 38,508,932(7). Compared to hg16 the number of scaffold and contigs has
drastically decreased, while the N50 has increased. Clearly hg19 has a much more refined
scaffold than hg16.
Looking at the newest reference genome hg38, the statistics have a clear edge over
hg19. The total sequence length is 3,209,286,105, with a total gap length of 159,970,007.
Hg38 has 735 scaffolds with an N50 of 67,794873, additionally there are 1,385 contigs with an
N50 of 56,413,054.(8)
3. Eisenhart 2
Reference Total
Sequence
Length
Total Gap
Length
# of
Scaffolds
Scaffold
N50
# of
Contig
s
Contig N50
hg16
(2003)
3,091,959,510 226,873,222 490 29,105,798, 1,756 28,857,747
hg19
(2009)
3,137,144,693 239,850,738 258 46,395,641 459 38,508,932
hg38
(2013)
3,209,286,105 159,970,007 735 67,794873 1,385 56,413,054
Table 1: A comparison of the first, fourth and fifth reference genomes. (6,7,8) Note the
increasing scaffold and contig N50 values, and the difference between gap length between
hg19 and hg38. (11)
The GRCh38/hg38 Genome Browser build has several appealing features including the
improved assembly statistics. Hg38 contains 455 different completed sequences, of these
sequences 261 are different haplotypes.This is a remarkable upgrade from hg19 which has a
total of 93 sequences. Hg38 has a greater coverage of sequences near centromeric regions
(11). Previously large megabase sized gaps were used to represent centromeric regions. In
hg38 this was replaced by sequences made from centromeric models created by Karen Miga
et al (11). However despite these clear advantages over hg19 there is a reluctance across the
scientific community to accept the Genome Browser hg38 build over its hg19 predecessor.
This reluctance is likely due to multiple factors. Comparing hg19 to hg38 there are far
more Genome Browser tracks supported for hg19. Hg19 was the major reference genome
during a critical time in biotechnology. As next generation sequencing techniques became
widespread the cost of sequencing drastically decreased. In response to this the amount of
data produced drastically increased. A large portion of this data made its way onto the hg19
Genome Browser tracks. This in turn is having a compounding effect, where researchers
create new data aligned to hg19 and this provides more incentive to continue using hg19.
Additionally analysis with hg19 still produces viable results, so it may be that researches see
little reason to adopt hg38 despite the higher quality.
METHODS
Lifting Tracks
LiftOver is a C program that takes in a .bed format file and a .chain file that provides
chromosome conversion data (4). The program converts the chromosome coordinates in the
input .bed file between builds of the same organism. Using this tool allows for data from
previous builds to be incorporated in the current build. Files that have had their coordinates
4. Eisenhart 3
converted using this program are coined ‘lifted’. This is useful for data which was costly or
difficult to prepare for the original build. LiftOver is relatively fast and easy to run, by using it to
convert hg19 tracks to hg38 tracks, new tracks were brought to hg38 increasing its appeal.
The tracks chosen for lifting originated in the ENCODE project(2). The ENCODE project
seeks to identify all regulatory information in the human genome. As such the tracks lifted all
focused on regulatory aspects, four tracks were chosen in total for initial lifting. One track
shows transcription levels assayed by sequencing of transcribed RNA from multiple cell type
(rnaseq), it is referred to as the ENCODE transcription track.(2) The remaining three tracks
generated data from chipseq and show where modification of the histone protein is on the
genome.(2) The first track ENCODE H3K4Me1 shows where modification of histone proteins
near regulatory regions have been identified. This histone mark is based on the
monomethylation of lysine four of the H3 histone protein, commonly associated with
enhancers and DNA regions downstream of transcription start. The second track ENCODE
H3K4Me3 displays enrichment levels across the genome. H3K4Me3 marks the trimethylation
of lysine 4 of the H3 histone protein, it is commonly associated with active promoters. The final
track ENCODE H3K27Ac shows enrichment levels of H3K27Ac across the genome. H3K27Ac
is the acetylation of lysine 27 of the H3 histone protein, its exact function is still unknown, but it
is thought to enhance transcription.(2)
These tracks were each built by comparing multiple cell lines, each track has several
underlying cell lines which are compared to create the track. Across the four tracks there were
seven common cell lines, GM12878, H1hESC, HSMM, HUVEC, K562, NHEK and NHLF. The
ENCODE transcription track had two additional cell lines, HeLaS3 and HepG2.(2)
Figure 1. Hg19 target tracks. Note the multiple colors for each track, each color corresponds to
values from a different cell line. The transcription track is darker because it has two extra cell
lines, which in turn means two extra colors. (9)
Tshell Script
Each cell line was generated from an underlying file, the liftOver process began by
identifying and lifting these underlying files. There was a total of 30 files that were lifted. These
files are stored as .bigWig files and therefore needed to be converted to .bed format before
they could be lifted. The files were converted from .bigWig format to .bed format using the
5. Eisenhart 4
kent/src/ utility bigWigToBedGraph. These bed files were lifted using the liftOver program, the
unmapped reads were sent to a null folder and were not considered. Next the .bed file was
sorted using the unix sort command before being passed into an kent/src utility
bedRemoveOverlap, which removes overlapping records from a sorted bed file. The smaller of
the overlapping files is removed. The output .bed file was then passed into the kent/src/ utility
bedGraphPack, which combines adjacent records representing the same value. Finally the
completed .bed file was converted back into a .bigWig file using the kent/src utility
bedGraphToBigWig. This program needs a chrom.sizes file which lists the sizes of all
chromosomes in the current build. These programs were embedded in a Tshell script that took
in a .bigWig file, output filename, chrom.sizes file, and a .chain file. (3,4,5)
C Wrapper
The genome browser data structure is large and complex. The associated files that
needed to be lifted were listed in a .ra file. The .ra files are hierarchical files that are used
internally to help structure tracks. Due to the number of files and their location in the .ra file a
C program was written to process all the files at once. The C program took in a .ra file and
identified the lift target filename, then called the Tshell script described above. The program
was able to successfully lift all 30 files needed to build the four ENCODE tracks.
Linking New Tracks to the Data
To complete the lifting a new .ra file was created. The .ra file provides track information
such as display color, and visual ‘lifted’ tags. The lifted .bigWig files were added to the hg38
sql database allowing the browser to access them. Lastly the .html files were copied from hg19
and calibrated to hg38. (5)
New Color Scheme
The four ENCODE tracks were created by comparing data from various cell lines. In
hg19 each cell line is assigned a color in no specific order and the track is generated by
overlapping the various colors. This color scheme was modified in hg38 to give the colors a
biological correlation.
The cell lines used for the ENCODE tracks are a subset of a larger collection of cell
lines. In the total collection each cell line has a corresponding .bigWig file, with a total of 79
cell lines. The new color scheme started by creating a binary tree of the 79 different cell types
based on similarity. As cell lines were added to the tree they were assigned a color. The
program takes in a list of bigWig files and systematically calculates the difference between
every bigWig file, joining the two that are most similar. The program iterates in this fashion
generating a binary tree, as nodes are joined they are assigned a color based on when they
joined the tree.
8. Eisenhart 7
Analysis of Lifted Tracks
In order to evaluate the quality of the lifted tracks a peak calling pipeline was created
and ran on both hg19 and hg38, generating two data sets. The data sets were then compared
to determine the differences caused on downstream analysis by using the respective builds.
Cell line GM12878 fastq files that contained chip seq data were aligned to to both hg19 and
hg38 using BWA embedded in a shell script, eap_run_bwa_se. (5) The shell script
eap_run_bwa_se uses the bwa commands aln and samse to align the fastq file. Next the script
calls samtools to sort the generated .bam files The resulting bam files were put through a
peak calling pipeline centered around macs2, eap_run_macs2_chip_se. This script calls
macs2 peakcall function to generate a tmp_peaks.xls output. This output is converted to
bigBed format before being converted into bigWig format, both file formats are produced for a
single input fastq file. The bigWig files were analyzed using the kent/src bigWigInfo
program.(3,5)
All programs and scripts proposed in this paper are available through the Kent Source
for UCSC Genome Browser. (5)
DISCUSSION
The liftOver program kept over 95% of the bases in each cell line. The coverage
between tracks is very high, and the tracks on hg38 have the advantage of a better reference
genome. The colors in the new tracks were successfully updated to correspond to the cell line
color in the radial dendrogram. By converting these key tracks from hg19 to hg38, the
versatility and inclusiveness of hg38 has been increased. This makes hg38 more attractive for
future research endeavors. The color scheme is now biologically motivated, where colors
correspond to the colors assigned in the radial dendrogram. Starting with red and ending with
purple the color denotes when the node was added to the radial dendrogram, which in turn
corresponds to its order in the neighbor joining algorithm. The lifted tracks are being reviewed
by the UCSC Genome Browser quality assurance team before they are released to the public.
Figure 3: The ENCODE tracks on hg19 (top) and hg38(bottom) viewing chromosome 21. Note
the different coloring scheme. (9, 10)
9. Eisenhart 8
The lifted ENCODE regulation tracks can currently be viewed at
http://hgwdevceisenhart.cse.ucsc.edu/cgibin/hgGateway.
Comparison of Analysis on hg19 and hg38
Analysis of the bigWig files created by macs2 was completed using the kent/src utility
bigWigInfo. To determine the more effective pipeline three fields were compared, the total
number of bases covered, the mean score per base and the standard deviation. Since hg38
has more bases than hg19 it was expected that hg38 would have more bases covered.
Additionally the improved scaffold and contig N50 coupled with the reduced amount of gaps
gives hg38 an advantage over hg19 in all areas.
File (.bigWig) Bases Covered mean std. dev.
ctcfsc15914c20StdRawDataRep1 3,136,242,125 .787020 4.219821
ctcfsc15914c20StdRawDataRep2 3,136,238,996 .787021 3.884653
maxRawDataRep1 3,136,222,978 .420838 1.206457
maxRawDataRep2 3,136,219,441 .471495 1.273891
Table 2: Analysis pipeline bigWig info generated for hg19
The bigWig files generated for hg19 covered 99.97 % of the genome, however the
reference genome is roughly 7 % gaps. Therefore some of the information present in the
bigWig file does not accurately model the underlying biology.
File (.bigWig) Bases Covered mean std dev
ctcfsc15914c20StdRawDataRep1 3,098,365,734 .802534 4.222092
ctcfsc15914c20StdRawDataRep2 3,098,357,635 .802533 3.831295
maxRawDataRep1 3,098,326,654 .430072 1.099436
maxRawDataRep2 3,098,348,535 .481063 1.171373
Table 3: Analysis pipeline bigWig info generated for hg38
Since hg38 is larger than hg19 more bases it was expected that more bases would be
covered in the hg38 bigWig files. The mean is higher in all hg38 files, this suggests that the
bases which were included in hg19 are of a lower quality. Since these bases are not present in
hg38 it is likely due to the increased accuracy of the hg38 build over hg19. Hg38 has 4.985%
gaps, the total number of bases covered by the bigWig file accounts for 96.54% of the
genome.
10. Eisenhart 9
Reference Build Percentage of
genome that is a gap
Percentage of
genome covered by
bigWig
Mean
hg19 7.6455% 99.971% .787020
hg38 4.9846% 96.543% .802534
Table 4: Comparison of genome gap percent, percent of genome covered by the bigWig file,
and the corresponding mean score. Note that the hg19 bigWig file has over 99% coverage,
despite the fact that the hg19 build is 7.6455% gaps.
It is possible that the gaps are responsible for the drop in bases covered and the
increase in mean. Since hg19 is 7.6455 % gaps the highest reasonable value for the
percentage of the genome covered by a bigWig file is 92.3545%. However the hg19 bigWig file
covers 99.971% which suggests that 7.6165% of the calls are noise. Comparing this with
hg38, the genome is 4.9846% gaps, therefore the highest reasonable value for the percentage
of the genome covered by a bigWig file is 95.0154%. The hg38 bigWig file covers 96.543% of
the genome, this suggests that 1.5276% of the calls are noise. Therefore the hg38 bigWig file
experienced only 20% of the noise that the hg19 bigWig file experienced.
Reference Build Percentage of
genome that is
a gap
Percentage of
genome
covered by
bigWig
Mean Noise
hg19 7.6455% 99.971% .787020 7.6165%
hg38 4.9846% 96.543% .802534 1.5276%
Table 5: Statistics that provide insight into the increased mean in hg38 over hg19. Note that
the potential noise in hg19 is five times larger than in hg38.
Conclusion
Hg38 is the fourth major human reference genome. As such it has notably better
scaffold and contig N50 as well as far fewer gaps (11). Despite these advantages the
scientific community still prefers hg19 for analysis. This preference is likely due to the long
time hg19 has been in use, and the still viable results one can get when using hg19. To make
hg38 more appealing four key ENCODE regulatory tracks were lifted from hg19 to hg38 using
the liftOver program. A shell script was created to lift bigWig files, converting the files to bed
format, lifting, then converting back into bigWig. Due to the number of files and their location in
the system a C wrapper was made to call the shell script on the correct files. The tracks
coloring schemes were updated to correspond with the radial dendrograms color scheme.
11. Eisenhart 10
Analysis using BWA and macs2 pipeline showed the improved accuracy of analysis pipelines
on hg38 compared to hg19. The hypothesis was presented that hg19 experiences close to five
times the noise as hg38, which directly corresponds to hg38 having more biologically relevant
and accurate results.
References
1 “The human genome browser at UCSC” Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle
TH, Zahler AM, Haussler D.. Genome Res. 2002 Jun;12(6):9961006
2 “ENCODE data in the UCSC Genome Browser: year 5 update.” Rosenbloom KR, Sloan CA,
Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R, Heitner SG,
Lee BT, Barber GP, Harte RA, Diekhans M, Long JC, Wilder SP, Zweig AS, Karolchik D, Kuhn
RM, Haussler D, Kent WJ. Nucleic Acids Res. 2013 Jan;41(Database issue):D5663.
3 “BigWig and BigBed: enabling browsing of large distributed data sets” Kent WJ, Zweig AS,
Barber G, Hinrichs AS, Karolchik D. Bioinformatics. 2010 Sep 1;26(17):22047.
4 “The UCSC genome browser database: update 2007” Kuhn RM, Karolchik D, Zweig AS,
Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, et
al.: Nucleic Acids Res 2007:D668673.
5 “Kent Source for UCSC Genome Browser” Jim Kent, UCSC Genome Browser group, Tue,
17 Mar 2015, http://genomesource.cse.ucsc.edu/gitweb/?p=kent.git;a=summary
6 “UCSC representation of NCBI34 (hg16) with the DR51 alternate MHC haplotype as an
unlocalized scaffold on chromosome 6” International Human Genome Sequencing Consortium,
2004,02,04 http://www.ncbi.nlm.nih.gov/assembly/111478/
7 “Genome Reference Consortium Human Build 37 (GRCh37)” Genome Reference
Consortium 2009, 02, 27, http://www.ncbi.nlm.nih.gov/assembly/2758/
8 “Genome Reference Consortium Human Build 38” Genome Reference Consortium 2013, 12,
17, http://www.ncbi.nlm.nih.gov/assembly/883148
9 “hg38 Genome Browser Screen Shot”
http://hgwdevceisenhart.cse.ucsc.edu/trash/hgt/hgt_hgwdev_ceisenhart_10f14_8ad150.pdf
10 “hg19 Genome Browser Screen Shot”
http://hgwdevceisenhart.cse.ucsc.edu/trash/hgt/hgt_hgwdev_ceisenhart_13e5d_8ad590.pdf
12. Eisenhart 11
11 “ Centromere reference models for human chromosomes X and Y satellite arrays”
Miga,K.H., Newton,Y., Jain,M., Altemose,N., Willard,H.F. and Kent,W.J. (2014). Genome Res.,
24, 697–707.
12 “D3: DataDriven Documents” Michael Bostock, Vadim Ogievetsky, Jeffrey Heer, IEEE
Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2011
Acknowledgments
The author would like to thank the following persons and associations for their
assistance; Jim Kent, Kate Rosenbloom, Karen Miga, Ann Zweig, UCSC Genome Browser
Staff, Josh Stuart, Ed Green, David Haussler