Good morning! Thank you for inviting me. I’ve been coming to this meeting for many times but never made any contribution. So today is pay back time.
The first problem I identified is sth. called differential analysis for ChIP-seq data. Basically, you have two groups of animals, one group is treatment and the other group is control. You take samples from these animals and send them for sequencing to measure the chromatin modifications. And you want to compare the two groups to find out the differences in chromatin modifications. This sounds like a straightforward question but the solution is not. Some people may say, well, this is easy, why not do peak calling for each group separately, then compare the two peak lists using a venn diagram, right? Well, this is surely going to be problematic. A treatment-specific peak may not be truly different. You may happen to set the cutoff between the two heights. This leads to false positives. On the other hand, a common peak may actually be different, but you set the cutoff below the two heights. This leads to false negatives. So we definitely need to treat this problem more carefully.
In addition to this problem, we also found, from real data, that some of the chromatin modifications are really subtle. This is an example from the ENCODE project. ChIP-seq was performed on histone mark h3k4me3 in both cell lines, k562 and embryonic stem cell. Clearly, there is a peak at the TSS of the gene ASUN in both cell lines. there appear to be two increased sites at each side of the peak, which seem to be really subtle. And the site downstream of the TSS seems to overlap with this variant exon. Could this chromatin modification site cause the change of the expression of these two isoforms? To answer this kind of question, you really feel like to cut the whole genome into many small slices and determine the chromatin modification at slice.
Unfortunately, when I started to work on this kind of problems, there was very few choices I can make. Back in 2009, it seems there was only one program, called chipdiff, which specifically targets the differential analysis for chip-seq. based on our experience, chipdiff tends to generate very few targets. When I used it on our brain data, it often gave me nothing. It was not until 2011 and 2012, there were two new tools called dime and dbchip which base their differential analysis on peak lists given by another peak calling program. But this kind of approach has caveats. Using the example I just showed you, you may identify a peak in k562 like this, and another peak in stem cell like this, how can you compare these two peaks? It’s very likely you’ll miss these two differential sites. Finally, people have also tried to use deseq and edger on chip-seq data. these two programs are my favorite because they treat statistics seriously. But they were originally designed for rna-seq. to use them on chip-seq, you’ll add a lot of pre- and post-processing steps. So they are not convenient to use.
Out of these frustrations, I decided to develop my own program called diffreps. It is a program package written in PERL. the workflow of diffreps is illustrated here. It goes from background modeling, normalization, all the way down to multiple testing correction. It is typically triggered by one command line like this and do all these things. It uses a sliding window strategy so you won’t miss a thing. Btw, diffreps is developed as an open source project and is hosted on google code.
Across my career, I have heard some people saying things like “it doesn’t matter what kind of distribution you use, they are all about the same”. I do not agree with that. One of the most common mistakes people make on sequencing data is that they do normalization on the read counts, and then assume these values are normally distributed. Here I used a chip-seq dataset from our brain samples. I then calculated the difference between the means of two groups of diffferent conditions. The dot-dashed line shows you the empirical density while the red line shows you the Gaussian fit. As you can see, the two distributions are totally different. The empirical data shows a sharp peak with a long righthand side tail. While the gaussian is much more broad. In differential analysis, it’s all about the tail behavior of the distribution. At p value of 10 to minus 5, this is where the Gaussian cutoff is, and this is where the empirical cutoff is. Look at how big the difference is between them.
So choosing the right statistical test is extremely important for chip-seq differential analysis. In diffreps, we implemented four different tests: negative binomial, t-test, chi-square test and g-test. If you have biological replicates, then negative binomial test is really what you should use. It models the over-dispersion among the biological replicates and control false positives. While t-test really should not be used. I only added it for comparison purpose. If your data do not contain biological replicates, then chi-square test or g-test can be an excellent choice. G-test can be basically considered as a modification to chi-sqaure test and is recommended by some statisticians to replace chi-square. On the top right, this group of people from oregon state has done some very nice comparison between negative binomial and t-test. The conclusion is that t-test is no good: it is not sensitive or specific on sequencing data. But they somehow publish this study in a not so prominent journal so probably most people did notice this paper. But if you are interested in differential analysis, I would suggest you to read it. On the bottom right, I also did some comparison between negative binomial and t-test on our own chip-seq data. The difference is striking. Negative binomial predicts 20 folds more sites than t-test. What’s even worse is that, only less than half of the t-test sites are overlapped by negative binomial. So this really raises a red flag for those who are using t-test on chip-seq or rna-seq data.
Besides differential tests, diffreps also includes two additional tools. The first tool is called find hotspots. A hotspot is basically a region where the differential sites or peaks occur significantly more often than random chance. In this cartoon, these guys are very close to each other and they form a hotspot while this guy is being squared. A greedy search algorithm is designed to identify those hotspots. It basically goes from start to the end and eats a differential site whenever it improves the score. When a hotspot is found, it is evaluated by a local poisson model. The second tool is called region analysis. It is a script which accepts any input file as long as the first 3 columns contain genomic coordinates. It will assign each region to genes or heterochromatic regions.
So we’ve talked a lot of methodology. Now, let’s put diffreps into test. This test dataset is from the ENCODE project. Chip-seq was performed on h3k4me3 between two cell lines: k562 and embryonic stem cell. There are two replicates in each group, the number of aligned reads ranges from 7 to 16 million. We also created a mock dataset using DNA input samples and we mixed the replicates between the two cell lines. The reason of doing that is because the dna input actually contains information about chromatin structures. So we want to remove those biases. By using this mock dataset, we can estimate the empirical false positive rate.
These two figures show you that diffreps predicts much more differential sites than the other approaches at different p-value cutoffs. Although diffreps also produces some differential sites on the mock data, the number decreases rapidly with the p-value cutoff. And the empirical false discovery rate is below for .5% for diffreps. It should also be noticed that g-test is very sensitive and produces much more sites than negative binomial test. It is not surprising because g-test ignores the variation within a group so it tends to have higher false positive rate. But the nice thing about g-test is that it nearly includes negative binomial. So if false positive is not your major concern, g-test can be a excellent choice.
Now, at the default p-value cutoff, diffreps produces a differential site list that basically includes deseq and chipdiff. There are lots of diffreps specific sites that are not overlapped with other methods. A natural question is whether those sites are actually biological, not just noise from the data. So we separate the differential sties into specific and overlapped category, and further classify them based on their location into promoter and genebody. Then correlate those sites with RNA-SEQ data.
The RNA-seq data from the two cell lines were processed using the tophat-cufflinks pipeline. This program not only measures gene expression change, but also more complicated things like alternative promoter usage and alternative splicing. We correlated these different categories of events using fisher’s exact test. When we look at the overlapped category, they correlate very well with gene expression changes. They also show some correlation with alternative promoters but not with alternative splicing. When we look at the diffreps specific category, they also show different kinds of correlation with transcriptional change. So this is very positive, that means a lot of the diffreps specific sites are likely to be biological. What is interesting here is, those diffreps specific sites also correlate with alternative splicing. This seems to suggest that a lot of subtle chromatin modifications are missed by other methods but diffreps can pick them up. So diffreps is a very sensitive method that catches both major and minor changes.
To give you some more intuitive and real examples, we created these two figures. In the upper figure, this micu1 gene has two alternative promoters. The second one is many kb downstream of the first one. The longer TSS has increased expression in k562 cell line. diffreps found two increased sites at the longer TSS. This is consistent with this histone mark’s role as an activation mark at the TSS. In the lower figure, this fanci gene has two isoforms. The second isoform contains a variant exon which has increased expression in k562 cell line. Diffreps found an increased site which overlaps with this variant exon. This seems to suggest a positive role for h3k4me3 in this exon’s inclusion.
As you can see, diffreps can be a very useful tool for chip-seq analysis. We have used it literally on every chip-seq dataset we have. It was used to study morphine-regulated h3k9me2 in mouse brain, a study that was published last year in the journal of neuroscience. It was also used in our big cocaine project to study the cocaine-regulated chromatin modification of 7 different histone marks.
The paper about diffreps is now in production in plos one and shall come online in no time. Recently, I received this email from one of diffreps’ users. This guy from UK said, and I quote, “great to see…”. Well, I am really flattered. Sometimes, I do feel that it is users like this who keep me motived to improve my programs and make them even better.
I thought I could be innovative in this section too. These are two heatmaps that show you each person’s role in the two software. The diffreps is kind of a one man’s project. I pretty much did everything and ningyi helped a lot with testing and results generation. For ngsplot, I developed most part of the code. Ningyi also made some contribution. Leo has been helping with testing, documentation and maintaining the google code page. He also imported it into Galaxy. Eric nestler is all about money.
Existing programs for differential
• ChIPDiff(2008): HMM-based
approach. NOT sensitive
enough for brain data.
• Peak-based: DIME(2011),
• Read counts +
Not convenient to use.
diffReps: a ChIP-seq differential analysis package
• Written in PERL, easy
to use command line
tool; Do everything in
• Sliding window
Merge and re-
diffReps.pl -tr A.bed B.bed -co C.bed D.bed -gn mm9 -re report.txt
Statistical tests for differential analysis
• Negative binomial test:
models biological replicates,
• T-test: NOT recommended
• X2 test: SUM((exp – emp)^2)
=> X2 distr (p-val).
• G-test: SUM(ln(emp / exp))
=> X2 distr (p-val). A
modification to X2 test,
diffReps on H3K4me3: cocaine vs. saline
Two additional tools
1. Find hotspots - hotspots are regions where the differential
sites or peaks occur significantly more often than random
Greedy search algorithm
2. Region analysis - any file with the first 3 columns to be:
chromosome, start, end. Annotate gene and heterochromatic
Easy to use: region_analysis.pl -i input.txt
Test data: ENCODE H3K4me3 between
K562 and ESC
Target: H3K4me3 Mock: DNA Input
Identify differential chromatin
Estimate empirical false
diffReps is used in many works
Big cocaine project:
diffReps: current status & community
Great to see diffreps has found a nice home in plos one. It is
literally the program which has saved my sanity, my phD and
probably the paper i'm writing!
- Michael Reschen, Oxford Univ., UK
Role Li Shen Ningyi Shao Xiaochuan Liu Eric Nestler
Test & result