Update version of the SMBE/SESBE Lecture on ENCODE & junk DNA (Graur, December 2013)

How to Assemble a Human Genome? Mix generous
amounts of Junk DNA and Indifferent DNA, add a Dollop
of Garbage DNA and a Sprinkling of Functional DNA
(Lazarus DNA optional)

Dan Graur

University of Houston

Dan Graur

(until 5 September 2012)

Dan Graur

(from 6 September 2012 to the present time)

In September 2012, 30 papers based on thousands
of data sets were simultaneously published in high
proﬁle journals to describe the major ﬁndings from
the ENCODE project.

The main finding of the main paper was
picked up by news outlets all over the world.

And what was the main finding of the main
paper?

And what was the main finding of the main
paper?

80% of t

he huma
n genom

e is func

tional

442 authors

+

594 collaborators

On the thirtieth day
of the month of
September, in the
Year of our Lord
2012, it was
announced that “junk
DNA” is “dead.”

An epic media spin
An example of epic media spin

An epic media spin

Compiled by T. Ryan Gregory, Genomicron

An epic media spin

(Una manipulación mediática épica)

Three problems: (1) If the human genome is indeed devoid
of junk DNA as implied by the ENCODE project, then a
long, undirected evolutionary process, cannot explain the
human genome.

If, on the other hand, organisms are

designed, then all DNA, or as much

as possible, is expected to exhibit

function. If ENCODE is right,

then Evolution is wrong.

Three problems: (2) If ENCODE is right, then humans are
the Goldilocks of the living world.

Organism

C-value

Junk

Complexity

Tetraodon fluvialis (pufferfish)

0.35

No

Primitive

Hyla nana (frog)

1.89

No

Primitive

Homo sapiens (human)

3.5

No

Pinnacle of Creation

Extatosoma tiaratum (insect)

8.0

Yes

Primitive

Alium cepa (onion)

16.75

Yes

Primitive

Protopterus aethiopicus (lungfish)

132.83

Yes

Primitive

Paris japonica (canopy plant)

152.20

Yes

Primitive

Three problems: (3) If ENCODE-2012 is right, then
ENCODE-2011 is wrong.

80% of t

he huma
n genom

e is func

tional

Evolutio
nary con
s
the frac
tion of t traint indicates
he huma
th
is functi
n genom at
onal is ~
e that
5%.

Nature. 2011. 478:476-482

We wrote a critical piece on
ENCODE, and got a very negative
review from Trends in Genetics.

angry?

insane?

“Graur is mad, and not entirely
without cause.”

“It would be good for Trends in
Genetics to publish a reasoned
and dispassionate critical essay
on this topic, preferably by
someone of Graur’s stature, but
not him.”

192 cm
, 6’2”,

115 kg,
254 lb

How did ENCODE reach the conclusion that 80%
of the human genome is functional, when the
evidence for selection constraint is ~5%?

•  Equating hype with science.

•  Wrong experimental systems.

•  Inappropriate statistical analyses.

•  A peculiar deﬁnition of function.

•  A peculiar deﬁnition of junk.

•  A lack of evolutionary perspective.

•  A lack of objectivity about the study organism.

•  Ignorance of everything that came before
ENCODE.

Wrong experimental systems.

A huge chunk of ENCODE data is derived from
HeLa cells and other cancer cells.

Does the HeLa karyotype look human to you?

Wrong experimental systems.

Landry et al. 2013

•  Equating hype with science.

6 S E P T E M B E R 2 0 1 2

The birth of 80%

“These data enabled us
to assign biochemical
functions for 80% of
the genome…”

26

6 S E P T E M B E R 2 0 1 2

Implication that 80% may be 99%

“The vast majority (80.4%) of the human genome
participates in at least one biochemical RNA- and/or
chromatin-associated event in at least one cell type.
Much of the genome lies close to a regulatory event:
95% of the genome lies within 8 kilobases (kb) of a
DNA–protein interaction..., and 99% is within 1.7 kb
of at least one of the biochemical events measured by
ENCODE.”

27

“junk DNA” is dead!

6 S E P T E M B E R 2 0 1 2

99% is not enough, 100% is better

ENCODE researcher Ewan Birney tells Ed
Yong that that the 80 percent ﬁgure will
increase, possibly reaching 100 percent.
“We don’t really have any large chunks of
redundant DNA,” Birney says. “This
metaphor of junk isn't that useful.”

The PR machine at work:

“Virtually all of the DNA passed down from
generation to generation has been kept for a
reason.” An intelligent God, perhaps?

99% disappears and 80% becomes 40%

I [went] back to ENCODE biologist John
Stamatoyannopoulos, who was quoted in the ﬁrst
wave of news. He said he thought the skeptics
hadn’t fully understood the papers… He did
Faye Flam

admit that the press conference mislead people by
claiming that 80% of our genome was essential
and useful. He puts that number at 40%.
Otherwise he stands by all the ENCODE claims.

(The origin of 9%: 5% + 4% = 9%)

20% kills “junk DNA”

• 
• 
• 
• 

20% of the genome is functional.

Ergo, 80% must be junk.

Yet, “junk DNA” should be “totally expunged” from the lexicon.

In which universe does Ewan Birney’s logic work?

20% kills “junk DNA”

• 
• 
• 
• 

20% of the genome is functional.

Ergo, 80% must be junk.

Yet, “junk DNA” should be “totally expunged” from the lexicon.

In which universe does Ewan Birney’s logic work?

•  In a universe in which 20% >> 80%!

At the end of 2012, 20%
becomes the favorite
number in Nature

Genome

Transcribed

Translated

Nontranscribed

Nontranslated

Information ﬂow within the genome

42

Genome

Transcribed

Nontranscribed

DNA

Translated

protein

Nontranslated

RNA

Information ﬂow within the genome

43

Genome

Nontranscribed

Transcribed

Functional

Nontranslated

Translated

Functional

Junk

Junk

Functional

Junk

44

Genome

Functional

Junk

nonfunctional

Junk has nothing to do with non-protein-coding.

Junk is about function… actually lack of function.

45

Genome

Functional

ad hoc

Junk

ad hoc

46

Genome

Functional

Junk

= Pseudogene

47

Genome

Functional

= Lazarus DNA

Junk

48

Lazarus DNA
Emmaus DNA

Zombie DNA

49

If the acquired function (Lazarus DNA) lowers the
ﬁtness of the carriers, it is called zombie DNA.

If the acquired function (Lazarus DNA) is advantageous, it
is called Emmaus DNA.

Genome

Functional

Nontranscribed

Transcribed

Junk

Transcribed

Nontranscribed

52

Genome

Functional

Nontranscribed

Transcribed

Junk

Transcribed

Nontranscribed

Transcriptome

Not all the transcriptome is functional.

53

Genome

Functional

Transcribed

Untranslated

Junk

Untranscribed

Untranscribed

Transcribed

Translated

Untranslated

Translated

54

Genome

Functional

Transcribed

Untranslated

Junk

Untranscribed

Untranscribed

Transcribed

Translated

Untranslated

Translated

Not all the proteome is functional.

Proteome

55

THE ORIGIN OF A SPECIES (smart & elegant)

Seiko Astron: Like a smartphone, the Astron is
GPS-enabled, allowing it to determine accurate time
from atomic clocks and automatically update to any
time zone in the world. Unlike a smartphone,
however, it looks nice with a suit, won’t break if
you drop it and uses solar power, so it never needs
to be charged. Darwin would be proud. $2,300

Hemispheres Magazine.
April 2013.

Actually, Darwin would not be proud.

Evolution does not produce “smart and elegant”

This is an intelligently designed
Dining Table

This is an evolutionary functional
Dining Table

The story of the human genome:

1998 to the present

agaaacggggagaggtatcaaataaacataattgacacacccggacacgttgacttctccgttga
agttgtacgttccatgaaagttctcgacggaatagttttcatattctccgcggttgaaggtgtgc
aacctcagtccgaagcaaactggagatgggcggacaggttccaagttccgaggatagccttcata
aacaagatggaccgtctgggtgcggatttttacagagtgtttaaggaaatagaagaaaagctaac
cataaagcccgttgccattcaaatacccctgggagcggaggaccagtttgaaggtgttatagatc
taatggaaatgaaggcaataaggtggctcgaagaaaccctcggagctaaatacgaagtagtagac
attcctccagaataccaggaaaaggctcaagaatggcgcgaaaagatgatagaaaccatcgtaga
aaccgacgacgagttaatggaaaagtacttagaaggacaggaaatatctatagatgaactaagaa
aagctttaagaaaggcaacaatagagagaaagctcgttcccgttctttgcggttctgcattcaag
aacaaaggtgttcaaccccttcttgacgcagttatagattacctgccttctcctatagaccttcc
tcccgttaaggggacaaatcccaagaccggggaagaagaggtcagacacccctctgacgacgaac
ccttctgcgcttacgcctttaaggttatgtccgacccgtatgccggacaacttacctacatcaga
gtgttctcaggaacgctaaaagcgggttcttacgtctacaacgcaaccaaggacgaaaagcaaag
ggctggaagacttcttctcatgcacgcgaactccagagaggaaatacagcaggtttccgcgggtg
aaatttgtgcagttgtaggactagacgccgcaacgggtgatactctctgtgatgaaaagcacccc
ataatccttgaaaagcttgaattccctgaccccgttatatctatggctatagagccaaagaccaa
gaaggaccaagaaaaactctcacaagttctcaacaagttcatgaaagaggatccaaccttcaggg
caacaaccgatcccgaaactggtcagatactcatacacggaatgggtgagctccacctcgaaata
atggttgacagaatgaagagggaatacggaattgaagtgaacgtcggtaaaccgcaggttgctta
caaggaaaccatcaggaaaaaggcaattggtgagggtaagttcatcaagcaaactggtggtagag
ggcagtacggtcacgcgataatcgaaatcgaacccctccccagaggtgcgggatttgaattcata
Run by a Eton high-school boy called Ewan Birney.

gacgacattcacggaggagttatccccaaagaattcataccctccgttgagaagggtgtaaagga
His guess was was in the very high range!

agctatgcaaaacggaattctcgcaggataccccgttgttgacgttagagttagactctttgacg
59

gttcttaccacgaagttgactcttcggacatagcattccaggttgcgggttccttggcattcaaa

The gene number game: Genesweep©

(started in Cold Spring Harbor, 1998)

Bets: 281

Median: 61,302

Lowest: 25,947

Highest: 212,278

Pot: 1,200 US Dollars

15 February 2001

1st draft

21 October 2004

From 30,000 protein-coding
genes to less than 25,000.

Lee Rowen (Institute for Systems Biology) won half of the pot with a
guess of 25,947 genes. She was at the bottom of the pool. Olivier
Jaillon (26,500) & Paul Dear (27,462) shared the rest of the 600
64

dollars.


The gene number game: Genesweep©

Bets: 281

Median: 61,302

Lowest: 25,947

Highest: 212,278

Pot: 1200 US Dollars

Genebuild last updated/patched:
!May 2012!
!
Total length:
3,287,209,763 bp!
Protein-coding genes:
21,065!
Pseudogenes: !
!
!
15,930!
RNA-specifying genes:
!
12,955!

Genebuild last updated/patched: April 2013!
!
Total length:
3,320,602,130 bp!
Protein-coding genes:
20,774!
Pseudogenes: !
!
!
14,445!
RNA-specifying genes:
!
! 22,493!
65

The “end” of the Human Genome Project in 2004
($3.8 Billion) was a big disappointment for
scientists unversed in evolutionary biology

The human genome turned out to be:

•  small in size

•  sparsely populated with genes

•  densely populated with dead genomic parasites

•  unoriginal

Small

<<

3.5 billion letters in a fourletter alphabet = 7 billion
bits = 0.81 GB (gigabytes)

1 DVD = 8.5 GB

Information content

Small

≈

1 DVD = 8.5 GB

3.5 billion letters in a four-letter alphabet =

= 7 billion bits = 0.81 GB (gigabytes)

Information content

Sparsely populated with genes.

Organism

Gene Density
(# genes per 1 Mb)

Escherichia coli (bacterium)

911

Saccharomyces cerevisae (yeast)

483

Arabidopsis thaliana (mustard weed)

221

Drosophila melanogaster (fly)

197

Homo sapiens

12
69

Densely populated with dead transposable elements

45-67%

+plus at most 0.1% for RNA-specifying genes (non-coding RNA)
+plus at most 0.1% for DNA switches.

Densely populated with dead transposable elements

Unoriginal

0.15% nonsynonymous differences

1.22% synonymous differences

Cost of sequencing your human genome

= ~$25,000. Percent genome recovery

= 90%. Error rate = 1-3%.

I will provide you with your genome
sequence with less error for half the
price (and you can haggle).

Data: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030087

Comparing the human genome to other genomes
has given rise to three complexity paradoxes.*
Genomic paradox = A lack of correspondence
between a measure of genome size and the
presumed amount of genetic information
“needed” by the organism (its complexity).
*The paradoxes only exist under
the assumption that humans are
the most complex organisms and
the pinnacle of creation.

Defining complexity is difficult

The complexity of a system may
be defined by the minimum
number of independent
characters required to describe it,
where independence is defined as
the ability of the character to
assume any possible character
state independently of any other
character in the system.

74

Deﬁning complexity is difﬁcult

Thus the wall on the right is more complex—it has a crack,
75

than the wall on left.

However, even if we cannot quantify organismal
complexity very well, in many cases, it is possible
to state unequivocally that A is more complex
than B.

Without doubt,

is more complex than

I

K-value paradox: Complexity
does not correlate with
chromosome number.

Homo sapiens

46

Lysandra atlantica

Ophioglossum reticulatum

250

~1260

78

C-value paradox: Complexity
does not correlate with
genome size.

3.5 × 109 bp
Homo sapiens

1.5 × 1010 bp
Allium cepa

6.7 × 1011 bp
Amoeba dubia
79

G-value paradox:
Complexity does not
correlate with proteincoding gene number.

~21,000

~21,000

~57,000

80

>94,000

Total Number of Protein-Coding Genes

Drosophila melanogaster (fruitﬂy)

13,917

Pan troglodytes (chimpanzee)

18,746

Canis familiaris (dog)

19,856

Bos taurus (cow)

19,994

Caenorhabditis elegans (nematode)

20.517

Homo sapiens (human)

20,774

Arabidopsis thaliana (mustard weed)

27,416

Physcomitrella patens (moss)

35,938

Oryza sativa (rice)

40,577

Populus trichocarpa (poplar)

41,377

Manihot esculenta (cassava)

47,164

Malus domestica (apple)

57,386

Triticum aestivum (bread wheat)

>94,000

81

Mommy, mommy, a fern has

27 times as many chromosomes
as I do; an amoeba has 200
times more DNA than I do;

and wheat has 5 times more
genes than me.

82

Ohno S. 1972. So much ‘junk’ DNA in
our genome. Brookhaven Symp. Biol.
23:366-370.

83

gatgcagccaaaaaggcagatcccgttcttctggaacccataatggaagttgaagtggaaactcc

Conclusion: The human
genome is mostly “junk.”

?

What is and what isn’t “junk DNA”

There are known knowns; there are
things we know that we know.

There are known unknowns; that is to
say, there are things that we now know we
don’t know.

But there are also unknown unknowns
—there are things we do not know we don’t
know.”

Donald Rumsfeld

February 12, 2002

“Junk DNA” misrepresented as a “known unknown”

What is and what isn’t “junk DNA”

Junk DNA is a known known; it is a thing
that we know what it does—it takes space.
Junk DNA is any piece of DNA that has no
function and does not affect ﬁtness. NOT
everything that is not translated or not
transcribed is Junk DNA.

Junk DNA is NOT a known unknown.

Dark DNA is a known unknown.

Dan Graur

June 22, 2013

Junk DNA is a consequence of population genetics considerations!

In organisms with
LARGE effective
population sizes, the
strength of natural
selection is relatively
strong.

In organisms with
SMALL effective
strength of natural
weak.


In organisms with
LARGE effective
strength of natural
strong.

In organisms with
SMALL effective
strength of natural
weak.

The majority of new
mutations are mildly
deleterious. In humans and
elephants, selection is not
sufﬁciently strong to
eliminate many such
deleterious mutations.


In organisms with
LARGE effective
strength of natural
strong.

In organisms with
SMALL effective
strength of natural
weak.

Humans and elephants are
expected to accumulate
numerous deleterious
mutations in their
genome.

013)

(2
little
Genomic anthropocentrism?

d Doo

Human exceptionalism?

. For
W

“What would we expect for the number of functional
elements (as ENCODE deﬁnes them) in genomes much
larger than our own?

If the number [of functional elements] were to stay moreor-less constant, it would seem sensible to consider the rest
of the DNA of larger genomes to be junk.

If on the other hand the number of functional elements
were to rise signiﬁcantly with genome size, then
organisms with genomes larger than ours should be
more complex phenotypically than we are.”

•  A peculiar deﬁnition of function.

•  A peculiar deﬁnition of junk.

•  A lack of evolutionary perspective.

In biology, there are two main concepts
of function:

•  A historical concept of function, also
referred to as the “selected effect
function” or “proper function.”

•  A non-historical concept of function,
also referred to as the “causal

function.”

What is the function of the heart?

The proper function
is to pump blood.

What is the function of the heart?

The proper function
is to pump blood.

The causal functions of the heart are to
add 300 grams to the body weight, to
produce sounds, to be encased in the
the pericardium, to partially ﬁll the
mediastinum, to provide an inaccurate
logo for Valentine Day cards, etc.

•  Evolutionary biologists use the proper
or selected effect function.

•  ENCODE used the causal function.

“Operationally, we define a functional element as a
discrete genome segment that encodes a defined
product (for example, protein or non-coding RNA)
or displays a reproducible biochemical signature
(for example, protein binding, or a specific
chromatin structure).”

•  An example of a function that fits the
ENCODE definition: shoes binding
chewing gum.

“Operationally, we define a
functional element as an
entity that displays a
reproducible signature (for
example, chewing gum
binding.”

“By the logic employed by ENCODE, following a
collision between a car and a pedestrian, a car’s
bonnet would be ascribed the 'function' of
projecting a pedestrian many meters and the
pedestrian would have the 'function' of deforming
the car’s bonnet.” Laurence Hurst 2013. BMC Biol. 11:58

ENCODE uses a know logical fallacy called
afﬁrming the consequent.

If a functional sequence is transcribed,

then, all transcribed sequences are functional.

Moreover, ENCODE uses the logical fallacy
inconsistently.

The ENCODE Project:

74.7% of the genome is transcribed,

56.1% is associated with modiﬁed histones,
15.2% is found in open-chromatin areas,
8.5% binds transcription factors,

4.6% consists of methylated CpGs.

The fraction of the genome that is functional
(the Boolean union) is 80.4%.

Our additions to ENCODE

74.7% of the genome is transcribed,

56.1% is associated with modiﬁed histones,
15.2% is found in open-chromatin areas,

8.5% binds transcription factors,

4.6% consists of methylated CpGs.

84.8% binds histone

100% of the genome is replicated.

The fraction of the genome that is functional is
100%.

Interesting Question:

Why do people have problems with DNA that
has no function?

Inability to deal with randomness

“… nothing is so alien to the human mind
as the idea of randomness.”

John Cohen. 1960. Chance, Skill, and Luck: The Psychology of Guessing
and Gambling. Baltimore,

MD: Penguin Books.

Apophenia /æpɵˈﬁːniə/: The experience of seeing
meaningful patterns or connections in random or
meaningless data. A type of mild or incipient
schizophrenia. In statistics, apophenia is known as
Type I error (false positives).

Klaus Conrad. 1958. Die beginnende Schizophrenie. Versuch einer
Gestaltanalyse des Wahns [Incipient Schizophrenia: An Attempt to Analyze
delusion]. Stuttgart: Georg Thieme Verlag.

People like mysteries:
such as hidden messages
in the Bible.

If you search long enough and hard enough for patterns in random texts,
you will ﬁnd patterns. Especially if you do not employ negative
controls. This pattern, for instance, predicts on the vertical from the
bottom up (in Hebrew) that MITROMNI(TAURA)NSIA, where NSIA
is “president.” The 5 letters in between MITROMNI and NSIA are
random. It also helps that Hebrew has no vowels.

The Bible Code employs no negative controls. Someone else did and
they found similar “prophecies” in Moby Dock by Herman Melville.

ENCODE has no negative controls.

Mike White provided them and showed in a paper
published in PNAS that random DNA sequences
cause reproducible regulatory effects on the
reporter gene.

Random genetic sequences have as
much or a little a function as the
human genome sequences analyzed
by ENCODE.

“Some years ago I noticed that there are two kinds of
rubbish in the world and that most languages have
different words to distinguish them. There is the rubbish
we keep, which is junk, and the rubbish we throw away,
which is garbage. The excess DNA in our genomes is

junk, and it is there because it is harmless, as well as

being useless, and because the molecular

processes generating extra DNA outpace

those getting rid of it.”

Sydney Brenner. 1998. Refuge of spandrels. Current Biology 8:R669.

“Were the extra DNA to become disadvantageous, it
would become subject to selection, just as junk that
takes up too much space, or is beginning to smell, is
instantly converted to garbage by one’s wife, that
excellent Darwinian instrument.”

Sydney Brenner. 1998. Refuge of spandrels. Current Biology 8:R669.

Graur’s garage: Functional but full of junk

A garage according to ENCODE

A garage in which junk became garbage

Junk can Sometimes be Repurposed

Junk DNA can Sometimes be Repurposed

Norihiro Okada & Jürgen
Brosius, specialists in the
repurposing of junk DNA.

Functional DNA ✔

Junk DNA ✔

Garbage DNA ✔

Lazarus DNA ✔

Indifferent DNA

Dark DNA

Sequence-indifferent DNA or
indifferent DNA refers to DNA sites that
are functional, but show no evidence of
selection against point mutations.
Deletion of these sites, however, are
deleterious, and are subject to purifying
selection.

Examples of indifferent DNA
are spacers and ﬂanking
elements whose presence is
required but the sequence is not
important. One such case is the
third position of four-fold
redundant codons, which needs
to be present to avoid a
downstream frameshift.

Dark DNA refers to the fraction of the genome for
which no good evidence exists as to its evolutionary
impact on ﬁtness.

Dark DNA is an unknown unknown.

The term “dark” is borrowed from the ﬁeld of
astrophysics.

An astrophysicist (Dr. Or Graur) whose
research deals with dark energy.
Unfortunately, he has no interest in dark
DNA.

Interesting Question:

How can one tell if a certain genomic sequence is
functional or not?

Can we make the car on the left less ﬁt for driving?

Can make the car on the right less ﬁt for driving?

Mutation

Mutation

Mutation

Mutation

Mutation

Mutation

Mutation

Functional DNA

(almost all mutations
are deleterious)

Evolutionary
change

119

Mutation

Mutation

Mutation

Mutation

Mutation

Mutation

Mutation

Nonfunctional
DNA

(all mutations are neutral)

Evolutionary
change

120

How do we know if a particular genomic
sequence is functional?

Since most mutations in functional regions are
deleterious and likely to impair the function,
these mutations will tend to be eliminated by
natural selection. Thus, functional regions of
the genome should evolve more slowly, and
therefore be more conserved among species,
than nonfunctional regions.

121

Another indicator for the existence of a
genomic function is that losing it has some
consequence for the organism.

Evolution has tested the functionality of every
region of the human genome through
mutation over millions of years of evolution.

122

Is it even possible that ENCODE is right?

No! The main reason being that in humans, there is a huge
difference between population size and effective population
size.

Long-term Ne = 10,000

Is it even possible that ENCODE is right?

Under such conditions selection is inefﬁcient and most
genetic variation is deleterious. Genomic “perfection” is
unachievable.

Long-term Ne = 10,000

Fact 1: It has been known for more than a century that the vast
majority of non-neutral mutations are deleterious (Thomas Morgan
1903).

Fact 2: Mutation rate is evolvable.

These facts have led Alfred Sturtevant

to raise the question “Why does the

mutation rate not become reduced

to zero?” (Sturtevant 1937).

128

Motoo Kimura: Mutation rate cannot reach zero,
because of the COST OF FIDELITY.

In other words, the mutation rate in a lineage is a
compromise between the benefits of complete fidelity in
the replication of the genetic material and the cost of
achieving complete fidelity.

The mutation rate
modulation hypot
hesis

129

How did ENCODE reach such ridiculous
numbers?

1.  It used methodologies encouraging biased
errors in favor of inflating estimates of
functionality.

2.  It consistently and excessively favored
sensitivity over specificity.

3.  It paid attention to statistical significance,
rather than magnitude of the effect.

Example:

Transcription factors binding sites (TFBS):

So far, almost all known TFBSs range in length from 6

to 14 nucleotides.

The TFBS entries in ENCODE range in size from 457

to 824 nucleotides.

Thus, the estimates of the fraction of the human
genome devoted to transcription factor bindings are
extraordinarily inﬂated (sometimes by about two
orders of magnitude).

Encode prefers false positives over false
negatives, thus inﬂating the proportion of
positives.

Example:

ENCODE used a probability based alignment tool, and mapped RNA
transcripts onto DNA when the statistical conﬁdence exceeded
90%.

This means that 10% of the correspondences between RNA and
genome are erroneous.

The total number of RNA transcripts in ENCODE is approximately
109 million. The mean transcript length is 564 nucleotides.

Thus, a total of 6 billion nucleotides, or two times the human
genome size, are potentially misplaced (false positives).

“Derived allele frequency spectrum for primate-speciﬁc elements, with
variations outside ENCODE elements in black and variations covered
by ENCODE elements in red. The increase in low-frequency alleles
compared to background is indicative of negative selection
occurring in the set of variants annotated by the ENCODE data.”

p = 10−37

Magnitude of effect = 0.042%

“Derived allele frequency spectrum for primate-speciﬁc elements, with
variations outside ENCODE elements in black and variations covered
by ENCODE elements in red. The increase in low-frequency alleles
compared to background is indicative of negative selection
occurring in the set of variants annotated by the ENCODE data.”

Let’s examine the rationale and the
methodology for dealing with the
derived allele frequency spectrum
in primate-speciﬁc elements

The Why

•  If all alleles are neutral, a certain
frequency distribution is expected.

•  If some alleles are under negative
selection, an excess of rare derived alleles
is expected.

•  This excess is expected to be detectable for
only very short periods of evolutionary
time.

The Why

•  To deal with very short periods of
evolutionary time, ENCODE decided to use
primate speciﬁc sequences.

human

human

human

chimpanzee

gorilla

macaque

rat

mouse

human

human

human

chimpanzee

gorilla

macaque

rat

mouse

Primate Speciﬁc Sequences

What is missing from the derived allele
frequency spectrum of primate-speciﬁc
elements in ENCODE?

Genes!

3,296,458 SNPs that are in annotated
coding regions are not found in the
ENCODE sample.

Missing populations and their effect on
estimates of derived alleles and
ancestral alleles.

Three human populations were
available at the time ENCODE was
submitted; ENCODE used only one.

Caucasians

Derived allele

frequency (%)

OUT

40

60

60

Primate Speciﬁc Sequences

Ancestral alleles

Asians

Yoruba

Derived alleles

Yoruba

Derived allele

frequency (%)

OUT

100

20

0

Yoruba

Derived allele

Frequency (%)

100

20

0

The ENCODE data includes 2,136
alleles with frequencies of exactly
0. In a miraculous feat of science,
ENCODE was able to determine
the frequencies of nonexistent
alleles.

OUT

ENCODE uses multifurcated trees

Frequency of derived allele = 40%

ENCODE uses multifurcated trees

Frequency of derived allele < 40%

ENCODE uses only single species from
primates.

There are no derived alleles

Unwarranted extrapolations:

Badly trained techincians
tend to “kill” junk DNA
whenever they ﬁnd a new
function in non-coding DNA.

Even supposing that all the 55,000 putative lincRNAs in
this paper are functional and important, then

55,000 × 2000 bp = 110 MB

(less than 4% of the human genome).

Showing that 4% of the genome is functional is “cool,”
but doesn’t bear on the questions of “junk DNA,” which
has to do with the majority of the genome.

Conclusion: Badly trained
technicians who do not understand
(1) population genetics, (2) the
concept of effective population size,
(3) random genetic drift, and (4)
the limitations of selection should
be forbidden to even mention
“junk DNA” let alone write papers
on the subject.

6 S E P T E M B E R 2 0 1 2

442 researchers + 288 million dollars.

What have we learned from ENCODE?

157

“Data is not information, information is
not knowledge, knowledge is not
wisdom, wisdom is not truth,”

—Robert Royar (1994) paraphrasing Frank
Zappa’s (1979) anadiplosis

onion test

The
is a simple reality check for
anyone who thinks they have come up with a
universal function for 80% of the genome, or 100%
of the genome. Whatever the proposed function, ask
yourself this question: Can you explain why onions
need about ﬁve times more DNA than humans?”

T. Ryan Gregory

1.5 × 1010 bp
Allium cepa

3.5 × 109 bp
Homo sapiens

159

“All science is either physics or
stamp collecting.”

Ernest Rutherford

“ENCODE is stamp collecting.”

Roderic Guigó

“I can think of better uses for
288 million dollars.”

Dan Graur

Acknowledgments: The Good Guys

Coauthors: Ricardo Azevedo, Becky Zufall, Nicholas Price, and
Yichen Zheng (UH), and Eran Elhaik (Johns Hopkins).

Reviewers: Giddy Landan (Heirich Heine Universität, Germany),
Michael Lynch (University of Indiana, USA), Naruya Saitou
(National Institute of Genetics, Japan), David Penny (Massey
University, New Zealand), W. Ford Doolittle (Dalhousie University,
Canada + 2 reviewers who think I don’t know who they are.

Editor: Bill Martin (Genome Biology and Evolution)

Update version of the SMBE/SESBE Lecture on ENCODE & junk DNA (Graur, December 2013)

Update version of the SMBE/SESBE Lecture on ENCODE & junk DNA (Graur, December 2013)

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (18)

Similar to Update version of the SMBE/SESBE Lecture on ENCODE & junk DNA (Graur, December 2013)

Similar to Update version of the SMBE/SESBE Lecture on ENCODE & junk DNA (Graur, December 2013) (20)

More from Dan Graur

More from Dan Graur (6)

Recently uploaded

Recently uploaded (20)

Update version of the SMBE/SESBE Lecture on ENCODE & junk DNA (Graur, December 2013)