SlideShare a Scribd company logo
1 of 5
Download to read offline
www.BioTechniques.com
747
Vol. 47 | No. 3 | 2009
Short Technical Reports
Introduction
The increasing use of digital technology
presents a challenge for existing storage
capabilities. The need for a reliable and
long-term solution for information storage
is further heightened by the prediction
that the current magnetic and optical
storage will become unrecoverable within
a century or less (1). DNA is a compact,
long-term, and proven medium for infor-
mation storage. Indeed, over the last few
decades, a good case has been made for
crucial information storage in DNA (2).
Desirable properties of DNA include
its capacity for long-term information
storage and recovery, which are mostly
independent of technological changes, the
ability to conceal data in a miniaturized
fashion and its ability to be transferred,
when required, via self propagation (1–6).
Variousapproachesforinformationcoding
in DNA have been reported, including the
Huffman code, the comma code, and the
alternatingcode(4),astraightcodingbased
on 3 bases per letter (1,2,6), or sequential
conversion of text to keyboard scan codes
followed by conversion to hexadecimal
code and then conversion to binary code
with a designed nucleotide encryption key
(5). Each approach offers advantages and
inherent difficulties, and differs in the
degree of economical use of nucleotides.
Wesoughttodevelopanalternateapproach
for information archiving in DNA. We
used the principles of the Huffman code
(4,7) to define DNA codons for the entire
keyboard, for unambiguous information
coding. The approach described in this
manuscript is based on the construction of
plasmid library for information archiving
with specially designed primers embedded
inthemessagesegmentwithanexon/intron
structure for rapid, reliable, and efficient
information retrieval.
Materials and methods
The DNA coding was based on modifi-
cation of the Huffman code (2,4,7,8). We
also adopted the nomenclature suggested
by Cox (2) for definition of the DNA
segment representing a single character
as ‘codon’. DNA (844 bp; Figure 1A) was
synthesized and inserted as a SacI/KpnI
fragment in pBluescript-based plasmid
(Mr.GeneGmbH,Regensburg,Germany).
Sequence confirmation of supplied
plasmid was provided by the manufac-
turer using plasmid universal primer. For
information retrieval, plasmid (300 ng/7
μL) was mixed with sequencing primer
(5 pmol/0.7μL; Sigma, Oakville, Ontario)
(Figure 1B) and subjected to sequencing
(service was performed by The Centre
for Applied Genomics, The Hospital for
Sick Children, Toronto, ON, Canada).
The chromatogram was created using the
FinchTV 1.4 application (Geospiza Inc.,
Seattle, WA, USA). Sequences of designed
and sequenced DNA were aligned using
bl2seq(NCBI,Bethesda,MD,USA).PCR
amplification was performed in iQ5 cycler
(Bio-Rad Laboratories, Mississauga, ON,
Canada). A reaction mixture contained
2 units Taq polymerase with 1× reaction
buffer (New England BioLabs, Pickering,
ON, Canada), 0.2 mM each dNTP
(Fermentas, Burlington, ON, Canada), 0.3
mM each primer, 200 ng plasmid DNA,
and UltraPure distilled water (Invitrogen,
Burlington, ON, Canada) to a volume of
20 μL. PCR conditions were 94°C for 3
min; 94°C for 30 s, 55°C for 30 s, and 72°C
for 60 s for 30 cycles; then 72°C for 7 min
finalextension;andholdat4°C.Tenmicro-
liters of PCR reaction was mixed with 2
μL 6× loading buffer (Fermentas). DNA
fragment size was determined by loading
in parallel 5 μL 100-bp DNA ladder
(Fermentas) and resolved on 1% agarose gel
(Bioshop, Burlington, ON, Canada). Gel
was visualized with UV transillumination,
and image was captured with Biospectrum
AC Imaging System (UVP, Upland, CA,
USA).
Results and discussion
Rationale for the improved
Huffman Method
One of the prerequisites for a good DNA
coding method is the economical use of
nucleotides per character. In the improved
Huffman coding described in this paper,
the bases-to-character ratio was ~3.5. This
numberismoreeconomicalcomparedwith
previously described methods that enable
entire keyboard coding [e.g., the comma
or alternating codes (6 bases/character)
(4) or sequential encryption (~5.3 bases/
character) (5)]. It should be noted that
other coding methods yielded lower base-
to-character ratios, but these approaches
were limited to a low number of characters,
usually sufficient only for text encoding of
the English alphabet (1,2,6). Information
storage of DNA in living organisms has a
disadvantage of losing the information as
a result of breakage by mutation, deletion,
and insertion of DNA (9). Yachie et al. (5)
described an alignment-based approach for
prolonged and reliable information storage
inDNAinlivingorganisms.Thisapproach
providesefficientrecoveryofstoreddataeven
for damaged DNA (9). While the size of
foreignDNAinsertedintoalivingorganism
may be limited (2), recent studies have
An improved Huffman coding
method for archiving text, images,
and music characters in DNA
Menachem Ailenberg and Ori D. Rotstein
Departments of Surgery, University of Toronto, and St. Michael’s Hospital,
Li Ka Shing Knowledge Institute, Keenan Research Centre, Toronto,
Ontario, Canada
BioTechniques 47:747-754 (September 2009) doi 10.2144/000113218
Keywords: DNA archiving; information storage in DNA
Supplementary material for this article is available at www.BioTechniques.com/article/113218.
An improved Huffman coding method for information storage in DNA
is described. The method entails the utilization of modified unambigu-
ous base assignment that enables efficient coding of characters. A plasmid-
based library with efficient and reliable information retrieval and assembly
with uniquely designed primers is described. We illustrate our approach by
synthesis of DNA that encodes text, images and music, which could easily
be retrieved by DNA sequencing using the specific primers. The method is
simple and lends itself to automated information retrieval.
www.BioTechniques.com
748
Vol. 47 | No. 3 | 2009
Short Technical Reports
demonstrated successful insertion of large
piecesofforeignDNAintolivingorganisms
(reviewedinReference9).Nevertheless,the
unique codon assignment of the Huffman
code as described by Smith et al. (4)—and
extended in this study—readily identifies
any frame shift that results from mutation
inDNAorerrorsinsequencing.Also,infor-
mationstorageinnakedorplasmidDNAis
not subjected to mutations from the added
stageofinsertionoftheencodingDNAinto
living organism. Therefore, we describe in
thiscommunicationanimprovedHuffman
coding with unique primer design using
plasmid DNA libraries.
General description of the
improved Huffman Method
Our method involves the creation of a
plasmidlibrarywithupto10,000bpworth
of information inserted into each plasmid.
A different index plasmid that contains
general information about the library such
astitle,authors,plasmidnumber,andprimers
assignments(seebelow),isalsoconstructed.
General information about the library is
initiallyretrievedbyDNAsequencingfrom
the index plasmid using plasmid-specific
universalprimersandthenfromtheplasmid
librarybyuniquelydesignedprimers(Figure
1B).Bancroftetal.(1)describedatechnique
for information storage in DNA that
utilized two classes of DNA: information-
containingDNAandpolyprimerkey(PPK)
DNA. According to this method, the PPK
containedtheentiresequencesoftheprimers.
Incontrast,ourindexplasmidcontainsonly
theinformationforthestructureoftheinfor-
mation library, and the sequencing primers
areembeddedintheinformationlibrary(for
furtherdescriptionofinformationretrieval,
seeSupplementaryMaterials).TheHuffman
codeforDNAencryptionsuggestedbySmith
etal.enablesencodingforonly26characters
(4). We sought to improve this approach in
ordertoenablecodingoftheentirestandard
computer keyboard. Since the retrieving
primersinourmethodcontainGCbasesin
the5′and3′ends,andthecodonssuggested
bySmithetal.areGC-rich,wefirstreplaced
CswithAsandGswithTs,andmoved,when
possible, the GC-containing codons down
the frequency table (see below).
Rulesforcodingoftext,
music,andimageswiththe
improvedHuffmancoding
Wedefinedrulesfortext,music,andimage
coding. For text coding (coding preceded
with “tx*” GTG TTCCT TACCA), we
created three columns headed by low–base
number DNA codons (G,TT,TA), and
placed the remaining codons in increasing
base number under each header codons
(Figure 2; Supplementary Materials). For
musical notes coding (coding preceded
with “mu*” TAAC TACT TACCA),
we utilized the one-column–modified
Huffman coding (Figure 3; Supple-
mentary Materials). For image coding
(coding preceded with “im*” GCG TAAC
TACCA), we utilized the one-column–
modified Huffman coding (Figure 4;
Supplementary Materials).
According to our design, the sequence
encoding the text, music, and image is part
of the information sequence. Sequencing
primers are embedded in the information
sequence in 500-bp intervals (Figure 5A).
This structure is not unlike genomic exon/
intron structure of genes. In our case, the
exons are usually 500 bp, and the introns
(sequencing primers) are 20–30 bp in size.
This unique repetitive structure allows for
easy pattern identification, even for those
who are not familiar with the rules of our
specific coding. This pattern also allows
for an algorithm to decipher the reading
frame of the message.
The index plasmid
The index plasmid is designed to contain
an insert with general identification infor-
mation (Figure 5B). The index plasmid is
different from the library information
plasmids. Therefore, for storage purposes,
the index plasmid can be mixed with the
library plasmids, and the index infor-
mation can be retrieved with generic
plasmid-specific primers. For more infor-
mation on the index plasmid, see the
Supplementary Materials. In addition,
the index plasmid contains sequences of
clusters of 14 (7 + 7) bases representing
therandomsequencesofthefirstprimerin
eachplasmid.Thisisessentialfortheinitial
sequencing of the plasmid and retrieving
the remaining sequencing primers, if they
are not available. Thus, in our particular
example, the index plasmid also contains
at its 3′ end the sequence CGGTGTAC-
GACACT. For the purpose of simplicity,
we describe only the theoretical aspects of
Figure 1. DNA sequence of insert and sequencing primers. (A) Sequence of 844-bp DNA that stores
information for the text, music, and image of the nursery rhyme Mary Had a Little Lamb described
in the text (Figures 2–4 and 6; Supplementary Figures 1 and 2). The text’s sequence is shown in
orange type, music in black type, and image in red type. Page break is shown in bold underlined
type. Characters like punctuation, spaces, and “shift” keys are shown in lower case. Note that the
sequence of the music is interrupted with sequencing primer 2 (intron). According to the assembly
rules of the information sequence (exon), the sequence of the primer is recognized as an intron,
excised, and the sequence of the flanking exons are spliced and read as continuous information.
The 844-bp DNA segment was synthesized and inserted to a pBluescript-based plasmid as a SacI/
KpnI fragment. (B) Sequencing sense and anti-sense primers utilized for information retrieval. Blue
bold type depicts the 5′ end of sense primers and corresponding 3′ sequence of anti-sense primers.
Green bold type depicts 3′ the end of sense primers and corresponding 5′ sequence of anti-sense
primers. Lowercase black type depicts seven bases of random sequence that flank the plasmid
and primer numbers, designed to provide uniqueness to each primer and reduce mis-priming in
the sequencing reaction. Pink bold type depicts the plasmid number (in this particular example,
the number 1 in all cases). Red uppercase type depicts primer number (in this particular case, 1
and 2 for sense primers, and 0 and 2 for anti-sense primers). For this particular study, we assumed
a 1-digit plasmid number. When more than 10 plasmids are present, a space should be inserted
to distinguish between the plasmid and primer number. (C) DNA codons assigned to plasmid and
primer number codes.
A
B
www.BioTechniques.com
750
Vol. 47 | No. 3 | 2009
Short Technical Reports
the index plasmidwithoutactualsynthesis
of the insert.
Illustration of the improved
Huffman coding
To illustrate our method, a 844-bp DNA
fragment (Figure 1A) was synthesized. This
DNA fragment contained encoded infor-
mation for text, musical notes, and image
for the nursery rhyme “Mary Had a Little
Lamb,” by Sarah Josepha Hale (10), and
was constructed according to the principles
outlinedinFigures1–5.TheDNAsequence
for the text and musical notes of the rhyme
are shown in Figures 2 and 3, respectively.
We also show here a simple image of a
lamb (Figure 6). The image is drawn in a
field of 10 × 10 units. The head, body and
ear of the lamb are defined by ellipses, the
eye by a circle, the legs by lines, and the tail
by a rectangle. It should be noted that the
concise definition of the geometrical shapes
in DNA codes described here allows for
economical coding of the lamb image by
only 238 bp of DNA. For retrieving infor-
mation from the 844-bp DNA fragment
in our particular example, sense primers 1
and 2 were flanking a 500-bp segment, and
anti-sense primers 2 and 0 were flanking a
344-bpsegment(Figure1).Toillustratethe
utilityofthespecificityoftheprimerdesign,
a PCR reaction employing the 3 primer sets
wasperformed.AsdemonstratedinFigure7,
the 3 amplification products corresponding
to the expected amplicons sizes were noted.
Unique sequencing primers
for information retrieval
The sequencing primers (Figures 1 and
5) were flanked by 5′ CGC and 3′ GCC
(sense) or 5′-GGC and 3′-GCG (anti-sense)
for easy identification and also for creating
Figure 2. Modification of Huffman coding for un-
ambiguous DNA coding for text. (A) DNA codons
according to frequencies (4) were modified: Cs
were replaced with As and Gs were replaced
with Ts, and CG-rich codons were moved down
the frequency table. Three low–base number
DNA codons (G, TT, and TA) were placed as
group headers, and the rest of the codons were
placed in each group in decreasing frequency
order, allowing for 69 unambiguous codons.
Characters of the computer keyboard were
assigned a unique DNA codon and placed in
the table in decreasing frequency order as de-
scribed in the “Results and discussion” section,
yielding an average of 4.9 bases per character.
It should be noted that in effect, as defined by
the principle of Huffman coding (7), the base/
character ratio is lower due to the fact that more
frequent characters are assigned fewer bases.
In this communication, text encoding averaged
~3.5 bases per codon. (B) Encoding text of the
nursery rhyme Mary Had a Little Lamb to DNA
utilizing the principles in panel A. Upper row
shows the lyrics and lower row shows the cor-
responding DNA codons.
A
B
• ProScanTM
II Series stages with
Intelligent Scanning Technology
provide the highest precision
in its class
FLUORESCENCE ILLUMINATION
MOTORIZED STAGES
• Lumen 200/200Pro series 200Watt
fluorescence illumination systems
MICROSCOPE AUTOMATION
WHERE VISION MEETS PRECISION
Prior Scientific, Inc.
80 Reservoir Park Drive
Rockland, MA. 02370
800-877-2234
www.prior.com
•Compatible with Prior Scientific’s
line of NanoScanZ Piezo Z Stages
• OEM and custom stage systems
are available
• Built-in high speed light attenuator and
six position high speed filter wheel
• Pre-centered bulb with 2,000
hour life
• Stabilized DC power supply
eliminates variations in light
intensity
www.BioTechniques.com
752
Vol. 47 | No. 3 | 2009
Short Technical Reports
a triple-GC clamp at the 3′ end for tight
hybridization. In the middle of the primer,
we reserved space for coding the plasmid
number and the primer number. Primer
number 1 indicates the sequence of the 5′
segment of the information insert, and
primer number 0 indicates the sequence of
the 3′ end of the information insert. Other
primers are identified in ascending order,
to a maximum of 20 in a 10,000-bp infor-
mationinsert.Wespecifiedheretheplasmid
andprimernumberinsingledigits.However,
whenmorethan10plasmidsareincludedin
the library, a space character (GAT) should
be inserted to distinguish between the
plasmid and primer numbers. Importantly,
the plasmid and primer number encoded in
the sequencing primer were designed to be
flanked by a random seven-base sequence (a
totalof14basesperprimer)toprovideprimer
specificitywhenusedforsequencingbothin
the sense and anti-sense orientation, or for
PCR amplification.
Information retrieval by sequencing
In general, sequencing reactions do not
providesequencinginformationimmediately
Figure 3. DNA codes for music storage in DNA, and DNA codons for the tune of the nursery rhyme
Mary Had a Little Lamb. (A) Note values and pitches, as well as meters and repeats, are assigned
to DNA codons (note values are demonstrated in Supplementary Figure 1). (B) Each musical note
is accompanied by its value. Upper row shows the lyrics. Middle row shows the notes and values,
and the lower row shows the corresponding notes and values encrypted in DNA. The code for music
(“mu*”) at the beginning of the message distinguishes the music part from the text and image in
the information insert, and the 4/4 musical meter code that follows the music code. Also, brackets
and “× 2” indicate that the tune repeats 2 times with different lyrics. Note pitches and values in
DNA sequence are depicted in upper case, and “mu*,” as well as meter and repeat sequences, are
depicted in lower case.
Figure 4. DNA codons for image storage. (A) Numeric and shape codon assignment. Note the distinc-
tion between “a” (angle of shape relative to horizontal line) and “an” (angle of triangle between two
sides). (B) Parameter for drawing shapes.
Figure 5. Strategy for DNA archiving. (A) Construction the information library: DNA information sequence (≤10,000 bp) is synthesized according to DNA
codons specified in Figures 2–4 and “Results and discussion” section. Primer sequences of ~20–30 bp are embedded in the information sequence
every 500 bp, creating islets of intron sequences. Also depicted in this figure is the structure of the primer and the locations of the plasmid and primer
numbers, the locations of the flanking GC sequences of the primers, and the flanking scrambled sequences. For more details, see “Results and discus-
sion” section and Figure 1. (B) Structure of index plasmid. General index information is flanked by plasmid-specific sequence utilized to retrieve index
information. This plasmid is different from the library plasmid in order to retrieve the index information without conflict. Information retrieval from the
library plasmid is achieved by using specially designed specific primers (panel A and Figure 1B). Note: In this paper, we designed only an 844-bp
information insert for the purpose of proof-of-principle, and for simplicity, the index plasmid theoretical consideration is described, but with no actual
synthesis of DNA insert in the plasmid.
A B
A B
A B
www.BioTechniques.com
754
Vol. 47 | No. 3 | 2009
Short Technical Reports
adjacenttothesequencingprimer.Moreover,
thesequencecouldoccasionallybe<500bp,
orabasecallcanbeinconclusive.Therefore,
sequencing in both orientations may be
required. With the unique design of our
primers, sequencing could be achieved with
ahighdegreeofaccuracy.Agoodsequencing
practice for retrieving information from an
informationlibrarymightbetoinitiallyuse
thesenseprimersonly,andanti-senseprimer
2 to retrieve the information of the 5′ end.
Thereafter, the other anti-sense primers can
beutilizedasrequired.Forsequencinginour
particular example, we used sense primers
1 and 2, and anti-sense primers 2 and 0.
Sequencingwithprimer1yieldeda1276-bp
product. Since our insert was 844 bp, it also
yielded a plasmid sequence downstream
from the insert. However, the sequence was
missing 41 bp just downstream of primer 1.
This sequence was retrieved with anti-sense
primer 2. Together, these two sequencing
reactions retrieved the original information
with 100% accuracy. We recognize that a
sequencing reaction cannot always retrieve
1276 bp as was the case with primer 1, and
therefore acknowledge that additional
sequencing may be required. As mentioned
above, we further retrieved the insert infor-
mationwithadditionalsequencingwithsense
primer2andanti-senseprimer0,againwith
100% accuracy, compared with the initial
designedinsert.Wethendecodedtheinfor-
mation obtained by DNA sequencing with
theguidelinesprovidedinFigures2–4,and
wereabletoreconstructthetext,music,and
image encrypted in the DNA. An example
of part of a sequencing chromatogram
achieved with sense primer 1 is shown in
Supplementary Figure 2. This particular
sequencedinformationistheDNAsequence
coding the rectangle that constructs the
image of the lamb’s tail (bases 647–674 on
the chromatogram) with 100% accuracy,
compared with the original sequence.
In addition to the inherent advanta-
geous attributes of information storage
in DNA, our improved Huffman coding
method for use of unambiguous DNA
codingforarchivingofferseconomical,easy
pattern recognition and message retrieval
through specially designed primers. As
DNA synthesis and sequencing become
faster and cheaper (Genome Analyzer
Sequencing System, Illumina, San Diego,
CA, USA; 454 Sequencing, Roche,
Branford, CT, USA) information storage
in DNA becomes even more attractive.
Acknowledgments
This study was supported by the Canadian
Institutes of Health Research (CIHR;
grant no. 37779).
The authors declare no competing
interests.
References
	 1.	Bancroft, C., T. Bowler, B. Bloom, and K.T.
Clelland. 2001. Long-term storage of infor-
mation in DNA. Science 293:1763-1765.
	 2.	Cox, J.P.L. 2001. Long-term data storage in
DNA. Trends Biotechnol. 19:247-250.
	 3.	Wong, P.C., K.-K. Wong, and H. Foote. 2003.
OrganicdatamemoryusingtheDNAapproach.
Commun. ACM 46:95-98.
	 4.	Smith, G.C., C.C. Fiddes, J.P. Hawkins, and
J.P.L. Cox. 2003. Some possible codes for
encrypting data in DNA. Biotechnol. Lett.
25:1125-1130.
	 5.	Yachie,N.,K.Sekiyama,J.Sugahara,Y.Ohashi,
andM.Tomita.2007.Alignment-basedapproach
for durable data storage into living organisms.
Biotechnol. Prog. 23:501-505.
	 6.	Clelland,C.T.,V.Risca,andC.Bancroft.1999.
Hiding messages in DNA microdots. Nature
399:533-534.
	 7.	Huffman, D.A. 1953. A method for the
construction of minimum-redundancy codes.
Proc. IRE. 40:1098-1101.
	 8.	Doig, A.J. 1997. Improving the efficiency of
the genetic code by varying the codon length-
theperfectgeneticcode.J.Theor.Biol.188:355-
360.
	 9.	Yachie, N., Y. Ohashi, and M. Tomita. 2008.
Stabilizing synthetic data in the DNA of living
organisms. Syst. Synth. Biol. 2:19-25.
	10.	Hughes, C. W. 1980. American hymns old and
new:notesonthehymnsandbiographiesofthe
authors and composers, Columbia University
Press, NY.
	11.	Lewand, R.E. 2000. Cryptographical Mathe-
matics. The Mathematical Association of
America Publishing, Washington, DC.
Received 28 March 2009; accepted 9 July 2009.
Address correspondence to Menachem Ailenberg,
St. Michael’s Hospital, 30 Bond Street, 16CC-
044, Toronto, Ontario, Canada M5B 1W8. email:
m.ailenberg@utoronto.ca
Figure 6. Image of lamb encrypted in DNA. (A)
DNA codons were defined as described in Fig-
ure 4. Upper row identifies the organ of the
lamb, middle row defines the shapes and their
parameters as defined in Figure 4, and lower
row shows image encrypted in DNA codons.
Also shown at the top of this panel is the DNA
code that precedes the image (“im*”). (B) Im-
age was created by using shapes (ellipse, circle,
line, rectangular) as described in the “Results
and discussion” section and Figure 4. A field
of 10 × 10 units was created, and shapes were
inserted. An example of the tail sequencing is
shown in Supplementary Figure 2.
Figure 7. PCR amplification of DNA encrypting
text, image, and music, using specific primers.
PCR was performed using as template a plas-
mid containing the 844-bp synthesized DNA
that encrypts text, music, and image. Numbers
on the left indicate DNA fragment size, in bp.
Lane 1, 100-bp DNA ladder; Lane 2, amplifica-
tion using sense primer 1 and anti-sense primer
2; Lane 3, amplification using sense primer 2
and anti-sense primer 0; Lane 4, amplification
using sense primer 1 and anti-sense primer
0. Note that the amplified bands in lanes 2,
3, and 4 correspond to the expected sizes of
553 bp, 314 bp, and 844 bp, respectively. For
conditions of PCR amplification, consult the
“Materials and Methods” section. For primer
assignment, see Figure 1.
A
B

More Related Content

What's hot

What's hot (20)

Finding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopFinding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/Hadoop
 
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIdentifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
De novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis meloDe novo transcriptome assembly of solid sequencing data in cucumis melo
De novo transcriptome assembly of solid sequencing data in cucumis melo
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
2014 11 03_bioinformatics_case_studies
2014 11 03_bioinformatics_case_studies2014 11 03_bioinformatics_case_studies
2014 11 03_bioinformatics_case_studies
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
presentation
presentationpresentation
presentation
 
IRJET- DNA Cryptography
IRJET-  	  DNA CryptographyIRJET-  	  DNA Cryptography
IRJET- DNA Cryptography
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 

Similar to 1 2 10.1.1.468.7609

Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
Naima Tahsin
 
_Nitant_Choksi_CAP6545_Final_Project.pdf
_Nitant_Choksi_CAP6545_Final_Project.pdf_Nitant_Choksi_CAP6545_Final_Project.pdf
_Nitant_Choksi_CAP6545_Final_Project.pdf
NitantChoksi1
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
TELKOMNIKA JOURNAL
 
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingA Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
IJERA Editor
 

Similar to 1 2 10.1.1.468.7609 (20)

De Novo
De NovoDe Novo
De Novo
 
Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancy
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
50320130403003 2
50320130403003 250320130403003 2
50320130403003 2
 
Loss less DNA Solidity Using Huffman and Arithmetic Coding
Loss less DNA Solidity Using Huffman and Arithmetic CodingLoss less DNA Solidity Using Huffman and Arithmetic Coding
Loss less DNA Solidity Using Huffman and Arithmetic Coding
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
DATA ENCRYPTION USING BIO MOLECULAR INFORMATION
DATA ENCRYPTION USING BIO MOLECULAR INFORMATIONDATA ENCRYPTION USING BIO MOLECULAR INFORMATION
DATA ENCRYPTION USING BIO MOLECULAR INFORMATION
 
_Nitant_Choksi_CAP6545_Final_Project.pdf
_Nitant_Choksi_CAP6545_Final_Project.pdf_Nitant_Choksi_CAP6545_Final_Project.pdf
_Nitant_Choksi_CAP6545_Final_Project.pdf
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Secure data transmission using dna encryption
Secure data transmission using dna encryptionSecure data transmission using dna encryption
Secure data transmission using dna encryption
 
Iplant pag
Iplant pagIplant pag
Iplant pag
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Crypt Sequence DNA
Crypt Sequence DNACrypt Sequence DNA
Crypt Sequence DNA
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
 
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingA Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
 
sb400161v
sb400161vsb400161v
sb400161v
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
 

Recently uploaded

Corporate Presentation Probe May 2024.pdf
Corporate Presentation Probe May 2024.pdfCorporate Presentation Probe May 2024.pdf
Corporate Presentation Probe May 2024.pdf
Probe Gold
 
Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
mriyagarg453
 
B2 Interpret the brief.docxccccccccccccccc
B2 Interpret the brief.docxcccccccccccccccB2 Interpret the brief.docxccccccccccccccc
B2 Interpret the brief.docxccccccccccccccc
MollyBrown86
 
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Rohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
slideshare Call girls Noida Escorts 9999965857 henakhan
slideshare Call girls Noida Escorts 9999965857 henakhanslideshare Call girls Noida Escorts 9999965857 henakhan
slideshare Call girls Noida Escorts 9999965857 henakhan
hanshkumar9870
 

Recently uploaded (20)

Corporate Presentation Probe May 2024.pdf
Corporate Presentation Probe May 2024.pdfCorporate Presentation Probe May 2024.pdf
Corporate Presentation Probe May 2024.pdf
 
@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Karol Bagh 💓 High Profile Escorts Delhi 🫶
 
Q3 FY24 Earnings Conference Call Presentation
Q3 FY24 Earnings Conference Call PresentationQ3 FY24 Earnings Conference Call Presentation
Q3 FY24 Earnings Conference Call Presentation
 
Vip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
(‿ˠ‿) Independent Call Girls Laxmi Nagar 👉 9999965857 👈 Delhi : 9999 Cash Pa...
(‿ˠ‿) Independent Call Girls Laxmi Nagar 👉 9999965857 👈 Delhi  : 9999 Cash Pa...(‿ˠ‿) Independent Call Girls Laxmi Nagar 👉 9999965857 👈 Delhi  : 9999 Cash Pa...
(‿ˠ‿) Independent Call Girls Laxmi Nagar 👉 9999965857 👈 Delhi : 9999 Cash Pa...
 
@9999965857 🫦 Sexy Desi Call Girls Janakpuri 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Janakpuri 💓 High Profile Escorts Delhi 🫶@9999965857 🫦 Sexy Desi Call Girls Janakpuri 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Janakpuri 💓 High Profile Escorts Delhi 🫶
 
Vip Call Girls South Ex ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls South Ex ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls South Ex ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls South Ex ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Russian Call Girls Rohini Sector 3 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
 
Call Girls 🫤 East Of Kailash ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ...
Call Girls 🫤 East Of Kailash ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ...Call Girls 🫤 East Of Kailash ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ...
Call Girls 🫤 East Of Kailash ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ...
 
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
 
Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
B2 Interpret the brief.docxccccccccccccccc
B2 Interpret the brief.docxcccccccccccccccB2 Interpret the brief.docxccccccccccccccc
B2 Interpret the brief.docxccccccccccccccc
 
Diligence Checklist for Early Stage Startups
Diligence Checklist for Early Stage StartupsDiligence Checklist for Early Stage Startups
Diligence Checklist for Early Stage Startups
 
✂️ 👅 Independent Goregaon Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Goregaon Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Goregaon Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Goregaon Escorts With Room Vashi Call Girls 💃 9004004663
 
BDSM⚡Call Girls in Hari Nagar Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Hari Nagar Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Hari Nagar Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Hari Nagar Delhi >༒8448380779 Escort Service
 
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Rohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 15 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
slideshare Call girls Noida Escorts 9999965857 henakhan
slideshare Call girls Noida Escorts 9999965857 henakhanslideshare Call girls Noida Escorts 9999965857 henakhan
slideshare Call girls Noida Escorts 9999965857 henakhan
 
(👉゚9999965857 ゚)👉 Russian Call Girls Aerocity 👉 Delhi 👈 : 9999 Cash Payment F...
(👉゚9999965857 ゚)👉 Russian Call Girls Aerocity 👉 Delhi 👈 : 9999 Cash Payment F...(👉゚9999965857 ゚)👉 Russian Call Girls Aerocity 👉 Delhi 👈 : 9999 Cash Payment F...
(👉゚9999965857 ゚)👉 Russian Call Girls Aerocity 👉 Delhi 👈 : 9999 Cash Payment F...
 
(👉゚9999965857 ゚)👉 VIP Call Girls Friends Colony 👉 Delhi 👈 : 9999 Cash Payment...
(👉゚9999965857 ゚)👉 VIP Call Girls Friends Colony 👉 Delhi 👈 : 9999 Cash Payment...(👉゚9999965857 ゚)👉 VIP Call Girls Friends Colony 👉 Delhi 👈 : 9999 Cash Payment...
(👉゚9999965857 ゚)👉 VIP Call Girls Friends Colony 👉 Delhi 👈 : 9999 Cash Payment...
 

1 2 10.1.1.468.7609

  • 1. www.BioTechniques.com 747 Vol. 47 | No. 3 | 2009 Short Technical Reports Introduction The increasing use of digital technology presents a challenge for existing storage capabilities. The need for a reliable and long-term solution for information storage is further heightened by the prediction that the current magnetic and optical storage will become unrecoverable within a century or less (1). DNA is a compact, long-term, and proven medium for infor- mation storage. Indeed, over the last few decades, a good case has been made for crucial information storage in DNA (2). Desirable properties of DNA include its capacity for long-term information storage and recovery, which are mostly independent of technological changes, the ability to conceal data in a miniaturized fashion and its ability to be transferred, when required, via self propagation (1–6). Variousapproachesforinformationcoding in DNA have been reported, including the Huffman code, the comma code, and the alternatingcode(4),astraightcodingbased on 3 bases per letter (1,2,6), or sequential conversion of text to keyboard scan codes followed by conversion to hexadecimal code and then conversion to binary code with a designed nucleotide encryption key (5). Each approach offers advantages and inherent difficulties, and differs in the degree of economical use of nucleotides. Wesoughttodevelopanalternateapproach for information archiving in DNA. We used the principles of the Huffman code (4,7) to define DNA codons for the entire keyboard, for unambiguous information coding. The approach described in this manuscript is based on the construction of plasmid library for information archiving with specially designed primers embedded inthemessagesegmentwithanexon/intron structure for rapid, reliable, and efficient information retrieval. Materials and methods The DNA coding was based on modifi- cation of the Huffman code (2,4,7,8). We also adopted the nomenclature suggested by Cox (2) for definition of the DNA segment representing a single character as ‘codon’. DNA (844 bp; Figure 1A) was synthesized and inserted as a SacI/KpnI fragment in pBluescript-based plasmid (Mr.GeneGmbH,Regensburg,Germany). Sequence confirmation of supplied plasmid was provided by the manufac- turer using plasmid universal primer. For information retrieval, plasmid (300 ng/7 μL) was mixed with sequencing primer (5 pmol/0.7μL; Sigma, Oakville, Ontario) (Figure 1B) and subjected to sequencing (service was performed by The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada). The chromatogram was created using the FinchTV 1.4 application (Geospiza Inc., Seattle, WA, USA). Sequences of designed and sequenced DNA were aligned using bl2seq(NCBI,Bethesda,MD,USA).PCR amplification was performed in iQ5 cycler (Bio-Rad Laboratories, Mississauga, ON, Canada). A reaction mixture contained 2 units Taq polymerase with 1× reaction buffer (New England BioLabs, Pickering, ON, Canada), 0.2 mM each dNTP (Fermentas, Burlington, ON, Canada), 0.3 mM each primer, 200 ng plasmid DNA, and UltraPure distilled water (Invitrogen, Burlington, ON, Canada) to a volume of 20 μL. PCR conditions were 94°C for 3 min; 94°C for 30 s, 55°C for 30 s, and 72°C for 60 s for 30 cycles; then 72°C for 7 min finalextension;andholdat4°C.Tenmicro- liters of PCR reaction was mixed with 2 μL 6× loading buffer (Fermentas). DNA fragment size was determined by loading in parallel 5 μL 100-bp DNA ladder (Fermentas) and resolved on 1% agarose gel (Bioshop, Burlington, ON, Canada). Gel was visualized with UV transillumination, and image was captured with Biospectrum AC Imaging System (UVP, Upland, CA, USA). Results and discussion Rationale for the improved Huffman Method One of the prerequisites for a good DNA coding method is the economical use of nucleotides per character. In the improved Huffman coding described in this paper, the bases-to-character ratio was ~3.5. This numberismoreeconomicalcomparedwith previously described methods that enable entire keyboard coding [e.g., the comma or alternating codes (6 bases/character) (4) or sequential encryption (~5.3 bases/ character) (5)]. It should be noted that other coding methods yielded lower base- to-character ratios, but these approaches were limited to a low number of characters, usually sufficient only for text encoding of the English alphabet (1,2,6). Information storage of DNA in living organisms has a disadvantage of losing the information as a result of breakage by mutation, deletion, and insertion of DNA (9). Yachie et al. (5) described an alignment-based approach for prolonged and reliable information storage inDNAinlivingorganisms.Thisapproach providesefficientrecoveryofstoreddataeven for damaged DNA (9). While the size of foreignDNAinsertedintoalivingorganism may be limited (2), recent studies have An improved Huffman coding method for archiving text, images, and music characters in DNA Menachem Ailenberg and Ori D. Rotstein Departments of Surgery, University of Toronto, and St. Michael’s Hospital, Li Ka Shing Knowledge Institute, Keenan Research Centre, Toronto, Ontario, Canada BioTechniques 47:747-754 (September 2009) doi 10.2144/000113218 Keywords: DNA archiving; information storage in DNA Supplementary material for this article is available at www.BioTechniques.com/article/113218. An improved Huffman coding method for information storage in DNA is described. The method entails the utilization of modified unambigu- ous base assignment that enables efficient coding of characters. A plasmid- based library with efficient and reliable information retrieval and assembly with uniquely designed primers is described. We illustrate our approach by synthesis of DNA that encodes text, images and music, which could easily be retrieved by DNA sequencing using the specific primers. The method is simple and lends itself to automated information retrieval.
  • 2. www.BioTechniques.com 748 Vol. 47 | No. 3 | 2009 Short Technical Reports demonstrated successful insertion of large piecesofforeignDNAintolivingorganisms (reviewedinReference9).Nevertheless,the unique codon assignment of the Huffman code as described by Smith et al. (4)—and extended in this study—readily identifies any frame shift that results from mutation inDNAorerrorsinsequencing.Also,infor- mationstorageinnakedorplasmidDNAis not subjected to mutations from the added stageofinsertionoftheencodingDNAinto living organism. Therefore, we describe in thiscommunicationanimprovedHuffman coding with unique primer design using plasmid DNA libraries. General description of the improved Huffman Method Our method involves the creation of a plasmidlibrarywithupto10,000bpworth of information inserted into each plasmid. A different index plasmid that contains general information about the library such astitle,authors,plasmidnumber,andprimers assignments(seebelow),isalsoconstructed. General information about the library is initiallyretrievedbyDNAsequencingfrom the index plasmid using plasmid-specific universalprimersandthenfromtheplasmid librarybyuniquelydesignedprimers(Figure 1B).Bancroftetal.(1)describedatechnique for information storage in DNA that utilized two classes of DNA: information- containingDNAandpolyprimerkey(PPK) DNA. According to this method, the PPK containedtheentiresequencesoftheprimers. Incontrast,ourindexplasmidcontainsonly theinformationforthestructureoftheinfor- mation library, and the sequencing primers areembeddedintheinformationlibrary(for furtherdescriptionofinformationretrieval, seeSupplementaryMaterials).TheHuffman codeforDNAencryptionsuggestedbySmith etal.enablesencodingforonly26characters (4). We sought to improve this approach in ordertoenablecodingoftheentirestandard computer keyboard. Since the retrieving primersinourmethodcontainGCbasesin the5′and3′ends,andthecodonssuggested bySmithetal.areGC-rich,wefirstreplaced CswithAsandGswithTs,andmoved,when possible, the GC-containing codons down the frequency table (see below). Rulesforcodingoftext, music,andimageswiththe improvedHuffmancoding Wedefinedrulesfortext,music,andimage coding. For text coding (coding preceded with “tx*” GTG TTCCT TACCA), we created three columns headed by low–base number DNA codons (G,TT,TA), and placed the remaining codons in increasing base number under each header codons (Figure 2; Supplementary Materials). For musical notes coding (coding preceded with “mu*” TAAC TACT TACCA), we utilized the one-column–modified Huffman coding (Figure 3; Supple- mentary Materials). For image coding (coding preceded with “im*” GCG TAAC TACCA), we utilized the one-column– modified Huffman coding (Figure 4; Supplementary Materials). According to our design, the sequence encoding the text, music, and image is part of the information sequence. Sequencing primers are embedded in the information sequence in 500-bp intervals (Figure 5A). This structure is not unlike genomic exon/ intron structure of genes. In our case, the exons are usually 500 bp, and the introns (sequencing primers) are 20–30 bp in size. This unique repetitive structure allows for easy pattern identification, even for those who are not familiar with the rules of our specific coding. This pattern also allows for an algorithm to decipher the reading frame of the message. The index plasmid The index plasmid is designed to contain an insert with general identification infor- mation (Figure 5B). The index plasmid is different from the library information plasmids. Therefore, for storage purposes, the index plasmid can be mixed with the library plasmids, and the index infor- mation can be retrieved with generic plasmid-specific primers. For more infor- mation on the index plasmid, see the Supplementary Materials. In addition, the index plasmid contains sequences of clusters of 14 (7 + 7) bases representing therandomsequencesofthefirstprimerin eachplasmid.Thisisessentialfortheinitial sequencing of the plasmid and retrieving the remaining sequencing primers, if they are not available. Thus, in our particular example, the index plasmid also contains at its 3′ end the sequence CGGTGTAC- GACACT. For the purpose of simplicity, we describe only the theoretical aspects of Figure 1. DNA sequence of insert and sequencing primers. (A) Sequence of 844-bp DNA that stores information for the text, music, and image of the nursery rhyme Mary Had a Little Lamb described in the text (Figures 2–4 and 6; Supplementary Figures 1 and 2). The text’s sequence is shown in orange type, music in black type, and image in red type. Page break is shown in bold underlined type. Characters like punctuation, spaces, and “shift” keys are shown in lower case. Note that the sequence of the music is interrupted with sequencing primer 2 (intron). According to the assembly rules of the information sequence (exon), the sequence of the primer is recognized as an intron, excised, and the sequence of the flanking exons are spliced and read as continuous information. The 844-bp DNA segment was synthesized and inserted to a pBluescript-based plasmid as a SacI/ KpnI fragment. (B) Sequencing sense and anti-sense primers utilized for information retrieval. Blue bold type depicts the 5′ end of sense primers and corresponding 3′ sequence of anti-sense primers. Green bold type depicts 3′ the end of sense primers and corresponding 5′ sequence of anti-sense primers. Lowercase black type depicts seven bases of random sequence that flank the plasmid and primer numbers, designed to provide uniqueness to each primer and reduce mis-priming in the sequencing reaction. Pink bold type depicts the plasmid number (in this particular example, the number 1 in all cases). Red uppercase type depicts primer number (in this particular case, 1 and 2 for sense primers, and 0 and 2 for anti-sense primers). For this particular study, we assumed a 1-digit plasmid number. When more than 10 plasmids are present, a space should be inserted to distinguish between the plasmid and primer number. (C) DNA codons assigned to plasmid and primer number codes. A B
  • 3. www.BioTechniques.com 750 Vol. 47 | No. 3 | 2009 Short Technical Reports the index plasmidwithoutactualsynthesis of the insert. Illustration of the improved Huffman coding To illustrate our method, a 844-bp DNA fragment (Figure 1A) was synthesized. This DNA fragment contained encoded infor- mation for text, musical notes, and image for the nursery rhyme “Mary Had a Little Lamb,” by Sarah Josepha Hale (10), and was constructed according to the principles outlinedinFigures1–5.TheDNAsequence for the text and musical notes of the rhyme are shown in Figures 2 and 3, respectively. We also show here a simple image of a lamb (Figure 6). The image is drawn in a field of 10 × 10 units. The head, body and ear of the lamb are defined by ellipses, the eye by a circle, the legs by lines, and the tail by a rectangle. It should be noted that the concise definition of the geometrical shapes in DNA codes described here allows for economical coding of the lamb image by only 238 bp of DNA. For retrieving infor- mation from the 844-bp DNA fragment in our particular example, sense primers 1 and 2 were flanking a 500-bp segment, and anti-sense primers 2 and 0 were flanking a 344-bpsegment(Figure1).Toillustratethe utilityofthespecificityoftheprimerdesign, a PCR reaction employing the 3 primer sets wasperformed.AsdemonstratedinFigure7, the 3 amplification products corresponding to the expected amplicons sizes were noted. Unique sequencing primers for information retrieval The sequencing primers (Figures 1 and 5) were flanked by 5′ CGC and 3′ GCC (sense) or 5′-GGC and 3′-GCG (anti-sense) for easy identification and also for creating Figure 2. Modification of Huffman coding for un- ambiguous DNA coding for text. (A) DNA codons according to frequencies (4) were modified: Cs were replaced with As and Gs were replaced with Ts, and CG-rich codons were moved down the frequency table. Three low–base number DNA codons (G, TT, and TA) were placed as group headers, and the rest of the codons were placed in each group in decreasing frequency order, allowing for 69 unambiguous codons. Characters of the computer keyboard were assigned a unique DNA codon and placed in the table in decreasing frequency order as de- scribed in the “Results and discussion” section, yielding an average of 4.9 bases per character. It should be noted that in effect, as defined by the principle of Huffman coding (7), the base/ character ratio is lower due to the fact that more frequent characters are assigned fewer bases. In this communication, text encoding averaged ~3.5 bases per codon. (B) Encoding text of the nursery rhyme Mary Had a Little Lamb to DNA utilizing the principles in panel A. Upper row shows the lyrics and lower row shows the cor- responding DNA codons. A B • ProScanTM II Series stages with Intelligent Scanning Technology provide the highest precision in its class FLUORESCENCE ILLUMINATION MOTORIZED STAGES • Lumen 200/200Pro series 200Watt fluorescence illumination systems MICROSCOPE AUTOMATION WHERE VISION MEETS PRECISION Prior Scientific, Inc. 80 Reservoir Park Drive Rockland, MA. 02370 800-877-2234 www.prior.com •Compatible with Prior Scientific’s line of NanoScanZ Piezo Z Stages • OEM and custom stage systems are available • Built-in high speed light attenuator and six position high speed filter wheel • Pre-centered bulb with 2,000 hour life • Stabilized DC power supply eliminates variations in light intensity
  • 4. www.BioTechniques.com 752 Vol. 47 | No. 3 | 2009 Short Technical Reports a triple-GC clamp at the 3′ end for tight hybridization. In the middle of the primer, we reserved space for coding the plasmid number and the primer number. Primer number 1 indicates the sequence of the 5′ segment of the information insert, and primer number 0 indicates the sequence of the 3′ end of the information insert. Other primers are identified in ascending order, to a maximum of 20 in a 10,000-bp infor- mationinsert.Wespecifiedheretheplasmid andprimernumberinsingledigits.However, whenmorethan10plasmidsareincludedin the library, a space character (GAT) should be inserted to distinguish between the plasmid and primer numbers. Importantly, the plasmid and primer number encoded in the sequencing primer were designed to be flanked by a random seven-base sequence (a totalof14basesperprimer)toprovideprimer specificitywhenusedforsequencingbothin the sense and anti-sense orientation, or for PCR amplification. Information retrieval by sequencing In general, sequencing reactions do not providesequencinginformationimmediately Figure 3. DNA codes for music storage in DNA, and DNA codons for the tune of the nursery rhyme Mary Had a Little Lamb. (A) Note values and pitches, as well as meters and repeats, are assigned to DNA codons (note values are demonstrated in Supplementary Figure 1). (B) Each musical note is accompanied by its value. Upper row shows the lyrics. Middle row shows the notes and values, and the lower row shows the corresponding notes and values encrypted in DNA. The code for music (“mu*”) at the beginning of the message distinguishes the music part from the text and image in the information insert, and the 4/4 musical meter code that follows the music code. Also, brackets and “× 2” indicate that the tune repeats 2 times with different lyrics. Note pitches and values in DNA sequence are depicted in upper case, and “mu*,” as well as meter and repeat sequences, are depicted in lower case. Figure 4. DNA codons for image storage. (A) Numeric and shape codon assignment. Note the distinc- tion between “a” (angle of shape relative to horizontal line) and “an” (angle of triangle between two sides). (B) Parameter for drawing shapes. Figure 5. Strategy for DNA archiving. (A) Construction the information library: DNA information sequence (≤10,000 bp) is synthesized according to DNA codons specified in Figures 2–4 and “Results and discussion” section. Primer sequences of ~20–30 bp are embedded in the information sequence every 500 bp, creating islets of intron sequences. Also depicted in this figure is the structure of the primer and the locations of the plasmid and primer numbers, the locations of the flanking GC sequences of the primers, and the flanking scrambled sequences. For more details, see “Results and discus- sion” section and Figure 1. (B) Structure of index plasmid. General index information is flanked by plasmid-specific sequence utilized to retrieve index information. This plasmid is different from the library plasmid in order to retrieve the index information without conflict. Information retrieval from the library plasmid is achieved by using specially designed specific primers (panel A and Figure 1B). Note: In this paper, we designed only an 844-bp information insert for the purpose of proof-of-principle, and for simplicity, the index plasmid theoretical consideration is described, but with no actual synthesis of DNA insert in the plasmid. A B A B A B
  • 5. www.BioTechniques.com 754 Vol. 47 | No. 3 | 2009 Short Technical Reports adjacenttothesequencingprimer.Moreover, thesequencecouldoccasionallybe<500bp, orabasecallcanbeinconclusive.Therefore, sequencing in both orientations may be required. With the unique design of our primers, sequencing could be achieved with ahighdegreeofaccuracy.Agoodsequencing practice for retrieving information from an informationlibrarymightbetoinitiallyuse thesenseprimersonly,andanti-senseprimer 2 to retrieve the information of the 5′ end. Thereafter, the other anti-sense primers can beutilizedasrequired.Forsequencinginour particular example, we used sense primers 1 and 2, and anti-sense primers 2 and 0. Sequencingwithprimer1yieldeda1276-bp product. Since our insert was 844 bp, it also yielded a plasmid sequence downstream from the insert. However, the sequence was missing 41 bp just downstream of primer 1. This sequence was retrieved with anti-sense primer 2. Together, these two sequencing reactions retrieved the original information with 100% accuracy. We recognize that a sequencing reaction cannot always retrieve 1276 bp as was the case with primer 1, and therefore acknowledge that additional sequencing may be required. As mentioned above, we further retrieved the insert infor- mationwithadditionalsequencingwithsense primer2andanti-senseprimer0,againwith 100% accuracy, compared with the initial designedinsert.Wethendecodedtheinfor- mation obtained by DNA sequencing with theguidelinesprovidedinFigures2–4,and wereabletoreconstructthetext,music,and image encrypted in the DNA. An example of part of a sequencing chromatogram achieved with sense primer 1 is shown in Supplementary Figure 2. This particular sequencedinformationistheDNAsequence coding the rectangle that constructs the image of the lamb’s tail (bases 647–674 on the chromatogram) with 100% accuracy, compared with the original sequence. In addition to the inherent advanta- geous attributes of information storage in DNA, our improved Huffman coding method for use of unambiguous DNA codingforarchivingofferseconomical,easy pattern recognition and message retrieval through specially designed primers. As DNA synthesis and sequencing become faster and cheaper (Genome Analyzer Sequencing System, Illumina, San Diego, CA, USA; 454 Sequencing, Roche, Branford, CT, USA) information storage in DNA becomes even more attractive. Acknowledgments This study was supported by the Canadian Institutes of Health Research (CIHR; grant no. 37779). The authors declare no competing interests. References 1. Bancroft, C., T. Bowler, B. Bloom, and K.T. Clelland. 2001. Long-term storage of infor- mation in DNA. Science 293:1763-1765. 2. Cox, J.P.L. 2001. Long-term data storage in DNA. Trends Biotechnol. 19:247-250. 3. Wong, P.C., K.-K. Wong, and H. Foote. 2003. OrganicdatamemoryusingtheDNAapproach. Commun. ACM 46:95-98. 4. Smith, G.C., C.C. Fiddes, J.P. Hawkins, and J.P.L. Cox. 2003. Some possible codes for encrypting data in DNA. Biotechnol. Lett. 25:1125-1130. 5. Yachie,N.,K.Sekiyama,J.Sugahara,Y.Ohashi, andM.Tomita.2007.Alignment-basedapproach for durable data storage into living organisms. Biotechnol. Prog. 23:501-505. 6. Clelland,C.T.,V.Risca,andC.Bancroft.1999. Hiding messages in DNA microdots. Nature 399:533-534. 7. Huffman, D.A. 1953. A method for the construction of minimum-redundancy codes. Proc. IRE. 40:1098-1101. 8. Doig, A.J. 1997. Improving the efficiency of the genetic code by varying the codon length- theperfectgeneticcode.J.Theor.Biol.188:355- 360. 9. Yachie, N., Y. Ohashi, and M. Tomita. 2008. Stabilizing synthetic data in the DNA of living organisms. Syst. Synth. Biol. 2:19-25. 10. Hughes, C. W. 1980. American hymns old and new:notesonthehymnsandbiographiesofthe authors and composers, Columbia University Press, NY. 11. Lewand, R.E. 2000. Cryptographical Mathe- matics. The Mathematical Association of America Publishing, Washington, DC. Received 28 March 2009; accepted 9 July 2009. Address correspondence to Menachem Ailenberg, St. Michael’s Hospital, 30 Bond Street, 16CC- 044, Toronto, Ontario, Canada M5B 1W8. email: m.ailenberg@utoronto.ca Figure 6. Image of lamb encrypted in DNA. (A) DNA codons were defined as described in Fig- ure 4. Upper row identifies the organ of the lamb, middle row defines the shapes and their parameters as defined in Figure 4, and lower row shows image encrypted in DNA codons. Also shown at the top of this panel is the DNA code that precedes the image (“im*”). (B) Im- age was created by using shapes (ellipse, circle, line, rectangular) as described in the “Results and discussion” section and Figure 4. A field of 10 × 10 units was created, and shapes were inserted. An example of the tail sequencing is shown in Supplementary Figure 2. Figure 7. PCR amplification of DNA encrypting text, image, and music, using specific primers. PCR was performed using as template a plas- mid containing the 844-bp synthesized DNA that encrypts text, music, and image. Numbers on the left indicate DNA fragment size, in bp. Lane 1, 100-bp DNA ladder; Lane 2, amplifica- tion using sense primer 1 and anti-sense primer 2; Lane 3, amplification using sense primer 2 and anti-sense primer 0; Lane 4, amplification using sense primer 1 and anti-sense primer 0. Note that the amplified bands in lanes 2, 3, and 4 correspond to the expected sizes of 553 bp, 314 bp, and 844 bp, respectively. For conditions of PCR amplification, consult the “Materials and Methods” section. For primer assignment, see Figure 1. A B