A presentation on Digital Content Creation by Rupesh Kumar A, Assistant Professor, Department of Studies and Research in Library and Information Science, Tumkur University, Tumakuru, Karnataka, India.
2. Digitization
• Digitization refers to the process of translating a piece of
information such as a book, journal articles, sound recordings,
pictures, audio tapes or video recordings, etc. into bits.
• Bits are the fundamental units of information in a computer
system.
• Converting information into these binary digits (bits) is called
digitisation.
• Thefirst step in digitizationis scanning.
3. • Whenanobjectisscanned,itisconverted intoadigitalimage.
• A digital image is composed of a set of pixels (picture elements),
arrangedaccording toapre-definedratioofcolumnsandrows.
• An image file can be managed as regular computer file and can be
retrieved,printedandmodifiedusing appropriatesoftware.
• Images containing text can be converted into text files using a
process calledOpticalCharacterRecognition(OCR).
4. OCR
• Optical Character Recognition, or OCR, is a technology that
enables a user to convert different types of documents, such as
scanned paper documents, PDF files or images captured by a
digitalcameraintoeditableand searchable data.
• The mechanical or electronic conversion of images of typed,
handwritten or printed text into machine-encoded text, whether
fromascanned document,aphotoofadocument,ascene-photo.
6. Pre-processing
• Pre-processing involves certain tasks to improve character recognition
and its accuracy.
• Pre-processing includes
• de-skewing: setting the characters perfectly horizontal or vertical if they
are slant
• Despeckle: removing positive and negative spots, smoothing edges
• Binarization:converting images to b&w
• Line removal: clearing non-character lines andboxes
• Line and word detection
• Script recognition: recognizing the script of the text
7. CharacterRecognition
• Character recognition may involve:
• Matrix matching:comparing an image to a stored glyph on a
pixel-by-pixelbasis.
• It is also knownas “patternmatching” or“image correlation”.
• Featureextraction: decomposing (dividing) glyphs into
featureslikelines, closed loops, linedirection and line
intersections.
10. Post-processing
• The output stream may be a plain text stream or fileof
characters.
• More sophisticated OCR systems can preserve the original
layoutof thepage.
12. ElectronicDocument
• Any electronic media content which is intended to be used in either
electronic form or as printed output.
• E-documents donot include computer programs or system files.
• E-documents come in a varietyof file formats.
• Today, most e-docs in different file formats will have at least one file
viewer (e.g. Adobe Reader for PDFfiles).
• File format incompatibility poses achallenge for e-docs.
• Development of non-proprietary, standardized file formats is a solution
to tackle incompatibility (e.g. HTML, OpenDocument).
13. FileFormats (in digitization)
• Several fileformats are used for documentsto be included in
digital libraries.
• Most common formatis PDF.
• Other formats include:
– TIFF: Tagged Image File Format
– JPG (JPEG): Joint Photographic Experts Group
– PNG: Portable Network Graphics
– GIF: Graphics Interchange Format
– PS or EPS: PostScript or Encapsulated PostScript
14. PortableDocumentFormat
• A file format used to present documents in a manner
independentof software, hardware, and operating systems.
• PDF file encapsulates a complete description of a fixed-layout
flat document, including the text, fonts, graphics, and other
informationneededtodisplay it.
• A PDF file will look the same way on a variety of computers
irrespective of operating systems.
15. History
• PDFwas developedby AdobeCorporation in early 1990s.
• Before the emergence of World Wide Web and HTML format, PDF
waspopularin DesktopPublishing(DTP).
• PDFwasaproprietary formatcontrolledby Adobetill2008.
• On July 1, 2008, it was released as an open standard and
published by ISO as
ISO 32000-1:2008.
16. TechnicalAspectsof PDF
• PDFuses the followingtechnologies:
– PostScript page description programming language, for generating
the layout and graphics.
– A font-embedding/replacement system to allow fonts to travel
with the documents
– A structured storage system to bundle these elements and any
associated content into a single file, with data compression where
appropriate.
17. SpecialFeatures
• PDF files may contain interactive elements such as
annotations, form fields, video and Flash animation. Such
filesare called “RichMediaPDF”.
• A PDF file may be encrypted for security, or digitally signed
for authentication.
• PDF documents can contain display settings, including the
pagedisplay layout and zoom level.
18. Borndigitalandlegacydocuments
• Born digital documents are resources or items created and
managedin digital form.
• They may be: digital photographs, digital documents,
harvested Web content, digital manuscripts, electronic
records, staticdata sets, digital art, digital mediapublications.
• Born digital documents can be easily processed for inclusion
in thedigitallibrary as they are nativelyin digitalformat.
19. Legacy documents
• Legacy documents are resources or items which are originally in ‘non-digital’
form and have to be converted into ‘digital’ form for inclusion in a digital
library.
• Photographs, documents, manuscripts, print records, art, media publications
are examplesoflegacydocuments.
• The process of converting legacy documents into digital form to make them
compatiblefordigitallibrariesisknownas‘digitization’.
• Legacy documents pose greater challenge for digital libraries as their
conversiontodigitalformisverytedious.
20. ScholarlyCommunication
• Scholarly communication is the process by which academics,
scholars and researchers share and publish their research findings
so that they are available to the wider academic community and
beyond.
• Scholarly communication is “the system through which research
and other scholarly writings are created, evaluated for quality,
disseminated to the scholarly community, and preserved for
futureuse.”
21. ScholarlyLiterature
• Writings in a scholarly journals& books, E-journals
• Reviews, preprints and working papers,
• Writings in encyclopaedias, dictionaries,and annotated
content,data,
• blogs, discussion forums, professional and scholarlyhubs and
conference papers.
• Sound and video recordings
22. Terminologyin ScholarlyCommunication
• Manuscript:a scholarly documentwhich has notyetbeen
submittedforpublication.
• Preprint: a scholarly documentacceptedforpublicationin a
journal or book;materialacceptedto beusedin a presentationat
a conference.
• Article: a scholarly documentwhich has beenpublished.
• Paper: a scholarly documentor materialwhich have been
presentedataconference.
• E-Script:an electronicmanuscript.
23. ElectronicPublishing
• E-publishing includes the digital publication of e-books, digital
magazines, and the development of digital libraries and
catalogues.
• The electronic publishing process follows some aspects of the
traditional paper-based publishing process but differs from
traditionalpublishingin twoways:
– 1)itdoesnotincludeusingan offsetprintingpresstoprintthefinal
productand
– 2)itavoidsthedistributionofaphysicalproduct(e.g.,paper books,
papermagazines,orpapernewspapers).
24. • Because the content is electronic, it may be distributed over
theInternetand throughelectronic bookstores.
• Users can read the material on a range of electronic and
digital devices, including desktop computers, laptops, tablet
computers, smartphones or e-reader tablets.
25. E-Journal
• Electronic journals, also known as ejournals, ejournals, and electronic
serials, are scholarly journals or intellectual magazines that can be
accessed viaelectronic transmission.
• An e-journal closely resembles a print journal in structure, but will be in
electronic format.
• Often a journal article will be available for download in two formats - as a
PDF and in HTML format.
• E-journals allow new types on content to be included in journals, for
example video material, or the data sets on which research has been
based.
26. E-book
• An electronic book (or e-book) is a book publication made
available in digital form, consisting of text, images, or both,
readable on the flat-panel display of computers or other
electronic devices.
• An e-book may be an e-only book or an electronic version of a
printedbook.