SlideShare a Scribd company logo
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
.
.
. ..
.
.
Scanned publications in digital libraries:
new Open Source DjVu tools
Janusz S. Bień
Formal Linguistics Department, University of Warsaw
The Library 2.012 Worldwide Virtual Conference
October 3 - 5, 2012
http://bc.klf.uw.edu.pl/298/
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Introduction
General information
.
Grant ”Digitalization tools for philological research” 2009-2012
..
.
. ..
.
.
The tools were developed within the Ministry of Science and Higher
Education’s grant (no. N N519 384036) directed by the present author.
.
Some links
..
.
. ..
.
.
The project site: https://bitbucket.org/jsbien/ndt
Our digital library: http://bc.klf.uw.edu.pl/
.
Mailing lists
..
.
. ..
.
.
the announcement list:
http://lists.mimuw.edu.pl/listinfo/nmpt-ann
the discussion and support list:
http://lists.mimuw.edu.pl/listinfo/nmpt-l
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Introduction
Grant results
.
A DjVu search engine (client-server architecture)
..
.
. ..
.
.
Poliqarp for DjVu — the Poliqarp server extension by Jakub Wilk
marasca — the WWW client by Jakub Wilk,
cf. http://poliqarp.wbl.klf.uw.edu.pl/en/
djview4poliqarp — the remote client for Debian/Ubuntu and
MSWindows by Michał Rudolf,
cf. https://bitbucket.org/mrudolf/
djview-poliqarp/downloads
.
DjVu utilities
..
.
. ..
.
.
pdf2djvu, didjvu, ocrodjvu, djvusmooth by Jakub Wilk
some experimental tools
by Tomasz Olejniczak, Michał Rudolf and Piotr Sikora
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Introduction
An example: searching a geographical gazeteer
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu and DjVuLibre
Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard
1996
.
What is DjVu? More then just a format for scans…
..
.
. ..
.
.
an image compression technique, a document format, and a
software platform for delivering documents images over the
Internet
.
OCR, searching and indexing
..
.
. ..
.
.
DjVu pages can contain a ”hidden text” chunk which
includes the recognized text as well as the coordinates of
each word on the page in a compressed form.
Quoted from:
http://leon.bottou.org/papers/lecun-2001
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu and DjVuLibre
.
Some design principles (for remote access)
..
.
. ..
.
.
Action Real-word equivalent Acceptable delay
Zooming/Panning Moving the eyes Immediate
Next/Previous Page Turning a page < 1 second
Random Page access Finding a page < 3 seconds
Quoted from:
http://leon.bottou.org/papers/lecun-2001
.
Unbundled DjVu documents
..
.
. ..
.
.
Every page can be stored and served as a separate file!
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu metadata (also eXtensible Metadata Platform)
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu outlines
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu annotations
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu annotations
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu external hyperlinks
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu internal hyperlinks
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
DjVu internal hyperlinks
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Referencing DjVu documents (the first page)
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Referencing DjVu documents (a specific page)
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Referencing DjVu documents (a view)
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Referencing DjVu documents (highlightings)
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
URLs for DjVu documents
http://triggs.djvu.org/century-
dictionary.com/04/index04.djvu
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
URLs for DjVu documents
http://triggs.djvu.org/century-
dictionary.com/04/index04.djvu
?djvuopts=&page=p2719.djvu
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
URLs for DjVu documents
http://triggs.djvu.org/century-
dictionary.com/04/index04.djvu
?djvuopts=&page=p2719.djvu
&zoom=556&showposition=0.49,0.22
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
URLs for DjVu documents
http://triggs.djvu.org/century-
dictionary.com/04/index04.djvu
?djvuopts=&page=p2719.djvu
&zoom=556&showposition=0.49,0.22
&highlight=1100,3735,217,46
&highlight=1284,3538,166,35
&highlight=1640,3538,166,35
&highlight=1901,3288,168,35
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Creating URLs with djview
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Creating URLs with djview
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Creating URLs with djview
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
Layer of DjVu documents
Graphic layers:
Stencil
Background
usually in lower resolution
Foreground
encoded using shape dictionaries
Hidden text layer
encoded in Unicode
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
The legal status of the DjVu technology
a patent (or more?) granted
some patents pending (?)
crucial code (DjVuLibre, djview)
available on
GNU General Public Licence
and maintained by the inventors of DjVu
http://djvu.sourceforge.net/
new software
available on
GNU General Public Licence
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Why DjVu?
GNU GPL
.
GNU General Public License
..
.
. ..
.
.
4 freedoms
(http://www.gnu.org/philosophy/free-sw.html):
The freedom to run the program, for any purpose.
The freedom to study how the program works,
and adapt it to your needs.
The freedom to redistribute copies
so you can help your neighbor.
The freedom to improve the program,
and release your improvements to the public,
so that the whole community benefits.
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Sample input files
Tagged Image File Format
.
ABBY FineReader 11
..
.
. ..
.
.
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Sample input files
ABBY FineReader 11 — Optical Character Recognition
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Sample input files
ABBY FineReader 11 OCR output — text under image
.
Formats and sizes
..
.
. ..
.
.
50M Parkosz4demoFR11pdf.pdf
11M Parkosz4demoFR11pdfMRC.pdf
2.2M Parkosz4demoFR11pdfaMRC.pdf
779K Parkosz4demoFR11djvu.djvu
.
MRC — Multiple Raster Content
..
.
. ..
.
.
A time-consuming compression method with a high compression ratio
.
PDF/A — a variant of Portable Document Format for archiving
..
.
. ..
.
.
ISO 19005-1:2005. Document management — Electronic
document file format for long-term preservation …
…
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Sample input files
A FineReader 11 PDF output — outline and metadata
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Sample input files
The FineReader 11 DjVu output — outline and metadata
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Sample input files
The FineReader 11 DjVu output
— foregorund and hidden text
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu
http://jwilk.net/software/pdf2djvu
Developed since 2007,
current version 0.7.14 (released on 2012-09-18)
Platforms: included in the following Unix distributions
Debian, Ubuntu, openSUSE, FreeBSD,
Digitlab (http://dl.psnc.pl/2012/09/23/digitlab/)
Win32 (with a GUI: http://www.trustfm.net/
GeneralTools/SoftwarePdfToDjvuGUI.php)
Language versions:
English, German, Polish, Russian, Ukrainian
Demonstration: Open Virtual Appliance
http://fleksem.klf.uw.edu.pl/ndt/
ubuntu4poliqarp/NDT_ubuntu4poliqarp1.ova
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu users
.
Debian/Ubuntu popularity contest
..
.
. ..
.
.
Installed/votes: ∼ 45 000/1000. Debian only actual use:
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu simple use example
pdf2djvu -d 600 Parkosz4demoFR11pdf.pdf
-o Parkosz4demoFR11pdf_p2d600.pdf
Parkosz4demoFR11pdf.pdf:
- page #1 -> #1
- page #2 -> #2
- page #3 -> #3
- page #4 -> #4
- page #5 -> #5
- page #6 -> #6
- page #7 -> #7
- Warning: metadata[CreationDate] is not a valid date
- Warning: metadata[ModDate] is not a valid date
0,091 bits/pixel; 42,890:1, 97,67% saved,
51802361 bytes in, 1207797 bytes out
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu output size
50M Parkosz4demoFR11pdf.pdf
11M Parkosz4demoFR11pdfMRC.pdf
2.2M Parkosz4demoFR11pdfaMRC.pdf
1.4M Parkosz4demoFR11djvuMRC_p2d600.djvu
1.3M Parkosz4demoFR11djvuaMRC_p2d600.djvu
1.2M Parkosz4demoFR11pdf_p2d600.djvu
779K Parkosz4demoFR11djvu.djvu
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu output — foregorund and hidden text
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu output — outline and metadata
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
pdf2djvu
pdf2djvu advanced usage
Usage:
pdf2djvu [-o <output-djvu-file>] [options] <pdf-file>
pdf2djvu -i <index-djvu-file> [options] <pdf-file>
Options:
-i, --indirect=FILE --no-metadata
-o, --output=FILE --verbatim-metadata
--pageid-prefix=NAME --no-outline
--pageid-template=TEMPLATE --hyperlinks=border-avis
--page-title-template=TEMPLATE --hyperlinks=#RRGGBB
-d, --dpi=RESOLUTION --no-hyperlinks
--guess-dpi --no-text
--media-box --words
--page-size=WxH --lines
--bg-slices=N,...,N --crop-text
--bg-slices=N+...+N --no-nfkc
--bg-subsample=N --filter-text=COMMAND-LINE
--fg-colors=default -p, --pages=...
--fg-colors=web -v, --verbose
--fg-colors=black -j, --jobs=N
--fg-colors=N -q, --quiet
--monochrome -h, --help
--loss-level=N --version
--lossy
--anti-alias
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu
didjvu uses the Gamera framework to separate foreground/background
layers, which can be then encoded into a DjVu file.
http://jwilk.net/software/didjvu
http://gamera.informatik.hsnr.de/
(http://minidjvu.sourceforge.net/)
Developed since 2009, current version 0.2.6 (released on 2012-05-15)
Platforms: Linux distributions Debian, Ubuntu
Debian+Ubuntu popcon installed/votes: ∼ 200/20
Language versions: English
Demonstration: Open Virtual Appliance
http://fleksem.klf.uw.edu.pl/ndt/
ubuntu4poliqarp/NDT_ubuntu4poliqarp1.ova
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu simple use example
didjvu bundle -d 600 -o Parkosz_di600.djvu *.tif
Parkosz_0003.tif:
- reading image
- converting to DjVu
- 0.029 bits/pixel; 275.856:1, 99.64% saved,
15078290 bytes in, 54660 bytes out
Parkosz_0004.tif:
- reading image
- converting to DjVu
- 0.054 bits/pixel; 147.387:1, 99.32% saved,
14263346 bytes in, 96775 bytes out
...
bundling
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu output size
50M Parkosz4demoFR11pdf.pdf
11M Parkosz4demoFR11pdfMRC.pdf
2.2M Parkosz4demoFR11pdfaMRC.pdf
1.4M Parkosz4demoFR11djvuMRC_p2d600.djvu
1.3M Parkosz4demoFR11djvuaMRC_p2d600.djvu
1.2M Parkosz4demoFR11pdf_p2d600.djvu
779K Parkosz4demoFR11djvu.djvu
668K Parkosz_di600.djvu
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu output — foregorund and background
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu output — foregorund shape structures
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu advanced usage
didjvu --help
usage: didjvu [-h] [--version] {separate,encode,bundle} ...
positional arguments:
{separate,encode,bundle}
separate generate masks for images
encode convert images to single-page DjVu documents
bundle convert images to bundled multi-page DjVu document
optional arguments:
-h, --help show this help message and exit
--version show version information and exit
more help:
didjvu separate --help
didjvu encode --help
didjvu bundle --help
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
didjvu
didjvu advanced usage
didjvu bundle --help
usage: didjvu bundle [-h] [-o FILE] [--pageid-template TEMPLATE]
[--loss-level N] [--lossless] [--clean] [--lossy]
[--masks MASK [MASK ...]] [--mask MASK] [--fg-slices N]
[--fg-crcb {normal,half,full,none}] [--fg-subsample N]
[--bg-slices N+...+N] [--bg-crcb {normal,half,full,none}]
[--bg-subsample N] [-d N] [-p N]
[-m {bernsen,tsai,white_rohrer,gatos,abutaleb,otsu,djvu,sauvola,niblack}]
[-v] [-q]
<input-image> [<input-image> ...]
positional arguments:
<input-image>
optional arguments:
-h, --help show this help message and exit
-o FILE, --output FILE
output filename
--pageid-template TEMPLATE
naming scheme for page identifiers
--loss-level N aggressiveness of lossy compression
--lossless lossless compression
--clean lossy compression: remove flyspecks
--lossy lossy compression: substitute patterns with small
variations
--masks MASK [MASK ...]
use pre-generated masks
--mask MASK use a pre-generated mask
...
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
ocrodjvu
ocrodjvu
ocrodjvu is a wrapper for OCR systems, that allows you to perform
OCR on DjVu files.
http://jwilk.net/software/ocrodjvu
http://en.wikipedia.org/wiki/OCRopus
http://en.wikipedia.org/wiki/Tesseract_(software)
http://en.wikipedia.org/wiki/CuneiForm_(software)
http://en.wikipedia.org/wiki/Ocrad
http://en.wikipedia.org/wiki/GOCR
Developed since 2008,
current version 0.7.12 (released on 2012-08-15)
Platforms: Linux distributions Debian, Ubuntu, openSUSE
Debian+Ubuntu popcon installed/votes: ∼ 2 000/100
Language versions: English
Demonstration: Open Virtual Appliance
http://fleksem.klf.uw.edu.pl/ndt/sid4ocr,
[…] squeeze4ocropus
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
ocrodjvu
ocrodjvu simple use example
ocrodjvu -D -e tesseract -l pol
Parkosz_di600.djvu -o Parkosz_di600t.djvu
Processing '../Parkosz4demo/di/Parkosz_di600.djvu':
- Page #1
- Page #2
- Page #3
- Page #4
- Page #5
- Page #6
- Page #7
Intermediate files were left
in the '/tmp/ocrodjvu.ueIzif' directory.
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
ocrodjvu
Tesseract 3.02 & ocrodjvu output — hidden text
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
ocrodjvu
ocrodjvu advanced usage
ocrodjvu --help
usage: ocrodjvu [options] FILE
positional arguments:
FILE DjVu file to process
optional arguments:
-h, --help show this help message and exit
-v, --version show version information and exit
-e ENGINE, --engine ENGINE
OCR engine to use
--list-engines print list of available OCR engines
--ocr-only don't save pages without OCR
--clear-text remove existing hidden text
-l LANGUAGE, --language LANGUAGE
set recognition language
--list-languages print list of available languages
--render {foreground,all,mask}
image layers to render
...
advanced options:
-D, --debug don't delete intermediate files
-X KEY=VALUE set an engine-specific property
--on-error {abort,resume}
error handling strategy
--html5 use HTML5 parse
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
ocrodjvu
Open Source vs commercial OCR
http://lib.psnc.pl/publication/428
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
djvusmooth
djvusmooth
djvusmooth is a graphical editor for DjVu documents
http://jwilk.net/software/djvusmooth
Developed since 2008,
current version 0.2.13 (released on 2012-10-02)
Platforms: Linux distributions Debian, Ubuntu, openSUSE
Debian+Ubuntu popcon installed/votes: ∼ 1 500/80
Language versions: English, Russian, Spanish
Demonstration: Open Virtual Appliance
http://fleksem.klf.uw.edu.pl/ndt/ubuntu4poliqarp/
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
djvusmooth
djvusmooth — editing hidden text
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Jakub Wilk’s utilities
djvusmooth
djvusmooth — using external text editor
. . . . . .
Scanned publications in digital libraries: new Open Source DjVu tools
Closing remarks
Thank you
for your attention!
Any questions?

More Related Content

Similar to Scanned publications in digital libraries: new Open Source DjVu tools.

Open Source/Open Access/Freeware/Shareware Tools for Libraries
Open Source/Open Access/Freeware/Shareware Tools for LibrariesOpen Source/Open Access/Freeware/Shareware Tools for Libraries
Open Source/Open Access/Freeware/Shareware Tools for Libraries
St. Petersburg College
 
Free/Open Source Software for Science & Engineering
Free/Open Source Software for Science & EngineeringFree/Open Source Software for Science & Engineering
Free/Open Source Software for Science & Engineering
Kinshuk Sunil
 
Michael Weber - Rechenkraft.net - From Volunteers to Scientists
Michael Weber - Rechenkraft.net - From Volunteers to ScientistsMichael Weber - Rechenkraft.net - From Volunteers to Scientists
Michael Weber - Rechenkraft.net - From Volunteers to Scientists
CitizenCyberlab
 
Niatalk24jan10
Niatalk24jan10Niatalk24jan10
Niatalk24jan10
Sunita Barve
 
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and toolsOpen Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
OpenAIRE
 
Introduction to Web2.0 & Language Learning, LaProf Summer School 2010
Introduction to Web2.0 & Language Learning, LaProf Summer School 2010Introduction to Web2.0 & Language Learning, LaProf Summer School 2010
Introduction to Web2.0 & Language Learning, LaProf Summer School 2010Maria Perifanou
 
Openedweek 2013
Openedweek 2013Openedweek 2013
Openedweek 2013
kathi-fletcher
 
LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...
LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...
LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...
locloud
 
Linked Open Data stuff
Linked Open Data stuffLinked Open Data stuff
Open Design Definition workshop @ Open Knowledge Festival 2012
Open Design Definition workshop @ Open Knowledge Festival 2012Open Design Definition workshop @ Open Knowledge Festival 2012
Open Design Definition workshop @ Open Knowledge Festival 2012
Massimo Menichinelli
 
Collective Access - Open Source Collections Management Software
Collective Access - Open Source Collections Management SoftwareCollective Access - Open Source Collections Management Software
Collective Access - Open Source Collections Management Software
Courtney Michael
 
Open P2P Design: A Metadesign methodology for Open Design Projects @Iaac
Open P2P Design: A Metadesign methodology for Open Design Projects @IaacOpen P2P Design: A Metadesign methodology for Open Design Projects @Iaac
Open P2P Design: A Metadesign methodology for Open Design Projects @Iaac
Massimo Menichinelli
 
Open P2P Design @ DMY Berlin 2011 - MakerLab
Open P2P Design @ DMY Berlin 2011 - MakerLabOpen P2P Design @ DMY Berlin 2011 - MakerLab
Open P2P Design @ DMY Berlin 2011 - MakerLab
Massimo Menichinelli
 
LS DIGITAL FOR DIGITAL LIBRARY
 LS DIGITAL  FOR DIGITAL LIBRARY LS DIGITAL  FOR DIGITAL LIBRARY
LS DIGITAL FOR DIGITAL LIBRARY
guestfa5009
 
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Massimo Menichinelli
 
Open Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future EverythingOpen Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future Everything
Massimo Menichinelli
 
Open P2P Design @ Simbioms.org, Helsinki 12/11/2011
Open P2P Design @ Simbioms.org, Helsinki 12/11/2011Open P2P Design @ Simbioms.org, Helsinki 12/11/2011
Open P2P Design @ Simbioms.org, Helsinki 12/11/2011
Massimo Menichinelli
 

Similar to Scanned publications in digital libraries: new Open Source DjVu tools. (20)

Open Source/Open Access/Freeware/Shareware Tools for Libraries
Open Source/Open Access/Freeware/Shareware Tools for LibrariesOpen Source/Open Access/Freeware/Shareware Tools for Libraries
Open Source/Open Access/Freeware/Shareware Tools for Libraries
 
Free/Open Source Software for Science & Engineering
Free/Open Source Software for Science & EngineeringFree/Open Source Software for Science & Engineering
Free/Open Source Software for Science & Engineering
 
Michael Weber - Rechenkraft.net - From Volunteers to Scientists
Michael Weber - Rechenkraft.net - From Volunteers to ScientistsMichael Weber - Rechenkraft.net - From Volunteers to Scientists
Michael Weber - Rechenkraft.net - From Volunteers to Scientists
 
Niatalk24jan10
Niatalk24jan10Niatalk24jan10
Niatalk24jan10
 
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and toolsOpen Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
Open Access Week 2017: Life Sciences and Open Sciences - worfkflows and tools
 
Introduction to Web2.0 & Language Learning, LaProf Summer School 2010
Introduction to Web2.0 & Language Learning, LaProf Summer School 2010Introduction to Web2.0 & Language Learning, LaProf Summer School 2010
Introduction to Web2.0 & Language Learning, LaProf Summer School 2010
 
Openedweek 2013
Openedweek 2013Openedweek 2013
Openedweek 2013
 
LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...
LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...
LoCloud - D2.5: Lightweight Digital Library Prototype (LoCloud Collections Se...
 
Linked Open Data stuff
Linked Open Data stuffLinked Open Data stuff
Linked Open Data stuff
 
30s
30s30s
30s
 
30s
30s30s
30s
 
Open Design Definition workshop @ Open Knowledge Festival 2012
Open Design Definition workshop @ Open Knowledge Festival 2012Open Design Definition workshop @ Open Knowledge Festival 2012
Open Design Definition workshop @ Open Knowledge Festival 2012
 
Collective Access - Open Source Collections Management Software
Collective Access - Open Source Collections Management SoftwareCollective Access - Open Source Collections Management Software
Collective Access - Open Source Collections Management Software
 
Open P2P Design: A Metadesign methodology for Open Design Projects @Iaac
Open P2P Design: A Metadesign methodology for Open Design Projects @IaacOpen P2P Design: A Metadesign methodology for Open Design Projects @Iaac
Open P2P Design: A Metadesign methodology for Open Design Projects @Iaac
 
Open P2P Design @ DMY Berlin 2011 - MakerLab
Open P2P Design @ DMY Berlin 2011 - MakerLabOpen P2P Design @ DMY Berlin 2011 - MakerLab
Open P2P Design @ DMY Berlin 2011 - MakerLab
 
LS DIGITAL FOR DIGITAL LIBRARY
 LS DIGITAL  FOR DIGITAL LIBRARY LS DIGITAL  FOR DIGITAL LIBRARY
LS DIGITAL FOR DIGITAL LIBRARY
 
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
Open (P2P) Design @ Pixelversity, Helsinki (16/09/2011)
 
Open P2P Design
Open P2P DesignOpen P2P Design
Open P2P Design
 
Open Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future EverythingOpen Design Definition @ Fab* @ Future Everything
Open Design Definition @ Fab* @ Future Everything
 
Open P2P Design @ Simbioms.org, Helsinki 12/11/2011
Open P2P Design @ Simbioms.org, Helsinki 12/11/2011Open P2P Design @ Simbioms.org, Helsinki 12/11/2011
Open P2P Design @ Simbioms.org, Helsinki 12/11/2011
 

More from jsbien

Podstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnejPodstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnej
jsbien
 
Jsb i linde_18-10-01_ipi
Jsb i linde_18-10-01_ipiJsb i linde_18-10-01_ipi
Jsb i linde_18-10-01_ipi
jsbien
 
Scanned texts as corpora - a case study
Scanned texts as corpora - a case study Scanned texts as corpora - a case study
Scanned texts as corpora - a case study
jsbien
 
Polskie zasoby językowe w projekcie IMPACT
Polskie zasoby językowe w projekcie IMPACT Polskie zasoby językowe w projekcie IMPACT
Polskie zasoby językowe w projekcie IMPACT
jsbien
 
Język naturalny i komputer (komputery i teksty)
Język naturalny i komputer (komputery i teksty)Język naturalny i komputer (komputery i teksty)
Język naturalny i komputer (komputery i teksty)
jsbien
 
Kilka uwag o słownikach przyszłości i Radzie Języka Polskiego
Kilka uwag o słownikach przyszłości i Radzie Języka PolskiegoKilka uwag o słownikach przyszłości i Radzie Języka Polskiego
Kilka uwag o słownikach przyszłości i Radzie Języka Polskiego
jsbien
 
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanieJanusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
jsbien
 
Słownik Lindego jako korpus
Słownik Lindego jako korpus Słownik Lindego jako korpus
Słownik Lindego jako korpus
jsbien
 
Elektroniczny indeks do słownika Lindego
Elektroniczny indeks do słownika LindegoElektroniczny indeks do słownika Lindego
Elektroniczny indeks do słownika Lindego
jsbien
 
Otwarty dostęp do zasobów lingwistycznych w praktyce
Otwarty dostęp do zasobów lingwistycznych w praktyceOtwarty dostęp do zasobów lingwistycznych w praktyce
Otwarty dostęp do zasobów lingwistycznych w praktyce
jsbien
 

More from jsbien (10)

Podstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnejPodstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnej
 
Jsb i linde_18-10-01_ipi
Jsb i linde_18-10-01_ipiJsb i linde_18-10-01_ipi
Jsb i linde_18-10-01_ipi
 
Scanned texts as corpora - a case study
Scanned texts as corpora - a case study Scanned texts as corpora - a case study
Scanned texts as corpora - a case study
 
Polskie zasoby językowe w projekcie IMPACT
Polskie zasoby językowe w projekcie IMPACT Polskie zasoby językowe w projekcie IMPACT
Polskie zasoby językowe w projekcie IMPACT
 
Język naturalny i komputer (komputery i teksty)
Język naturalny i komputer (komputery i teksty)Język naturalny i komputer (komputery i teksty)
Język naturalny i komputer (komputery i teksty)
 
Kilka uwag o słownikach przyszłości i Radzie Języka Polskiego
Kilka uwag o słownikach przyszłości i Radzie Języka PolskiegoKilka uwag o słownikach przyszłości i Radzie Języka Polskiego
Kilka uwag o słownikach przyszłości i Radzie Języka Polskiego
 
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanieJanusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
 
Słownik Lindego jako korpus
Słownik Lindego jako korpus Słownik Lindego jako korpus
Słownik Lindego jako korpus
 
Elektroniczny indeks do słownika Lindego
Elektroniczny indeks do słownika LindegoElektroniczny indeks do słownika Lindego
Elektroniczny indeks do słownika Lindego
 
Otwarty dostęp do zasobów lingwistycznych w praktyce
Otwarty dostęp do zasobów lingwistycznych w praktyceOtwarty dostęp do zasobów lingwistycznych w praktyce
Otwarty dostęp do zasobów lingwistycznych w praktyce
 

Recently uploaded

Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 

Recently uploaded (20)

Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 

Scanned publications in digital libraries: new Open Source DjVu tools.

  • 1. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools . . . .. . . Scanned publications in digital libraries: new Open Source DjVu tools Janusz S. Bień Formal Linguistics Department, University of Warsaw The Library 2.012 Worldwide Virtual Conference October 3 - 5, 2012 http://bc.klf.uw.edu.pl/298/
  • 2. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Introduction General information . Grant ”Digitalization tools for philological research” 2009-2012 .. . . .. . . The tools were developed within the Ministry of Science and Higher Education’s grant (no. N N519 384036) directed by the present author. . Some links .. . . .. . . The project site: https://bitbucket.org/jsbien/ndt Our digital library: http://bc.klf.uw.edu.pl/ . Mailing lists .. . . .. . . the announcement list: http://lists.mimuw.edu.pl/listinfo/nmpt-ann the discussion and support list: http://lists.mimuw.edu.pl/listinfo/nmpt-l
  • 3. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Introduction Grant results . A DjVu search engine (client-server architecture) .. . . .. . . Poliqarp for DjVu — the Poliqarp server extension by Jakub Wilk marasca — the WWW client by Jakub Wilk, cf. http://poliqarp.wbl.klf.uw.edu.pl/en/ djview4poliqarp — the remote client for Debian/Ubuntu and MSWindows by Michał Rudolf, cf. https://bitbucket.org/mrudolf/ djview-poliqarp/downloads . DjVu utilities .. . . .. . . pdf2djvu, didjvu, ocrodjvu, djvusmooth by Jakub Wilk some experimental tools by Tomasz Olejniczak, Michał Rudolf and Piotr Sikora
  • 4. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Introduction An example: searching a geographical gazeteer
  • 5. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu and DjVuLibre Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard 1996 . What is DjVu? More then just a format for scans… .. . . .. . . an image compression technique, a document format, and a software platform for delivering documents images over the Internet . OCR, searching and indexing .. . . .. . . DjVu pages can contain a ”hidden text” chunk which includes the recognized text as well as the coordinates of each word on the page in a compressed form. Quoted from: http://leon.bottou.org/papers/lecun-2001
  • 6. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu and DjVuLibre . Some design principles (for remote access) .. . . .. . . Action Real-word equivalent Acceptable delay Zooming/Panning Moving the eyes Immediate Next/Previous Page Turning a page < 1 second Random Page access Finding a page < 3 seconds Quoted from: http://leon.bottou.org/papers/lecun-2001 . Unbundled DjVu documents .. . . .. . . Every page can be stored and served as a separate file!
  • 7. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu metadata (also eXtensible Metadata Platform)
  • 8. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu outlines
  • 9. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu annotations
  • 10. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu annotations
  • 11. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu external hyperlinks
  • 12. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu internal hyperlinks
  • 13. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? DjVu internal hyperlinks
  • 14. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Referencing DjVu documents (the first page)
  • 15. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Referencing DjVu documents (a specific page)
  • 16. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Referencing DjVu documents (a view)
  • 17. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Referencing DjVu documents (highlightings)
  • 18. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? URLs for DjVu documents http://triggs.djvu.org/century- dictionary.com/04/index04.djvu
  • 19. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? URLs for DjVu documents http://triggs.djvu.org/century- dictionary.com/04/index04.djvu ?djvuopts=&page=p2719.djvu
  • 20. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? URLs for DjVu documents http://triggs.djvu.org/century- dictionary.com/04/index04.djvu ?djvuopts=&page=p2719.djvu &zoom=556&showposition=0.49,0.22
  • 21. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? URLs for DjVu documents http://triggs.djvu.org/century- dictionary.com/04/index04.djvu ?djvuopts=&page=p2719.djvu &zoom=556&showposition=0.49,0.22 &highlight=1100,3735,217,46 &highlight=1284,3538,166,35 &highlight=1640,3538,166,35 &highlight=1901,3288,168,35
  • 22. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Creating URLs with djview
  • 23. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Creating URLs with djview
  • 24. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Creating URLs with djview
  • 25. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? Layer of DjVu documents Graphic layers: Stencil Background usually in lower resolution Foreground encoded using shape dictionaries Hidden text layer encoded in Unicode
  • 26. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? The legal status of the DjVu technology a patent (or more?) granted some patents pending (?) crucial code (DjVuLibre, djview) available on GNU General Public Licence and maintained by the inventors of DjVu http://djvu.sourceforge.net/ new software available on GNU General Public Licence
  • 27. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Why DjVu? GNU GPL . GNU General Public License .. . . .. . . 4 freedoms (http://www.gnu.org/philosophy/free-sw.html): The freedom to run the program, for any purpose. The freedom to study how the program works, and adapt it to your needs. The freedom to redistribute copies so you can help your neighbor. The freedom to improve the program, and release your improvements to the public, so that the whole community benefits.
  • 28. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Sample input files Tagged Image File Format . ABBY FineReader 11 .. . . .. . .
  • 29. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Sample input files ABBY FineReader 11 — Optical Character Recognition
  • 30. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Sample input files ABBY FineReader 11 OCR output — text under image . Formats and sizes .. . . .. . . 50M Parkosz4demoFR11pdf.pdf 11M Parkosz4demoFR11pdfMRC.pdf 2.2M Parkosz4demoFR11pdfaMRC.pdf 779K Parkosz4demoFR11djvu.djvu . MRC — Multiple Raster Content .. . . .. . . A time-consuming compression method with a high compression ratio . PDF/A — a variant of Portable Document Format for archiving .. . . .. . . ISO 19005-1:2005. Document management — Electronic document file format for long-term preservation … …
  • 31. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Sample input files A FineReader 11 PDF output — outline and metadata
  • 32. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Sample input files The FineReader 11 DjVu output — outline and metadata
  • 33. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Sample input files The FineReader 11 DjVu output — foregorund and hidden text
  • 34. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu http://jwilk.net/software/pdf2djvu Developed since 2007, current version 0.7.14 (released on 2012-09-18) Platforms: included in the following Unix distributions Debian, Ubuntu, openSUSE, FreeBSD, Digitlab (http://dl.psnc.pl/2012/09/23/digitlab/) Win32 (with a GUI: http://www.trustfm.net/ GeneralTools/SoftwarePdfToDjvuGUI.php) Language versions: English, German, Polish, Russian, Ukrainian Demonstration: Open Virtual Appliance http://fleksem.klf.uw.edu.pl/ndt/ ubuntu4poliqarp/NDT_ubuntu4poliqarp1.ova
  • 35. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu users . Debian/Ubuntu popularity contest .. . . .. . . Installed/votes: ∼ 45 000/1000. Debian only actual use:
  • 36. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu simple use example pdf2djvu -d 600 Parkosz4demoFR11pdf.pdf -o Parkosz4demoFR11pdf_p2d600.pdf Parkosz4demoFR11pdf.pdf: - page #1 -> #1 - page #2 -> #2 - page #3 -> #3 - page #4 -> #4 - page #5 -> #5 - page #6 -> #6 - page #7 -> #7 - Warning: metadata[CreationDate] is not a valid date - Warning: metadata[ModDate] is not a valid date 0,091 bits/pixel; 42,890:1, 97,67% saved, 51802361 bytes in, 1207797 bytes out
  • 37. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu output size 50M Parkosz4demoFR11pdf.pdf 11M Parkosz4demoFR11pdfMRC.pdf 2.2M Parkosz4demoFR11pdfaMRC.pdf 1.4M Parkosz4demoFR11djvuMRC_p2d600.djvu 1.3M Parkosz4demoFR11djvuaMRC_p2d600.djvu 1.2M Parkosz4demoFR11pdf_p2d600.djvu 779K Parkosz4demoFR11djvu.djvu
  • 38. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu output — foregorund and hidden text
  • 39. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu output — outline and metadata
  • 40. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities pdf2djvu pdf2djvu advanced usage Usage: pdf2djvu [-o <output-djvu-file>] [options] <pdf-file> pdf2djvu -i <index-djvu-file> [options] <pdf-file> Options: -i, --indirect=FILE --no-metadata -o, --output=FILE --verbatim-metadata --pageid-prefix=NAME --no-outline --pageid-template=TEMPLATE --hyperlinks=border-avis --page-title-template=TEMPLATE --hyperlinks=#RRGGBB -d, --dpi=RESOLUTION --no-hyperlinks --guess-dpi --no-text --media-box --words --page-size=WxH --lines --bg-slices=N,...,N --crop-text --bg-slices=N+...+N --no-nfkc --bg-subsample=N --filter-text=COMMAND-LINE --fg-colors=default -p, --pages=... --fg-colors=web -v, --verbose --fg-colors=black -j, --jobs=N --fg-colors=N -q, --quiet --monochrome -h, --help --loss-level=N --version --lossy --anti-alias
  • 41. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu didjvu uses the Gamera framework to separate foreground/background layers, which can be then encoded into a DjVu file. http://jwilk.net/software/didjvu http://gamera.informatik.hsnr.de/ (http://minidjvu.sourceforge.net/) Developed since 2009, current version 0.2.6 (released on 2012-05-15) Platforms: Linux distributions Debian, Ubuntu Debian+Ubuntu popcon installed/votes: ∼ 200/20 Language versions: English Demonstration: Open Virtual Appliance http://fleksem.klf.uw.edu.pl/ndt/ ubuntu4poliqarp/NDT_ubuntu4poliqarp1.ova
  • 42. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu simple use example didjvu bundle -d 600 -o Parkosz_di600.djvu *.tif Parkosz_0003.tif: - reading image - converting to DjVu - 0.029 bits/pixel; 275.856:1, 99.64% saved, 15078290 bytes in, 54660 bytes out Parkosz_0004.tif: - reading image - converting to DjVu - 0.054 bits/pixel; 147.387:1, 99.32% saved, 14263346 bytes in, 96775 bytes out ... bundling
  • 43. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu output size 50M Parkosz4demoFR11pdf.pdf 11M Parkosz4demoFR11pdfMRC.pdf 2.2M Parkosz4demoFR11pdfaMRC.pdf 1.4M Parkosz4demoFR11djvuMRC_p2d600.djvu 1.3M Parkosz4demoFR11djvuaMRC_p2d600.djvu 1.2M Parkosz4demoFR11pdf_p2d600.djvu 779K Parkosz4demoFR11djvu.djvu 668K Parkosz_di600.djvu
  • 44. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu output — foregorund and background
  • 45. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu output — foregorund shape structures
  • 46. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu advanced usage didjvu --help usage: didjvu [-h] [--version] {separate,encode,bundle} ... positional arguments: {separate,encode,bundle} separate generate masks for images encode convert images to single-page DjVu documents bundle convert images to bundled multi-page DjVu document optional arguments: -h, --help show this help message and exit --version show version information and exit more help: didjvu separate --help didjvu encode --help didjvu bundle --help
  • 47. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities didjvu didjvu advanced usage didjvu bundle --help usage: didjvu bundle [-h] [-o FILE] [--pageid-template TEMPLATE] [--loss-level N] [--lossless] [--clean] [--lossy] [--masks MASK [MASK ...]] [--mask MASK] [--fg-slices N] [--fg-crcb {normal,half,full,none}] [--fg-subsample N] [--bg-slices N+...+N] [--bg-crcb {normal,half,full,none}] [--bg-subsample N] [-d N] [-p N] [-m {bernsen,tsai,white_rohrer,gatos,abutaleb,otsu,djvu,sauvola,niblack}] [-v] [-q] <input-image> [<input-image> ...] positional arguments: <input-image> optional arguments: -h, --help show this help message and exit -o FILE, --output FILE output filename --pageid-template TEMPLATE naming scheme for page identifiers --loss-level N aggressiveness of lossy compression --lossless lossless compression --clean lossy compression: remove flyspecks --lossy lossy compression: substitute patterns with small variations --masks MASK [MASK ...] use pre-generated masks --mask MASK use a pre-generated mask ...
  • 48. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities ocrodjvu ocrodjvu ocrodjvu is a wrapper for OCR systems, that allows you to perform OCR on DjVu files. http://jwilk.net/software/ocrodjvu http://en.wikipedia.org/wiki/OCRopus http://en.wikipedia.org/wiki/Tesseract_(software) http://en.wikipedia.org/wiki/CuneiForm_(software) http://en.wikipedia.org/wiki/Ocrad http://en.wikipedia.org/wiki/GOCR Developed since 2008, current version 0.7.12 (released on 2012-08-15) Platforms: Linux distributions Debian, Ubuntu, openSUSE Debian+Ubuntu popcon installed/votes: ∼ 2 000/100 Language versions: English Demonstration: Open Virtual Appliance http://fleksem.klf.uw.edu.pl/ndt/sid4ocr, […] squeeze4ocropus
  • 49. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities ocrodjvu ocrodjvu simple use example ocrodjvu -D -e tesseract -l pol Parkosz_di600.djvu -o Parkosz_di600t.djvu Processing '../Parkosz4demo/di/Parkosz_di600.djvu': - Page #1 - Page #2 - Page #3 - Page #4 - Page #5 - Page #6 - Page #7 Intermediate files were left in the '/tmp/ocrodjvu.ueIzif' directory.
  • 50. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities ocrodjvu Tesseract 3.02 & ocrodjvu output — hidden text
  • 51. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities ocrodjvu ocrodjvu advanced usage ocrodjvu --help usage: ocrodjvu [options] FILE positional arguments: FILE DjVu file to process optional arguments: -h, --help show this help message and exit -v, --version show version information and exit -e ENGINE, --engine ENGINE OCR engine to use --list-engines print list of available OCR engines --ocr-only don't save pages without OCR --clear-text remove existing hidden text -l LANGUAGE, --language LANGUAGE set recognition language --list-languages print list of available languages --render {foreground,all,mask} image layers to render ... advanced options: -D, --debug don't delete intermediate files -X KEY=VALUE set an engine-specific property --on-error {abort,resume} error handling strategy --html5 use HTML5 parse
  • 52. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities ocrodjvu Open Source vs commercial OCR http://lib.psnc.pl/publication/428
  • 53. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities djvusmooth djvusmooth djvusmooth is a graphical editor for DjVu documents http://jwilk.net/software/djvusmooth Developed since 2008, current version 0.2.13 (released on 2012-10-02) Platforms: Linux distributions Debian, Ubuntu, openSUSE Debian+Ubuntu popcon installed/votes: ∼ 1 500/80 Language versions: English, Russian, Spanish Demonstration: Open Virtual Appliance http://fleksem.klf.uw.edu.pl/ndt/ubuntu4poliqarp/
  • 54. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities djvusmooth djvusmooth — editing hidden text
  • 55. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Jakub Wilk’s utilities djvusmooth djvusmooth — using external text editor
  • 56. . . . . . . Scanned publications in digital libraries: new Open Source DjVu tools Closing remarks Thank you for your attention! Any questions?