SPiZONE
Presentation
We inSPire success.
Challenges in Text Extraction from PDF
•

PDF is not a markup format. Extracting text from a PDF file is not easy.

•

Whe...
Introduction
•

After doing a lot of R&D, SPi has come up with a new approach for
extracting text from searchable PDF inpu...
Product Highlights
•

Text extraction is possible for all languages.

•

Text accuracy is more than 99.95%.

•

Table extr...
PDF to Text using SPiZONE - Quick Workflow

SZI Generator

•SZI Generator
(Server Process)

SPiZONE Edit

•Styling and Zon...
SZI Generation
•

Sever based process

•

Input: PDF

•

Output: LowRes TIFF and SZI

•

SZI – Styling and Zoning Informat...
SPiZONE Edit
•

Styling and Zoning application

•
•

Input: TIFF and SZI
Output: SZI

•

User will identify the text to be...
SPiZONE Edit -- DEMO

We inSPire success.

8
Text Extraction from PDF
•

Server based process.

•
•

Input: PDF and SZI
Output: HTML, SZD

•

SZD – SPiZONE Document us...
SPiZONE Verify
•

OCR/Text Extraction QA application.

•
•

Input: Extracted content in HTML format, SZI and LowRes TIFF.
...
SPiZONE Verify -- DEMO

We inSPire success.

11
Processing SPiZONE Output
•

PDF to Short-tagged text file creation workflow process is generic for all
projects.

•

Shor...
SPiZONE Edit Samples

We inSPire success.

13
SPiZONE Edit Samples

We inSPire success.

14
SPiZONE Edit Samples

We inSPire success.

15
SPiZONE Verify Samples

We inSPire success.

16
SPiZONE Verify Samples

We inSPire success.

17
SPiZONE Verify Samples

We inSPire success.

18
SPiZONE Verify Samples

We inSPire success.

19
SPiZONE Verify Samples

We inSPire success.

20
ePUB Output Samples

We inSPire success.

21
ePUB Output Samples

We inSPire success.

22
ePUB Output Samples

We inSPire success.

23
Know more about PDF to ePUB conversion
http://www.spi-global.com/content-solutions/our-services/publishingsolutions/conver...
Upcoming SlideShare
Loading in …5
×

Convert PDF to EPUB with SPiZone

1,047 views

Published on

Converting PDF to EPUB can be challenging without the right tools. After doing a lot of R&D, SPi has come up with a new approach for extracting text from searchable PDF inputs.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,047
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Convert PDF to EPUB with SPiZone

  1. 1. SPiZONE Presentation We inSPire success.
  2. 2. Challenges in Text Extraction from PDF • PDF is not a markup format. Extracting text from a PDF file is not easy. • When extracting the text, we need to take care of fonts, encoding and sometimes font-subsets. • Usual problems encountered when extracting text from PDF using conventional method are:  Special characters are not properly extracted.  Missing formatting including case changes.  Unwanted merging/splitting of paragraphs.  Content extracted in incorrect order.  Text in columns are mixed up. We inSPire success. 2
  3. 3. Introduction • After doing a lot of R&D, SPi has come up with a new approach for extracting text from searchable PDF inputs. • SPiZONE tool was developed to have a generic workflow for OCR on raster PDF and scanned images, text extraction processes for searchable PDF. • Output of SPiZONE Verify is short-tagged text file. It can be further converted into any output format like XML, ePub etc. We inSPire success. 3
  4. 4. Product Highlights • Text extraction is possible for all languages. • Text accuracy is more than 99.95%. • Table extraction along with column-spanning and row-spanning etc, based on user input. • Image extraction. • Options to mark some text as ‘Ignore Text’ within zones, so that it will not be produced in output. We inSPire success. 4
  5. 5. PDF to Text using SPiZONE - Quick Workflow SZI Generator •SZI Generator (Server Process) SPiZONE Edit •Styling and Zoning Extraction •PDF to HTML (Sever Process) SPiZONE Verify We inSPire success. •Content QA 5
  6. 6. SZI Generation • Sever based process • Input: PDF • Output: LowRes TIFF and SZI • SZI – Styling and Zoning Information We inSPire success. 6
  7. 7. SPiZONE Edit • Styling and Zoning application • • Input: TIFF and SZI Output: SZI • User will identify the text to be extracted by drawing zones. When drawing zones, style names and sequence numbers and other properties, are assigned to each element. • These style names are used during post-extraction processing and during XML/ePub conversion • The zones information are saved in SZI file. We inSPire success. 7
  8. 8. SPiZONE Edit -- DEMO We inSPire success. 8
  9. 9. Text Extraction from PDF • Server based process. • • Input: PDF and SZI Output: HTML, SZD • SZD – SPiZONE Document used for logging. • Font details, uncertain space, soft-hyphens etc are flagged in the extracted file which are used by SPiZONE Verify. We inSPire success. 9
  10. 10. SPiZONE Verify • OCR/Text Extraction QA application. • • Input: Extracted content in HTML format, SZI and LowRes TIFF. Output: Short-tagged files. • With this application user performs a regulated content checking on the extracted HTML files. • Font Normalization is used to make sure all the characters are extracted fine. User can correct the discrepancies if any. • Verify will not allow the user to create short-tagged file without normalizing all fonts and checking all uncertain space/soft-hyphens. • To see how SPIZONE Verify works, open the video on next slide. We inSPire success. 10
  11. 11. SPiZONE Verify -- DEMO We inSPire success. 11
  12. 12. Processing SPiZONE Output • PDF to Short-tagged text file creation workflow process is generic for all projects. • Short-tagged text files can be further converted into XML or ePub or any other format as per project requirement. • SPiZONE Structure is a customizable application which is used for conversion into any format like (but not limited to) XML, ePub etc. • Structure applications can be built in shorter period of time for any XML conversion project. • SPiZONE ePub application accepts short-tagged files as input to create ePub2/3. We inSPire success. 12
  13. 13. SPiZONE Edit Samples We inSPire success. 13
  14. 14. SPiZONE Edit Samples We inSPire success. 14
  15. 15. SPiZONE Edit Samples We inSPire success. 15
  16. 16. SPiZONE Verify Samples We inSPire success. 16
  17. 17. SPiZONE Verify Samples We inSPire success. 17
  18. 18. SPiZONE Verify Samples We inSPire success. 18
  19. 19. SPiZONE Verify Samples We inSPire success. 19
  20. 20. SPiZONE Verify Samples We inSPire success. 20
  21. 21. ePUB Output Samples We inSPire success. 21
  22. 22. ePUB Output Samples We inSPire success. 22
  23. 23. ePUB Output Samples We inSPire success. 23
  24. 24. Know more about PDF to ePUB conversion http://www.spi-global.com/content-solutions/our-services/publishingsolutions/conversion/convert-pdf-epub We inSPire success.

×