SlideShare a Scribd company logo
Extracting text from
PDF
How far does the rabbit hole go?
Kaz Yoshikawa

kaz@digitallynx.com
May 2016
How to extract text
from PDF on iOS?
🤔
I know some say 

"extracting text from PDF is really hard"

Just exaggerated, isn't it?
References
References
• アジア言語圏のPDFのテキスト抽出

http://ponpoko1968.hatenablog.com/entry/20100810/1281438828

http://ponpoko1968.hatenablog.com/entry/20100915/1284559500
• PDFビューワの作り方 (連載)- HMDT

https://news.mynavi.jp/itsearch/article/devsoft/1212
• PDF千夜一夜 — アンテナハウス

http://www.antenna.co.jp/pdf/reference/Blog-Index.htm
References
• PDFKitten

https://github.com/KurtCode/PDFKitten
What is hard? Really?
Why so difficult?
• iOS does not provide any API to extract text directly

(OS X has PDFKit – still limited)
• Core Graphics provides only very basic API
• Needs to write parser — hard! really!
• Extracted text data is not unicode
• Glyph ID to Unicode mapping
Understanding PDF
Structure
Document - Page
Outline Pages
Document
Metadata
PagePage Page
Page - Font
MediaBox Resources
Page
Contents
… Font …
Tc1 Tc2
…
subtype… …
case: Type 1
Subtype Type1
Name Referenced from Font subdirectory
BaseFont PostScript font name
FirstChar First character code defined in the font’s Widths array
LastChar Last character code defined in the font’s Widths array
Widths An array of (LastChar − FirstChar + 1) widths
FontDescriptor
A font descriptor describing the font’s metrics other than its
glyph widths
Encoding Font’s character encoding
ToUnicode CMap file that maps character codes to Unicode values
PDF Reference: p412
case: TrueType
Subtype Type1
Name Referenced from Font subdirectory
BaseFont PostScript font name
FirstChar First character code defined in the font’s Widths array
LastChar Last character code defined in the font’s Widths array
Widths An array of (LastChar − FirstChar + 1) widths
FontDescriptor
A font descriptor describing the font’s metrics other than its
glyph widths
Encoding Font’s character encoding
ToUnicode CMap file that maps character codes to Unicode values
PDF Reference: p412
Same as Type1 with some differences
case: Type 3
Subtype Type3
Name Referenced from Font subdirectory
FontBBox A rectangle expressed in the glyph coordinate system
FontMatrix
An array of six numbers specifying the font matrix, mapping
glyph space to text space
CharProcs ??
FirstChar, LastChar ditto
Widths ditto – sort of
FontDescriptor
A font descriptor describing the font’s default metrics other
than its glyph widths
Resources A list of the named resources, such as fonts and images
ToUnicode CMap file that maps character codes to Unicode values
PDF Reference: p420
Case: Type 0 

Composite Fonts
Subtype CIDFontType0 or CIDFontType2
Name Referenced from Font subdirectory
BaseFont The PostScript name of the CIDFont
CIDSystemInfo
A dictionary containing entries that define the character
collection of the CIDFont
FontDescriptor
A font descriptor describing the CIDFont’s default metrics
other than its glyph widths
DW
The default width for glyphs in the CIDFont. Default value:
1000
DW2
An array of two numbers specifying the default metrics for
vertical writing
W2
A description of the metrics for vertical writing for the
glyphs in the CIDFont
CIDToGIDMap Type 2 CIDFonts only — omitted
PDF Reference: p436
😏
OK, PDF structure is pretty complex.
Is there any tools?
Tools
PDF-Voyeur
Open Source
https://github.com/below/PDF-Voyeur
Font
Contents

(Text, etc.)
Bounding

Box
Rotation
Annotation
Page
Understanding how
PDFs are rendered?
Page Object knows enough
about drawing page
MediaBox Resources
Page
Contents
Font
Tc2
dictionaryarray stream
Drawing operators
Operators
Begin a text object
BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET
End a text object
specify font
specify location
Draw Text
Rendering Japanese
/C2_0 1 Tf
0 Tc 175 720 Td 

<30533093306B3061306F> Tj
Tf, Td, Tj
PDF Reference: p398,406,407
Decoding Text
case 1

Has 'ToUnicode' entry
Font entry
Subtype Type1
Name Referenced from Font subdirectory
BaseFont PostScript font name
FirstChar First character code defined in the font’s Widths array
LastChar Last character code defined in the font’s Widths array
Widths An array of (LastChar − FirstChar + 1) widths
FontDescriptor
A font descriptor describing the font’s metrics other than its
glyph widths
Encoding Font’s character encoding
ToUnicode CMap file that maps character codes to Unicode values
Parsing CMap
CMap Specification
Adobe CMap and CIDFont
Files Specification
Version 1.0
11 June 1993
Adobe Developer Support
®
® ®
https://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf
102 pages
CMap example
%!PS-Adobe-3.0 Resource-CMap
%%Version: 1
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (Japan1) def
/Supplement 0 def
end def
/CMapName /83pv-RKSJ-H def
/CMapVersion 1 def
/CMapType 0 def
/UIDOffset 0 def
/XUID [1 10 25324] def
/WMode 0 def
4 begincodespacerange
<00> <80>
<8140> <9ffc>
<a0> <df>
<e040> <fbfc>
endcodespacerange
1 beginnotdefrange
<00> <1f> 1
endnotdefrange
100 begincidrange
<9780> <97fc> 3914
<9840> <9872> 4039
<989f> <98fc> 4090
<9940> <997e> 4184
<9980> <99fc> 4247
<< 90 ranges missing >>
<ed83> <ed83> 7934
<ed84> <ed84> 992
<ed85> <ed85> 7935
<ed86> <ed86> 994
<ed87> <ed87> 7936
endcidrange
17 begincidrange
<ed88> <ed8d> 996
<ed8e> <ed8e> 7937
<< 13 ranges missing >>
<ee9a> <ee9a> 768
<ee9b> <ee9c> 7631
endcidrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
%%EndResource
%%EOF
←Adobe Japan 1-0
←Horizontal/Vertical
←CID Range
←CID Range
begin-end-cidrange
100 begincidrange
<9780> <97fc> 3914
<9840> <9872> 4039
<989f> <98fc> 4090
<9940> <997e> 4184
<9980> <99fc> 4247
…
<ed83> <ed83> 7934
<ed84> <ed84> 992
<ed85> <ed85> 7935
<ed86> <ed86> 994
<ed87> <ed87> 7936
endcidrange
•Code range between 

  0x9780 ∼ 0x97fc
•will be mapped between

  3914 ∼ 4038
•Unicode code point: UCS2
•16-bit
Some others
• beginbfchar - endbfchar
• beginbfrange - endbfrange
• begincidchar - endcidchar
• begincidrange - endcidrange
• begincodespacerange - endcodespacerange
case 2

Encoding: Identity-H or Identity-V,

No 'ToUnicode' entry

Using external CMap
• Check CIDSystemInfo
• Registy,Ordering,Supplement (eg. Adobe Japan 1-6)
• Adobe Type Tools

https://github.com/adobe-type-tools/cmap-resources
Adobe Japan 1-6
%!PS-Adobe-3.0 Resource-CMap
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (Japan1) def
/Supplement 6 def
end def
/CMapName /Adobe-Japan1-6 def
/CMapVersion 1.005 def
/CMapType 1 def
/XUID [1 10 25614] def
/WMode 0 def
/CIDCount 23058 def
1 begincodespacerange
<0000> <5AFF>
endcodespacerange
91 begincidrange
<0000> <00ff> 0
<0100> <01ff> 256
<0200> <02ff> 512
<0300> <03ff> 768
<0400> <04ff> 1024
<0500> <05ff> 1280
<0600> <06ff> 1536
<0700> <07ff> 1792
<0800> <08ff> 2048
<0900> <09ff> 2304
…
<5300> <53ff> 21248
<5400> <54ff> 21504
<5500> <55ff> 21760
<5600> <56ff> 22016
<5700> <57ff> 22272
<5800> <58ff> 22528
<5900> <59ff> 22784
<5a00> <5a11> 23040
endcidrange
endcmap
CMapName currentdict /CMap defineresource
pop
end
end
https://github.com/adobe-type-tools/cmap-resources/blob/master/cmapresources_japan1-6/CMap/Adobe-Japan1-6
Be careful, character code may not be Unicode.
case 3

No 'ToUnicode' entry,

Encoding: "WinAnsiEncoding" etc.
Use following encoding
WinAnsiEncoding NSWindowsCP1252StringEncoding
MacRomanEncoding …
MacExpertEncoding …
Enough Talk…

Let's code
Find the 1st page




Outline Pages
Document
Metadata
PagePage Page
CGPDFOperatorTable








←Callback
Some Tips
CGPDFDictionaryApplyFunction
• CGPDFDictionaryApplyFunction()
• C-Style callback
• not possible in Swift 1.x (probably)
• possible in Swift 2
• enumerate each entry in CGPDFDictionary
Utility function
DEMO
Wrap up
• Understanding PDF Structure
• Too many encodings — hard to find test data
• Too complex –– documentation is not always clear
• Yah, Parsing PDF is hard, really…
Thank You
Kaz Yoshikawa
kaz@digitallynx.com

More Related Content

What's hot

Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
IndicThreads
 
Symbology plessey
Symbology plesseySymbology plessey
Symbology plesseycri fan
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
Tonny Madsen
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
davidfstr
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
Alan Dean
 
introduction to python
 introduction to python introduction to python
introduction to python
Jincy Nelson
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
Srinivas Narasegouda
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data science
deepak teja
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
Ravi Raj
 
C++
C++C++
C data type format specifier
C data type format specifierC data type format specifier
C data type format specifier
Sandip Sitäulä
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
Mobisoft Infotech
 
Python programming
Python programmingPython programming
Python programming
Prof. Dr. K. Adisesha
 
Moving from User Documentation to Developer Documentation
Moving from User Documentation to Developer DocumentationMoving from User Documentation to Developer Documentation
Moving from User Documentation to Developer DocumentationAmruta Ranade
 
Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)Muhammad Haseeb Shahid
 
Python programming introduction
Python programming introductionPython programming introduction
Python programming introduction
Siddique Ibrahim
 
Module 2
Module 2 Module 2
Module 2
ShwetaNirmanik
 
SS & CD Module 3
SS & CD Module 3 SS & CD Module 3
SS & CD Module 3
ShwetaNirmanik
 
C and CPP Interview Questions
C and CPP Interview QuestionsC and CPP Interview Questions
C and CPP Interview Questions
Sagar Joshi
 

What's hot (20)

Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Symbology plessey
Symbology plesseySymbology plessey
Symbology plessey
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 
How To Tame Python
How To Tame PythonHow To Tame Python
How To Tame Python
 
introduction to python
 introduction to python introduction to python
introduction to python
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data science
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
 
C++
C++C++
C++
 
C data type format specifier
C data type format specifierC data type format specifier
C data type format specifier
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
Python programming
Python programmingPython programming
Python programming
 
Moving from User Documentation to Developer Documentation
Moving from User Documentation to Developer DocumentationMoving from User Documentation to Developer Documentation
Moving from User Documentation to Developer Documentation
 
Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)
 
Python programming introduction
Python programming introductionPython programming introduction
Python programming introduction
 
Module 2
Module 2 Module 2
Module 2
 
SS & CD Module 3
SS & CD Module 3 SS & CD Module 3
SS & CD Module 3
 
C and CPP Interview Questions
C and CPP Interview QuestionsC and CPP Interview Questions
C and CPP Interview Questions
 

Viewers also liked

Programming Complex Algorithm in Swift
Programming Complex Algorithm in SwiftProgramming Complex Algorithm in Swift
Programming Complex Algorithm in Swift
Kaz Yoshikawa
 
State of Bitcoin and Blockchain 2016
State of Bitcoin and Blockchain 2016State of Bitcoin and Blockchain 2016
State of Bitcoin and Blockchain 2016
CoinDesk
 
ShadowSend
ShadowSendShadowSend
ShadowSend
shadowcash
 
Idioms in swift 2016 05c
Idioms in swift 2016 05cIdioms in swift 2016 05c
Idioms in swift 2016 05c
Kaz Yoshikawa
 
Programming Language Swift Overview
Programming Language Swift OverviewProgramming Language Swift Overview
Programming Language Swift Overview
Kaz Yoshikawa
 
State of Blockchain Q4 2016
State of Blockchain Q4 2016State of Blockchain Q4 2016
State of Blockchain Q4 2016
CoinDesk
 
Data Compression
Data CompressionData Compression
Data Compression
Sanket Yavalkar
 
pythonでつくるiPhoneアプリ
pythonでつくるiPhoneアプリpythonでつくるiPhoneアプリ
pythonでつくるiPhoneアプリKazufumi Ohkawa
 
Pythonでpdfをいじってみる
PythonでpdfをいじってみるPythonでpdfをいじってみる
Pythonでpdfをいじってみる
株式会社 システムヨシイ
 
Clowds collaborate
Clowds collaborateClowds collaborate
Clowds collaborate
Clowds
 
2012 JTEL - Workshop: Basics of Game Design
2012 JTEL - Workshop: Basics of Game Design 2012 JTEL - Workshop: Basics of Game Design
2012 JTEL - Workshop: Basics of Game Design
Carolina Islas Sedano
 
Tarea 4 como considera su incursion en el entorno educativo
Tarea 4 como considera su incursion en el entorno educativoTarea 4 como considera su incursion en el entorno educativo
Tarea 4 como considera su incursion en el entorno educativoGeintner Albuja
 
2015 Ajou University
2015 Ajou University 2015 Ajou University
2015 Ajou University
Carolina Islas Sedano
 
الوسائط المتعددة وعناصرها-1
الوسائط المتعددة وعناصرها-1الوسائط المتعددة وعناصرها-1
الوسائط المتعددة وعناصرها-1asmafauzi
 
Modular, Scalable Learning: How to Drive Product Launch and Customer Training...
Modular, Scalable Learning: How to Drive Product Launch and Customer Training...Modular, Scalable Learning: How to Drive Product Launch and Customer Training...
Modular, Scalable Learning: How to Drive Product Launch and Customer Training...
Bottom-Line Performance
 
Program diet 13 hari
Program diet 13 hariProgram diet 13 hari
Program diet 13 hari
anita sriwaty
 

Viewers also liked (20)

Programming Complex Algorithm in Swift
Programming Complex Algorithm in SwiftProgramming Complex Algorithm in Swift
Programming Complex Algorithm in Swift
 
State of Bitcoin and Blockchain 2016
State of Bitcoin and Blockchain 2016State of Bitcoin and Blockchain 2016
State of Bitcoin and Blockchain 2016
 
ShadowSend
ShadowSendShadowSend
ShadowSend
 
Idioms in swift 2016 05c
Idioms in swift 2016 05cIdioms in swift 2016 05c
Idioms in swift 2016 05c
 
Programming Language Swift Overview
Programming Language Swift OverviewProgramming Language Swift Overview
Programming Language Swift Overview
 
Newsstand
NewsstandNewsstand
Newsstand
 
State of Blockchain Q4 2016
State of Blockchain Q4 2016State of Blockchain Q4 2016
State of Blockchain Q4 2016
 
Data Compression
Data CompressionData Compression
Data Compression
 
pythonでつくるiPhoneアプリ
pythonでつくるiPhoneアプリpythonでつくるiPhoneアプリ
pythonでつくるiPhoneアプリ
 
Pythonでpdfをいじってみる
PythonでpdfをいじってみるPythonでpdfをいじってみる
Pythonでpdfをいじってみる
 
Clowds collaborate
Clowds collaborateClowds collaborate
Clowds collaborate
 
2012 JTEL - Workshop: Basics of Game Design
2012 JTEL - Workshop: Basics of Game Design 2012 JTEL - Workshop: Basics of Game Design
2012 JTEL - Workshop: Basics of Game Design
 
Tarea 4 como considera su incursion en el entorno educativo
Tarea 4 como considera su incursion en el entorno educativoTarea 4 como considera su incursion en el entorno educativo
Tarea 4 como considera su incursion en el entorno educativo
 
Siklus hidrologi
Siklus hidrologiSiklus hidrologi
Siklus hidrologi
 
2015 Ajou University
2015 Ajou University 2015 Ajou University
2015 Ajou University
 
الوسائط المتعددة وعناصرها-1
الوسائط المتعددة وعناصرها-1الوسائط المتعددة وعناصرها-1
الوسائط المتعددة وعناصرها-1
 
Dental materials
Dental materialsDental materials
Dental materials
 
Modular, Scalable Learning: How to Drive Product Launch and Customer Training...
Modular, Scalable Learning: How to Drive Product Launch and Customer Training...Modular, Scalable Learning: How to Drive Product Launch and Customer Training...
Modular, Scalable Learning: How to Drive Product Launch and Customer Training...
 
Program diet 13 hari
Program diet 13 hariProgram diet 13 hari
Program diet 13 hari
 
проект 1
проект 1проект 1
проект 1
 

Similar to Extracting text from PDF (iOS)

44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON
 
Programming in C [Module One]
Programming in C [Module One]Programming in C [Module One]
Programming in C [Module One]
Abhishek Sinha
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
Talentica Software
 
PLUG code generation tool
PLUG code generation toolPLUG code generation tool
PLUG code generation tool
Emmanuel Fuchs
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Mahfuzur Rahman
 
python-online&offline-training-in-kphb-hyderabad (1) (1).pdf
python-online&offline-training-in-kphb-hyderabad (1) (1).pdfpython-online&offline-training-in-kphb-hyderabad (1) (1).pdf
python-online&offline-training-in-kphb-hyderabad (1) (1).pdf
KosmikTech1
 
Introduction of c programming unit-ii ppt
Introduction of  c programming unit-ii pptIntroduction of  c programming unit-ii ppt
Introduction of c programming unit-ii ppt
JStalinAsstProfessor
 
Code Generation using T4
Code Generation using T4Code Generation using T4
Code Generation using T4
Joubin Najmaie
 
Presentation compiler design
Presentation compiler designPresentation compiler design
Presentation compiler design
Md. Touhidur Rahman
 
PLUG : Presentation Layer Universal Generator
PLUG : Presentation Layer Universal GeneratorPLUG : Presentation Layer Universal Generator
PLUG : Presentation Layer Universal Generator
Emmanuel Fuchs
 
Chapter One
Chapter OneChapter One
Chapter Onebolovv
 
Introduction to C Unit 1
Introduction to C Unit 1Introduction to C Unit 1
Introduction to C Unit 1
SURBHI SAROHA
 
c_pro_introduction.pptx
c_pro_introduction.pptxc_pro_introduction.pptx
c_pro_introduction.pptx
RohitRaj744272
 

Similar to Extracting text from PDF (iOS) (20)

44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
 
Programming in C [Module One]
Programming in C [Module One]Programming in C [Module One]
Programming in C [Module One]
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
PLUG code generation tool
PLUG code generation toolPLUG code generation tool
PLUG code generation tool
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
python-online&offline-training-in-kphb-hyderabad (1) (1).pdf
python-online&offline-training-in-kphb-hyderabad (1) (1).pdfpython-online&offline-training-in-kphb-hyderabad (1) (1).pdf
python-online&offline-training-in-kphb-hyderabad (1) (1).pdf
 
Introduction of c programming unit-ii ppt
Introduction of  c programming unit-ii pptIntroduction of  c programming unit-ii ppt
Introduction of c programming unit-ii ppt
 
Code Generation using T4
Code Generation using T4Code Generation using T4
Code Generation using T4
 
Presentation compiler design
Presentation compiler designPresentation compiler design
Presentation compiler design
 
PLUG : Presentation Layer Universal Generator
PLUG : Presentation Layer Universal GeneratorPLUG : Presentation Layer Universal Generator
PLUG : Presentation Layer Universal Generator
 
Chapter One
Chapter OneChapter One
Chapter One
 
Introduction to C Unit 1
Introduction to C Unit 1Introduction to C Unit 1
Introduction to C Unit 1
 
OOPSLA Talk on Preon
OOPSLA Talk on PreonOOPSLA Talk on Preon
OOPSLA Talk on Preon
 
spraa64
spraa64spraa64
spraa64
 
spraa64
spraa64spraa64
spraa64
 
spraa64
spraa64spraa64
spraa64
 
spraa64
spraa64spraa64
spraa64
 
Introduction to c programming
Introduction to c programmingIntroduction to c programming
Introduction to c programming
 
c_pro_introduction.pptx
c_pro_introduction.pptxc_pro_introduction.pptx
c_pro_introduction.pptx
 
C language unit-1
C language unit-1C language unit-1
C language unit-1
 

Recently uploaded

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 

Recently uploaded (20)

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 

Extracting text from PDF (iOS)

  • 1. Extracting text from PDF How far does the rabbit hole go? Kaz Yoshikawa
 kaz@digitallynx.com May 2016
  • 2. How to extract text from PDF on iOS?
  • 3. 🤔 I know some say 
 "extracting text from PDF is really hard"
 Just exaggerated, isn't it?
  • 5. References • アジア言語圏のPDFのテキスト抽出
 http://ponpoko1968.hatenablog.com/entry/20100810/1281438828
 http://ponpoko1968.hatenablog.com/entry/20100915/1284559500 • PDFビューワの作り方 (連載)- HMDT
 https://news.mynavi.jp/itsearch/article/devsoft/1212 • PDF千夜一夜 — アンテナハウス
 http://www.antenna.co.jp/pdf/reference/Blog-Index.htm
  • 7. What is hard? Really?
  • 8. Why so difficult? • iOS does not provide any API to extract text directly
 (OS X has PDFKit – still limited) • Core Graphics provides only very basic API • Needs to write parser — hard! really! • Extracted text data is not unicode • Glyph ID to Unicode mapping
  • 10. Document - Page Outline Pages Document Metadata PagePage Page
  • 11. Page - Font MediaBox Resources Page Contents … Font … Tc1 Tc2 … subtype… …
  • 12. case: Type 1 Subtype Type1 Name Referenced from Font subdirectory BaseFont PostScript font name FirstChar First character code defined in the font’s Widths array LastChar Last character code defined in the font’s Widths array Widths An array of (LastChar − FirstChar + 1) widths FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths Encoding Font’s character encoding ToUnicode CMap file that maps character codes to Unicode values PDF Reference: p412
  • 13. case: TrueType Subtype Type1 Name Referenced from Font subdirectory BaseFont PostScript font name FirstChar First character code defined in the font’s Widths array LastChar Last character code defined in the font’s Widths array Widths An array of (LastChar − FirstChar + 1) widths FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths Encoding Font’s character encoding ToUnicode CMap file that maps character codes to Unicode values PDF Reference: p412 Same as Type1 with some differences
  • 14. case: Type 3 Subtype Type3 Name Referenced from Font subdirectory FontBBox A rectangle expressed in the glyph coordinate system FontMatrix An array of six numbers specifying the font matrix, mapping glyph space to text space CharProcs ?? FirstChar, LastChar ditto Widths ditto – sort of FontDescriptor A font descriptor describing the font’s default metrics other than its glyph widths Resources A list of the named resources, such as fonts and images ToUnicode CMap file that maps character codes to Unicode values PDF Reference: p420
  • 15. Case: Type 0 
 Composite Fonts Subtype CIDFontType0 or CIDFontType2 Name Referenced from Font subdirectory BaseFont The PostScript name of the CIDFont CIDSystemInfo A dictionary containing entries that define the character collection of the CIDFont FontDescriptor A font descriptor describing the CIDFont’s default metrics other than its glyph widths DW The default width for glyphs in the CIDFont. Default value: 1000 DW2 An array of two numbers specifying the default metrics for vertical writing W2 A description of the metrics for vertical writing for the glyphs in the CIDFont CIDToGIDMap Type 2 CIDFonts only — omitted PDF Reference: p436
  • 16. 😏 OK, PDF structure is pretty complex. Is there any tools?
  • 17. Tools
  • 21. Page Object knows enough about drawing page MediaBox Resources Page Contents Font Tc2 dictionaryarray stream Drawing operators
  • 22. Operators Begin a text object BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET End a text object specify font specify location Draw Text
  • 23. Rendering Japanese /C2_0 1 Tf 0 Tc 175 720 Td 
 <30533093306B3061306F> Tj
  • 24. Tf, Td, Tj PDF Reference: p398,406,407
  • 27. Font entry Subtype Type1 Name Referenced from Font subdirectory BaseFont PostScript font name FirstChar First character code defined in the font’s Widths array LastChar Last character code defined in the font’s Widths array Widths An array of (LastChar − FirstChar + 1) widths FontDescriptor A font descriptor describing the font’s metrics other than its glyph widths Encoding Font’s character encoding ToUnicode CMap file that maps character codes to Unicode values
  • 29. CMap Specification Adobe CMap and CIDFont Files Specification Version 1.0 11 June 1993 Adobe Developer Support ® ® ® https://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf 102 pages
  • 30. CMap example %!PS-Adobe-3.0 Resource-CMap %%Version: 1 /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (Japan1) def /Supplement 0 def end def /CMapName /83pv-RKSJ-H def /CMapVersion 1 def /CMapType 0 def /UIDOffset 0 def /XUID [1 10 25324] def /WMode 0 def 4 begincodespacerange <00> <80> <8140> <9ffc> <a0> <df> <e040> <fbfc> endcodespacerange 1 beginnotdefrange <00> <1f> 1 endnotdefrange 100 begincidrange <9780> <97fc> 3914 <9840> <9872> 4039 <989f> <98fc> 4090 <9940> <997e> 4184 <9980> <99fc> 4247 << 90 ranges missing >> <ed83> <ed83> 7934 <ed84> <ed84> 992 <ed85> <ed85> 7935 <ed86> <ed86> 994 <ed87> <ed87> 7936 endcidrange 17 begincidrange <ed88> <ed8d> 996 <ed8e> <ed8e> 7937 << 13 ranges missing >> <ee9a> <ee9a> 768 <ee9b> <ee9c> 7631 endcidrange endcmap CMapName currentdict /CMap defineresource pop end end %%EndResource %%EOF ←Adobe Japan 1-0 ←Horizontal/Vertical ←CID Range ←CID Range
  • 31. begin-end-cidrange 100 begincidrange <9780> <97fc> 3914 <9840> <9872> 4039 <989f> <98fc> 4090 <9940> <997e> 4184 <9980> <99fc> 4247 … <ed83> <ed83> 7934 <ed84> <ed84> 992 <ed85> <ed85> 7935 <ed86> <ed86> 994 <ed87> <ed87> 7936 endcidrange •Code range between 
   0x9780 ∼ 0x97fc •will be mapped between
   3914 ∼ 4038 •Unicode code point: UCS2 •16-bit
  • 32. Some others • beginbfchar - endbfchar • beginbfrange - endbfrange • begincidchar - endcidchar • begincidrange - endcidrange • begincodespacerange - endcodespacerange
  • 33. case 2
 Encoding: Identity-H or Identity-V,
 No 'ToUnicode' entry

  • 34. Using external CMap • Check CIDSystemInfo • Registy,Ordering,Supplement (eg. Adobe Japan 1-6) • Adobe Type Tools
 https://github.com/adobe-type-tools/cmap-resources
  • 35. Adobe Japan 1-6 %!PS-Adobe-3.0 Resource-CMap /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo 3 dict dup begin /Registry (Adobe) def /Ordering (Japan1) def /Supplement 6 def end def /CMapName /Adobe-Japan1-6 def /CMapVersion 1.005 def /CMapType 1 def /XUID [1 10 25614] def /WMode 0 def /CIDCount 23058 def 1 begincodespacerange <0000> <5AFF> endcodespacerange 91 begincidrange <0000> <00ff> 0 <0100> <01ff> 256 <0200> <02ff> 512 <0300> <03ff> 768 <0400> <04ff> 1024 <0500> <05ff> 1280 <0600> <06ff> 1536 <0700> <07ff> 1792 <0800> <08ff> 2048 <0900> <09ff> 2304 … <5300> <53ff> 21248 <5400> <54ff> 21504 <5500> <55ff> 21760 <5600> <56ff> 22016 <5700> <57ff> 22272 <5800> <58ff> 22528 <5900> <59ff> 22784 <5a00> <5a11> 23040 endcidrange endcmap CMapName currentdict /CMap defineresource pop end end https://github.com/adobe-type-tools/cmap-resources/blob/master/cmapresources_japan1-6/CMap/Adobe-Japan1-6 Be careful, character code may not be Unicode.
  • 36. case 3
 No 'ToUnicode' entry,
 Encoding: "WinAnsiEncoding" etc.
  • 37. Use following encoding WinAnsiEncoding NSWindowsCP1252StringEncoding MacRomanEncoding … MacExpertEncoding …
  • 39. Find the 1st page 
 
 Outline Pages Document Metadata PagePage Page
  • 42. CGPDFDictionaryApplyFunction • CGPDFDictionaryApplyFunction() • C-Style callback • not possible in Swift 1.x (probably) • possible in Swift 2 • enumerate each entry in CGPDFDictionary
  • 44. DEMO
  • 45. Wrap up • Understanding PDF Structure • Too many encodings — hard to find test data • Too complex –– documentation is not always clear • Yah, Parsing PDF is hard, really…