SlideShare a Scribd company logo
1 of 49
Download to read offline
Embedded Files: Risks, Challenges
and Options
Tim Allison, Ph.D.
Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization (1740)
ITSD
The research was carried out at the NASA (National Aeronautics
and Space Administration) Jet Propulsion Laboratory, California
Institute of Technology under a contract with the Defense
Advanced Research Projects Agency (DARPA) SafeDocs
program. © 2022 California Institute of Technology. Government
sponsorship acknowledged.
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.
jpl.nasa.gov
About me
• Data scientist (files and search) NASA’s Jet
Propulsion Laboratory, California Institute of
Technology
• Chair/V.P. Apache Tika
• Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
• Member Apache Software Foundation
2
© 2022 California Institute of Technology. Government sponsorship acknowledged.
The research was carried out at the NASA (National Aeronautics and Space Administration)
Jet Propulsion Laboratory, California Institute of Technology under a contract with the
Defense Advanced Research Projects Agency (DARPA) SafeDocs program.
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Overview
• Intended audience – technical, with larger
implications
• This is a work in progress, please help!
9/22/22 3
jpl.nasa.gov
Takeaways
• Think: every file may have embedded files
• Develop budgets, risk assessments and workflows
accordingly
9/22/22 4
I don’t have all the answers!
jpl.nasa.gov
Why should you care?
9/22/22 5
https://twitter.com/WeirdMedieval/status/1532319439684874240
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Why should you care?
• Sensitive data, accidental data disclosure, not just in
the attachments but in the metadata about the
attachments hosted in the parent document
• Accessibility – how do we make these searchable,
discoverable and available to users
• Every other digital preservation concern…literally
every other digital preservation concern
9/22/22 6
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Diligence Spectrum
• Where are you? Where do you need to be?
• What do the digipres vendors support? What do
they need to support?
• Cost/benefit, complexity and budgets
9/22/22 7
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
“Hidden data” not covered in this talk
• Encrypted files (without password)
• Hidden sheets/columns/data in spreadsheets
• Track changes/edits and incremental updates
(including deletes!)
• Text out of viewing area in PDFs
• Text in notes components of PPT(x)s
9/22/22 8
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
“Hidden Data” not covered in this talk
• “ActualText” (in PDF)
• Alternative content
• Font too small/too big or same color as background
• Corrupt/missing Unicode mappings/fonts (in PDF)
• Steganography
• Files embedded in cavities whether within the
“parsable” parts or outside of the file-specific
“parsable” parts.
9/22/22 9
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Intentionally malicious files – not covered in this talk
• Malware – Denial of Service, remote code
execution, etc.
• Crafted parser differentials
9/22/22 10
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Challenging files also not covered in this talk
• Polyglot and schizophrenic files – files that may be
parsed as more than one file type (e.g. a PDF that is
also a zip file)
• Quines – zip or gz or other package format that
when unpackaged is byte for byte exactly the same
file as the original
9/22/22 11
Refs: Ange Albertini
https://blog.trailofbits.com/2019/11/01/two-new-tools-
that-tame-the-treachery-of-files/
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Categories of Embedded Files (1 of 2)
• Attachments – something a human or process
added to the file as supplementary information that
is intended to stand on its own/be easily exported
• Images – intended to be rendered as part of the file
• Thumbnail images
• Macros/code – executable code that is intended to
help the functionality (macros in MSOffice and/or
javascript in PDF/HTML, etc.)
9/22/22 12
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Categories of Embedded Files (2 of 2)
• Metadata files – XMP, anything else?
• Standalone files that help with the rendering/user
experience of a file, e.g. font files, International
Color Consortium (ICC) profiles, subtitle streams
• Files that normally don’t exist outside of files – EMF,
WMF, MSGraph
9/22/22 13
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Why grep is not sufficient
“Hello World” as stored in a PDF
9/22/22 14
“Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction”
https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
Compressed object as stored in
the file
Uncompressed
! -> H
“ -> e
# -> l
$ -> o
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Notes on Some Specific Formats
9/22/22 15
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Hodge podge
• Email can have alternate content (text/html)
• Zips can have free text comments!
• Apple resource forks can have all sorts of things:
See Tyler Thorsted’s #iPres2022 talk if you haven’t!
9/22/22 16
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
HTML
Even HTML can have embedded files!
9/22/22 17
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
MSOffice: OLE2 (doc/ppt/xls)
• Directory + file based format like zip
• Embedded files need to be parsed out of streams;
they may not be stored as separate files even within
the zip-like structure
• Old .doc files may contain a “save history” – full file
paths for where the file has been saved
9/22/22 18
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
MSOffice: OOXML (.docx/pptx/xlsx)
• Zip files
• Embedded files may be stored as standalone within the
zip structure
• Extra “bonus” embedded files not part of the
docx/pptx/xlsx may be stored
• Files may be wrapped in an OLE2 stream
• Full file paths for the source locations for embedded files
may be stored in xml within the zip file
• Excel may store full “last saved” path:
9/22/22 19
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDF
• Incremental Updates
• Attachments
9/22/22 20
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDF Incremental Updates
9/22/22 21
https://developers.foxit.com/developer-hub/document/incremental-updates/
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Simply truncate to earlier %%EOFs to get earlier file(s)!
See also tool: pdfresurrect
9/22/22 22
Fun file (starting around p 51):
https://www.usitc.gov/publications/337/pub1859.pdf
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFResurrect – Incremental Updates, 1 million sample
from Common Crawl CC-MAIN-2021-31
9/22/22 23
Updates Percentage
0 77%
1 21.75%*
2 1.02%
3 0.30%
4 0.14%
5 0.07%
6 0.04%
7 0.03%
8 0.02%
9 0.01%
* Many files
created by
MSWord with a
single incremental
update are not
significantly
different!
Max in 1 million
sample:
3,441 incremental
updates
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDF Attachments
• Overheard: “We only have to worry about attachments in Portfolio PDFs”
9/22/22 24
No!!! Nearly all* PDFs may contain attachments!
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Nearly all*
• Files that conform to PDF/A-1 are not allowed to
contain attachments.
9/22/22 25
For all practical purposes, I don’t understand why an ingest
pipeline wouldn’t look for attachments whether or not the PDF
alleges that it is PDF/A-1 or whether or not the PDF actually
passes a conformance check for PDF/A-1.
More simply: assume everything has attachments until proven
otherwise.
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
All PDFs may contain attachments!
• The dataset in the following consists of 8 million
PDFs from one month of Common Crawl CC-MAIN-
2021-31
• ~50k had at least one attachment
• Only 670 files were “Portfolio PDFs”
9/22/22 26
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika – Attached Files, Embedded Depth = 1
9/22/22 27
Mime Count
text/plain; charset=ISO-8859-1 49,288
application/pdf 12,090
text/plain; charset=windows-1252 5,045
audio/mpeg 4,840
application/x-shockwave-flash 4,740
application/xml 4,727
text/html; charset=UTF-8 3,390
image/png 2,099
image/gif 1,656
image/svg+xml 1,476
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika – Attached Files, Embedded Depth > 0
9/22/22 28
Mime Count
text/plain; charset=ISO-8859-1 49,397
image/wmf 14,419
application/pdf 12,564
application/vnd.ms-equation 12,387
image/png 7,126
text/plain; charset=windows-1252 5,127
application/xml 4,959
audio/mpeg 4,886
application/x-shockwave-flash 4,753
text/html; charset=UTF-8 3,391
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika – Maximum Number of Embedded Files
9/22/22 29
Embedded File Counts PDF Count
1 42,054
2 1,416
3 884
4 563
6 423
5 303
8 193
7 176
9 171
16 145
One file has 3,852
embedded files!
Only 670 PDFs are
“Portfolio PDFs”
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika – Embedded File Depths
9/22/22 30
Embedded Depth Count
0 7,931,327
1 98,268
2 37,547
3 3,137
4 177
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs may include full file paths for various reasons
9/22/22 31
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
TIFF
ExifTool on ~5300 TIFFs
9/22/22 32
• ~2400 have a binary data field (~800 of these are
thumbnails)
• They may contain OCR’d text
• They may contain full file paths
Source: https://corpora.tika.apache.org/base/share/tiffs-out.txt.gz
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Files that mostly only exist as
embedded files
9/22/22 33
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
9/22/22 34
Container file
format
EMF/WMF Counts
ppt 57,869
doc 36,464
docx 8,246
image/emf 6,798
xls 5,007
pptx 2,970
rtf 2,163
xlsx 1,913
xls (macro) 988
pptx (variant) 381
WMF and EMF – Top 10 containers
• Windows Metafile
(WMF)
• Enhanced Windows
Metafile (EMF)
Habitat within
Apache Tika’s 1
million file regression
test sample
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
WMF – Windows Metafile
“WMF specifies structures for defining
a graphical image. A WMF metafile
contains drawing commands, property
definitions, and graphics objects in a
series of WMF records.”
WMF 17.0 Specification
9/22/22 35
Like PDF, WMF may contain extractable text!
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
EMF – Enhanced Windows Metafile
• “Enhanced metafile format (EMF) is a file format that is
used to store portable representations of graphical
images. EMF metafiles contain sequential records that
are parsed and processed to render the stored image on
any output device.”
EMF 17.0 Specification
9/22/22 36
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
EMF, EMF+, EMFSpool
9/22/22 37
EMF 17.0 Specification
Like PDF, these file types
may contain extractable
text AND attached files!
AND “EMF metafiles define a
mechanism for the encapsulation
of arbitrary vendor-defined data.
The EMR_COMMENT record
(section 2.3.3.1) can contain
arbitrary private data that is
unknown to EMF.”
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
EMF attachments in Apache Tika’s 1 million file
regression sample
9/22/22 38
Attachments in EMFs Counts
image/wmf 86,810
application/pdf 662
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
XMP – eXtensible Metadata Platform
• Habitat: PDF, JPEGs, Photoshop, PNG, TIFF, video,
and more
• Embedded files
• SVG
• HTML
• Thumbnails (e.g. JPEGs)
9/22/22 39
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
XMP – some potential issues
• History – what software packaged modified the file
when, nature of modification
• User information – creator, modified by, file path
links to embedded files or external
resources/references
• OriginalDocumentId, documentId, InstanceId
• Other metadata: title, keywords, subject
• Embedded binary images!
• Custom metadata!
9/22/22 40
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
XFA – XML Forms Architecture
• Habitat: PDF
• XML representation of forms, questions and
answers
• NOTE: If the text extractor/parser is only processing
the PDF content, it will miss content from these
XMLs
9/22/22 41
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Why file-format specific tools (alone) are not sufficient
• Maximum embedded depth in the 8 million PDF
corpus from Common Crawl is 4
• Maximum embedded depth in the ~1 million Tika
regression corpus is 7
• For full workflow, file-format specific tools must be
able to call each other arbitrarily
9/22/22 42
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Conclusion
• Treat every file as if it has attachments until proven
otherwise
• Develop budgets, risk assessments and workflows
accordingly
9/22/22 43
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Extras
9/22/22 44
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
• The Portfolio PDF and the PPT with attachment are
in the iPres2022 bakeoff corpus:
https://drive.google.com/drive/folders/1ACktqBv_Yo
oW9DLInBM5ad_I0yHJRoLU
• Step by step commandlines with example data for
this talk:
https://cwiki.apache.org/confluence/display/TIKA/Op
en+Preservation+Foundation+Talk+--
+21+September+2022
9/22/22 45
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Some tools
• Apache Tika
• ExifTool
• pdfresurrect
• Poppler: pdfinfo, pdfimages, pdfdetach
• Didier Stevens (forensics): oledump.py, pdfid.py, pdftool.py
• Philippe Lagadec (forensics): oletools
• Great blog from @bitsgalore on opensource PDF tools:
https://www.bitsgalore.org/2021/09/06/pdf-processing-and-
analysis-with-open-source-tools
9/22/22 46
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Commandlines
• JSON embedded file output
• java –jar tika-app-2.y.z.jar –J -t
digitally_signed_3D_Portfolio(1).pdf
• JSON embedded file output batch
• java –jar tika-app-2.y.z.jar –J –t –i
<input_dir> -o <output_dir>
• Dump first level attachments
• java –jar tika-app-2.y.z.jar –z
digitally_signed_3D_Portfolio(1).pdf
9/22/22 47
© 2022 California Institute of Technology. Government sponsorship acknowledged.
NOTE: -z only extracts first level. If you want full recursive, please
open an issue: https://issues.apache.org/jira/projects/TIKA
jpl.nasa.gov
XMP
• Extract the literal XMP from file.pdf into
xmp.xmp
• exiftool -a -o xmp.xmp file.pdf
9/22/22 48
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Commandlines
• Dump contents of OLE2 to local directory
• java -cp tika-app-2.Y.Z.jar
org.apache.poi.poifs.dev.POIFSDump
testWORD_1img.doc
• List contents of OLE2 files
• java -cp ~/Intellij/tika-main/tika-
app/target/tika-app-2.Y.Z.jar
org.apache.poi.poifs.dev.POIFSViewer
261779.ppt
9/22/22 49
© 2022 California Institute of Technology. Government sponsorship acknowledged.
See also Didier Stevens’ oledump.py

More Related Content

What's hot

Empowerment Technology by: Maria Elisa Pal and Rodel Reyes
Empowerment Technology  by: Maria Elisa Pal and Rodel ReyesEmpowerment Technology  by: Maria Elisa Pal and Rodel Reyes
Empowerment Technology by: Maria Elisa Pal and Rodel Reyesandregoron
 
Empowerment Technologies - Module 1
Empowerment Technologies - Module 1Empowerment Technologies - Module 1
Empowerment Technologies - Module 1Jesus Rances
 
Empowerment Technology Lesson 5
Empowerment Technology Lesson 5Empowerment Technology Lesson 5
Empowerment Technology Lesson 5alicelagajino
 
Online Search and Research Skills - Empowerment Technologies
Online Search and Research Skills - Empowerment TechnologiesOnline Search and Research Skills - Empowerment Technologies
Online Search and Research Skills - Empowerment TechnologiesMark Jhon Oxillo
 
Planning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdf
Planning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdfPlanning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdf
Planning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdfEidene Joy Manuel
 
Empowerment Technologies - Principles of Visual Message and Design using Inf...
Empowerment  Technologies - Principles of Visual Message and Design using Inf...Empowerment  Technologies - Principles of Visual Message and Design using Inf...
Empowerment Technologies - Principles of Visual Message and Design using Inf...Lany Lyn Magdaraog
 
Nature and purposes of of online platforms and applications
Nature and purposes of of online platforms and applicationsNature and purposes of of online platforms and applications
Nature and purposes of of online platforms and applicationswylljie
 
Contextualized Online Research
Contextualized Online ResearchContextualized Online Research
Contextualized Online ResearchIrvin John Salegon
 
EMP TECH Q2 LESSON 1.pptx
EMP TECH Q2 LESSON 1.pptxEMP TECH Q2 LESSON 1.pptx
EMP TECH Q2 LESSON 1.pptxsherelynbalada1
 
Empowerment Technologies - Module 2
Empowerment Technologies - Module 2Empowerment Technologies - Module 2
Empowerment Technologies - Module 2Jesus Rances
 
Lesson 4- Developing ict content for specific purposes
Lesson 4- Developing ict content for specific purposesLesson 4- Developing ict content for specific purposes
Lesson 4- Developing ict content for specific purposesJuvywen
 
Collaborative ICT Development - Empowerment Technologies
Collaborative ICT Development - Empowerment TechnologiesCollaborative ICT Development - Empowerment Technologies
Collaborative ICT Development - Empowerment TechnologiesMark Jhon Oxillo
 
Empowerment Technology Lesson 3
Empowerment Technology Lesson 3Empowerment Technology Lesson 3
Empowerment Technology Lesson 3alicelagajino
 
Contextualized Online Search and Research Skills
Contextualized Online Search and Research SkillsContextualized Online Search and Research Skills
Contextualized Online Search and Research SkillsAngelito Quiambao
 
Empowerment technologies
Empowerment technologiesEmpowerment technologies
Empowerment technologiesRufa Laguit
 
Empowerment technologies
Empowerment technologiesEmpowerment technologies
Empowerment technologiesDeped
 

What's hot (20)

Empowerment Technology by: Maria Elisa Pal and Rodel Reyes
Empowerment Technology  by: Maria Elisa Pal and Rodel ReyesEmpowerment Technology  by: Maria Elisa Pal and Rodel Reyes
Empowerment Technology by: Maria Elisa Pal and Rodel Reyes
 
Empowerment Technologies - Module 1
Empowerment Technologies - Module 1Empowerment Technologies - Module 1
Empowerment Technologies - Module 1
 
Empowerment Technology Lesson 5
Empowerment Technology Lesson 5Empowerment Technology Lesson 5
Empowerment Technology Lesson 5
 
Online Search and Research Skills - Empowerment Technologies
Online Search and Research Skills - Empowerment TechnologiesOnline Search and Research Skills - Empowerment Technologies
Online Search and Research Skills - Empowerment Technologies
 
Planning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdf
Planning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdfPlanning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdf
Planning-and-Conceptualizing-an-ICT-for-Social-Change (1).pdf
 
Empowerment Technologies - Principles of Visual Message and Design using Inf...
Empowerment  Technologies - Principles of Visual Message and Design using Inf...Empowerment  Technologies - Principles of Visual Message and Design using Inf...
Empowerment Technologies - Principles of Visual Message and Design using Inf...
 
Nature and purposes of of online platforms and applications
Nature and purposes of of online platforms and applicationsNature and purposes of of online platforms and applications
Nature and purposes of of online platforms and applications
 
Contextualized Online Research
Contextualized Online ResearchContextualized Online Research
Contextualized Online Research
 
EMP TECH Q2 LESSON 1.pptx
EMP TECH Q2 LESSON 1.pptxEMP TECH Q2 LESSON 1.pptx
EMP TECH Q2 LESSON 1.pptx
 
Empowerment Technologies - Module 2
Empowerment Technologies - Module 2Empowerment Technologies - Module 2
Empowerment Technologies - Module 2
 
Lesson 4- Developing ict content for specific purposes
Lesson 4- Developing ict content for specific purposesLesson 4- Developing ict content for specific purposes
Lesson 4- Developing ict content for specific purposes
 
Collaborative ICT Development - Empowerment Technologies
Collaborative ICT Development - Empowerment TechnologiesCollaborative ICT Development - Empowerment Technologies
Collaborative ICT Development - Empowerment Technologies
 
Empowerment Technology Lesson 3
Empowerment Technology Lesson 3Empowerment Technology Lesson 3
Empowerment Technology Lesson 3
 
Features of the web
Features of the webFeatures of the web
Features of the web
 
Contextualized Online Search and Research Skills
Contextualized Online Search and Research SkillsContextualized Online Search and Research Skills
Contextualized Online Search and Research Skills
 
Module 4 EMPOWERMENT TECHNOLOGY
Module 4 EMPOWERMENT TECHNOLOGYModule 4 EMPOWERMENT TECHNOLOGY
Module 4 EMPOWERMENT TECHNOLOGY
 
EMPOWERMENT TECHNOLOGIES - LESSON 5
EMPOWERMENT TECHNOLOGIES - LESSON 5EMPOWERMENT TECHNOLOGIES - LESSON 5
EMPOWERMENT TECHNOLOGIES - LESSON 5
 
Empowerment technologies
Empowerment technologiesEmpowerment technologies
Empowerment technologies
 
Empowerment technologies
Empowerment technologiesEmpowerment technologies
Empowerment technologies
 
Excel
ExcelExcel
Excel
 

Similar to Embedded Files in Documents: Risks, Challenges and Options

"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"Tim Allison
 
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison
 
CESSI Digital Library Case Study Eng
CESSI Digital Library Case Study EngCESSI Digital Library Case Study Eng
CESSI Digital Library Case Study Engatolomei
 
GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016Dag Endresen
 
e-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currente-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currentpbajcsy
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsAaron Collie
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATTony Ross-Hellauer
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATOpenAIRE
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | EUDAT
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...Matthew J Collins
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...Projeto RCAAP
 
Expanded ten reasons to deploy data express final
Expanded ten reasons to deploy data express   finalExpanded ten reasons to deploy data express   final
Expanded ten reasons to deploy data express finalDataExpress
 
Expanded ten reasons to deploy data express final
Expanded ten reasons to deploy data express   finalExpanded ten reasons to deploy data express   final
Expanded ten reasons to deploy data express finalDataExpress
 
File Formats for Preservation
File Formats for PreservationFile Formats for Preservation
File Formats for PreservationStephen Gray
 
Keep Calm and Curate
Keep Calm and CurateKeep Calm and Curate
Keep Calm and CurateGarethKnight
 
MyersTessella_Dec2013
MyersTessella_Dec2013MyersTessella_Dec2013
MyersTessella_Dec2013Mark Myers
 
Data Management Planning for researchers
Data Management Planning for researchersData Management Planning for researchers
Data Management Planning for researchersSarah Jones
 

Similar to Embedded Files in Documents: Risks, Challenges and Options (20)

"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
 
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
 
CESSI Digital Library Case Study Eng
CESSI Digital Library Case Study EngCESSI Digital Library Case Study Eng
CESSI Digital Library Case Study Eng
 
GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016
 
e-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currente-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Current
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Expanded ten reasons to deploy data express final
Expanded ten reasons to deploy data express   finalExpanded ten reasons to deploy data express   final
Expanded ten reasons to deploy data express final
 
Expanded ten reasons to deploy data express final
Expanded ten reasons to deploy data express   finalExpanded ten reasons to deploy data express   final
Expanded ten reasons to deploy data express final
 
File Formats for Preservation
File Formats for PreservationFile Formats for Preservation
File Formats for Preservation
 
Keep Calm and Curate
Keep Calm and CurateKeep Calm and Curate
Keep Calm and Curate
 
MyersTessella_Dec2013
MyersTessella_Dec2013MyersTessella_Dec2013
MyersTessella_Dec2013
 
Data Management Planning for researchers
Data Management Planning for researchersData Management Planning for researchers
Data Management Planning for researchers
 
Second Thoughts about Metadata Standards for Data
Second Thoughts about Metadata Standards for DataSecond Thoughts about Metadata Standards for Data
Second Thoughts about Metadata Standards for Data
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Embedded Files in Documents: Risks, Challenges and Options

  • 1. Embedded Files: Risks, Challenges and Options Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization (1740) ITSD The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2022 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
  • 2. jpl.nasa.gov About me • Data scientist (files and search) NASA’s Jet Propulsion Laboratory, California Institute of Technology • Chair/V.P. Apache Tika • Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP • Member Apache Software Foundation 2 © 2022 California Institute of Technology. Government sponsorship acknowledged. The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 3. jpl.nasa.gov Overview • Intended audience – technical, with larger implications • This is a work in progress, please help! 9/22/22 3
  • 4. jpl.nasa.gov Takeaways • Think: every file may have embedded files • Develop budgets, risk assessments and workflows accordingly 9/22/22 4 I don’t have all the answers!
  • 5. jpl.nasa.gov Why should you care? 9/22/22 5 https://twitter.com/WeirdMedieval/status/1532319439684874240 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 6. jpl.nasa.gov Why should you care? • Sensitive data, accidental data disclosure, not just in the attachments but in the metadata about the attachments hosted in the parent document • Accessibility – how do we make these searchable, discoverable and available to users • Every other digital preservation concern…literally every other digital preservation concern 9/22/22 6 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 7. jpl.nasa.gov Diligence Spectrum • Where are you? Where do you need to be? • What do the digipres vendors support? What do they need to support? • Cost/benefit, complexity and budgets 9/22/22 7 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 8. jpl.nasa.gov “Hidden data” not covered in this talk • Encrypted files (without password) • Hidden sheets/columns/data in spreadsheets • Track changes/edits and incremental updates (including deletes!) • Text out of viewing area in PDFs • Text in notes components of PPT(x)s 9/22/22 8 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 9. jpl.nasa.gov “Hidden Data” not covered in this talk • “ActualText” (in PDF) • Alternative content • Font too small/too big or same color as background • Corrupt/missing Unicode mappings/fonts (in PDF) • Steganography • Files embedded in cavities whether within the “parsable” parts or outside of the file-specific “parsable” parts. 9/22/22 9 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 10. jpl.nasa.gov Intentionally malicious files – not covered in this talk • Malware – Denial of Service, remote code execution, etc. • Crafted parser differentials 9/22/22 10 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 11. jpl.nasa.gov Challenging files also not covered in this talk • Polyglot and schizophrenic files – files that may be parsed as more than one file type (e.g. a PDF that is also a zip file) • Quines – zip or gz or other package format that when unpackaged is byte for byte exactly the same file as the original 9/22/22 11 Refs: Ange Albertini https://blog.trailofbits.com/2019/11/01/two-new-tools- that-tame-the-treachery-of-files/ © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 12. jpl.nasa.gov Categories of Embedded Files (1 of 2) • Attachments – something a human or process added to the file as supplementary information that is intended to stand on its own/be easily exported • Images – intended to be rendered as part of the file • Thumbnail images • Macros/code – executable code that is intended to help the functionality (macros in MSOffice and/or javascript in PDF/HTML, etc.) 9/22/22 12 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 13. jpl.nasa.gov Categories of Embedded Files (2 of 2) • Metadata files – XMP, anything else? • Standalone files that help with the rendering/user experience of a file, e.g. font files, International Color Consortium (ICC) profiles, subtitle streams • Files that normally don’t exist outside of files – EMF, WMF, MSGraph 9/22/22 13 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 14. jpl.nasa.gov Why grep is not sufficient “Hello World” as stored in a PDF 9/22/22 14 “Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction” https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf Compressed object as stored in the file Uncompressed ! -> H “ -> e # -> l $ -> o © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 15. jpl.nasa.gov Notes on Some Specific Formats 9/22/22 15 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 16. jpl.nasa.gov Hodge podge • Email can have alternate content (text/html) • Zips can have free text comments! • Apple resource forks can have all sorts of things: See Tyler Thorsted’s #iPres2022 talk if you haven’t! 9/22/22 16 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 17. jpl.nasa.gov HTML Even HTML can have embedded files! 9/22/22 17 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 18. jpl.nasa.gov MSOffice: OLE2 (doc/ppt/xls) • Directory + file based format like zip • Embedded files need to be parsed out of streams; they may not be stored as separate files even within the zip-like structure • Old .doc files may contain a “save history” – full file paths for where the file has been saved 9/22/22 18 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 19. jpl.nasa.gov MSOffice: OOXML (.docx/pptx/xlsx) • Zip files • Embedded files may be stored as standalone within the zip structure • Extra “bonus” embedded files not part of the docx/pptx/xlsx may be stored • Files may be wrapped in an OLE2 stream • Full file paths for the source locations for embedded files may be stored in xml within the zip file • Excel may store full “last saved” path: 9/22/22 19 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 20. jpl.nasa.gov PDF • Incremental Updates • Attachments 9/22/22 20 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 21. jpl.nasa.gov PDF Incremental Updates 9/22/22 21 https://developers.foxit.com/developer-hub/document/incremental-updates/ © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 22. jpl.nasa.gov Simply truncate to earlier %%EOFs to get earlier file(s)! See also tool: pdfresurrect 9/22/22 22 Fun file (starting around p 51): https://www.usitc.gov/publications/337/pub1859.pdf © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 23. jpl.nasa.gov PDFResurrect – Incremental Updates, 1 million sample from Common Crawl CC-MAIN-2021-31 9/22/22 23 Updates Percentage 0 77% 1 21.75%* 2 1.02% 3 0.30% 4 0.14% 5 0.07% 6 0.04% 7 0.03% 8 0.02% 9 0.01% * Many files created by MSWord with a single incremental update are not significantly different! Max in 1 million sample: 3,441 incremental updates © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 24. jpl.nasa.gov PDF Attachments • Overheard: “We only have to worry about attachments in Portfolio PDFs” 9/22/22 24 No!!! Nearly all* PDFs may contain attachments! © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 25. jpl.nasa.gov Nearly all* • Files that conform to PDF/A-1 are not allowed to contain attachments. 9/22/22 25 For all practical purposes, I don’t understand why an ingest pipeline wouldn’t look for attachments whether or not the PDF alleges that it is PDF/A-1 or whether or not the PDF actually passes a conformance check for PDF/A-1. More simply: assume everything has attachments until proven otherwise. © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 26. jpl.nasa.gov All PDFs may contain attachments! • The dataset in the following consists of 8 million PDFs from one month of Common Crawl CC-MAIN- 2021-31 • ~50k had at least one attachment • Only 670 files were “Portfolio PDFs” 9/22/22 26 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 27. jpl.nasa.gov Apache Tika – Attached Files, Embedded Depth = 1 9/22/22 27 Mime Count text/plain; charset=ISO-8859-1 49,288 application/pdf 12,090 text/plain; charset=windows-1252 5,045 audio/mpeg 4,840 application/x-shockwave-flash 4,740 application/xml 4,727 text/html; charset=UTF-8 3,390 image/png 2,099 image/gif 1,656 image/svg+xml 1,476 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 28. jpl.nasa.gov Apache Tika – Attached Files, Embedded Depth > 0 9/22/22 28 Mime Count text/plain; charset=ISO-8859-1 49,397 image/wmf 14,419 application/pdf 12,564 application/vnd.ms-equation 12,387 image/png 7,126 text/plain; charset=windows-1252 5,127 application/xml 4,959 audio/mpeg 4,886 application/x-shockwave-flash 4,753 text/html; charset=UTF-8 3,391 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 29. jpl.nasa.gov Apache Tika – Maximum Number of Embedded Files 9/22/22 29 Embedded File Counts PDF Count 1 42,054 2 1,416 3 884 4 563 6 423 5 303 8 193 7 176 9 171 16 145 One file has 3,852 embedded files! Only 670 PDFs are “Portfolio PDFs” © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 30. jpl.nasa.gov Apache Tika – Embedded File Depths 9/22/22 30 Embedded Depth Count 0 7,931,327 1 98,268 2 37,547 3 3,137 4 177 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 31. jpl.nasa.gov PDFs may include full file paths for various reasons 9/22/22 31 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 32. jpl.nasa.gov TIFF ExifTool on ~5300 TIFFs 9/22/22 32 • ~2400 have a binary data field (~800 of these are thumbnails) • They may contain OCR’d text • They may contain full file paths Source: https://corpora.tika.apache.org/base/share/tiffs-out.txt.gz © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 33. jpl.nasa.gov Files that mostly only exist as embedded files 9/22/22 33 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 34. jpl.nasa.gov 9/22/22 34 Container file format EMF/WMF Counts ppt 57,869 doc 36,464 docx 8,246 image/emf 6,798 xls 5,007 pptx 2,970 rtf 2,163 xlsx 1,913 xls (macro) 988 pptx (variant) 381 WMF and EMF – Top 10 containers • Windows Metafile (WMF) • Enhanced Windows Metafile (EMF) Habitat within Apache Tika’s 1 million file regression test sample © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 35. jpl.nasa.gov WMF – Windows Metafile “WMF specifies structures for defining a graphical image. A WMF metafile contains drawing commands, property definitions, and graphics objects in a series of WMF records.” WMF 17.0 Specification 9/22/22 35 Like PDF, WMF may contain extractable text! © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 36. jpl.nasa.gov EMF – Enhanced Windows Metafile • “Enhanced metafile format (EMF) is a file format that is used to store portable representations of graphical images. EMF metafiles contain sequential records that are parsed and processed to render the stored image on any output device.” EMF 17.0 Specification 9/22/22 36 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 37. jpl.nasa.gov EMF, EMF+, EMFSpool 9/22/22 37 EMF 17.0 Specification Like PDF, these file types may contain extractable text AND attached files! AND “EMF metafiles define a mechanism for the encapsulation of arbitrary vendor-defined data. The EMR_COMMENT record (section 2.3.3.1) can contain arbitrary private data that is unknown to EMF.” © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 38. jpl.nasa.gov EMF attachments in Apache Tika’s 1 million file regression sample 9/22/22 38 Attachments in EMFs Counts image/wmf 86,810 application/pdf 662 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 39. jpl.nasa.gov XMP – eXtensible Metadata Platform • Habitat: PDF, JPEGs, Photoshop, PNG, TIFF, video, and more • Embedded files • SVG • HTML • Thumbnails (e.g. JPEGs) 9/22/22 39 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 40. jpl.nasa.gov XMP – some potential issues • History – what software packaged modified the file when, nature of modification • User information – creator, modified by, file path links to embedded files or external resources/references • OriginalDocumentId, documentId, InstanceId • Other metadata: title, keywords, subject • Embedded binary images! • Custom metadata! 9/22/22 40 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 41. jpl.nasa.gov XFA – XML Forms Architecture • Habitat: PDF • XML representation of forms, questions and answers • NOTE: If the text extractor/parser is only processing the PDF content, it will miss content from these XMLs 9/22/22 41 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 42. jpl.nasa.gov Why file-format specific tools (alone) are not sufficient • Maximum embedded depth in the 8 million PDF corpus from Common Crawl is 4 • Maximum embedded depth in the ~1 million Tika regression corpus is 7 • For full workflow, file-format specific tools must be able to call each other arbitrarily 9/22/22 42 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 43. jpl.nasa.gov Conclusion • Treat every file as if it has attachments until proven otherwise • Develop budgets, risk assessments and workflows accordingly 9/22/22 43 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 44. jpl.nasa.gov Extras 9/22/22 44 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 45. jpl.nasa.gov • The Portfolio PDF and the PPT with attachment are in the iPres2022 bakeoff corpus: https://drive.google.com/drive/folders/1ACktqBv_Yo oW9DLInBM5ad_I0yHJRoLU • Step by step commandlines with example data for this talk: https://cwiki.apache.org/confluence/display/TIKA/Op en+Preservation+Foundation+Talk+-- +21+September+2022 9/22/22 45 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 46. jpl.nasa.gov Some tools • Apache Tika • ExifTool • pdfresurrect • Poppler: pdfinfo, pdfimages, pdfdetach • Didier Stevens (forensics): oledump.py, pdfid.py, pdftool.py • Philippe Lagadec (forensics): oletools • Great blog from @bitsgalore on opensource PDF tools: https://www.bitsgalore.org/2021/09/06/pdf-processing-and- analysis-with-open-source-tools 9/22/22 46 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 47. jpl.nasa.gov Commandlines • JSON embedded file output • java –jar tika-app-2.y.z.jar –J -t digitally_signed_3D_Portfolio(1).pdf • JSON embedded file output batch • java –jar tika-app-2.y.z.jar –J –t –i <input_dir> -o <output_dir> • Dump first level attachments • java –jar tika-app-2.y.z.jar –z digitally_signed_3D_Portfolio(1).pdf 9/22/22 47 © 2022 California Institute of Technology. Government sponsorship acknowledged. NOTE: -z only extracts first level. If you want full recursive, please open an issue: https://issues.apache.org/jira/projects/TIKA
  • 48. jpl.nasa.gov XMP • Extract the literal XMP from file.pdf into xmp.xmp • exiftool -a -o xmp.xmp file.pdf 9/22/22 48 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 49. jpl.nasa.gov Commandlines • Dump contents of OLE2 to local directory • java -cp tika-app-2.Y.Z.jar org.apache.poi.poifs.dev.POIFSDump testWORD_1img.doc • List contents of OLE2 files • java -cp ~/Intellij/tika-main/tika- app/target/tika-app-2.Y.Z.jar org.apache.poi.poifs.dev.POIFSViewer 261779.ppt 9/22/22 49 © 2022 California Institute of Technology. Government sponsorship acknowledged. See also Didier Stevens’ oledump.py