SlideShare a Scribd company logo
Building a File Observatory:
Making Sense of PDFs in the
Wild
January 19, 2022
Open Preservation Foundation
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.
The research was carried out at the NASA (National
Aeronautics and Space Administration) Jet Propulsion
Laboratory, California Institute of Technology under a
contract with the Defense Advanced Research Projects
Agency (DARPA) SafeDocs program. © 2022 California
Institute of Technology. Government sponsorship
acknowledged.
jpl.nasa.gov
The Team
Chris Mattmann
PI; Chief Technology
and Innovation Officer
Wayne Burke
Cognizant Engineer
Phil Southam
Trouble (Fun?)
Maker
Tim Allison
Files and Search
Ryan Stonebraker
Data Scientist
Alaskan
Michael Fedell
Data Scientist
Anastasia Menshikova
Data Scientist
Michael Milano
UX/UI Researcher
Dustin Graf
Project Manager
2
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Debts of Gratitude
Sergey Bratus
Peter Wyatt and Duff Johnson, PDF Association
Kudu Dynamics, Trail of Bits, Galois, BAE and SRI
Common Crawl
3
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Goals
Goals
• Demonstrate state of the possible
• Transfer techniques if not code
• Start a discussion for next steps
Caveats
• Numbers/stats are on different sized samples and
everything is preliminary...we cannot draw conclusions
• Some of the examples include synthetic data, labeled:
4
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Synthetic
data!
jpl.nasa.gov
Motivation
• Inducing grammars
• Devtesting parsers during development
• Testing/profiling/tracing existing parsers
• Literal files
• Seeds for fuzzing
• Quickly identifying root causes, patterns
5
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Outline
1. Getting the PDFs -- URL descriptives and fetching
2. Observatory architecture
3. Feature Extraction
4. Basic Descriptives
5. Comparing Text Extraction Tools
6. Discovery with built-in Elasticsearch features
7. Next Steps
6
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Fetching
Common Crawl
CC-MAIN-2021-31
jpl.nasa.gov
Common Crawl
• Monthly open source crawls of large portions of
the web: for July/August 2021 (CC-MAIN-2021-
31), 3.2 billion pages (360 TB).
• Available via Amazon Web Services Public
Datasets
• Searchable indexes available
https://commoncrawl.org
8
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Common Crawl -- known limitations
• Files truncated at 1MB
• Coverage/representativeness
• Web contains web documents – likely missing subtypes
designed for internal/proprietary/sensitive data (medical, 3D
engineering, legal…)
• Common Crawl is ~convenience sample of the web
Ref: http://spw20.langsec.org/papers/corpus_LangSec2020.pdf
9
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Overview of URLS
Random sample of ~3 million URLs with a ‘200’
response in CC-MAIN-2021-31
jpl.nasa.gov
Detected File Types
Detected Mime Percentage
text/html 84.16%
application/xhtml+xml 13.20%
text/plain 1.85%
application/pdf 0.26%
image/jpeg 0.14%
application/rss+xml 0.09%
application/atom+xml 0.07%
application/xml 0.05%
image/png 0.03%
text/calendar 0.02%
See also: https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
11
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Top Level Domains (all file types)
TLD Percentage
com 45.1%
org 5.5%
ru 4.8%
de 4.3%
net 3.5%
uk 2.5%
jp 1.8%
fr 1.7%
it 1.7%
nl 1.6%
12
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Countries (all file types) -- GeoIP with free version of
MaxMind Country Percentage
US 42.1%
DE 8.0%
UNKNOWN 5.3%
RU 4.7%
FR 4.5%
JP 4.3%
CA 3.9%
NL 2.7%
GB 2.6%
ES 1.4%
https://www.maxmind.com/en/home
13
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs around the world – all 8.3 million in CC-MAIN-2021-31
14
© 2022 California Institute of Technology. Government sponsorship acknowledged.
See next
slide on geo
precision!
jpl.nasa.gov
Note on GeoIP precision!
Cheney Reservoir does
NOT host 1.6 million PDFs!
This is where MaxMind puts
IP addresses that it can only
identify as “somewhere in
the U.S.”
https://en.m.wikipedia.org/wiki/Ch
eney_Reservoir
https://www.denverpost.com/2016
/08/10/lawsuit-kansas-home-600-
million-ip-addresses/
15
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs with Borders?
16
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
The Math…See Extras section at the end of this deck
• Takeaway, yes Germany does have a high
percentage of PDFs, but Switzerland and India
have more as a percentage of URLs.
• Russia and China have fewer PDFs as a
percentage of URLs.
And, of course, remember the caveats about generalizability of web data and Common Crawl!
17
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Bytes, bytes and more bytes
jpl.nasa.gov
Some stats for CC-MAIN-2021-31
• 8.3 million PDFs
• 6.4 million retrieved from Common Crawl (1.6 TB)
• 1.9 million refetched (9.8 TB)
• ~100k refetch failures
• 7.9 million unique hashes (10TB stored)
19
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDF Sizes
Size Counts
<1kb 7,275
<10kb 109,092
<100kb 1,671,726
<1mb 4,604,023
<10mb 1,692,893
<100mb 210,735
<=1gb 3,686
>1gb 2
20
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Observatory at various scales
jpl.nasa.gov
Various Scales
• Full Cloud -- AWS components (Athena, Kinesis, AWS
Batch…), S3 for storage -> AWS Hosted Elasticsearch
• Desktop -- Postgresql, local files -> local Elasticsearch
• Hybrid -- hosted/local Postgresql, EC2 instance for
processing, S3 for storage -> AWS Hosted Elasticsearch
22
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Common
Crawl
GovDocs
Provided
Files
Monitored for new files,
which are then collected
and added to the corpus
Files are uploaded to an S3
bucket
DATA
SOURCES
PREPROCESSING
STORAGE
FEATURE
EXTRACTION /
ANNOTATION
FULL
FEATURE
DATABASE API
Amazon S3
File Bucket
Amazon
Athena
Amazon
ElasticSearch
Service
Data
Collection
Sparkler WARC files
containing the necessary
files
Performer
Parsers
USER
INTERFACE
Website
SEARCH
INDEX
USERS
… and their
computers
Other Tools
23
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
File Observatory in a Box
Public github repository
Standard Metadata Output Per Tool
https://github.com/tballison/file-observatory
24
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Standard table for each tool (pdfinfo example)
25
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Feature Extraction
jpl.nasa.gov
From simple regexes, to more interesting items
PDFInfo
pi_producer: iText 2.1.7 by 1T3XT
pi_creation_date: 2021-07-29T05:33:30.000Z
27
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Structural bits...QPDF’s JSON format
q_keys=[/ProcSet, /Info, /Kids,...
q_parent_and_keys=[/Kids->ARRAY, /ProcSet->ARRAY,..
q_type_keys=[/Pages->/Count, /Pages->/ProcSet,..
q_key_values=[/Producer->wkhtmltopdf,...
q_filters=/ASCII85Decode, /FlateDecode->/CCITTFaxDecode
q_max_filter_count=2
28
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Basic descriptives
jpl.nasa.gov
PDFs by “created date”, where it exists
30
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDF versions by year
31
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Automatic Language Detection on text from Apache Tika
32
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Comparing Text Extraction Tools
jpl.nasa.gov
Measuring Text
• Estimating quality without ground truth
• Unmapped Unicode characters
• Out-of-vocabulary (OOV %)
• Comparing output from two or more tools --
overlap, dice coefficient
34
© 2022 California Institute of Technology. Government sponsorship acknowledged.
See also: Popat, Ashok. “A panlingual anomalous text detector.”
DocEng ‘09: Proceedings of the 9th ACM symposium on Document
engineering. 2009.
jpl.nasa.gov
Unmapped Unicode characters
File: 0000bd3e1cd25eaaacc56e30bfdf800a60018495a12e2d7d423f73f8da0d7416
35
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
File: 0000bd3e1cd25eaaacc56e30bfdf800a60018495a12e2d7d423f73f8da0d7416
Tika Mutool pdftotext
Detected
Language
Romanized
Urdu
Romanized
Urdu
Romanized
Bengali
OOV % 99.9% 99.9% 99.9%
36
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
100% unmapped Unicode characters
File: 0000bd3e1cd25eaaacc56e30bfdf800a60018495a12e2d7d423f73f8da0d7416
pdftotext
mutool
37
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Tika vs Mutool extract sort by overlap ascending
38
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
File: 3e652c7dfdd96db91ff85e67931b9ea35ff74bc5610e7f5458509f69f9bad471
Tika Mutool extract pdftotext
Detected Language German Romanized Urdu German
Alphabetic Tokens 9068 43536 8952
Common Tokens 6539 10 6511
Out-of-Vocabulary (OOV) 28% 99.9% 27%
Tool A Tool B Overlap
Tika Mutool extract 9%
Tika pdftotext 95%
Mutool extract pdftotext 9%
Profile
Compare
39
© 2022 California Institute of Technology. Government sponsorship acknowledged.
File: 3e652c7dfdd96db91ff85e67931b9ea35ff74bc5610e7f5458509f69f9bad471
mutool Tika
40
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Discovery with built-in Elasticsearch
features
jpl.nasa.gov
FORCEDENTRY
https://citizenlab.ca/2021/09/forcedentry-nso-group-imessage-zero-click-exploit-captured-in-the-wild/
Citizen Lab’s post includes this image
Three filters?! One repeat?
Found in the wild: /ASCII85Decode->/FlateDecode->/CCITTFaxDecode
42
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Elasticsearch terms aggregation (facets/group by)
country:DE country:JP
43
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
country:JP
44
© 2022 California Institute of Technology. Government sponsorship acknowledged.
country:DE
Elasticsearch significant terms aggregation
jpl.nasa.gov
Using Elasticsearch’s Prefix Completion
45
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Using Elasticsearch’s Autosuggest
46
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Synthetic
data!
jpl.nasa.gov
Spelling Variants
Scripting autosuggest
47
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Synthetic
data!
jpl.nasa.gov
Syntax, erm, flexibility: /Root->/Dests with direct object
48
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Syntax, erm, flexibility: /Root->/Dests with direct object
49
© 2022 California Institute of Technology. Government sponsorship acknowledged.
v0.12.0 released 2014-02-06
jpl.nasa.gov
Why we need the Arlington DOM!
Syntax, erm, flexibility: /Root->/Dests with direct object
wkhtmltopdf
dvips + GPL
Ghostscript
50
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
User Interface: Scout
jpl.nasa.gov
1) Initial UI was created focusing on the
needs of Observatory earlier in its
development
2) Use of a main search bar was
intended for easy search entry
1) As the Observatory started to mature
and more features were added, it
became less user friendly.
2) After a heuristic evaluation it became
clear that we needed rethink the
information hierarchy
52
© 2022 California Institute of Technology. Government sponsorship acknowledged.
1) Design around modularity and
scalability
2) Monospace font to make for easy
comparison of text strings
1) Search and filter with a smaller screen
presence
2) More data visualizations to better
understand data
3) Bigger data viz with greater search tools
53
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Analytics
Document Classification Task
● Classifying the creator tool based
on keys
○ 2021 dataset: q keys, key-value
pairs, parent-key pairs
● Top 10 most occurring creator tools
● Splitting the dataset 80:20
● Label = creator tool, features = keys
● Using CountVectorizer to vectorize
the data
● Testing out different classifiers:
○ MultinomialNB
○ GaussianNB
○ Bernoulli NB
○ Logistic Regression
○ SVM
○ Decision Trees
55
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Model performance
Feature importance example (LR)
56
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Example
Crash Analytics Task
Performing analytics on the crash
data from Eval 3
57
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Next Steps
jpl.nasa.gov
Next Steps
Refactor tool wrappers/integration
Identify final tools/features
Releasing “observatory in a box”
59
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
References
Tika Eval: https://cwiki.apache.org/confluence/display/TIKA/TikaEval
Evaluating Content Extraction:
https://www.pdfa.org/presentation/evaluating-text-extraction-at-scale/
Contact info:
timothy.b.allison@jpl.nasa.gov, tallison@apache.org @_tallison
60
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
jpl.nasa.gov
Extras
jpl.nasa.gov
PDFs by Country -- which countries stand out?
Country Total URLs
Switzerland 17,335
India 12,334
Russia 150,865
Germany 258,775
China 42,928
NULL 170,749
Canada 124,448
Colombia 909
Italy 46,187
Luxembourg 1814
63
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs by Country
Country Total URLs PDFs Observed
Switzerland 17,335 159
India 12,334 100
Russia 150,865 167
Germany 258,775 934
China 42,928 17
NULL 170,749 272
Canada 124,448 191
Colombia 909 13
Italy 46,187 191
Luxembourg 1814 18
64
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs by Country
Country Total URLs PDFs Observed PDFs Expected
Switzerland 17,335 159 45
India 12,334 100 32
Russia 150,865 167 392
Germany 258,775 934 673
China 42,928 17 112
NULL 170,749 272 444
Canada 124,448 191 324
Colombia 909 13 2
Italy 46,187 191 120
Luxembourg 1814 18 5
Expected =
0.26% * Total URLs
65
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs by Country
Country Total URLs PDFs Observed PDFs Expected chi value Higher/Lower
Switzerland 17,335 159 45 288 Higher
India 12,334 100 32 144 Higher
Russia 150,865 167 392 129 Lower
Germany 258,775 934 673 101 Higher
China 42,928 17 112 80 Lower
NULL 170,749 272 444 67 Lower
Canada 124,448 191 324 54 Lower
Colombia 909 13 2 48 Higher
Italy 46,187 191 120 42 Higher
Luxembourg 1814 18 5 37 Higher
66
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
PDFs by Top Level Domain (TLD)
TLD Total URLs PDFs Observed PDFs Expected chi value Higher/Lower
gov 10,722 239 28 1599 Higher
org 176,671 923 459 468 Higher
com 1,452,337 2517 3776 420 Lower
ru 154,378 125 401 190 Lower
de 138,341 587 360 144 Higher
ch 19,239 129 50 125 Higher
in 14,586 106 38 122 Higher
us 6,486 56 17 91 Higher
jp 59,521 271 155 87 Higher
edu 35,249 170 92 67 Higher
67
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Generalizations per language -- sum of common tokens
Language Tika mutool pdftotext
deu 469,218 427,576 463,817
eng 4,591,875 4,499,390 4,533,080
fra 292,285 286,659 288,634
ita 124,081 193,515 190,452
jpn 268,508 264,195 254,348
kor 128,923 128,954 128,656
nld 156,353 144,415 155,143
por 232,705 232,823 232,965
slv 103,489 102,626 95,068
spa 377,255 371,049 378,145
68
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Select Right-to-Left Languages, Sum of Common Tokens
Tika mutool pdftotext
ara 551 118 1112
fas 3222 11 1449
heb 4969 287 5076
69
© 2022 California Institute of Technology. Government sponsorship acknowledged.
Example only. NOT SUFFICIENT DATA TO GENERALIZE!
jpl.nasa.gov
Side note: Bugtrackers -- November 2020 Crawl
• 35 issue trackers
• 32 tools (3 tools have 2 issue trackers -- legacy
and current)
• 1.2 million files (311 GB)
• PDF-centric subset ~33k files
Data
https://corpora.tika.apache.org/base/docs/bug_trackers/
https://corpora.tika.apache.org/base/packaged/pdfs/pdfs_202011/
And Peter Wyatt’s posts:
https://www.pdfa.org/a-new-stressful-pdf-corpus/
https://www.pdfa.org/stressful-pdf-corpus-grows/
70
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Stored vs. Refetched
Total: 10 TB (distinct files)
Stored in CC: 1.6 TB (not distinct)
Refetched: 9.8 TB (not distinct)
71
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Process
• Identify PDF files via Common Crawl index files (240 GB
compressed)
• Fetch the non-truncated files from Common Crawl
• Refetch the truncated files from URLs in Common Crawl
• Store files by their hashes in AWS S3, e.g. CC-MAIN-2021-
31/00/00/00000836a41510292765a86e57a80f1e182360ca5c6f
f0841482b31c8f25ab84
72
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates
• 8.3 million PDFs
• 7.9 distinct hashes
• 370k exact duplicates
• Few handfuls of less than optimal url design, e.g. 744 copies of
the same file
73
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Search and Analytics at Scale:
Observatory UI
jpl.nasa.gov
Search UI
● Interactive way to search
through the file-observatory,
visualize content, and download
files of interest
● Built with a large-volume of data
in mind
● Currently live and working
75
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Search UI
Search/Results Page
● Dynamic filtering based on 18
different fields and search text
● Allows for downloading of
individual or multiple PDFs
● Provides table of similar tokens
within a 2 character distance
● Provides table of suggested
completions and the number of
documents in which they occur
● Dynamic filter field breakdown
visualization
76
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Search UI
● Provides in-depth view of all
fields stored in elasticsearch for
a document by clicking on name
File View
77
© 2022 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Search UI
● Provides a real-time view of all
collections in the file-observatory
● Pulls the latest ElasticSearch file-
observatory mapping schema
with all available fields
Collections Page
78
© 2022 California Institute of Technology. Government sponsorship acknowledged.

More Related Content

Similar to "Building a File Observatory: Making Sense of PDFs in the Wild"

What's Next with Government Big Data
What's Next with Government Big Data What's Next with Government Big Data
What's Next with Government Big Data
GovLoop
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
Michele Pasin
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
Ian Foster
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
suresh sood
 
Conrad PDES Spring 2008
Conrad PDES Spring 2008Conrad PDES Spring 2008
Conrad PDES Spring 2008
Mark Conrad
 
Innovative Research Alliances
Innovative Research AlliancesInnovative Research Alliances
Innovative Research Alliances
Larry Smarr
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
Tim Allison
 
Scott Edmunds: Preparing a data paper for GigaByte
Scott Edmunds: Preparing a data paper for GigaByteScott Edmunds: Preparing a data paper for GigaByte
Scott Edmunds: Preparing a data paper for GigaByte
GigaScience, BGI Hong Kong
 
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
Tim Allison
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
Raffaele Montella
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...
Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...
Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...
impodgirl
 
1105 Media - 2014 Core Market Capabilities Presentation
1105 Media - 2014 Core Market Capabilities Presentation1105 Media - 2014 Core Market Capabilities Presentation
1105 Media - 2014 Core Market Capabilities Presentation
Christina Langer
 
Inspector Gadget 2023 - CalCPA.pdf
Inspector Gadget 2023 - CalCPA.pdfInspector Gadget 2023 - CalCPA.pdf
Inspector Gadget 2023 - CalCPA.pdf
David Cieslak, CPA.CITP, CGMA, GSEC
 
Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
smitasatpathy2
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
Mike Qaissaunee
 
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...
Cloudera, Inc.
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research Platform
Larry Smarr
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 

Similar to "Building a File Observatory: Making Sense of PDFs in the Wild" (20)

What's Next with Government Big Data
What's Next with Government Big Data What's Next with Government Big Data
What's Next with Government Big Data
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
Conrad PDES Spring 2008
Conrad PDES Spring 2008Conrad PDES Spring 2008
Conrad PDES Spring 2008
 
Innovative Research Alliances
Innovative Research AlliancesInnovative Research Alliances
Innovative Research Alliances
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
 
Scott Edmunds: Preparing a data paper for GigaByte
Scott Edmunds: Preparing a data paper for GigaByteScott Edmunds: Preparing a data paper for GigaByte
Scott Edmunds: Preparing a data paper for GigaByte
 
Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...
Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...
Red Hat, Green Energy Corp &amp; Magpie - Open Source Smart Grid Plataform - ...
 
1105 Media - 2014 Core Market Capabilities Presentation
1105 Media - 2014 Core Market Capabilities Presentation1105 Media - 2014 Core Market Capabilities Presentation
1105 Media - 2014 Core Market Capabilities Presentation
 
Inspector Gadget 2023 - CalCPA.pdf
Inspector Gadget 2023 - CalCPA.pdfInspector Gadget 2023 - CalCPA.pdf
Inspector Gadget 2023 - CalCPA.pdf
 
Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...
 
Security Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research PlatformSecurity Challenges and the Pacific Research Platform
Security Challenges and the Pacific Research Platform
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 

Recently uploaded

dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

"Building a File Observatory: Making Sense of PDFs in the Wild"

  • 1. Building a File Observatory: Making Sense of PDFs in the Wild January 19, 2022 Open Preservation Foundation Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology. The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 2. jpl.nasa.gov The Team Chris Mattmann PI; Chief Technology and Innovation Officer Wayne Burke Cognizant Engineer Phil Southam Trouble (Fun?) Maker Tim Allison Files and Search Ryan Stonebraker Data Scientist Alaskan Michael Fedell Data Scientist Anastasia Menshikova Data Scientist Michael Milano UX/UI Researcher Dustin Graf Project Manager 2 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 3. jpl.nasa.gov Debts of Gratitude Sergey Bratus Peter Wyatt and Duff Johnson, PDF Association Kudu Dynamics, Trail of Bits, Galois, BAE and SRI Common Crawl 3 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 4. jpl.nasa.gov Goals Goals • Demonstrate state of the possible • Transfer techniques if not code • Start a discussion for next steps Caveats • Numbers/stats are on different sized samples and everything is preliminary...we cannot draw conclusions • Some of the examples include synthetic data, labeled: 4 © 2022 California Institute of Technology. Government sponsorship acknowledged. Synthetic data!
  • 5. jpl.nasa.gov Motivation • Inducing grammars • Devtesting parsers during development • Testing/profiling/tracing existing parsers • Literal files • Seeds for fuzzing • Quickly identifying root causes, patterns 5 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 6. jpl.nasa.gov Outline 1. Getting the PDFs -- URL descriptives and fetching 2. Observatory architecture 3. Feature Extraction 4. Basic Descriptives 5. Comparing Text Extraction Tools 6. Discovery with built-in Elasticsearch features 7. Next Steps 6 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 8. jpl.nasa.gov Common Crawl • Monthly open source crawls of large portions of the web: for July/August 2021 (CC-MAIN-2021- 31), 3.2 billion pages (360 TB). • Available via Amazon Web Services Public Datasets • Searchable indexes available https://commoncrawl.org 8 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 9. jpl.nasa.gov Common Crawl -- known limitations • Files truncated at 1MB • Coverage/representativeness • Web contains web documents – likely missing subtypes designed for internal/proprietary/sensitive data (medical, 3D engineering, legal…) • Common Crawl is ~convenience sample of the web Ref: http://spw20.langsec.org/papers/corpus_LangSec2020.pdf 9 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 10. jpl.nasa.gov Overview of URLS Random sample of ~3 million URLs with a ‘200’ response in CC-MAIN-2021-31
  • 11. jpl.nasa.gov Detected File Types Detected Mime Percentage text/html 84.16% application/xhtml+xml 13.20% text/plain 1.85% application/pdf 0.26% image/jpeg 0.14% application/rss+xml 0.09% application/atom+xml 0.07% application/xml 0.05% image/png 0.03% text/calendar 0.02% See also: https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes 11 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 12. jpl.nasa.gov Top Level Domains (all file types) TLD Percentage com 45.1% org 5.5% ru 4.8% de 4.3% net 3.5% uk 2.5% jp 1.8% fr 1.7% it 1.7% nl 1.6% 12 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 13. jpl.nasa.gov Countries (all file types) -- GeoIP with free version of MaxMind Country Percentage US 42.1% DE 8.0% UNKNOWN 5.3% RU 4.7% FR 4.5% JP 4.3% CA 3.9% NL 2.7% GB 2.6% ES 1.4% https://www.maxmind.com/en/home 13 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 14. jpl.nasa.gov PDFs around the world – all 8.3 million in CC-MAIN-2021-31 14 © 2022 California Institute of Technology. Government sponsorship acknowledged. See next slide on geo precision!
  • 15. jpl.nasa.gov Note on GeoIP precision! Cheney Reservoir does NOT host 1.6 million PDFs! This is where MaxMind puts IP addresses that it can only identify as “somewhere in the U.S.” https://en.m.wikipedia.org/wiki/Ch eney_Reservoir https://www.denverpost.com/2016 /08/10/lawsuit-kansas-home-600- million-ip-addresses/ 15 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 16. jpl.nasa.gov PDFs with Borders? 16 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 17. jpl.nasa.gov The Math…See Extras section at the end of this deck • Takeaway, yes Germany does have a high percentage of PDFs, but Switzerland and India have more as a percentage of URLs. • Russia and China have fewer PDFs as a percentage of URLs. And, of course, remember the caveats about generalizability of web data and Common Crawl! 17 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 19. jpl.nasa.gov Some stats for CC-MAIN-2021-31 • 8.3 million PDFs • 6.4 million retrieved from Common Crawl (1.6 TB) • 1.9 million refetched (9.8 TB) • ~100k refetch failures • 7.9 million unique hashes (10TB stored) 19 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 20. jpl.nasa.gov PDF Sizes Size Counts <1kb 7,275 <10kb 109,092 <100kb 1,671,726 <1mb 4,604,023 <10mb 1,692,893 <100mb 210,735 <=1gb 3,686 >1gb 2 20 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 22. jpl.nasa.gov Various Scales • Full Cloud -- AWS components (Athena, Kinesis, AWS Batch…), S3 for storage -> AWS Hosted Elasticsearch • Desktop -- Postgresql, local files -> local Elasticsearch • Hybrid -- hosted/local Postgresql, EC2 instance for processing, S3 for storage -> AWS Hosted Elasticsearch 22 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 23. Common Crawl GovDocs Provided Files Monitored for new files, which are then collected and added to the corpus Files are uploaded to an S3 bucket DATA SOURCES PREPROCESSING STORAGE FEATURE EXTRACTION / ANNOTATION FULL FEATURE DATABASE API Amazon S3 File Bucket Amazon Athena Amazon ElasticSearch Service Data Collection Sparkler WARC files containing the necessary files Performer Parsers USER INTERFACE Website SEARCH INDEX USERS … and their computers Other Tools 23 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 24. jpl.nasa.gov File Observatory in a Box Public github repository Standard Metadata Output Per Tool https://github.com/tballison/file-observatory 24 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 25. jpl.nasa.gov Standard table for each tool (pdfinfo example) 25 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 27. jpl.nasa.gov From simple regexes, to more interesting items PDFInfo pi_producer: iText 2.1.7 by 1T3XT pi_creation_date: 2021-07-29T05:33:30.000Z 27 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 28. jpl.nasa.gov Structural bits...QPDF’s JSON format q_keys=[/ProcSet, /Info, /Kids,... q_parent_and_keys=[/Kids->ARRAY, /ProcSet->ARRAY,.. q_type_keys=[/Pages->/Count, /Pages->/ProcSet,.. q_key_values=[/Producer->wkhtmltopdf,... q_filters=/ASCII85Decode, /FlateDecode->/CCITTFaxDecode q_max_filter_count=2 28 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 30. jpl.nasa.gov PDFs by “created date”, where it exists 30 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 31. jpl.nasa.gov PDF versions by year 31 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 32. jpl.nasa.gov Automatic Language Detection on text from Apache Tika 32 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 34. jpl.nasa.gov Measuring Text • Estimating quality without ground truth • Unmapped Unicode characters • Out-of-vocabulary (OOV %) • Comparing output from two or more tools -- overlap, dice coefficient 34 © 2022 California Institute of Technology. Government sponsorship acknowledged. See also: Popat, Ashok. “A panlingual anomalous text detector.” DocEng ‘09: Proceedings of the 9th ACM symposium on Document engineering. 2009.
  • 35. jpl.nasa.gov Unmapped Unicode characters File: 0000bd3e1cd25eaaacc56e30bfdf800a60018495a12e2d7d423f73f8da0d7416 35 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 36. jpl.nasa.gov File: 0000bd3e1cd25eaaacc56e30bfdf800a60018495a12e2d7d423f73f8da0d7416 Tika Mutool pdftotext Detected Language Romanized Urdu Romanized Urdu Romanized Bengali OOV % 99.9% 99.9% 99.9% 36 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 37. jpl.nasa.gov 100% unmapped Unicode characters File: 0000bd3e1cd25eaaacc56e30bfdf800a60018495a12e2d7d423f73f8da0d7416 pdftotext mutool 37 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 38. Tika vs Mutool extract sort by overlap ascending 38 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 39. jpl.nasa.gov File: 3e652c7dfdd96db91ff85e67931b9ea35ff74bc5610e7f5458509f69f9bad471 Tika Mutool extract pdftotext Detected Language German Romanized Urdu German Alphabetic Tokens 9068 43536 8952 Common Tokens 6539 10 6511 Out-of-Vocabulary (OOV) 28% 99.9% 27% Tool A Tool B Overlap Tika Mutool extract 9% Tika pdftotext 95% Mutool extract pdftotext 9% Profile Compare 39 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 40. File: 3e652c7dfdd96db91ff85e67931b9ea35ff74bc5610e7f5458509f69f9bad471 mutool Tika 40 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 41. jpl.nasa.gov Discovery with built-in Elasticsearch features
  • 42. jpl.nasa.gov FORCEDENTRY https://citizenlab.ca/2021/09/forcedentry-nso-group-imessage-zero-click-exploit-captured-in-the-wild/ Citizen Lab’s post includes this image Three filters?! One repeat? Found in the wild: /ASCII85Decode->/FlateDecode->/CCITTFaxDecode 42 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 43. jpl.nasa.gov Elasticsearch terms aggregation (facets/group by) country:DE country:JP 43 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 44. jpl.nasa.gov country:JP 44 © 2022 California Institute of Technology. Government sponsorship acknowledged. country:DE Elasticsearch significant terms aggregation
  • 45. jpl.nasa.gov Using Elasticsearch’s Prefix Completion 45 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 46. jpl.nasa.gov Using Elasticsearch’s Autosuggest 46 © 2022 California Institute of Technology. Government sponsorship acknowledged. Synthetic data!
  • 47. jpl.nasa.gov Spelling Variants Scripting autosuggest 47 © 2022 California Institute of Technology. Government sponsorship acknowledged. Synthetic data!
  • 48. jpl.nasa.gov Syntax, erm, flexibility: /Root->/Dests with direct object 48 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 49. jpl.nasa.gov Syntax, erm, flexibility: /Root->/Dests with direct object 49 © 2022 California Institute of Technology. Government sponsorship acknowledged. v0.12.0 released 2014-02-06
  • 50. jpl.nasa.gov Why we need the Arlington DOM! Syntax, erm, flexibility: /Root->/Dests with direct object wkhtmltopdf dvips + GPL Ghostscript 50 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 52. jpl.nasa.gov 1) Initial UI was created focusing on the needs of Observatory earlier in its development 2) Use of a main search bar was intended for easy search entry 1) As the Observatory started to mature and more features were added, it became less user friendly. 2) After a heuristic evaluation it became clear that we needed rethink the information hierarchy 52 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 53. 1) Design around modularity and scalability 2) Monospace font to make for easy comparison of text strings 1) Search and filter with a smaller screen presence 2) More data visualizations to better understand data 3) Bigger data viz with greater search tools 53 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 55. Document Classification Task ● Classifying the creator tool based on keys ○ 2021 dataset: q keys, key-value pairs, parent-key pairs ● Top 10 most occurring creator tools ● Splitting the dataset 80:20 ● Label = creator tool, features = keys ● Using CountVectorizer to vectorize the data ● Testing out different classifiers: ○ MultinomialNB ○ GaussianNB ○ Bernoulli NB ○ Logistic Regression ○ SVM ○ Decision Trees 55 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 56. Model performance Feature importance example (LR) 56 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 57. Example Crash Analytics Task Performing analytics on the crash data from Eval 3 57 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 59. jpl.nasa.gov Next Steps Refactor tool wrappers/integration Identify final tools/features Releasing “observatory in a box” 59 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 60. jpl.nasa.gov References Tika Eval: https://cwiki.apache.org/confluence/display/TIKA/TikaEval Evaluating Content Extraction: https://www.pdfa.org/presentation/evaluating-text-extraction-at-scale/ Contact info: timothy.b.allison@jpl.nasa.gov, tallison@apache.org @_tallison 60 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 63. jpl.nasa.gov PDFs by Country -- which countries stand out? Country Total URLs Switzerland 17,335 India 12,334 Russia 150,865 Germany 258,775 China 42,928 NULL 170,749 Canada 124,448 Colombia 909 Italy 46,187 Luxembourg 1814 63 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 64. jpl.nasa.gov PDFs by Country Country Total URLs PDFs Observed Switzerland 17,335 159 India 12,334 100 Russia 150,865 167 Germany 258,775 934 China 42,928 17 NULL 170,749 272 Canada 124,448 191 Colombia 909 13 Italy 46,187 191 Luxembourg 1814 18 64 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 65. jpl.nasa.gov PDFs by Country Country Total URLs PDFs Observed PDFs Expected Switzerland 17,335 159 45 India 12,334 100 32 Russia 150,865 167 392 Germany 258,775 934 673 China 42,928 17 112 NULL 170,749 272 444 Canada 124,448 191 324 Colombia 909 13 2 Italy 46,187 191 120 Luxembourg 1814 18 5 Expected = 0.26% * Total URLs 65 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 66. jpl.nasa.gov PDFs by Country Country Total URLs PDFs Observed PDFs Expected chi value Higher/Lower Switzerland 17,335 159 45 288 Higher India 12,334 100 32 144 Higher Russia 150,865 167 392 129 Lower Germany 258,775 934 673 101 Higher China 42,928 17 112 80 Lower NULL 170,749 272 444 67 Lower Canada 124,448 191 324 54 Lower Colombia 909 13 2 48 Higher Italy 46,187 191 120 42 Higher Luxembourg 1814 18 5 37 Higher 66 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 67. jpl.nasa.gov PDFs by Top Level Domain (TLD) TLD Total URLs PDFs Observed PDFs Expected chi value Higher/Lower gov 10,722 239 28 1599 Higher org 176,671 923 459 468 Higher com 1,452,337 2517 3776 420 Lower ru 154,378 125 401 190 Lower de 138,341 587 360 144 Higher ch 19,239 129 50 125 Higher in 14,586 106 38 122 Higher us 6,486 56 17 91 Higher jp 59,521 271 155 87 Higher edu 35,249 170 92 67 Higher 67 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 68. jpl.nasa.gov Generalizations per language -- sum of common tokens Language Tika mutool pdftotext deu 469,218 427,576 463,817 eng 4,591,875 4,499,390 4,533,080 fra 292,285 286,659 288,634 ita 124,081 193,515 190,452 jpn 268,508 264,195 254,348 kor 128,923 128,954 128,656 nld 156,353 144,415 155,143 por 232,705 232,823 232,965 slv 103,489 102,626 95,068 spa 377,255 371,049 378,145 68 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 69. jpl.nasa.gov Select Right-to-Left Languages, Sum of Common Tokens Tika mutool pdftotext ara 551 118 1112 fas 3222 11 1449 heb 4969 287 5076 69 © 2022 California Institute of Technology. Government sponsorship acknowledged. Example only. NOT SUFFICIENT DATA TO GENERALIZE!
  • 70. jpl.nasa.gov Side note: Bugtrackers -- November 2020 Crawl • 35 issue trackers • 32 tools (3 tools have 2 issue trackers -- legacy and current) • 1.2 million files (311 GB) • PDF-centric subset ~33k files Data https://corpora.tika.apache.org/base/docs/bug_trackers/ https://corpora.tika.apache.org/base/packaged/pdfs/pdfs_202011/ And Peter Wyatt’s posts: https://www.pdfa.org/a-new-stressful-pdf-corpus/ https://www.pdfa.org/stressful-pdf-corpus-grows/ 70 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 71. jpl.nasa.gov Stored vs. Refetched Total: 10 TB (distinct files) Stored in CC: 1.6 TB (not distinct) Refetched: 9.8 TB (not distinct) 71 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 72. jpl.nasa.gov Process • Identify PDF files via Common Crawl index files (240 GB compressed) • Fetch the non-truncated files from Common Crawl • Refetch the truncated files from URLs in Common Crawl • Store files by their hashes in AWS S3, e.g. CC-MAIN-2021- 31/00/00/00000836a41510292765a86e57a80f1e182360ca5c6f f0841482b31c8f25ab84 72 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 73. jpl.nasa.gov Duplicates • 8.3 million PDFs • 7.9 distinct hashes • 370k exact duplicates • Few handfuls of less than optimal url design, e.g. 744 copies of the same file 73 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 74. jpl.nasa.gov Search and Analytics at Scale: Observatory UI
  • 75. jpl.nasa.gov Search UI ● Interactive way to search through the file-observatory, visualize content, and download files of interest ● Built with a large-volume of data in mind ● Currently live and working 75 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 76. jpl.nasa.gov Search UI Search/Results Page ● Dynamic filtering based on 18 different fields and search text ● Allows for downloading of individual or multiple PDFs ● Provides table of similar tokens within a 2 character distance ● Provides table of suggested completions and the number of documents in which they occur ● Dynamic filter field breakdown visualization 76 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 77. jpl.nasa.gov Search UI ● Provides in-depth view of all fields stored in elasticsearch for a document by clicking on name File View 77 © 2022 California Institute of Technology. Government sponsorship acknowledged.
  • 78. jpl.nasa.gov Search UI ● Provides a real-time view of all collections in the file-observatory ● Pulls the latest ElasticSearch file- observatory mapping schema with all available fields Collections Page 78 © 2022 California Institute of Technology. Government sponsorship acknowledged.