Building a Public Metadata Commons to Preserve Digital Data

A Public Metadata Commons:
What is it?
Why do we need it?
How do we get it?

Kurt Bollacker
Open Data Bay Area
2012 Nov 27

Wednesday, April 3, 2013 1

A long time ago, there was no “open” data.
All of the media we used to create was physical.


Then most (all?) of the media became digital.


The Internet let us ship data around
for (almost) free.


And we learned how to connect it all together.

So naturally, we started to build a
Global Digital Data Commons!


At ﬁrst it was a “free for all” of
academics and enthusiasts.

Almost all data on the Web was considered to be “open”.

And then folks ﬁgured out how to
make money from our contributions,

so they started to “lock down” part of the Internet that
previously would have been part of the commons.


Why is this bad?
For the data archivist, centrally controlled data
have far fewer (single?) points of failure.

• Technical Failure

• Legal Barriers

• Incompetence


A (Potential) Digital Dark Age

"Those who cannot remember the past are
condemned to repeat it" --- George Santayana

How Do We Avoid This
Lockdown Of Central Control,
(And Hopefully A Digital Dark Age)?

We Need A Practical Perspective On the Problem.

Example Surviving Archives


Data tends to survive if
over the long term, it is:

• Visible

• Mobile

• Well Loved

These happen to also be the
properties of data in a public commons.


Historical
• Bible / Torah / Koran

Examples: • U.S. Constitution

• DNA?
• Wikipedia

Present Day • Open Street Maps
Examples: • Freebase

• MusicBrainz
Why?
• There are many copies. (mobile)

• Their use is mostly unrestricted. (visible)

• Everyone can access and contribute. (well loved)


But what about data that is still trapped by:

• Technical Barriers?

• Legal Restrictions?

• Limited Resources?


We build a metadata commons to hold
the “cultural context” of our trapped data.


How does a metadata commons work?

Metadata

Metadata

Trapped Extraction
Datasets Processes Metadata

Metadata

Metadata

Even if the original contribution is lost or otherwise
made unavailable, we still have its cultural context.


The cultural context in a metadata commons
might contain:

• Indices and Tags (to ﬁnd and organize)

• Comments (to analyze and interpret)

• Technical metadata (e.g. provenance, format info)

• Transforms and Interpretations (to make something useful)


Where is the trapped data that we care about?
A lot of it is in The World Wide Web!

But the Web is:

• Very large (10TB - 100TB for accessible / deduped)

• Very noisy (useless pages, partial duplicates)

• Very diverse (in content, purpose, and target audience)

How do we build a Metadata Commons
from the Web?

A Practical Place To Start:

Common Crawl
(and cheap cloud computing resources)
make the Web far cheaper and easier to
access and manipulate.

• Can be downloaded wholesale

• Can be processed and analyzed in situ.

• Parts can be publicly referenced


This foundation helps us scale up to
“Web size”, but:

• What is the useful “metadata of the Web”?

• How to we extract that metadata?


Useful Web Extracts Are

• Interesting to many people (to me!)

• Can be used to answer relevant questions.

• Can be used to build useful products and services.

Almost everyone will have an itch to scratch.


Speciﬁc Examples Of Useful Web Extracts
(From the Common Crawl code contest)

• WikiEntities

• Congressional sentiment

• Reach of Facebook on the Web


(A Few) General Shapes Of Web Metadata Extracts

• Link graphs

• N-gram counts

• File Indices by domain or keyword

• Mashups with interesting datasets

• Wikipedia

• Freebase

• Location databases (e.g. Open Street Maps)

We should all create an extract!

How do I create an extract?

An easy Recipe:

• Ingredients:

• A Web crawl snapshot

• A little bit of programming skill

• Access to a cloud computing resources (e.g. EMR)

• Directions:

• http://commoncrawl.org/mapreduce-for-the-masses/


What Happens Once
I’ve Made This Awesome Extract?

• Share the extracted data

• Share the code you created / modiﬁed

• https://github.com/commoncrawl/
commoncrawl-examples/

• Broadcast it to the world!


And The World Is Saved!

Thank you.


Some Useful Links

• https://github.com/commoncrawl

• http://commoncrawl.org/mapreduce-for-the-masses/

• https://github.com/commoncrawl/commoncrawl-examples/

• https://aws.amazon.com/amis/common-crawl-quick-start

• https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set


Building a Public Metadata Commons to Preserve Digital Data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (16)

Similar to Building a Public Metadata Commons to Preserve Digital Data

Similar to Building a Public Metadata Commons to Preserve Digital Data (20)

Recently uploaded

Recently uploaded (20)

Building a Public Metadata Commons to Preserve Digital Data