SlideShare a Scribd company logo
1 of 27
Download to read offline
A Public Metadata Commons:
                                   What is it?
                                Why do we need it?
                                How do we get it?




                                    Kurt Bollacker
                                  Open Data Bay Area
                                    2012 Nov 27


Wednesday, April 3, 2013                                1
A long time ago, there was no “open” data.
                           All of the media we used to create was physical.




Wednesday, April 3, 2013                                                      2
Then most (all?) of the media became digital.




Wednesday, April 3, 2013                                    3
The Internet let us ship data around
                                for (almost) free.




Wednesday, April 3, 2013                                      4
And we learned how to connect it all together.




                            So naturally, we started to build a
                           Global Digital Data Commons!

Wednesday, April 3, 2013                                          5
At first it was a “free for all” of
                             academics and enthusiasts.




             Almost all data on the Web was considered to be “open”.
Wednesday, April 3, 2013                                               6
And then folks figured out how to
                           make money from our contributions,




             so they started to “lock down” part of the Internet that
                previously would have been part of the commons.


Wednesday, April 3, 2013                                                7
Why is this bad?
                           For the data archivist, centrally controlled data
                              have far fewer (single?) points of failure.

                                      •   Technical Failure

                                      •   Legal Barriers

                                      •   Incompetence




Wednesday, April 3, 2013                                                       8
A (Potential) Digital Dark Age




                         "Those who cannot remember the past are
                       condemned to repeat it" --- George Santayana
Wednesday, April 3, 2013                                              9
How Do We Avoid This
                           Lockdown Of Central Control,
                             (And Hopefully A Digital Dark Age)?




                   We Need A Practical Perspective On the Problem.
Wednesday, April 3, 2013                                             10
Example Surviving Archives




Wednesday, April 3, 2013                                11
Data tends to survive if
                              over the long term, it is:


                                      •   Visible

                                      •   Mobile

                                      •   Well Loved




                              These happen to also be the
                           properties of data in a public commons.


Wednesday, April 3, 2013                                             12
Historical
                                      •   Bible / Torah / Koran

      Examples:                       •   U.S. Constitution

                                      •   DNA?
                                                                  •   Wikipedia

                                       Present Day                •   Open Street Maps
                                        Examples:                 •   Freebase

                                                                  •   MusicBrainz
               Why?
                           •   There are many copies. (mobile)

                           •   Their use is mostly unrestricted. (visible)

                           •   Everyone can access and contribute. (well loved)

Wednesday, April 3, 2013                                                                 13
But what about data that is still trapped by:



                           •   Technical Barriers?

                           •   Legal Restrictions?

                           •   Limited Resources?




Wednesday, April 3, 2013                                      14
We build a metadata commons to hold
       the “cultural context” of our trapped data.




Wednesday, April 3, 2013                             15
How does a metadata commons work?


                                                                     Metadata

                                                    Metadata



                           Trapped     Extraction
                           Datasets    Processes               Metadata



                                                    Metadata

                                                                          Metadata




                   Even if the original contribution is lost or otherwise
                    made unavailable, we still have its cultural context.

Wednesday, April 3, 2013                                                             16
The cultural context in a metadata commons
                          might contain:

          •       Indices and Tags (to find and organize)

          •       Comments (to analyze and interpret)

          •       Technical metadata (e.g. provenance, format info)

          •       Transforms and Interpretations (to make something useful)




Wednesday, April 3, 2013                                                      17
Where is the trapped data that we care about?
         A lot of it is in The World Wide Web!

                                          But the Web is:

                   •       Very large (10TB - 100TB for accessible / deduped)

                   •       Very noisy (useless pages, partial duplicates)

                   •       Very diverse (in content, purpose, and target audience)



                   How do we build a Metadata Commons
                             from the Web?
Wednesday, April 3, 2013                                                             18
A Practical Place To Start:

                                      Common Crawl
                            (and cheap cloud computing resources)
                           make the Web far cheaper and easier to
                                   access and manipulate.

                           •   Can be downloaded wholesale

                           •   Can be processed and analyzed in situ.

                           •   Parts can be publicly referenced




Wednesday, April 3, 2013                                                19
This foundation helps us scale up to
                                “Web size”, but:


                           •   What is the useful “metadata of the Web”?

                           •   How to we extract that metadata?




Wednesday, April 3, 2013                                                   20
Useful Web Extracts Are


                       •   Interesting to many people (to me!)

                       •   Can be used to answer relevant questions.

                       •   Can be used to build useful products and services.




           Almost everyone will have an itch to scratch.


Wednesday, April 3, 2013                                                        21
Specific Examples Of Useful Web Extracts
                               (From the Common Crawl code contest)



                           •    WikiEntities

                           •    Congressional sentiment

                           •    Reach of Facebook on the Web




Wednesday, April 3, 2013                                              22
(A Few) General Shapes Of Web Metadata Extracts

                           •   Link graphs

                           •   N-gram counts

                           •   File Indices by domain or keyword

                           •   Mashups with interesting datasets

                               •   Wikipedia

                               •   Freebase

                               •   Location databases (e.g. Open Street Maps)


              We should all create an extract!
Wednesday, April 3, 2013                                                        23
How do I create an extract?

                                       An easy Recipe:


                       •   Ingredients:

                           •   A Web crawl snapshot

                           •   A little bit of programming skill

                           •   Access to a cloud computing resources (e.g. EMR)

                       •   Directions:

                           •   http://commoncrawl.org/mapreduce-for-the-masses/



Wednesday, April 3, 2013                                                          24
What Happens Once
                           I’ve Made This Awesome Extract?

                             •   Share the extracted data

                             •   Share the code you created / modified

                                 •   https://github.com/commoncrawl/
                                     commoncrawl-examples/

                             •   Broadcast it to the world!




Wednesday, April 3, 2013                                                25
And The World Is Saved!




                                Thank you.




Wednesday, April 3, 2013                             26
Some Useful Links


   •       https://github.com/commoncrawl

   •       http://commoncrawl.org/mapreduce-for-the-masses/

   •       https://github.com/commoncrawl/commoncrawl-examples/

   •       https://aws.amazon.com/amis/common-crawl-quick-start

   •       https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set




Wednesday, April 3, 2013                                                            27

More Related Content

What's hot

Blind Spots and Broken Links: Access to Government Information
Blind Spots and Broken Links: Access to Government InformationBlind Spots and Broken Links: Access to Government Information
Blind Spots and Broken Links: Access to Government InformationJames Jacobs
 
Gone today, here tomorrow: the future of government information and the digit...
Gone today, here tomorrow: the future of government information and the digit...Gone today, here tomorrow: the future of government information and the digit...
Gone today, here tomorrow: the future of government information and the digit...James Jacobs
 
20111114 b hyland government data and publishers
20111114   b hyland government data and publishers20111114   b hyland government data and publishers
20111114 b hyland government data and publishersBernadette Hyland-Wood
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Dorothea Salo
 
Let the trumpet sound 2003 version
Let the trumpet sound 2003 versionLet the trumpet sound 2003 version
Let the trumpet sound 2003 versionJohan Koren
 
Metadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionMetadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionKevin Rundblad
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?Brian Vetruba
 
2014 digital ethography_eric meyer
2014 digital ethography_eric meyer2014 digital ethography_eric meyer
2014 digital ethography_eric meyeroiisdp
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesIan Mulvany
 
Why Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & LibrariansWhy Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & LibrariansNicole C. Engard
 
Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...Gerben Zaagsma
 
Designing Instructions using the Internet and other E-Resources
Designing Instructions using the Internet and other E-ResourcesDesigning Instructions using the Internet and other E-Resources
Designing Instructions using the Internet and other E-ResourcesAdesina Esther Tolulope
 
European librarians theatre - Social Media Spotlight
European librarians theatre - Social Media SpotlightEuropean librarians theatre - Social Media Spotlight
European librarians theatre - Social Media SpotlightJulien Houssiere
 
Why Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & LibrariansWhy Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & LibrariansNicole C. Engard
 
Lecture 3: Data Formats on the Social Web (2013)
Lecture 3: Data Formats on the Social Web (2013)Lecture 3: Data Formats on the Social Web (2013)
Lecture 3: Data Formats on the Social Web (2013)Lora Aroyo
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the LoopLora Aroyo
 
Libraries & Open Source: Freedom and Community
Libraries & Open Source: Freedom and CommunityLibraries & Open Source: Freedom and Community
Libraries & Open Source: Freedom and CommunityNicole C. Engard
 
TRETC 2011 - DRP Presentation
 TRETC 2011 - DRP Presentation TRETC 2011 - DRP Presentation
TRETC 2011 - DRP Presentationngusky
 

What's hot (18)

Blind Spots and Broken Links: Access to Government Information
Blind Spots and Broken Links: Access to Government InformationBlind Spots and Broken Links: Access to Government Information
Blind Spots and Broken Links: Access to Government Information
 
Gone today, here tomorrow: the future of government information and the digit...
Gone today, here tomorrow: the future of government information and the digit...Gone today, here tomorrow: the future of government information and the digit...
Gone today, here tomorrow: the future of government information and the digit...
 
20111114 b hyland government data and publishers
20111114   b hyland government data and publishers20111114   b hyland government data and publishers
20111114 b hyland government data and publishers
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?
 
Let the trumpet sound 2003 version
Let the trumpet sound 2003 versionLet the trumpet sound 2003 version
Let the trumpet sound 2003 version
 
Metadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionMetadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge Production
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
 
2014 digital ethography_eric meyer
2014 digital ethography_eric meyer2014 digital ethography_eric meyer
2014 digital ethography_eric meyer
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific Curiosities
 
Why Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & LibrariansWhy Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & Librarians
 
Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...Introduction for skills seminar on Search and Data Mining, Master of European...
Introduction for skills seminar on Search and Data Mining, Master of European...
 
Designing Instructions using the Internet and other E-Resources
Designing Instructions using the Internet and other E-ResourcesDesigning Instructions using the Internet and other E-Resources
Designing Instructions using the Internet and other E-Resources
 
European librarians theatre - Social Media Spotlight
European librarians theatre - Social Media SpotlightEuropean librarians theatre - Social Media Spotlight
European librarians theatre - Social Media Spotlight
 
Why Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & LibrariansWhy Should I Care? New Technologies for Libraries & Librarians
Why Should I Care? New Technologies for Libraries & Librarians
 
Lecture 3: Data Formats on the Social Web (2013)
Lecture 3: Data Formats on the Social Web (2013)Lecture 3: Data Formats on the Social Web (2013)
Lecture 3: Data Formats on the Social Web (2013)
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the Loop
 
Libraries & Open Source: Freedom and Community
Libraries & Open Source: Freedom and CommunityLibraries & Open Source: Freedom and Community
Libraries & Open Source: Freedom and Community
 
TRETC 2011 - DRP Presentation
 TRETC 2011 - DRP Presentation TRETC 2011 - DRP Presentation
TRETC 2011 - DRP Presentation
 

Viewers also liked

Real Estate Home Sales: Spring, Tx
Real Estate Home Sales: Spring, Tx Real Estate Home Sales: Spring, Tx
Real Estate Home Sales: Spring, Tx TatianaLavoie
 
Real Estate Market Reports: The Woodlands, TX
Real Estate Market Reports: The Woodlands, TXReal Estate Market Reports: The Woodlands, TX
Real Estate Market Reports: The Woodlands, TXTatianaLavoie
 
Герои труда Ставрополья (03.07.13)
Герои труда Ставрополья (03.07.13)Герои труда Ставрополья (03.07.13)
Герои труда Ставрополья (03.07.13)Анатолий Крячко
 
Real Estate Report July 2013
Real Estate Report July 2013  Real Estate Report July 2013
Real Estate Report July 2013 TatianaLavoie
 
Magnolia, Texas Real Estate Update August 2013
Magnolia, Texas Real Estate Update August 2013Magnolia, Texas Real Estate Update August 2013
Magnolia, Texas Real Estate Update August 2013TatianaLavoie
 
Open Data Bay Area (OBDA) | Chase Davis: Data Journalism
Open Data Bay Area (OBDA) | Chase Davis: Data JournalismOpen Data Bay Area (OBDA) | Chase Davis: Data Journalism
Open Data Bay Area (OBDA) | Chase Davis: Data JournalismDomino Data Lab
 
Real Estate Report: Homes Sales in Magnolia TX
Real Estate Report: Homes Sales in Magnolia TXReal Estate Report: Homes Sales in Magnolia TX
Real Estate Report: Homes Sales in Magnolia TXTatianaLavoie
 
Suitemed presentation
Suitemed presentationSuitemed presentation
Suitemed presentationswilson018025
 
Genpak overview
Genpak overviewGenpak overview
Genpak overviewGenpak
 
Open Data Bay Area (OBDA) | John Wilbanks
Open Data Bay Area (OBDA) | John WilbanksOpen Data Bay Area (OBDA) | John Wilbanks
Open Data Bay Area (OBDA) | John WilbanksDomino Data Lab
 
Measuring the impact of Google Analytics
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google AnalyticsDomino Data Lab
 
Главы регионов в сфере ЖКХ - апрель 2013
Главы регионов в сфере ЖКХ - апрель 2013Главы регионов в сфере ЖКХ - апрель 2013
Главы регионов в сфере ЖКХ - апрель 2013Анатолий Крячко
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterDomino Data Lab
 

Viewers also liked (16)

Real Estate Home Sales: Spring, Tx
Real Estate Home Sales: Spring, Tx Real Estate Home Sales: Spring, Tx
Real Estate Home Sales: Spring, Tx
 
People To Know In Human Resources
People To Know In Human ResourcesPeople To Know In Human Resources
People To Know In Human Resources
 
Real Estate Market Reports: The Woodlands, TX
Real Estate Market Reports: The Woodlands, TXReal Estate Market Reports: The Woodlands, TX
Real Estate Market Reports: The Woodlands, TX
 
People toknowinhumanresources
People toknowinhumanresourcesPeople toknowinhumanresources
People toknowinhumanresources
 
Герои труда Ставрополья (03.07.13)
Герои труда Ставрополья (03.07.13)Герои труда Ставрополья (03.07.13)
Герои труда Ставрополья (03.07.13)
 
Real Estate Report July 2013
Real Estate Report July 2013  Real Estate Report July 2013
Real Estate Report July 2013
 
Magnolia, Texas Real Estate Update August 2013
Magnolia, Texas Real Estate Update August 2013Magnolia, Texas Real Estate Update August 2013
Magnolia, Texas Real Estate Update August 2013
 
Open Data Bay Area (OBDA) | Chase Davis: Data Journalism
Open Data Bay Area (OBDA) | Chase Davis: Data JournalismOpen Data Bay Area (OBDA) | Chase Davis: Data Journalism
Open Data Bay Area (OBDA) | Chase Davis: Data Journalism
 
Real Estate Report: Homes Sales in Magnolia TX
Real Estate Report: Homes Sales in Magnolia TXReal Estate Report: Homes Sales in Magnolia TX
Real Estate Report: Homes Sales in Magnolia TX
 
Suitemed presentation
Suitemed presentationSuitemed presentation
Suitemed presentation
 
Haggle
Haggle Haggle
Haggle
 
Genpak overview
Genpak overviewGenpak overview
Genpak overview
 
Open Data Bay Area (OBDA) | John Wilbanks
Open Data Bay Area (OBDA) | John WilbanksOpen Data Bay Area (OBDA) | John Wilbanks
Open Data Bay Area (OBDA) | John Wilbanks
 
Measuring the impact of Google Analytics
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google Analytics
 
Главы регионов в сфере ЖКХ - апрель 2013
Главы регионов в сфере ЖКХ - апрель 2013Главы регионов в сфере ЖКХ - апрель 2013
Главы регионов в сфере ЖКХ - апрель 2013
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
 

Similar to Building a Public Metadata Commons to Preserve Digital Data

Shared Data & Big Data for Libraries
Shared Data & Big Data for LibrariesShared Data & Big Data for Libraries
Shared Data & Big Data for Librariesrobin fay
 
Shared data and the future of libraries
Shared data and the future of librariesShared data and the future of libraries
Shared data and the future of librariesRegan Harper
 
Lecture 5: Social Web Data Analysis (2012)
Lecture 5: Social Web Data Analysis (2012)Lecture 5: Social Web Data Analysis (2012)
Lecture 5: Social Web Data Analysis (2012)Lora Aroyo
 
Sharing Data on the Web
Sharing Data on the WebSharing Data on the Web
Sharing Data on the Web3 Round Stones
 
Presentation elag 2013
Presentation elag 2013Presentation elag 2013
Presentation elag 2013geckomarma
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Data Infrastructure and the Scholarly Ecosystem of the Future
Data Infrastructure and the Scholarly Ecosystem of the FutureData Infrastructure and the Scholarly Ecosystem of the Future
Data Infrastructure and the Scholarly Ecosystem of the FutureAndrew Treloar
 
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA WeekData Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA WeekCarly Strasser
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Jon Voss
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataHamilton Public Library
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementSarah Jones
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)Dorothea Salo
 
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...nabo_ghea
 
Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010c1sc0
 
Discover or no discover?That is the question
Discover or no discover?That is the questionDiscover or no discover?That is the question
Discover or no discover?That is the questionHoueida Kammourié
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentationekansa
 
Practical Best Practices for Data Management
Practical Best Practices for Data ManagementPractical Best Practices for Data Management
Practical Best Practices for Data ManagementUW Research Data Services
 

Similar to Building a Public Metadata Commons to Preserve Digital Data (20)

Shared Data & Big Data for Libraries
Shared Data & Big Data for LibrariesShared Data & Big Data for Libraries
Shared Data & Big Data for Libraries
 
Shared data and the future of libraries
Shared data and the future of librariesShared data and the future of libraries
Shared data and the future of libraries
 
Lecture 5: Social Web Data Analysis (2012)
Lecture 5: Social Web Data Analysis (2012)Lecture 5: Social Web Data Analysis (2012)
Lecture 5: Social Web Data Analysis (2012)
 
Sharing Data on the Web
Sharing Data on the WebSharing Data on the Web
Sharing Data on the Web
 
Presentation elag 2013
Presentation elag 2013Presentation elag 2013
Presentation elag 2013
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 
Data Infrastructure and the Scholarly Ecosystem of the Future
Data Infrastructure and the Scholarly Ecosystem of the FutureData Infrastructure and the Scholarly Ecosystem of the Future
Data Infrastructure and the Scholarly Ecosystem of the Future
 
Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA WeekData Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
 
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
 
Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010
 
Discover or no discover?That is the question
Discover or no discover?That is the questionDiscover or no discover?That is the question
Discover or no discover?That is the question
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Practical Best Practices for Data Management
Practical Best Practices for Data ManagementPractical Best Practices for Data Management
Practical Best Practices for Data Management
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Building a Public Metadata Commons to Preserve Digital Data

  • 1. A Public Metadata Commons: What is it? Why do we need it? How do we get it? Kurt Bollacker Open Data Bay Area 2012 Nov 27 Wednesday, April 3, 2013 1
  • 2. A long time ago, there was no “open” data. All of the media we used to create was physical. Wednesday, April 3, 2013 2
  • 3. Then most (all?) of the media became digital. Wednesday, April 3, 2013 3
  • 4. The Internet let us ship data around for (almost) free. Wednesday, April 3, 2013 4
  • 5. And we learned how to connect it all together. So naturally, we started to build a Global Digital Data Commons! Wednesday, April 3, 2013 5
  • 6. At first it was a “free for all” of academics and enthusiasts. Almost all data on the Web was considered to be “open”. Wednesday, April 3, 2013 6
  • 7. And then folks figured out how to make money from our contributions, so they started to “lock down” part of the Internet that previously would have been part of the commons. Wednesday, April 3, 2013 7
  • 8. Why is this bad? For the data archivist, centrally controlled data have far fewer (single?) points of failure. • Technical Failure • Legal Barriers • Incompetence Wednesday, April 3, 2013 8
  • 9. A (Potential) Digital Dark Age "Those who cannot remember the past are condemned to repeat it" --- George Santayana Wednesday, April 3, 2013 9
  • 10. How Do We Avoid This Lockdown Of Central Control, (And Hopefully A Digital Dark Age)? We Need A Practical Perspective On the Problem. Wednesday, April 3, 2013 10
  • 12. Data tends to survive if over the long term, it is: • Visible • Mobile • Well Loved These happen to also be the properties of data in a public commons. Wednesday, April 3, 2013 12
  • 13. Historical • Bible / Torah / Koran Examples: • U.S. Constitution • DNA? • Wikipedia Present Day • Open Street Maps Examples: • Freebase • MusicBrainz Why? • There are many copies. (mobile) • Their use is mostly unrestricted. (visible) • Everyone can access and contribute. (well loved) Wednesday, April 3, 2013 13
  • 14. But what about data that is still trapped by: • Technical Barriers? • Legal Restrictions? • Limited Resources? Wednesday, April 3, 2013 14
  • 15. We build a metadata commons to hold the “cultural context” of our trapped data. Wednesday, April 3, 2013 15
  • 16. How does a metadata commons work? Metadata Metadata Trapped Extraction Datasets Processes Metadata Metadata Metadata Even if the original contribution is lost or otherwise made unavailable, we still have its cultural context. Wednesday, April 3, 2013 16
  • 17. The cultural context in a metadata commons might contain: • Indices and Tags (to find and organize) • Comments (to analyze and interpret) • Technical metadata (e.g. provenance, format info) • Transforms and Interpretations (to make something useful) Wednesday, April 3, 2013 17
  • 18. Where is the trapped data that we care about? A lot of it is in The World Wide Web! But the Web is: • Very large (10TB - 100TB for accessible / deduped) • Very noisy (useless pages, partial duplicates) • Very diverse (in content, purpose, and target audience) How do we build a Metadata Commons from the Web? Wednesday, April 3, 2013 18
  • 19. A Practical Place To Start: Common Crawl (and cheap cloud computing resources) make the Web far cheaper and easier to access and manipulate. • Can be downloaded wholesale • Can be processed and analyzed in situ. • Parts can be publicly referenced Wednesday, April 3, 2013 19
  • 20. This foundation helps us scale up to “Web size”, but: • What is the useful “metadata of the Web”? • How to we extract that metadata? Wednesday, April 3, 2013 20
  • 21. Useful Web Extracts Are • Interesting to many people (to me!) • Can be used to answer relevant questions. • Can be used to build useful products and services. Almost everyone will have an itch to scratch. Wednesday, April 3, 2013 21
  • 22. Specific Examples Of Useful Web Extracts (From the Common Crawl code contest) • WikiEntities • Congressional sentiment • Reach of Facebook on the Web Wednesday, April 3, 2013 22
  • 23. (A Few) General Shapes Of Web Metadata Extracts • Link graphs • N-gram counts • File Indices by domain or keyword • Mashups with interesting datasets • Wikipedia • Freebase • Location databases (e.g. Open Street Maps) We should all create an extract! Wednesday, April 3, 2013 23
  • 24. How do I create an extract? An easy Recipe: • Ingredients: • A Web crawl snapshot • A little bit of programming skill • Access to a cloud computing resources (e.g. EMR) • Directions: • http://commoncrawl.org/mapreduce-for-the-masses/ Wednesday, April 3, 2013 24
  • 25. What Happens Once I’ve Made This Awesome Extract? • Share the extracted data • Share the code you created / modified • https://github.com/commoncrawl/ commoncrawl-examples/ • Broadcast it to the world! Wednesday, April 3, 2013 25
  • 26. And The World Is Saved! Thank you. Wednesday, April 3, 2013 26
  • 27. Some Useful Links • https://github.com/commoncrawl • http://commoncrawl.org/mapreduce-for-the-masses/ • https://github.com/commoncrawl/commoncrawl-examples/ • https://aws.amazon.com/amis/common-crawl-quick-start • https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set Wednesday, April 3, 2013 27