SlideShare a Scribd company logo
Tufts Spatial Data Rescue:
Crawling at-risk Government
Data
Kyle Monahan
Statistics and Research Technology Specialist
Tufts University
FOSS4G | Boston, MA | 8/17/2017
Background
•What is a data rescue?
• Methods and techniques to identify, store and preserve
datasets
• Predominantly data associated with government entities
• *.gov, *.mil, *.edu, *.org, etc.
• Especially critical during election transitions
12/23/2017 FOSS4G Conference | Boston, MA 2
Background - History
12/23/2017 KMM 3
Example: 2008 End of Term Harvest
12/23/2017 FOSS4G Conference | Boston, MA 4
• National Archives and Records Administration announced they
would be unable to rescue data as they did in 2004.
• International Internet Preservation Consortium (IIPC) responded
by organizing a crawl:
• California Digital Library
• Internet Archive
• Government Printing Office
• Library of Congress
• University of North Texas
• Goal: “comprehensive harvest” (EOTerm Archive, 2016)
Example: 2008 End of Term Harvest
12/23/2017 FOSS4G Conference | Boston, MA 5
• Consisted of three main crawls:
• Pre-election
• Post-election
• Post-inauguration
• Produced over 16 TB of data
• And 160,211,356 URIs (Phillips, 2016)
End of Term, 2008
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 6
• Access to up-to-date Federal data is critical for our Data Lab
• GIS and statistics classes rely on federal data (e.g. US Census,
TRI, HUD)
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 7
• Inquired about key data for faculty
and staff at Tufts
• Also reached out to the Open
Geoportal community
• Created a list of critical data
sources that enable research and
learning at Tufts and beyond
Google Docs
OGP Outreach
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 8
• Used an FTP program called Filezilla
• Can re-initiate connections after
failure
• Ran on multiple computers
overnight, set to mirror different
FTP sites.
Methods of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 9
•Also completed collection of data
from speed-limited locations by
traditional mail
•Placed 128 GB flash drive in an
envelope
• Caught in a storm, but still much
faster than dial-up speed
Summary of Results
12/23/2017 FOSS4G Conference | Boston, MA 10
16
31
40
0
10
20
30
40
50
2008 2012 2017
DataRecused,TB
Year of Harvest
Data Rescue, Estimated Harvest
From Tufts alone – likely much
higher for all data rescues!Source: Phillips, 2016
Results of Tufts Crawl
12/23/2017 FOSS4G Conference | Boston, MA 11
Word Cloud
(highest frequency terms)
Development of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 12
•High volume of data – much of it
zipped
• Some further compressed inside zip
files
•Needed a lightweight tool to
assess what data was captured
• Solution  Python script
Development of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 13
•Packaged the Python script in a
GUI using Tkinter
• Object-oriented layer on Tcl/Tk
•Allows for users unfamiliar with
Python to use the tool
•Provides a simple interface and
clear results
Results of Tufts Crawler
12/23/2017 FOSS4G Conference | Boston, MA 14
Unzips
files
Records type
of file
Organizes XML
data
Using the Tufts Crawler – Take a Look
12/23/2017 FOSS4G Conference | Boston, MA 15
Summary & Future Work
12/23/2017 FOSS4G Conference | Boston, MA 16
• Tufts identified federal data
perceived “at-risk”
• Harvested over 40 TB of data, mostly
compressed
• Developed Tufts Crawler to unpack
and categorize types, sizes and other
metadata.
• Future work: pack into .exe, estimate
progress bar
Acknowledgements
12/23/2017 FOSS4G Conference | Boston, MA 17
• Tufts Geospatial Team: Carolyn
Talmadge, Chris Barnett, Szuhui Wu,
Annie Swafford, Kristen Lee, Adrian
Sharpe, Patrick Florance.
• Graduate students: Sam Boiler.
• Others: Faculty and members of OGP
who assisted in data selection,
DataRescue Boston, the #DataRefuge
slack channel, all in Bromfield House.
Thank you!
Kyle M. Monahan
Statistics & Research Technology Specialist
Tufts University
kyle.monahan@tufts.edu
12/23/2017 FOSS4G Conference | Boston, MA 18
kylemonahan.info datalab.tufts.edu
For more information:
Questions?
5 minutes
12/23/2017 FOSS4G Conference | Boston, MA 19
Extra Slides – Python Code
12/23/2017 FOSS4G Conference | Boston, MA 20
Extra Slides – Python Code
12/23/2017 FOSS4G Conference | Boston, MA 21
Extra Slides – Details about tkinter GUI
12/23/2017 FOSS4G Conference | Boston, MA 22

More Related Content

What's hot

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
Wiggins-7-jun15
Wiggins-7-jun15Wiggins-7-jun15
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
vafopoulos
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
vafopoulos
 
KESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgKESW2012 Hackathon St Petersburg
KESW2012 Hackathon St Petersburg
AI4BD GmbH
 
Making art (and more!) with metadata
Making art (and more!) with metadataMaking art (and more!) with metadata
Making art (and more!) with metadata
Matthew Miguez
 
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Robert H. McDonald
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
dgarijo
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Wacker-4-june15
Wacker-4-june15Wacker-4-june15
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
Courtney McDonald
 
Clark - Metadata is the Message
Clark - Metadata is the MessageClark - Metadata is the Message
Open data and linked data
Open data and linked dataOpen data and linked data
Open data and linked data
Marie Gustafsson Friberger
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-event
Ian Milligan
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked Data
Boris Villazón-Terrazas
 
Semantic Web in the Digital Humanities
Semantic Web in the Digital HumanitiesSemantic Web in the Digital Humanities
Semantic Web in the Digital Humanities
Leipziger Semantic Web Tag
 
Brdi rda 9 13 -- rda
Brdi rda 9 13 -- rdaBrdi rda 9 13 -- rda
Brdi rda 9 13 -- rda
Research Data Alliance
 
Information Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament dataInformation Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament data
Wim Peters
 
Linked Data past, present and futures
Linked Datapast, present and futuresLinked Datapast, present and futures
Linked Data past, present and futures
Pierre-Yves Vandenbussche, Ph.D.
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
Dimitris Kontokostas
 

What's hot (20)

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Wiggins-7-jun15
Wiggins-7-jun15Wiggins-7-jun15
Wiggins-7-jun15
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
KESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgKESW2012 Hackathon St Petersburg
KESW2012 Hackathon St Petersburg
 
Making art (and more!) with metadata
Making art (and more!) with metadataMaking art (and more!) with metadata
Making art (and more!) with metadata
 
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
Wacker-4-june15
Wacker-4-june15Wacker-4-june15
Wacker-4-june15
 
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IUOne Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
One Discovery Layer, Eight Front Doors: Implementing Blacklight @ IU
 
Clark - Metadata is the Message
Clark - Metadata is the MessageClark - Metadata is the Message
Clark - Metadata is the Message
 
Open data and linked data
Open data and linked dataOpen data and linked data
Open data and linked data
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-event
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked Data
 
Semantic Web in the Digital Humanities
Semantic Web in the Digital HumanitiesSemantic Web in the Digital Humanities
Semantic Web in the Digital Humanities
 
Brdi rda 9 13 -- rda
Brdi rda 9 13 -- rdaBrdi rda 9 13 -- rda
Brdi rda 9 13 -- rda
 
Information Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament dataInformation Extraction from EuroParliament and UK Parliament data
Information Extraction from EuroParliament and UK Parliament data
 
Linked Data past, present and futures
Linked Datapast, present and futuresLinked Datapast, present and futures
Linked Data past, present and futures
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 

Similar to Tufts Spatial Data Rescue: Crawling at-risk Government Data

The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
Ben Blaiszik
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
Big Data and its Role in Biomedical Research
Big Data and its Role in Biomedical ResearchBig Data and its Role in Biomedical Research
Big Data and its Role in Biomedical Research
Philip Bourne
 
Uncovering Measures that Matter: A Field-Wide Collaborative Exploration
Uncovering Measures that Matter: A Field-Wide Collaborative ExplorationUncovering Measures that Matter: A Field-Wide Collaborative Exploration
Uncovering Measures that Matter: A Field-Wide Collaborative Exploration
Georgia Libraries Conference (formerly Ga COMO).
 
Johnston - How to Curate Research Data
Johnston - How to Curate Research DataJohnston - How to Curate Research Data
Johnston - How to Curate Research Data
National Information Standards Organization (NISO)
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
re3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositoriesre3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositories
Heinz Pampel
 
Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)
aaroncollie
 
Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...
EDINA, University of Edinburgh
 
Sharing data
Sharing dataSharing data
Sharing data
Edmund Chamberlain
 
Open Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and SolutionsOpen Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and Solutions
Martin Donnelly
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
Implementing a new geospatial data discovery interface across a multi-institu...
Implementing a new geospatial data discovery interface across a multi-institu...Implementing a new geospatial data discovery interface across a multi-institu...
Implementing a new geospatial data discovery interface across a multi-institu...
nacis_slides
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
hsuleslie
 
Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...
Robin Rice
 
Edinburgh DataShare: Tackling research data in a DSpace institutional repository
Edinburgh DataShare: Tackling research data in a DSpace institutional repositoryEdinburgh DataShare: Tackling research data in a DSpace institutional repository
Edinburgh DataShare: Tackling research data in a DSpace institutional repository
Robin Rice
 
Automating Homelessness
Automating HomelessnessAutomating Homelessness
RDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue LibrariesRDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue Libraries
ASIS&T
 
The Power of Open Data!
The Power of Open Data!The Power of Open Data!
The Power of Open Data!
Renaine Julian
 
Critically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart CityCritically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart City
Communication and Media Studies, Carleton University
 

Similar to Tufts Spatial Data Rescue: Crawling at-risk Government Data (20)

The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Big Data and its Role in Biomedical Research
Big Data and its Role in Biomedical ResearchBig Data and its Role in Biomedical Research
Big Data and its Role in Biomedical Research
 
Uncovering Measures that Matter: A Field-Wide Collaborative Exploration
Uncovering Measures that Matter: A Field-Wide Collaborative ExplorationUncovering Measures that Matter: A Field-Wide Collaborative Exploration
Uncovering Measures that Matter: A Field-Wide Collaborative Exploration
 
Johnston - How to Curate Research Data
Johnston - How to Curate Research DataJohnston - How to Curate Research Data
Johnston - How to Curate Research Data
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
Lowenberg Making Data Count
 
re3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositoriesre3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositories
 
Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)
 
Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...
 
Sharing data
Sharing dataSharing data
Sharing data
 
Open Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and SolutionsOpen Access to Research Data: Challenges and Solutions
Open Access to Research Data: Challenges and Solutions
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Implementing a new geospatial data discovery interface across a multi-institu...
Implementing a new geospatial data discovery interface across a multi-institu...Implementing a new geospatial data discovery interface across a multi-institu...
Implementing a new geospatial data discovery interface across a multi-institu...
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
 
Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...Services, policy, guidance and training: Improving research data management a...
Services, policy, guidance and training: Improving research data management a...
 
Edinburgh DataShare: Tackling research data in a DSpace institutional repository
Edinburgh DataShare: Tackling research data in a DSpace institutional repositoryEdinburgh DataShare: Tackling research data in a DSpace institutional repository
Edinburgh DataShare: Tackling research data in a DSpace institutional repository
 
Automating Homelessness
Automating HomelessnessAutomating Homelessness
Automating Homelessness
 
RDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue LibrariesRDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue Libraries
 
The Power of Open Data!
The Power of Open Data!The Power of Open Data!
The Power of Open Data!
 
Critically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart CityCritically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart City
 

Recently uploaded

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 

Recently uploaded (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 

Tufts Spatial Data Rescue: Crawling at-risk Government Data

  • 1. Tufts Spatial Data Rescue: Crawling at-risk Government Data Kyle Monahan Statistics and Research Technology Specialist Tufts University FOSS4G | Boston, MA | 8/17/2017
  • 2. Background •What is a data rescue? • Methods and techniques to identify, store and preserve datasets • Predominantly data associated with government entities • *.gov, *.mil, *.edu, *.org, etc. • Especially critical during election transitions 12/23/2017 FOSS4G Conference | Boston, MA 2
  • 4. Example: 2008 End of Term Harvest 12/23/2017 FOSS4G Conference | Boston, MA 4 • National Archives and Records Administration announced they would be unable to rescue data as they did in 2004. • International Internet Preservation Consortium (IIPC) responded by organizing a crawl: • California Digital Library • Internet Archive • Government Printing Office • Library of Congress • University of North Texas • Goal: “comprehensive harvest” (EOTerm Archive, 2016)
  • 5. Example: 2008 End of Term Harvest 12/23/2017 FOSS4G Conference | Boston, MA 5 • Consisted of three main crawls: • Pre-election • Post-election • Post-inauguration • Produced over 16 TB of data • And 160,211,356 URIs (Phillips, 2016) End of Term, 2008
  • 6. Methods of Tufts Crawl 12/23/2017 FOSS4G Conference | Boston, MA 6 • Access to up-to-date Federal data is critical for our Data Lab • GIS and statistics classes rely on federal data (e.g. US Census, TRI, HUD)
  • 7. Methods of Tufts Crawl 12/23/2017 FOSS4G Conference | Boston, MA 7 • Inquired about key data for faculty and staff at Tufts • Also reached out to the Open Geoportal community • Created a list of critical data sources that enable research and learning at Tufts and beyond Google Docs OGP Outreach
  • 8. Methods of Tufts Crawl 12/23/2017 FOSS4G Conference | Boston, MA 8 • Used an FTP program called Filezilla • Can re-initiate connections after failure • Ran on multiple computers overnight, set to mirror different FTP sites.
  • 9. Methods of Tufts Crawl 12/23/2017 FOSS4G Conference | Boston, MA 9 •Also completed collection of data from speed-limited locations by traditional mail •Placed 128 GB flash drive in an envelope • Caught in a storm, but still much faster than dial-up speed
  • 10. Summary of Results 12/23/2017 FOSS4G Conference | Boston, MA 10 16 31 40 0 10 20 30 40 50 2008 2012 2017 DataRecused,TB Year of Harvest Data Rescue, Estimated Harvest From Tufts alone – likely much higher for all data rescues!Source: Phillips, 2016
  • 11. Results of Tufts Crawl 12/23/2017 FOSS4G Conference | Boston, MA 11 Word Cloud (highest frequency terms)
  • 12. Development of Tufts Crawler 12/23/2017 FOSS4G Conference | Boston, MA 12 •High volume of data – much of it zipped • Some further compressed inside zip files •Needed a lightweight tool to assess what data was captured • Solution  Python script
  • 13. Development of Tufts Crawler 12/23/2017 FOSS4G Conference | Boston, MA 13 •Packaged the Python script in a GUI using Tkinter • Object-oriented layer on Tcl/Tk •Allows for users unfamiliar with Python to use the tool •Provides a simple interface and clear results
  • 14. Results of Tufts Crawler 12/23/2017 FOSS4G Conference | Boston, MA 14 Unzips files Records type of file Organizes XML data
  • 15. Using the Tufts Crawler – Take a Look 12/23/2017 FOSS4G Conference | Boston, MA 15
  • 16. Summary & Future Work 12/23/2017 FOSS4G Conference | Boston, MA 16 • Tufts identified federal data perceived “at-risk” • Harvested over 40 TB of data, mostly compressed • Developed Tufts Crawler to unpack and categorize types, sizes and other metadata. • Future work: pack into .exe, estimate progress bar
  • 17. Acknowledgements 12/23/2017 FOSS4G Conference | Boston, MA 17 • Tufts Geospatial Team: Carolyn Talmadge, Chris Barnett, Szuhui Wu, Annie Swafford, Kristen Lee, Adrian Sharpe, Patrick Florance. • Graduate students: Sam Boiler. • Others: Faculty and members of OGP who assisted in data selection, DataRescue Boston, the #DataRefuge slack channel, all in Bromfield House.
  • 18. Thank you! Kyle M. Monahan Statistics & Research Technology Specialist Tufts University kyle.monahan@tufts.edu 12/23/2017 FOSS4G Conference | Boston, MA 18 kylemonahan.info datalab.tufts.edu For more information:
  • 19. Questions? 5 minutes 12/23/2017 FOSS4G Conference | Boston, MA 19
  • 20. Extra Slides – Python Code 12/23/2017 FOSS4G Conference | Boston, MA 20
  • 21. Extra Slides – Python Code 12/23/2017 FOSS4G Conference | Boston, MA 21
  • 22. Extra Slides – Details about tkinter GUI 12/23/2017 FOSS4G Conference | Boston, MA 22

Editor's Notes

  1. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  2. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 EDGI. “Homepage: Environmental Data and Governance Initiative.” Accessed on 8-15-2017. KMM. https://envirodatagov.org/
  3. Image source: http://netpreserve.org/wp-content/uploads/2017/04/IIPC-logo.png Sources: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit. End Of Term Archive. “Project Background: End of Term Archive.” 2008. http://eotarchive.cdlib.org/background.html Accessed on 8-15-2017.
  4. URI is uniform resource identifier – be sure to say it! Sources: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit. End Of Term Archive. “Project Background: End of Term Archive.” 2008. http://eotarchive.cdlib.org/background.html Accessed on 8-15-2017.
  5. Data is critical to our teaching and research computing lab, called the Data Lab. We focus on data reference, analysis and visualization, of which federal data provides an integral base. Many of our GIS and statistics courses rely on access to federal data, such as the US Census, the EPA’s Toxic Release Inventory, the US Department of Housing and Urban Development (HUD), among others. ----- Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  6. We took a similar approach as the previously mentioned crawls. We inquired about key data that was critical for faculty and staff at Tufts. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  7. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  8. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  9. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  10. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  11. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  12. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  13. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  14. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  15. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  16. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  17. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.
  18. Source: "End of Term Presidential Harvest 2008" University of North Texas Digital Library, retrieved August 14, 2017 Phillips, Mark Edward. End of Term Web Archives: 2008, 2012, 2016 ..., presentation, April 5, 2016;(digital.library.unt.edu/ark:/67531/metadc848587/: accessed August 14, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Digital Projects Unit.