NTC 2015 - Reuse of Open & Big Data for Sustainable Services for Social GoodSteph Nagoski
Reuse of Open Data, Big Data, and Open Analytical Models for Nonprofits for the Creation of Sustainable Social Services - Austin Texas, March 2015, NTC15, NTEN.
OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLSAmin Chowdhury
Summary:
Explain the principles for digital development, provide an understanding of the current digital landscape, and illustrate opportunities using emerging technologies.
Big Data in Agriculture, the SemaGrow and agINFRA experienceAndreas Drakos
Presentation of the SemaGrow and agINFRA projects during the EDBT/ICDT 2014 Special Track on Big Data Management Challenges and Solutions in the Context of European Projects, 27th of March 2014
http://www.edbticdt2014.gr/index.php/eu-projects-track
NTC 2015 - Reuse of Open & Big Data for Sustainable Services for Social GoodSteph Nagoski
Reuse of Open Data, Big Data, and Open Analytical Models for Nonprofits for the Creation of Sustainable Social Services - Austin Texas, March 2015, NTC15, NTEN.
OPPORTUNITIES FOR THE USE OF DIGITAL TECHNOLOGY TOOLSAmin Chowdhury
Summary:
Explain the principles for digital development, provide an understanding of the current digital landscape, and illustrate opportunities using emerging technologies.
Big Data in Agriculture, the SemaGrow and agINFRA experienceAndreas Drakos
Presentation of the SemaGrow and agINFRA projects during the EDBT/ICDT 2014 Special Track on Big Data Management Challenges and Solutions in the Context of European Projects, 27th of March 2014
http://www.edbticdt2014.gr/index.php/eu-projects-track
Developing a self-care protocol for working with potentially traumatic data: ...dri_ireland
This presentation was given by Dr Lorraine Grimes and Clare Lanigan of the Archiving Reproductive Health project at the Digital Repository of Ireland at the conference 'Care for People in the Archives' held by the Archives Society of Alberta in Edmonton on 25 -27 May 2023. The presentation gives an overview of the ARH project and the process by which the Self-Care Protocol was developed and implemented.
Data Driven Societies
Digital & Computational Studies
Bowdoin College
February 17, 2014
Professors Gieseking & Gaze
Lecture Slides "On Digital Publics of Opening…or Not"
Big Data Analytics and Open Data : The presentation aim is to enhance the awareness about big data analytics by process and importance of open data. Two case studies overview with accuracy and introduction is presented by Sharjeel Imtiaz.
PhD from University of East London
Why is big data all the rage? What is this "data science" that people are talking about? Why do I care — as a customer, and as someone who works at a company generating data? In this talk, I present the case for models, and how we can use data science to create and use models of our customers and the society around us.
This is a brief survey of data journalism, including the kinds of issues data journalists tackle, key challenges involved, and some examples of notable work.
Data journalism covers a broad range of activities. Some journalists construct databases from scratch. Others make detailed visualizations that illuminate hidden patterns. Using data, journalists can uncover new areas for potential stories, discover systemic patterns, verify claims, and address issues with greater transparency and detail.
Short talks from several groups (TAIR, MaizeGDB, GDR, Ensemble, and GrainGenes) who are now dealing with multiple genome sequence assemblies, each with multiple annotation sets (gene model sets).
These are some of the issues we will address:
1. How to name and version genome assemblies.
2. How to name and version the “official” annotation (gene model) sets for each assembly.
3. How to name and version second party annotation (gene model) sets for each assembly, if available.
4. How to name each gene model within a set- while trying to keep the assembly and version obvious
5. How to deal with/name and version user corrections to gene models, annotation sets and assemblies.
Youtube recording: https://www.youtube.com/watch?v=kNW6YReFP28&feature=youtu.be
Keynote talk presented at Web Archiving and Digital Libraries (WADL) 2018
June 6, 2018 - Fort Worth, TX
Michele C. Weigle (@weiglemc)
Web Science and Digital Libraries (WS-DL) Research Group (@WebSciDL)
Old Dominion University
Norfolk, VA
First annual scientific conference - overviewIFPRI-PIM
This presentation was given by Rhiannon Pyburn (KIT), as part of the Annual Scientific Conference hosted by the CGIAR Collaborative Platform for Gender Research. The event took place on 5-6 December 2017 in Amsterdam, the Netherlands, where the Platform is hosted (by KIT Royal Tropical Institute).
Read more: http://gender.cgiar.org/gender_events/annual-scientific-conference-capacity-development-workshop-cgiar-collaborative-platform-gender-research/
First annual scientific conference - overviewCGIAR
This presentation was given by Rhiannon Pyburn (KIT), as part of the Annual Scientific Conference hosted by the CGIAR Collaborative Platform for Gender Research. The event took place on 5-6 December 2017 in Amsterdam, the Netherlands, where the Platform is hosted (by KIT Royal Tropical Institute).
Read more: http://gender.cgiar.org/gender_events/annual-scientific-conference-capacity-development-workshop-cgiar-collaborative-platform-gender-research/
HiPPO and Flipism are no longer the only way to take decisions. In the Big Data / Data Science era one can dream of data-driven organization. If the data were "oil", Big Data technologies extract, transport, and store it, while Data Science methods provide the a way to "refine the crude oil". This presentation elaborates on the Ws (What, Why, When, Who and How) of Big Data and Data Science.
May 2016 NCI Cancer Center Directors meeting. Data Sharing and the Cancer Genomic Data Commons (GDC). Focus is on cancer genomic and clinical phenotype data.
How open data contribute to improving the world. The life science use case. The technical, social, ethical issues.
This was a talk given within the iGEM 2020 programme by the London Imperial College students group (https://2020.igem.org/Team:Imperial_College), in a webinar organised by the SOAPLab group on the topic of Ethics of Automation. Excellent Dr Brandon Sepulvado was the other speaker of the day.
Tweet Visibility Dynamics in a Tweet Conversation GraphAlexander Nwala
We sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. In order to do this we sought to understand how the visibility of tweets in a conversation graph changes based on the tweet selected (the tweet in hand).
More Related Content
Similar to Scraping SERPs For Archival Seeds - It Matters When You Start
Developing a self-care protocol for working with potentially traumatic data: ...dri_ireland
This presentation was given by Dr Lorraine Grimes and Clare Lanigan of the Archiving Reproductive Health project at the Digital Repository of Ireland at the conference 'Care for People in the Archives' held by the Archives Society of Alberta in Edmonton on 25 -27 May 2023. The presentation gives an overview of the ARH project and the process by which the Self-Care Protocol was developed and implemented.
Data Driven Societies
Digital & Computational Studies
Bowdoin College
February 17, 2014
Professors Gieseking & Gaze
Lecture Slides "On Digital Publics of Opening…or Not"
Big Data Analytics and Open Data : The presentation aim is to enhance the awareness about big data analytics by process and importance of open data. Two case studies overview with accuracy and introduction is presented by Sharjeel Imtiaz.
PhD from University of East London
Why is big data all the rage? What is this "data science" that people are talking about? Why do I care — as a customer, and as someone who works at a company generating data? In this talk, I present the case for models, and how we can use data science to create and use models of our customers and the society around us.
This is a brief survey of data journalism, including the kinds of issues data journalists tackle, key challenges involved, and some examples of notable work.
Data journalism covers a broad range of activities. Some journalists construct databases from scratch. Others make detailed visualizations that illuminate hidden patterns. Using data, journalists can uncover new areas for potential stories, discover systemic patterns, verify claims, and address issues with greater transparency and detail.
Short talks from several groups (TAIR, MaizeGDB, GDR, Ensemble, and GrainGenes) who are now dealing with multiple genome sequence assemblies, each with multiple annotation sets (gene model sets).
These are some of the issues we will address:
1. How to name and version genome assemblies.
2. How to name and version the “official” annotation (gene model) sets for each assembly.
3. How to name and version second party annotation (gene model) sets for each assembly, if available.
4. How to name each gene model within a set- while trying to keep the assembly and version obvious
5. How to deal with/name and version user corrections to gene models, annotation sets and assemblies.
Youtube recording: https://www.youtube.com/watch?v=kNW6YReFP28&feature=youtu.be
Keynote talk presented at Web Archiving and Digital Libraries (WADL) 2018
June 6, 2018 - Fort Worth, TX
Michele C. Weigle (@weiglemc)
Web Science and Digital Libraries (WS-DL) Research Group (@WebSciDL)
Old Dominion University
Norfolk, VA
First annual scientific conference - overviewIFPRI-PIM
This presentation was given by Rhiannon Pyburn (KIT), as part of the Annual Scientific Conference hosted by the CGIAR Collaborative Platform for Gender Research. The event took place on 5-6 December 2017 in Amsterdam, the Netherlands, where the Platform is hosted (by KIT Royal Tropical Institute).
Read more: http://gender.cgiar.org/gender_events/annual-scientific-conference-capacity-development-workshop-cgiar-collaborative-platform-gender-research/
First annual scientific conference - overviewCGIAR
This presentation was given by Rhiannon Pyburn (KIT), as part of the Annual Scientific Conference hosted by the CGIAR Collaborative Platform for Gender Research. The event took place on 5-6 December 2017 in Amsterdam, the Netherlands, where the Platform is hosted (by KIT Royal Tropical Institute).
Read more: http://gender.cgiar.org/gender_events/annual-scientific-conference-capacity-development-workshop-cgiar-collaborative-platform-gender-research/
HiPPO and Flipism are no longer the only way to take decisions. In the Big Data / Data Science era one can dream of data-driven organization. If the data were "oil", Big Data technologies extract, transport, and store it, while Data Science methods provide the a way to "refine the crude oil". This presentation elaborates on the Ws (What, Why, When, Who and How) of Big Data and Data Science.
May 2016 NCI Cancer Center Directors meeting. Data Sharing and the Cancer Genomic Data Commons (GDC). Focus is on cancer genomic and clinical phenotype data.
How open data contribute to improving the world. The life science use case. The technical, social, ethical issues.
This was a talk given within the iGEM 2020 programme by the London Imperial College students group (https://2020.igem.org/Team:Imperial_College), in a webinar organised by the SOAPLab group on the topic of Ethics of Automation. Excellent Dr Brandon Sepulvado was the other speaker of the day.
Tweet Visibility Dynamics in a Tweet Conversation GraphAlexander Nwala
We sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. In order to do this we sought to understand how the visibility of tweets in a conversation graph changes based on the tweet selected (the tweet in hand).
Archives Unleashed Web Archive Hackathon (#hackarchives) presentation by Tom Smyth, Allison Hegel, Alexander Nwala, Patrick Egan, Nick Ruest, Yu Xu, Kelsey Utne, Jonathan Armoza, and Federico Nanni.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Launch Your Streaming Platforms in MinutesRoshan Dwivedi
The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown:
Pros of Speedy Streaming Platform Launch Services:
No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge.
Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker.
All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations.
Things to Consider:
Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions.
Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option.
Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options).
Examples of Services for Launching Streaming Platforms:
Muvi [muvi com]
Uscreen [usencreen tv]
Alternatives to Consider:
Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited.
Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform.
Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
2. Alexander C. Nwala, Michele C. Weigle, and Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL)
June 5, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
Scraping SERPs for Archival Seeds:
It Matters When You Start
2
Thank you SIGIR for the Travel Grant
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
3. Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
3
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
4. In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1
.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
7. ● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
8. Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
9. ● The Internet Archive and Archive-It (a service of the Internet Archive) have on
multiple occasions requested that users submit seeds via Google Docs for:
9
Seeds may be generated by multiple users
11. 1. SOPA blackout (Jan 2012)
2. Hurricane Sandy (Aug 2012)
3. 2012 Occupy movement (May 2012)
4. Aaron Swartz (Jan 2013)
5. Supreme Court hearings DOMA (Mar 2013)
6. Boston Marathon Bombing (Apr 2013)
7. Nelson Mandela (Dec 2013)
8. 2014 Ferguson, MO (Aug 2014)
9. Ebola virus (Oct 2014)
10. 2016 U.S. presidential election (Nov 2016)
11. #DAPL (Dec 2016)
12. 2018 Winter Olympics (Feb 2018)
11
Tweet requests for
other collections
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
12. ● Seeds can be discovered by issuing queries
(e.g. “hurricane harvey”) to Google and
extracting URIs from the SERP (Search
Engine Result Page)
● URIs for older news stories may be difficult
to discover via Google after one month
(research result)
12
Collection building often begins with a simple Google search
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
13. 13
Before extracting seed URIs of news stories from
SERPs, we investigated re-finding URIs on Google.
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
14. 14
News vertical
SERP
All (renamed General)
SERP
● Wikipedia page present
● Older documents
● Wikipedia page absent
● Newer documents (e.g., 2 hours old)
15. 15
Initial stages of the AHCA bill and the
struggles to pass the bill
Later stages of the AHCA bill and the
failure of the bill which happened in
September 2017
Depending on query/topic, new pages displace older pages in SERPs:
query = "healthcare bill"
SERP on May 25, 2017 SERP on January 5, 2018 (7 months later)
Healthcare saga shaping GOP
approach to tax bill (thehill.com)
US Senate’s McConnell sees tough path
for passing healthcare bill (cnbc.com)
Will the Republican Health Care Bill
Really Lower Premiums? (time.com)
House Republicans used lessons from
failed health care bill to pass tax reform,
Ryan says (pbs.org)
GOP tax bill also manages to
needlessly screw up the healthcare
system (latimes.com)
How GOP tax bill’s Obamacare changes
will affect health care and consumers
(chicagotribune.com)
16. Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
16
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
17. ● RQ1: What is the rate at which new URIs replace old URIs on the SERP over
time?
● RQ2: What is the probability of finding the same URI with the same query
over time?
17
Primary research questions
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
18. Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
18
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
19. ● For a 7 month period (May 25, 2017 to January 12, 2018) we issued 7
queries every day and extracted URIs from the first 5 SERPs (General & News
Vertical):
1. “healthcare bill”
2. “manchester bombing”
3. “london terrorism”
4. “trump russia”
5. “travel ban”
6. “hurricane harvey”
7. “hurricane irma”
19
Methodology: Dataset generation, representation, and processing
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
20. 20
● Dataset generated by
extracting URIs from SERPs
for seven queries
● Dataset was
semi-automatically
generated with the
http://www.localmemory.org/
collection generator
chrome extension
151,602 URIs
(33,432 unique)
21. 21
Tracking URIs: single query perspective
1: Query issued
2: URIs extracted
Scheme and query
parameters removed to
track URIs
3: URI info
stored in
JSON files
4: Date
URI was
found,
page, etc.
22. ● URI replacement rate, new URI rate, and page-level new URI rate
● Probability of finding a story
● Distribution of stories over time across pages
● Overlap rate and recall (see paper for details)
22
Retrievability of URIs was assessed by extracting
four measures
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
23. ● SERP at t0
: URIs {a,b,c}
● SERP at t1
: URIs {a,b,x,y},
○ URI replacement rate at t1
is (at t1
c was replaced):
● SERP at t0
: URIs {a,b,c},
● SERP at t1
: URIs {a,b,c,d,e},
○ The new URI rate from t0
to t1
is (at t1
we saw new URI d and e):
23
Example: URI replacement rate and new URI rate
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
24. 24
Probability of finding a URI over time and
distribution of URIs over time across pages
3 URIs
Day 1 Day 2 Day 3 Day 4 Day 5
URI-1 4 2 0 0 0
URI-2 1 2 1 0 0
URI-3 1 1 1 1 0
Probabilities 3/3 3/3 2/3 1/3 0/3
URI-1 found on
page 4 on Day 1
URIs 1-3,
NOT found
(pages 1-5)
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
25. Outline
1. Introduction and Motivation
2. Research questions
3. Methodology
a. Dataset generation, representation, and processing
b. Primitive measures extraction
4. Results
5. Conclusions
25
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
26. 26
General SERP collections had lower new URI rates, thus produced
URIs with a longer lifespan than News vertical SERP collections
Hurricane Harvey
General SERP News Vertical SERP
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
27. 27
Probability of finding the same URI with the same query on
News vertical SERP after 1 month ≈ 0
Hurricane Harvey
● URIs of some news stories
may not be easily discoverable
if query is issued after 1 month:
○ It matters when users
search for URIs
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
28. 28
The “life span” of URIs is
dependent not just on
SERPs, but also topics
The URIs in “hurricane
harvey” had a “longer life”
than “trump russia” due to a
lower rate of new URIs
General
URIs
News
Vertical URIs
News
Vertical URIs
General
URIs
hurricane harvey
trump russia
29. Results: the URI
replacement and new
URI rates are strongly
dependent on the topic
E.g., Hurricane Harvey’s
lower daily avg. URI
replacement rate (0.21)
and avg. new URI rate
(0.21)
<
Trump Russia (highest
daily - monthly avg. URI
replacement and avg.
new URI rates)
29
Average URI replacement rate (column markers: min-
, max+
)
Average new URI rate (column markers: min-
, max+
)
30. 30
The probability of finding the same URI of a news story with the
same query decreased with time for both SERPs
● The probability of finding the URI for a news story when the same query is issued
one day after it was first observed:
○ 0.34 - 0.44 (General) vs 0.28 - 0.40 (News Vertical)
● After one week:
○ Weekly: 0.01 - 0.11 (General) vs 0.03 - 0.14 (News Vertical)
● After one month:
○ Monthly: 0.01 - 0.08 (General) vs 0 (News Vertical)
31. ● We fitted a curve over the union of occurrence of the URIs in our dataset with
an exponential model.
● The probability of finding an arbitrary URI of a news story s on a
SERP sp ∈ {General, News Vertical}, after k days is predicted as follows:
31
Generalization of the probability of finding an
arbitrary URI as a function of time (days)
32. 32
URIs show multiple page movement patterns
● Each box represents a URI,
numbers in boxes represent the
page the URI was found:
● https://en.wikipedia.org/wiki/Ma
nchester_Arena_bombing (page
1)
● Color codes:
○ Page 1
○ Page 2
○ Page 3
○ Page 4
○ Page 5
○ White (outside pages 1-5)
May 25, 2017 July 15, 2017
URIs
33. Rapid/steady rank climbing and falling:
● Rapid climb: Some URIs go from page
5 to 1 (skipping pages 4 - 2),
● Rapid fall: or go from 1 to 5.
● Steady fall: or go from 3 - 2 - 1
33
Persistent vs rank climbing/falling page movement patterns
https://en.wikipedia.org/wiki/Manche
ster_Arena_bombing: Some URIs
persist over time within the same
page
May 25, 2017 July 15, 2017
URIs
https://www.rollingstone.com/music/news/manchester-bo
mbing-what-we-know-about-arena-terror-attack-w483752:
34. Start early,
don’t stop!
34
Scraping SERPs for seeds?
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018
35. Conclusions
35
● Collection building offers a way of preserving the historic record of important events
and begin with seeds.
● Search engines provide an opportunity to build collections or extract seeds, but tend
to provide the most recent documents.
● Our findings about the difficulty in “refinding” news stories suggests that collection
building efforts that utilize SERPs should be start early and persist.
@acnwala @webscidl
Thank you!
Access our research dataset of 151,602 (33,432 unique) links extracted from
the Google SERPs for over seven months:
https://github.com/anwala/SERPRefind
@acnwala
Scraping SERPs for Archival Seeds: It Matters When You Start
JCDL 2018 • June 5, 2018