This document discusses a study on how well Arabic websites are archived. It finds that while the number of Arabic internet users has grown rapidly, little research has been done on archiving Arabic language content. The study examines the percentage of websites archived from various Arabic-speaking countries in 2009 and 2013, finding the percentage increased but still lags the worldwide average. The goal is to better understand how completely events and content from the Arabic world are preserved online.
This document discusses profiling web archives to summarize their holdings. It presents different profiling strategies like complete URI profiling, top-level domain only profiling, and a middle ground approach. It evaluates profiles generated for two archives using different sample query sets. The evaluation relates characteristics like CDX file size and numbers of URI-Rs and URI-Ms. It analyzes the cost and precision of the profiles for routing archive search requests. The results show gains in routing precision of up to 22% can be achieved with relative costs under 5%.
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
Herbert Van de Sompel (LANL) visisted the Web Science & Digital Libraries Group @ ODU on August 5--7, 2015. The seven PhD students who were in town at that time reviewed their current status for him.
This document discusses different ways to serialize archive profile data, which provides statistics about an archive's holdings of web resources. It proposes using a hybrid CDXJSON format that combines aspects of the CDX and JSON formats. CDXJSON allows for partial key lookups, binary searching, and error resilience. It is more suitable than JSON for processing archive profile data due to being text-based and not requiring the full file to be loaded. The document provides examples of archive profile data organized in different structures and serialized in JSON and the proposed CDXJSON format. Future work includes updating profiler code to output CDXJSON and formalizing the CDXJSON specification.
This document discusses strategies for profiling web archives to determine the likelihood that a URI is present in an archive. It examines generating profiles using different policies like profiling by top-level domain or path depth. Profiles were created for three sample archives and evaluated based on precision for routing requests versus relative computational cost. The results show profiles can gain up to 22% improved routing precision with less than 5% increased relative cost. The profiling strategies and open source code provide a way to predict archive holdings and help with tasks like Memento query routing.
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
This document discusses using web archives and storytelling techniques to generate summaries of archived web collections. It proposes identifying representative web pages from collections and arranging them into stories to provide overviews of the archived content. Four basic story types are described based on whether the pages or timestamps are fixed or sliding. The methodology involves establishing baselines, analyzing collection topics, filtering off-topic pages, and selecting pages to visualize as stories. Generated stories could be displayed on platforms like Storify or interactive timelines to enrich access to archived web collections.
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesMyriam Traub
Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR- induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.
This document discusses profiling web archives to summarize their holdings. It presents different profiling strategies like complete URI profiling, top-level domain only profiling, and a middle ground approach. It evaluates profiles generated for two archives using different sample query sets. The evaluation relates characteristics like CDX file size and numbers of URI-Rs and URI-Ms. It analyzes the cost and precision of the profiles for routing archive search requests. The results show gains in routing precision of up to 22% can be achieved with relative costs under 5%.
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
Herbert Van de Sompel (LANL) visisted the Web Science & Digital Libraries Group @ ODU on August 5--7, 2015. The seven PhD students who were in town at that time reviewed their current status for him.
This document discusses different ways to serialize archive profile data, which provides statistics about an archive's holdings of web resources. It proposes using a hybrid CDXJSON format that combines aspects of the CDX and JSON formats. CDXJSON allows for partial key lookups, binary searching, and error resilience. It is more suitable than JSON for processing archive profile data due to being text-based and not requiring the full file to be loaded. The document provides examples of archive profile data organized in different structures and serialized in JSON and the proposed CDXJSON format. Future work includes updating profiler code to output CDXJSON and formalizing the CDXJSON specification.
This document discusses strategies for profiling web archives to determine the likelihood that a URI is present in an archive. It examines generating profiles using different policies like profiling by top-level domain or path depth. Profiles were created for three sample archives and evaluated based on precision for routing requests versus relative computational cost. The results show profiles can gain up to 22% improved routing precision with less than 5% increased relative cost. The profiling strategies and open source code provide a way to predict archive holdings and help with tasks like Memento query routing.
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
This document discusses using web archives and storytelling techniques to generate summaries of archived web collections. It proposes identifying representative web pages from collections and arranging them into stories to provide overviews of the archived content. Four basic story types are described based on whether the pages or timestamps are fixed or sliding. The methodology involves establishing baselines, analyzing collection topics, filtering off-topic pages, and selecting pages to visualize as stories. Generated stories could be displayed on platforms like Storify or interactive timelines to enrich access to archived web collections.
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesMyriam Traub
Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR- induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.
This document discusses methods for detecting off-topic pages in web archives. It begins by providing examples of ways pages can go off-topic over time, such as due to database errors, financial problems, hacking, or domain expiration. It then examines the behavior of "timemaps" that track archived versions of pages over time. The document outlines several methods for detecting off-topic pages, including analyzing textual content using cosine similarity or Jaccard similarity, examining page semantics using a search engine kernel function, and looking at structural changes like word counts. It evaluates these methods on manually labeled archive collections and finds that combining three methods provides the best results. Finally, it describes a publicly available tool for detecting off-topic pages in archives and applies
The document discusses research into the characteristics of popular human-generated stories on social media platforms. It finds that on average, popular stories have 51 elements including 23 web elements, are edited over a period of 3 hours, and are most often composed of content from Twitter, Instagram, YouTube and Facebook. The research also shows a linear relationship between the time a story is edited and the number of elements included.
Quantifying Orphaned Annotations in Hypothes.ismaturban
Web annotation has been receiving increased attention recently with the organization of the Open Annotation Collaboration and new tools for open annotation, such as Hypothes.is. In this paper, we investigate the prevalence of orphaned annotations, where a live Web page no longer contains the text that had previously been annotated in the
Hypothes.is annotation system (containing 20,953 highlighted text annotations).
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
Michael L. Nelson
@phonedude_mln
Michele C. Weigle
@weiglemc
National Symposium on Web Archiving Interoperability
2017-02-21
Many projects joint with LANL
Funding from NSF, IMLS, NEH, and AMF
1) The document analyzes over 740,000 tweets from the second 2016 presidential debate between Donald Trump and Hillary Clinton to understand how quotes were recorded, interpreted, and shared on Twitter.
2) The tweets were filtered into collections based on memorable quotes from the debate and analyzed to find variations in how the quotes were reported, interpretive biases in the tweets, and sentiment toward the quotes.
3) The results showed that Twitter users often reported quotes with some variation or commentary, word trees illustrating changes to quotes over time, and that sarcastic tweets may have skewed sentiment analysis of the quotes.
Workshop: Archive Unleashed 3.0
Team: Good News/ Bad News
Category: Local News
Location/Date: Internet Archive, San Francisco, CA, February 23 – 24, 2017
Team members:
Sawood Alam, Old Dominion University
Lulwah Alkwai, Old Dominion University
Mark Beasley, Rhizome
Brenda Berkelaar, University of Texas at Austin
Frances Corry, University of Southern California
Ilya Kreymer, Rhizome
Nathalie Casemajor, INRS
Lauren Ko, University of North Texas
Presentation for PIDapalooza 2016. PIDs need to be used to achieve their intended persistence. Our research (reported at WWW2016, see http://arxiv.org/1602.09102) found that a disturbing percentage of references to papers that have DOIs actually use the landing page HTTP URI instead of the DOI HTTP URI. The problem is likely related to tools used for collecting references such as bookmarks and reference managers. These select the landing page URI instead of the DOI URI because the former is what's available in the address bar. It can safely be assumed that the same problem exists for other types of PIDs. The net result is that the true potential of PIDs is not realized. In order to ameliorate this problem we propose a Signposting pattern for PIDs (http://signposting.org/identifier/). It consists of adding a Link header to HTTP HEAD/GET responses for all resources identified by a DOI, including the landing page and content resources such as "the PDF" and "the dataset". The Link header contains a link, which points with the "identifier" relation type to the DOI HTTP URI. When such a link is available, tools can automatically discover and use the DOI URI instead of the other URIs (landing page, PDF, dataset) associated with the DOI-identified object.
To be useful, Linked Open Data requires shared identities and the reuse of their identifiers (URIs). This presentation argues that exact identity matching is both theoretically and practically impossible, and proposes some practical considerations for how to create an actual web of data.
Presented as invited seminar at UC Berkeley, February 24th, 2017
Finding Pages on the Unarchived Web (DL 2014)TimelessFuture
This document summarizes a study that aimed to recover parts of the unarchived web using link evidence found in the Dutch Web Archive. The researchers were able to reconstruct representations for over 10 million unarchived pages and found that the representations were rich enough to identify pages in a known-item search setting, with 59.7% of pages found in the top 10 results on average. While the representations were skewed with most pages only having sparse descriptions, pages with more incoming links had richer representations that led to higher search accuracy. The researchers believe these techniques could help expand web archive coverage and provide additional context about the archived and unarchived web.
An Unsteady Course: Challenges to Growth in Africa's Air Transport IndustryDr Lendy Spires
This document provides data on air transport for various African countries. It includes statistics such as the number of adjusted seats on domestic and international flights in 2007, the annual growth rate of seats between 2001-2007, the number of airports and airlines, and the number of domestic and international routes. The data aims to give insights into the challenges and opportunities for growth in Africa's air transport industry.
The Digital Bible Platform provides Bible content through various digital platforms and social networks. It utilizes cloud technologies, APIs, and standards compliant formats to distribute content internationally on mobile apps and online. The goal is to make the Bible widely accessible through modern digital tools.
The document is a table that ranks 164 countries by the economic cost of violence as a percentage of GDP. It provides data on the total economic cost of violence in millions of 2016 PPP dollars, the per capita cost in 2016 PPP dollars, and the cost as a percentage of GDP for each country. The top of the table shows that Syria, Iraq, and Afghanistan have the highest costs of violence as a percentage of GDP, at over 50% each. The bottom of the table shows that China, Norway, Ireland and other developed countries have costs below 5% of GDP.
This document summarizes key metrics about Turkey's digital landscape in 2013 based on various sources. Some of the key findings include:
- Internet penetration in Turkey reached 48.7% in 2013 with over 20 million broadband subscribers.
- Turks spend on average 73 hours per month online, close to the amount of time spent watching TV.
- The top 3 most visited websites in Turkey are Google, Facebook, and YouTube. Time spent on Facebook averages over 12 hours per month.
- Digital ad spending in Turkey grew 30% in the first half of 2013 to over 541 million TL, with search and display advertising seeing the largest increases.
- Mobile internet use also grew substantially, with over 12.5 million active mobile
SDGs in OIC Countries: Data, Finance and ImplementationSDGsPlus
The document summarizes key data and developments regarding progress towards achieving the UN Sustainable Development Goals (SDGs) in Organization of Islamic Cooperation (OIC) countries. It notes that 42 OIC countries have submitted Voluntary National Reviews of their SDG implementation between 2016-2019. It also provides data on OIC countries' scores on the SDG Index, Human Capital Index, levels of financial inclusion, and economic growth rates, finding mixed progress across countries but with most having work still to do to achieve the 2030 targets.
Research project on investing in consumer brands and companies Saar Gur
This document challenges conventional wisdom about preferring technology investments over consumer product investments. It finds that consumer companies making branded products have more attractive business models than retail aggregators. Specifically:
- Apparel and accessories is highlighted as an attractive vertical due to high brand values and financial profiles.
- Consumer electronics is seen as challenging due to low barriers and difficulty reaching end users.
- Retail aggregators have lower margins and returns than branded consumer product companies across many categories.
- Several consumer product categories have market sizes that are multiples larger than popular technology verticals.
So while technology companies may receive higher valuations, consumer product companies can build large, profitable businesses and achieve high market shares in their industries
David dean e friction refresh tunis ais 04jun15v3AFRINIC
This document discusses e-friction, which refers to factors that inhibit the development of the internet economy. It notes that while Africa has 300 million people online, representing 10% of the global internet population, 20% of the world's offline population is in Africa, with over 800 million non-users on the continent. The document then analyzes e-friction from a global perspective, noting increasing internet usage trends worldwide and the growing economic importance of the internet. It identifies various types of e-friction including issues related to infrastructure, industry, information, and individuals. The document concludes by presenting e-friction index scores for 65 countries.
Kepios's Simon Kemp explores some of the key digital trends and themes that will shape marketing success in 2023 and beyond. Topics in this presentation include:
▫️ The outlook for digital growth in 2023.
▫️ What you really need to know about marketing's "most-hyped", especially #NFTs and the #Metaverse.
▫️ Some surprising insights into the "demise" of #Facebook and the rise of #TikTok.
▫️ Simon's take on one of the hottest social platforms for 2023.
▫️ The longer-term solution to concerns around cookies, third-party data, and consumer privacy.
▫️ Why #messengers will become more important in the year ahead.
▫️ The devices and services that will revolutionise digital in the years to come.
The document discusses trends in internet, social media, online streaming, downloads, games and smartphone usage across several Middle Eastern and North African countries based on a 2013 study by Ipsos. Some key findings include:
- Internet penetration increased in all studied countries, with rates ranging from 55-92% across countries like Saudi Arabia, UAE, Egypt, Lebanon and others.
- Social network usage grew significantly, with penetration rates among internet users ranging from 81-99% across countries.
- Online streaming and downloads also increased substantially year-over-year in most countries studied.
- Smartphone ownership rose markedly, with penetration rates among total populations ranging from 36-79% in countries like Saudi Arabia, UAE
This document discusses methods for detecting off-topic pages in web archives. It begins by providing examples of ways pages can go off-topic over time, such as due to database errors, financial problems, hacking, or domain expiration. It then examines the behavior of "timemaps" that track archived versions of pages over time. The document outlines several methods for detecting off-topic pages, including analyzing textual content using cosine similarity or Jaccard similarity, examining page semantics using a search engine kernel function, and looking at structural changes like word counts. It evaluates these methods on manually labeled archive collections and finds that combining three methods provides the best results. Finally, it describes a publicly available tool for detecting off-topic pages in archives and applies
The document discusses research into the characteristics of popular human-generated stories on social media platforms. It finds that on average, popular stories have 51 elements including 23 web elements, are edited over a period of 3 hours, and are most often composed of content from Twitter, Instagram, YouTube and Facebook. The research also shows a linear relationship between the time a story is edited and the number of elements included.
Quantifying Orphaned Annotations in Hypothes.ismaturban
Web annotation has been receiving increased attention recently with the organization of the Open Annotation Collaboration and new tools for open annotation, such as Hypothes.is. In this paper, we investigate the prevalence of orphaned annotations, where a live Web page no longer contains the text that had previously been annotated in the
Hypothes.is annotation system (containing 20,953 highlighted text annotations).
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
Michael L. Nelson
@phonedude_mln
Michele C. Weigle
@weiglemc
National Symposium on Web Archiving Interoperability
2017-02-21
Many projects joint with LANL
Funding from NSF, IMLS, NEH, and AMF
1) The document analyzes over 740,000 tweets from the second 2016 presidential debate between Donald Trump and Hillary Clinton to understand how quotes were recorded, interpreted, and shared on Twitter.
2) The tweets were filtered into collections based on memorable quotes from the debate and analyzed to find variations in how the quotes were reported, interpretive biases in the tweets, and sentiment toward the quotes.
3) The results showed that Twitter users often reported quotes with some variation or commentary, word trees illustrating changes to quotes over time, and that sarcastic tweets may have skewed sentiment analysis of the quotes.
Workshop: Archive Unleashed 3.0
Team: Good News/ Bad News
Category: Local News
Location/Date: Internet Archive, San Francisco, CA, February 23 – 24, 2017
Team members:
Sawood Alam, Old Dominion University
Lulwah Alkwai, Old Dominion University
Mark Beasley, Rhizome
Brenda Berkelaar, University of Texas at Austin
Frances Corry, University of Southern California
Ilya Kreymer, Rhizome
Nathalie Casemajor, INRS
Lauren Ko, University of North Texas
Presentation for PIDapalooza 2016. PIDs need to be used to achieve their intended persistence. Our research (reported at WWW2016, see http://arxiv.org/1602.09102) found that a disturbing percentage of references to papers that have DOIs actually use the landing page HTTP URI instead of the DOI HTTP URI. The problem is likely related to tools used for collecting references such as bookmarks and reference managers. These select the landing page URI instead of the DOI URI because the former is what's available in the address bar. It can safely be assumed that the same problem exists for other types of PIDs. The net result is that the true potential of PIDs is not realized. In order to ameliorate this problem we propose a Signposting pattern for PIDs (http://signposting.org/identifier/). It consists of adding a Link header to HTTP HEAD/GET responses for all resources identified by a DOI, including the landing page and content resources such as "the PDF" and "the dataset". The Link header contains a link, which points with the "identifier" relation type to the DOI HTTP URI. When such a link is available, tools can automatically discover and use the DOI URI instead of the other URIs (landing page, PDF, dataset) associated with the DOI-identified object.
To be useful, Linked Open Data requires shared identities and the reuse of their identifiers (URIs). This presentation argues that exact identity matching is both theoretically and practically impossible, and proposes some practical considerations for how to create an actual web of data.
Presented as invited seminar at UC Berkeley, February 24th, 2017
Finding Pages on the Unarchived Web (DL 2014)TimelessFuture
This document summarizes a study that aimed to recover parts of the unarchived web using link evidence found in the Dutch Web Archive. The researchers were able to reconstruct representations for over 10 million unarchived pages and found that the representations were rich enough to identify pages in a known-item search setting, with 59.7% of pages found in the top 10 results on average. While the representations were skewed with most pages only having sparse descriptions, pages with more incoming links had richer representations that led to higher search accuracy. The researchers believe these techniques could help expand web archive coverage and provide additional context about the archived and unarchived web.
An Unsteady Course: Challenges to Growth in Africa's Air Transport IndustryDr Lendy Spires
This document provides data on air transport for various African countries. It includes statistics such as the number of adjusted seats on domestic and international flights in 2007, the annual growth rate of seats between 2001-2007, the number of airports and airlines, and the number of domestic and international routes. The data aims to give insights into the challenges and opportunities for growth in Africa's air transport industry.
The Digital Bible Platform provides Bible content through various digital platforms and social networks. It utilizes cloud technologies, APIs, and standards compliant formats to distribute content internationally on mobile apps and online. The goal is to make the Bible widely accessible through modern digital tools.
The document is a table that ranks 164 countries by the economic cost of violence as a percentage of GDP. It provides data on the total economic cost of violence in millions of 2016 PPP dollars, the per capita cost in 2016 PPP dollars, and the cost as a percentage of GDP for each country. The top of the table shows that Syria, Iraq, and Afghanistan have the highest costs of violence as a percentage of GDP, at over 50% each. The bottom of the table shows that China, Norway, Ireland and other developed countries have costs below 5% of GDP.
This document summarizes key metrics about Turkey's digital landscape in 2013 based on various sources. Some of the key findings include:
- Internet penetration in Turkey reached 48.7% in 2013 with over 20 million broadband subscribers.
- Turks spend on average 73 hours per month online, close to the amount of time spent watching TV.
- The top 3 most visited websites in Turkey are Google, Facebook, and YouTube. Time spent on Facebook averages over 12 hours per month.
- Digital ad spending in Turkey grew 30% in the first half of 2013 to over 541 million TL, with search and display advertising seeing the largest increases.
- Mobile internet use also grew substantially, with over 12.5 million active mobile
SDGs in OIC Countries: Data, Finance and ImplementationSDGsPlus
The document summarizes key data and developments regarding progress towards achieving the UN Sustainable Development Goals (SDGs) in Organization of Islamic Cooperation (OIC) countries. It notes that 42 OIC countries have submitted Voluntary National Reviews of their SDG implementation between 2016-2019. It also provides data on OIC countries' scores on the SDG Index, Human Capital Index, levels of financial inclusion, and economic growth rates, finding mixed progress across countries but with most having work still to do to achieve the 2030 targets.
Research project on investing in consumer brands and companies Saar Gur
This document challenges conventional wisdom about preferring technology investments over consumer product investments. It finds that consumer companies making branded products have more attractive business models than retail aggregators. Specifically:
- Apparel and accessories is highlighted as an attractive vertical due to high brand values and financial profiles.
- Consumer electronics is seen as challenging due to low barriers and difficulty reaching end users.
- Retail aggregators have lower margins and returns than branded consumer product companies across many categories.
- Several consumer product categories have market sizes that are multiples larger than popular technology verticals.
So while technology companies may receive higher valuations, consumer product companies can build large, profitable businesses and achieve high market shares in their industries
David dean e friction refresh tunis ais 04jun15v3AFRINIC
This document discusses e-friction, which refers to factors that inhibit the development of the internet economy. It notes that while Africa has 300 million people online, representing 10% of the global internet population, 20% of the world's offline population is in Africa, with over 800 million non-users on the continent. The document then analyzes e-friction from a global perspective, noting increasing internet usage trends worldwide and the growing economic importance of the internet. It identifies various types of e-friction including issues related to infrastructure, industry, information, and individuals. The document concludes by presenting e-friction index scores for 65 countries.
Kepios's Simon Kemp explores some of the key digital trends and themes that will shape marketing success in 2023 and beyond. Topics in this presentation include:
▫️ The outlook for digital growth in 2023.
▫️ What you really need to know about marketing's "most-hyped", especially #NFTs and the #Metaverse.
▫️ Some surprising insights into the "demise" of #Facebook and the rise of #TikTok.
▫️ Simon's take on one of the hottest social platforms for 2023.
▫️ The longer-term solution to concerns around cookies, third-party data, and consumer privacy.
▫️ Why #messengers will become more important in the year ahead.
▫️ The devices and services that will revolutionise digital in the years to come.
The document discusses trends in internet, social media, online streaming, downloads, games and smartphone usage across several Middle Eastern and North African countries based on a 2013 study by Ipsos. Some key findings include:
- Internet penetration increased in all studied countries, with rates ranging from 55-92% across countries like Saudi Arabia, UAE, Egypt, Lebanon and others.
- Social network usage grew significantly, with penetration rates among internet users ranging from 81-99% across countries.
- Online streaming and downloads also increased substantially year-over-year in most countries studied.
- Smartphone ownership rose markedly, with penetration rates among total populations ranging from 36-79% in countries like Saudi Arabia, UAE
Big CInema Data: Analysing global cinema showtimesDeb Verhoeven
Looking at cinema exhibition and distribution at an international scale requires data beyond broad aggregates, it requires data that is specific to individual films and cinema venues in order to appreciate the intricate temporal and geographic aspects of flow and patterns. The Kinomatics Project has tracked the global flow of individual film screenings (down to date and time) for over 54,000 films for 30,000 venues throughout 48 countries internationally.
This presentation will highlight the importance of global scale analysis and data through three case studies. The first will track the spatial and temporal relationships of The Hobbit: an unexpected journey, highlighting the complexities of international cinema enterprises and the subtleties of contemporary releasing strategies. The second explores the relationship between remittance flows and the movement of film around the globe with a focus on Bollywood films. The thrid test dyadic relationships between countries. This presentation will introduce some methods for analysing and visualising data used in the three case studies.
Slides of Enjoy IT Team from Korea IT Volunteers (KIV) Presentation with REGOS Team from Indonesia ICT Volunteer (RTIK Jakarta Raya).
Dari slides singkat ini, kita dapat mengetahui kalau mau maju, kita harus memanfaatkan internet untuk BELAJAR bukan sekedar "Rekreas'"
Middle East and North Africa, the fastest growing region in the world.
IT spending in the MENA Region is forecast to grow 12% in 2010, faster than any other region. Only India, taken outside of its neighbours, is set to grow more rapidly.
The document contains tables with data on the SEO scores, index page search results, accessibility scores, and social media followers of various Korean and international universities, hospitals, companies and government websites. It compares metrics like the number of backlinks and page ranks of these sites on search engines like Google, Bing, Baidu and Naver.
Former Trade Minister of Indonesia H.E. Mr Gita Wirjawan delivered his Keynote Address on the second day of the 6th Asia Think Tank Summit organised by the Economic Research Institute for ASEAN and East Asia (ERIA) and the Think Tank and Civil Societies Program (TTCSP) of the University of Pennsylvania in Bali, Indonesia on 22 November 2018.
Trendeo Industrial investment in Africa may 2018Trendeo
This document summarizes data from the Industries & Strategies database tracking industrial investments in Africa from January 2016 to April 2018. It finds that during this period there were 569 projects announced worth $391 billion that were expected to create 184,241 jobs. The top four recipient countries - Egypt, South Africa, Morocco, and Nigeria - accounted for 33% of jobs, 39% of projects, and 56% of investment. Majority of investments came from within Africa, followed by France, China, and the United States.
This document summarizes digital media usage across Asia based on research conducted by Michael Netzley. It begins by noting the diversity within Asia and issues with viewing it through a Western lens. It then provides statistics on internet penetration rates in various Asian countries, showing China and South Korea as leaders. National social networks, search engines, and communication tools are also described as varying by country. Survey results from Singapore are presented showing differences in online behaviors by age. Reasons for going online and issues like internet blocking are also briefly discussed.
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...APNIC
Adli Wahid, Senior Internet Security Specialist at APNIC, delivered a presentation titled 'Honeypots Unveiled: Proactive Defense Tactics for Cyber Security' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
Securing BGP: Operational Strategies and Best Practices for Network Defenders...APNIC
Md. Zobair Khan,
Network Analyst and Technical Trainer at APNIC, presented 'Securing BGP: Operational Strategies and Best Practices for Network Defenders' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
Integrating Physical and Cybersecurity to Lower Risks in Healthcare!Alec Kassir cozmozone
The contemporary hospital setting is witnessing a growing convergence between physical security and cybersecurity. Because of advancements in technology and the rise in cyberattacks, healthcare facilities face unique challenges.
Integrating Physical and Cybersecurity to Lower Risks in Healthcare!
JCDL2015: How Well are Arabic Websites Archived?
1. How Well Are Arabic
Websites Archived?
Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle
Old Dominion University
Department of Computer Science
Norfolk, Virginia 23529 USA
JCDL 2015
Knoxville, TN
June 21-25, 2015
11. Top ten languages in the Internet
World Language Map
Source: Quick Maps of the World immigration -
http://www.allcountries.org/maps/world_language_maps.html
Source: Internet World Stats -
http://www.internetworldstats.com/stats7.htm
11
15. Ø The number of Arabic speaking Internet users has grown
rapidly
Ø There has been previous work on the coverage of web
archives
Ø Little has been done in terms of Arabic language content
15
Why are we doing this?
16. How Much of the Web Is Archived?
Ø Sample of URIs from four different
sources (DMOZ, Delicious, Bitly,
Search engine indexes)
Ø The archival percentages ranged
from 16% to 79%
2013, A follow-on study:
Ø Archival percentages had increased
from 33% to 95%
Ø These studies were not focused on
content from specific countries or
content in specific languages
16
17. A fair history of the Web?
Examining country balance in the Internet Archive
Ø Examined country balance in the
Internet Archive:
Country Domain Archived
US .com 92%
Taiwan .com.tw 73%
China .com.cn 58%
Singapore .com.sg 73%
17
Ø This work focused on TLD rather
than content language or location
18. Characterization of National Web Domains
Ø Used 10 national web domains
§ 120 million pages
§ 24 countries
§ They studied page sizes,
degrees, link based scores, etc.
§ They found that depth,
response code were similar
Ø In this work, additional methods are
required to determine if a site
belongs to a particular country
18
19. Characterizing a National Community Web
Ø Used Portuguese dataset:
§ (.pt) ccTLD
§ (.com,.net,.org,.tv) in Portuguese
language that has at least one
incoming link from (.pt) ccTLD
Ø They identify, collect, and characterize the
Portuguese Web
19
20. GeoIP only
ccTLD only
Both
Neither
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
How do we classify Arabic websites?
20
21. GeoIP only
ccTLD only
Both
Neither
² E-Marketing: haraj.com.sa
² ccTLD: Arabic (.sa)
² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
21
How do we classify Arabic websites?
22. GeoIP only
ccTLD only
Both
Neither
² E-Marketing: haraj.com.sa
² ccTLD: Arabic (.sa)
² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
22
² Educational: uoh.edu.sa
² ccTLD: Arabic (.sa)
² GeoIP: Arabic country (SA)
How do we classify Arabic websites?
23. GeoIP only
ccTLD only
Both
Neither
² News: alarabiya.net
² ccTLD: Not Arabic (.net)
² GeoIP: Not Arabic country (US)
² E-Marketing: haraj.com.sa
² ccTLD: Arabic (.sa)
² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
23
² Educational: uoh.edu.sa
² ccTLD: Arabic (.sa)
² GeoIP: Arabic country (SA)
How do we classify Arabic websites?
24. Selecting seed URIs
Name Registered Year URI count
DMOZ US 1999 Dmoz.org/world/arabic 4,086
Raddadi Saudi Arabia 2000 Raddadi.com 3,271
Star28 Lebanon 2004 Star28.com 8,386
Total 15,743
• 15,092 unique seed URIs
• 11,014 URIs that existed in the live web
24
25. Determining a webpage language
• HTTP header Content-Language
• HTML title tag language
• Trigram method
• Language detection API client
25
26. >
curl
–I
www.alquds.com
HTTP/1.1
200
OK
Server:
nginx/1.6.2
Date:
Wed,
03
Jun
2015
19:11:31
GMT
Content-‐Type:
text/html;
charset=utf-‐8
Connection:
keep-‐alive
X-‐Powered-‐By:
PHP/5.3.3
X-‐Drupal-‐Cache:
HIT
Etag:
"1433361507-‐0"
Content-‐Language:
ar
…
HTTP header Content-Language
example#1
26
27. >
curl
–I
www.alquds.com
HTTP/1.1
200
OK
Server:
nginx/1.6.2
Date:
Wed,
03
Jun
2015
19:11:31
GMT
Content-‐Type:
text/html;
charset=utf-‐8
Connection:
keep-‐alive
X-‐Powered-‐By:
PHP/5.3.3
X-‐Drupal-‐Cache:
HIT
Etag:
"1433361507-‐0"
Content-‐Language:
ar
…
HTTP header Content-Language
example#1
27
37. https://code.google.com/p/guess-language/
37
Ø
curl
-‐s
www.cnn.com
|
grep
-‐io
"<title>[^<]*"
|
tail
-‐c+8
>
cnn_title.txt
>
Python
>>>
myfile=open("cnn_title.txt",
"r")
>>>
data=myfile.read()
>>>
from
guess_language
import
guess_language
>>>
guess_language(data)
'en'
HTML title tag language
example#2
38. https://code.google.com/p/guess-language/
38
Ø
curl
-‐s
www.cnn.com
|
grep
-‐io
"<title>[^<]*"
|
tail
-‐c+8
>
cnn_title.txt
>
Python
>>>
myfile=open("cnn_title.txt",
"r")
>>>
data=myfile.read()
>>>
from
guess_language
import
guess_language
>>>
guess_language(data)
'en'
HTML title tag language
example#2
39. § Built in C++ and wrapped as a python module
§ Identification is performed through basic trigram lookups
paired with unicode character set recognition
§ Accuracy is high for even short sample texts
https://github.com/decultured/Python-Language-Detector
Trigram method
39
40. https://github.com/decultured/Python-Language-Detector
>
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
Trigram method
example#1
40
41. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://github.com/decultured/Python-Language-Detector
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
ar
41
Trigram method
example#1
42. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
ar
https://github.com/decultured/Python-Language-Detector
42
Trigram method
example#1
43. https://github.com/decultured/Python-Language-Detector
>
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
43
Trigram method
example#2
44. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://github.com/decultured/Python-Language-Detector
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
en
44
Trigram method
example#2
45. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://github.com/decultured/Python-Language-Detector
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
en
45
Trigram method
example#2
46. Language detection API client
• Returns detected language codes and scores
• You have to setup your personal API key,
(http://detectlanguage.com)
• Example of output:
https://detectlanguage.com
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":9.54}]}}
46
47. • Returns detected language codes and scores
• You have to setup your personal API key,
(http://detectlanguage.com)
• Example of output:
https://detectlanguage.com
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":9.54}]}}
• how much text you
pass
• how well it is
identified
False means that the
confidence is low
Language
code
47
Language detection API client
48. https://detectlanguage.com
>
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
Language detection API client
example#1
48
49. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":8.32},
{"language":"tk","isReliable":false,"confidence":0.01}]}}
49
Language detection API client
example#1
50. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":8.32},
{"language":"tk","isReliable":false,"confidence":0.01}]}}
50
Language detection API client
example#1
51. https://detectlanguage.com
>
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
51
Language detection API client
example#2
52. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"en","isReliable":true,"confidence":6.14}]}}
52
Language detection API client
example#2
53. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"en","isReliable":true,"confidence":6.14}]}}
53
Language detection API client
example#2
64. 17,536 Unique domains
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
64
65. Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
First Arabic GeoIP location is at rank 17 65
17,536 Unique domains
66. Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
6 out of 10 top unique domains are news websites 66
17,536 Unique domains
67. Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
Popular western pages are in the top unique domains 67
17,536 Unique domains
68. TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Almost 58% are .com
68
69. TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Almost 58% are .com
69
70. TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Small percentage of Arabic TLD
70
71. TLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
Small percentage of Arabic TLD
71
72. TLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
Small percentage of Arabic TLD
72
73. Path Depth Example Percent
0 Example.com 17.30%
1 Example.com/a 40.42%
2 Example.com/a/b 24.45%
3 Example.com/a/b/c 10.81%
4+ Example.com/a/b/c/d 7.02%
More than 57% are of depth 0 and 1
73
74. Path Depth Example Percent
0 Example.com 17.30%
1 Example.com/a 40.42%
2 Example.com/a/b 24.45%
3 Example.com/a/b/c 10.81%
4+ Example.com/a/b/c/d 7.02%
74
More than 57% are of depth 0 and 1
75. 53.77% of Arabic URIs are archived
• January-March 2015
• ODU CS Memento Aggregator
Median=16
75
76. URI-Rs Memento Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
Most of the top archived URI-Rs are news
websites
76
77. URI-Rs Memento Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
77
Most of the top archived URI-Rs are news
websites
80. Two methods to determine the presence in
each archive
1. Percent of URI-Rs present in each archive
e.g.
http://aljazeera.net
2. Percent of URI-Ms present in each archive
e.g.
http://wayback.archive-it.org/all/20070727215420/http://
www.aljazeera.net/
e.g.
http://web.archive.org/web/20150618104846/http://aljazeera.net/
80
81. Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
Presence in each archive example
81
82. 1- Percent of URI-Rs present in
each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80%
Archive.today 3/5=0.6 60%
Webcitation 1/5=0.2 20%
Total 160%
Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
82
Presence in each archive example
83. Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
Archive Total Percentage
Internet Archive 6/10=0.6 60%
Archive.today 3/10=0.3 30%
Webcitation 1/10=0.1 10%
Total 100%
2- Percent of URI-Ms present in
each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80%
Archive.today 3/5=0.6 60%
Webcitation 1/5=0.2 20%
Total 160%
83
1- Percent of URI-Rs present in
each archive
Presence in each archive example
84. Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
84
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Presence in each archive
85. Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
85
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Presence in each archive
86. Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
Presence in each archive
86
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
87. Average archiving period (days)
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento
Median=48 days
87
88. Values less than 1 indicate
that the URI is archived
multiple times per day
The larger the
period, the more
irregularly the URI
was captured by
the archives
Median=48 days
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento 88
Average archiving period (days)
89. Creation date for archived Arabic URIs
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
We used CarbonDate for creation date estimate
89
92. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
92
93. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
93
94. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab
Emirates
0.67%
Top GeoIP locations
94
95. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab
Emirates
0.67%
Top GeoIP locations
95
96. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
Status of Arabic seed URIs
96
97. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
(Good)
discovered
and saved
97
Status of Arabic seed URIs
98. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
(Good)
discovered
and saved
(Bad)
undiscovered
and not saved
98
Status of Arabic seed URIs
99. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
31% were not indexed by Google
99
Status of Arabic seed URIs
100. 18% have
creation dates
over 1 year
before the first
memento was
archived
19.48% of the URIs have an estimated creation date that is the same
as first memento date
Difference between creation date and first
memento
100
101. Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
DMOZ URIs are more likely to be found and
archived
101
102. Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
102
DMOZ URIs are more likely to be found and
archived
103. Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
103
DMOZ URIs are more likely to be found and
archived
104. Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
Hosted in Western countries would be more
likely to be archived
104
105. Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
105
Hosted in Western countries would be more
likely to be archived
106. Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a
higher indexing rate
106
107. Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a
higher indexing rate
107
108. The spread of memento was not affected by
location or ccTLD
Ø Kolmogorov-Smirnov test
Category Mean
Ar GeoIP 0.5010
Ar ccTLD 0.5013
Both 0.5016
Neither 0.5005
Category D-Value P-Value
Ar ccTLD
vs. neither
0.017 <0.002
Ar GeoIP
vs. neither
0.014 <0.002
108
109. Just because a webpage is older it does not
mean that it is archived more
Because of low historical archiving rates
109
110. We look in the last three years 110
Just because a webpage is older it does not
mean that it is archived more
111. We look in the last three years 111
Just because a webpage is older it does not
mean that it is archived more
112. In the last three years the older the
resource is the more memento it has
112
113. Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
Top level URIs are more likely to be
archived and indexed
113
114. Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
114
Top level URIs are more likely to be
archived and indexed
115. Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
115
Top level URIs are more likely to be
archived and indexed
116. • Collected URIs from three Arabic directories (7,976):
Ø DMOZ
Ø Raddadi.com
Ø Star28.com
• Crawl seed dataset (1,299,671)
• Check if they are unique (663,443)
• Check if they are live (482,905)
• Check for Arabic Language (300,646)
Summary of collection methods
116
117. § Our Arabic language dataset was not largely located in Arabic
countries
Ø Only 14.84% had an Arabic ccTLD
Ø Only 10.53% had a GeoIP in an Arabic country
Ø Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the
top 10
§ Arabic webpages are not particularly well archived or indexed
Ø 46% were not archived
Ø 31% were not indexed by Google
§ An Arabic webpage is more likely to be...
Ø indexed if it is present in a directory
Ø archived if it is present in DMOZ
Ø archived if it has neither Arabic GeoIP nor Arabic ccTLD
For right now, if you want your Arabic language webpage to be archived,
host it outside of an Arabic country and get it listed in DMOZ
Findings
117
120. GeoIP Location
• We obtained the IP addresses of the hostnames
using nslookup, (which uses DNS to convert the
hostname to its IP address)
• We used the MaxMind GeoLite29 database to
determine location from the IP address. (Which
tests at 99.8% accuracy at the country level)
h,p://dev.maxmind.com/geoip/geoip2/geolite2/
h,p://dev.maxmind.com/faq/how-‐‑accurate-‐‑are-‐‑the-‐‑ geoip-‐‑databases/
120