Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JCDL2015: How Well are Arabic Websites Archived?

2,422 views

Published on

JCDL2015: How Well are Arabic Websites Archived?

Published in: Internet
  • Be the first to comment

  • Be the first to like this

JCDL2015: How Well are Arabic Websites Archived?

  1. 1. How Well Are Arabic Websites Archived? Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle Old Dominion University Department of Computer Science Norfolk, Virginia 23529 USA JCDL 2015 Knoxville, TN June 21-25, 2015
  2. 2. Archived events on English sites vs. Arabic sites 2
  3. 3. http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Baltimore (one week old) Archived events on English sites vs. Arabic sites 3
  4. 4. http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Baltimore (one week old) Search: Yemen Houthis (one week old) http://www.yemenakhbar.com/yemen-news/178683.html Archived events on English sites vs. Arabic sites 4
  5. 5. Search: Baltimore (one week old) http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Yemen Houthis (one week old) Archived events on English sites vs. Arabic sites 5 http://www.yemenakhbar.com/yemen-news/178683.html
  6. 6. Search: Baltimore (one week old) http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Yemen Houthis (one week old) Archived events on English sites vs. Arabic sites 6 http://www.yemenakhbar.com/yemen-news/178683.html
  7. 7. Search: Baltimore (one week old) http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Yemen Houthis (one week old) Archived events on English sites vs. Arabic sites 7 http://www.yemenakhbar.com/yemen-news/178683.html
  8. 8. English sports websites are more archived than Arabic www.espn.go.com www.kooora.com 8
  9. 9. English e-Marketing websites are more archived than Arabic www.amazon.com www.haraj.com.sa 9
  10. 10. English encyclopedia websites are more archived than Arabic en.wikipedia.org ar.wikipedia.org 10
  11. 11. Top ten languages in the Internet World Language Map Source: Quick Maps of the World immigration - http://www.allcountries.org/maps/world_language_maps.html                                                                                                                                                                                                 Source: Internet World Stats - http://www.internetworldstats.com/stats7.htm 11
  12. 12. 2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration 1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00% Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 % Source: http://www.internetworldstats.com/stats19.htm Arabic speaking Internet users 12
  13. 13. 2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration 1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00% Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 % Source: http://www.internetworldstats.com/stats19.htm 2009 Arabic Total=17.5% World Total=26.6% Arabic speaking Internet users 13
  14. 14. 2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration 1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00% Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 % 2013 Arabic Total=35.8% World Total=39.0% Source: http://www.internetworldstats.com/stats19.htm 2009 Arabic Total=17.5% World Total=26.6% 14 Arabic speaking Internet users
  15. 15. Ø  The number of Arabic speaking Internet users has grown rapidly Ø  There has been previous work on the coverage of web archives Ø  Little has been done in terms of Arabic language content 15 Why are we doing this?
  16. 16. How Much of the Web Is Archived? Ø  Sample of URIs from four different sources (DMOZ, Delicious, Bitly, Search engine indexes) Ø  The archival percentages ranged from 16% to 79% 2013, A follow-on study: Ø  Archival percentages had increased from 33% to 95% Ø  These studies were not focused on content from specific countries or content in specific languages 16
  17. 17. A fair history of the Web? Examining country balance in the Internet Archive Ø  Examined country balance in the Internet Archive: Country Domain Archived US .com 92% Taiwan .com.tw 73% China .com.cn 58% Singapore .com.sg 73% 17 Ø  This work focused on TLD rather than content language or location
  18. 18. Characterization of National Web Domains Ø  Used 10 national web domains §  120 million pages §  24 countries §  They studied page sizes, degrees, link based scores, etc. §  They found that depth, response code were similar Ø  In this work, additional methods are required to determine if a site belongs to a particular country 18
  19. 19. Characterizing a National Community Web Ø  Used Portuguese dataset: §  (.pt) ccTLD §  (.com,.net,.org,.tv) in Portuguese language that has at least one incoming link from (.pt) ccTLD Ø  They identify, collect, and characterize the Portuguese Web 19
  20. 20. GeoIP  only ccTLD  only Both Neither ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) How do we classify Arabic websites? 20
  21. 21. GeoIP  only ccTLD  only Both Neither ²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland) ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) 21 How do we classify Arabic websites?
  22. 22. GeoIP  only ccTLD  only Both Neither ²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland) ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) 22 ²  Educational: uoh.edu.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Arabic country (SA) How do we classify Arabic websites?
  23. 23. GeoIP  only ccTLD  only Both Neither ²  News: alarabiya.net ²  ccTLD: Not Arabic (.net) ²  GeoIP: Not Arabic country (US) ²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland) ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) 23 ²  Educational: uoh.edu.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Arabic country (SA) How do we classify Arabic websites?
  24. 24. Selecting seed URIs Name Registered Year URI count DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743 •  15,092 unique seed URIs •  11,014 URIs that existed in the live web 24
  25. 25. Determining a webpage language •  HTTP header Content-Language •  HTML title tag language •  Trigram method •  Language detection API client 25
  26. 26. >  curl  –I  www.alquds.com   HTTP/1.1  200  OK   Server:  nginx/1.6.2   Date:  Wed,  03  Jun  2015  19:11:31  GMT   Content-­‐Type:  text/html;  charset=utf-­‐8   Connection:  keep-­‐alive   X-­‐Powered-­‐By:  PHP/5.3.3   X-­‐Drupal-­‐Cache:  HIT   Etag:  "1433361507-­‐0"   Content-­‐Language:  ar   …   HTTP header Content-Language example#1 26
  27. 27. >  curl  –I  www.alquds.com   HTTP/1.1  200  OK   Server:  nginx/1.6.2   Date:  Wed,  03  Jun  2015  19:11:31  GMT   Content-­‐Type:  text/html;  charset=utf-­‐8   Connection:  keep-­‐alive   X-­‐Powered-­‐By:  PHP/5.3.3   X-­‐Drupal-­‐Cache:  HIT   Etag:  "1433361507-­‐0"   Content-­‐Language:  ar   …   HTTP header Content-Language example#1 27
  28. 28. >  curl  –I  www.raddadi.com   HTTP/1.1  200  OK   Server:  nginx/1.8.0   Date:  Sat,  06  Jun  2015  22:47:09  GMT   Content-­‐Type:  text/html   Connection:  keep-­‐alive   …   HTTP header Content-Language example#2 28
  29. 29. >  curl  –I  www.raddadi.com   HTTP/1.1  200  OK   Server:  nginx/1.8.0   Date:  Sat,  06  Jun  2015  22:47:09  GMT   Content-­‐Type:  text/html   Connection:  keep-­‐alive   …   >  curl  www.raddadi.com   <!DOCTYPE  html  PUBLIC  "-­‐//W3C//DTD  XHTML  1.0   Transitional//EN"  "http://www.w3.org/TR/ xhtml1/DTD/xhtml1-­‐transitional.dtd">     <html  dir="rtl"  xmlns="http://www.w3.org/ 1999/xhtml"  xml:lang="ar"  lang="ar"  >   <head>   HTTP header Content-Language example#2 29
  30. 30. >  curl  –I  www.raddadi.com   HTTP/1.1  200  OK   Server:  nginx/1.8.0   Date:  Sat,  06  Jun  2015  22:47:09  GMT   Content-­‐Type:  text/html   Connection:  keep-­‐alive   …   >  curl  www.raddadi.com   <!DOCTYPE  html  PUBLIC  "-­‐//W3C//DTD  XHTML  1.0   Transitional//EN"  "http://www.w3.org/TR/ xhtml1/DTD/xhtml1-­‐transitional.dtd">     <html  dir="rtl"  xmlns="http://www.w3.org/ 1999/xhtml"  xml:lang="ar"  lang="ar"  >   <head>   HTTP header Content-Language example#2 30
  31. 31. https://code.google.com/p/guess-language/ >  curl  www.star28.com   …   <META  name="Copyright"  content="©  2011   www.star28.com">   <META  name="DISTRIBUTION"  content="GLOBAL">   <META  name="REVISIT-­‐AFTER"  content="1  DAYS">   <TITLE> ‫دليل‬‫العرب‬‫الشامل‬ </TITLE>   <META  name="description"  content=" ‫دليل‬‫للمواقع‬ ‫العربية‬‫و‬‫أفضل‬‫املواقع‬‫العاملية‬,‫يحدث‬‫باستمرار‬ ">   <META  name="keywords"  content=" ‫دليل‬‫مواقع‬,‫جتارة‬,‫جتارة‬ ,‫مواقع‬ ‫دليل‬ ‫العاب‬,‫جافا‬‫سكربت‬,‫رياضة‬,‫منتديات‬,‫علوم‬,‫كومبيوتر‬,‫اسالم‬,‫اخبار‬,‫اخبار‬ ,‫اسالم‬ ,‫كومبيوتر‬ ,‫علوم‬ ,‫منتديات‬ ,‫رياضة‬ ,‫سكربت‬ ‫جافا‬ ,‫العاب‬ ‫صحف‬,‫تلفزيون‬,‫سياحة‬,‫تعليم‬,‫زواج‬,‫توظيف‬ "> … HTML title tag language 31
  32. 32. >  curl  www.star28.com   …   <META  name="Copyright"  content="©  2011   www.star28.com">   <META  name="DISTRIBUTION"  content="GLOBAL">   <META  name="REVISIT-­‐AFTER"  content="1  DAYS">   <TITLE> ‫دليل‬‫العرب‬‫الشامل‬ </TITLE>   <META  name="description"  content=" ‫دليل‬‫للمواقع‬ ‫العربية‬‫و‬‫أفضل‬‫املواقع‬‫العاملية‬,‫يحدث‬‫باستمرار‬ ">   <META  name="keywords"  content=" ‫دليل‬‫مواقع‬,‫جتارة‬,‫جتارة‬ ,‫مواقع‬ ‫دليل‬ ‫العاب‬,‫جافا‬‫سكربت‬,‫رياضة‬,‫منتديات‬,‫علوم‬,‫كومبيوتر‬,‫اسالم‬,‫اخبار‬,‫اخبار‬ ,‫اسالم‬ ,‫كومبيوتر‬ ,‫علوم‬ ,‫منتديات‬ ,‫رياضة‬ ,‫سكربت‬ ‫جافا‬ ,‫العاب‬ ‫صحف‬,‫تلفزيون‬,‫سياحة‬,‫تعليم‬,‫زواج‬,‫توظيف‬ "> … https://code.google.com/p/guess-language/ Then we use guess-language Python library to determine the language HTML title tag language 32
  33. 33. https://code.google.com/p/guess-language/ Ø   curl  -­‐s  www.gulfup.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  gulfup_title.txt   33 HTML title tag language example#1
  34. 34. https://code.google.com/p/guess-language/ 34 Ø   curl  -­‐s  www.gulfup.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  gulfup_title.txt   >  Python   >>>  myfile=open("gulfup_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'ar'   HTML title tag language example#1
  35. 35. https://code.google.com/p/guess-language/ 35 Ø   curl  -­‐s  www.gulfup.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  gulfup_title.txt   >  Python   >>>  myfile=open("gulfup_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'ar'   HTML title tag language example#1
  36. 36. https://code.google.com/p/guess-language/ 36 Ø   curl  -­‐s  www.cnn.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  cnn_title.txt   HTML title tag language example#2
  37. 37. https://code.google.com/p/guess-language/ 37 Ø   curl  -­‐s  www.cnn.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  cnn_title.txt   >  Python   >>>  myfile=open("cnn_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'en'   HTML title tag language example#2
  38. 38. https://code.google.com/p/guess-language/ 38 Ø   curl  -­‐s  www.cnn.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  cnn_title.txt   >  Python   >>>  myfile=open("cnn_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'en'   HTML title tag language example#2
  39. 39. §  Built in C++ and wrapped as a python module §  Identification is performed through basic trigram lookups paired with unicode character set recognition §  Accuracy is high for even short sample texts https://github.com/decultured/Python-Language-Detector Trigram method 39
  40. 40. https://github.com/decultured/Python-Language-Detector >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   Trigram method example#1 40
  41. 41. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://github.com/decultured/Python-Language-Detector >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   ar   41 Trigram method example#1
  42. 42. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   ar   https://github.com/decultured/Python-Language-Detector 42 Trigram method example#1
  43. 43. https://github.com/decultured/Python-Language-Detector >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   43 Trigram method example#2
  44. 44. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://github.com/decultured/Python-Language-Detector >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   en   44 Trigram method example#2
  45. 45. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://github.com/decultured/Python-Language-Detector >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   en   45 Trigram method example#2
  46. 46. Language detection API client •  Returns detected language codes and scores •  You have to setup your personal API key, (http://detectlanguage.com) •  Example of output: https://detectlanguage.com {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":9.54}]}}     46
  47. 47. •  Returns detected language codes and scores •  You have to setup your personal API key, (http://detectlanguage.com) •  Example of output: https://detectlanguage.com {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":9.54}]}}     •  how much text you pass •  how well it is identified False means that the confidence is low Language code 47 Language detection API client
  48. 48. https://detectlanguage.com >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   Language detection API client example#1 48
  49. 49. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":8.32}, {"language":"tk","isReliable":false,"confidence":0.01}]}}   49 Language detection API client example#1
  50. 50. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":8.32}, {"language":"tk","isReliable":false,"confidence":0.01}]}}   50 Language detection API client example#1
  51. 51. https://detectlanguage.com >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   51 Language detection API client example#2
  52. 52. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"en","isReliable":true,"confidence":6.14}]}}   52 Language detection API client example#2
  53. 53. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"en","isReliable":true,"confidence":6.14}]}}   53 Language detection API client example#2
  54. 54. Language test intersection testing for Arabic language 54 ~41%
  55. 55. 55 ~38% ~41% Language test intersection testing for Arabic language
  56. 56. 56 ~41% ~38% ~36% Language test intersection testing for Arabic language
  57. 57. 57 ~41% ~38% ~36% ~39% Language test intersection testing for Arabic language
  58. 58. 58 ~41% ~38% ~36% ~39% 872 ~8% Language test intersection testing for Arabic language
  59. 59. Language test intersection testing for Arabic language 59 ~41% ~38% ~36% ~39% Total Arabic = 7,976
  60. 60. Crawling Arabic seed URIs Unique:663,443 60
  61. 61. Crawling Arabic seed URIs 61
  62. 62. 62 Crawling Arabic seed URIs
  63. 63. Total Arabic URIs Dataset = (7,976+292,670) = 300,646 63 Crawling Arabic seed URIs
  64. 64. 17,536 Unique domains Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport 64
  65. 65. Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport First Arabic GeoIP location is at rank 17 65 17,536 Unique domains
  66. 66. Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport 6 out of 10 top unique domains are news websites 66 17,536 Unique domains
  67. 67. Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport Popular western pages are in the top unique domains 67 17,536 Unique domains
  68. 68. TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Almost 58% are .com 68
  69. 69. TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Almost 58% are .com 69
  70. 70. TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Small percentage of Arabic TLD 70
  71. 71. TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82% Small percentage of Arabic TLD 71
  72. 72. TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82% Small percentage of Arabic TLD 72
  73. 73. Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02% More than 57% are of depth 0 and 1 73
  74. 74. Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02% 74 More than 57% are of depth 0 and 1
  75. 75. 53.77% of Arabic URIs are archived •  January-March 2015 •  ODU CS Memento Aggregator Median=16 75
  76. 76. URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine Most of the top archived URI-Rs are news websites 76
  77. 77. URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine 77 Most of the top archived URI-Rs are news websites
  78. 78. Archiving has accelerated since 2011 78
  79. 79. March 2015 79 Archiving has accelerated since 2011
  80. 80. Two methods to determine the presence in each archive 1.  Percent of URI-Rs present in each archive e.g. http://aljazeera.net 2.  Percent of URI-Ms present in each archive e.g. http://wayback.archive-it.org/all/20070727215420/http:// www.aljazeera.net/ e.g. http://web.archive.org/web/20150618104846/http://aljazeera.net/ 80
  81. 81. Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10 Presence in each archive example 81
  82. 82. 1- Percent of URI-Rs present in each archive Archive Total Percentage Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160% Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10 82 Presence in each archive example
  83. 83. Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10 Archive Total Percentage Internet Archive 6/10=0.6 60% Archive.today 3/10=0.3 30% Webcitation 1/10=0.1 10% Total 100% 2- Percent of URI-Ms present in each archive Archive Total Percentage Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160% 83 1- Percent of URI-Rs present in each archive Presence in each archive example
  84. 84. Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86% Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100% 84 1- Percent of URI-Rs present in each archive 2- Percent of URI-Ms present in each archive Presence in each archive
  85. 85. Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86% Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100% 85 1- Percent of URI-Rs present in each archive 2- Percent of URI-Ms present in each archive Presence in each archive
  86. 86. Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86% Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100% Presence in each archive 86 1- Percent of URI-Rs present in each archive 2- Percent of URI-Ms present in each archive
  87. 87. Average archiving period (days) Average archiving period = (LM-FM) / number of mementos 16,732 URIs have only one memento Median=48 days 87
  88. 88. Values less than 1 indicate that the URI is archived multiple times per day The larger the period, the more irregularly the URI was captured by the archives Median=48 days Average archiving period = (LM-FM) / number of mementos 16,732 URIs have only one memento 88 Average archiving period (days)
  89. 89. Creation date for archived Arabic URIs Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html We used CarbonDate for creation date estimate 89
  90. 90. Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html We used CarbonDate for creation date estimate 18 years 90 Creation date for archived Arabic URIs
  91. 91. Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html 2013 is the most frequent year We used CarbonDate for creation date estimate 18 years 91 Creation date for archived Arabic URIs
  92. 92. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Top GeoIP locations 92
  93. 93. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Top GeoIP locations 93
  94. 94. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Archive Percent Saudi Arabia 4.75% Egypt 1.97% Jordan 1.42% Kuwait 0.71% United Arab Emirates 0.67% Top GeoIP locations 94
  95. 95. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Archive Percent Saudi Arabia 4.75% Egypt 1.97% Jordan 1.42% Kuwait 0.71% United Arab Emirates 0.67% Top GeoIP locations 95
  96. 96. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% Status of Arabic seed URIs 96
  97. 97. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% (Good) discovered and saved 97 Status of Arabic seed URIs
  98. 98. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% (Good) discovered and saved (Bad) undiscovered and not saved 98 Status of Arabic seed URIs
  99. 99. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% 31% were not indexed by Google 99 Status of Arabic seed URIs
  100. 100. 18% have creation dates over 1 year before the first memento was archived 19.48% of the URIs have an estimated creation date that is the same as first memento date Difference between creation date and first memento 100
  101. 101. Seed Data Set Arabic Archived Indexed DMOZ 34.43% 95.52% 82.13% Raddadi 19.88% 45.44% 65.83% Star28 45.69% 41.54% 65.23% DMOZ URIs are more likely to be found and archived 101
  102. 102. Seed Data Set Arabic Archived Indexed DMOZ 34.43% 95.52% 82.13% Raddadi 19.88% 45.44% 65.83% Star28 45.69% 41.54% 65.23% 102 DMOZ URIs are more likely to be found and archived
  103. 103. Seed Data Set Arabic Archived Indexed DMOZ 34.43% 95.52% 82.13% Raddadi 19.88% 45.44% 65.83% Star28 45.69% 41.54% 65.23% 103 DMOZ URIs are more likely to be found and archived
  104. 104. Full Data Set Total Archived Category Total Archived Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09% AR GeoIP 10.53% 13.11% AR both 7.81% 59.50% Neither 66.82% 65.22% Neither 66.82% 65.22% Hosted in Western countries would be more likely to be archived 104
  105. 105. Full Data Set Total Archived Category Total Archived Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09% AR GeoIP 10.53% 13.11% AR both 7.81% 59.50% Neither 66.82% 65.22% Neither 66.82% 65.22% 105 Hosted in Western countries would be more likely to be archived
  106. 106. Seed Data Set Total Indexed Category Total Indexed Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09% AR GeoIP 2.37% 73.54% AR both 6.03% 85.24% Neither 84.99% 65.22% Neither 84.99% 67.09% URIs that had some Arabic location had a higher indexing rate 106
  107. 107. Seed Data Set Total Indexed Category Total Indexed Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09% AR GeoIP 2.37% 73.54% AR both 6.03% 85.24% Neither 84.99% 65.22% Neither 84.99% 67.09% URIs that had some Arabic location had a higher indexing rate 107
  108. 108. The spread of memento was not affected by location or ccTLD Ø  Kolmogorov-Smirnov test Category Mean Ar GeoIP 0.5010 Ar ccTLD 0.5013 Both 0.5016 Neither 0.5005 Category D-Value P-Value Ar ccTLD vs. neither 0.017 <0.002 Ar GeoIP vs. neither 0.014 <0.002 108
  109. 109. Just because a webpage is older it does not mean that it is archived more Because of low historical archiving rates 109
  110. 110. We look in the last three years 110 Just because a webpage is older it does not mean that it is archived more
  111. 111. We look in the last three years 111 Just because a webpage is older it does not mean that it is archived more
  112. 112. In the last three years the older the resource is the more memento it has 112
  113. 113. Full Data Set Seed Data Set Path Depth Total Archived Total Indexed 0 17.30% 86.29% 86.05% 74.60% 1 40.42% 53.49% 9.77% 38.91% 2 24.45% 45.57% 3.72% 17.85% 3+ 17.83% 34.24% 0.50% 57.50% Top level URIs are more likely to be archived and indexed 113
  114. 114. Full Data Set Seed Data Set Path Depth Total Archived Total Indexed 0 17.30% 86.29% 86.05% 74.60% 1 40.42% 53.49% 9.77% 38.91% 2 24.45% 45.57% 3.72% 17.85% 3+ 17.83% 34.24% 0.50% 57.50% 114 Top level URIs are more likely to be archived and indexed
  115. 115. Full Data Set Seed Data Set Path Depth Total Archived Total Indexed 0 17.30% 86.29% 86.05% 74.60% 1 40.42% 53.49% 9.77% 38.91% 2 24.45% 45.57% 3.72% 17.85% 3+ 17.83% 34.24% 0.50% 57.50% 115 Top level URIs are more likely to be archived and indexed
  116. 116. •  Collected URIs from three Arabic directories (7,976): Ø  DMOZ Ø  Raddadi.com Ø  Star28.com •  Crawl seed dataset (1,299,671) •  Check if they are unique (663,443) •  Check if they are live (482,905) •  Check for Arabic Language (300,646) Summary of collection methods 116
  117. 117. §  Our Arabic language dataset was not largely located in Arabic countries Ø  Only 14.84% had an Arabic ccTLD Ø  Only 10.53% had a GeoIP in an Arabic country Ø  Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the top 10 §  Arabic webpages are not particularly well archived or indexed Ø  46% were not archived Ø  31% were not indexed by Google §  An Arabic webpage is more likely to be... Ø  indexed if it is present in a directory Ø  archived if it is present in DMOZ Ø  archived if it has neither Arabic GeoIP nor Arabic ccTLD For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ Findings 117
  118. 118. 118
  119. 119. Backup Slides 119
  120. 120. GeoIP Location •  We obtained the IP addresses of the hostnames using nslookup, (which uses DNS to convert the hostname to its IP address) •  We used the MaxMind GeoLite29 database to determine location from the IP address. (Which tests at 99.8% accuracy at the country level) h,p://dev.maxmind.com/geoip/geoip2/geolite2/   h,p://dev.maxmind.com/faq/how-­‐‑accurate-­‐‑are-­‐‑the-­‐‑  geoip-­‐‑databases/   120

×