Mapping french open data actors on the web with common crawl

3,058 views
2,939 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,058
On SlideShare
0
From Embeds
0
Number of Embeds
100
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Mapping french open data actors on the web with common crawl

  1. 1. Mapping french OpenData actors on the webwith Common Crawlguillaume.lebourgeois@data-publica.com@glebourg
  2. 2. Mining the Web at Data PublicaDifferent needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  3. 3. Mining the Web at Data PublicaScraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  4. 4. Mining the Web at Data PublicaFocused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  5. 5. Mining the Web at Data PublicaProspective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDDIt makes your life easier to use a third-party !
  6. 6. From a crawl to a mapGoal : build a map of the french open dataactors on the web ● As a graph ● Showing websites
  7. 7. From a crawl to a mapUsing Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  8. 8. From a crawl to a mapWorking on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  9. 9. From a crawl to a mapWorking on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  10. 10. From a crawl to a mapBuilding graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  11. 11. From a crawl to a mapBuilding graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  12. 12. From a crawl to a mapVisualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  13. 13. From a crawl to a map
  14. 14. From a crawl to a mapVisualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  15. 15. Analyze● The final graph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now● This graph is to be generated again in near futur, to see changes in this ecosystem
  16. 16. Results● Large scale crawl made easy ○ Easy to focus on mining the results instead of finding/storing the data● Nice workflow from raw data to an understandable visualisation● The final graph is a good way to understand interactions between actors
  17. 17. Feedback● Common Crawl ○ Common crawl doesnt have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1)● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  18. 18. Resources● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL● http://french-opendata.data-publica.com/ ○ Project host page
  19. 19. Mapping french OpenData actors on the webwith Common Crawlguillaume.lebourgeois@data-publica.com@glebourg

×