Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mapping french open data actors on the web with common crawl

4,620 views

Published on

  • Be the first to comment

Mapping french open data actors on the web with common crawl

  1. 1. Mapping french OpenData actors on the webwith Common Crawlguillaume.lebourgeois@data-publica.com@glebourg
  2. 2. Mining the Web at Data PublicaDifferent needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  3. 3. Mining the Web at Data PublicaScraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  4. 4. Mining the Web at Data PublicaFocused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  5. 5. Mining the Web at Data PublicaProspective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDDIt makes your life easier to use a third-party !
  6. 6. From a crawl to a mapGoal : build a map of the french open dataactors on the web ● As a graph ● Showing websites
  7. 7. From a crawl to a mapUsing Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  8. 8. From a crawl to a mapWorking on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  9. 9. From a crawl to a mapWorking on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  10. 10. From a crawl to a mapBuilding graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  11. 11. From a crawl to a mapBuilding graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  12. 12. From a crawl to a mapVisualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  13. 13. From a crawl to a map
  14. 14. From a crawl to a mapVisualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  15. 15. Analyze● The final graph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now● This graph is to be generated again in near futur, to see changes in this ecosystem
  16. 16. Results● Large scale crawl made easy ○ Easy to focus on mining the results instead of finding/storing the data● Nice workflow from raw data to an understandable visualisation● The final graph is a good way to understand interactions between actors
  17. 17. Feedback● Common Crawl ○ Common crawl doesnt have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1)● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  18. 18. Resources● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL● http://french-opendata.data-publica.com/ ○ Project host page
  19. 19. Mapping french OpenData actors on the webwith Common Crawlguillaume.lebourgeois@data-publica.com@glebourg

×