Successfully reported this slideshow.

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Mapping french open data actors on the web with common crawl

  1. 1. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg
  2. 2. Mining the Web at Data Publica Different needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  3. 3. Mining the Web at Data Publica Scraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  4. 4. Mining the Web at Data Publica Focused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  5. 5. Mining the Web at Data Publica Prospective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDD It makes your life easier to use a third-party !
  6. 6. From a crawl to a map Goal : build a map of the french open data actors on the web ● As a graph ● Showing websites
  7. 7. From a crawl to a map Using Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  8. 8. From a crawl to a map Working on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  9. 9. From a crawl to a map Working on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  10. 10. From a crawl to a map Building graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  11. 11. From a crawl to a map Building graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  12. 12. From a crawl to a map Visualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  13. 13. From a crawl to a map
  14. 14. From a crawl to a map Visualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  15. 15. Analyze ● The final graph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now ● This graph is to be generated again in near futur, to see changes in this ecosystem
  16. 16. Results ● Large scale crawl made easy ○ Easy to focus on mining the results instead of finding/storing the data ● Nice workflow from raw data to an understandable visualisation ● The final graph is a good way to understand interactions between actors
  17. 17. Feedback ● Common Crawl ○ Common crawl doesn't have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1) ● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  18. 18. Resources ● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla ● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica ● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL ● http://french-opendata.data-publica.com/ ○ Project host page
  19. 19. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg

×