Euroscipy SemNews 2011

901 views

Published on

Scikits.learn (http://scikit-learn.sourceforge.net/) is a scikit for machine learning which has gained lots of popularity in recent months. In particular, it can be used for text and large scale database mining.
On another side, CubicWeb (http://www.cubicweb.org/) is a python-based framework for semantic web applications that has been used in different application fields (library, museum, conference, intranet applications).
The aim of this talk is to present how these tools can be used together for semantic data mining of rss feeds (clustering, prediction), and for building a news aggregator similar to google news.

Full description : http://www.euroscipy.org/talk/4291

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
901
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Euroscipy SemNews 2011

  1. 1. A semantic news aggregator in Python using Dbpedia, Cubicweb and Scikits-learn Vincent Michel - Logilab 27 août 2011
  2. 2. Context With a bunch of RSS news... Google Borrows Apple Strategy : Google’s deal underscores the allure of a business model pioneered... Japan disaster plant cold shutdown could face delay : TOKYO (Reuters) - Tokyo Electric Power Co said on Wednesday... Libya shows signs of slipping from Muammar Gaddafi’s grasp : Supply lines to capital in peril as coastal cities fall... ... how can we analyze them in Python ? → Clustering (grouping) RSS (e.g. Google News). → Extracting/synthetizing information. → Providing semantic useful/original visualisation and analytics tools.
  3. 3. Overview 1 Introduction 2 Extracting information from RSS news 3 Analyzing RSS news
  4. 4. Semantic ? Sussex St. Reading Andrews NDL Audio- Lists Resource subjects t4gm MySpace scrobbler Lists Moseley (DBTune) (DBTune) RAMEAU Folk NTU SH lobid GTAA Plymouth Resource Lists Organi- Reading Lists sations Music The Open ECS Magna- Brainz Music DB tune Library LCSH South- (Data Brainz LIBRIS ampton Tropes lobid Ulm Incubator) (zitgist) Man- EPrints Resources chester Surge Reading biz. Music RISKS Radio Lists The Open ECS data. John Brainz Discogs Library PSH Gem. UB South- gov.uk Peel (DBTune) FanHubz (Data In- (Talis) Norm- Mann- ampton (DB cubator) Jamendo datei heim RESEX Tune) Popula- Poké- DEPLOY Last.fm tion (En- pédia Artists Last.FM Linked RDF AKTing) research EUTC (DBTune) (rdfize) LCCN VIAF Book Wiki data.gov Produc- Pisa Eurécom P20 Mashup semantic NHS .uk tions classical web.org (EnAKTing) Pokedex (DB Mortality Tune) PBAC ECS (En- AKTing) BBC MARC (RKB Budapest Program Codes Explorer) Energy education OpenEI BBC List Semantic Lotico Revyu OAI (En- CO2 data.gov mes Music Crunch SW AKTing) (En- .uk Chronic- Linked Dog NSZL Base AKTing) ling Event- MDB RDF Food IRIT America Media Catalog ohloh BBC DBLP ACM IBM Good- BibBase Ord- Wildlife (RKB Openly Recht- win nance Finder Explorer) Local spraak. Family DBLP legislation Survey Tele- New VIVO UF .gov.uk nl graphis York flickr (L3S) New- VIVO castle Times URI wrappr Open Indiana RAE2001 UK Post- Burner Calais DBLP codes statistics (FU VIVO CiteSeer Roma data.gov LOIUS Taxon iServe Berlin) IEEE .uk Cornell Concept Geo World data ESD Fact- OS dcs Names book dotAC stan- reference Project Linked Data NASA (FUB) Freebase dards data.gov Guten- .uk for Intervals (Data GESIS Course- transport DBpedia berg STW ePrints CORDIS Incu- ware data.gov bator) (FUB) Fishes ERA UN/ .uk of Texas Geo LOCODE Uberblic Euro- Species The stat dbpedia TCM SIDER Pub KISTI (FUB) lite Gene STITCH Chem JISC London Geo KEGG DIT LAAS Gazette TWC LOGD Linked Daily OBO Drug Eurostat Data UMBEL lingvoj Med (es) Disea- YAGO Medi some Care ChEBI KEGG NSF Linked KEGG KEGG Linked Drug Cpd GovTrack rdfabout Glycan Sensor Data CT Bank Pathway US SEC Open Reactome (Kno.e.sis) riese Uni Cyc Lexvo Path- way PDB Media Semantic totl.net Pfam HGNC XBRL WordNet KEGG KEGG Geographic (VUA) Linked Taxo- CAS Reaction rdfabout Twarql UniProt Enzyme EUNIS Open nomy US Census Publications Numbers PRO- ProDom SITE Chem2 UniRef Bio2RDF User-generated content Climbing WordNet SGD Homolo Linked (W3C) Affy- Gene GeoData Cornetto metrix Government PubMed Gene UniParc Ontology GeneID Cross-domain Airports Product DB UniSTS MGI Gen Life sciences Bank OMIM InterPro As of September 2010 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http ://lod-cloud.net/”
  5. 5. Tools Fetching, storing and querying the data → CubicWeb (*) Semantic CMS, high-level database management with metadata. Multiple sources : RSS, micro-blogging, SQL database, . . . Based on PostgreSQL : deals with (very) large database. Data-mining and machine learning → Scikits-learn (*) Easy-to-use and general-purpose machine learning in Python. Unsupervised learning, supervised learning, model selection . . . Semantic information database → Dbpedia (*) ∼ 8.106 articles with abstracts, images, . . . from Wikipedia. ∼ 0, 6.106 categories, 273 types (e.g. person, place, . . . ) ∼ 100.106 links between articles, categories and types. (*) Open Source/Creative Commons ! ! !
  6. 6. Storing RSS information with Cubicweb Each object (or entity) is defined in a schema and may be displayed using different views. Storing RSS Define (in schema.py ) how RSS should be stored in the database. class R S S A r t i c l e ( E n t i t y T y p e ) : t i t l e = S t r i n g ( ) # T i t l e o f t h e feed u r i = S t r i n g ( unique=True ) # U r i o f t h e r s s feed c o n t e n t = S t r i n g ( ) # Content o f t h e feed Fetching RSS (based on feedparser and BeautifulSoup) Simply construct a source, by giving an URL and a parser : u r l = u ’ h t t p : / / feeds . b b c i . co . uk / news / w o r l d / r s s . xml ’ − s = s e s s i o n . c r e a t e _ e n t i t y ( ’ CWSource ’ , name=u ’BBCNews World ’ , u r l = u r l , t y p e =u ’ d a t af e e d ’ , p a r s e r =u ’ rss −p a r s e r ’ , c o n f i g =u ’ s y n c h r o n i z a t i o n − i n t e r v a l =240min ’ ) s . p u l l _ d a t a ( session ) 7 english/american journals (The New York Times, . . . )
  7. 7. Storing Dbpedia information with Cubicweb Storing Dbpedia page Define (in schema.py ) how Dbpedia pages should be stored in the database. class DbpediaPage ( E n t i t y T y p e ) : u r i = S t r i n g ( unique=True , indexed=True ) # U r i o f t h e r e s s o u r c e l a b e l = S t r i n g ( indexed=True ) # h t t p : / / www. w3 . org / 2 0 0 0 / 0 1 / r d f −schema pageid = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / wikiPageID a b s t r a c t = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / a b s t r a c t homepage = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / homepage t h u m b n a i l = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / t h u m b n a i l d e p i c t i o n = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / d e p i c t i o n w i k i p a g e = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / page l a t i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos l o n g i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos Storing all dbpedia information (∼ 9.106 pages, ∼ 100.106 links, 20Go) in Cubicweb takes less than 24 hours. → See the full schema
  8. 8. Analyzing RSS news What can we do with the RSS news stored in database ? 1 extract relevant features of the data. 2 construct a usable (i.e. matrix) representation of the data. 3 cluster (group) RSS together. 4 deeper semantic analyze and visualization of the information. Example sentence : “Google is to buy mobile phone manufacturer Motorola Mobility, allowing it to mount a serious challenge to Apple Inc.”
  9. 9. Overview 1 Introduction 2 Extracting information from RSS news 3 Analyzing RSS news
  10. 10. Extracting information : Classical approaches (Scikits-learn) Char-N-gram Extracts features of N characters from a text. from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import CharNGramAnalyzer a n a l y z e r = CharNGramAnalyzer ( min_n =3 , max_n=6) f e a t u r e s = a n a l y z e r . analyze ( sentence ) 450 features : ’goo’, ’oog’, ’ogl’, . . . , ’mob’, ’obi’, ’bil’, . . . Word-N-gram Extracts features of N words from a text. from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import WordNGramAnalyzer a n a l y z e r = WordNGramAnalyzer ( min_n =3 , max_n=6) f e a t u r e s = a n a l y z e r . analyze ( sentence ) 58 features : ’google is to’, ’is to buy’, . . . , ’serious challenge to’, . . . Many irrelevant features (tokens). Features do not carry lots of contextual information (i.e. understandable by humans).
  11. 11. Feature extraction : Dbpedia - Context Main Hypothesis : Only things that exist in Dbpedia (i.e Wikipedia) have some interest in news analysis → Named Entities Recognition (NER) Dbpedia feature extraction (Cubicweb/Dbpedia) from cubes . semnews . views . n e r t o o l s import D b p e d i a E n t i t i e s A n a l y z e r a n a l y z e r = D b p e d i a E n t i t i e s A n a l y z e r ( session , l a n g = ’ en ’ ) tokens = a n a l y z e r . analyze ( sentence ) → 3 features : ’Apple Inc’, ’Google’, ’Motorola’ ENAMEX TYPE=ORGANIZATION Google/ENAMEX is to buy mobile phone manufacturer ENAMEX TYPE=ORGANIZATION Motorola/ENAMEX Mobility, allowing it to mount a serious challenge to ENAMEX TYPE=ORGANIZATION Apple Inc/ENAMEX → Try it ! “DBpedia Spotlight : Shedding Light on the Web of Documents”, Pablo N. Mendes et al., I-Semantics 2011 “Learning Named Entity Recognition from Wikipedia”, Joel Nothman 2008 “Large-Scale Named Entity Disambiguation Based on Wikipedia Data”, Silviu Cucerzan 2007
  12. 12. Feature extraction : Dbpedia - Properties Efficient and robust feature extraction Keep the meaning of a text → interpretable features. ’. . . said former soldier Larry, . . . ’ ’. . . said student Larry, . . . ’ ’. . . said Larry Page, . . . ’ Robust features based on redirections. e.g. ’Obama’, ’Barak Obamba’, ’Pres. Obama’ redirect to ’Barack Obama’ Simple RQL (Relation Query Language) queries Fast, based on indexed SQL tables and regular expressions : rset = rql(’Any E WHERE E is DbpediaPage, E label %(token)s’, {’token’: token}) e.g. 19 entities extracted in 4s in 765 words, among ∼ 8.106 dbpedia entries. Different labels but same URI → cross-language feature extraction : e.g. Grenada/Grenade → http ://dbpedia.org/resource/Grenada
  13. 13. Overview 1 Introduction 2 Extracting information from RSS news 3 Analyzing RSS news
  14. 14. Feature extraction : Matrix representation and storage     Obama’s approval . . . 0 0 ... 1 0  Protest as Spain . . .     0  0 ... 0 1    ...  →  ...       Libya rebels fight . . .   0 1 ... 0 0  Obama in Spain . . . 0 0 ... 1 1 E.g. The Obama-Merkel-Sarkozy space . . . Results stored in a relation (appears in rss) in Cubicweb : rql(’Any X WHERE X appears_in_rss Y’)
  15. 15. Exploiting information : Clustering news Creating clusters (groups) of news, using the matrix representation of the data. Meanshift algorithm (Scikits-learn) Based on locating the maxima of a density function. Automatically tunes the number of clusters. from s c i k i t s . l e a r n . c l u s t e r import MeanShift , est im ate _ba nd wid th bandwidth = es ti mat e_b an dwi dth ( X , q u a n t i l e =0.005) c l u s t e r i n g = MeanShift ( bandwidth=bandwidth ) c l u s t e r i n g . f i t (X) labels = clustering . labels_ “Mean shift, mode seeking, and clustering.”, Yizong Cheng. IEEE Transactions on Pattern Analysis and Machine Intelligence 1995 → Try it ! Other possible alternatives : Ward’s algorithm, K-means, . . .
  16. 16. Exploiting information : Queries and Views Each query returns a result set (rset). A view is called on a result set → define the representation rules. Defining a view (short version) class M y E n t i t i e s V i e w ( View ) : __regid__ = ’ example−view ’ def c a l l ( s e l f ) : r s e t = s e l f . cw_rset for e n t i t y in r s e t . e n t i t i e s ( ) : do_whatever_you_want_ . . . write_some_html_ . . . . . . you just have to plug a rset from a query : r s e t = r q l ( ’ Any X WHERE . . . ’ ) s e l f . wview ( ’ example−view ’ , r s e t ) or within an url : http://myapplication/?rql=Any X WHERE ....vid=example-view
  17. 17. Pluging Scikits.learn in a view Combine machine learning tools and database query system, in pure Python. Defining the view class RSSClustering ( View ) : __regid__ = ’ rss −c l u s t e r i n g ’ def c a l l ( s e l f ) : # Get t h e r e s u l t s s e t and c r e a t e t h e data r s e t = s e l f . cw_rset o c c u r r e n c e _ m a t r i x = Oc c ur re n ce Ma t ri x ( ) occurrence_matrix . construct_matrix ( r s e t = r s e t ) # Scale t h e data X = s c a l e ( o c c u r r e n c e _ m a t r i x , a x i s =0 , w i t h _ s t d =True , copy=True ) # Compute bandwith f o r c l u s t e r i n g bandwidth = e sti mat e_ ban dwi dt h (X) # Clustering c l u s t e r = MeanShift ( bandwidth=bandwidth ) c l u s t e r . f i t (X) labels = cluster . labels_ # Perform some HTML r e n d e r i n g ... → Try it !
  18. 18. A new approach for querying information from RSS All musical artists in the news rql(’DISTINCT Any E, R WHERE E appears_in_rss R, E has_type T, T label musical artist’) All living office holder persons in the news rql(’DISTINCT Any E WHERE E appears_in_rss R, E has_type T, T label office holder, E has_subject C, C label Living people’) All news that talk about Barack Obama and any scientist rql(’DISTINCT Any R WHERE E1 label Barack Obama, E1 appears_in_rss R, E2 appears_in_rss R, E2 has_type T, T label scientist’) All news that talk about a drug rql(’Any X, R WHERE X appears_in_rss R, X has_type T, T label drug’) Try it ! . . . with an xml view . . . or a thumbnail view
  19. 19. Vizualisation : Mapping information View class EntityMapView ( View ) : __regid__ = ’map ’ def c a l l ( s e l f ) : r s e t = s e l f . cw_rset s e l f . init_map ( ) for e n t i t y in r s e t . e n t i t i e s ( ) : s e l f . add_marker ( e n t i t y . l a t i t u d e , e n t i t y . l o n g i t u d e , entity . dc_title ) s e l f . center_and_zoom ( 0 , 0 , 1 . 5 ) s e l f . finish_map ( ) Based on mapstraction (javascript) : http ://mapstraction.com/ Automatically locate information from RSS news. Try it !
  20. 20. Conclusion Cubicweb and Scikits-learn : Efficient and easy-to-use tools for data storing, querying and mining. Easy to plug together, only Python tools, all Open Source. Using Dbpedia allows to extract very few highly relevant features Decrease the dimensionality of the data. Link features to millions of pages of information. A new semantic way for querying information Simple information queries using RQL expressions. Use Dbpedia types and categories to refine the selection.
  21. 21. Future Improvements Named Entities Recognition → use disambiguations links / more refined Regular expressions. → add new databases (MusicBrainz, diseasome, . . . ). → closely follow Wikipedia with Dbpedia live update : Motorola Mobility (11 :46, 15 August 2011) ’On the 15 August, Google announced that it agreed to acquire the company.’ RSS news analyzing → explore new algorithms : bi-clustering, . . . → add new data sources : Twitters, Blogs, . . . → ’from scikits.learn predict __future__ . . . ’ → use Matrix Completion to predict new edges in the correlation graph.
  22. 22. Thanks you for your attention ! Questions ?

×