Lightning talk: Searching in more than 140 years newspaper articles - Nicolas Provenzano

614 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

Bassilichi Group worked for the implementation of the oldest Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006. Lucene technologies has powered this successed story to highlight the content of over 5.000.000 articles captured from 2.000.000 pages, printed in an unstructured layout and recognized with semantic analisys approach. An example of the implementation may be found at http://devlastampa.bdadoc.it/.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
614
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lightning talk: Searching in more than 140 years newspaper articles - Nicolas Provenzano

  1. 1. Searching in more than 140 years newspaper articlesHow Bassilichi Group worked to implement the oldest Italiannewspaper historical archive of "La Stampa di Torino" from1867 to 2006Nicola Provenzano, Bassilichi Group, Italy
  2. 2. Agendao About Bassilichi Groupo The Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006o Our Search Challengeso Enhancing the findability
  3. 3. BASSILICHI S.p.A. Turnover: € 256MAn Italian Business Process Outsourcing(BPO), the company serves as a strategicpartner for banks, businesses and the publicsector with an offering that covers thefollowing three areas:Monetics, Security and Back Office Employees: 1009 (at 31/12/2010)
  4. 4. The Italian newspaper La Stampa from Turino Born on February 9, 1867 with the name of “Gazzetta Piemontese”o La Stampa is one of the best known and most famous Italian newspaper, published in Turin and distributed in Italy and other European nationso With the daily sales of about 400,000 copies (2010) and 9.000.000 of site page view in a month La Stampa is the third best-selling information newspaper in the country
  5. 5. The project: digitalize the entire historical archive and publish the content on the web2007 The project startsDigitalizationLayout AnalysisOCRData entry2010 The project goes on line
  6. 6. Project workgroupCommittee for the Digital Library Information Journalism, members o San Paolo Company o CRT Foundation, o La Stampa publishing company o Regione PiemonteService Providerso STI S.p.a, Bassilichi S.p.a, Microshop S.r.l, Bassnet S.r.lHosting and infrastructure providero CSI Piemonte
  7. 7. Project numberso nearly 150 years of historyo 1,761,000 newspaper pages with various page layouto more than 5 million newspaper articleso 4.5 million images of photographs and negativeso Nearly 100 TByte of images (from 300 to 96 dpi), xml and txt documents
  8. 8. Web project requirementso Search in the articles: full-text search and search with headboard, date and page numbero Possibility to read the article with text only interface or with article highlighting over the image of the newspaper source pageo To use Open-source technologies
  9. 9. Web project input datao XML with: o Headboard, issue date, page number o Title and article bodyo Mets and Alto xml file with article, line and works position on the page
  10. 10. January 17, 2007“Solr has graduated from the Apache Incubator, and is now a sub-project of Lucene“
  11. 11. Main Solr implementation trickso Lucene document ID is a Domain Primary Keyo Long articles text indexed but not stored to reduce index sizeo Abstract article’s text is stored to reduce search result listing timeo Custom XmlUpdateRequestHandler to index long articles OCR texto Robust Message Queuing System to handle system indexing commands
  12. 12. Web project main technologies
  13. 13. Web project challenges The search engine works good but how to ensure high performance in the presence of a potentially very high traffic?TO DO:o Investigate load balancing possibilities and fault tolerance strategieso Find how to disjoin the index creation phase from the index release in productiono Use read-only optimized production lucene index
  14. 14. Solr collection distribution Load Balancer HTTPD HTTPD HTTPD Load Balancer JBOSS EAP Cluster Slave Slave Slave Index Slave Index Index Index Management Index Replication Updates IndexAdministration
  15. 15. On line web project numbersIn the day of the presentation of the project the site supports very high traffic without any problemo The historical archive of “La Stampa di Torino” is one of the biggest freely available digital newspaper archive, near the Times and New York Timeso 509.791 page view on the 1° November 2010, 21.352 user sessionso Near 15.000.000 page view in the last year
  16. 16. Current development version challenges Browsing the archive by date, article title and text give good search experience but how to enhance the findability?o Boosting articles with Named Entity Recognition with help of Celi s.r.lo Enhancing user search capabilities with query autocomplete suggestions and advanced search possibilities over Named Entities: author, persons, locations, organizationso Faceting content with all the new article attributeso Enable content tagging to collect useful user navigation suggestions
  17. 17. Current development version detailso JQuery UI enriched our user interfaceo Date Range filters drive the new timeline search widgeto Multi select faceting for user search refinemento MORE LIKE THIS with named entities for user search suggestions
  18. 18. Q&A nicola.provenzano@bdadoc.itBassilichi Group - Firenze - Italy

×