SlideShare a Scribd company logo
1 of 21
Searching in more than 140 years
            newspaper articles


How Bassilichi Group worked to implement the oldest Italian
newspaper historical archive of "La Stampa di Torino" from
1867 to 2006
Nicola Provenzano, Bassilichi Group, Italy
Agenda

o      About Bassilichi Group

o      The Italian newspaper historical archive of
       "La Stampa di Torino" from 1867 to 2006

o      Our Search Challenges

o      Enhancing the findability
BASSILICHI S.p.A.                                 Turnover: € 256M
An Italian Business Process Outsourcing
(BPO), the company serves as a strategic
partner for banks, businesses and the public
sector with an offering that covers the
following three areas:
Monetics, Security and Back Office




                                               Employees:
                                                 1009
                                               (at 31/12/2010)
The Italian newspaper La Stampa from Turin

o Born on February 9, 1867 with the name of “Gazzetta
   Piemontese”

o La Stampa is one of the best known and most famous Italian
   newspaper, published in Turin and distributed in Italy and
   other European nations

o With the daily sales of about 400,000 copies (2010) and
   9.000.000 of site page view in a month La Stampa is the third
   best-selling information newspaper in the country
The project: digitalize the entire historical
 archive and publish the content on the web
2007 The project starts

Digitalization




Layout Analysis




OCR




Data entry


2010 The project goes on line
Project workgroup
Committee for the Digital Library Information Journalism,
    members
    o    San Paolo Company
    o    CRT Foundation,
    o    La Stampa publishing company
    o    Regione Piemonte

Service Providers

o STI S.p.a, Bassilichi S.p.a, Microshop S.r.l, Bassnet S.r.l

Hosting and infrastructure provider

o   CSI Piemonte
Project numbers
o nearly 150 years of history

o 1,761,000 newspaper pages with various page layout

o more than 5 million newspaper articles

o 4.5 million images of photographs and negatives

o Nearly 100 TByte of images (from 300 to 96 dpi), xml and txt
   documents
Web project requirements

o Search in the articles: full-text search and search with
    headboard, date and page number



o Possibility to read the article with text only interface or with
    article highlighting over the image of the newspaper source
    page



o To use Open-source technologies
Web project input data

o XML with:
   o   Headboard, issue date,
       page number

   o   Title and article body




o Mets and Alto xml file with
   article, line and works
   position on the page
January 17, 2007

“Solr has graduated from the Apache Incubator, and
           is now a sub-project of Lucene“
Main Solr implementation tricks

o Lucene document ID is a Domain Primary Key

o Long articles text indexed but not stored to reduce index size

o Abstract article’s text is stored to reduce search result listing
    time

o Custom XmlUpdateRequestHandler to index long articles
    OCR text

o Robust Message Queuing System to handle system indexing
    commands
Web project main technologies
Web project challenges
     The search engine works good but how to ensure high
    performance in the presence of a potentially very high traffic?

TO DO:

o Investigate load balancing possibilities and fault tolerance
    strategies

o Find how to disjoin the index creation phase from the index
    release in production

o Use read-only optimized production lucene index
Solr collection distribution
                                 Load Balancer


              HTTPD                     HTTPD                             HTTPD



                                 Load Balancer

                                                                                   JBOSS EAP
                                                                                     Cluster
           Slave                Slave                    Slave    Index    Slave
   Index               Index             Index




                   Management                              Index Replication


                     Updates
                                                 Index
Administration
On line web project numbers

In the day of the presentation of the project the site supports very
                   high traffic without any problem

o The historical archive of “La Stampa di Torino” is one of the
    biggest freely available digital newspaper archive, near the
    Times and New York Times

o 509.791 page view on the 1° November 2010, 21.352 user
    sessions

o Near 15.000.000 page view in the last year
Current development version challenges
   Browsing the archive by date, article title and text give good
         search experience but how to enhance the findability?

o Boosting articles with Named Entity Recognition with help of
    Celi s.r.l

o Enhancing user search capabilities with query autocomplete
    suggestions and advanced search possibilities over Named
    Entities: author, persons, locations, organizations

o Faceting content with all the new article attributes

o Enable content tagging to collect useful user navigation
    suggestions
Current development version details
o   JQuery UI enriched our user interface

o   Date Range filters drive the new timeline
    search widget

o   Multi select faceting for user search refinement

o   MORE LIKE THIS with named entities for user
    search suggestions
Q&A
 nicola.provenzano@bdadoc.it

Bassilichi Group - Firenze - Italy

More Related Content

Viewers also liked

Preparándonosparaserinstrumentosde diospart.2
Preparándonosparaserinstrumentosde diospart.2Preparándonosparaserinstrumentosde diospart.2
Preparándonosparaserinstrumentosde diospart.2daniel3138317672
 
Conheça Joanesburgo porta de entrada do continente africano
Conheça Joanesburgo porta de entrada do continente africanoConheça Joanesburgo porta de entrada do continente africano
Conheça Joanesburgo porta de entrada do continente africanoalqlima
 
Atitude intercom
Atitude intercomAtitude intercom
Atitude intercomAci Unifoa
 
BP 2013 UFT Competition - Team Stokes Poster
BP 2013 UFT Competition - Team Stokes PosterBP 2013 UFT Competition - Team Stokes Poster
BP 2013 UFT Competition - Team Stokes PosterGovinda Hosein
 
Becas Api
Becas ApiBecas Api
Becas Apiusapuka
 
IX Encontro Internacional, 2013
IX Encontro Internacional, 2013  IX Encontro Internacional, 2013
IX Encontro Internacional, 2013 Apahsdf
 
Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)
Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)
Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)Patricia Almeida Ashley
 

Viewers also liked (8)

Preparándonosparaserinstrumentosde diospart.2
Preparándonosparaserinstrumentosde diospart.2Preparándonosparaserinstrumentosde diospart.2
Preparándonosparaserinstrumentosde diospart.2
 
Conheça Joanesburgo porta de entrada do continente africano
Conheça Joanesburgo porta de entrada do continente africanoConheça Joanesburgo porta de entrada do continente africano
Conheça Joanesburgo porta de entrada do continente africano
 
Atitude intercom
Atitude intercomAtitude intercom
Atitude intercom
 
BP 2013 UFT Competition - Team Stokes Poster
BP 2013 UFT Competition - Team Stokes PosterBP 2013 UFT Competition - Team Stokes Poster
BP 2013 UFT Competition - Team Stokes Poster
 
Becas Api
Becas ApiBecas Api
Becas Api
 
IX Encontro Internacional, 2013
IX Encontro Internacional, 2013  IX Encontro Internacional, 2013
IX Encontro Internacional, 2013
 
Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)
Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)
Cartilha autoria digital (Comissão de Avaliação de Autoria - IACS/UFF, Brasil)
 
Confesión Bautista De fe de 1689
Confesión Bautista De fe de 1689Confesión Bautista De fe de 1689
Confesión Bautista De fe de 1689
 

Similar to Searching In More Than 140 Years Newspaper Articles Def

12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.Sofronijevic12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.SofronijevicNikola Smolenski
 
Europeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsEuropeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsVladimir Alexiev, PhD, PMP
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryOntotext
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...IMPACT Centre of Competence
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenVladimir Alexiev, PhD, PMP
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventEuropeana Newspapers
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Vladimir Alexiev, PhD, PMP
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...Andrea Bollini
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...4Science
 
A Semantic Multimedia Web (Part 3)
A Semantic Multimedia Web (Part 3)A Semantic Multimedia Web (Part 3)
A Semantic Multimedia Web (Part 3)Raphael Troncy
 
Devfest09 OpenSocial Enterprise
Devfest09 OpenSocial EnterpriseDevfest09 OpenSocial Enterprise
Devfest09 OpenSocial EnterpriseChris Schalk
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data21Style
 

Similar to Searching In More Than 140 Years Newspaper Articles Def (20)

Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.Sofronijevic12_N.Smolenski, M.Kostic, A.Sofronijevic
12_N.Smolenski, M.Kostic, A.Sofronijevic
 
Spanish senateweb2012en
Spanish senateweb2012enSpanish senateweb2012en
Spanish senateweb2012en
 
Europeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsEuropeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom Views
 
IMPACT Final Conference - Richard Boulderstone
IMPACT Final Conference - Richard BoulderstoneIMPACT Final Conference - Richard Boulderstone
IMPACT Final Conference - Richard Boulderstone
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
Europeana datainaction nov2012
Europeana datainaction nov2012Europeana datainaction nov2012
Europeana datainaction nov2012
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
 
255 shaw
255 shaw255 shaw
255 shaw
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
 
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
COAR Venice 2017 Next Generation Repository Session: What can be done, right ...
 
A Semantic Multimedia Web (Part 3)
A Semantic Multimedia Web (Part 3)A Semantic Multimedia Web (Part 3)
A Semantic Multimedia Web (Part 3)
 
Devfest09 OpenSocial Enterprise
Devfest09 OpenSocial EnterpriseDevfest09 OpenSocial Enterprise
Devfest09 OpenSocial Enterprise
 
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
 
Public
PublicPublic
Public
 

Searching In More Than 140 Years Newspaper Articles Def

  • 1. Searching in more than 140 years newspaper articles How Bassilichi Group worked to implement the oldest Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006 Nicola Provenzano, Bassilichi Group, Italy
  • 2. Agenda o About Bassilichi Group o The Italian newspaper historical archive of "La Stampa di Torino" from 1867 to 2006 o Our Search Challenges o Enhancing the findability
  • 3. BASSILICHI S.p.A. Turnover: € 256M An Italian Business Process Outsourcing (BPO), the company serves as a strategic partner for banks, businesses and the public sector with an offering that covers the following three areas: Monetics, Security and Back Office Employees: 1009 (at 31/12/2010)
  • 4. The Italian newspaper La Stampa from Turin o Born on February 9, 1867 with the name of “Gazzetta Piemontese” o La Stampa is one of the best known and most famous Italian newspaper, published in Turin and distributed in Italy and other European nations o With the daily sales of about 400,000 copies (2010) and 9.000.000 of site page view in a month La Stampa is the third best-selling information newspaper in the country
  • 5. The project: digitalize the entire historical archive and publish the content on the web 2007 The project starts Digitalization Layout Analysis OCR Data entry 2010 The project goes on line
  • 6. Project workgroup Committee for the Digital Library Information Journalism, members o San Paolo Company o CRT Foundation, o La Stampa publishing company o Regione Piemonte Service Providers o STI S.p.a, Bassilichi S.p.a, Microshop S.r.l, Bassnet S.r.l Hosting and infrastructure provider o CSI Piemonte
  • 7. Project numbers o nearly 150 years of history o 1,761,000 newspaper pages with various page layout o more than 5 million newspaper articles o 4.5 million images of photographs and negatives o Nearly 100 TByte of images (from 300 to 96 dpi), xml and txt documents
  • 8. Web project requirements o Search in the articles: full-text search and search with headboard, date and page number o Possibility to read the article with text only interface or with article highlighting over the image of the newspaper source page o To use Open-source technologies
  • 9. Web project input data o XML with: o Headboard, issue date, page number o Title and article body o Mets and Alto xml file with article, line and works position on the page
  • 10. January 17, 2007 “Solr has graduated from the Apache Incubator, and is now a sub-project of Lucene“
  • 11. Main Solr implementation tricks o Lucene document ID is a Domain Primary Key o Long articles text indexed but not stored to reduce index size o Abstract article’s text is stored to reduce search result listing time o Custom XmlUpdateRequestHandler to index long articles OCR text o Robust Message Queuing System to handle system indexing commands
  • 12. Web project main technologies
  • 13.
  • 14.
  • 15. Web project challenges The search engine works good but how to ensure high performance in the presence of a potentially very high traffic? TO DO: o Investigate load balancing possibilities and fault tolerance strategies o Find how to disjoin the index creation phase from the index release in production o Use read-only optimized production lucene index
  • 16. Solr collection distribution Load Balancer HTTPD HTTPD HTTPD Load Balancer JBOSS EAP Cluster Slave Slave Slave Index Slave Index Index Index Management Index Replication Updates Index Administration
  • 17. On line web project numbers In the day of the presentation of the project the site supports very high traffic without any problem o The historical archive of “La Stampa di Torino” is one of the biggest freely available digital newspaper archive, near the Times and New York Times o 509.791 page view on the 1° November 2010, 21.352 user sessions o Near 15.000.000 page view in the last year
  • 18. Current development version challenges Browsing the archive by date, article title and text give good search experience but how to enhance the findability? o Boosting articles with Named Entity Recognition with help of Celi s.r.l o Enhancing user search capabilities with query autocomplete suggestions and advanced search possibilities over Named Entities: author, persons, locations, organizations o Faceting content with all the new article attributes o Enable content tagging to collect useful user navigation suggestions
  • 19. Current development version details o JQuery UI enriched our user interface o Date Range filters drive the new timeline search widget o Multi select faceting for user search refinement o MORE LIKE THIS with named entities for user search suggestions
  • 20.