May 31st, 2013 First SICSA MMI Information Retrieval Workshop
Looking beyond plain text for
document representation in
the enterprise
Arjen P. de Vries
arjen@acm.org
Centrum Wiskunde & Informatica
Delft University of Technology
Spinque B.V.
Outline
 Motivation
 Mixed structured and unstructured
sources
 Search by strategy
 Equip
 Open ends
Enterprise Information Needs
Hang Li et al. A new approach to intranet search based on information extraction. CIKM’05
Strategic and business
development needs
 What funding schemes are the primary source
of income?
 E.g., can we move to Europe when Dutch funding
dries up?
 Who has active relations with partner X?
 “Valorisation”; new national funding requirements
 What industry sectors do we depend upon?
 E.g., how many projects in smart cities? Green
energy? Cloud computing? Etc.
 How are strategic decisions implemented?
 E.g., has objective “move from Telecom toward ICT”
been achieved, and how does it develop over time?
A week in the life
Date: Wed, 15 May 2013 15:14:49 +0200
From: Theme Coordinator “INFORMATION”
To: Group Leaders Information Theme
Subject: List of company relations for internal CWI
distribution
Dear Information Theme Group Leaders,
The theme coordinators have been asked whether they: "een
lijstje kan maken met de bedrijfscontacten en daarbij aan te
geven van welke aard de contacten zijn".
Could you send me the names of Dutch companies you are currently
working with or have worked with in the recent past by the end
of Friday 17th May.
The Theme Coordinator
Date: Fri, 24 May 2013 11:33:04 +0200
From: Theme Coordinator Life Sciences
To: Group Leaders Life Sciences Team
Subject: Life Sciences: contacts with NL companies?
Dear all,
The CWI themes are currently collecting all contacts we have
with Dutch industry and companies (but also hospitals and TNO
etc.) in order to get an overview. I am doing this for
the theme "Life Sciences".
Can you please send me a list of your contacts with short
description?
Life Sciences Theme Coordinator
From: Project Leader Project X
Date: Sun, 26 May 2013 17:34:15 +0200
To: Project X
Subject: [Project X: 33] @WP-leiders
X-BeenThere: Project X @ Y.org
Beste WP-leiders,
Ik kreeg van Het Programma Management het volgende verzoek:
> Mag ik je vragen me een lijstje te sturen van welk EU
onderzoek en welk internationaal onderzoek er loopt bij de
partners gerelateerd aan Project X (internationale inbedding).
Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen
een lijstje met de volgende punten:
- lijst van lopende EU projecten waarbij mensen uit jouw WP
betrokken zijn; geef aub aan wi de partners zijn,
financieringsbron, of het een STREP (of NoE of ...) is, en of
jouw WP een participant of coordinator levert;
- lijst van aangevraagde EU projecten, met zelfde extra's
- lijst van eventuele andere internationale samenwerkingen die
niet door een formeel project zijn afgedekt
Stuur me de lijstjes aub zsm maar niet later dan dinsdag
18u. Bedankt voor jullie hulp. De Projectleider
Surely, academia is not like…
The High Cost of Not Finding Info
 If you employ 1000 knowledge workers:
 50% of content unindexed  $2.5
million/year
 6.25% of effort is spent reproducing
information that already exists 
$5 million/year
 Knowledge workers spend 15-25% of
their time on non-productive
information-related activities
Feldman and Sherman.
IDC Technical Report #29127, 2003
Butler Group Report: Enterprise Search and Retrieval. Oct-2006
“many organisations are frittering away up to 10% of their staff
costs on wasted effort because employees simply can’t find
the right information to do their jobs.”
So… “the real world”
 “Real” companies (as opposed to
academic institutions) attempt to address
these information needs a priori, by
setting up a Customer Relationship
Management system (CRM)
Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of
the customer", Communications of the ACM 46(4) (2003): 95-99
However…
 So-called “Professionals” are well known
to focus on their own expertise
 They do not have (or take) the time to
maintain adequate descriptions of their
network, skills, projects etc. – neither for
most other types of “management
overhead”
We only need to organize ourselves!!
Funding Proposals
 Proposals submitted (are supposed to)
pass by the faculty’s (TUD) “contract
managers” or the institute’s (CWI)
“project bureau”
 E.g., checks for liability, IPR and valid budget
 Proposal and (partial) metadata are added to
a content management system (CMS)
 The CMS used at my faculty at TUD is DECOS; a
few other faculties plan to use Microsoft
Sharepoint; CWI deploys BSCW
Step 1
 Index all the proposals submitted with
your favourite IR system
Incompleteness
 The DECOS metadata entered is usually
incomplete from the start
 For many projects for example, only the coordinator
is entered as partner
 Also, a proposal’s metadata does not reflect
subsequent change; e.g., as in PuppyIR:
 People hired after funding secured
 Partner change when key person moved job
 Teams evolved
 Priorities shifted
 New tasks introduced and tasks (re-)assigned
 …
Incompleteness
 In general:
 A project’s proposal or even the contract
seldomly represents the project’s exact future
Inaccuracy
 Key information necessary for strategy &
business development scenarios missing
 Adding those is error-prone
 Infer domain (big data, green energy, cloud
computing, …) from keywords or content
 Extract names automatically
 Copy amounts manually; inconsistencies in
tables in proposal text are not uncommon
Incomplete & inaccurate Data
 Ambiguity
 When describing domain, e.g., cloud
computing vs. clouds in environmental models
 Names of people and companies involved
 Typos & OCR mistakes
 Entity resolution
 Amounts of funding per partner, own
contribution
 Funding request may not equal funding
received
The real world to rescue (1)
 Not much work gets done without
payments…
ERP
 All large organisations deploy Enterprise
Resource Planning (ERP) systems
 Typical modules include accounting, human
resources, manufacturing, and logistics
 ERP integrates the modules, data
storing/retrieving processes, and
management and analysis functionalities
 Baan, Oracle, PeopleSoft, SAP, …
More complete and more
accurate data from ERP
 Financial details of each project as executed
 Project leader
 People who are reimbursed from the project
 Exact duration of project activities
 ...
Step 2
 Index all the ERP data with your favourite
IR system
 Link the ERP project identifiers to the CMS
proposal identifiers
 Surprisingly, an n:m relationship…
DB +
The real world to rescue (2)
Institutional Repository
 Publication metadata helps validate
existing (and may even extend) the
management info required:
 Authors
 Author affiliations
 Projects and funding schemes (from
acknowledgements)?
 Again incomplete data though…
 Especially my faculty notoriously bad at
maintaining their part of the institutional
repository
Step 3
 Crawl the Institutional Repository using
the Open Archives Initiative (OAI)
harvesting protocol
 Index all the publications data with your
favourite DB + IR system
 Relate projects to publications by author
name, similar title, etc.
Result: Unified Access
 Proposals
 from an XML dump of the CMS
 Actual project administration
 from CSVs extracted from ERP
 Publications
 crawled using OAI, from the IRP
Schema
Heterogeneous content!
 BAAN-project (ERP)
 Decos-project (CMS)
 Decos-document (CMS attachments)
 Publication (Institutional Repository)
 Publication-document (Institutional Repository PDFs)
 Person (adress lists, ERP + CMS mentions)
 Company (CMS + ERP + document mentions)
 Subsidy (CMS)
 Department (address lists, CMS)
 Web addresses (extracted from documents)
 Topic (assigned to publications)
 Research programme (dependent on funding scheme)
Schema V2
How to search that graph???!
 Rank (un-/semi-)structured data to deal
with incompleteness & inaccuracies
 Structured data representation for
attributes including project revenu,
people’s names, starting dates, etc.
 Use cases varying from “expert search” to
“data cleaning” and “visual analytics”
Search by Strategy
 First, visually construct search strategies
by connecting “building blocks”
Search by Strategy
 First, visually construct search strategies
by connecting “building blocks”
 Next, generate the search engine specified
by that search strategy
Strategies: DB+IR query plans
 Database
Spinque: RDBMS (MonetDB)
BB1(in1,in2,in3, u1,u2)
in1 in2 in3
out
BB2(in1)
in1
out
• Data flow
Spinque: strategy
• Query: strategy made operational
Spinque: PRA
CREATE VIEW a AS
SELECT ..
CREATE VIEW b AS
SELECT ..
CREATE VIEW c AS
SELECT ..
Strategy
Relational DB
Probabilistic Relational Algebra
Strategy
Relational DB
• SQL
explicit probabilities
CREATE VIEW x AS
SELECT a1, a3,
1-prod(1-prob) AS prob
FROM y
GROUP BY a1, a3;
• PRA: probabilistic
relational algebra
(Fuhr and Roelleke,
TOIS 2001)
x = Project DISTINCT
[$1,$3](y);
Rank by Text
Expert Finding
Search User Interface
Search results
Result List Interactions
 Zoom in on item using “+”:
 Open item in left pane
 Shows results of item as query, using a
result-type specific search strategy
 Goal to provide contextually most related nodes
from underlying graph
 Marking any item red/yellow/green for
later usage
Browse by facet
Strategic and business
development needs
 What are our industry relations?
 Who of these partners collaborate with
more than one group?
 What funding schemes support these
collaborations?
Note: relations between partners and departments, edge strength represents revenue
Note: relations between partners and departments, edge strength represents revenue
Multi party relations
Grouping of external relations
Foreign
Univ.
NL Univ.
Funding
agency
Public NL
Public
foreign
Private
sector
Multi party relations
Grouping of external relations
Foreign
Univ.
NL Univ.
Funding
agency
Public NL
Public
foreign
Private
sector
Note: External relations with at least two departments; node size w.r.t. number of relations
Initial Findings
 The integrated search helps improve
recall, reducing the effort involved and
leading to higher quality analyses
 Many things that could be done even
more automatically (albeit not perfectly)
seem less important than expected
 We use very simple rules to extract URIs and
companies; no information extraction yet
 Information professional will always look into
results in detail
Open issues
 Integrate visualization
 Idea: select result list and facet
 Too many facets
 Idea: group facets
 Result explanations
 Idea: describe path through graph
 Entity support ++
Open issues
 What strategy is good? Why?
 Idea: test using past usage data
 What are the right user roles?
 Who should do the searches?
 Who should write strategies?
~ who writes the SQL queries in traditional DB?
 Human in the loop for retrieval, but not
yet for indexing…
Questions?

Looking beyond plain text for document representation in the enterprise

  • 1.
    May 31st, 2013First SICSA MMI Information Retrieval Workshop Looking beyond plain text for document representation in the enterprise Arjen P. de Vries arjen@acm.org Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  • 2.
    Outline  Motivation  Mixedstructured and unstructured sources  Search by strategy  Equip  Open ends
  • 3.
    Enterprise Information Needs HangLi et al. A new approach to intranet search based on information extraction. CIKM’05
  • 4.
    Strategic and business developmentneeds  What funding schemes are the primary source of income?  E.g., can we move to Europe when Dutch funding dries up?  Who has active relations with partner X?  “Valorisation”; new national funding requirements  What industry sectors do we depend upon?  E.g., how many projects in smart cities? Green energy? Cloud computing? Etc.  How are strategic decisions implemented?  E.g., has objective “move from Telecom toward ICT” been achieved, and how does it develop over time?
  • 5.
    A week inthe life
  • 6.
    Date: Wed, 15May 2013 15:14:49 +0200 From: Theme Coordinator “INFORMATION” To: Group Leaders Information Theme Subject: List of company relations for internal CWI distribution Dear Information Theme Group Leaders, The theme coordinators have been asked whether they: "een lijstje kan maken met de bedrijfscontacten en daarbij aan te geven van welke aard de contacten zijn". Could you send me the names of Dutch companies you are currently working with or have worked with in the recent past by the end of Friday 17th May. The Theme Coordinator
  • 7.
    Date: Fri, 24May 2013 11:33:04 +0200 From: Theme Coordinator Life Sciences To: Group Leaders Life Sciences Team Subject: Life Sciences: contacts with NL companies? Dear all, The CWI themes are currently collecting all contacts we have with Dutch industry and companies (but also hospitals and TNO etc.) in order to get an overview. I am doing this for the theme "Life Sciences". Can you please send me a list of your contacts with short description? Life Sciences Theme Coordinator
  • 8.
    From: Project LeaderProject X Date: Sun, 26 May 2013 17:34:15 +0200 To: Project X Subject: [Project X: 33] @WP-leiders X-BeenThere: Project X @ Y.org Beste WP-leiders, Ik kreeg van Het Programma Management het volgende verzoek: > Mag ik je vragen me een lijstje te sturen van welk EU onderzoek en welk internationaal onderzoek er loopt bij de partners gerelateerd aan Project X (internationale inbedding). Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen een lijstje met de volgende punten: - lijst van lopende EU projecten waarbij mensen uit jouw WP betrokken zijn; geef aub aan wi de partners zijn, financieringsbron, of het een STREP (of NoE of ...) is, en of jouw WP een participant of coordinator levert; - lijst van aangevraagde EU projecten, met zelfde extra's - lijst van eventuele andere internationale samenwerkingen die niet door een formeel project zijn afgedekt Stuur me de lijstjes aub zsm maar niet later dan dinsdag 18u. Bedankt voor jullie hulp. De Projectleider
  • 9.
  • 10.
    The High Costof Not Finding Info  If you employ 1000 knowledge workers:  50% of content unindexed  $2.5 million/year  6.25% of effort is spent reproducing information that already exists  $5 million/year  Knowledge workers spend 15-25% of their time on non-productive information-related activities Feldman and Sherman. IDC Technical Report #29127, 2003 Butler Group Report: Enterprise Search and Retrieval. Oct-2006 “many organisations are frittering away up to 10% of their staff costs on wasted effort because employees simply can’t find the right information to do their jobs.”
  • 11.
    So… “the realworld”  “Real” companies (as opposed to academic institutions) attempt to address these information needs a priori, by setting up a Customer Relationship Management system (CRM) Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of the customer", Communications of the ACM 46(4) (2003): 95-99
  • 13.
    However…  So-called “Professionals”are well known to focus on their own expertise  They do not have (or take) the time to maintain adequate descriptions of their network, skills, projects etc. – neither for most other types of “management overhead”
  • 14.
    We only needto organize ourselves!!
  • 15.
    Funding Proposals  Proposalssubmitted (are supposed to) pass by the faculty’s (TUD) “contract managers” or the institute’s (CWI) “project bureau”  E.g., checks for liability, IPR and valid budget  Proposal and (partial) metadata are added to a content management system (CMS)  The CMS used at my faculty at TUD is DECOS; a few other faculties plan to use Microsoft Sharepoint; CWI deploys BSCW
  • 17.
    Step 1  Indexall the proposals submitted with your favourite IR system
  • 18.
    Incompleteness  The DECOSmetadata entered is usually incomplete from the start  For many projects for example, only the coordinator is entered as partner  Also, a proposal’s metadata does not reflect subsequent change; e.g., as in PuppyIR:  People hired after funding secured  Partner change when key person moved job  Teams evolved  Priorities shifted  New tasks introduced and tasks (re-)assigned  …
  • 19.
    Incompleteness  In general: A project’s proposal or even the contract seldomly represents the project’s exact future
  • 20.
    Inaccuracy  Key informationnecessary for strategy & business development scenarios missing  Adding those is error-prone  Infer domain (big data, green energy, cloud computing, …) from keywords or content  Extract names automatically  Copy amounts manually; inconsistencies in tables in proposal text are not uncommon
  • 21.
    Incomplete & inaccurateData  Ambiguity  When describing domain, e.g., cloud computing vs. clouds in environmental models  Names of people and companies involved  Typos & OCR mistakes  Entity resolution  Amounts of funding per partner, own contribution  Funding request may not equal funding received
  • 22.
    The real worldto rescue (1)  Not much work gets done without payments…
  • 23.
    ERP  All largeorganisations deploy Enterprise Resource Planning (ERP) systems  Typical modules include accounting, human resources, manufacturing, and logistics  ERP integrates the modules, data storing/retrieving processes, and management and analysis functionalities  Baan, Oracle, PeopleSoft, SAP, …
  • 24.
    More complete andmore accurate data from ERP  Financial details of each project as executed  Project leader  People who are reimbursed from the project  Exact duration of project activities  ...
  • 25.
    Step 2  Indexall the ERP data with your favourite IR system  Link the ERP project identifiers to the CMS proposal identifiers  Surprisingly, an n:m relationship… DB +
  • 26.
    The real worldto rescue (2)
  • 27.
    Institutional Repository  Publicationmetadata helps validate existing (and may even extend) the management info required:  Authors  Author affiliations  Projects and funding schemes (from acknowledgements)?  Again incomplete data though…  Especially my faculty notoriously bad at maintaining their part of the institutional repository
  • 28.
    Step 3  Crawlthe Institutional Repository using the Open Archives Initiative (OAI) harvesting protocol  Index all the publications data with your favourite DB + IR system  Relate projects to publications by author name, similar title, etc.
  • 29.
    Result: Unified Access Proposals  from an XML dump of the CMS  Actual project administration  from CSVs extracted from ERP  Publications  crawled using OAI, from the IRP
  • 30.
  • 31.
    Heterogeneous content!  BAAN-project(ERP)  Decos-project (CMS)  Decos-document (CMS attachments)  Publication (Institutional Repository)  Publication-document (Institutional Repository PDFs)  Person (adress lists, ERP + CMS mentions)  Company (CMS + ERP + document mentions)  Subsidy (CMS)  Department (address lists, CMS)  Web addresses (extracted from documents)  Topic (assigned to publications)  Research programme (dependent on funding scheme)
  • 32.
  • 33.
    How to searchthat graph???!  Rank (un-/semi-)structured data to deal with incompleteness & inaccuracies  Structured data representation for attributes including project revenu, people’s names, starting dates, etc.  Use cases varying from “expert search” to “data cleaning” and “visual analytics”
  • 34.
    Search by Strategy First, visually construct search strategies by connecting “building blocks”
  • 35.
    Search by Strategy First, visually construct search strategies by connecting “building blocks”  Next, generate the search engine specified by that search strategy
  • 36.
    Strategies: DB+IR queryplans  Database Spinque: RDBMS (MonetDB) BB1(in1,in2,in3, u1,u2) in1 in2 in3 out BB2(in1) in1 out • Data flow Spinque: strategy • Query: strategy made operational Spinque: PRA CREATE VIEW a AS SELECT .. CREATE VIEW b AS SELECT .. CREATE VIEW c AS SELECT .. Strategy Relational DB
  • 37.
    Probabilistic Relational Algebra Strategy RelationalDB • SQL explicit probabilities CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM y GROUP BY a1, a3; • PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001) x = Project DISTINCT [$1,$3](y);
  • 38.
  • 39.
  • 41.
  • 42.
  • 43.
    Result List Interactions Zoom in on item using “+”:  Open item in left pane  Shows results of item as query, using a result-type specific search strategy  Goal to provide contextually most related nodes from underlying graph  Marking any item red/yellow/green for later usage
  • 46.
  • 48.
    Strategic and business developmentneeds  What are our industry relations?  Who of these partners collaborate with more than one group?  What funding schemes support these collaborations?
  • 49.
    Note: relations betweenpartners and departments, edge strength represents revenue
  • 50.
    Note: relations betweenpartners and departments, edge strength represents revenue
  • 51.
    Multi party relations Groupingof external relations Foreign Univ. NL Univ. Funding agency Public NL Public foreign Private sector Multi party relations Grouping of external relations Foreign Univ. NL Univ. Funding agency Public NL Public foreign Private sector Note: External relations with at least two departments; node size w.r.t. number of relations
  • 52.
    Initial Findings  Theintegrated search helps improve recall, reducing the effort involved and leading to higher quality analyses  Many things that could be done even more automatically (albeit not perfectly) seem less important than expected  We use very simple rules to extract URIs and companies; no information extraction yet  Information professional will always look into results in detail
  • 53.
    Open issues  Integratevisualization  Idea: select result list and facet  Too many facets  Idea: group facets  Result explanations  Idea: describe path through graph  Entity support ++
  • 54.
    Open issues  Whatstrategy is good? Why?  Idea: test using past usage data  What are the right user roles?  Who should do the searches?  Who should write strategies? ~ who writes the SQL queries in traditional DB?  Human in the loop for retrieval, but not yet for indexing…
  • 55.