Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Looking beyond plain text for document representation in the enterprise


Published on

In many real life scenarios, searching for information is not the user's end goal. In this presentation I look into the specific example of corporate strategy and business development in a university setting.

In today's academic institutions, strategic questions are those that relate to dependency on funding instruments, the public private partnerships that exist (and those that should be extended!), and the match between topic areas addressed by the research staff and those claimed important by policy makers. The professional search tasks encountered to answer questions in this domain are usually addressed by business intelligence (BI) tools, and not by search engines. However, professionals are known to be busy people inspired by their own research interests, and not particularly fond of keeping the
customer relationship management (CRM) or knowledge management systems up to date for the organisation's strategic interest. This then results in incomplete and inaccurate data.

Instead of requiring research staff (or their administrative support) to provide this management information, I will illustrate by example how the desired information usually exists already in the documents inherent to the academic work process. Information retrieval could thus play an important role in the computer systems that support the business analytics involved, and could significantly improve the coverage of entities of interest - i.e., to reduce the effort involved in achieving good recall in business analytics. The ranking functionality over the enterprise's (textual) content should however not be an isolated component. Our example setting integrates the information derived from research proposals, research publications and the financial systems, providing an excellent motivation for a more unified approach to structured and unstructured data.

Published in: Technology, Business
  • Be the first to comment

Looking beyond plain text for document representation in the enterprise

  1. 1. May 31st, 2013 First SICSA MMI Information Retrieval WorkshopLooking beyond plain text fordocument representation inthe enterpriseArjen P. de Vriesarjen@acm.orgCentrum Wiskunde & InformaticaDelft University of TechnologySpinque B.V.
  2. 2. Outline Motivation Mixed structured and unstructuredsources Search by strategy Equip Open ends
  3. 3. Enterprise Information NeedsHang Li et al. A new approach to intranet search based on information extraction. CIKM’05
  4. 4. Strategic and businessdevelopment needs What funding schemes are the primary sourceof income? E.g., can we move to Europe when Dutch fundingdries up? Who has active relations with partner X? “Valorisation”; new national funding requirements What industry sectors do we depend upon? E.g., how many projects in smart cities? Greenenergy? Cloud computing? Etc. How are strategic decisions implemented? E.g., has objective “move from Telecom toward ICT”been achieved, and how does it develop over time?
  5. 5. A week in the life
  6. 6. Date: Wed, 15 May 2013 15:14:49 +0200From: Theme Coordinator “INFORMATION”To: Group Leaders Information ThemeSubject: List of company relations for internal CWIdistributionDear Information Theme Group Leaders,The theme coordinators have been asked whether they: "eenlijstje kan maken met de bedrijfscontacten en daarbij aan tegeven van welke aard de contacten zijn".Could you send me the names of Dutch companies you are currentlyworking with or have worked with in the recent past by the endof Friday 17th May.The Theme Coordinator
  7. 7. Date: Fri, 24 May 2013 11:33:04 +0200From: Theme Coordinator Life SciencesTo: Group Leaders Life Sciences TeamSubject: Life Sciences: contacts with NL companies?Dear all,The CWI themes are currently collecting all contacts we havewith Dutch industry and companies (but also hospitals and TNOetc.) in order to get an overview. I am doing this forthe theme "Life Sciences".Can you please send me a list of your contacts with shortdescription?Life Sciences Theme Coordinator
  8. 8. From: Project Leader Project XDate: Sun, 26 May 2013 17:34:15 +0200To: Project XSubject: [Project X: 33] @WP-leidersX-BeenThere: Project X @ Y.orgBeste WP-leiders,Ik kreeg van Het Programma Management het volgende verzoek:> Mag ik je vragen me een lijstje te sturen van welk EUonderzoek en welk internationaal onderzoek er loopt bij departners gerelateerd aan Project X (internationale inbedding).Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij stureneen lijstje met de volgende punten:- lijst van lopende EU projecten waarbij mensen uit jouw WPbetrokken zijn; geef aub aan wi de partners zijn,financieringsbron, of het een STREP (of NoE of ...) is, en ofjouw WP een participant of coordinator levert;- lijst van aangevraagde EU projecten, met zelfde extras- lijst van eventuele andere internationale samenwerkingen dieniet door een formeel project zijn afgedektStuur me de lijstjes aub zsm maar niet later dan dinsdag18u. Bedankt voor jullie hulp. De Projectleider
  9. 9. Surely, academia is not like…
  10. 10. The High Cost of Not Finding Info If you employ 1000 knowledge workers: 50% of content unindexed  $2.5million/year 6.25% of effort is spent reproducinginformation that already exists $5 million/year Knowledge workers spend 15-25% oftheir time on non-productiveinformation-related activitiesFeldman and Sherman.IDC Technical Report #29127, 2003Butler Group Report: Enterprise Search and Retrieval. Oct-2006“many organisations are frittering away up to 10% of their staffcosts on wasted effort because employees simply can’t findthe right information to do their jobs.”
  11. 11. So… “the real world” “Real” companies (as opposed toacademic institutions) attempt to addressthese information needs a priori, bysetting up a Customer RelationshipManagement system (CRM)Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view ofthe customer", Communications of the ACM 46(4) (2003): 95-99
  12. 12. However… So-called “Professionals” are well knownto focus on their own expertise They do not have (or take) the time tomaintain adequate descriptions of theirnetwork, skills, projects etc. – neither formost other types of “managementoverhead”
  13. 13. We only need to organize ourselves!!
  14. 14. Funding Proposals Proposals submitted (are supposed to)pass by the faculty’s (TUD) “contractmanagers” or the institute’s (CWI)“project bureau” E.g., checks for liability, IPR and valid budget Proposal and (partial) metadata are added toa content management system (CMS) The CMS used at my faculty at TUD is DECOS; afew other faculties plan to use MicrosoftSharepoint; CWI deploys BSCW
  15. 15. Step 1 Index all the proposals submitted withyour favourite IR system
  16. 16. Incompleteness The DECOS metadata entered is usuallyincomplete from the start For many projects for example, only the coordinatoris entered as partner Also, a proposal’s metadata does not reflectsubsequent change; e.g., as in PuppyIR: People hired after funding secured Partner change when key person moved job Teams evolved Priorities shifted New tasks introduced and tasks (re-)assigned …
  17. 17. Incompleteness In general: A project’s proposal or even the contractseldomly represents the project’s exact future
  18. 18. Inaccuracy Key information necessary for strategy &business development scenarios missing Adding those is error-prone Infer domain (big data, green energy, cloudcomputing, …) from keywords or content Extract names automatically Copy amounts manually; inconsistencies intables in proposal text are not uncommon
  19. 19. Incomplete & inaccurate Data Ambiguity When describing domain, e.g., cloudcomputing vs. clouds in environmental models Names of people and companies involved Typos & OCR mistakes Entity resolution Amounts of funding per partner, owncontribution Funding request may not equal fundingreceived
  20. 20. The real world to rescue (1) Not much work gets done withoutpayments…
  21. 21. ERP All large organisations deploy EnterpriseResource Planning (ERP) systems Typical modules include accounting, humanresources, manufacturing, and logistics ERP integrates the modules, datastoring/retrieving processes, andmanagement and analysis functionalities Baan, Oracle, PeopleSoft, SAP, …
  22. 22. More complete and moreaccurate data from ERP Financial details of each project as executed Project leader People who are reimbursed from the project Exact duration of project activities ...
  23. 23. Step 2 Index all the ERP data with your favouriteIR system Link the ERP project identifiers to the CMSproposal identifiers Surprisingly, an n:m relationship…DB +
  24. 24. The real world to rescue (2)
  25. 25. Institutional Repository Publication metadata helps validateexisting (and may even extend) themanagement info required: Authors Author affiliations Projects and funding schemes (fromacknowledgements)? Again incomplete data though… Especially my faculty notoriously bad atmaintaining their part of the institutionalrepository
  26. 26. Step 3 Crawl the Institutional Repository usingthe Open Archives Initiative (OAI)harvesting protocol Index all the publications data with yourfavourite DB + IR system Relate projects to publications by authorname, similar title, etc.
  27. 27. Result: Unified Access Proposals from an XML dump of the CMS Actual project administration from CSVs extracted from ERP Publications crawled using OAI, from the IRP
  28. 28. Schema
  29. 29. Heterogeneous content! BAAN-project (ERP) Decos-project (CMS) Decos-document (CMS attachments) Publication (Institutional Repository) Publication-document (Institutional Repository PDFs) Person (adress lists, ERP + CMS mentions) Company (CMS + ERP + document mentions) Subsidy (CMS) Department (address lists, CMS) Web addresses (extracted from documents) Topic (assigned to publications) Research programme (dependent on funding scheme)
  30. 30. Schema V2
  31. 31. How to search that graph???! Rank (un-/semi-)structured data to dealwith incompleteness & inaccuracies Structured data representation forattributes including project revenu,people’s names, starting dates, etc. Use cases varying from “expert search” to“data cleaning” and “visual analytics”
  32. 32. Search by Strategy First, visually construct search strategiesby connecting “building blocks”
  33. 33. Search by Strategy First, visually construct search strategiesby connecting “building blocks” Next, generate the search engine specifiedby that search strategy
  34. 34. Strategies: DB+IR query plans DatabaseSpinque: RDBMS (MonetDB)BB1(in1,in2,in3, u1,u2)in1 in2 in3outBB2(in1)in1out• Data flowSpinque: strategy• Query: strategy made operationalSpinque: PRACREATE VIEW a ASSELECT ..CREATE VIEW b ASSELECT ..CREATE VIEW c ASSELECT ..StrategyRelational DB
  35. 35. Probabilistic Relational AlgebraStrategyRelational DB• SQLexplicit probabilitiesCREATE VIEW x ASSELECT a1, a3,1-prod(1-prob) AS probFROM yGROUP BY a1, a3;• PRA: probabilisticrelational algebra(Fuhr and Roelleke,TOIS 2001)x = Project DISTINCT[$1,$3](y);
  36. 36. Rank by Text
  37. 37. Expert Finding
  38. 38. Search User Interface
  39. 39. Search results
  40. 40. Result List Interactions Zoom in on item using “+”: Open item in left pane Shows results of item as query, using aresult-type specific search strategy Goal to provide contextually most related nodesfrom underlying graph Marking any item red/yellow/green forlater usage
  41. 41. Browse by facet
  42. 42. Strategic and businessdevelopment needs What are our industry relations? Who of these partners collaborate withmore than one group? What funding schemes support thesecollaborations?
  43. 43. Note: relations between partners and departments, edge strength represents revenue
  44. 44. Note: relations between partners and departments, edge strength represents revenue
  45. 45. Multi party relationsGrouping of external relationsForeignUniv.NL Univ.FundingagencyPublic NLPublicforeignPrivatesectorMulti party relationsGrouping of external relationsForeignUniv.NL Univ.FundingagencyPublic NLPublicforeignPrivatesectorNote: External relations with at least two departments; node size w.r.t. number of relations
  46. 46. Initial Findings The integrated search helps improverecall, reducing the effort involved andleading to higher quality analyses Many things that could be done evenmore automatically (albeit not perfectly)seem less important than expected We use very simple rules to extract URIs andcompanies; no information extraction yet Information professional will always look intoresults in detail
  47. 47. Open issues Integrate visualization Idea: select result list and facet Too many facets Idea: group facets Result explanations Idea: describe path through graph Entity support ++
  48. 48. Open issues What strategy is good? Why? Idea: test using past usage data What are the right user roles? Who should do the searches? Who should write strategies?~ who writes the SQL queries in traditional DB? Human in the loop for retrieval, but notyet for indexing…
  49. 49. Questions?