Engineering Web Search Applications
Upcoming SlideShare
Loading in...5
×
 

Engineering Web Search Applications

on

  • 10,734 views

This tutorial, offered at the 10th International Conference on Web Engineering, presents the peculiarities of advanced Web search applications, describes some tools and techniques that can be ...

This tutorial, offered at the 10th International Conference on Web Engineering, presents the peculiarities of advanced Web search applications, describes some tools and techniques that can be exploited, and offers a methodological approach to development. The approach proposed in this tutorial is based on the paradigm of Model Driven Development (MDD), where models are the core artifacts of the application life-cycle and model transformations progressively refine models to achieve an executable version of the system. To cope with the process-intensive nature of the main interactions (i.e., content analysis, query management, etc.), we describe the use of Process Models (e.g., BPMN models). Indeed, search-based applications are considered as process- and content-intensive applications, due to the trends towards exploratory search and search as a process visions.

Statistics

Views

Total Views
10,734
Views on SlideShare
10,417
Embed Views
317

Actions

Likes
5
Downloads
221
Comments
0

9 Embeds 317

http://blog.search-computing.net 263
http://dbgroup.como.polimi.it 22
http://www.modeldrivenstar.org 11
http://blog.search-computing.com 9
http://blog.search-computing.org 4
http://translate.googleusercontent.com 3
http://marcobrambi.blogspot.com 2
http://www.techgig.com 2
http://blog.search-computing.it 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • i.e. it might not be clear to the system whether the user is “recall-oriented” or “precision-oriented”
  • In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
  • complex search is characterized by: multiple searches, possibly over multiple sessions and spanning multiple sources of information; a combination of exploration and more directed information finding activities; the need of note-taking, the variation of the search goal during the search process.
  • In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
  • In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
  • From an high-level perspective, “search” is enabled by mechanisms which allow the extraction of contents from data repositories (e.g., text file, audio file, video file, databases, etc). Contents are therefore processed in order to build an index of the managed information, optimized for efficiently answer to users’ queries. Before being indexed, contents are analyzed and enriched with annotations 1 that build contents’ representation. Along with the index, search leverages on ranking models, i.e., mathematical methods that associates a score to the relevance of a content item w.r.t. a query. Once contents are indexed, multiple user interfaces (e.g., Web applications) provide users the means to interact with the search engine by executing queries and displaying the retrieved results.
  • We define (i) an Indexing process (represented as a dashed line), which addresses the indexation of contents coming from the application data sources (thus involving data retrieval from external sources, transformation or aggregation of the retrieved data and, finally, their indexation) (ii) a Query and Result Presentation (QRP) process (represented as a solid line), addressing the operations related to query execution, orchestration and result-set composition (iii) a User Interaction process (represented as a dotted line), i.e., the way users interact with the application’s functionalities.
  • One aspect of the proposed development framework is the definition of a methodology for the design and implementation of the application to be produced. A development approach based on a formal methodology and appropriate high level modeling languages smoothly incorporates change management into the mainstream production life-cycle, and greatly reduces the risk of breaking the software engineering process due to the occurrence of changes. The proposed methodology follows the path of the MDD approach by leveraging on a incremental, iterative design steps that foster separation of concerns among the actors involved in the SBA design. The Conceptual Design macro activity represents the core of the development lifecycle, since it involves the main design activities In the terminology of MDD, the BPMN Process Model can be seen as a Computation Independent Model (CIM), which specifies SBA requirements for the CAI and QRP processes; as we will see, instead, the UI process is address as an Interaction pattern composition activity. The WebML application model is a Platform Independent Model (PIM), which exploits SOA and Web hypertext interfaces as a technical space. Finally, the application code is a Platform Specific Model (PSM) for the Java 2 technical space. Initially, requirements are conceptualized in a Domain Model, which formalizes the essential data objects managed by the application, and a Process Model, which pinpoints the workflow of the CAI,QRP and UI processes. The link between the domain and process models is established by the type of objects that flow between activities. The designed solutions do not take into account domain specific informations like the schema of the adopted search technologies, or the format of the annotations produced by the analysis components. Nonetheless, the focus on a specific class of applications allows one to include, in the business model, high-level concepts relative to the applications’ domain. For SBA, for instance, the concept of query, user, index and so on. The use of an high-level model combined with coarse grained domain concepts allows one to address the designed application in perspective, possibly by creating designs that can be applied to classes of applications (e.g., audiovisual search engines), more than punctual solutions. Abstract-level notation, though, cannot be translated into running code,due to the lack of platform-specific details (e.g., the technologies adopted by actual search engines, analysis components, deployment platform etc.) needed to enact code generation. The Domain Model and Process Model are then subject to a first (CIM to PIM) transformation, which produces the Application Model and process metadata. objects. Therefore, coarse-grained design is followed by refinements that take into account more domain-specific information, like the structure and format for the contents, the annotations and indexes. To do so, a finer grained model is adopted, in order to enable the definition of domain-and application-specific details that can lead to automatic code generation. The proposed approach is generic enough in order to adopt alternative modeling languages, both for process and application design. This slide discusses how to derive an application model from high-level process model. The proposed framework employ the BPMN modeling language for process specification and the WebML modeling language for the design of hypertextes and Web service orchestrations
  • Let’s now have a bird’s eye view on some reference, example design for all the 3 identified SBA’s processes. The CAI process can be defined as the work to be performed by the actors of a SBA to achieve the indexation of a content item . The goal of the domain model is to formalize content- and index-related data and metadata managed by the search applications. Such models build on five basic domain concepts: + Content Item : a Content Item is an individual information unit which is relevant in a search based Web application for indexing purposes. + Annotation : an annotation is the textual information associated with a content item for indexing and searching purposes. Such information might be of different nature, being both manual annotation, provided by the content provider or by the user, and automatically generated annotation, produced by the search application during the Indexing process. + Usage Group : Content Items are published by one or more Content Provider, which is responsible for their publication. A Usage Group is an access profile specified by a content provider to define the set of operations allowed for a given content item to a set of users: + Index : the notion of Index, well known in many disciplines of computer science, denotes a data structure designed in order to optimize speed and performance in finding relevant content items for a search query.
  • User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
  • Thanks to the implemented extensions, we inject more information in the higher level model, thus leading to: + finer-grained application models + less errors + more efficiency. Transformations were implemented in ATL, a language for model transformations. Here’s a graphical example of model transformation among BPMN* activities and WebML model, and here’s just to give you a hint of how transformations are coded
  • Indri/Lemur Language modeling BM25, Okapi, Cosine similarity, inQuery Lucene TF-IDF, weighted by term occurrences Fielded search Terrier Okapi BM25, language modeling and TF-IDF Divergence from Randomness Your own re-ranking code using open search
  • Not enough comparative benchmarks out there. Hard to do; we really need standards Optimize each platform, per hardware and data set Lot of platforms, with different APIs, options and numerical settings Need good diverse data sets, small & large Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).
  • Larger data set (3x larger than the Twitter one) we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.
  • <!-- When a message on portType an operation "process" instantiate a variable named "Request" --> <!-- tipicamente la request conterrà un solo Record. Record multipli sono prodotti ad esempio da annotatori che esaminano archivi zip|rar|tgz. L'extension activity verrà eseguita se l'attributo workflow-attribute' presente sul record contiene il valore "split". Le condizioni sono espresse come espressioni XPath e gli attributi e annotazioni utilizzati devono essere espressamente resi disponibili al workflow BPEL tramite configurazione (di org.eclipse.smila.blackboard). -->
  • RAP – Rich Ajax Platform G-Eclipse: extensible framework including a GRID model for seamless integration of GRID/Cloud resources. It support different Grid/Cloud interfaces, including AWS
  • Example: the token “saw” Stemming  it might return just “s” Lemmatization  attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun

Engineering Web Search Applications Engineering Web Search Applications Presentation Transcript

  • Engineering Web Search Applications Alessandro Bozzon Marco Brambilla Vienna July 5, 2010
    • Alessandro Bozzon
    • Post-doc @Politecnico di Milano
    • http://home.dei.polimi.it/bozzon
    • Marco Brambilla
    • Assistant Professor @Politecnico di Milano
    • http://home.dei.polimi.it/mbrambil
    About the speakers © 2010 Alessandro Bozzon, Marco Brambilla
    • Research background and interests
      • Web engineering and model-driven development
        • WebML and WebRatio
        • Complex enterprise application design
      • BPM, SOA and integration with Web application devel.
      • Search engine and complex search application development
        • Search Computing: multidomain search
        • Pharos: multimedia search framework
    July 5, 2010 ABOUT //
  • About the tutorial
    • Information Retrieval is a >40y old discipline tackled from a myriad of viewpoints
    • This tutorial is:
      • Breadth-oriented
      • Development process driven …
      • … using real-world case studies as examples
    • The tutorial is necessarily shallow
      • But we provide references and links
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 ABOUT //
  • Agenda © 2010 Alessandro Bozzon, Marco Brambilla
  • AGENDA
    • Introduction
      • What are Web search applications?
    • Requirements
      • Which are their requirements?
    • Design
      • How to design them?
    • Implementation
      • How to implement them?
    • Validation
      • How to measure their success?
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 AGENDA //
  • Introduction © 2010 Alessandro Bozzon, Marco Brambilla
  • Search prevails
    • Search is an integral part of online life of people
    • Web search has become a standard (and often preferred) source of information finding
      • “ ... 92% of Internet users say the Internet is a good place to go for getting everyday information...” - 2004 Pew Internet Survey
    • Web search engines are now the second most frequently used online computer application, after email
    • Search is fully integrated into operating systems and is viewed as an essential part of most information systems
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION //
  • Some numbers …
    • Web
      • Estimated size: ~ 60 billion pages – 22/06/2010
        • http://www.worldwidewebsize.com/
      • > 9.3 billion queries … just in the U.S. … in May 2010
        • http://blog.nielsen.com/nielsenwire/online_mobile/top-u-s-search-sites-for-may-2010/
        • … and growing
    • Twitter
      • # of new tweets per day: 55 million
      • # of search queries per day: 600 million
    • Facebook
      • 400 Million Global Users (and growing)
      • The average Facebook User Spends 55 Minutes Per Day
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION //
  • … more numbers …
    • IDC Digital Universe report estimates:
      • digital data grew by 62% between 2008 and 2009
        • ~ 800,000 petabytes (PB)
      • >1.2 million PB in 2010
      • reach 35 ZB (zetabytes) by 2020.
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 INTRODUCTION // [Ramakrishnan and Tomkins 2007]
  • Information Retrieval
    • Information retrieval (IR) deals with the representation, storage, organization of, and access to information items.
      • “ Old” discipline
    • As an academic field of study:
      • Information retrieval (IR) is devoted to finding relevant documents , not finding simple match to patterns.
      • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually stored on computers).
        • [Manning et al., 2007]
    © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // July 5, 2010
  • Information Retrieval Applications
    • Search (‘ad hoc’ retrieval)
      • Static document collection
      • Dynamic queries
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
    • Filtering
      • Queries are static
      • Document collection constantly changing
        • Example: corporate mails routed by predefined queries to different parts of the organizations
    Static Document Collection Ranked Result Ad-Hoc query Document Routing System Predetermined queries or User profiles Incoming Documents
  • The nature of information retrieval
    • … retrieving all objects which might be useful or relevant to the user information need
      • Usually unstructured queries (no formal semantics)
        • The IR system ‘interpret’ the contents of the information items
        • Examples: keyword-based queries, context queries, proximity, phrases, natural language queries…
        • Also structural queries and, in recent systems, structured query languages are supported (but with a different semantics)
      • Errors in the results are tolerated
      • Core concept: relevance
        • Relevance Ranking (according to the user need)
        • It is not clear what “degree of relevance” the user is happy with
        • The user starts from the top of the ranked list and explore down satisfied
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • Information Retrieval is NOT Data Retrieval
    • Data Retrieval (RDBMS, XML DB)
      • … retrieving all objects which satisfy clearly defined conditions expressed trough a query language.
      • Data has a well defined structure and semantics
      • Formal query languages
        • Regular expression, relation algebra expression, etc.
      • Results are EXACT matches  errors are not tolerated
      • No ranking w.r.t. the user information need
        • Binary retrieval: does not allow the user to control the magnitude of the output
        • For a given query, the system may return:
          • Under-dimensioned output
          • Over-dimensioned output
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • The Information Retrieval Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION // Content Management Query analysis Query Interaction Generic search-oriented application B A C K E N D F R O N T E N D q’ q r r’ Search Result Composition Result Manipulation
  • Search Engine vs. Search Application
    • Search Engine
      • data management system which uses information retrieval algorithms to retrieve information items from one or more sources upon the submission of a query
    • Web Search Application
      • data management system where search engines are a piece of a more complex puzzle, that includes:
        • data source integration (e.g. databases, legacy systems, the Web)
        • content analysis technologies orchestration
        • user interfaces
        • Web-mediated social interactions, etc.
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • Characterization of the user information need
    • It is not a simple problem:
      • “ Blurred” goals
      • Sensory Gap
        • Gap between the object in the
        • world and the information in a
        • (computational) description
      • Semantic Gap
        • Lack of coincidence between the
        • (computational) description of the
        • information and their interpretation
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • Evaluating an IR System
    • Precision: fraction of retrieved docs that are relevant
        • P(relevant|retrieved)
          • “ degree of soundness” of the system
          • not considering the total number of documents
    • Recall: fraction of relevant docs that are retrieved
        • P(retrieved|relevant)
          • “ degree of completeness” of the system
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • Enterprise search
    • Public Web search engines are the ones known to the general public
    • But there is also a huge need (and market share!) for professional search over enterprise repositories
    • Enterprise search is covered by
      • Packaged suites
        • Microsoft FAST
        • Autonomy IDOL
        • IBM OmniFind
        • Exalead
      • Frameworks
        • Apache UIMA (ex IBM)
        • Smila
        • Solr
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla INTRODUCTION //
  • Case Studies
    • Textual Search
      • YaGoBi
    • Multi-media Search
      • The PHAROS Project
    • Multi-domain Search
      • The Search Computing project
    • Example of Web Search Application
      • Chansonnier
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • YaGoBi
    • THE Web Search
      • 92% of market share in the U.S.
    • Searching on
      • Web pages, Blog, News, Books, Scientific Publications, Emails
      • Images and Videos (but only trough textual descriptions )
      • Tweets
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla CASE STUDIES //
  • The PHAROS Project
    • FP6 IP, 3Years, 12 Partners, ~15 M€ budget
    • Mission : Develop SOA-compliant, open and distributed technology platform for development of information access solutions for audio visual content
    • www.pharos-audiovisual-search.eu
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • The Search Computing Project
    • European Research Council (ERC), 2008 Call for "IDEAS Advanced Grants”, 5y (started in 2009)
    • Mission : provide the abstractions, foundations, methods, and tools required to answer multi-domain queries by interacting with a constellation of cooperating search services, using ranking and joining of results
    • as the dominant factors for service
    • composition
    • www.search-computing.org
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • Chansonnier
    • BsC Thesis project
      • Mission : graduate 
    • Open source video analysis
    • application based on
    • open frameworks (SMILA / SOLR)
      • Crawling of Web video
      • Download of song lyrics
      • Analysis on lyrics text
        • Language, emotion
      • Keyframe extraction for video snippets
    • http://github.com/giorgiosironi/Chansonnier
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 CASE STUDIES //
  • Requirements © 2010 Alessandro Bozzon, Marco Brambilla
  • Key Requirements and Design Dimensions for Web Search © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
    • Data Source
    • User Behavior
    • Query Format
    • User Interface
    • Security
    • Data Analysis
    • Performance
    • Data Format
    • Social Interactions
    • Search Engine
  • Data Sources
    • Web
    • Databases
    • File systems
    • Intranet / Extranets
    • Legacy systems
    • Users
    • Sensors (in wide sense) and streams
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • Data Type
    • Unstructured data
      • Textual Documents
      • Blog Posts
    • (Semi) Structured data
      • Software Code
      • Models
      • XML Files
    • Media
      • Pictures
      • Video
      • Music
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
    • Textual Analysis
      • Deals with basic language units (morphemes, roots, stems, words, phrases, sentences, etc.)
    • Media Analysis
      • Deals with media contents
        • Transcoding
        • Classification
        • Feature Extraction
    Data Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
    • An activity performed at the purpose of providing a representation of a content item suited for the application
  • Search Engine _1
    • Textual
      • Textual contents represented as collection of unstructured text terms
    • Fielded
      • Textual contents structured in fields (e.g., metadata)
    • Semi-structured
      • Textual contents organized in complex (possibly heterogeneous) structure (e.g., XML, HTML)
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • Search Engine _2
    • Content-based
      • Media contents described by low-level features
    • Geographic and other special dimensions
      • Content featuring geo-spatial features
      • Streaming content searched by temporal features (e.g., recency)
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • Query Format
    • Representation of the user information need
      • Natural Language
        • For instance trough vocal interfaces
      • Keyword
        • Set of text items, plus Boolean (AND/OR/NOT), proximity ( lexical nearness) and/or wildcard conditions
      • Fielded Keyword
        • Text items defined on one or more fields
        • Queries to semi-structured search-engines and Faceted queries
      • Content-based
        • Query by example (text, image, video, audio, etc.)
      • Geographic and other special dimensions
        • Geographic coordinates plus spatial operator terms ( near, north of, within X kilometers from, etc.)
        • Timestamps plus temporal operator terms (recent, near, interval, etc.)
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • YaGoBi
    • Data Sources
      • Web : crawling of Web resources
      • Users : comments, preferences, relationships
    • Data Types
      • Unstructured data : Web pages
      • Documents : PDF, PPT, DOC, etc.
    • Data Analysis
      • Textual : for content, document, and user generated comments
      • Media : some basic image analysis for color, faces, size
    • Search Engine
      • Fielded: filetype, page title, site, page content
      • Content-based: image similarity in Google
    • Query Format:
      • Fielded keyword
      • Geographic
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • PHAROS
    • Data Sources
      • Web : crawling of audio/video files
      • File System : NAS and content provider media archives
      • Users : comments, preferences, relationships
    • Data Types
      • Structured data : content provider description metadata
      • Media : hi-quality video and audio files
      • Semi-structured data : MPEG-7 description of processed media files and user annotations
    • Data Analysis
      • Textual : for content metadata and user generated comments
      • Media : for audio and video
        • Audio/Video Mood classification, Image concept classification, Music Genre, Danceability classification, face recognition and identification, speech to text
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • PHAROS
    • Search Engine
      • Semi-structured : XML search engine for MPEG-7 content description
        • Plus geographic annotations and geo-based ranking
      • 3 content-based engines :
        • one CB for music,
        • one for images (shots of the video)
        • one for face similarity
    • Query Format
      • Fielded-keyword : XQuery for XML search engine
      • Query by example : for image, music and faces
      • MPQF: high level query language
        • AND/OR/AND THEN for fielded keyword and by-example queries
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • Query Federation in PHAROS July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // JPG Long/Lat XPath Keywords “ amsterdam” here[contains(“amsterdam”)] and opic[contains(“building”)] Geo search R-tree index 52.37N 4.89 E Text search Inverted index XML search Semantic index Image search Similarity index Query analysis Federation
  • User Behavior
    • Search is evolving
      • Content Vs. Intent
        • People don’t want to search
        • People want to get task done and get answers
      • Moving towards identifying a user’s task
      • Enabling means for task completion
    • Search as a Process
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
    • Search applications must
      • Support the user in the search process
      • (try to) Infer the user intent to help him accomplishing his task
    Ricardo Baeza-Yates Next Generation Search , 2 nd SeCo Workshop, Milan, 24/06/2010 Start End I am craving for a good Wiener Schnitzel and a Sachertorte in Vienna Search Menu Reviews Map
  • Information Seeking [Bates, 2002] July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Bates, Marcia J. 2002. Toward an integrated model for information seeking and searching. In: The Fourth International Conference on Information Needs, Seeking and Use in Different Contexts.
  • Information Foraging
    • Information foraging applies the ideas from optimal foraging theory to understand how human users search for information.
    • Assumption: humans use "built-in" foraging mechanisms that evolved to help our animal ancestors find food.
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
    • Some References
      • Fu, Wai-Tat; Pirolli, Peter (2007), "SNIF-ACT: a cognitive model of user navigation on the world wide web", Human-Computer Interaction: 335–412
      • Jason Withrow, "Do your links stink?," American Society for Information Science Bulletin, June 1, 2002
      • Pirolli, Peter (2009), "An elementary social information foraging model", Proceedings of the 27th international conference on Human factors in computing systems: 605–614
  • Moving between patches
    • Patches of information = websites
    • Problem: should I continue foraging in the current patch or look for another patch?
    •  Expected gain from continuing in current patch vs. moving to another
    © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // July 5, 2010
  • Information seeking funnel [D. Rose, 2008]
    • Wandering: the user does not have an information seeking-goal in mind.
    • Exploring: the user has a general goal but not a plan for how to achieve it.
    • Seeking: the user has started to identify information needs that must be satisfied but the needs are open-ended.
    • Asking: the user has a very specific information need that corresponds to a closed-class question
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • Berrypicking vs. Orienteering vs. Teleporting ...
    • Information needs change during interactions
        • M.J. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5):407–431,1989.
    • Orienteering [ Teevan et al., CHI 2004 ] : Searcher issues a quick, imprecise to get to approximately the right information space region and then follows known paths that require small steps that move them closer to their goal. Easy! (“perfect” query not needed)
    • Teleporting: Expert searchers issue longer queries to jump directly to the target. Requires more effort and experience.
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • … vs. exploratory search
    • Exploratory Search: user’s intent is primarily to learn more on a topic of interest, by exploring various directions and sources
      • “… exploratory search blends querying and browsing strategies” and is different “from retrieval that is best served by analytical strategies…”
            • Marchionini, G. Exploratory search: from finding to understanding. Communications ACM 49(4): 41-46 (2006)
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
    • Some references
      • Definition and analysis of the problem
        • White, R. W., and Drucker, S. M. Investigating behavioral variability in web search. 16th WWW Conf. (Banff, Canada, 2007)
      • Complex Search and Exploratory Search
        • Aula, A., and Russell, D.M. Complex and Exploratory Web Search. ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)
  • Multi-domain Exploratory Search
    • “… search for upcoming concerts close to an attractive location (like a beach, lake, mountain, natural park, and so on), considering also availability of good , close-by hotels ”
    • Current approach the user can adopt:
      • Independently explore search services
      • Manually combine findings
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • Multi-domain Exploratory Search
    • “… expand the search to get information about available restaurants near the candidate concert locations, news associated to the event and possible options to combine further events scheduled in the same days and located in a close-by place with respect to the first one…”
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • Existing Approaches _1
    • Topic based search : instance of exploratory search centered on the goal of collecting information on a subject matter of interest from multiple sources
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
          • Kosmix : topic discovery engine, keyword search, a topic page summarizes the most relevant information on the subject
          • Hakia : resume pages for topics associated with user’s queries, natural language processing techniques
  • Existing Approaches _2
    • Structured Object Search : process queries and present results that address entities or real world objects described in Web pages
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
          • Google Squared: keyword search, results collected in a table (called a square) featuring all the attributes relevant to the result items as columns headers
          • Google Fusion Tables: upload data tables (e.g., spreadsheet files) and join (or “fuse”) the data in some column with other tables
  • The note-taking limit
    • There is a limit after which the found options need to be marked down.
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [Aula and Russel, 2008]
  • Liquid Queries
    • “ A new paradigm allowing users to formulate and get responses to multi-domain queries through an exploratory information seeking approach, based upon structured information sources exposed as software services…”
    • Composite answers obtained by aggregating search results from various domains
    • Highlight the contribution of each search service
    • Join of results based on the structural information afforded by the search service interfaces
    • Refine the user query
    • Re-shape the result list
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
    • Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri. Liquid Query: multi-domain exploratory search on the Web . WWW 2010, Raleigh, USA
  • Liquid Queries Definition _1
    • Template-based approach
    • It consists of subsetting and parametrizing the resource graph...
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Concert Artist Exhibition Restaurant Hotel Movie Metro Station Theatre Photo Landmark News Photo Concert Metro Station Restaurant News Exhibition Artist Hotel = inputs, outputs + GR = global ranking
  • Liquid Queries Definition _2
    • And then characterizing the user interaction
    • Plus:
      • Parametrization of global ranking
      • Data visualization options
      • .. and so on
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Photo Concert Metro Station Restaurant News Exhibition Artist Hotel Expand
  • Result Exploration Support
    • If the current set of combinations is not satisfactory, the user may ask for more values for a service (more one) or for all services (more all)
      • More concerts, more hotels, or more combinations
    • Add new information about further domains for selected combinations (expand)
      • Find close-by restaurants or co-located events
    • Aggregate information to ease analysis and readability (clustering, grouping)
      • Group events by venue
    • Reduce the number of shown items through filtering
      • Total walked distance for the night
    • Re-order (ranking or sorting)
      • Calculate derived values from existing ones
      • Total walked distance for the night
    • Alternative data visualization
      • Map, parallel coordinates, …
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
    • DEMO :
    • http://demo.search-computing.org
  • User Intent
    • Understand the user information need
      • User intent taxonomy (Broder2002)
        • Informational –want to learn about something (~40% / 65%)
        • Navigational –want to go to a given page (~25% / 15%)
        • Transactional – want to do something (web-mediated) (~35% / 20%)
        • Grey Areas
          • Find a good hub
          • Exploratory search
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // [from SIGIR 2008 Tutorial, Baeza-Yates and Jones] History nyonya food Singapore Airlines Jakarta Weather Nikon Finepix Car Rental Kuala Lumpur
  • Contextual Content Delivery
    • Context Vs. Personalization
    • Trigger the right search depending on the context
      • Task
      • Location
      • User Engagement
    • Not interested in your personal profile
      • Your favorite restaurant?
        • It depends on where you are!
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // from Ricardo Baeza-Yates, Next Generation Search , 2 nd Search Computing Workshop, Milan, 24/06/2010 Demo: http://sandbox.yahoo.com/Motif
  • Relevance: the Top-k problem
    • Relevance of the results with respect to the request is the main expectation for search engine users
    • Top-k relevant items : retrieve quickly a number ( k) of highest ranking tuples in the presence of monotone ranking functions defined on the attributes of underlying relations
    • Some References
      • R. Fagin. Combining fuzzy information from multiple systems. J. Comput. Syst. Sci., 58(1):83–99, 1999.
      • F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization. In SIGMOD Conference, pages 203–214, 2004
      • D. Martinenghi and M. Tagliasacchi: Proximity Rank Join, to appear in PVLDB
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • Result Diversification
    • Relevance is not the only success factor for a result set
    • User satisfaction is increased if the first items cover a good spectrum of options
      • If user intent is ambiguous , diversification tries to cover the most likely intents
      • If several top-k items are very similar , they can be clustered together
    • Thus: an optimization problem
    • Objective: find the set of k elements that contains the most relevant and diverse items
    • Maximal Marginal Relevance [Carbonell and Goldstein 1998]
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Relevance Diversity
  • User Interface
    • More Complete information on one search
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS // Shortcuts Deep Links Enhanced Results
  • User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • User Interface
    • Optimization of the result set layout (and of page space)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • User Interface
    • Optimization of the result set layout (and of page space)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • User Interface
    • Optimization of the result set layout (and of page space)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • User Interface
    • Optimization of the result set layout (and of page space)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla REQUIREMENTS //
  • Performance
    • Users don’t want to lose their time waiting for a search result
        • User satisfaction
    • Performances are the leading factor for the evaluation of Web Search applications
      • Queries per seconds (QPS)
      • Time to Index
    • Scalability
      • Content
      • Queries
    • Distribution
      • Service-oriented computing
      • Content Delivery Networks
        • But intellectual properties may be a concern
      • More in section (ARCHITECTURE)
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • Other Requirements
    • Social Interaction
      • Content evaluation
      • User relationships and actions as additional content description
    • Security & Privacy
      • Access policies
        • Collection Vs. Item level
      • Anonymity
        • Who I am = What I like + What I do + Where I am ?
        • A search process tells a lot about whom is doing it
    • Alessandro Bozzon, Tereza Iofciu, Wolfgang Nejdl, Antonio V. Taddeo, Sascha Tönnies, Role Based Access Control for the interaction with Search Engines, (COOPER) 2007, Crete, Greece .
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REQUIREMENTS //
  • Design © 2010 Alessandro Bozzon, Marco Brambilla
  • Designing Web Search Applications
    • Reference architecture
    • Reference execution processes
    • Set of design dimensions
    • Development methodology
    • Tools supporting the methodology
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Search Applications from 1000 feet © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 DESIGN //
  • Bird eye view on Search Applications © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 DESIGN //
  • Search Application Processes July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • An example of Indexing Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Pharos: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Main Query flow <Uses> relation
  • Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // High level query “ Where can I attend a DB scientific conference close to a beautiful beach reachable with cheap flights?” Sub query 1 “ Where can I attend a DB scientific conference?” Sub query 2 “ place close to a beautiful beach?” Sub query 3 “ place reachable with cheap flight?”
  • Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Low level query 1 ConfSearch(“DB”,placeX,dateY) Low level query 2 TourSearch(“Beach”,PlaceX) Low level query 3 Flight(“cost<200”,PlaceX,DateY)
  • Search Computing: the architecture July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Services invocations and operators execution Presented results ESWC-Crete-Olympic CAISE- Hammamet – Alitalia TOOLS-Malaga-EasyJet Query plan Results
  • Design Dimensions July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Design Dimension Affected Process Values Retrieval Policy Indexing Push Pull Data Homogeneity Indexing Homogeneity Heterogeneity Data Analysis Indexing Mono Annotation Multi Annotation Mono Modal Multi Modal Search Technology Indexing, Query and Result Presentation Search Engine(s) Type Homogeneity Heterogeneity Query Format Query and Result Presentation, User Interface Query Type Mono Modal Multi Modal Mono Domain Multi Domain User Interaction User Interface Direct Indirect Active Passive
  • Designing Web Search Applications - A MDD approach
    • Alessandro Bozzon, Marco Brambilla, Piero Fraternali. Conceptual Modeling of Multimedia Search Applications using Rich Process Models . ICWE 2009, June 24-26, 2009, San Sebastian, Spain
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
    • Clear separation of concerns among the involved actors
    • Central roles of models as key development artifacts
    • Automatic code generation, etc.
  • Development Methodology
    • Process Models
      • E.g.: BPMN
    • Domain data and process metadata
      • E.g.: ER/UML
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
    • Model To Model Transformation
      • E.g.: Java / XSLT / ATL
    • Application Models
      • DSL, e.g. WebML
    • Model To Code Transformation
    • Running Application
  • An example domain model Content Analysis / ER
    • Content : the objects that relate to the Content Items indexed by a search application
    • Annotation : structure of the annotations associated with searchable Content Items during the indexing process
    • Usage : usage groups of the application (RBAC model)
    • Index : abstraction for the actual physical implementation of search engine indexes
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • An example process model Content Analysis / BPMN - WebML
    • Coarse indexing process model
      • Content Registration
      • Content Analysis
      • Content Indexation
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
    • Fine-grained process model
      • Analysis of audiovisual content trough face recognition and identification technologies
    • Application model
      • Face Recognition and Segmentation activity
    • Running CPA process
      • Console trace of the working annotation technology
      • Process advancement control UI
    Refinement M2M Transformation M2T Transformation
  • An Example of Complex Process July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Analysis of audiovisual content Incremental analysis of audio-visual content with textual annotations
  • Modeling User Interface
    • The information seeking interaction modes (Searching, Browsing, Monitoring, Being aware, Social interactions)
    • Distilled 30+ information seeking user interaction patterns
      • Query execution and result presentation
      • Keyword (Faceted, Similarity, Geo) search specification and refinement...
      • Browsing, content organization, content-based awareness, etc.
      • Relationship setting, recommendation, etc.
    • UI designed as assembly of standard interaction patterns
      • expressed in WebML
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // Alessandro Bozzon, Model-driven development of Search Based Web Applications, Ph.D Thesis, Politecnico di Milano, April 2009.
  • Pattern Example: Faceted Search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Pattern Example: Faceted Search July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Pharos: Modeling User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // http://www.youtube.com/watch?v=ZpxyNi6Ht50
  • Pharos: Modeling User Interface July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // http://www.youtube.com/watch?v=ZpxyNi6Ht50 KEYWORD REFINEMENT FACETED REFINEMENT CONTENT-BASED REFINEMENT RESULT PRESENTATION
  • An Example of M2M Transformation BPMN*  WebML July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • MDD in Search Computing
    • 4 artifact models
      • Search Service, Query, Query Parameters, Result
    • A query plan model
      • For the runtime query transformation
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Search Computing Model Example Search Service Model
    • ServiceMart
      • abstraction (e.g., Hotel) of one or more Web service implementations (e.g., Bookings and Expedia)
      • possibly ranked and chunked into page
    • Attribute
      • Atomic or Composite
    • AccessPattern
      • specifies RankingType and AttributeDirection (I/O)
    • ConnectionPattern
      • is defined as an input-output relationship between pairs of service marts (for joining them)
        • the output city of Concert used as input for Hotel.
    • ServiceInterface
      • physical interface of the service
      • Exact or Search (ranked)
      • details about chunk size, cost
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Search Computing Query Meta-model
    • LogicalQuery
      • is a conjunctive query over services
      • can be defined at an abstract level ( AccessPatternLevelQuery ) or at physical level ( InterfaceLevelQuery ).
    • QueryClause
      • a LogicalQuery is composed by a set of QueryClauses
      • a QueryClause can refer to the service mart level or to the Service Interface level.
      • Several types
        • InvocationClauses
        • PredicateClauses
        • JoinClauses
        • RankingClauses
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN //
  • Search Computing Model Transformations
    • Vertical transformations for Queries and ServiceMarts
    • QueryToPlan transformation
    • Query Execution transformation (at runtime)
    • Result transformation (at runtime)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // 1 1 2 4 3 Prototype: http://dbgroup.como.polimi.it/brambilla/SeCoMDA
  • Search Computing DSLs (& Transformations): Panta Rhei
    • describes both the execution flow and the data flow between nodes of a query plan.
    • Several types of nodes exist
        • service invocators, sorting, join, and chunk operators, clocks (defining the frequency of invocations), caches, and others.
    • The query result model is constructed stepwise, following the execution flow
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla DESIGN // D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources, http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf
  • Implementation © 2010 Alessandro Bozzon, Marco Brambilla
  • From the models to implementation
    • Once the design phase is completed
      • IMPLEMENTATION TIME
    • Never implement a search engine/app from scratch!!
    • Start from your requirements and design and:
      • Identify possible existing solutions ( REUSE )
      • Select the best fitting wrt your needs ( SHOPPING )
      • Implement what you need ( DEPLOY vs. CONFIGURE )
        • We will see: open source (products) vs. Open search (services)
    • A full-fledged model-driven approach can be devised:
      • Model to code transformation that generate:
        • The code for the pieces of Web search applications that you need
        • The configuration for the tools of choice
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • Search Framework Vs. Search Engine
    • Search Engines
      • “ provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query”
        • Wikipedia
    • Search Frameworks
      • Software components that target a set (possibly exhaustive) of the architectural layers of a Search Applications
        • E.g., crawling + analysis + indexing/querying
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • Open Source Search Vs Open Search
    • Open Source  build your own engine
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION // www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo!
    • Open Search  exploit commercial engines
    API v. 2
  • Open Source Search High level comparison July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Extended version of www2010 Tutorial Open Source Tools, Drake & Jones, Yahoo! Product License Lang. Docs Ranking Users Parallel Scale Support Lucene Apache Java/ C++ Several Flexible Amazon Yes TB 5/5 Zettair BSD Like C HTML, TREC, TXT Flexible Research No TB 1/5 Indri BSD Like C++ Many Very Flexible Research Yes TB 1.5/5 Sphinx GPL C++ Many Flexible Craiglist Yes YB 4/5 Xapian GPL C++ Many Flexible GMane Yes TB 3/5 RDBMS BSD, GPL C Limited Maybe GB 4/5
  • Open Source Search Benchmark _1
    • [Middleton+Baeza-Yates 07]: A comparison of open source search engine
    • http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
      • Vik Singh /Yahoo, Weekend project: Index 1M tweet
      • Source Code available at http://github.com/zooie/opensearch
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Open Source Search Benchmark _2
    • Relevancy tested on TREC 9 – Filtering Track collection
    • Judgment data for 63 query-like tasks
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Lucene
    • High-performance, scalable information retrieval (IR) library
      • in Java
      • There’s also pyLucene & Clucene
      • Apache License
      • Lot of industrial support with proven scalability
        • Amazon, Netflix, Wikipedia
    • Core API for full-text indexing and searching
    • Plus plug-in modules
        • Text analysis: text analyzer, tokenizer, token-filter, stemmer, N-gram filters, shingle filters
        • spell-checkers, result highlight, “more like this”
        • Fuzzy queries, regex queries
        • Geo ranking
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Lucene Indexing Example July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Additional Indexing Features
    • Documents can be
      • Updated and Deleted
      • Boosted  doc.setBoost(1.5F);
    • Fields can be
      • Indexed - to search in
      • Stored - to show the original content (e.g., abstract )
      • coded in term vectors - to enable more like this
      • Multivalued (e.g., authors field)
      • Boosted  subjectField.setBoost(1.2F);
      • There are built-in field types for numbers, dates , and time , to better support sorting or range search
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Lucene Querying Example July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Simple Term Query Query Parser
  • Additional Querying Features
    • Boolean
    • Prefix
    • Phrase
    • Wildcard
    • Fuzzy
    • Scoring function
      • Fielded
      • TF-IDF, weighted by term occurrences
      • Term and document boost
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • More Features
    • Thread and multi-JVM safety
      • Any number of read-only IndexReaders may be open at once on a single index
      • Only a single writer may be open on an index at once
      • IndexReaders may be open even while an IndexWriter is making changes to the index
      • Any number of threads can share a single instance of IndexReader or Index- Writer  not thread safe, but it scales
    • Lucene implements the ACID transactional model
      • only one transaction (writer) may be open at once
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Why Open Search?
    • Search as a software service
      • No need of in-house engine development
    • Search as a commodity
      • Internals are unknown, the features are taken off the shelf
    • Javascript
      • Access to search features through client-side programming (no server needed at all)
    • But …
      • you can search only for Web resources
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Open Search APIs
    • Google Ajax Search API
      • http://code.google.com/apis/ajaxsearch/
    • Google Custom Search API
      • http://code.google.com/intl/en/apis/customsearch/
    • Microsoft Bing API
      • http://www.bing.com/toolbox/developers/
    • Yahoo Boss (Build your Own Search Service)
      • http://developer.yahoo.com/search/boss/
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // API v. 2
  • Google Ajax Search API
    • Javascript Widget
    • REST API
      • No limitations on the number of queries
      • 8 results per query
      • No change in the result order
    • Query Web, Local, Video, Images, Blog, Book, News
    • Very limited customization of result presentation
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Code Snippets from Google Ajax Search API Documentation
  • Google Custom Search API
    • Custom search engine for a Web site, blog, or a collection of Web sites
      • Max 5000 sites
      • On-demand 24 hour Web Indexing
      • iFrame or Custom Search Element results for developers; XML for enterprise
      • Few result personalization options
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Microsoft BING API
    • REST APIs
    • Query Ad, Image, News, Phonebook, Video, Web
    • Unlimited traffic
    • Results can be modified, but with some restrictions
      • You cannot re-rank or merge non-Bing sources
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Yahoo! Boss (+ Search Monkey)
    • Unlimited queries
    • Blend, re-order, discard
    • Full Presentation control
    • Usage:
    • http://boss.yahooapis.com/ysearch/ {vert} /v1/ {q} ? appid= {appid} &start=0&count=10&lang=en& format=xml&view=keyterms
    • Verticals
      • Web, News, Images, Spelling
    • In query syntax
      • inurl, url, intitle, site, AND/OR, “-”, “+”
    • Notable web view fields
      • Delicious bookmarks
      • SearchMonkey ( microformats )
      • Larger abstracts
      • Extracted Entities (keyterms)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // WWW 2010 Tutorial Open Search Tools - Drake & Jones SearchMonkey keyterms Bookmarks
  • Search Frameworks – State of the industry © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • Open Source Search Frameworks © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • SMILA
    • SeMantic Information Logistics Architecture
      • http://www.eclipse.org/smila/
    • Open Source Search Framework
      • based on SOA principles and standards (e.g. BPEL, SCA)
      • dedicated to the access and integration of (unstructured) information
      • Standard interfaces for the integration of the main components of a Search application
      • Set of out-of the box components included
        • Crawlers (Web, FS) and agents (e.g. RSS feeds)
        • Lucene/Solr indexer
      • interfaces for management, operation and monitoring of the framework and its components
      • Written in Java
      • Based on OSGi (Eclipse Equinox)
      • Cloud-ready
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Data Model
    • Record
      • Representation of an information item
      • Composed of a set of
        • Attributes : textual metadata (e.g., mime type)
        • Attachments : binary data (e.g., picture)
        • Annotations : associated both to records, attributes or attachments
    • Attributes and attachments are usually produced during the discovery of data
    • Annotations are usually produced during the indexing process
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Chansonnier Data Model
    • Record : a song
      • id: the download URI
      • Attributes
        • Link, PageTitle, Description, Keywords, Title, Artists
        • Lyrics
        • Language, language confidence
        • Emotion, emotion confidence
      • Attachments
        • Original videos
        • Extracted keyframes
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • SMILA Architecture
    • 3 Macro components
      • Each one can run on a dedicated OSGi instance
        • Distribution, replication
        • Each one aggregates a set of OSGi bundles
    • Set of data storages
      • Metadata
      • Binary data
      • Ontologies
      • Delta Indexing
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CONNECTIVITY SEARCH PROCESSING
  • Processing Pipelines
    • Orchestration performed through BPEL Engine (Apache ODE)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Process Invocation Condition on a record attribute Condition on an annotation value Activity Invocation
  • Chansonnier Activities
    • Lyrics Wiki
      • To decorate a song with its lyrics by querying the LyricWiki service (http://lyrics.wikia.com/).
    • Google Translate
      • To identify the language of a song’s lyric (with a given confidence)
    • Synesketch
      • To analyze the song’s lyric in order to infer the dominant emotion in it
    • FFMPEG
      • To extract the keyframes from the song video
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Distribution July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // EclipseCON 2010: http://www.eclipsecon.org/2010/sessions/?page=sessions&id=1388
  • Content Analysis July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Text Annotation Media Annotation Transcoding Media Artifact Generation Media Analysis Media Analysis Text Analysis Text Analysis Media Artifact Generation Media Item Text Item
  • Text Processing
    • Not all words are equally significant for representing the semantics of a document
      • usually, noun words (or groups of noun words) are the most representative of a document content
    • Vocabulary : language used to describe documents and queries
    • Worthwhile to preprocess the text of the documents in the collection to determine the terms to be used as index terms
      • Subset of words selected to represent a document’s content
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Index Terms and Precision/Recall
    • Trade off
      • Exhaustiveness
        • Cover the whole document content  assign a big number of terms to a document
      • Specificity
        • Generic terms: low discriminative power, their frequency is high in all the documents (e.g., “and”, “or”, “of”, etc.)
        • Specific terms: higher discriminative power, variable document frequency  their frequency denotes their document’s representativeness
    • Recall
      • High-frequency in the overall collection
      • Index expansion via associative techniques (thesauri, clustering)
    • Precision
      • High frequency just in some documents
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Text Analysis Process
    • Document Parsing
    • Lexical analysis : manage digits, hyphens, punctuation marks, letter cases
    • Elimination of stopwords (e.g., “and”, “or”, “of”, etc.)
    • Thesaurus
    • Phrases (noun groups)
    • Stemming (reduction of a word to its grammatical root)
    • Selection and weighting of index terms (noun, adjectives, etc…)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Document Parsing Lexycal Analysis Phrases Stemming Indexing Weighting Structure Full text Index Terms Stopwords Removal
  • Document Parsing
    • What format : pdf/word/excel/html?
    • What language ?
    • What character set ?
    • Problems:
      • Documents being indexed can include docs from many different languages
      • Sometimes a document or its components can contain multiple languages/formats (French email with a Portuguese pdf attachment.
      • What is a unit document ? (An email? With attachments? An email with a zip containing documents?)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Lexical Analysis
    • Process that transforms an input character stream (the original document’s text) into a flow of words ( tokens )
    • GOAL: identification of words in the text
    • Example
      • Input: “ Friends, Romans and Countrymen”
      • Output: Tokens
        • Friends
        • Romans
        • Countrymen
      • Each such token is now a candidate for an index entry, after further processing
        • But what are valid tokens to emit?
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Tokenization
    • Trivial case: recognition of blanks as word separator
    • Other cases might need to be addressed:
        • Phrases
          • Finland’s capital -> Finland? Finlands?, Finland’s?
          • Hewlett-Packard -> Hewlett and Packard as two tokens?
          • San Francisco: one token or two? How do you decide it is one token?
        • Language issues (normalization)
          • Accents: résumé vs. resume.
          • L'ensemble -> one token or two?
            • L ? L’ ? Le ?
        • How are your users like to write their queries for these words? Use locale?
          • Punctuation (e.g: U.S.A. vs. USA)
          • Numbers (100.45 vs. 100,45 vs. 1.0045 E+2 )
          • Dates (e.g. March 1 st 2009 vs. 03/01/09 vs. 1/03/2009)
          • Case folding ….
    • It depends on the addressed language
      • E.g., in Chinese spaces do not separate words
        • (tokenization based on vocabulary)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Stopword Removal
      • Removal of high-frequency words , which carry less information
      • Strategies
        • Statistical analysis on the indexed collection
        • Functional terms (articles, conjunctions, auxiliary verbs)
        • A-priori knowledge, based on the IR system domain
          • Creation of a “stop-list” with all the terms to remove
          • English stop list is about 200-300 terms (e.g., “been”, “a”, “about”, “otherwise”, “the”, etc..)
            • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
    • < 30% - 50% of tokens (smaller dictionary)
    • It can decrease recall (e.g. “to be or not to be”, “let it be”)
    • Most of WEB search engines do not remove stopwords [ ManningIR]
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Phrases (noun groups)
      • Phrases capture the meaning behind the bag of words and result in multi-term phrases
      • Uses of phrases:
        • Added to the query: a query “New” “York” should be modified to search for “New York”  > 10% in precision and recall
        • Replace terms in index: empirically considered not as good as query rewriting
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Phrases (noun groups) - Strategies
      • Simple Phrases
        • Many systems identify phrases as any pairs of terms not separated by:
          • stop term
          • punctuation mark
          • special character
        • Phrases occurring fewer than 25 times are removed (decrease in memory requirements)
      • NLP
      • Part Of Speech and Word Sense tagging
        • statistical or rule-based methods to identify the part of speech (noun, verb, adjective) of each token
      • Syntactic parsing
        • Identify the key syntactic components of a sentence usually by tagging according to POS and then applying a grammar (FSA and NFSA)
      • Thesauri
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Thesauri
    • A thesaurus is as a classification scheme composed of words and phrases whose organization aims at facilitating the expression of ideas in written text
      • E.g.: synonyms and homonyms
        • Example entry from Roget’s 1 thesaurus: cowardly adjective
          • Ignobly lacking in courage: cowardly turncoats.
          • Syns: chicken (slang) chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered
    • A thesaurus can be
      • Thematic: specific to the IR system’s domain of application (most frequent case)
        • E.g.: Thesaurus of Engineering and Scientific Terms
      • Generic
    • A thesaurus can be used to
      • Help user formulate queries
      • Modification of queries by the system
      • Select index terms
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Thesauri
    • Many kinds of thesauri have been developed for IR systems
      • Hierarchical: synonyms (RT  related terms, UF  use for), generalization (BT  broader term), specialization (NT  narrower term)
        • ISO and ANSI standards, almost always thematic
        • Manually built and updated by domain experts
      • Clustered: cluster (or synset) of words
        • Non-typed, semantic relationships among cluster
          • Each cluster is a set of word having strong semantic relationship (usually UF)
        • WORDNET
        • Clustered Thesauri can be automatically generated if no distinction is made among semantic relationships
      • Associative: graph of words, where nodes represents words and edges represents semantic similarity among words
        • Edges can be oriented or not, according to the symmetry of the similarity relationship
        • Edged can be weighted (fuzzy pseudo-thesauri)
        • Can be automatic generated from a collection of documents using a co-occurrence relationships
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Stemming and Lemmatization
    • Goals
      • Reduce terms to their “roots” before indexing
      • Reduce inflectional/variant forms to base form
        • language dependent
        • E.g.,
          • am, are, is -> be
          • car, cars, car's, cars' -> car
          • the boy's cars are different colors -> the boy car be different color
    • Stemming : heuristic process that chops off the ends of words in the hope of achieving the goal correctly the most of the time
      • Stemming collapses derivationally related words
    • Lemmatization : NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word
      • Lemmatization collapses the different inflectional forms of a lemma
      • Not widely used cause it harms performances
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Stemming
    • Many different algorithms :
      • Porter’s algorithm
        • Commonest algorithm for stemming English
          • Porter, Martin F. 1980. An algorithm for suffix stripping. Program 14:130–137.
          • http://www.tartarus.org/˜martin/PorterStemmer/
      • One-pass Lovins stemmer
        • Lovins, Julie Beth. 1968. Development of a stemming algorithm. Translation and
      • Lancaster
        • http://www.comp.lancs.ac.uk/computing/research/stemming/
        • Paice, Chris D. 1990. Another stemmer. SIGIR Forum 24:56–61
        • http://snowball.tartarus.org/demo.php
    • Stemming increases recall while harming precision
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Stemming Example July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Tools for text analysis _1
    • Lucene and Solr contains a lot of text analyzer working on several languages
      • http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
        • CharFilters, Tokenizer, Token Analyzers
    • Apache Tika
      • http://tika.apache.org/
      • toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries
    • GATE (General Architecture for Text Engineering)
      • http://gate.ac.uk/
      • ANNIE (A Nearly-New Information Extraction System)
        • tokenizer, gazetteer, sentence splitter, part of speech tagger,
        • named entities transducer, coreference tagger
        • Support for English, Spanish, Chinese, Arabic, French, German,
          • Hindi, Italian, Cebuano, Romanian, Russian
    • MALLET (Machine Learning for Language Toolkit)
      • http://mallet.cs.umass.edu/index.php
      • Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Tools for text analysis _2
    • OpenNLP
      • http://opennlp.sourceforge.net/projects.html
      • open source projects related to natural language processing)
    • Cognitive Computation Group – University of Illinois
      • http://l2r.cs.uiuc.edu/~cogcomp/software.php
        • Chunker, Part of Speech tagger, String similarity, Semantic Role Labeler Named Entity Extractor, etc.
    • Supersense Tagger
      • http://medialab.di.unipi.it/wiki/SuperSense_Tagger
      • tool for assigning to each noun, verb, adjective and adverb of a sentence one of the 45 standard WordNet supersenses
    • Wordnet Domains
      • http://wndomains.fbk.eu/hierarchy.html
    • Synesketch
      • http://www.synesketch.krcadinac.com/
      • Open source textual emotion recognition
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla SECTION NAME //
  • Multimedia Content Analysis
    • Computer are not able to catch the underlying meaning of a multimedia content. Annotation is needed.
    • Manual annotation
      • Expensive
        • It can take up to 10x the duration of the video
        • Problems in scaling to millions of contents
      • Incomplete or inaccurate
        • People might not be able to holistically catch all the meanings associated with a multimedia object
      • Difficult
        • Some contents are tedious to describe with words
          • E.g., a melody without lyrics
    • Automatic annotation
      • Reasonably good quality
        • Some technologies have a ~90% precision
      • “ Low” cost
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • Audio Segmentation
    • GOAL: split an audio track according to contained information
      • Music
      • Speech
      • Noise
    • Additional usage
      • Identification and removal of ads
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 IMPLEMENTATION //
  • Video Segmentation
    • Keyframe segmentation:
      • segment a video track according to its keyframes
        • fixed-length temporal segments
    • Shot detection:
      • automated detection of transitions between shots
        • a shot is a series of consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space.
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS: Thorsten Hermes@SSMT2006
  • Speech Analysis
    • Speaker Identification : identify people participating in a discussion
    • Additional usage:
      • Vocal command execution
    • Speech To Text : automatically recognize spoken words belonging to an open dictionary
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // ERIC DAVID JOHN
  • Classification of Music Genre
    • GOAL: automatically classify the genre and mood of a song
      • Rock, pop, Jazz, Blues, etc.
      • Happy, aggressive, sad, melancholic,
    • Additional usage:
      • Automatic selection of songs for playlist composition
    • Tutorial from PHAROS Summer School
      • http://www.pharos-audiovisual-search.eu/ res/files/SummerSchool/Programme_Summer_School_file.zip
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // Rock Dance!
  • Images: Low-level features
    • GOAL: extract implicit characteristics of a picture
      • luminosity
      • orientations
      • textures
      • Color distribution
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Face Identification and Recognition
    • GOAL: recognize and identify faces in an image
    • Usage examples:
      • People counting
      • Security applications
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION // CREDITS: Thorsten Hermes@SSMT2006
  • Image Concept Detection
    • GOAL: recognize context/ concepts of an image
      • E.g., playground, seaside, road, ...
    • Extraction of low level features from raw data
      • color histograms, color correlograms, color moments, co-occurrence texture matrices, edge direction histograms, etc..
    • Features can be used to build discrete classifiers , which may associate semantic concepts to images or regions thereof
      • The MediaMill semantic search engine defines 491 semantic concepts
        • http://www.science.uva.nl/research/mediamill/demo
    • Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Image Object Identification
    • GOAL: identify objects appearing in a picture
      • Basket ball, cars, planes, players, etc.
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Tools for media analysis _1
    • OpenCV
      • http://opencv.willowgarage.com/wiki/
      • Framework for image analysis
    • Octave
      • http://www.gnu.org/software/octave/
      • high-level language, primarily intended for numerical computations, it works well with Matlab
    • Marsyas (Music Analysis, Retrieval and Synthesis for Audio Signals)
      • http://marsyas.sness.net/
      • Framework for music analysis and retrieval
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Tools for media analysis _2
    • TINA (TINA Is No Acronym)
      • http://www.tina-vision.net/
      • is an open source environment developed to accelerate the process of image analysis research.
    • Sphynx
      • http://cmusphinx.sourceforge.net/sphinx4/
      • speech recognition system written entirely in the Java
    • WEKA
        • http://www.cs.waikato.ac.nz/ml/weka/
        • A collection of machine learning algorithms for data mining
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla IMPLEMENTATION //
  • Validation © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010
  • Disclaimer
    • This section is inspired by the WWW2010 tutorial by Dasdan, Tsioutsiouliklis, Velipasaoglu @ WWW2010
    • Web Search Engine Metrics for Measuring User Satisfaction
    • http://analytics.ncsu.edu/reports/wsmt.pdf
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Measures for IR Systems
    • Measurable properties
      • How fast does it process (index) documents?
        • Number of documents/hour
        • Average document size
      • How fast does it search?
        • Latency as a function of index size
        • Expressiveness of query language
        • Speed on complex queries
    • The key measure: user happiness
      • What is this?
        • Speed of response/size of index are factors
          • But blindingly fast, useless answers won’t make a user happy
      • How do we quantify user happiness?
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Measuring User Happiness
    • Who is the user we are trying to make happy?
      • Depends on the setting
        • Web engine: user finds what they want and return to the engine
          • Can measure rate of return users
        • eCommerce site: user finds what they want and make a purchase
          • Is it the end-user, or the eCommerce site, whose happiness we measure?
          • Measure time to purchase, or fraction of searchers who become buyers?
        • Enterprise (company/govt/academic): Care about “user productivity”
          • How much time do my users save when looking for information?
          • Many other criteria having to do with breadth of access, secure access …
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Evaluation measures
    • Relevance
      • Of search results
    • Coverage
      • Presence of content of interest in a catalog
    • Diversity
      • Of result set
    • Discovery and Latency
      • How many new resources (in the collection) are in the catalogue
      • How long it took to get the new resources in the catalog?
        • Time to first click
    • Freshness
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Relevance as a measure of user happiness
    • How do you measure relevance?
    • In order to assess the performance of a IR system you needed a test collection composed of:
      • A benchmark document collection
      • A benchmark suite of queries
      • A binary assessment of either Relevant or Irrelevant for each query-doc pair ( gold standard , or ground truth )
    • Test collection must be of a reasonable size
      • Need to average performance since results are very variable over different documents and information needs
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Evaluating Relevance
    • Set based evaluation
    • Rank based evaluation with explicit judgment
      • Absolute judgment
      • Preference judgment
    • Rank based evaluation with implicit judgment
      • Direct and indirect evaluation by clicks
    • Model based evaluation
      • Browsing models
      • User satisfaction
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // NOT COVERED HERE
  • Information Need Translation
    • Relevance is assessed relative to the need not to the query
    • E.g., Information need:
      • I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
      • Query: wine red white heart attack effective
    • A document is relevant if it addresses the stated information need, not just because it contains all the word in the query
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Set-based evaluation
    • The two most frequent and basic measures for IR effectiveness are precision and recall
      • Precision: fraction of retrieved docs that are relevant
        • P(relevant|retrieved)
          • Provides a measure of the “degree of soundness” of the system
          • This not consider the total number of documents
      • Recall: fraction of relevant docs that are retrieved
        • P(retrieved|relevant)
          • Provides a measure of the “degree of completeness” of the system
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Precision / Recall
    • Can get high recall (but low precision ) by retrieving all docs for all queries!
      • Recall is a non-decreasing function of the number of docs retrieved
      • Precision usually decreases (in a good system)
    • Precision can be computed at different levels of recall
      • Perhaps most appropriate for web search: all people want are good matches on the first one or two results pages
    • Precision-oriented users
      • Web surfers
    • Recall-oriented users
      • Professional searchers, paralegals, intelligence analysts
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • F-Measure
    • Combined measure that assesses the tradeoff between precision and recall (weighted harmonic mean):
      • Values of β<1 emphasize precision
      • Values of β>1 emphasize recall
    • People usually use balanced F 1 measure
      • i.e., with β = 1 or α = ½
    • Harmonic mean is conservative average
          • [CJ van Rijsbergen, Information Retrieval ]
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Difficulties in using precision/recall
    • Average over large corpus/query…
      • Need human relevance assessments
        • People aren’t reliable assessors
      • Assessments have to be binary
        • Nuanced assessments?
      • Heavily skewed by corpus/authorship
        • Results may not translate from one domain to another
    • The relevance of one document is treated as independent of the relevance of other document
      • This is also an assumption in most retrieval system
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Ranked Based evaluation
    • In ranked retrieval systems, P and R are values relative to a rank position
    • Evaluation performed by computing precision as a function of recall
    • Function computed at each rank position in which a relevant
    • document has been retrieved
    • Resulting values are interpolated
    • yielding a precision/recall plot
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Measures for Ranked Based evaluation
    • Mean average precision ( MAP )
      • Measure of quality at all recall levels
    • [email_address]
      • Not all queries will have more than K relevant results
      • Even a perfect system may have a score less than 1.0 for some queries
    • R-Precision [Allan 2005]
      • Use a variable result set cut-off for each query based on number of its relevant results
    • Mean Reciprocal Rank ( MRR ) [ Voorhees 1999]
      • Reciprocal of the rank of the first relevant result averaged over a population of queries
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Discounted Cumulative Gain (DCG)
    • [Järvelin and Kekäläinen 2002]
    • Gain adjustable for importance of different relevance grades for user satisfaction
    • Discounting desirable for web ranking
      • Most users don’t browse deep
      • Search engines truncate the list of results returned.
    • DCG yields unbounded scores
      • For each query, divide the DCG by the best attainable DCG for that query
      •  Normalized Discounted Cumulative Gain (nDCG)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
    • Example:
    • Very Useful: 3
    • Somehow useful: 1
    • Not Useful: 0
  • Preference Judgment
    • Kendall tau coefficient
      • Based on counts of preferences
      • Range in [-1, 1]
      • Robust for incomplete judgments
    • Binary Preference (bpref)
      • Buckley and Voorhees (2004)
      • Designed for incomplete judgments
      • Generalized to graded judgment
        • De Beer and Moens (2006)
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // A: preferences in agreement D: preferences in disagreement N r = # of non-relevant docs above relevant doc r, In the first R non-relevant R = number of relevant results for the query
  • Presentation Metrics
    • How to present information?
      • Which information
      • Where they should be displayed
      • Which presentation elements should be used?
        • Font, colors, design elements, interaction design
      • Generalization
    • How to measure success?
      • User studies
        • On-line, on-home, usability, eye tracking, focus group, surveys
      • Log analysis
      • Editorial
        • Comparative, Perceived vs. actual
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Not all results are likely to be reviewed July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) ‏
  • Clicks and views depend on rank July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION // [Joachims et al, 2005]
  • Eye Tracking Studies July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Heat Maps
    • Golden Triangle
      • The first result is always considered more trusted and more relevant by default
      • The user spend less time reading the lower part of the page
      • [Marti A. Hearst, Search User Interfaces , Cambridge University Press, 2009]
    July 5, 2010 © 2010 Alessandro Bozzon, Marco Brambilla VALIDATION //
  • Thank you for your attention!
    • Questions?
    © 2010 Alessandro Bozzon, Marco Brambilla Alessandro Bozzon Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/bozzon Marco Brambilla Dipartimento di Elettronica e Informazione Politecnico di Milano Milano, Italy [email_address] http://home.dei.polimi.it/mbrambil http://www.search-computing.org/book July 5, 2010 REFERENCES //
  • References – Books
    • Modern Information Retrieval
      • Ricardo Baeza-Yates, Berthier Ribeiro-Neto , Addison Wesley Longman Publishing Co. Inc., 2010
    • [ManningIR] Introduction to Information Retrieval
      • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press, 2008
    • Information Retrieval: Algorithms and Heuristics .
      • D.A. Grossman, O. Frieder. Springer, 2004
    • Managing Gigabytes.
      • I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, 1999
    • Mining the Web: Analysis of Hypertext and Semi Structured Data .
      • S. Chakrabarti. Morgan Kaufmann, 2002
    • Search User Interfaces
      • Marti A. Hearst. Cambridge University Press, 2009
    • Search Computing – Challenges and directions
      • Stefano Ceri, Marco Brambilla (eds.) . Springer LNCS, vol. 5950, 2010
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • References - Tutorial
    • Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction
      • Ali Dasdan, Kostas Tsioutsiouliklis, Emre Velipasaoglu (Yahoo!)
      • www2010
    • Recent Progress on Inferring Web Searcher Intent
      • Eugene Agichtein (Emory University)
      • www2010
    • Applications of Open Search Tools
      • Rosie Jones, Ted Drake (Yahoo!)
      • www2010
    • [BAEZASeco2010] New Frontiers for Search
      • Ricardo Baeza-Yates
      • www2010
    • Web Mining for Search
      • Ricardo Baeza-Yates and Rosie Jones (Yahoo!)
      • SIGIR 2008
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • References - Papers
    • [Ramakrishnan and Tomkins 2007] Raghu Ramakrishnan, Andrew Tomkins: Toward a PeopleWeb
      • IEEE Computer 40(8): 63-72 (2007)
    • [Broder2002] A. Broder. A taxonomy of web search
      • SIGIR Forum, 36(2):3–10, 2002.
    • [BATES2002] Bates, Marcia J. Toward an integrated model for information seeking and searching
      • In: The Fourth International Conference on Information Needs, Seeking and Use in Different Contexts, 2002
    • [FU2007] Fu, Wai-Tat; Pirolli, Peter, SNIF-ACT: a cognitive model of user navigation on the world wide web
      • Human-Computer Interaction: 335–412 , 2007
    • [Withrow2002] Jason Withrow, Do your links stink?
      • American Society for Information Science Bulletin, June 1, 2002
    • [Pirolli2009] Pirolli, Peter An elementary social information foraging model
      • Proceedings of the 27th international conference on Human factors in computing systems: 605–614, 2009
    • [D. Rose, 2008]
    • [BATES1989] M.J. Bates. The design of browsing and berrypicking techniques for the online search interface
      • Online Review, 13(5):407–431,1989.
    • [Teevan et al., CHI 2004] Teevan, J., Alvarado, C., Ackerman, M. and Karger, D. The perfect Search Engine is not Enough: A Study of Orienteering Behavior in Directed Search
      • Proceedings of ACM CHI 2004, pp. 415-4422.
    • [MARCHIONINI2006] Marchionini, G. Exploratory search: from finding to understanding .
      • Communications ACM 49(4): 41-46 (2006)
    • [WHITE2007] White, R. W., and Drucker, S. M. Investigating behavioral variability in web search
      • 16th WWW Conf. (Banff, Canada, 2007)
    • [AULA2008] Aula, A., and Russell, D.M. Complex and Exploratory Web Search
      • ISSS: Information Seeking Support Systems Workshop (Chapel Hill, June 2008)
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • References - Papers
    • [BozzonEtAL2010] Alessandro Bozzon, Marco Brambilla, Piero Fraternali, Stefano Ceri. Liquid Query: multi-domain exploratory search on the Web
      • WWW 2010, Raleigh, USA
    • [FAGIN1999] R. Fagin. Combining fuzzy information from multiple systems
      • J. Comput. Syst. Sci., 58(1):83–99, 1999.
    • [ILYAS1999] F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid. Rank-aware query optimization
      • In SIGMOD Conference, pages 203–214, 2004.
    • [MARTINENGHI2010] D. Martinenghi and M. Tagliasacchi: Proximity Rank Join
      • to appear in PVLDB
    • [Carbonell and Goldstein 1998] J. Goldstein and J. Carbonell (1998), Summarization: Using MMR for Diversity- based Reranking
      • SIGIR’98
    • [BozzonEtAl2007] Alessandro Bozzon, et Al Role Based Access Control for the interaction with Search Engines
      • International Workshop on Collaborative Open Environments for Project-Centered Learning (COOPER) 2007, Crete, Greece.
    • [BozzonEtAl2009] Alessandro Bozzon, Marco Brambilla, Piero Fraternali Conceptual Modeling of Multimedia Search Applications using Rich Process Models
      • ICWE 2009, June 24-26, 2009, San Sebastian, Spain
    • [BozzonThesis2009]Alessandro Bozzon, Model-driven development of Search Based Web Applications
      • Ph.D Thesis, Politecnico di Milano, April 2009.
    • [BragaEtAl2010] D. Braga, S. Ceri, F. Corcoglioniti,M. Grossniklaus, and S. Vadacca: Panta Rhei: An Execution Model for Queries over Web Information Sources
      • http://www.search-computing.it/sites/cms.web.seco/files/pantarhei2010.pdf
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • References - Papers
    • [Allan 2005] J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents.
    • [Voorhees 1999] E.M. Voorhees (1999), TREC-8 question answering track report
    • [Järvelin and Kekäläinen 2002] K. Järvelin and J. Kekäläinen, Cumulated gain-based evaluation of IR techniques
      • ACM Trans. IS, 20(4): 422-446, 2002
    • [Buckley and Voorhees (2004)] C. Buckley and E.M. Voorhees, Retrieval evaluation with incomplete information
      • SIGIR’04.
    • [De Beer and Moens (2006)] De Beer, Jan; Moens, Marie-Francine. Rpref: a generalization of Bpref towards graded relevance judgments
      • SIGIR 2006, Seattle, USA, 6-11 August 2006, pages 637-638, ACM
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //
  • References - Links
    • Search Computing Course Lecture Notes
      • http://www.search-computing.it/course
    • Fabio Aolli, Università di Padova, http://www.math.unipd.it/~aiolli/corsi/0809/IR/IR.html
    • http://www.ir.disco.unimib.it/
    © 2010 Alessandro Bozzon, Marco Brambilla July 5, 2010 REFERENCES //