Strategic scenarios in digital content and digital business


Published on

This lesson was given in May 2009 at MIP, Politecnico di Milano. The audience included members of the Acer academy program.

Rights on reused content are maintained by respective owners.

See further information on my activity at:

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • There have been many definitions for IR in the last decades… we just report
  • There have been many definitions for IR in the last decades… we just report
  • User-centric interfacesCloud services should be accessed with simple and pervasive methods. In fact, the Cloud computing adopts the concept of Utility computing. Utility Computing: users obtain and employ computing platforms in computing Clouds as easily as they access a traditional public utility. In detail, the Cloud services enjoy the following features:The cloud interfaces do not force users to change their working habits and environments.The cloud client software which is required to be installed locally is lightweightCloud interfaces are location independent and can be accessed by some well established interfaces like Web services framework and Internet browserAutonomous SystemThe computing Cloud is an autonomous system and it is managed transparently to users. Hardware, software and data inside clouds can be automatically reconfigured, orchestrated and consolidated to present a single platform image, finally rendered to users.Scalability and flexibilityThe scalability and flexibility are the most important features that drive the emergence of the Cloud computing. Cloud services and computing platforms offered by computing Clouds could be scaled across various concerns, such as geographical locations, hardware performance, software configurations. The computing platform should be flexible to adapt to various requirements of a potentially large number of users.
  • Software or an application is hosted as a service and provided to customers across the Internet. This mode eliminates the need to install and run the application on the customer’s local computers. SaaS therefore alleviates the customer’s burden of software maintenance, and reduces the expense of software purchases by on-demand pricingAn early example of the SaaS is the Application Service Provider (ASP). The ASP approach provides subscriptions to software that is hosted or delivered over the Internet. Microsoft’s “Software +Service” shows another example: a combination of local software and Internet services interacting with one another. Google’s Chrome browser gives an interesting SaaS scenario: a new desktop could be offered, through which applications can be delivered (either locally or remotely) in addition to the traditional Web browsing experience
  • The Google App Engine is an interesting example of the IaaS. The Google App Engine enables users to build Web applications with Google’s APIs and SDKs across the same scalable systems, which power the Google applications.
  • ITaaS is a highly disruptive concept for enterprise users, who have less to gain and more to lose by outsourcing ITCloud service providers trying to serve this space must implement enterprise-class capabilities at multiple levels both in the network and at the end pointsKey business and technical challenges include cost, security, performance, business resiliency, interoperability, and data migrationCloud computing is still in early development. Market researchers, financial analysis, and business leaders all want to assess its potential markets and business impact. According to IDC, a market research firm that recently surveyed IT executives, CIOs, and other business leaders, IT spending on cloud services will reach US$42 billion by 2012. However, as with any disruptive technology and transitional business model, there is no definitive assessment of cloud computing’s market opportunity. We believe its long-term business impact could be even larger
  • Strategic scenarios in digital content and digital business

    1. 1. Strategic Scenarios in Digital contents Marco Brambilla et al. Politecnico di Milano, DEI and MIP Acer Academy May 2009
    2. 2. Agenda overview  Information overload  Evolution of contents  Web 2.0  Web 3.0  Tools and technologies for managing information overload
    3. 3. 1. Information overload
    4. 4. Introduction and motivation  161 exabytes of information was created or replicated worldwide in 2006  IDC estimates 6X growth by 2010 to 988 exabytes (a zetabyte) / year  That‟s more than in the previous 5,000 years. – DATA from: Dr. Michael L. Brodie - Chief Scientist Verizon
    5. 5. Where does content come from  The largest source of data?  USERS  YouTube Videos  1.7 billion served / month  1 million streams / day = 75 billion e-mails  Facebook had [in 2007] …  1.8 billion photos  31 million active users  100.000 new users / day  1,800 applications  MySpace, 185+ million registered users (Apr 2007), has…  Images: – 1+ billion - Millions uploaded / day- 150,000 requests / sec  Songs: – 25 million - 250,000 concurrent streams  Videos: – 60 TB - 60,000 uploaded / day - 15,000 concurrent streams
    6. 6. Quality of data  (User Generated) Content is:  25% original; 75% replicated  25% from the workplace; 75% not  95% unstructured and growing  While enterprise data is 10-15% structured and decreasing  Main challenges:  How to make multimedia content available to search engines and search based applications?  Exploiting multimedia content requires: – Acquiring it – (Re) Formatting it – Indexing it – Querying it – Transmitting it – Browsing it
    7. 7. Information overload effects on (our) way of working For knowledge workers • Time is limited • Processes overlap • Knowledge is (often) artefact- dependent • Tools allow multiplicity of uses • Need for several tools • Relations with people take time • Contexts mix and merge
    8. 8. Example: email (!!) 8
    9. 9. Working with information  Types of information  Usefulness – Active: ephemeral and working (“hot”) – Dormant: inactive, potentially useful (“cold”) – Not useful – Un-accessed  Ownership: mine or not-mine  Activities  Acquisition of items to form a collection  Organisation of items  Maintenance of the collection (e.g. archiving items into long- term storage)  Retrieval of items for reuse  Information (and choice) overload.. On YOUTUBE
    10. 10. Acquisition  Different between tools  Manual (files), uncontrolled (e-mails)  Push vs. pull  Reasons for deciding how to store information  Portability  Number of access points  Preservation of information in its current state  Currency of information  Context  Reminding  Ease of integration into existing structures  Communication and information sharing  Ease of maintenance
    11. 11. Organisation  Categorisations are complex  Folders vs. keywords  Trees vs. webs  Change over time  Categorisations are local  If two groups of people construct thesauri in a particular subject area, the overlap of index terms will only be 60%  Two indexers using the same thesaurus on the same document use common index terms in only 30% of cases  The output from two experienced database searchers has only 40% overlap  Experts' judgements of relevance concur in only 60% of cases
    12. 12. Maintanance  Hardly any  Occasional cleaning  Extensive maintenance is related to major life changes (e.g. new job)
    13. 13. Retrieval  Personal archives instead of corporate systems  Need to start searching  Not invented here: reinventing is more fun than reusing  Asking is more difficult than sharing  Social search: asking others  Estimations of quality and relevance are best made by experts themselves  It's fastest and most efficient way  Colleagues can give you feedback and help to sharpen your questions  Consulting others is fun  While searching systems  Preference for location-based search  Critical reminding function of file placement  Lack of retrieval of archived files
    14. 14. 2. Evolution of contents
    15. 15. Evolution of contents and technologies  I. from static to dynamic  II. from fixed to mobile  III. from big to small  IV. from local to global  V. from vertical to horizontal  VI. from sometimes-on to always-on  VII. from wired to wireless  VIII. from divergence to convergence 15
    16. 16. Content proliferation and classification  Proliferation of  blogs  online video  podcasting,  other social media tools  the definition of what consititutes ‟web‟/‟non-web‟ content has become increasingly blurred 16
    17. 17. Pervasive and convergent digital content 17
    18. 18. Convergence of connectivity 18
    19. 19. 3. Web 2.0
    20. 20. Social- vs. Group- ware  The basic model of 90's era collaboration (Lotus Notes): all about the group.  Information was managed in group-based repositories, then passed around for review, or published to intranet portals via customized apps. Information era workflows where people are first and foremost occupiers of roles, not individuals, and the materials being created are more closely aligned with groups than individuals.  Web 2.0 social tools: MySpace, Facebook, LinkedIn Social networks -- explicit ones or implicit ones in social media –  are really organized around individuals and their networked self-expression. I am writing this blog post, and publishing it, personally. It is not the product of some workgroup. It is not an anonymous chunk of text on a corporate portal. My Facebook profile pulls traffic from my network of contacts, sources I find interesting, and the chance presence updates of my friends.  See: 21
    21. 21. Doug Engelbart, 1968 "The grand challenge is to boost the collective IQ of organizations and of society. "
    22. 22. Tim O’Reilly, 2006, on Web 2.0 “The central principle behind the success of the giants born in the Web 1.0 era who have survived to lead the Web 2.0 era appears to be this, that they have embraced the power of the web to harness collective intelligence”
    23. 23. Web 2.0 is about The Social Web “Web 2.0 Is Much More About A Change In People and Society Than Technology” -Dion Hinchcliffe, tech blogger  1 billion people connect to the Internet  100 million web sites  over a third of adults in US have contributed content to the public Internet. - 18% of adults over 65
    24. 24. Tim Berners-Lee “The Web isn’t about what you can do with computers. It’s people and, yes, they are connected by computers. But computer science, as the study of what happens in a computer, doesn’t tell you about what happens on the Web.” NY Times, Nov 2, 2006
    25. 25. But what is “collective intelligence” in the social web sense?  intelligent collection?  collaborative bookmarking, searching  “database of intentions”  clicking, rating, tagging, buying  what we all know but hadn‟t got around to saying in public before  blogs, wikis, discussion lists “database of intentions” – Tim O’Reilly
    26. 26. the wisdom of clouds?
    27. 27. “Collective Knowledge” Systems  The capacity to provide useful information  based on human contributions  which gets better as more people participate.  typically  mix of structured, machine-readable data and unstructured data from human input
    28. 28. Collective Knowledge is Real  FAQ-o-Sphere - self service Q&A forums  Citizen Journalism – “We the Media”  Product reviews for gadgets and hotels  Collaborative filtering for books and music  Amateur Academia
    29. 29. The timeline
    30. 30. Web 2.0 The phrase "Web 2.0" can refer to one or more of the following:  The transition of web sites from isolated information silos to sources of content and functionality, thus becoming computing platforms serving web applications to end-users  A social phenomenon embracing an approach to generating and distributing Web content itself, characterized by open communication, decentralization of authority, freedom to share and re-use, and "the market as a conversation”  Enhanced organization and categorization of content, emphasizing deep linking  A rise in the economic value of the Web, possibly surpassing the impact of the dot-com boom of the late 1990s
    31. 31. Two main kinds  PEOPLE FOCUS: The first kind of socializing is typified by "people focus" websites such as Bebo, Facebook, and Myspace and Xiaonei.  HOBBY FOCUS: The second kind of socializing is typified by a sort of "hobby focus" websites. such as Flickr, Kodak Gallery and Photobucket
    32. 32. Web 2.0 (see Wesch from YouTube [LOCAL]) Since social web applications are built to encourage communication between people, they typically emphasize some combination of the following social attributes:  Identity: who are you?  Reputation: what do people think you stand for?  Presence: where are you?  Relationships: who are you connected with? who do you trust?  Groups: how do you organize your connections?  Conversations: what do you discuss with others?  Sharing: what content do you make available for others to interact with?  Examples of social applications include Twitter, Facebook, Stumpedia, and Jaiku.
    33. 33. Keyword: sharing!  Sharing...  Useful vs. Not useful (!?) 
    34. 34. Sharing for the enterprise? (1) A teenager model? (2) Always useful?
    35. 35. Community 36
    36. 36. Human Resource Management 2.0  Social networks for the job market – To find and be found – To manage your online reputation – To research and reference check – To hire a superstar – To use your network to do your job better – To use your network to get a better job
    37. 37. Blog  a user-generated website where entries are made in journal style and displayed in a reverse chronological order. The term "blog" is derived from "Web log." "Blog" can also be used as a verb, meaning to maintain or add content to a blog.
    38. 38. Wiki  a website that allows the visitors themselves to easily add, remove, and otherwise edit and change available content, typically without the need for registration. This ease of interaction and operation makes a wiki an effective tool for mass collaborative authoring.
    39. 39. Best known wiki
    40. 40. Wiki vs. Blog A blog, or web log, shares writing and multimedia content in the form of “posts” (starting point entries) and “comments” (responses to the posts). While commenting, and even posting, are open to the members of the blog or the general public, no one is able to change a comment or post made by another. The usual format is post-comment-comment-comment, and so on. For this reason, blogs are often the vehicle of choice to expressindividual opinions. A wiki has a far more open structure and allows others to change what one person has written. This openness may trump individual opinion withgroup consensus.
    41. 41. Special purpose blogs: photos, music, ... 42
    42. 42. (Social) Tagging  Term – a word or phrase that is recognizable by people and computers  Document – a thing to be tagged, identifiable by a URI or a similar naming service  Tagger – someone or thing doing the tagging, such as the user of an application  Tagged – the assertion by Tagger that Document should be tagged with Term
    43. 43. Podcast  A podcast is a media file that is distributed by subscription (paid or unpaid) over the Internet using syndication feeds, for playback on mobile devices and personal computers.
    44. 44. Examples of Podcasts available  iTunes Store  NPR  ArtsEdge  Ed. Podcast Network  SFMoMA
    45. 45. Blog with Podcasts & Wikis  Several functions on the same platform
    46. 46. Gathering specific communities – TappedIn
    47. 47. Collecting feedbacks – SurveyMonkey
    48. 48. Tools. Example: collaboration and sharing  Webex  Meeting center  Training center  Acquired by CISCO in 2007  Integrated phone conferencing, VoIP, support for PowerPoint, Flash, audio, and video;  Meeting recording and playback, One-click meeting access, scheduling, and IM applications, full compatibility, secure communications  See webex-beefs-collaboration/ 49
    49. 49. Trends and size  Facebook growth: 700% from 2008 to 2009  Twitter growth: 3,700%  And unique visitors..
    50. 50. One big social application? Facebook connect!  evolution of Facebook Platform enabling you to integrate Facebook into your own site. You can add social context to your site:  Identity. Seamlessly connect the user's Facebook account with your site  Friends. Bring a user's Facebook friends into your site.  Social Distribution. Publish information back into Facebook.  Privacy. Bring dynamic privacy to your site. How scalable, reliable, open-minded? 51
    51. 51. Wouldn’t this be better? But.. 52
    52. 52. The Mash-up approach  User-defined combination of services available on the web  Graphical design  Immediate execution
    53. 53. E.g.: airlines mash-up Tracing of referral, searches, and so on […]
    54. 54. SOA vs. Web 2.0 SOA Web 2.0 Planning Design Implementation Monitoring
    55. 55. Comparison ... Web 2.0 SOA Saas = Saas Web-based interoperability Standard based interoperability (REST) (SOAP, WSDL, UDDI) Application as a platform = Application as a platform Pushes for unexpected reuse Allows reuse RIA No UI Participatory architecture Centralized governance
    56. 56. … and complementarity Fonte: Babak Hosseinzadeh, IBM
    57. 57. Short term challenge: Mash-up on SOA Mash-up SOA
    58. 58. Mid-term: Web as a platform  The past  The future […] […] Framework Framework API API API API API API API API API RSS RSS RSS REST SOAP REST REST SOAP SOAP […] […] Operating System Web Hardware Internet
    59. 59. Example: eBay  Services for  shopping  trading  Publishes services  REST interface  SOAP interface  Numbers1:  4 billion requests/month (5.5 mln/h)  25% of the offer only via Web Service  25000 registered developers  1900 known applications 1
    60. 60. Example: Amazon  Services for  e-commerce  on-line payment  computing (EC2)  storage (s3)  human computing (MTurk)  Queues (SQS)  Success stories  Ex 1, Jungle Disk: online back-up service  Ex 2, ABACA:99%-protection antispam
    61. 61. (NOT) Artificial intelligence: Mechanical Turk ! 62
    62. 62. 4. Web 3.0
    63. 63. SOA provides great plumbing!
    64. 64. Web 2.0 providegreatplumbing! E. Della Valle @ CEFRIELValle @ CEFRIEL - Politecnico di Milano E. Della - Politecnico di
    65. 65. Is plumbing enough?
    66. 66. How to manage complexity?  A few services in a small company  Hundreds of services and processes in a big organization Few services Several services Several enterprises A1 B8 A4 A1 B3 A1 B3 A1 A1 A1 A1 A1 A1 A4 A2 A4 A1 A2 A1 A4 A2 A4 A1 A2 B3 A1 A2 One company A5 A1 A2 B3 B3 A1 A1 A1 A1 A1 B3 A1 A1 A1 A4 A6 A1 A4 A1 A1 A1 A4 A1 A2 A4 B3 A1 A1 A2 A4 A1 B3 A1 A1 A4 A1 A2 A2 A1 A4 B3 A1 A4 B3 A1 A2 A4 A1 A2 A1 A1 A1 A1 A1 A1 A1 A1 B3 A2 A4 A1 A2 A1 A1 A1 A1 A4 A1 A2 A1A1 A4 A4 A2 A1 A2A2 A4 A1 A2 A2 A4 A4 A1 A1 A1 A1 A2 A1 A4 B3 A1 A4 A2 A4 A2 A4 A1 A1A1 A1 A2 B3 B3 A4 A2 B3 A4 A1 B3 A2 A1 A1 A1 A1 A1 A4 A1 A4 A1 A4 A2 B3 B3A1 A1 A1 A1 A2 A4 A1 A1 A1 A2 A1A1 Mashup A4 A1 A2 A1 A4 A1 A1A1 A1A2 A4 A4 A1 A4 B3 A1A1 B3 B3 A1 A1 A1 ? A N1 E N2 F C D Complex BPM
    67. 67. The problem is in the semantics! “The problem is not in the plumbing, it is in the semantics ” VerizonChief Scientist - M . L . Brodie “L’eterogeneità semantica rimane il principale intoppo alla integrazione di applicazioni, un intoppo che i Web Services da soli non risolveranno. Finché qualcuno non troverà un modo di per far sì che le applicazioni si capiscano, gli effetti dei Web Services resteranno limitate. Quando si passano i dati di un utente in un certo formato usando un Web Services come interfaccia, il programma che li riceve deve comunque sapere in che formato sono. Occorre comunque accordarsi sulla struttura di ciascun business object. Fino ad ora nessuno ha ancora trovato una soluzione attuabile…” Oracle Chairman and CEO - Larry Ellison
    68. 68. Web 3.0  Combining SOA + Social Web + Semantic Web  I.e., Services + Folksonomies + Ontologies (or + Taxonomies) 69
    69. 69. Tim Berners-Lee, 2001 “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation.” Scientific American, May 2001
    70. 70. Beyond Web 2.0 ... Business Process Given a BPM: Find the best set of services? Find the best datasource? Integration Mediator Mediator Manage not heterogeneous Web as a world scale platform data/services? Legacy Mediator Mediator Comm. Mediator Mediator AT Services Buyer RUNTIME! […] […] […] 3rd Party Shipment
    71. 71. SOA + Web 2.0 = ? UDDI WSDL Service Description WSBPEL Discovery Agencies Publish Discover Service Description Service Service requester provider Interact SOAP .. source:
    72. 72. SOA Advantages Costs of different EAI approaches Relative costs Custom Integration Proprietary EAI solutions Web Services based EAI solutions SOA based EAI solutions Adoption Deployment Maintenance Changes [source ZapThink]
    73. 73. From vertical applications...  Different IT solutions in each department Department 1 Department 2 Department N […]
    74. 74. … to service extraction …  Rationalization of IT solutions  Factorization and publication of common services Department 1 Department 2 Department N […]
    75. 75. … and process composition.  For using internal subprocesses, but also processes of customers or providers. Client Department 1 Department 2 Shared services Outsourced services Provider
    76. 76. “Ontology is overrated.”  “[tags] are a radical break with previous categorization strategies”  hierarchical, centrally controlled, taxonomic categorization has serious limitations  e.g., Dewey Decimal System  free-form, massively distributed tagging is resilient against several of these limitations
    77. 77. But...  ontologies aren‟t taxonomies  they are for sharing, not finding  they enable cross-application aggregation and value-added services
    78. 78. Ontology of Folksonomy  What would it look like to formalize an ontology for tag data?  Functional Purpose: applications that use tag data from multiple systems  tag search across multiple sites  collaboratively filtered search – “find things using tags my buddies say match those tags”  combine tags with structured query – “find all hotels in Spain tagged with “romantic”
    79. 79. Example: formal match, semantic mismatch  System A says a tag is a property of a document.  System B says a tag is an assertion by an individual with an identity.  Does it mean anything to combine the tag data from these two systems?  “Precision without accuracy”  “Statistical fantasy”
    80. 80. Engineering the tag ontology  Working with tag community, identify core and non core agreements  Use the process of ontology engineering to surface issues that need clarification  Couple a proposed ontology with reference implementations or hosted APIs
    81. 81. Issues raised by ontological engineering  is term identity invariant over case, whitespace, punctuation?  are documents one-to-one with URI identities? (are alias URLs possible?)  can tagging be asserted without human taggers?  negation of tag assertions?  tag polarity – “voting” for an assertion  tag spaces – is the scope of tagging data a user community, application, namespace, or database?
    82. 82. Pivot Browsing – surfing unstructured content along structured lines  Structured data provides dimensions of a hypercube  location  author  type  date  quality rating  Travel researchers browse along any dimension.  The key structured data is the destination hierarchy  Contributors place their content into the destination hierarchy, and the other dimensions are automatic.
    83. 83. 5. Tools and technologies for managing information overload
    84. 84. Tools Information: The double edged sword  You want good information, not all information  Information Retrieval /search – Multimedia IR  RSS/Bloglines/Google Reader  Social bookmarking
    85. 85. 5.1. Multimedia Information Retrieval
    86. 86. Data in digital libraries  TEXT: e-book, Word documents, Web pages, PDF, Blog, etc.  Audio:  Speech (broadcasting, podcasting, recording, etc.)  Music (CD, MP3, etc.)  Pictures: Personal photos, schemes, diagrams, etc.  Video: sequence of images and audio (music and/or speech) Challenge: How to make multimedia content available to search engines and search based applications?
    87. 87. Some user challenges…  Precision & contextual relevancy  aware of rights, user and information contexts  personalization and recommendation  Search must support multiple interaction patterns  active searching, monitoring, browsing and "being aware“  Trust and spam  Ubiquity of access
    88. 88. MIR Application Areas  Architecture, real estate, and  Investigation services interior design  (e.g., human characteristics  (e.g., searching for ideas) recognition, forensics)  Broadcast media selection  Journalism  (e.g., radio and TV channel)  (e.g. searching speeches of a certain politician using his name,  Cultural services his voice or his face)  (history museums, art galleries,  Multimedia directory services etc.)  (e.g. yellow pages, Tourist  Digital libraries information, GIS)  (e.g., musical dictionary, bio-  Multimedia editing medical imaging catalogues, film, video and radio archives)  (e.g., personalized news service, media authoring)  E-Commerce  Remote sensing  (e.g., personalized advertising, on-line catalogues)  (e.g., cartography, ecology)  Education  Shopping  (e.g., repositories of multimedia  (e.g., searching for clothes) courses)  Social  Home Entertainment  (e.g. dating services)  (e.g., personal multimedia collections)  Surveillance  (e.g., traffic control)
    89. 89. MIR: Query Examples  Play a few notes on a keyboard and retrieve a list of musical pieces similar to the required tune, or images matching the notes in a certain way, e.g., in terms of emotions  Draw a few lines on a screen and find a set of images containing similar graphics, logos, ideograms,...  Define objects, including color patches or textures and retrieve examples among which you select the interesting objects to compose your design  On a given set of multimedia objects, describe movements and relations between objects and so search for animations fulfilling the described temporal and spatial relations  Describe actions and get a list of scenarios containing such actions  Using an excerpt of Pavarotti’s voice, obtaining a list of Pavarotti’s records, video clips where Pavarotti is singing and photographic material portraying Pavarotti
    90. 90. State-of-the art of MSE  Image search  Video Search         Music Search  Entrerprise MIR search       
    91. 91. Metadata? 92  “Data about other data”  They describe in a structured fashion properties of the data – E.g.: owner, creation and modification date, description, etc.  Some metadata are implicitly available  E.g.: file size, file name, etc.  Others need to be manually provided or automatically extracted
    92. 92. The MIR reference architecture
    93. 93. Content Process Content Content Content Acquisition Transformation Indexing
    94. 94. Content acquisition  In MIR, content is acquired from many sources and in in multiple ways:  By crawling  By user’s contribution  By syndicated contribution from content aggregators  Via broadcast capture (e.g., from air/cable/satellite broadcast, IPTV, Internet TV multicast, ..)
    95. 95. Content acquisition  In text or Web search engines, content is a closed or open collection of documents  Textual Web content is acquired by crawlers, who exploit link navigation  In MIR, content is acquired from many sources, in a range of quality and value:  Web cams, security apps  (Video/Audio) Telephony and teleconferencing  Industrial/Academic/Medical  User Generated Content  Public Access and Government Access  Rushes, Raw Footage MOTION PICTURES VALUE  News BROADCAST TV  Advertising ENTERPRISE  TV Programming  Feature Films USER GENERATED WEB CAM, SECURITY PRODUCTION COST
    96. 96. Acquisition: (video) metadata sources & formats  Content element may be accompanied by textual descriptions, which range in quantity and quality, from no description (e.g., web cam content) to multilingual high value data (closed captions and production metadata of motion pictures)  Metadata may reside:  Embedded within content (e.g., close captions)  In surrounding Web pages or links (e.g., HTML content, link anchors, etc)  In domain-specific databases (e.g., IMDB for feature films)  In ontologies: ASSET PACKAGE METADATA METADATA METADATA MULTIPLEXED METADATA MEDIA STREAMS EXTERNAL METADATA
    97. 97. Acquisition: (video) representative metadata standards Standard Body MPEG-7, ISO/IEC Int. Electrotechnical Comm., Motion MPEG-21 Picture Expert Group UPnP Universal Plug and Play forum MXF, MDD SMPTE Society of Motion Picture and Television Engineers AAF AMWA Advanced Media Workflow Association TV Anytime ETSI European Telecommunication Standards Institute Timed Text W3C, 3GPP RSS Harward Podcast Apple Media RSS Yahoo
    98. 98. Transformation dimesions: Digital video formats  A digital video is a sequence of frames  The Frame Aspect Ratio (FAR) defines the shape of each image (width divided by heigh), with 4:3 and 16:9 being the currently adopted values  Pixel aspect ratio (PAR) describes how the width of pixels in a digital image compares to their height (rectangular pixels format exist for analog TV compatibility).  Frame rate: number of frames per second (24 and 25 are common, but also lower and higher values are used)
    99. 99. Transformation dimensions: compression  Web media must be compressed, with lossy (but perceptually acceptable) transformations  In video, compression works in two ways  Intra-Frame: an image is divided in blocks, whose content is “averaged”  Inter-frame: a frame is represented differentially with respect to the preceding one, by encoding only block that “have moved” and their motion vector  Example (MPEG compression)
    100. 100. Content Transformation: popular compression standards Standard Typical bitrates Applications M-JPEG, Up to 60 Consumer electronics, video JPEG2000 Mbit/sec editing systems DVCAM 25M Consumer MPEG-1 1.5M CD-ROM Multimedia MPEG-2 4-20M Broadcast TV, DVD MPEG-4 300K-12M Mobile video, Podcast, IPTV H.264 H.261 H.263 64k-1M Video teleconferencing, telephony Each standard has profiles, that balance latency, complexity, error resilience and bandwidth, specifically for a target application (e.g., file-based vs transport-based fruition)
    101. 101. Content indexing  In textual search engines, content need little (lexical) analysis before indexing  Index elements (words) are part of the content  In MIR, content cannot be indexed directly  Indexablemeatadatamust be created from the input data – Low level features: concisely describe physical or perceptual properties of a media element (e.g., feature vectors) – High level features: domain concepts characterizing the content (e.g., extracted objects and their properties, content categorizations, etc)  In continuous media, extracted features must be related to the media segment that they characterize, both in space and time  Feature extraction may require a change of medium, e.g., speech to text transcription
    102. 102. Motivations for metadata generation  Computer are not able to catch the underlying meaning of a multimedia content  A computer is not able to understand that this picture represents a sunset  Pixels and audio samples do not convey semantics, just binary  Metadata are used to produce representations that are manageable by computers  E.g.: text or numbers
    103. 103. How to create multimedia annotations?  Manually  Expensive – It can take up to 10x the duration of the video – Problems in scaling to millions of contents  Incomplete or inaccurate – People might not be able to holistically catch all the meanings associated with a multimedia object  Difficult – Some contents are tedious to describe with words - E.g., a melody without lyrics  Automatically  Good quality – Some technologies have a ~90% precision  “Low” cost
    104. 104. Indexing: the core pipeline Content Metadata processing Indexing Multimedia Metadata (e.g., MPEG-7) Indexes content (e.g., inverted (e.g., MPEG-2 video) files) Video Audio processing processing Segmentation Segmentation Audio Analysis Image Video Analysis Analysis
    105. 105. Image/Text segmentation  GOAL: identify the type of contents included in an image  Text + pictures  Image sections
    106. 106. Audio Segmentation  GOAL: split an audio track according to contained information  Music  Speech  Noise …  Additional usage  Identification and removal of ads
    107. 107. Video Segmentation  Keyframe segmentation:  segment a video track according to its keyframes – fixed-length temporal segments  Shot detection:  automated detection of transitions between shots – a shot is a series of interrelated consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space.
    108. 108. Speaker identification  GOAL: identify people participating in a discussion ERIC DAVID JOHN  Additional usage:  Vocal command execution
    109. 109. Word spotting  GOAL: recognize spoken words belonging to a closed dictionary Call Open Bomb  Additional usage:  Spot blacklist words in spontaneous speech – E.g.: terrorist, attack,…  dialing (e.g., "Call home”)  call routing (e.g., "I would like to make a collect call”)  Domotic appliance control
    110. 110. Speech to text  GOAL: automatically recognize spoken words belonging to an open dictionary  Example: quote_detection.avi CREDITS: Thorsten Hermes@SSMT2006
    111. 111. Identification of audio events  GOAL: automatically identify audio events of interest  E.g.: shouts, gunshots, etc.  Additional usage:  Security applications  Example: sound_events.avi CREDITS: Thorsten Hermes@SSMT2006
    112. 112. Classification of music genre, mood, etc.  GOAL: automatically classify the genre and mood of a song  Rock, pop, Jazz, Blues, etc.  Happy, aggressive, sad, melancholic, Rock Dance!  Additional usage:  Automatic selection of songs for playlist composition
    113. 113. Images: low-level features  GOAL: extract implicit characteristics of a picture  luminosity  orientations  textures  Color distribution
    114. 114. Images: Optical character recognition (OCR)  OCR is a technique for translating images of typed or handwritten text into symbols  Solved problem for typewritten text (99% accuracy)  Commercial solutions for handwritten text (e.g, MS Tablet PC)
    115. 115. Image: face identification and recognition  GOAL: recognize and identify faces in an image  Usage examples:  People counting  Security applications  Example: face_detection.avi CREDITS: Thorsten Hermes@SSMT2006
    116. 116. Image: concept detection  Image analysis extract low level features from raw data (e.g., color histograms, color correlograms, color moments, co-occurrence texture matrices, edge direction histograms, etc..)  Features can be used to build discrete classifiers, which may associate semantic concepts to images or regions thereof  The MediaMill semantic search engine defines 491 semantic concepts   Concepts can be detected also from text (e.g., from manual or automatic metadata) using NLP techniques (FAST text search engine recognizes entities like geographical locations, professions, names of persons, domain-specific technical concepts, etc)
    117. 117. Image: object identification  GOAL: identify objects appearing in a picture  Basket ball, cars, planes, players, etc.  Also by example (unaware of position, scaling, etc) – objectByExample.mp4 CREDITS:
    118. 118. Video OCR  Video OCR has specific problems, due to low resolution, small text size, and interference with background  Detection is normally done on the most representative image of an entire shots, rather than frame by frame  Approach: filter for enhancing resolution + pattern matching for character identification  Example: VirageConTEXTract text extraction and recognition technology (recognizes text in real time)
    119. 119. Multimodal annotation fusion  Media segmentation and concept extraction are probabilistic processes  The result is characterized by a confidence value  Significance can be enhanced by comparing the output of distinct techniques applied to the same or similar problems  Examples:  Media segmentation: shot detection + speaker’s turn identification  Person recognition: voice identification + face detection  Concept detection: image based classification (e.g., “outdoor” & “water” + object extraction: “bird”, “boat”)
    120. 120. Overview of the query process
    121. 121. Content querying  In textual search applications, queries are keywords or expressions thereof  In MIR, search can take place  By keyword  By (mono-media) example (e.g., query by image, query by humming, query by song similarity)  By (multi-media) example (e.g., query by video similarity)  Query by example entails real time content processing  MIR query processing naturally requires the interaction of multiple search engines (e.g., a text search engine for textual metadata and a content-based search engine for feature vectors)
    122. 122. Querying: modalities  In MIR applications, search keyword match the manual or automatic metadata  A complementary approach is to provide an example of the desired content and look for similar media elements  Similarity is a medium-dependent, domain-dependent, and subjective criterion  Can be computed on low lever features (e.g., image color histograms, music bpm) or on high level concepts/categorization (e.g., melancholic images, party music)  Can be multimodal (e.g., video similarity)  Querying may also consider context information (e.g., the user’s geographical position or the access device)
    123. 123. Example query modalities and search types where[contains(“amsterdam”)] and 52.37N 4.89 E topic[contains(“building”)] “amsterdam” Image Song Query analysis Federation Music search Text search Image Similarity index search XML search Geo search Inverted index Similarity index Semantic index R-tree index
    124. 124. Faceted query  When a media collection is large and its content unknown to the user, exposing part of the metadata can help  This can be done by showing a compact representation of the categories of content (facets)  A query can be restricted by selecting only the relevant facets
    125. 125. Querying: by keyword  The keyword may match the manual metadata and/or the automatic metadata  The match can be multimodal: in the audio, in a visual concept
    126. 126. Querying: by similarity – query interface
    127. 127. Content browsing  In textual search engines, results are ranked linearly, browsed by navigating links, and read at a glance  In MIR and similarity- based search applications, browsing results must consider multiple dimensions  Relevance: where the result appears in the sequence of retrieved media elements  Space: where the search has matched inside a spatially organized media element (e.g., an image)  Time: when a match occurs in a linear media element
    128. 128. Browsing: timeline-based video access
    129. 129. References  MPEG-7:  MPEG-7 Overview 7/mpeg-7.htm  Prof. Ray Larson & Prof. Marc Davis, UC Berkeley SIMS 202/f03/  RSS:  MEDIA RSS:  MPEG:  Shot detection: tion
    130. 130. References  MediaMill:  Similarity search     Slides del corsodi “ArchiviMultimedialie Data Mining”, Politecnicodi Torino, Prof. Silvia Chiusano  Slides e video dellelezionetenutedal Prof. Thorsten Hermes presso la summer school SSMS 2006  PHAROS: http://www.pharos-audiovisual-
    131. 131. 5.2 RSS and readers
    132. 132. Acquisition: RSS and Media RSS  RSS (Really Simple Syndication) describes a family of web feed formats used to publish frequently updated web resources (e.g., news)  An RSS feed includes full or summarized text, plus metadata such as publishing dates and authorship  RSS formats are specified using XML  RSS 2.0 now “frozen”  Media RSS proposed by Yahoo as an RSS module that supplements the <enclosure> element capabilities of RSS 2.0 to allow for more robust media syndication.
    133. 133. Acquisition: Example of RSS 2.0
    134. 134. Acquisition: Browser rendition of RSS
    135. 135. Acquisition: an example of Media RSS
    136. 136. Indexing: Media segmentation in MPEG-7
    137. 137. Bloglines: web content aggregator 138
    138. 138. Google reader 139
    139. 139. Social bookmarking  Online shared catalogs of annotated bookmarks  Even ad-hoc sites are needed for managing complexity of bookmark sharing task 140
    140. 140. 5.3 Personalization
    141. 141. Why Personalization?  Personalization is an attempt to find most relevant documents using information about user's goals, knowledge, preferences, navigation history, etc.
    142. 142. Same Query, Different Intent  “Cancer”  Different meanings  “Information about the astronomical/astrological sign of cancer”  “information about cancer treatments”  Different intents  “is there any new tests for cancer?”  “information about cancer treatments”
    143. 143. Personalization Algorithms  Standard IR Query Server Document Client User  Related to relevance feedback  Query expansion  Result re-ranking
    144. 144. User Profile  A user‟s profile is a collection of information about the user of the system.  This information is used to get the user to more relevant information
    145. 145. Core vs. Extended User Profile  Core profile  contains information related to the user search goals and interests  Extended profile  contains information related to the user as a person in order to understand or model the use that a person will make with the information retrieved
    146. 146. Who Maintains the Profile?  Profile is provided and maintained by the user/administrator  Sometimes the only choice  The system constructs and updates the profile (automatic personalization)  Collaborative - user and system  User creates, system maintains  User can influence and edit  Does it help or not?
    147. 147. Adaptive Search  Goals:  Present documents (pages) that are most suitable for the individual user  Methods:  Employ user profiles representing short-term and/or long- term interests  Rank and present search results taking both user query and user profile into account
    148. 148. Personalized Search: Benefits  Resolving ambiguity  The profile provides a context to the query in order to reduce ambiguity.  Example: The profile of interests will allow to distinguish what the user asked about “Berkeley” (“Pirates”, “Jaguar”) really wants  Revealing hidden treasures  The profile allows to bring to surface most relevant documents, which could be hidden beyond top results page  Example: Owner of iPhone searches for Google Android. Pages referring to both would be most interesting
    149. 149. Where to Apply Profiles ?  The user profile can be applied in several ways:  To modify the query itself (pre-processing)  To change the usual way of retrieval  To process results of a query (post-processing)  To present document snippets  Special case: adaptation for meta-search
    150. 150. Pre-Process: Query Expansion  User profile is applied to add terms to the query  Popular terms could be added to introduce context  Similar terms could be added to resolve indexer-user mismatch  Related terms could be added to resolve ambiguity  Works with any IR model or search engine
    151. 151. Pre-Process: Relevance Feedback  In this case the profile is used to “move” the query  Imagine that:  the documents,  the query  the user profile are represented by the same set of weighted index terms
    152. 152. Post-Processing  The user profile is used to organize the results of the retrieval process  Present to the user the most interesting documents  Filter out irrelevant documents  Extended profile can be used effectively  In this case the use of the profile adds an extra step to processing  Similar to classic information filtering problem  Typical way for adaptive Web IR
    153. 153. Post-Filter: Annotations  The result could be relevant to the user in several aspects. Fusing this relevance with query relevance is error prone and leads to a loss of data  Results are ranked by the query relevance, but annotated with visual cues reflecting other kinds of relevance  User interests - Syskill and Webert, group interests - KnowledgeSea
    154. 154. Post-Filter: Re-Ranking  Re-ranking is a typical approach for post-filtering  Each document is rated according to its relevance (similarity) to the user or group profile  This rating is fused with the relevance rating returned by the search engine  The results are ranked by fused rating  User model: WIFS, group model: I-Spy
    155. 155. Privacy related problems  Web Information Retrieval face a challenge; that the data required to perform evaluations, namely query logs and click- through data, is not readily available due to valid privacy concerns.  Researchers can:  Limit to small (and sometimes biased) samples of users, restricting somewhat the conclusions that can be drawn.  Limit the usage of private data to local computation, exploiting personal data only in post processing search result.  Look for publicly available data that can be used to approximate query logs and click-through data (such as user bookmarks). 157
    156. 156. Tag Data and Personalized Information Retrieval  Recently it has been shown that the information contained in social bookmarking (tagging) systems may be useful for improving Web search.  Using data from the social bookmarking site, it is possible to demonstrate how one can rate the quality of personalized retrieval results.  User's “bookmark history" can be used to improve search results via personalization.  Analogously to studies involving implicit feedback mechanisms in IR, which have found that profiles based on the content of clicked URLs outperform those based on past queries alone, profiles based on the content of bookmarked URLs are generally superior to those based on tags alone. 158
    157. 157. Tag Data and Personalized Information Retrieval  Social bookmarking systems such as and Bibsonomy are a recent and popular phenomenon.  Users label interesting web pages (or research articles) with primarily short and unstructured annotations in natural language called tags.  These sites offer an alternative model for discovering information online.  Rather than following the traditional model of submitting queries to a Web search engine, users can browse tags as though they were directories looking for popular pages that have been tagged by a number of different users. Since tags are chosen by users from an unrestricted vocabulary, these systems can be seen to provide consensus categorizations of interesting websites. 159
    158. 158. Tag Data and Personalized Information Retrieval  How social bookmarking data can be used to improve Web search?  Can tag data be used to approximate actual user queries to a search engine?  How evaluate personalized IR systems using information contained in social bookmarks (tag data)?  Is there enough information in (i.e. a strong enough correlation between) the tags/bookmarks in a user's history in order to build a profile of the user that will be useful for personalizing search engine results? 160
    159. 159. Models for generating a profile of the user  We record the (time ordered) stream of webpages that have been bookmarked by a particular user  The first simple profile involves counting the occurrences of terms in the tags of any of the known bookmarks.  An obvious problem is that users often have multiple interests and their many bookmarks cover a range of topics. Thus some bookmarks may be completely unrelated to the nth bookmark (and thus the tags being used as the current query). 161
    160. 160.  The second source of information in the bookmarks is the content of the bookmarked pages themselves.  One would expect given the much larger vocabulary of Web pages compared to tag data, that content may prove more useful than tags. Indeed content-based profiles are more useful than query-based ones.  A user spends more time deliberating which pages to bookmark than deciding which search results to click on.  Since a user will only bookmark sites that they find particularly useful or interesting, these documents should contain a lot of useful information about the user and the content of bookmarked documents is particularly useful for personalization. 162
    161. 161.  The previous profile is somewhat adhoc in its decision which documents to include and which not to include.  In theory, we would like to include all documents that the user has bookmarked, but weight them according to their expected usefulness for resolving ambiguity in the current query.  Our first attempt to estimate the distance between two bookmarks is to count the number of common terms in their respective sets of tags 163
    162. 162. How do we use these profiles?  In order to incorporate the user profile for personalized information retrieval queries are expanded with terms from the profile, weighting them appropriately.  The number of expansion terms to be added to the query is limited so as to limit the amount of noise and total length of the expanded query.  In particular, the K most frequent terms from the profile are added and the weights to account for the missing terms are normalized. 164
    163. 163. 5.4 Recommendation systems
    164. 164. Introduction to Recommender Systems  Systems for recommending items (e.g. books, movies, CD’s, web pages, newsgroup messages) to users based on examples of their preferences.  Objective:  To propose objects fitting the user needs/wishes  To sell services (site visits) or goods  Many search engines and on-line stores provide recommendations (e.g. Amazon, CDNow).  Recommenders have been shown to substantially increase clicks (and sales).
    165. 165. Book Recommender Red Mars Found ation Juras- sic Park Machine User Lost Learning Profile World 2001 Neuro- 2010 mancer Differ- ence Engine
    166. 166. Personalization  Recommenders are instances of personalization software.  Personalization concerns adapting to the individual needs, interests, and preferences of each user.  Includes:  Recommending  Filtering  Predicting (e.g. form or calendar appt. completion)  From a business perspective, it is viewed as part of Customer Relationship Management (CRM).
    167. 167. Machine Learning and Personalization  Machine Learning can allow learning a user model or profile of a particular user based on:  Sample interaction  Rated examples  Similar user profiles  This model or profile can then be used to:  Recommend items  Filter information  Predict behavior
    168. 168. Types of recommendation systems 1.Search-based recommendations 2.Category-based recommendations 3.Collaborative filtering 4.Clustering 5.Association rules 6.Information filtering 7.Classifiers
    169. 169. 1. Search-based recommendations  The only visitor types a search query  « data mining customer »  The system retrieves all the items that correspond to that query  e.g. 6 books  The system recommends some of these books based on general, non-personalized ranking (sales rank, popularity, etc.)
    170. 170. Search-based recommendations  Pros:  Simple to implement  Cons:  Not very powerful  Which criteria to use to rank recommendations?  Is it really « recommendations »?  The user only gets what he asked for
    171. 171. 2. Category-based recommendations  Each item belongs to one category or more.  Explicit / implicitchoice:  The customer select a category of interest (refinesearch, opt-in for category- basedrecommendations, etc.). – « Subjects> Computers & Internet >Databases> Data Storage & Management > Data Mining »  The system selects categories of interest on the behalf of the customer, based on the current item viewed, past purchases, etc.  Certain items (bestsellers, new items) are eventually recommended
    172. 172. Category-based recommendations  Pros:  Still simple to implement  Cons:  Again: not very powerful, which criteria to use to order recommendations? is it really « recommendations »?  Capacity highly depends upon the kind of categories implemented – Too specific: not efficient – Not specific enough: no relevant recommendations
    173. 173. 3. Collaborative filtering  Collaborative filtering techniques « compare » customers, based on their previous purchases, to make recommendations to « similar » customers  It’s also called « social » filtering  Follow these steps: 1.Find customers who are similar (« nearest neighbors ») in term of tastes, preferences, past behaviors 2.Aggregate weighted preferences of these neighbors 3.Make recommendations based on these aggregated, weighted preferences (most preferred, unbought items)
    174. 174. Collaborative filtering  Example: the system needs to make recommendations to customer C Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Customer A X X Customer B X X X Customer C X X Customer D X X Customer E X X  Customer B is very close to C (he has bought all the books C has bought). Book 5 is highly recommended  Customer D is somewhat close. Book 6 is recommended to a lower extent  Customers A and E are not similar at all. Weight=0
    175. 175. Collaborative filtering  Pros:  Extremely powerful and efficient  Very relevant recommendations  (1) The bigger the database, (2) the more the past behaviors, the better the recommendations  Cons:  Difficult to implement, resource and time-consuming  What about a new item that has never been purchased? Cannot be recommended  What about a new customer who has never bought anything? Cannot be compared to other customers no items can be recommended
    176. 176. 4. Clustering  Another way to make recommendations based on past purchases of other customers is to cluster customers into categories  Each cluster will be assigned « typical » preferences, based on preferences of customers who belong to the cluster  Customers within each cluster will receive recommendations computed at the cluster level
    177. 177. Clustering Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Customer A X X Customer B X X X Customer C X X Customer D X X Customer E X X  Customers B, C and D are « clustered » together. Customers A and E are clustered into another separate group  « Typical » preferences for CLUSTER are:  Book 2, very high  Book 3, high  Books 5 and 6, may be recommended  Books 1 and 4, not recommended at all
    178. 178. Clustering Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Customer A X X Customer B X X X Customer C X X Customer D X X Customer E X X Customer F X X  How does it work?  Any customer that shall be classified as a member of CLUSTER will receive recommendations based on preferences of the group:  Book 2 will be highly recommended to Customer F  Book 6 will also be recommended to some extent
    179. 179. Clustering  Problem: customers may belong to more than one cluster; clusters may overlap  Predictions are then averaged across the clusters, weighted by participation Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Customer A X X Customer B X X X Customer C X X Customer D X X Customer E X X Customer F X X Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Customer A X X Customer B X X X Customer C X X Customer D X X Customer E X X Customer F X X
    180. 180. Clustering  Pros:  Clustering techniques work on aggregated data: faster  It can also be applied as a « first step » for shrinking the selection of relevant neighbors in a collaborative filtering algorithm  Cons:  Recommendations (per cluster) are less relevant than collaborative filtering (per individual)
    181. 181. 5. Association rules  Clustering works at a group (cluster) level  Collaborative filtering works at the customer level  Association rules work at the item level
    182. 182. Association rules  Past purchases are transformed into relationships of common purchases Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Customer A X X Customer B X X X Customer C X X Customer D X X Customer E X X Customer F X X Also bought… Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Book 1 1 1 who bought… Customers Book 2 2 1 1 Book 3 2 2 Book 4 1 Book 5 1 1 2 Book 6 1
    183. 183. Association rules  These association rules are then used to make recommendations  If a visitor has some interest in Book 5, he will be recommended to buy Book 3 as well  Recommendations are constrained to some minimum levels of confidence  What if recommendations can be made using more than one piece of information?  Recommendations are aggregated Also bought… Book 1 Book 2 Book 3 Book 4 Book 5 Book 6 Book 1 1 1 who bought… Customers Book 2 2 1 1 Book 3 2 2 Book 4 1 Book 5 1 1 2 Book 6 1
    184. 184. Association rules  Pros:  Fast to implement  Fast to execute  Not much storage space required  Not « individual » specific  Very successful in broad applications for large populations, such as shelf layout in retail stores  Cons:  Not suitable if knowledge of preferences change rapidly  It is tempting to do not apply restrictive confidence rules May lead to litteraly stupid recommendations
    185. 185. 6. Information filtering  Association rules compare items based on past purchases  Information filtering compare items based on their content  Also called « content-based filtering » or « content-based recommendations »  Can exploit syntactical information on objects (features)  But also semantic knowledge of objects (concepts/ontologies)
    186. 186. Information filtering  What is the « content » of an item?  It can be explicit « attributes » or « characteristics » of the item. For example for a film:  Action / adventure  Feature Bruce Willis  Year 1995  It can also be « textual content » (title, description, table of content, etc.)  Several techniques exist to compute the distance between two textual documents
    187. 187. Information filtering  How does it work?  A textual document is scanned and parsed  Word occurrences are counted (may be stemmed)  Several words or «tokens» are not taken into account: rarely used or «stop words»  Each document is transformed into a normed TFIDF vector, size N(Term Frequency / Inverted Document Frequency).  The distance between any pair of vector is computed
    188. 188. Information filtering  An (unrealistic) example: how to compute recommendations between 8 books based only on their title?  Books selected:  Building data mining applications for CRM  Accelerating Customer Relationships: Using CRM and Relationship Technologies  Mastering Data Mining: The Art and Science of Customer Relationship Management  Data Mining Your Website  Introduction to marketing  Consumer behavior  marketing research, a handbook  Customer knowledge management
    189. 189. COUNT building data Accelerating Mastering Data Data Mining Your Introduction to consumer marketing customer mining Customer Mining: The Art Website marketing behavior research, a knowledge applications for Relationships: and Science of handbook management crm Using CRM and Customer Relationship Relationship Technologies Management a 1 accelerating 1 and 1 1 application 1 art 1 behavior 1 building 1 consumer 1 crm 1 1 customer 1 1 1 data 1 1 1 for 1 handbook 1 introduction 1 knowledge 1 management 1 1 marketing 1 1 mastering 1 mining 1 1 1 of 1 relationship 2 1 research 1 science 1 technology 1 the 1 to 1 using 1 website 1 your 1
    190. 190. TFIDF Normed Vectors building data Accelerating Mastering Data Data Mining Your Introduction to consumer marketing customer mining Customer Mining: The Art Website marketing behavior research, a knowledge Mastering Data Mining: applications for Relationships: and Science of Data mining handbook management crm Using CRM and Customer The Art and Science Relationship Relationship Technologies of Customer Relationship Management your website a 0.000 0.000 0.000 0.000 0.000 0.000 0.537 0.000 accelerating and 0.000 0.000 Management 0.432 0.296 0.000 0.256 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 application 0.502 0.000 0.000 0.000 0.000 0.000 0.000 0.000 art 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000 behavior 0.000 0.000 0.000 0.000 0.000 0.707 0.000 0.000 building 0.502 0.000 0.000 0.000 0.000 0.000 0.000 0.000 consumer 0.000 0.000 0.000 0.000 0.000 0.707 0.000 0.000 crm 0.344 0.296 0.000 0.000 0.000 0.000 0.000 0.000 customer 0.000 0.216 0.187 0.000 0.000 0.000 0.000 0.381 data 0.251 0.000 0.187 0.316 0.000 0.000 0.000 0.000 for 0.502 0.000 0.000 0.000 0.000 0.000 0.000 0.000 handbook 0.000 0.000 0.000 0.000 0.000 0.000 0.537 0.000 introduction 0.000 0.000 0.000 0.000 0.636 0.000 0.000 0.000 knowledge 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.763 management 0.000 0.000 0.256 0.000 0.000 0.000 0.000 0.522 marketing 0.000 0.000 0.000 0.000 0.436 0.000 0.368 0.000 mastering 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000 mining 0.251 0.000 0.187 0.316 0.000 0.000 0.000 0.000 of 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000 relationship research 0.000 0.000 Data0.468 0.000 0.256 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.537 0.000 0.000 science 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000 technology 0.000 0.432 0.000 0.000 0.000 0.000 0.000 0.000 the 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000 to 0.000 0.000 0.000 0.000 0.636 0.000 0.000 0.000 using 0.000 0.432 0.000 0.000 0.000 0.000 0.000 0.000 website 0.000 0.000 0.000 0.632 0.000 0.000 0.000 0.000 your 0.000 0.000 0.187 0.000 0.632 0.316 0.000 0.000 0.000 0.000
    191. 191. Information filtering  A customer is interested in the following book: « Building data mining applications for CRM »  The system computes distances between this book and the 7 others  The « closest » books are recommended:  #1:Data Mining Your Website  #2:Accelerating Customer Relationships: Using CRM and Relationship Technologies  #3:Mastering Data Mining: The Art and Science of Customer Relationship Management  Not recommended:Introduction to marketing  Not recommended: Consumer behavior  Not recommended:marketing research, a handbook  Not recommended: Customer knowledge management
    192. 192. Information filtering  Pros:  No need for past purchase history  Not extremely difficult to implement  Cons:  « Static » recommendations  Not efficient is content is not very informative e.g. information filtering is more suited to recommend technical books than novels or movies
    193. 193. 7. Classifiers  Classifiers are general computational models  They may take in inputs:  Vector of item features (action / adventure, Bruce Willis)  Preferences of customers (like action / adventure)  Relations among items  They may give as outputs:  Classification  Rank  Preference estimate  That can be a neural network, Bayesian network, rule induction model, etc.  The classifier is trained using a training set
    194. 194. Classifiers  Pros:  Versatile  Can be combined with other methods to improve accuracy of recommendations  Cons:  Need a relevant training set
    195. 195. Collaborative Filtering  Maintain a database of many users’ ratings of a variety of items.  For a given user, find other similar users whose ratings strongly correlate with the current user.  Recommend items rated highly by these similar users, but not rated by the current user.  Almost all existing commercial recommenders use this approach (e.g. Amazon).
    196. 196. Collaborative Filtering A 9 A A 5 A A 6 A 10 User B 3 B B 3 B B 4 B 4 C C 9 C C 8 C C 8 Database : : : : : : : : : : . . Z 5 Z 10 Z 7 Z Z Z 1 A 9 A 10 B 3 B 4 Correlation C C 8 Match : : . . Z 5 Z 1 A 9 Active B 3 Extract User C C Recommendations . . Z 5
    197. 197. Collaborative Filtering Method  Weight all users with respect to similarity with the active user.  Select a subset of the users (neighbors) to use as predictors.  Normalize ratings and compute a prediction from a weighted combination of the selected neighbors’ ratings.  Present items with highest predicted ratings as recommendations.
    198. 198. Significance Weighting  Important not to trust correlations based on very few co-rated items.  Include significance weights, based on number of co-rated items.  If no items are rated by both users, correlation is not meaningful
    199. 199. Neighbor Selection  For a given active user, a, select correlated users to serve as source of predictions.  Standard approach is to use the most similarn users, u, based on similarity weights, wa,u  Alternate approach is to include all users whose similarity weight is above a given threshold.