This document provides an overview of strategic scenarios in digital contents. It discusses the evolution from static to dynamic contents, from fixed to mobile, and from local to global. It also covers the rise of Web 2.0, including the growth of user-generated content, tagging, blogs, wikis, podcasts and other social media tools. Finally, it discusses some tools that enable collaboration and information sharing, such as WebEx, and the trend toward mashups that combine multiple web services.
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Strategic scenarios in digital content and digital business
1. Strategic Scenarios in
Digital contents
Marco Brambilla et al.
Politecnico di Milano, DEI and MIP
Acer Academy
May 2009
http://home.dei.polimi.it/mbrambil/
2. Agenda overview
Information overload
Evolution of contents
Web 2.0
Web 3.0
Tools and technologies for managing information overload
4. Introduction and motivation
161 exabytes of information was created or replicated
worldwide in 2006
IDC estimates 6X growth by 2010 to 988 exabytes (a
zetabyte) / year
That‟s more than in the previous 5,000 years.
– DATA from: Dr. Michael L. Brodie - Chief Scientist Verizon
5. Where does content come from
The largest source of data? USERS
YouTube Videos
1.7 billion served / month
1 million streams / day = 75 billion e-mails
Facebook had [in 2007] …
1.8 billion photos
31 million active users
100.000 new users / day
1,800 applications
MySpace, 185+ million registered users (Apr 2007), has…
Images:
– 1+ billion - Millions uploaded / day- 150,000 requests / sec
Songs:
– 25 million - 250,000 concurrent streams
Videos:
– 60 TB - 60,000 uploaded / day - 15,000 concurrent streams
6. Quality of data
(User Generated) Content is:
25% original; 75% replicated
25% from the workplace; 75% not
95% unstructured and growing
While enterprise data is 10-15% structured and decreasing
Main challenges:
How to make multimedia content available to search engines and
search based applications?
Exploiting multimedia content requires:
– Acquiring it
– (Re) Formatting it
– Indexing it
– Querying it
– Transmitting it
– Browsing it
7. Information overload effects on (our)
way of working
For knowledge workers
• Time is limited
• Processes overlap
• Knowledge is (often) artefact-
dependent
• Tools allow multiplicity of uses
• Need for several tools
• Relations with people take time
• Contexts mix and merge
9. Working with information
Types of information
Usefulness
– Active: ephemeral and working (“hot”)
– Dormant: inactive, potentially useful (“cold”)
– Not useful
– Un-accessed
Ownership: mine or not-mine
Activities
Acquisition of items to form a collection
Organisation of items
Maintenance of the collection (e.g. archiving items into long-
term storage)
Retrieval of items for reuse
Information (and choice) overload.. On YOUTUBE
10. Acquisition
Different between tools
Manual (files), uncontrolled (e-mails)
Push vs. pull
Reasons for deciding how to store information
Portability
Number of access points
Preservation of information in its current state
Currency of information
Context
Reminding
Ease of integration into existing structures
Communication and information sharing
Ease of maintenance
11. Organisation
Categorisations are complex
Folders vs. keywords
Trees vs. webs
Change over time
Categorisations are local
If two groups of people construct thesauri in a particular subject
area, the overlap of index terms will only be 60%
Two indexers using the same thesaurus on the same document
use common index terms in only 30% of cases
The output from two experienced database searchers has only
40% overlap
Experts' judgements of relevance concur in only 60% of cases
12. Maintanance
Hardly any
Occasional cleaning
Extensive maintenance is related to major life changes (e.g. new
job)
13. Retrieval
Personal archives instead of corporate systems
Need to start searching
Not invented here: reinventing is more fun than reusing
Asking is more difficult than sharing
Social search: asking others
Estimations of quality and relevance are best made by experts
themselves
It's fastest and most efficient way
Colleagues can give you feedback and help to sharpen your
questions
Consulting others is fun
While searching systems
Preference for location-based search
Critical reminding function of file placement
Lack of retrieval of archived files
15. Evolution of contents and technologies
I. from static to dynamic
II. from fixed to mobile
III. from big to small
IV. from local to global
V. from vertical to horizontal
VI. from sometimes-on to always-on
VII. from wired to wireless
VIII. from divergence to convergence
15
16. Content proliferation and classification
Proliferation of
blogs
online video
podcasting,
other social media tools
the definition of what consititutes ‟web‟/‟non-web‟ content has
become increasingly blurred
16
21. Social- vs. Group- ware
The basic model of 90's era collaboration (Lotus Notes):
all about the group.
Information was managed in group-based repositories, then passed around for
review, or published to intranet portals via customized apps. Information era
workflows where people are first and foremost occupiers of roles, not individuals,
and the materials being created are more closely aligned with groups than
individuals.
Web 2.0 social tools: MySpace, Facebook, LinkedIn
Social networks -- explicit ones or implicit ones in social media –
are really organized around individuals and their networked self-expression. I am
writing this blog post, and publishing it, personally. It is not the product of some
workgroup. It is not an anonymous chunk of text on a corporate portal. My
Facebook profile pulls traffic from my network of contacts, sources I find
interesting, and the chance presence updates of my friends.
See: http://www.stoweboyd.com/message/2007/01/in_the_time_of_.html
21
22. Doug Engelbart, 1968
"The grand challenge is
to boost the collective
IQ of organizations
and of society. "
23. Tim O’Reilly, 2006, on Web 2.0
“The central principle
behind the success of the
giants born in the Web 1.0
era who have survived to
lead the Web 2.0 era
appears to be this, that
they have embraced the
power of the web to
harness collective
intelligence”
24. Web 2.0 is about The Social Web
“Web 2.0 Is Much More
About A Change In
People and Society Than
Technology”
-Dion Hinchcliffe,
tech blogger
1 billion people connect to the
Internet
100 million web sites
over a third of adults in US
have contributed content to the
public Internet. - 18% of adults
over 65
25. Tim Berners-Lee
“The Web isn’t about what
you can do with computers.
It’s people and, yes, they are
connected by computers.
But computer science, as
the study of what happens in
a computer, doesn’t tell you
about what happens on the
Web.”
NY Times, Nov 2, 2006
26. But what is “collective intelligence”
in the social web sense?
intelligent collection?
collaborative bookmarking, searching
“database of intentions”
clicking, rating, tagging, buying
what we all know but hadn‟t got around to saying in public
before
blogs, wikis, discussion lists
“database of intentions” – Tim O’Reilly
28. “Collective Knowledge” Systems
The capacity to provide useful information
based on human contributions
which gets better as more people participate.
typically
mix of structured, machine-readable data and unstructured data
from human input
29. Collective Knowledge is Real
FAQ-o-Sphere - self service Q&A forums
Citizen Journalism – “We the Media”
Product reviews for gadgets and hotels
Collaborative filtering for books and music
Amateur Academia
31. Web 2.0
The phrase "Web 2.0" can refer to one or more of the following:
The transition of web sites from isolated information silos to sources of
content and functionality, thus becoming computing platforms serving
web applications to end-users
A social phenomenon embracing an approach to generating and
distributing Web content itself, characterized by open communication,
decentralization of authority, freedom to share and re-use, and "the
market as a conversation”
Enhanced organization and categorization of content, emphasizing deep
linking
A rise in the economic value of the Web, possibly surpassing the impact
of the dot-com boom of the late 1990s
32. Two main kinds
PEOPLE FOCUS: The first kind of socializing is typified by
"people focus" websites such as Bebo, Facebook, and Myspace
and Xiaonei.
HOBBY FOCUS: The second kind of socializing is typified by
a sort of "hobby focus" websites. such as Flickr, Kodak Gallery
and Photobucket
33. Web 2.0 (see Wesch from YouTube
[LOCAL])
Since social web applications are built to encourage communication
between people, they typically emphasize some combination of the
following social attributes:
Identity: who are you?
Reputation: what do people think you stand for?
Presence: where are you?
Relationships: who are you connected with? who do you trust?
Groups: how do you organize your connections?
Conversations: what do you discuss with others?
Sharing: what content do you make available for others to interact with?
Examples of social applications include Twitter, Facebook, Stumpedia,
and Jaiku.
37. Human Resource Management 2.0
Social networks for the job market
– To find and be found
– To manage your online
reputation
– To research and
reference check
– To hire a superstar
– To use your network to do your job better
– To use your network to get a better job
http://www.linkedin.com/
38. Blog
a user-generated website where entries are
made in journal style and displayed in a
reverse chronological order. The term
"blog" is derived from "Web log." "Blog"
can also be used as a verb, meaning to
maintain or add content to a blog.
39. Wiki
a website that allows the visitors themselves
to easily add, remove, and otherwise edit
and change available content, typically
without the need for registration. This ease
of interaction and operation makes a wiki an
effective tool for mass collaborative
authoring.
41. Wiki vs. Blog
A blog, or web log, shares writing and multimedia content in the form of
“posts” (starting point entries) and “comments” (responses to the posts).
While commenting, and even posting, are open to the members of the
blog or the general public, no one is able to change a comment or post
made by another. The usual format is post-comment-comment-comment,
and so on. For this reason, blogs are often the vehicle of choice to
expressindividual opinions.
A wiki has a far more open structure and allows others to change what
one person has written. This openness may trump individual opinion
withgroup consensus.
43. (Social) Tagging
Term – a word or phrase that is recognizable by
people and computers
Document – a thing to be tagged, identifiable by a
URI or a similar naming service
Tagger – someone or thing doing the tagging, such
as the user of an application
Tagged – the assertion by Tagger that Document
should be tagged with Term
44. Podcast
A podcast is a media file that is
distributed by subscription (paid or
unpaid) over the Internet using
syndication feeds, for playback on mobile
devices and personal computers.
45. Examples of Podcasts available
iTunes Store
NPR
ArtsEdge
Ed. Podcast
Network
SFMoMA
49. Tools. Example: collaboration and
sharing
Webex
Meeting center
Training center
Acquired by CISCO in 2007
Integrated phone conferencing, VoIP, support for PowerPoint,
Flash, audio, and video;
Meeting recording and playback, One-click meeting access,
scheduling, and IM applications, full compatibility, secure
communications
See http://www.sramanamitra.com/2007/03/15/cisco-acquires-
webex-beefs-collaboration/
49
50. Trends and size
Facebook growth: 700% from 2008 to 2009
Twitter growth: 3,700%
And unique visitors..
51. One big social application? Facebook
connect!
evolution of Facebook Platform enabling
you to integrate Facebook into your own
site.
You can add social context to your site:
Identity. Seamlessly connect the user's
Facebook account with your site
Friends. Bring a user's Facebook friends
into your site.
Social Distribution. Publish information
back into Facebook.
Privacy. Bring dynamic privacy to your
site.
How scalable, reliable, open-minded?
51
55. SOA vs. Web 2.0
SOA Web 2.0
Planning
Design
Implementation
Monitoring
56. Comparison ...
Web 2.0 SOA
Saas = Saas
Web-based interoperability Standard based interoperability
(REST) (SOAP, WSDL, UDDI)
Application as a platform = Application as a platform
Pushes for unexpected reuse Allows reuse
RIA No UI
Participatory architecture Centralized governance
59. Mid-term: Web as a platform
The past The future
[…] […]
Framework Framework
API
API
API
API
API
API
API
API
API
RSS
RSS
RSS
REST
SOAP
REST
REST
SOAP
SOAP
[…] […]
Operating System Web
Hardware Internet
60. Example: eBay
Services for
shopping
trading
Publishes services
REST interface
SOAP interface
Numbers1:
4 billion requests/month
(5.5 mln/h)
25% of the offer only via
Web Service
25000 registered developers
1900 known applications
1http://blogs.zdnet.com/ITFacts/?p=10326
61. Example: Amazon
Services for
e-commerce
on-line payment
computing (EC2)
storage (s3)
human computing (MTurk)
Queues (SQS)
Success stories
Ex 1, Jungle Disk: online back-up
service
Ex 2, ABACA:99%-protection
antispam
67. How to manage complexity?
A few services in a small company Hundreds of services and processes
in a big organization
Few services Several services
Several enterprises
A1
B8 A4 A1 B3 A1
B3 A1
A1 A1
A1 A1
A1 A4 A2
A4 A1 A2 A1 A4 A2 A4 A1 A2
B3 A1 A2
One company
A5 A1 A2
B3 B3 A1
A1 A1
A1 A1 B3 A1 A1
A1
A4 A6 A1 A4
A1 A1 A1 A4 A1 A2
A4 B3
A1 A1
A2 A4 A1 B3 A1 A1 A4 A1 A2
A2 A1
A4 B3 A1 A4
B3 A1 A2 A4 A1 A2
A1 A1 A1 A1 A1
A1 A1
A1
B3 A2 A4 A1 A2
A1 A1
A1 A1 A4 A1 A2 A1A1 A4 A4
A2 A1 A2A2 A4 A1 A2
A2
A4
A4 A1 A1 A1 A1
A2 A1 A4 B3 A1 A4 A2 A4 A2
A4 A1 A1A1 A1 A2 B3
B3 A4 A2
B3 A4 A1 B3 A2
A1
A1 A1 A1 A1 A4 A1 A4 A1
A4 A2
B3 B3A1 A1
A1 A1 A2 A4 A1
A1 A1 A2 A1A1
Mashup A4
A1
A2 A1
A4
A1 A1A1 A1A2 A4
A4 A1 A4
B3 A1A1 B3 B3 A1
A1
A1
?
A N1 E N2 F
C D
Complex BPM
68. The problem is in the semantics!
“The problem is not in the plumbing,
it is in the semantics ”
VerizonChief Scientist - M . L . Brodie
“L’eterogeneità semantica rimane il principale intoppo alla
integrazione di applicazioni, un intoppo che i Web Services da soli non
risolveranno. Finché qualcuno non troverà un modo di per far sì che
le applicazioni si capiscano, gli effetti dei Web Services resteranno
limitate. Quando si passano i dati di un utente in un certo formato
usando un Web Services come interfaccia, il programma che li riceve
deve comunque sapere in che formato sono. Occorre comunque accordarsi sulla
struttura di ciascun business object. Fino ad ora nessuno ha ancora trovato una
soluzione attuabile…”
Oracle Chairman and CEO - Larry Ellison
69. Web 3.0
Combining SOA + Social Web + Semantic Web
I.e., Services + Folksonomies + Ontologies (or + Taxonomies)
69
70. Tim Berners-Lee, 2001
“The Semantic Web is not a
separate Web but an extension
of the current one, in which
information is given well-
defined meaning, better
enabling computers and people
to work in cooperation.”
Scientific American, May 2001
71. Beyond Web 2.0 ...
Business Process
Given a BPM:
Find the best
set of services?
Find the best
datasource?
Integration
Mediator
Mediator
Manage not
heterogeneous
Web as a world scale platform data/services?
Legacy
Mediator Mediator Comm.
Mediator Mediator
AT
Services
Buyer RUNTIME!
[…]
[…]
[…]
3rd Party Shipment
72. SOA + Web 2.0 = ?
UDDI
WSDL
Service
Description
WSBPEL Discovery
Agencies
Publish
Discover
Service
Description
Service Service
requester provider
Interact
SOAP ..
source: http://www.w3.org/TR/2002/WD-ws-arch-20021114/
73. SOA Advantages
Costs of different EAI approaches
Relative costs
Custom Integration
Proprietary EAI solutions
Web Services based EAI solutions
SOA based EAI solutions
Adoption Deployment Maintenance Changes
[source ZapThink http://www.zapthink.com/]
75. … to service extraction …
Rationalization of IT solutions
Factorization and publication of common services
Department 1 Department 2 Department N
[…]
76. … and process composition.
For using internal subprocesses, but also processes of customers or providers.
Client
Department 1
Department 2
Shared services
Outsourced services
Provider
77. “Ontology is overrated.”
“[tags] are a radical break with previous categorization
strategies”
hierarchical, centrally controlled, taxonomic
categorization has serious limitations
e.g., Dewey Decimal System
free-form, massively distributed tagging is resilient
against several of these limitations
http://shirky.com/writings/ontology_overrated.html
78. But...
ontologies aren‟t taxonomies
they are for sharing, not finding
they enable cross-application aggregation and value-added
services
79. Ontology of Folksonomy
What would it look like to formalize an ontology for
tag data?
Functional Purpose: applications that use tag data from
multiple systems
tag search across multiple sites
collaboratively filtered search
– “find things using tags my buddies say match those tags”
combine tags with structured query
– “find all hotels in Spain tagged with “romantic”
http://tomgruber.org/writing/ontology-of-folksonomy.htm
80. Example: formal match,
semantic mismatch
System A says a tag is a property of a document.
System B says a tag is an assertion by an individual with an
identity.
Does it mean anything to combine the tag data from these two
systems?
“Precision without accuracy”
“Statistical fantasy”
81. Engineering the tag ontology
Working with tag community, identify core and non core
agreements
Use the process of ontology engineering to surface issues that
need clarification
Couple a proposed ontology with reference implementations or
hosted APIs
82. Issues raised by ontological engineering
is term identity invariant over case, whitespace, punctuation?
are documents one-to-one with URI identities?
(are alias URLs possible?)
can tagging be asserted without human taggers?
negation of tag assertions?
tag polarity – “voting” for an assertion
tag spaces – is the scope of tagging data a user community,
application, namespace, or database?
83. Pivot Browsing – surfing unstructured
content along structured lines
Structured data provides dimensions of a hypercube
location
author
type
date
quality rating
Travel researchers browse along any dimension.
The key structured data is the destination hierarchy
Contributors place their content into the destination hierarchy, and
the other dimensions are automatic.
84. 5. Tools and technologies
for managing information overload
85. Tools
Information:
The double edged sword
You want good
information, not all
information
Information Retrieval
/search
– Multimedia IR
RSS/Bloglines/Google
Reader
Social bookmarking
87. Data in digital libraries
TEXT: e-book, Word documents, Web pages, PDF, Blog,
etc.
Audio:
Speech (broadcasting, podcasting, recording, etc.)
Music (CD, MP3, etc.)
Pictures: Personal photos, schemes, diagrams, etc.
Video: sequence of images and audio (music and/or
speech)
Challenge: How to make multimedia content available
to search engines and search based applications?
88. Some user challenges…
Precision & contextual relevancy
aware of rights, user and information contexts
personalization and recommendation
Search must support multiple interaction patterns
active searching, monitoring, browsing and "being aware“
Trust and spam
Ubiquity of access
89. MIR Application Areas
Architecture, real estate, and Investigation services
interior design
(e.g., human characteristics
(e.g., searching for ideas) recognition, forensics)
Broadcast media selection Journalism
(e.g., radio and TV channel) (e.g. searching speeches of a
certain politician using his name,
Cultural services his voice or his face)
(history museums, art galleries, Multimedia directory services
etc.)
(e.g. yellow pages, Tourist
Digital libraries information, GIS)
(e.g., musical dictionary, bio- Multimedia editing
medical imaging catalogues, film,
video and radio archives) (e.g., personalized news service,
media authoring)
E-Commerce
Remote sensing
(e.g., personalized advertising,
on-line catalogues) (e.g., cartography, ecology)
Education Shopping
(e.g., repositories of multimedia (e.g., searching for clothes)
courses)
Social
Home Entertainment
(e.g. dating services)
(e.g., personal multimedia
collections) Surveillance
(e.g., traffic control)
90. MIR: Query Examples
Play a few notes on a keyboard and retrieve a list of
musical pieces similar to the required tune, or images
matching the notes in a certain way, e.g., in terms of
emotions
Draw a few lines on a screen and find a set of images
containing similar graphics, logos, ideograms,...
Define objects, including color patches or textures and
retrieve examples among which you select the interesting
objects to compose your design
On a given set of multimedia objects, describe
movements and relations between objects and so
search for animations fulfilling the described temporal and
spatial relations
Describe actions and get a list of scenarios containing
such actions
Using an excerpt of Pavarotti’s voice, obtaining a list of
Pavarotti’s records, video clips where Pavarotti is singing
and photographic material portraying Pavarotti
91. State-of-the art of MSE
Image search Video Search
www.tiltomo.com www.blinx.com
www.tineye.com www.clipta.com
www.pixsta.com www.yovisto.com
www.picsearch.com
Music Search Entrerprise MIR search
www.midomi.com www.autonomy.com
www.audiobaba.com www.pictron.com
http://www.bmat.com www.exalead.com
www.fastsearch.com
92. Metadata? 92
“Data about other data”
They describe in a structured fashion properties
of the data
– E.g.: owner, creation and modification date,
description, etc.
Some metadata are implicitly available
E.g.: file size, file name, etc.
Others need to be manually provided or
automatically extracted
94. Content Process
Content Content
Content Acquisition
Transformation Indexing
95. Content acquisition
In MIR, content is acquired from many sources and
in in multiple ways:
By crawling
By user’s contribution
By syndicated contribution from content aggregators
Via broadcast capture (e.g., from air/cable/satellite
broadcast, IPTV, Internet TV multicast, ..)
96. Content acquisition
In text or Web search engines, content is a closed or open
collection of documents
Textual Web content is acquired by crawlers, who exploit link
navigation
In MIR, content is acquired from many sources, in a range of
quality and value:
Web cams, security apps
(Video/Audio) Telephony and teleconferencing
Industrial/Academic/Medical
User Generated Content
Public Access and Government Access
Rushes, Raw Footage MOTION PICTURES
VALUE
News
BROADCAST TV
Advertising
ENTERPRISE
TV Programming
Feature Films USER GENERATED
WEB CAM, SECURITY
PRODUCTION COST
97. Acquisition: (video) metadata sources & formats
Content element may be accompanied by textual
descriptions, which range in quantity and quality, from no
description (e.g., web cam content) to multilingual high
value data (closed captions and production metadata of
motion pictures)
Metadata may reside:
Embedded within content (e.g., close captions)
In surrounding Web pages or links (e.g., HTML content, link
anchors, etc)
In domain-specific databases (e.g., IMDB for feature films)
In ontologies:
http://www.daml.org/ontologies/keyword.html
ASSET PACKAGE
METADATA
METADATA
METADATA
MULTIPLEXED
METADATA MEDIA
STREAMS EXTERNAL
METADATA
98. Acquisition: (video) representative metadata
standards
Standard Body
MPEG-7, ISO/IEC Int. Electrotechnical Comm., Motion
MPEG-21 Picture Expert Group
UPnP Universal Plug and Play forum
MXF, MDD SMPTE Society of Motion Picture and Television
Engineers
AAF AMWA Advanced Media Workflow Association
TV Anytime ETSI European Telecommunication Standards
Institute
Timed Text W3C, 3GPP
RSS Harward
Podcast Apple
Media RSS Yahoo
99. Transformation dimesions: Digital video formats
A digital video is a sequence of frames
The Frame Aspect Ratio (FAR) defines the shape of each
image (width divided by heigh), with 4:3 and 16:9 being the
currently adopted values
Pixel aspect ratio (PAR) describes how the width of pixels in a
digital image compares to their height (rectangular pixels
format exist for analog TV compatibility).
Frame rate: number of frames per second (24 and 25 are
common, but also lower and higher values are used)
100. Transformation dimensions: compression
Web media must be compressed, with lossy (but perceptually
acceptable) transformations
In video, compression works in two ways
Intra-Frame: an image is divided in blocks, whose content is
“averaged”
Inter-frame: a frame is represented differentially with respect to
the preceding one, by encoding only block that “have moved”
and their motion vector
Example (MPEG compression)
101. Content Transformation: popular compression
standards
Standard Typical bitrates Applications
M-JPEG, Up to 60 Consumer electronics, video
JPEG2000 Mbit/sec editing systems
DVCAM 25M Consumer
MPEG-1 1.5M CD-ROM Multimedia
MPEG-2 4-20M Broadcast TV, DVD
MPEG-4 300K-12M Mobile video, Podcast, IPTV
H.264
H.261 H.263 64k-1M Video teleconferencing,
telephony
Each standard has profiles, that balance latency, complexity, error resilience
and bandwidth, specifically for a target application (e.g., file-based vs
transport-based fruition)
102. Content indexing
In textual search engines, content need little (lexical) analysis
before indexing
Index elements (words) are part of the content
In MIR, content cannot be indexed directly
Indexablemeatadatamust be created from the input data
– Low level features: concisely describe physical or perceptual properties
of a media element (e.g., feature vectors)
– High level features: domain concepts characterizing the content (e.g.,
extracted objects and their properties, content categorizations, etc)
In continuous media, extracted features must be related to the
media segment that they characterize, both in space and time
Feature extraction may require a change of medium, e.g.,
speech to text transcription
103. Motivations for metadata generation
Computer are not able to catch the
underlying meaning of a multimedia
content
A computer is not able to understand that
this picture represents a sunset
Pixels and audio samples do not convey
semantics, just binary
Metadata are used to produce
representations that are manageable
by computers
E.g.: text or numbers
104. How to create multimedia annotations?
Manually
Expensive
– It can take up to 10x the duration of the video
– Problems in scaling to millions of contents
Incomplete or inaccurate
– People might not be able to holistically catch all the
meanings associated with a multimedia object
Difficult
– Some contents are tedious to describe with words
- E.g., a melody without lyrics
Automatically
Good quality
– Some technologies have a ~90% precision
“Low” cost
107. Audio Segmentation
GOAL: split an audio track according to
contained information
Music
Speech
Noise
…
Additional usage
Identification and removal of ads
108. Video Segmentation
Keyframe segmentation:
segment a video track
according to its keyframes
– fixed-length temporal segments
Shot detection:
automated detection of
transitions between shots
– a shot is a series of interrelated
consecutive pictures taken
contiguously by a single camera
and representing a continuous
action in time and space.
109. Speaker identification
GOAL: identify people participating in a
discussion
ERIC
DAVID
JOHN
Additional usage:
Vocal command execution
110. Word spotting
GOAL: recognize spoken words belonging to a
closed dictionary
Call
Open
Bomb
Additional usage:
Spot blacklist words in spontaneous speech
– E.g.: terrorist, attack,…
dialing (e.g., "Call home”)
call routing (e.g., "I would like to make a collect
call”)
Domotic appliance control
111. Speech to text
GOAL: automatically recognize spoken words
belonging to an open dictionary
Example: quote_detection.avi
CREDITS: Thorsten Hermes@SSMT2006
112. Identification of audio events
GOAL: automatically identify audio events of
interest
E.g.: shouts, gunshots, etc.
Additional usage:
Security applications
Example: sound_events.avi
CREDITS: Thorsten Hermes@SSMT2006
113. Classification of music genre, mood, etc.
GOAL: automatically classify the genre and
mood of a song
Rock, pop, Jazz, Blues, etc.
Happy, aggressive, sad, melancholic,
Rock
Dance!
Additional usage:
Automatic selection of songs for playlist
composition
114. Images: low-level features
GOAL: extract implicit characteristics of a
picture
luminosity
orientations
textures
Color distribution
115. Images: Optical character recognition (OCR)
OCR is a technique for
translating images of typed or
handwritten text into symbols
Solved problem for typewritten
text (99% accuracy)
Commercial solutions for
handwritten text (e.g, MS
Tablet PC)
116. Image: face identification and recognition
GOAL: recognize and identify
faces in an image
Usage examples:
People counting
Security applications
Example: face_detection.avi
CREDITS: Thorsten
Hermes@SSMT2006
117. Image: concept detection
Image analysis extract low level features from raw data
(e.g., color histograms, color correlograms, color
moments, co-occurrence texture matrices, edge
direction histograms, etc..)
Features can be used to build discrete classifiers, which
may associate semantic concepts to images or regions
thereof
The MediaMill semantic search engine defines 491
semantic concepts
http://www.science.uva.nl/research/mediamill/demo
Concepts can be detected also from text (e.g., from
manual or automatic metadata) using NLP techniques
(FAST text search engine recognizes entities like
geographical locations, professions, names of persons,
domain-specific technical concepts, etc)
118. Image: object identification
GOAL: identify objects appearing in a picture
Basket ball, cars, planes, players, etc.
Also by example (unaware of position, scaling, etc)
– objectByExample.mp4
CREDITS: http://www.youtube.com/user/GuoshenYu
119. Video OCR
Video OCR has specific problems, due
to low resolution, small text size, and
interference with background
Detection is normally done on the most
representative image of an entire
shots, rather than frame by frame
Approach: filter for enhancing
resolution + pattern matching for
character identification
Example: VirageConTEXTract text
extraction and recognition technology
(recognizes text in real time)
120. Multimodal annotation fusion
Media segmentation and concept extraction are
probabilistic processes
The result is characterized by a confidence value
Significance can be enhanced by comparing the
output of distinct techniques applied to the same
or similar problems
Examples:
Media segmentation: shot detection + speaker’s turn
identification
Person recognition: voice identification + face
detection
Concept detection: image based classification (e.g.,
“outdoor” & “water” + object extraction: “bird”,
“boat”)
122. Content querying
In textual search applications, queries are keywords or
expressions thereof
In MIR, search can take place
By keyword
By (mono-media) example (e.g., query by image, query
by humming, query by song similarity)
By (multi-media) example (e.g., query by video
similarity)
Query by example entails real time content processing
MIR query processing naturally requires the interaction
of multiple search engines (e.g., a text search engine
for textual metadata and a content-based search
engine for feature vectors)
123. Querying: modalities
In MIR applications, search keyword match the manual
or automatic metadata
A complementary approach is to provide an example of
the desired content and look for similar media elements
Similarity is a medium-dependent, domain-dependent,
and subjective criterion
Can be computed on low lever features (e.g., image
color histograms, music bpm) or on high level
concepts/categorization (e.g., melancholic images,
party music)
Can be multimodal (e.g., video similarity)
Querying may also consider context information (e.g.,
the user’s geographical position or the access device)
124. Example query modalities and search types
where[contains(“amsterdam”)]
and 52.37N 4.89 E
topic[contains(“building”)]
“amsterdam” Image
Song
Query analysis
Federation
Music search
Text search Image Similarity index
search
XML search Geo search
Inverted index Similarity index
Semantic index R-tree index
125. Faceted query
When a media collection is
large and its content
unknown to the user,
exposing part of the
metadata can help
This can be done by
showing a compact
representation of the
categories of content
(facets)
A query can be restricted
by selecting only the
relevant facets
126. Querying: by keyword
The keyword may match the manual metadata and/or the
automatic metadata
The match can be multimodal: in the audio, in a visual
concept
128. Content browsing
In textual search engines,
results are ranked linearly,
browsed by navigating
links, and read at a glance
In MIR and similarity-
based search applications,
browsing results must
consider multiple
dimensions
Relevance: where the
result appears in the
sequence of retrieved
media elements
Space: where the search
has matched inside a
spatially organized media
element (e.g., an image)
Time: when a match
occurs in a linear media
element
130. References
MPEG-7:
MPEG-7 Overview
http://www.chiariglione.org/mpeg/standards/mpeg-
7/mpeg-7.htm
Prof. Ray Larson & Prof. Marc Davis, UC Berkeley
SIMS
http://www.sims.berkeley.edu/academics/courses/is
202/f03/
RSS: http://www.rssboard.org/rss-specification
MEDIA RSS: http://search.yahoo.com/mrss
MPEG:http://en.wikipedia.org/wiki/MPEG
Shot detection:
http://en.wikipedia.org/wiki/Shot_boundary_detec
tion
131. References
MediaMill:
http://www.science.uva.nl/research/mediamill
Similarity search
www.midimi.com
www.tiltomo.com
http://tineye.com/
Slides del corsodi “ArchiviMultimedialie Data
Mining”, Politecnicodi Torino, Prof. Silvia Chiusano
Slides e video dellelezionetenutedal Prof. Thorsten
Hermes presso la summer school SSMS 2006
PHAROS: http://www.pharos-audiovisual-
search.eu/
133. Acquisition: RSS and Media RSS
RSS (Really Simple Syndication) describes a family of web feed
formats used to publish frequently updated web resources (e.g.,
news)
An RSS feed includes full or summarized text, plus metadata
such as publishing dates and authorship
RSS formats are specified using XML
RSS 2.0 now “frozen”
Media RSS proposed by Yahoo as an RSS module that
supplements the <enclosure> element capabilities of RSS 2.0 to
allow for more robust media syndication.
140. Social bookmarking
Online shared catalogs of annotated bookmarks
Even ad-hoc sites are needed for managing
complexity of bookmark sharing task
140
142. Why Personalization?
Personalization is an attempt to find most relevant
documents using information about user's goals,
knowledge, preferences, navigation history, etc.
143. Same Query, Different Intent
“Cancer”
Different meanings
“Information about the astronomical/astrological sign of cancer”
“information about cancer treatments”
Different intents
“is there any new tests for cancer?”
“information about cancer treatments”
145. User Profile
A user‟s profile is a collection of information about
the user of the system.
This information is used to get the user to more
relevant information
146. Core vs. Extended User Profile
Core profile
contains information related to the user search goals and
interests
Extended profile
contains information related to the user as a person in order to
understand or model the use that a person will make with the
information retrieved
147. Who Maintains the Profile?
Profile is provided and maintained by the
user/administrator
Sometimes the only choice
The system constructs and updates the profile (automatic
personalization)
Collaborative - user and system
User creates, system maintains
User can influence and edit
Does it help or not?
148. Adaptive Search
Goals:
Present documents (pages) that are most suitable for the
individual user
Methods:
Employ user profiles representing short-term and/or long-
term interests
Rank and present search results taking both user query and
user profile into account
149. Personalized Search: Benefits
Resolving ambiguity
The profile provides a context to the query in order to reduce
ambiguity.
Example: The profile of interests will allow to distinguish
what the user asked about “Berkeley” (“Pirates”, “Jaguar”)
really wants
Revealing hidden treasures
The profile allows to bring to surface most relevant
documents, which could be hidden beyond top results page
Example: Owner of iPhone searches for Google Android.
Pages referring to both would be most interesting
150. Where to Apply Profiles ?
The user profile can be applied in several ways:
To modify the query itself (pre-processing)
To change the usual way of retrieval
To process results of a query (post-processing)
To present document snippets
Special case: adaptation for meta-search
151. Pre-Process: Query Expansion
User profile is applied to add terms to the query
Popular terms could be added to introduce context
Similar terms could be added to resolve indexer-user
mismatch
Related terms could be added to resolve ambiguity
Works with any IR model or search engine
152. Pre-Process: Relevance Feedback
In this case the profile is used to “move” the query
Imagine that:
the documents,
the query
the user profile
are represented by the same set of weighted index terms
153. Post-Processing
The user profile is used to organize the results of the
retrieval process
Present to the user the most interesting documents
Filter out irrelevant documents
Extended profile can be used effectively
In this case the use of the profile adds an extra step to
processing
Similar to classic information filtering problem
Typical way for adaptive Web IR
154. Post-Filter: Annotations
The result could be relevant to the user in several aspects.
Fusing this relevance with query relevance is error prone
and leads to a loss of data
Results are ranked by the query relevance, but annotated
with visual cues reflecting other kinds of relevance
User interests - Syskill and Webert, group interests -
KnowledgeSea
155. Post-Filter: Re-Ranking
Re-ranking is a typical approach for post-filtering
Each document is rated according to its relevance
(similarity) to the user or group profile
This rating is fused with the relevance rating returned by
the search engine
The results are ranked by fused rating
User model: WIFS, group model: I-Spy
156. Privacy related problems
Web Information Retrieval face a challenge; that the data
required to perform evaluations, namely query logs and click-
through data, is not readily available due to valid privacy
concerns.
Researchers can:
Limit to small (and sometimes biased) samples of users,
restricting somewhat the conclusions that can be drawn.
Limit the usage of private data to local computation, exploiting
personal data only in post processing search result.
Look for publicly available data that can be used to approximate
query logs and click-through data (such as user bookmarks).
157
157. Tag Data and Personalized Information
Retrieval
Recently it has been shown that the information contained in
social bookmarking (tagging) systems may be useful for
improving Web search.
Using data from the social bookmarking site del.icio.us, it is
possible to demonstrate how one can rate the quality of
personalized retrieval results.
User's “bookmark history" can be used to improve search
results via personalization.
Analogously to studies involving implicit feedback mechanisms
in IR, which have found that profiles based on the content of
clicked URLs outperform those based on past queries alone,
profiles based on the content of bookmarked URLs are
generally superior to those based on tags alone.
158
158. Tag Data and Personalized Information
Retrieval
Social bookmarking systems such as del.icio.us and
Bibsonomy are a recent and popular phenomenon.
Users label interesting web pages (or research articles) with
primarily short and unstructured annotations in natural
language called tags.
These sites offer an alternative model for discovering
information online.
Rather than following the traditional model of submitting
queries to a Web search engine, users can browse tags as though
they were directories looking for popular pages that have been
tagged by a number of different users. Since tags are chosen by
users from an unrestricted vocabulary, these systems can be
seen to provide consensus categorizations of interesting
websites.
159
159. Tag Data and Personalized Information
Retrieval
How social bookmarking data can be used to improve Web
search?
Can tag data be used to approximate actual user queries to a
search engine?
How evaluate personalized IR systems using information
contained in social bookmarks (tag data)?
Is there enough information in (i.e. a strong enough
correlation between) the tags/bookmarks in a user's history in
order to build a profile of the user that will be useful for
personalizing search engine results?
160
160. Models for generating a profile of the
user
We record the (time ordered) stream of webpages that have been
bookmarked by a particular user
The first simple profile involves counting the occurrences of
terms in the tags of any of the known bookmarks.
An obvious problem is that users often have multiple interests
and their many bookmarks cover a range of topics. Thus some
bookmarks may be completely unrelated to the nth bookmark
(and thus the tags being used as the current query).
161
161. The second source of information in the bookmarks is the
content of the bookmarked pages themselves.
One would expect given the much larger vocabulary of Web
pages compared to tag data, that content may prove more
useful than tags. Indeed content-based profiles are more
useful than query-based ones.
A user spends more time deliberating which pages to
bookmark than deciding which search results to click on.
Since a user will only bookmark sites that they find
particularly useful or interesting, these documents should
contain a lot of useful information about the user and the
content of bookmarked documents is particularly useful for
personalization.
162
162. The previous profile is somewhat adhoc in its decision which
documents to include and which not to include.
In theory, we would like to include all documents that the user
has bookmarked, but weight them according to their expected
usefulness for resolving ambiguity in the current query.
Our first attempt to estimate the distance between two
bookmarks is to count the number of common terms in their
respective sets of tags
163
163. How do we use these profiles?
In order to incorporate the user profile for personalized
information retrieval queries are expanded with terms from
the profile, weighting them appropriately.
The number of expansion terms to be added to the query is
limited so as to limit the amount of noise and total length of
the expanded query.
In particular, the K most frequent terms from the profile are
added and the weights to account for the missing terms are
normalized.
164
165. Introduction to Recommender Systems
Systems for recommending items (e.g. books, movies,
CD’s, web pages, newsgroup messages) to users based on
examples of their preferences.
Objective:
To propose objects fitting the user needs/wishes
To sell services (site visits) or goods
Many search engines and on-line stores provide
recommendations (e.g. Amazon, CDNow).
Recommenders have been shown to substantially increase
clicks (and sales).
166. Book Recommender
Red
Mars
Found
ation
Juras-
sic
Park Machine User
Lost Learning Profile
World
2001
Neuro- 2010
mancer
Differ-
ence
Engine
167. Personalization
Recommenders are instances of personalization software.
Personalization concerns adapting to the individual
needs, interests, and preferences of each user.
Includes:
Recommending
Filtering
Predicting (e.g. form or calendar appt. completion)
From a business perspective, it is viewed as part of
Customer Relationship Management (CRM).
168. Machine Learning and Personalization
Machine Learning can allow learning a user
model or profile of a particular user based on:
Sample interaction
Rated examples
Similar user profiles
This model or profile can then be used to:
Recommend items
Filter information
Predict behavior
169. Types of recommendation systems
1.Search-based recommendations
2.Category-based recommendations
3.Collaborative filtering
4.Clustering
5.Association rules
6.Information filtering
7.Classifiers
170. 1. Search-based recommendations
The only visitor types a search query
« data mining customer »
The system retrieves all the items that
correspond to that query
e.g. 6 books
The system recommends some of these books
based on general, non-personalized ranking
(sales rank, popularity, etc.)
171. Search-based recommendations
Pros:
Simple to implement
Cons:
Not very powerful
Which criteria to use to rank recommendations?
Is it really « recommendations »?
The user only gets what he asked for
172. 2. Category-based recommendations
Each item belongs to one category or more.
Explicit / implicitchoice:
The customer select a category of interest
(refinesearch, opt-in for category-
basedrecommendations, etc.).
– « Subjects> Computers & Internet >Databases> Data
Storage & Management > Data Mining »
The system selects categories of interest on the
behalf of the customer, based on the current item
viewed, past purchases, etc.
Certain items
(bestsellers,
new items) are
eventually
recommended
173. Category-based recommendations
Pros:
Still simple to implement
Cons:
Again: not very powerful, which criteria to use to
order recommendations? is it really
« recommendations »?
Capacity highly depends upon the kind of
categories implemented
– Too specific: not efficient
– Not specific enough: no relevant recommendations
174. 3. Collaborative filtering
Collaborative filtering techniques « compare »
customers, based on their previous purchases,
to make recommendations to « similar »
customers
It’s also called « social » filtering
Follow these steps:
1.Find customers who are similar (« nearest
neighbors ») in term of tastes, preferences, past
behaviors
2.Aggregate weighted preferences of these
neighbors
3.Make recommendations based on these
aggregated, weighted preferences (most
preferred, unbought items)
175. Collaborative filtering
Example: the system needs to make
recommendations to customer C
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Customer A X X
Customer B X X X
Customer C X X
Customer D X X
Customer E X X
Customer B is very close to C (he has bought all
the books C has bought). Book 5 is highly
recommended
Customer D is somewhat close. Book 6 is
recommended to a lower extent
Customers A and E are not similar at all.
Weight=0
176. Collaborative filtering
Pros:
Extremely powerful and efficient
Very relevant recommendations
(1) The bigger the database, (2) the more the past
behaviors, the better the recommendations
Cons:
Difficult to implement, resource and time-consuming
What about a new item that has never been
purchased?
Cannot be recommended
What about a new customer who has never bought
anything? Cannot be compared to other customers
no items can be recommended
177. 4. Clustering
Another way to make recommendations based
on past purchases of other customers is to
cluster customers into categories
Each cluster will be assigned « typical »
preferences, based on preferences of customers
who belong to the cluster
Customers within each cluster will receive
recommendations computed at the cluster level
178. Clustering
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Customer A X X
Customer B X X X
Customer C X X
Customer D X X
Customer E X X
Customers B, C and D are « clustered »
together. Customers A and E are clustered into
another separate group
« Typical » preferences for CLUSTER are:
Book 2, very high
Book 3, high
Books 5 and 6, may be recommended
Books 1 and 4, not recommended at all
179. Clustering
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Customer A X X
Customer B X X X
Customer C X X
Customer D X X
Customer E X X
Customer F X X
How does it work?
Any customer that shall be classified as a
member of CLUSTER will receive
recommendations based on preferences of the
group:
Book 2 will be highly recommended to Customer F
Book 6 will also be recommended to some extent
180. Clustering
Problem: customers may belong to more than
one cluster; clusters may overlap
Predictions are then averaged across the
clusters, weighted by participation
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Customer A X X
Customer B X X X
Customer C X X
Customer D X X
Customer E X X
Customer F X X
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Customer A X X
Customer B X X X
Customer C X X
Customer D X X
Customer E X X
Customer F X X
181. Clustering
Pros:
Clustering techniques work on aggregated data:
faster
It can also be applied as a « first step » for
shrinking the selection of relevant neighbors in a
collaborative filtering algorithm
Cons:
Recommendations (per cluster) are less relevant
than collaborative filtering (per individual)
182. 5. Association rules
Clustering works at a group (cluster) level
Collaborative filtering works at the customer
level
Association rules work at the item level
183. Association rules
Past purchases are transformed into
relationships of common purchases
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Customer A X X
Customer B X X X
Customer C X X
Customer D X X
Customer E X X
Customer F X X
Also bought…
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Book 1 1 1
who bought…
Customers
Book 2 2 1 1
Book 3 2 2
Book 4 1
Book 5 1 1 2
Book 6 1
184. Association rules
These association rules are then used to make
recommendations
If a visitor has some interest in Book 5, he will
be recommended to buy Book 3 as well
Recommendations are constrained to some
minimum levels of confidence
What if recommendations can be made using
more than one piece of information?
Recommendations are aggregated
Also bought…
Book 1 Book 2 Book 3 Book 4 Book 5 Book 6
Book 1 1 1
who bought…
Customers
Book 2 2 1 1
Book 3 2 2
Book 4 1
Book 5 1 1 2
Book 6 1
185. Association rules
Pros:
Fast to implement
Fast to execute
Not much storage space required
Not « individual » specific
Very successful in broad applications for large
populations, such as shelf layout in retail stores
Cons:
Not suitable if knowledge of preferences change
rapidly
It is tempting to do not apply restrictive
confidence rules
May lead to litteraly stupid recommendations
186. 6. Information filtering
Association rules compare items based on past
purchases
Information filtering compare items based on
their content
Also called « content-based filtering » or
« content-based recommendations »
Can exploit syntactical information on objects
(features)
But also semantic knowledge of objects
(concepts/ontologies)
187. Information filtering
What is the « content » of an item?
It can be explicit « attributes » or
« characteristics » of the item. For example for a
film:
Action / adventure
Feature Bruce Willis
Year 1995
It can also be « textual content » (title,
description, table of content, etc.)
Several techniques exist to compute the distance
between two textual documents
188. Information filtering
How does it work?
A textual document is scanned and parsed
Word occurrences are counted (may be stemmed)
Several words or «tokens» are not taken into
account: rarely used or «stop words»
Each document is transformed into a normed
TFIDF vector, size N(Term Frequency / Inverted
Document Frequency).
The distance between any pair of vector is
computed
189. Information filtering
An (unrealistic) example: how to compute
recommendations between 8 books based only on their
title?
Books selected:
Building data mining applications for CRM
Accelerating Customer Relationships: Using CRM and
Relationship Technologies
Mastering Data Mining: The Art and Science of Customer
Relationship Management
Data Mining Your Website
Introduction to marketing
Consumer behavior
marketing research, a handbook
Customer knowledge management
190. COUNT
building data Accelerating Mastering Data Data Mining Your Introduction to consumer marketing customer
mining Customer Mining: The Art Website marketing behavior research, a knowledge
applications for Relationships: and Science of handbook management
crm Using CRM and Customer
Relationship Relationship
Technologies Management
a 1
accelerating 1
and 1 1
application 1
art 1
behavior 1
building 1
consumer 1
crm 1 1
customer 1 1 1
data 1 1 1
for 1
handbook 1
introduction 1
knowledge 1
management 1 1
marketing 1 1
mastering 1
mining 1 1 1
of 1
relationship 2 1
research 1
science 1
technology 1
the 1
to 1
using 1
website 1
your 1
191. TFIDF Normed Vectors
building data Accelerating Mastering Data Data Mining Your Introduction to consumer marketing customer
mining Customer Mining: The Art Website marketing behavior research, a knowledge
Mastering Data Mining:
applications for Relationships: and Science of
Data mining handbook management
crm Using CRM and Customer
The Art and Science
Relationship Relationship
Technologies
of Customer Relationship Management your website
a 0.000 0.000 0.000 0.000 0.000 0.000 0.537 0.000
accelerating
and
0.000
0.000
Management
0.432
0.296
0.000
0.256
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
application 0.502 0.000 0.000 0.000 0.000 0.000 0.000 0.000
art 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000
behavior 0.000 0.000 0.000 0.000 0.000 0.707 0.000 0.000
building 0.502 0.000 0.000 0.000 0.000 0.000 0.000 0.000
consumer 0.000 0.000 0.000 0.000 0.000 0.707 0.000 0.000
crm 0.344 0.296 0.000 0.000 0.000 0.000 0.000 0.000
customer 0.000 0.216 0.187 0.000 0.000 0.000 0.000 0.381
data 0.251 0.000 0.187 0.316 0.000 0.000 0.000 0.000
for 0.502 0.000 0.000 0.000 0.000 0.000 0.000 0.000
handbook 0.000 0.000 0.000 0.000 0.000 0.000 0.537 0.000
introduction 0.000 0.000 0.000 0.000 0.636 0.000 0.000 0.000
knowledge 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.763
management 0.000 0.000 0.256 0.000 0.000 0.000 0.000 0.522
marketing 0.000 0.000 0.000 0.000 0.436 0.000 0.368 0.000
mastering 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000
mining 0.251 0.000 0.187 0.316 0.000 0.000 0.000 0.000
of 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000
relationship
research
0.000
0.000 Data0.468
0.000
0.256
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.537
0.000
0.000
science 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000
technology 0.000 0.432 0.000 0.000 0.000 0.000 0.000 0.000
the 0.000 0.000 0.374 0.000 0.000 0.000 0.000 0.000
to 0.000 0.000 0.000 0.000 0.636 0.000 0.000 0.000
using 0.000 0.432 0.000 0.000 0.000 0.000 0.000 0.000
website 0.000 0.000 0.000 0.632 0.000 0.000 0.000 0.000
your 0.000 0.000
0.187
0.000 0.632
0.316
0.000 0.000 0.000 0.000
192. Information filtering
A customer is interested in the following book:
« Building data mining applications for CRM »
The system computes distances between this book and the
7 others
The « closest » books are recommended:
#1:Data Mining Your Website
#2:Accelerating Customer Relationships: Using CRM
and Relationship Technologies
#3:Mastering Data Mining: The Art and Science of
Customer Relationship Management
Not recommended:Introduction to marketing
Not recommended: Consumer behavior
Not recommended:marketing research, a handbook
Not recommended: Customer knowledge
management
193. Information filtering
Pros:
No need for past purchase history
Not extremely difficult to implement
Cons:
« Static » recommendations
Not efficient is content is not very informative
e.g. information filtering is more suited to
recommend technical books than novels or movies
194. 7. Classifiers
Classifiers are general computational models
They may take in inputs:
Vector of item features (action / adventure, Bruce
Willis)
Preferences of customers (like action / adventure)
Relations among items
They may give as outputs:
Classification
Rank
Preference estimate
That can be a neural network, Bayesian network, rule
induction model, etc.
The classifier is trained using a training set
195. Classifiers
Pros:
Versatile
Can be combined with other methods to improve
accuracy of recommendations
Cons:
Need a relevant training set
196. Collaborative Filtering
Maintain a database of many users’ ratings of a variety of
items.
For a given user, find other similar users whose ratings
strongly correlate with the current user.
Recommend items rated highly by these similar users, but
not rated by the current user.
Almost all existing commercial recommenders use this
approach (e.g. Amazon).
197. Collaborative Filtering
A 9 A A 5 A A 6 A 10
User B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
Database : : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1
A 9 A 10
B 3 B 4
Correlation C C 8
Match : : . .
Z 5 Z 1
A 9
Active B 3 Extract
User C
C Recommendations
. .
Z 5
198. Collaborative Filtering Method
Weight all users with respect to similarity with
the active user.
Select a subset of the users (neighbors) to use
as predictors.
Normalize ratings and compute a prediction from
a weighted combination of the selected
neighbors’ ratings.
Present items with highest predicted ratings as
recommendations.
199. Significance Weighting
Important not to trust correlations based on very
few co-rated items.
Include significance weights, based on number of
co-rated items.
If no items are rated by both users, correlation is not meaningful
200. Neighbor Selection
For a given active user, a, select correlated users
to serve as source of predictions.
Standard approach is to use the most similarn
users, u, based on similarity weights, wa,u
Alternate approach is to include all users whose
similarity weight is above a given threshold.
Editor's Notes
There have been many definitions for IR in the last decades… we just report
There have been many definitions for IR in the last decades… we just report
User-centric interfacesCloud services should be accessed with simple and pervasive methods. In fact, the Cloud computing adopts the concept of Utility computing. Utility Computing: users obtain and employ computing platforms in computing Clouds as easily as they access a traditional public utility. In detail, the Cloud services enjoy the following features:The cloud interfaces do not force users to change their working habits and environments.The cloud client software which is required to be installed locally is lightweightCloud interfaces are location independent and can be accessed by some well established interfaces like Web services framework and Internet browserAutonomous SystemThe computing Cloud is an autonomous system and it is managed transparently to users. Hardware, software and data inside clouds can be automatically reconfigured, orchestrated and consolidated to present a single platform image, finally rendered to users.Scalability and flexibilityThe scalability and flexibility are the most important features that drive the emergence of the Cloud computing. Cloud services and computing platforms offered by computing Clouds could be scaled across various concerns, such as geographical locations, hardware performance, software configurations. The computing platform should be flexible to adapt to various requirements of a potentially large number of users.
Software or an application is hosted as a service and provided to customers across the Internet. This mode eliminates the need to install and run the application on the customer’s local computers. SaaS therefore alleviates the customer’s burden of software maintenance, and reduces the expense of software purchases by on-demand pricingAn early example of the SaaS is the Application Service Provider (ASP). The ASP approach provides subscriptions to software that is hosted or delivered over the Internet. Microsoft’s “Software +Service” shows another example: a combination of local software and Internet services interacting with one another. Google’s Chrome browser gives an interesting SaaS scenario: a new desktop could be offered, through which applications can be delivered (either locally or remotely) in addition to the traditional Web browsing experience
The Google App Engine is an interesting example of the IaaS. The Google App Engine enables users to build Web applications with Google’s APIs and SDKs across the same scalable systems, which power the Google applications.
ITaaS is a highly disruptive concept for enterprise users, who have less to gain and more to lose by outsourcing ITCloud service providers trying to serve this space must implement enterprise-class capabilities at multiple levels both in the network and at the end pointsKey business and technical challenges include cost, security, performance, business resiliency, interoperability, and data migrationCloud computing is still in early development. Market researchers, financial analysis, and business leaders all want to assess its potential markets and business impact. According to IDC, a market research firm that recently surveyed IT executives, CIOs, and other business leaders, IT spending on cloud services will reach US$42 billion by 2012. However, as with any disruptive technology and transitional business model, there is no definitive assessment of cloud computing’s market opportunity. We believe its long-term business impact could be even larger