Classification,  Tagging & Search James Melzer August 14, 2007
Where does this fit in my project? Search Sitemap Portals Content Integration & Aggregation  (Yahoo, Lexis/Nexis) Discover Metadata Navigation Filters Classify/ Organize Content Assets Create
Paradigms of Information Organization Classification  is the process of organizing a domain of items into a systematic scheme. Items (instances) are classified into categories (classes) by a person or system. All parties share a common scheme for consistency and clarity, which make it ideal for high-value information.  Tagging  is the process of assigning a term or phrase to an item. Every person uses their own non-systematic tag scheme, which they generally make up as they go along. Tagging is most effective with personal collections of information.  Search  is the multi-stage process of reducing a group of unstructured documents into structured data and then matching that data to a human’s query. Search is most effective in huge collections of disorganized information.
Classification
Attributes of Classical Classification Taxonomy Strict concept hierarchy Mutual exclusivity Comprehensiveness Inheritance Thesaurus Has all the attributes of a taxonomy, plus: Synonyms and alternate forms Related terms
Classification’s Strengths Descriptive of a domain Exhaustive within a domain Provides colocation of similar items Unlimited domain scale
Classification’s Weaknesses Expensive to create or maintain Rigid in perspective and application Slow to grow or change Hard to use * * may require trained experts, leading to scalability issues
Fancy Classification: Facets A taxonomy usually classifies everything in its domain along a single axis.  A  polyhierarchy  is made up of multiple mutually-exclusive taxonomies. A class from each taxonomy is applied to every item.  Each taxonomy in a polyhierarchy is called a  facet  (like the many sides of a diamond). Automobile example...
Standard Facets Subject  Asset  Use  Relation
Facet: Subject Extrinsic properties Examples: Subjects discussed  Geographic coverage Companies mentioned Questions Answered:  What is it about? Where does it fit in similar discourse?
Facet: Asset Intrinsic properties Examples: Author Title Language Social Security Number Bar Code/UPC Questions Answered:  What makes it unique? What authority does it have?
Facet: Use Permissions and audience Examples: Intended for children ages 9-12 Intended for professionals with advanced degrees Restricted to senior management and auditors Restricted to subscribers Non-subscribers can access only the abstract Questions Answered:  Who can use it? Who should use it?
Facet: Relation Connections to other objects Examples: People who bought this book also bought these other books If you buy this grill, you may also want to buy these tongs and apron If you are researching electric cars, you may also want to look into hybrid cars This top will go with that skirt really well Questions Answered:  What commonly goes with this? What other objects would help me use this object more effectively?
Classification Applied Where is classification used?   Anywhere people need excellent colocation of similar items, and can afford to ensure it with professional cataloging Research Business Government Libraries
Classification Guidelines Specificity rule Apply the most specific terms when tagging assets. Specific terms can always be generalized, but generic terms cannot be specialized. Repeatable rule All attributes should be repeatable. Use as many terms as necessary to describe What the asset is about and Why it is important. Storage is cheap. Re-creating content is expensive. Appropriateness rule Not all attributes apply to all assets. Only supply values for attributes that make sense. Usability rule Anticipate how the asset will be searched for in the future, and how to make it easy to find it. Remember that search engines can only operate on explicit information.
Tagging
Tagging’s Attributes No controlled vocabulary (It’s NOT classification) Informal Personal, although sometimes social Messy
Tagging Conceptual Model Can’t get enough? More: http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-2.html
Social Tagging Many tagging systems (particularly Yahoo! properties  del.icio.us  and  flickr.com ) allow people to see other people’s tags and items.  The way tags, items and people are shared influences people’s behavior Self-conscious tagging Intentional group tags Copying another’s tags Tag spamming
Types of Tags * Description Categorization Opinion Action Relation Insider reference Spam * According to Rashmi Sinha, Uzanto
Tagging’s Strengths Easy and cheap Personalized Rapid adaptation Infinite scalability Readily multilingual
Tagging’s Weaknesses Not necessarily exhaustive within domain No inheritance Weak colocation of similar items Variable quality
Tagging Applied Where is tagging used? Amateurs with stuff to organize and not a lot of time on their hands.  Serendipitous searching or browsing Re-finding items (refindability) Personal or insider collections Shared resources
Search Search
Search Basics A search engine mediates between user’s query and metadata surrogates for documents Documents are reduced to metadata User’s need is translated into a query Query terms are used to find matching metadata terms Lots and lots of room for error...
Search Process Crawl  content for metadata Index  document terms into an inverted file; an inverted file is very fast to search Search  the index to identify the result set; search the index; not the documents Rank  the results for display; ranking is the hardest part
Search Algorithm 1 Term-based Ranking (tf/idf) tf  = term frequency  documents that use the query terms most are presumed to be most relevant idf  = inverse document frequency terms that are more rare are better indicators of relevance Assumptions 1) relevance can be measured with document terms
Search Algorithm 2 PageRank (Google) Relevant set is still identified by term matching A  revolution  in ranking:  based on linking between documents Assumptions:  1) important sites link to other important sites  2) if many people link to a site, it is important
Improving Search Best Bets Relevance Feedback Search + Classification
Best Bets A  best bet is a manually selected search result, tied to specific query terms or phrases User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply Business-driven phrases select phrases important to the business; such as product names or office locations
Relevance Feedback “More like this”; use one document’s metadata as a query to find others Cluster results, so users can filter by a cluster (e.g. Jaguar the car vs. jaguar the cat) Structured search guesses (e.g. is it a zip code? a product name?)
Combining Search and Classification Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if collection is well-cataloged Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged Spell check on query phrases Classifying documents with additional metadata (even tagging)
Wrap Up Classification Tagging Search
Questions? James Melzer Senior Information Architect SRA International, Inc. [email_address] jamesmelzer.com ( jamesmelzer.com/class/slides.pdf ) del.icio.us/jamesmelzer

Classification, Tagging & Search

  • 1.
    Classification, Tagging& Search James Melzer August 14, 2007
  • 2.
    Where does thisfit in my project? Search Sitemap Portals Content Integration & Aggregation (Yahoo, Lexis/Nexis) Discover Metadata Navigation Filters Classify/ Organize Content Assets Create
  • 3.
    Paradigms of InformationOrganization Classification is the process of organizing a domain of items into a systematic scheme. Items (instances) are classified into categories (classes) by a person or system. All parties share a common scheme for consistency and clarity, which make it ideal for high-value information. Tagging is the process of assigning a term or phrase to an item. Every person uses their own non-systematic tag scheme, which they generally make up as they go along. Tagging is most effective with personal collections of information. Search is the multi-stage process of reducing a group of unstructured documents into structured data and then matching that data to a human’s query. Search is most effective in huge collections of disorganized information.
  • 4.
  • 5.
    Attributes of ClassicalClassification Taxonomy Strict concept hierarchy Mutual exclusivity Comprehensiveness Inheritance Thesaurus Has all the attributes of a taxonomy, plus: Synonyms and alternate forms Related terms
  • 6.
    Classification’s Strengths Descriptiveof a domain Exhaustive within a domain Provides colocation of similar items Unlimited domain scale
  • 7.
    Classification’s Weaknesses Expensiveto create or maintain Rigid in perspective and application Slow to grow or change Hard to use * * may require trained experts, leading to scalability issues
  • 8.
    Fancy Classification: FacetsA taxonomy usually classifies everything in its domain along a single axis. A polyhierarchy is made up of multiple mutually-exclusive taxonomies. A class from each taxonomy is applied to every item. Each taxonomy in a polyhierarchy is called a facet (like the many sides of a diamond). Automobile example...
  • 9.
    Standard Facets Subject Asset Use Relation
  • 10.
    Facet: Subject Extrinsicproperties Examples: Subjects discussed Geographic coverage Companies mentioned Questions Answered: What is it about? Where does it fit in similar discourse?
  • 11.
    Facet: Asset Intrinsicproperties Examples: Author Title Language Social Security Number Bar Code/UPC Questions Answered: What makes it unique? What authority does it have?
  • 12.
    Facet: Use Permissionsand audience Examples: Intended for children ages 9-12 Intended for professionals with advanced degrees Restricted to senior management and auditors Restricted to subscribers Non-subscribers can access only the abstract Questions Answered: Who can use it? Who should use it?
  • 13.
    Facet: Relation Connectionsto other objects Examples: People who bought this book also bought these other books If you buy this grill, you may also want to buy these tongs and apron If you are researching electric cars, you may also want to look into hybrid cars This top will go with that skirt really well Questions Answered: What commonly goes with this? What other objects would help me use this object more effectively?
  • 14.
    Classification Applied Whereis classification used? Anywhere people need excellent colocation of similar items, and can afford to ensure it with professional cataloging Research Business Government Libraries
  • 15.
    Classification Guidelines Specificityrule Apply the most specific terms when tagging assets. Specific terms can always be generalized, but generic terms cannot be specialized. Repeatable rule All attributes should be repeatable. Use as many terms as necessary to describe What the asset is about and Why it is important. Storage is cheap. Re-creating content is expensive. Appropriateness rule Not all attributes apply to all assets. Only supply values for attributes that make sense. Usability rule Anticipate how the asset will be searched for in the future, and how to make it easy to find it. Remember that search engines can only operate on explicit information.
  • 16.
  • 17.
    Tagging’s Attributes Nocontrolled vocabulary (It’s NOT classification) Informal Personal, although sometimes social Messy
  • 18.
    Tagging Conceptual ModelCan’t get enough? More: http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-2.html
  • 19.
    Social Tagging Manytagging systems (particularly Yahoo! properties del.icio.us and flickr.com ) allow people to see other people’s tags and items. The way tags, items and people are shared influences people’s behavior Self-conscious tagging Intentional group tags Copying another’s tags Tag spamming
  • 20.
    Types of Tags* Description Categorization Opinion Action Relation Insider reference Spam * According to Rashmi Sinha, Uzanto
  • 21.
    Tagging’s Strengths Easyand cheap Personalized Rapid adaptation Infinite scalability Readily multilingual
  • 22.
    Tagging’s Weaknesses Notnecessarily exhaustive within domain No inheritance Weak colocation of similar items Variable quality
  • 23.
    Tagging Applied Whereis tagging used? Amateurs with stuff to organize and not a lot of time on their hands. Serendipitous searching or browsing Re-finding items (refindability) Personal or insider collections Shared resources
  • 24.
  • 25.
    Search Basics Asearch engine mediates between user’s query and metadata surrogates for documents Documents are reduced to metadata User’s need is translated into a query Query terms are used to find matching metadata terms Lots and lots of room for error...
  • 26.
    Search Process Crawl content for metadata Index document terms into an inverted file; an inverted file is very fast to search Search the index to identify the result set; search the index; not the documents Rank the results for display; ranking is the hardest part
  • 27.
    Search Algorithm 1Term-based Ranking (tf/idf) tf = term frequency documents that use the query terms most are presumed to be most relevant idf = inverse document frequency terms that are more rare are better indicators of relevance Assumptions 1) relevance can be measured with document terms
  • 28.
    Search Algorithm 2PageRank (Google) Relevant set is still identified by term matching A revolution in ranking: based on linking between documents Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important
  • 29.
    Improving Search BestBets Relevance Feedback Search + Classification
  • 30.
    Best Bets A best bet is a manually selected search result, tied to specific query terms or phrases User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply Business-driven phrases select phrases important to the business; such as product names or office locations
  • 31.
    Relevance Feedback “Morelike this”; use one document’s metadata as a query to find others Cluster results, so users can filter by a cluster (e.g. Jaguar the car vs. jaguar the cat) Structured search guesses (e.g. is it a zip code? a product name?)
  • 32.
    Combining Search andClassification Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if collection is well-cataloged Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged Spell check on query phrases Classifying documents with additional metadata (even tagging)
  • 33.
    Wrap Up ClassificationTagging Search
  • 34.
    Questions? James MelzerSenior Information Architect SRA International, Inc. [email_address] jamesmelzer.com ( jamesmelzer.com/class/slides.pdf ) del.icio.us/jamesmelzer