Data comes in two forms: structured data and unstructured data.
Structured data has a well defined form, something that makes it easily stored and queried on.( e.g. user ratings, content articles viewed, and items purchased …)
Unstructured data is typically in the form of raw text (e.g. reviews, discussion forum posts, blog entries, and chat sessions …)
Most applications transform unstructured data into structured data
Source: Alag, S. Collective Intelligence in Action . Manning Press (2009)
CI Data Model Most applications generally consist of users and items. An item is any entity of interest in your application. If your application is a social-networking application, or you’re looking to connect one user with another, then a user is also a type of item. Source: Alag, S. Collective Intelligence in Action . Manning Press (2009) Users Metadata Items
Non-personalized recommendations are identical for each user. The recommendations are either manually selected (e.g. editor choices) or based on the popularity of items (e.g. average ratings, sales data).
The users are categorized based on the attributes of their demographic profiles in order to find users with similar features. The engine then suggests or recommends (explicitly or implicitly) items that are preferred by these similar users.
Collaborative Filtering: Cosine Similarity (an Example) Step 1: Find SQRT of Sum of Squares Each Row of Scores Step 2: Divide each Scores In row by SQRT of Sum of SQs Step3: Calculate Cosine Similarity Between Users by Summing X-Products of their normalized Scores (from Step 2)
Collaborative Filtering: User-Based Predictions and Recommendations
Cold Start: What do you do with users who have no or few ratings?
Sparcity: What do you do if there is little overlap in user ratings across users in the data set?
Scale: What if there are millions of users? Does this scale well as the number of comparisons increases?
Real-time: How do you do these calculations in real-time.
Collaborative Filtering: Item-Based Example www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf Amazon.com has more than 29 million customers and several million catalog items. Other major retailers have comparably large data sources. While all this data offers opportunity, it’s also a curse, breaking the backs of algorithms designed for data sets three orders of magnitude smaller. Almost all existing algorithms were evaluated over small data sets.
Collaborative Filtering: Group Lens Rating Data Sets for Testing
MovieLens , Wikilens (Beers), Book-Crossing, Jester Joke, HP EachMovie
Collaborative Filtering: Other Applications Anything that can be represented in matrix form where n is a number representing a nominal (e.g. 0,1 for present, absent), ordinal, interval or ratio value
Text mining (also known as text data mining or knowledge discovery in textual databases) is the semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources.
Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching.
Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.
Summarization. Summarizing a document to save time on the part of the reader.
Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes.
Clustering. Grouping similar documents without having a predefined set of categories.
Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods.
Question answering . Finding the best answer to a given question through knowledge-driven pattern matching.
CI for Unstructured Contents: Blog Results for K-Means Clustering
CI from Content: Simple Example “We Feel Fine”
Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ).
Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and look to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way.
URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner.
Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
CI from Content: Simple Example “We Feel Fine” Visualizations Madness Murmerings Montage Mounds Metrics Mobs
CI from Content: 9/11 Pager Data 2001-09-11 08:52:46 Skytel  B ALPHA Netdesk@nbc.com||Reports of a plane crash near World Trade Center - no more details at this point. WNBC's LIVE pix - Network working on coverage.
The systematic use of personal data systems in the investigation or monitoring of the actions or communications of one or more persons.
The terms personal surveillance and mass surveillance are commonly used, but seldom defined
Personal surveillance is the surveillance of an identified person. In general, a specific reason exists for the investigation or monitoring.
Mass surveillance is the surveillance of groups of people, usually large groups. In general, the reason for investigation or monitoring is to identify individuals who belong to some particular class of interest to the surveillance organization.
Deals with the linkage of datasets without explicit identifiers such as name and address.
Examples of Re-identification
Large portion of the US can be re-identified using a combination of 5-digit ZIP code, gender and date of birth.
AOL case 4417749 (2006 release of 20 million search queries of over 650,000 users
CMU study of predicting SSNs -- it is possible to guess many -- if not all -- of the nine digits in an individual's Social Security number using publicly available information (about location and birth date)