• Save
Data Mining Open Ap Is
Upcoming SlideShare
Loading in...5
×
 

Data Mining Open Ap Is

on

  • 1,564 views

 

Statistics

Views

Total Views
1,564
Views on SlideShare
1,564
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Mining Open Ap Is Data Mining Open Ap Is Presentation Transcript

  • Data Mining and Open APIs Toby Segaran
  • About Me Software Developer at Genstruct Work directly with scientists Design algorithms to aid in drug testing “Programming Collective Intelligence” Published by O’Reilly Due out in August Consult with open-source projects and other companies http://kiwitobes.com
  • Presentation Goals Look at some Open APIs Get some data Visualize algorithms for data-mining Work through some Python code Variety of techniques and sources Advocacy (why you should care)
  • Open data APIs Zillow Yahoo Answers eBay Amazon Facebook Technorati del.icio.us Twitter HotOrNot Google News Upcoming programmableweb.com/apis for more…
  • Open API uses Mashups Integration Automation Command-line tools Most importantly, creating datasets!
  • What is data mining? From a large dataset find the: Implicit Unknown Useful Data could be: Tabular, e.g. Price lists Free text Pictures
  • Why it’s important now More devices produce more data People share more data The internet is vast Products are more customized Advertising is targeted Human cognition is limited
  • Traditional Applications Computational Biology Financial Markets Retail Markets Fraud Detection Surveillance Supply Chain Optimization National Security
  • Traditional = Inaccessible Real applications are esoteric Tutorial examples are trivial Generally lacking in “interest value”
  • Fun, Accessible Applications Home price modeling Where are the hottest people? Which bloggers are similar? Important attributes on eBay Predicting fashion trends Movie popularity
  • Zillow
  • The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price Requires registration key http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
  • The Zillow API REST Request http://www.zillow.com/webservice/GetDeepSearchResults.htm? zws-id=key&address=address&citystatezip=citystateszip
  • The Zillow API <SearchResults:searchresults xmlns:SearchResults=quot;http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;> … <response> <results> <result> <zpid>48749425</zpid> <links> … </links> <address> <street>2114 Bigelow Ave N</street> <zipcode>98109</zipcode> <city>SEATTLE</city> <state>WA</state> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <finishedSqFt>3290</finishedSqFt> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldDate>06/18/2002</lastSoldDate> <lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice> <valuation> <amount currency=quot;USDquot;>1091061</amount> </result> </results> </response>
  • The Zillow API <SearchResults:searchresults xmlns:SearchResults=quot;http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;> … <zipcode>98109</zipcode> <response> <results> <city>SEATTLE</city> <result> <state>WA</state> <zpid>48749425</zpid> <links> <latitude>47.637934</latitude> … <longitude>-122.347936</longitude> </links> <address> </address>Bigelow Ave N</street> <street>2114 <yearBuilt>1924</yearBuilt> <zipcode>98109</zipcode> <city>SEATTLE</city> <lotSizeSqFt>4680</lotSizeSqFt> <state>WA</state> <finishedSqFt>3290</finishedSqFt> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <bathrooms>2.75</bathrooms> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <bedrooms>4</bedrooms> <finishedSqFt>3290</finishedSqFt> <lastSoldDate>06/18/2002</lastSoldDate> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice> <lastSoldDate>06/18/2002</lastSoldDate> <valuation> currency=quot;USDquot;>770000</lastSoldPrice> <lastSoldPrice <valuation> <amountcurrency=quot;USDquot;>1091061</amount> currency=quot;USDquot;>1091061</amount> <amount </result> </results> </response>
  • Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data # Extract the info about this property try: use=doc.getElementsByTagName('useCode')[0].firstChild.data zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • A home price dataset House Zip Bathrooms Bedrooms Built Type Price A 02138 1.5 2 1847 Single 505296 B 02139 3.5 9 1916 Triplex 776378 C 02140 3.5 4 1894 Duplex 595027 D 02139 2.5 4 1854 Duplex 552213 E 02138 3.5 5 1909 Duplex 947528 F 02138 3.5 4 1930 Single 2107871 etc..
  • What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
  • Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
  • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
  • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
  • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
  • Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  • Python Code def variance(rows): def variance(rows): if len(rows)==0: return 0 if len(rows)==0: return for row in rows] data=[float(row[len(row)-1]) 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) mean=sum(data)/len(data)d in data])/len(data) variance=sum([(d-mean)**2 for return variance variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  • Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance # def divideset(rows,column,value): us if a row is in Make a function that tells # the Make a function (true) or the asecond in # first group that tells us if row is group (false) # the first group (true) or the second group (false) split_function=None split_function=None if isinstance(value,int) or isinstance(value,float): if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value split_function=lambda row:row[column]>=value else: else: split_function=lambda row:row[column]==value split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  • Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and returnreturn them # Divide the rows into two sets and them set1=[row for row in rows if split_function(row)] set1=[row for row in rows if not split_function(row)] in rows if split_function(row)] set2=[row for row set2=[row(set1,set2) in rows if not split_function(row)] for row return return (set1,set2)
  • CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • CART Algoritm 22 Square 8 10 Circle 20 18 Circle 6 11 Square 22
  • CART Algoritm
  • Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  • Python Code def buildtree(rows,scoref=variance): def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() if len(rows)==0: return decisionnode() current_score=scoref(rows) current_score=scoref(rows) criteria # Set up some variables to track the best #best_gain=0.0some variables to track the best criteria Set up best_criteria=None best_gain=0.0 best_sets=None column_count=len(rows[0])-1 best_criteria=None for col in range(0,column_count): best_sets=None of different values in # Generate the list # this column column_count=len(rows[0])-1 column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  • Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 for try dividing the rows up for each value # Now value in column_values.keys(): # in this column (set1,set2)=divideset(rows,col,value) for value in column_values.keys(): # Information gain (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_gain=gain best_sets=(set1,set2) best_criteria=(col,value) # Create the sub branches if best_gain>0: best_sets=(set1,set2) trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  • Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if best_gain>0: and len(set1)>0 and len(set2)>0: if gain>best_gain best_gain=gain trueBranch=buildtree(best_sets[0]) best_criteria=(col,value) best_sets=(set1,set2) falseBranch=buildtree(best_sets[1]) # Create the sub branches if best_gain>0: return decisionnode(col=best_criteria[0],value=best_criteria[1], trueBranch=buildtree(best_sets[0]) tb=trueBranch,fb=falseBranch) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: else: return decisionnode(results=uniquecounts(rows)) return decisionnode(results=uniquecounts(rows))
  • Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
  • Just for Fun… Hot or Not
  • Just for Fun… Hot or Not
  • Supervised and Unsupervised Regression trees are supervised “answers” are in the dataset Tree models predict answers Some methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns
  • Next challenge - Bloggers Millions of blogs online Usually focus on a subject area Can they be characterized automatically? … using only the words in the posts?
  • The Technorati Top 100
  • A single blog
  • Getting the content Use Mark Pilgrim’s Universal Feed Reader Retrieve the post titles and text Split up the words Count occurrence of each word
  • Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  • Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} for e in d.entries: # Loop over all the entries if 'summary' in e: summary=e.summary for e in d.entries: else: summary=e.description if 'summary' in e: summary=e.summary else: summary=e.description words # Extract a list of # Extract a list of words words=getwords(e.title+' '+summary) words=getwords(e.title+' '+summary) for word in words: for word in words: wc.setdefault(word,0) wc.setdefault(word,0) wc[word]+=1 wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  • Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove def getwords(html): all the HTML tags # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) txt=re.compile(r'<[^>]+>').sub('',html) # Split words bywords by all non-alpha characters # Split all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert # Convert to lowercase to lowercase return [word.lower() for word in words if word!=''] return [word.lower() for word in words if word!='']
  • Building a Word Matrix Build a matrix of word counts Blogs are rows, words are columns Eliminate words that are: Too common Too rare
  • Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  • Python Code apcount={} wordcounts={} for feedurlinin file('feedlist.txt'): for feedurl file('feedlist.txt'): title,wc=getwordcounts(feedurl) title,wc=getwordcounts(feedurl) wordcounts[title]=wc wordcounts[title]=wc for word,count in wc.items(): forapcount.setdefault(word,0) word,count in wc.items(): apcount.setdefault(word,0) if count>1: if apcount[word]+=1 count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  • Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] wordlist=[] for w,bc in apcount.items(): for w,bc in apcount.items(): frac=float(bc)/len(feedlist) frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  • Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') out=file('blogdata.txt','w') for word in wordlist: out.write('t%s' % word) out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') out.write('n') for blog,wcinin wordcounts.items(): for blog,wc wordcounts.items(): out.write(blog) out.write(blog) for wordin wordlist: for word in wordlist: if word in wc: out.write('t%d' % wc[word]) if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') else: out.write('t0') out.write('n') out.write('n')
  • The Word Matrix “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12
  • Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0) 2 + (0 − 2) 2 + (1 − 2) 2 + (2 − 12) 2 = 12 (approx)
  • Other Distance Metrics Manhattan Tanamoto Pearson Correlation Chebychev Spearman
  • Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
  • Hierarchical Algorithm
  • Hierarchical Algorithm
  • Hierarchical Algorithm
  • Hierarchical Algorithm
  • Hierarchical Algorithm
  • Dendrogram
  • Python Code class bicluster: def __init__(self,vec,left=None,right=None,distance=0.0,id=None): self.left=left self.right=right self.vec=vec self.id=id self.distance=distance
  • Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  • Python Code def hcluster(rows,distance=pearson): distances={} distances={} currentclustid=-1 currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] # Clusters are initially just the rows while len(clust)>1: lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  • Python Code def hcluster(rows,distance=pearson): distances={} while len(clust)>1: currentclustid=-1 # Clusters are initially just the rows lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) while len(clust)>1: lowestpair=(0,1) # loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance through every pair looking for i inloopin range(len(clust)): # range(len(clust)): for the smallest distance through every pair looking for i for j for j range(i+1,len(clust)): in in range(i+1,len(clust)): # distances is the cache of distance calculations # distances is the cache of distances: if (clust[i].id,clust[j].id) not in distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] distances[(clust[i].id,clust[j].id)]= if d<closest: closest=d distance(clust[i].vec,clust[j].vec) lowestpair=(i,j) d=distances[(clust[i].id,clust[j].id)] # calculate the average of the two clusters mergevec=[ if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 d<closest: for i in range(len(clust[0].vec))] closest=d # create the new cluster lowestpair=(i,j) newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  • Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: # calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) the average of the two clusters d=distances[(clust[i].id,clust[j].id)] mergevec=[ if d<closest: closest=d (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 lowestpair=(i,j) #in range(len(clust[0].vec)) calculate the average of the two clusters for i mergevec=[ ] (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 # create for i in range(len(clust[0].vec))] #the new new cluster create the cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative distance=closest,id=currentclustid) currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[1]] del clust[lowestpair[0]] del clust[lowestpair[0]] clust.append(newcluster) clust.append(newcluster) return clust[0]
  • Hierarchical Blog Clusters
  • Hierarchical Blog Clusters
  • Hierarchical Blog Clusters
  • Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
  • Hierarchical Word Clusters
  • K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
  • K-Means Algorithm
  • K-Means Algorithm
  • K-Means Algorithm
  • K-Means Algorithm
  • K-Means Algorithm
  • Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) ranges=[(min([row[i] for row in rows]), for i in range(len(rows[0]))] # Create k randomly placed centroids max([row[i] for row in rows])) clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] for i in range(len(rows[0]))] lastmatches=None for t in range(100): # Create k randomly placed centroids print 'Iteration %d' % t bestmatches=[[] for i in range(k)] clusters=[[random.random()* # Find which centroid is the closest for each row for j(ranges[i][1]-ranges[i][0])+ranges[i][0] in range(len(rows)): row=rows[j] for i in range(len(rows[0]))] bestmatch=0 for i in range(k): for j in range(k)] d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids for t in range(100): clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] bestmatches=[[] for i in range(k)] lastmatches=None for t in range(100): # Find which centroid is the closest for each row print 'Iteration %d' % t bestmatches=[[] for i in range(k)] for j in range(len(rows)): # Find which centroid is the closest for each row row=rows[j] for j in range(len(rows)): row=rows[j] bestmatch=0 bestmatch=0 for for iin range(k): i in range(k): d=distance(clusters[i],row) d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete if bestmatches==lastmatches: break bestmatches[bestmatch].append(j) lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break if bestmatches==lastmatches: break lastmatches=bestmatches lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): # Move the centroids to the average of their members d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i for i in range(k): bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete avgs=[0.0]*len(rows[0]) if bestmatches==lastmatches: break lastmatches=bestmatches if len(bestmatches[i])>0: # Move the centroids toin average of their members for rowid the bestmatches[i]: for i in range(k): avgs=[0.0]*len(rows[0])range(len(rows[rowid])): for m in if len(bestmatches[i])>0: avgs[m]+=rows[rowid][m] for rowid in bestmatches[i]: for m in range(len(rows[rowid])): for j in range(len(avgs)): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) avgs[j]/=len(bestmatches[i]) clusters[i]=avgs clusters[i]=avgs return bestmatches
  • K-Means Results >> [rownames[r] for r in k[0]] ['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', quot;Seth's Blogquot;] >> [rownames[r] for r in k[1]] ['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
  • 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
  • Multidimensional Scaling
  • Multidimensional Scaling
  • Multidimensional Scaling
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) The real distances between every pair of items n=len(data) # # The realrealdist=[[distance(data[i],data[j]) for j inpair of items distances between every range(n)] for i in range(0,n)] realdist=[[distance(data[i],data[j]) for j in range(n)] outersum=0.0 # for i initialize the starting points of the locations in 2D in range(0,n)] Randomly loc=[[random.random(),random.random()] for i in range(n)] outersum=0.0 fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # RandomlyRandomly initialize the startingof the locations in 2D in # initialize the starting points points of the locations 2D loc=[[random.random(),random.random()] for i in range(n)] loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None lasterror=None for m in range(0,1000): for m in # Find projected distances range(0,1000): for i in range(n): # Find projected distances for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for i in range(n): for x in range(len(loc[i]))])) for j in range(n): # Move points fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) grad=[[0.0,0.0] for i in range(n)] for x in range(len(loc[i]))])) totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] # Move points lasterror=None grad=[[0.0,0.0]# m in range(0,1000): for for i in range(n)] Find projected distances for i in range(n): for j in range(n): totalerror=0 fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) for k in range(n): # Move points for j in range(n): grad=[[0.0,0.0] for i in range(n)] if j==k: continue totalerror=0 # The errorfor k inpercent difference between the distances is range(n): for j in range(n): errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the towards the # Each point needs to be moved away from or other # other point# in proportionhow much error much error it has point in proportion to to how it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm totalerror+=abs(errorterm) print totalerror # Keep trackIfof answer got worse by moving the points, we are done the the total error # if lasterror and lasterror<totalerror: break totalerror+=abs(errorterm) lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) # If the answer got worse by moving the points, we are done print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break if lasterror and lasterror<totalerror: break lasterror=totalerror lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break # Move each of the points by the learning rate times the gradient lasterror=totalerror for k in range(n): of the points by the learning rate times the gradient # Move each loc[k][0]-=rate*grad[k][0] for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] loc[k][1]-=rate*grad[k][1] return loc
  • Numerical Predictions Back to “supervised” learning We have a set of numerical attributes Specs for a laptop Age and rating for wine Ratios for a stock Want to predict another attribute Formula/model is unknown e.g. price
  • Regression Trees? Regression trees find hard boundaries Can’t deal with complex formulae
  • Statistical regression Requires specification of a model Usually linear Doesn’t handle context
  • Alternative - Interpolation Find “similar” items Guess price based on similar items Need to determine: What is similar? How should we aggregate prices?
  • Price Data from EBay
  • The eBay API XML API Send XML over HTTPS Receive results in XML http://developer.ebay.com/quickstartguide.
  • Some Python Code def getHeaders(apicall,siteID=quot;0quot;,compatabilityLevel = quot;433quot;): headers = {quot;X-EBAY-API-COMPATIBILITY-LEVELquot;: compatabilityLevel, quot;X-EBAY-API-DEV-NAMEquot;: devKey, quot;X-EBAY-API-APP-NAMEquot;: appKey, quot;X-EBAY-API-CERT-NAMEquot;: certKey, quot;X-EBAY-API-CALL-NAMEquot;: apicall, quot;X-EBAY-API-SITEIDquot;: siteID, quot;Content-Typequot;: quot;text/xmlquot;} return headers def sendRequest(apicall,xmlparameters): connection = httplib.HTTPSConnection(serverUrl) connection.request(quot;POSTquot;, '/ws/api.dll', xmlparameters, getHeaders(apicall)) response = connection.getresponse() if response.status != 200: print quot;Error sending request:quot; + response.reason else: data = response.read() connection.close() return data
  • Some Python Code def getItem(itemID): xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+ quot;<GetItemRequest xmlns=quot;urn:ebay:apis:eBLBaseComponentsquot;>quot;+ quot;<RequesterCredentials><eBayAuthToken>quot; + userToken + quot;</eBayAuthToken></RequesterCredentials>quot; + quot;<ItemID>quot; + str(itemID) + quot;</ItemID>quot;+ quot;<DetailLevel>ItemReturnAttributes</DetailLevel>quot;+ quot;</GetItemRequest>quot; data=sendRequest('GetItem',xml) result={} response=parseString(data) result['title']=getSingleValue(response,'Title') sellingStatusNode = response.getElementsByTagName('SellingStatus')[0]; result['price']=getSingleValue(sellingStatusNode,'CurrentPrice') result['bids']=getSingleValue(sellingStatusNode,'BidCount') seller = response.getElementsByTagName('Seller') result['feedback'] = getSingleValue(seller[0],'FeedbackScore') attributeSet=response.getElementsByTagName('Attribute'); attributes={} for att in attributeSet: attID=att.attributes.getNamedItem('attributeID').nodeValue attValue=getSingleValue(att,'ValueLiteral') attributes[attID]=attValue result['attributes']=attributes return result
  • Building an item table RAM CPU HDD Screen DVD Price D600 512 1400 40 14 1 $350 Lenovo 160 300 5 13 0 $80 T22 256 900 20 14 1 $200 Pavillion 1024 1600 120 17 1 $800 etc..
  • Distance between items RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? T22 256 900 20 14 1 $200 Euclidean, just like in clustering (512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2
  • Idea 1 – use the closest item With the item whose price I want to guess: Calculate the distance for every item in my dataset Guess that the price is the same as the closest This is called kNN with k=1
  • Problems with “outliers” The closest item may be anomalous Why? Exceptional deal that won’t occur again Something missing from the dataset Data errors
  • Using an average RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 No. 2 512 1400 60 14 1 $400 No. 3 1024 1600 120 15 0 $325 k=3, estimate = $361
  • Using a weighted average RAM CPU HDD Screen DVD Price Weight New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 3 No. 2 512 1400 60 14 1 $400 2 No. 3 1024 1600 120 15 0 $325 1 Estimate = $367
  • Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): vec2=data[i]['input'] distancelist.append((euclidean(vec1,vec2),i)) distancelist.sort() return distancelist def weightedknn(data,vec1,k=5,weightf=gaussian): # Get distances dlist=getdistances(data,vec1) avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  • Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): defvec2=data[i]['input'] weightedknn(data,vec1,k=5,weightf=gaussian): distancelist.append((euclidean(vec1,vec2),i)) # Get distances distancelist.sort() dlist=getdistances(data,vec1) return distancelist avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  • Too few – k too low
  • Too many – k too high
  • Determining the best k Divide the dataset up Training set Test set Guess the prices for the test set using the training set See how good the guesses are for different values of k Known as “cross-validation”
  • Determining the best k Test set Attribute Price Attribute Price 10 20 10 20 Training set 11 30 Attribute Price 8 10 11 30 6 0 8 10 6 0 For k = 1, guess = 30, error = 10 For k = 2, guess = 20, error = 0 For k = 3, guess = 13, error = 7 Repeat with different test sets, average the error
  • Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • Problems with scale
  • Scaling the data
  • Scaling to zero
  • Determining the best scale Try different weights Use the “cross-validation” method Different ways of choosing a scale: Range-scaling Intuitive guessing Optimization
  • Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Weight k-nearest neighbors
  • New projects Openads An open-source ad server Users can share impression/click data Matrix of what hits based on Page Text Ad Ad placement Search query Can we improve targeting?
  • New Projects Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
  • New Projects Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors