Data Mining and Open APIs

1,072 views
961 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,072
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Mining and Open APIs

  1. 1. Data Mining and Open APIs Toby Segaran
  2. 2. About Me Software Developer at Genstruct Work directly with scientists Design algorithms to aid in drug testing “Programming Collective Intelligence” Published by O’Reilly Due out in August Consult with open-source projects and other companies http://kiwitobes.com
  3. 3. Presentation Goals Look at some Open APIs Get some data Visualize algorithms for data-mining Work through some Python code Variety of techniques and sources Advocacy (why you should care)
  4. 4. Open data APIs Zillow Yahoo Answers eBay Amazon Facebook Technorati del.icio.us Twitter HotOrNot Google News Upcoming programmableweb.com/apis for more…
  5. 5. Open API uses Mashups Integration Automation Command-line tools Most importantly, creating datasets!
  6. 6. What is data mining? From a large dataset find the: Implicit Unknown Useful Data could be: Tabular, e.g. Price lists Free text Pictures
  7. 7. Why it’s important now More devices produce more data People share more data The internet is vast Products are more customized Advertising is targeted Human cognition is limited
  8. 8. Traditional Applications Computational Biology Financial Markets Retail Markets Fraud Detection Surveillance Supply Chain Optimization National Security
  9. 9. Traditional = Inaccessible Real applications are esoteric Tutorial examples are trivial Generally lacking in “interest value”
  10. 10. Fun, Accessible Applications Home price modeling Where are the hottest people? Which bloggers are similar? Important attributes on eBay Predicting fashion trends Movie popularity
  11. 11. Zillow
  12. 12. The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price Requires registration key http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
  13. 13. The Zillow API REST Request http://www.zillow.com/webservice/GetDeepSearchResults.htm? zws-id=key&address=address&citystatezip=citystateszip
  14. 14. The Zillow API <SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd"> … <response> <results> <result> <zpid>48749425</zpid> <links> … </links> <address> <street>2114 Bigelow Ave N</street> <zipcode>98109</zipcode> <city>SEATTLE</city> <state>WA</state> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <finishedSqFt>3290</finishedSqFt> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldDate>06/18/2002</lastSoldDate> <lastSoldPrice currency="USD">770000</lastSoldPrice> <valuation> <amount currency="USD">1091061</amount> </result> </results> </response>
  15. 15. The Zillow API <SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd"> … <zipcode>98109</zipcode> <response> <results> <city>SEATTLE</city> <result> <state>WA</state> <zpid>48749425</zpid> <links> <latitude>47.637934</latitude> … <longitude>-122.347936</longitude> </links> <address> </address>Bigelow Ave N</street> <street>2114 <yearBuilt>1924</yearBuilt> <zipcode>98109</zipcode> <city>SEATTLE</city> <lotSizeSqFt>4680</lotSizeSqFt> <state>WA</state> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> <finishedSqFt>3290</finishedSqFt> </address> <bathrooms>2.75</bathrooms> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <bedrooms>4</bedrooms> <finishedSqFt>3290</finishedSqFt> <lastSoldDate>06/18/2002</lastSoldDate> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldPrice currency="USD">770000</lastSoldPrice> <lastSoldDate>06/18/2002</lastSoldDate> <valuation> currency="USD">770000</lastSoldPrice> <lastSoldPrice <valuation> <amountcurrency="USD">1091061</amount> <amount currency="USD">1091061</amount> </result> </results> </response>
  16. 16. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  17. 17. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  18. 18. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  19. 19. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data # Extract the info about this property try: use=doc.getElementsByTagName('useCode')[0].firstChild.data zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  20. 20. A home price dataset House Zip Bathrooms Bedrooms Built Type Price A 02138 1.5 2 1847 Single 505296 B 02139 3.5 9 1916 Triplex 776378 C 02140 3.5 4 1894 Duplex 595027 D 02139 2.5 4 1854 Duplex 552213 E 02138 3.5 5 1909 Duplex 947528 F 02138 3.5 4 1930 Single 2107871 etc..
  21. 21. What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
  22. 22. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  23. 23. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  24. 24. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A B Value Initially 10 Circle 20 Average = 14 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
  25. 25. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A B Value B = Circle 10 Circle 20 Average = 13 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
  26. 26. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A B Value A > 18 10 Circle 20 Average = 8 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
  27. 27. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A B Value A > 11 10 Circle 20 Average = 7 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
  28. 28. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  29. 29. Python Code def variance(rows): def variance(rows): if len(rows)==0: return 0 if len(rows)==0: return for row in rows] data=[float(row[len(row)-1]) 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) mean=sum(data)/len(data)d in data])/len(data) variance=sum([(d-mean)**2 for return variance variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  30. 30. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance # def divideset(rows,column,value): us if a row is in Make a function that tells # the Make a function (true) or the asecond in # first group that tells us if row is group (false) # the first group (true) or the second group (false) split_function=None split_function=None if isinstance(value,int) or isinstance(value,float): if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value split_function=lambda row:row[column]>=value else:else: split_function=lambda row:row[column]==value split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  31. 31. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and returnreturn them # Divide the rows into two sets and them set1=[row for row in rows if split_function(row)] set1=[row for row in rows if not split_function(row)] set2=[row for row in rows if split_function(row)] set2=[row(set1,set2) in rows if not split_function(row)] return for row return (set1,set2)
  32. 32. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  33. 33. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  34. 34. CART Algoritm 10 Circle 20 22 Square 8 11 Square 22 18 Circle 6
  35. 35. CART Algoritm
  36. 36. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  37. 37. Python Code def buildtree(rows,scoref=variance): def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() if len(rows)==0: return decisionnode() current_score=scoref(rows) current_score=scoref(rows) criteria # Set up some variables to track the best #best_gain=0.0some variables to track the best criteria Set up best_criteria=None best_gain=0.0 best_sets=None column_count=len(rows[0])-1 best_criteria=None for col in range(0,column_count): best_sets=None of different values in # Generate the list # this column column_count=len(rows[0])-1 column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  38. 38. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 for try dividing the rows up for each value # Now value in column_values.keys(): # in this column (set1,set2)=divideset(rows,col,value) for value in column_values.keys(): # Information gain (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_gain=gain best_sets=(set1,set2) best_criteria=(col,value) # Create the sub branches if best_gain>0: best_sets=(set1,set2) trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  39. 39. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if best_gain>0: and len(set1)>0 and len(set2)>0: if gain>best_gain best_gain=gain trueBranch=buildtree(best_sets[0]) best_criteria=(col,value) best_sets=(set1,set2) falseBranch=buildtree(best_sets[1]) # Create the sub branches if best_gain>0: return decisionnode(col=best_criteria[0],value=best_criteria[1], trueBranch=buildtree(best_sets[0]) tb=trueBranch,fb=falseBranch) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: else: return decisionnode(results=uniquecounts(rows)) return decisionnode(results=uniquecounts(rows))
  40. 40. Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
  41. 41. Just for Fun… Hot or Not
  42. 42. Just for Fun… Hot or Not
  43. 43. Supervised and Unsupervised Regression trees are supervised “answers” are in the dataset Tree models predict answers Some methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns
  44. 44. Next challenge - Bloggers Millions of blogs online Usually focus on a subject area Can they be characterized automatically? … using only the words in the posts?
  45. 45. The Technorati Top 100
  46. 46. A single blog
  47. 47. Getting the content Use Mark Pilgrim’s Universal Feed Reader Retrieve the post titles and text Split up the words Count occurrence of each word
  48. 48. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  49. 49. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} for e in d.entries: # Loop over all the entries if 'summary' in e: summary=e.summary for e in d.entries: else: summary=e.description if 'summary' in e: summary=e.summary else: summary=e.description words # Extract a list of # Extract a list of words words=getwords(e.title+' '+summary) words=getwords(e.title+' '+summary) for word in words: for word in words: wc.setdefault(word,0) wc.setdefault(word,0) wc[word]+=1 wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  50. 50. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove def getwords(html): all the HTML tags # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) txt=re.compile(r'<[^>]+>').sub('',html) # Split words bywords by all non-alpha characters # Split all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert # Convert to lowercase to lowercase return [word.lower() for word in words if word!=''] return [word.lower() for word in words if word!='']
  51. 51. Building a Word Matrix Build a matrix of word counts Blogs are rows, words are columns Eliminate words that are: Too common Too rare
  52. 52. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  53. 53. Python Code apcount={} wordcounts={} for feedurlinin file('feedlist.txt'): for feedurl file('feedlist.txt'): title,wc=getwordcounts(feedurl) title,wc=getwordcounts(feedurl) wordcounts[title]=wc wordcounts[title]=wc for word,count in wc.items(): forapcount.setdefault(word,0) word,count in wc.items(): apcount.setdefault(word,0) if count>1: if apcount[word]+=1 count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  54. 54. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] wordlist=[] for w,bc in apcount.items(): for w,bc in apcount.items(): frac=float(bc)/len(feedlist) frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  55. 55. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') out=file('blogdata.txt','w') for word in wordlist: out.write('t%s' % word) out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') out.write('n') for blog,wcinin wordcounts.items(): for blog,wc wordcounts.items(): out.write(blog) out.write(blog) for wordin wordlist: for word in wordlist: if word in wc: out.write('t%d' % wc[word]) if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') else: out.write('t0') out.write('n') out.write('n')
  56. 56. The Word Matrix “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12
  57. 57. Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0) 2 + (0 − 2) 2 + (1 − 2) 2 + (2 − 12) 2 = 12 (approx)
  58. 58. Other Distance Metrics Manhattan Tanamoto Pearson Correlation Chebychev Spearman
  59. 59. Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
  60. 60. Hierarchical Algorithm
  61. 61. Hierarchical Algorithm
  62. 62. Hierarchical Algorithm
  63. 63. Hierarchical Algorithm
  64. 64. Hierarchical Algorithm
  65. 65. Dendrogram
  66. 66. Python Code class bicluster: def __init__(self,vec,left=None,right=None,distance=0.0,id=None): self.left=left self.right=right self.vec=vec self.id=id self.distance=distance
  67. 67. Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  68. 68. Python Code def hcluster(rows,distance=pearson): distances={} distances={} currentclustid=-1 currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] # Clusters are initially just the rows while len(clust)>1: lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  69. 69. Python Code def hcluster(rows,distance=pearson): distances={} while len(clust)>1: currentclustid=-1 # Clusters are initially just the rows lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) while len(clust)>1: lowestpair=(0,1) # loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance through every pair looking for i inloopin range(len(clust)): # range(len(clust)): for the smallest distance for i through every pair looking for j for j range(i+1,len(clust)): in in range(i+1,len(clust)): # distances is the cache of distance calculations # distances is the cache of distances: if (clust[i].id,clust[j].id) not in distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] distances[(clust[i].id,clust[j].id)]= if d<closest: closest=d lowestpair=(i,j) distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] # calculate the average of the two clusters mergevec=[ if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 d<closest: for i in range(len(clust[0].vec))] closest=d # create the new cluster lowestpair=(i,j) newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  70. 70. Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: # calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) the average of the two clusters d=distances[(clust[i].id,clust[j].id)] mergevec=[ if d<closest: closest=d (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 lowestpair=(i,j) for i #in range(len(clust[0].vec)) calculate the average of the two clusters mergevec=[ ] (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 # create for i in range(len(clust[0].vec))] #the new new cluster create the cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative distance=closest,id=currentclustid) currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[1]] del clust[lowestpair[0]] del clust[lowestpair[0]] clust.append(newcluster) clust.append(newcluster) return clust[0]
  71. 71. Hierarchical Blog Clusters
  72. 72. Hierarchical Blog Clusters
  73. 73. Hierarchical Blog Clusters
  74. 74. Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
  75. 75. Hierarchical Word Clusters
  76. 76. K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
  77. 77. K-Means Algorithm
  78. 78. K-Means Algorithm
  79. 79. K-Means Algorithm
  80. 80. K-Means Algorithm
  81. 81. K-Means Algorithm
  82. 82. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  83. 83. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) ranges=[(min([row[i] for row in rows]), for i in range(len(rows[0]))] # Create k randomly placed centroids max([row[i] for row in rows])) clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] for i in range(len(rows[0]))] lastmatches=None for t in range(100): # Create k randomly placed centroids print 'Iteration %d' % t bestmatches=[[] for i in range(k)] clusters=[[random.random()* # Find which centroid is the closest for each row for j(ranges[i][1]-ranges[i][0])+ranges[i][0] in range(len(rows)): row=rows[j] for i in range(len(rows[0]))] bestmatch=0 for i in range(k): for j in range(k)] d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  84. 84. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids for t in range(100): clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] bestmatches=[[] for i in range(k)] lastmatches=None for t in range(100): # Find which centroid is the closest for each row print 'Iteration %d' % t bestmatches=[[] for i in range(k)] for j in range(len(rows)): # Find which centroid is the closest for each row row=rows[j] for j in range(len(rows)): row=rows[j] bestmatch=0 bestmatch=0 for for iin range(k): i in range(k): d=distance(clusters[i],row) d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete if bestmatches==lastmatches: break bestmatches[bestmatch].append(j) lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  85. 85. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break if bestmatches==lastmatches: break lastmatches=bestmatches lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  86. 86. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): # Move the centroids to the average of their members d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i for i in range(k): bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete avgs=[0.0]*len(rows[0]) if bestmatches==lastmatches: break lastmatches=bestmatches if len(bestmatches[i])>0: # Move the centroids toin average of their members for rowid the bestmatches[i]: for i in range(k): avgs=[0.0]*len(rows[0])range(len(rows[rowid])): for m in if len(bestmatches[i])>0: avgs[m]+=rows[rowid][m] for rowid in bestmatches[i]: for m in range(len(rows[rowid])): for j in range(len(avgs)): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) avgs[j]/=len(bestmatches[i]) clusters[i]=avgs clusters[i]=avgs return bestmatches
  87. 87. K-Means Results >> [rownames[r] for r in k[0]] ['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', "Seth's Blog"] >> [rownames[r] for r in k[1]] ['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
  88. 88. 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
  89. 89. Multidimensional Scaling
  90. 90. Multidimensional Scaling
  91. 91. Multidimensional Scaling
  92. 92. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  93. 93. def scaledown(data,distance=pearson,rate=0.01): n=len(data) The real distances between every pair of items n=len(data) # # The realrealdist=[[distance(data[i],data[j]) for j inpair of items distances between every range(n)] for i in range(0,n)] realdist=[[distance(data[i],data[j]) for j in range(n)] outersum=0.0 # for i initialize the starting points of the locations in 2D Randomly in range(0,n)] loc=[[random.random(),random.random()] for i in range(n)] outersum=0.0 fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  94. 94. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # RandomlyRandomly initialize the startingof the locations in 2D in # initialize the starting points points of the locations 2D loc=[[random.random(),random.random()] for i in range(n)] loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  95. 95. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None lasterror=None for m in range(0,1000): for m in # Find projected distances range(0,1000): for i in range(n): # Find projected distances for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for i in range(n): for x in range(len(loc[i]))])) for j in range(n): # Move points fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) grad=[[0.0,0.0] for i in range(n)] for x in range(len(loc[i]))])) totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  96. 96. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] # Move points lasterror=None grad=[[0.0,0.0]# m in range(0,1000): for for i in range(n)] Find projected distances for i in range(n): for j in range(n): totalerror=0 fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) for k in range(n): # Move points for j in range(n): grad=[[0.0,0.0] for i in range(n)] if j==k: continue totalerror=0 # The errorfor k inpercent difference between the distances is range(n): for j in range(n): errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the towards the # Each point needs to be moved away from or other # other point# in proportionhow much error much error it has point in proportion to to how it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm totalerror+=abs(errorterm) print totalerror # Keep trackIfof answer got worse by moving the points, we are done # the the total error if lasterror and lasterror<totalerror: break totalerror+=abs(errorterm) lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  97. 97. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) # If the answer got worse by moving the points, we are done print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break if lasterror and lasterror<totalerror: break lasterror=totalerror lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  98. 98. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break # Move each of the points by the learning rate times the gradient lasterror=totalerror for k in range(n): of the points by the learning rate times the gradient # Move each loc[k][0]-=rate*grad[k][0] for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] loc[k][1]-=rate*grad[k][1] return loc
  99. 99. Numerical Predictions Back to “supervised” learning We have a set of numerical attributes Specs for a laptop Age and rating for wine Ratios for a stock Want to predict another attribute Formula/model is unknown e.g. price
  100. 100. Regression Trees? Regression trees find hard boundaries Can’t deal with complex formulae
  101. 101. Statistical regression Requires specification of a model Usually linear Doesn’t handle context
  102. 102. Alternative - Interpolation Find “similar” items Guess price based on similar items Need to determine: What is similar? How should we aggregate prices?
  103. 103. Price Data from EBay
  104. 104. The eBay API XML API Send XML over HTTPS Receive results in XML http://developer.ebay.com/quickstartguide.
  105. 105. Some Python Code def getHeaders(apicall,siteID="0",compatabilityLevel = "433"): headers = {"X-EBAY-API-COMPATIBILITY-LEVEL": compatabilityLevel, "X-EBAY-API-DEV-NAME": devKey, "X-EBAY-API-APP-NAME": appKey, "X-EBAY-API-CERT-NAME": certKey, "X-EBAY-API-CALL-NAME": apicall, "X-EBAY-API-SITEID": siteID, "Content-Type": "text/xml"} return headers def sendRequest(apicall,xmlparameters): connection = httplib.HTTPSConnection(serverUrl) connection.request("POST", '/ws/api.dll', xmlparameters, getHeaders(apicall)) response = connection.getresponse() if response.status != 200: print "Error sending request:" + response.reason else: data = response.read() connection.close() return data
  106. 106. Some Python Code def getItem(itemID): xml = "<?xml version='1.0' encoding='utf-8'?>"+ "<GetItemRequest xmlns="urn:ebay:apis:eBLBaseComponents">"+ "<RequesterCredentials><eBayAuthToken>" + userToken + "</eBayAuthToken></RequesterCredentials>" + "<ItemID>" + str(itemID) + "</ItemID>"+ "<DetailLevel>ItemReturnAttributes</DetailLevel>"+ "</GetItemRequest>" data=sendRequest('GetItem',xml) result={} response=parseString(data) result['title']=getSingleValue(response,'Title') sellingStatusNode = response.getElementsByTagName('SellingStatus')[0]; result['price']=getSingleValue(sellingStatusNode,'CurrentPrice') result['bids']=getSingleValue(sellingStatusNode,'BidCount') seller = response.getElementsByTagName('Seller') result['feedback'] = getSingleValue(seller[0],'FeedbackScore') attributeSet=response.getElementsByTagName('Attribute'); attributes={} for att in attributeSet: attID=att.attributes.getNamedItem('attributeID').nodeValue attValue=getSingleValue(att,'ValueLiteral') attributes[attID]=attValue result['attributes']=attributes return result
  107. 107. Building an item table RAM CPU HDD Screen DVD Price D600 512 1400 40 14 1 $350 Lenovo 160 300 5 13 0 $80 T22 256 900 20 14 1 $200 Pavillion 1024 1600 120 17 1 $800 etc..
  108. 108. Distance between items RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? T22 256 900 20 14 1 $200 Euclidean, just like in clustering (512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2
  109. 109. Idea 1 – use the closest item With the item whose price I want to guess: Calculate the distance for every item in my dataset Guess that the price is the same as the closest This is called kNN with k=1
  110. 110. Problems with “outliers” The closest item may be anomalous Why? Exceptional deal that won’t occur again Something missing from the dataset Data errors
  111. 111. Using an average RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 No. 2 512 1400 60 14 1 $400 No. 3 1024 1600 120 15 0 $325 k=3, estimate = $361
  112. 112. Using a weighted average RAM CPU HDD Screen DVD Price Weight New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 3 No. 2 512 1400 60 14 1 $400 2 No. 3 1024 1600 120 15 0 $325 1 Estimate = $367
  113. 113. Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): vec2=data[i]['input'] distancelist.append((euclidean(vec1,vec2),i)) distancelist.sort() return distancelist def weightedknn(data,vec1,k=5,weightf=gaussian): # Get distances dlist=getdistances(data,vec1) avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  114. 114. Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): defvec2=data[i]['input'] weightedknn(data,vec1,k=5,weightf=gaussian): distancelist.append((euclidean(vec1,vec2),i)) # Get distances distancelist.sort() dlist=getdistances(data,vec1) return distancelist avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  115. 115. Too few – k too low
  116. 116. Too many – k too high
  117. 117. Determining the best k Divide the dataset up Training set Test set Guess the prices for the test set using the training set See how good the guesses are for different values of k Known as “cross-validation”
  118. 118. Determining the best k Test set Attribute Price Attribute Price 10 20 10 20 11 30 Training set Attribute Price 8 10 11 30 6 0 8 10 For k = 1, guess = 30, error = 10 6 0 For k = 2, guess = 20, error = 0 For k = 3, guess = 13, error = 7 Repeat with different test sets, average the error
  119. 119. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  120. 120. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  121. 121. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  122. 122. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  123. 123. Problems with scale
  124. 124. Scaling the data
  125. 125. Scaling to zero
  126. 126. Determining the best scale Try different weights Use the “cross-validation” method Different ways of choosing a scale: Range-scaling Intuitive guessing Optimization
  127. 127. Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Weight k-nearest neighbors
  128. 128. New projects Openads An open-source ad server Users can share impression/click data Matrix of what hits based on Page Text Ad Ad placement Search query Can we improve targeting?
  129. 129. New Projects Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
  130. 130. New Projects Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors

×