Data Mining and Open APIs


     Toby Segaran
About Me
 Software Developer at Genstruct
   Work directly with scientists
   Design algorithms to aid in drug testing
 “P...
Presentation Goals

 Look at some Open APIs
 Get some data
 Visualize algorithms for data-mining
 Work through some Python...
Open data APIs

 Zillow               Yahoo Answers
 eBay                 Amazon
 Facebook             Technorati
 del.ici...
Open API uses

 Mashups
 Integration
 Automation
 Command-line tools
 Most importantly, creating datasets!
What is data mining?

 From a large dataset find the:
   Implicit
   Unknown
   Useful
 Data could be:
   Tabular, e.g. Pr...
Why it’s important now

  More devices produce more data
  People share more data
  The internet is vast
  Products are mo...
Traditional Applications

  Computational Biology
  Financial Markets
  Retail Markets
  Fraud Detection
  Surveillance
  ...
Traditional = Inaccessible

  Real applications are esoteric
  Tutorial examples are trivial
  Generally lacking in “inter...
Fun, Accessible Applications

  Home price modeling
  Where are the hottest people?
  Which bloggers are similar?
  Import...
Zillow
The Zillow API

 Allows querying by address
 Returns information about the property
     Bedrooms
     Bathrooms
     Zip ...
The Zillow API

REST Request

http://www.zillow.com/webservice/GetDeepSearchResults.htm?
zws-id=key&address=address&cityst...
The Zillow API
<SearchResults:searchresults xmlns:SearchResults=quot;http://www.
zillow.com/vstatic/3/static/xsd/SearchRes...
The Zillow API
  <SearchResults:searchresults xmlns:SearchResults=quot;http://www.
  zillow.com/vstatic/3/static/xsd/Searc...
Zillow from Python
def getaddressdata(address,city):
  escad=address.replace(' ','+')

  # Construct the URL
  url='http:/...
Zillow from Python
  def getaddressdata(address,city):
    escad=address.replace(' ','+')

    # Construct the URL
# Const...
Zillow from Python
  def getaddressdata(address,city):
    escad=address.replace(' ','+')

    # Construct the URL
    url...
Zillow from Python
   def getaddressdata(address,city):
     escad=address.replace(' ','+')

     # Construct the URL
    ...
A home price dataset

House    Zip     Bathrooms Bedrooms   Built   Type      Price

A        02138   1.5      2          ...
What can we learn?

 A made-up houses price
 How important is Zip Code?
 What are the important attributes?

 Can we do be...
Introducing Regression Trees
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
Introducing Regression Trees
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose ...
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose ...
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose ...
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose ...
Python Code
def variance(rows):
  if len(rows)==0: return 0
  data=[float(row[len(row)-1]) for row in rows]
  mean=sum(dat...
Python Code
 def variance(rows):
def variance(rows):
   if len(rows)==0: return 0
  if len(rows)==0: return for row in row...
Python Code
 def variance(rows):
   if len(rows)==0: return 0
   data=[float(row[len(row)-1]) for row in rows]
   mean=sum...
Python Code
 def variance(rows):
   if len(rows)==0: return 0
   data=[float(row[len(row)-1]) for row in rows]
   mean=sum...
CART Algoritm
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
CART Algoritm
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
CART Algoritm




                      22   Square   8
   10   Circle   20
                      18   Circle   6
   11   ...
CART Algoritm
Python Code
def buildtree(rows,scoref=variance):
  if len(rows)==0: return decisionnode()
  current_score=scoref(rows)
  #...
Python Code
def buildtree(rows,scoref=variance):
  def buildtree(rows,scoref=variance):
  if len(rows)==0: return decision...
Python Code
def buildtree(rows,scoref=variance):
  if len(rows)==0: return decisionnode()
  current_score=scoref(rows)
  #...
Python Code
 def buildtree(rows,scoref=variance):
   if len(rows)==0: return decisionnode()
   current_score=scoref(rows)
...
Zillow Results

                           Bathrooms > 3




      Zip: 02139?                               After 1903?

...
Just for Fun… Hot or Not
Just for Fun… Hot or Not
Supervised and Unsupervised

 Regression trees are supervised
   “answers” are in the dataset
   Tree models predict answe...
Next challenge - Bloggers

  Millions of blogs online
  Usually focus on a subject area
  Can they be characterized
  auto...
The Technorati Top 100
A single blog
Getting the content

  Use Mark Pilgrim’s Universal Feed
  Reader
  Retrieve the post titles and text
  Split up the words...
Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(ur...
Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(ur...
Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(ur...
Building a Word Matrix

  Build a matrix of word counts
  Blogs are rows, words are columns
  Eliminate words that are:
  ...
Python Code
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
  title,wc=getwordcounts(feedurl)
  wordcounts[t...
Python Code
  apcount={}
  wordcounts={}
for feedurlinin file('feedlist.txt'):
  for feedurl     file('feedlist.txt'):
  t...
Python Code
  apcount={}
  wordcounts={}
  for feedurl in file('feedlist.txt'):
    title,wc=getwordcounts(feedurl)
    wo...
Python Code
  apcount={}
  wordcounts={}
  for feedurl in file('feedlist.txt'):
    title,wc=getwordcounts(feedurl)
    wo...
The Word Matrix
                    “china”   “kids”   “music”   “yahoo”



Gothamist           0         3        3      ...
Determining distance
                     “china”    “kids”    “music”     “yahoo”




Gothamist            0          3  ...
Other Distance Metrics

 Manhattan
 Tanamoto
 Pearson Correlation
 Chebychev
 Spearman
Hierarchical Clustering

  Find the two closest item
  Combine them into a single item
  Repeat…
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Dendrogram
Python Code


class bicluster:
  def
__init__(self,vec,left=None,right=None,distance=0.0,id=None):
    self.left=left
    ...
Python Code
 def hcluster(rows,distance=pearson):
   distances={}
   currentclustid=-1
   # Clusters are initially just th...
Python Code
    def hcluster(rows,distance=pearson):
      distances={}
distances={}
      currentclustid=-1
 currentclust...
Python Code
       def hcluster(rows,distance=pearson):
         distances={}
while len(clust)>1:
         currentclustid=...
Python Code
      def hcluster(rows,distance=pearson):
        distances={}
        currentclustid=-1
        # Clusters a...
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Rotating the Matrix

  Words in a blog -> blogs containing each word


            Gothamist     GigaOM        Quick Onl
c...
Hierarchical Word Clusters
K-Means Clustering

 Divides data into distinct clusters
 User determines how many
 Algorithm
   Start with arbitrary cent...
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
Python Code
import random
def kcluster(rows,distance=pearson,k=4):
  # Determine the minimum and maximum values for each p...
Python Code
      import random
      def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values ...
Python Code
     import random
     def kcluster(rows,distance=pearson,k=4):
       # Determine the minimum and maximum va...
Python Code
  import random
  def kcluster(rows,distance=pearson,k=4):
    # Determine the minimum and maximum values for ...
Python Code
  import random
  def kcluster(rows,distance=pearson,k=4):
    # Determine the minimum and maximum values for ...
K-Means Results

>> [rownames[r] for r in k[0]]
['The Viral Garden', 'Copyblogger', 'Creating Passionate Users',
 'Oilman'...
2D Visualizations

  Instead of Clusters, a 2D Map
  Goals
    Preserve distances as much as possible
    Draw in two dime...
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
def scaledown(data,distance=pearson,rate=0.01):
  n=len(data)
  # The real distances between every pair of items
  realdis...
def scaledown(data,distance=pearson,rate=0.01):
n=len(data) The real distances between every pair of items
          n=len...
def scaledown(data,distance=pearson,rate=0.01):
          n=len(data)
          # The real distances between every pair of...
def scaledown(data,distance=pearson,rate=0.01):
         n=len(data)
         # The real distances between every pair of i...
def scaledown(data,distance=pearson,rate=0.01):
                 n=len(data)
                 # The real distances between...
def scaledown(data,distance=pearson,rate=0.01):
        n=len(data)
        # The real distances between every pair of ite...
def scaledown(data,distance=pearson,rate=0.01):
             n=len(data)
             # The real distances between every p...
Numerical Predictions

 Back to “supervised” learning
 We have a set of numerical attributes
   Specs for a laptop
   Age ...
Regression Trees?

 Regression trees find hard boundaries
 Can’t deal with complex formulae
Statistical regression

  Requires specification of a model
  Usually linear
  Doesn’t handle context
Alternative - Interpolation

  Find “similar” items
  Guess price based on similar items
  Need to determine:
    What is ...
Price Data from EBay
The eBay API

 XML API
 Send XML over HTTPS
 Receive results in XML

 http://developer.ebay.com/quickstartguide.
Some Python Code
def getHeaders(apicall,siteID=quot;0quot;,compatabilityLevel = quot;433quot;):
  headers = {quot;X-EBAY-A...
Some Python Code
def getItem(itemID):
  xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+
        quot;<GetItemReque...
Building an item table
         RAM     CPU    HDD   Screen   DVD   Price


D600     512     1400   40    14       1     $...
Distance between items
            RAM      CPU        HDD        Screen     DVD        Price


New         512      1400 ...
Idea 1 – use the closest item

  With the item whose price I want to
  guess:
    Calculate the distance for every item in...
Problems with “outliers”

  The closest item may be anomalous
  Why?
    Exceptional deal that won’t occur again
    Somet...
Using an average
        RAM    CPU    HDD   Screen   DVD      Price

New     512    1400   40    14       1        ???

N...
Using a weighted average
        RAM    CPU    HDD   Screen   DVD   Price   Weight

New     512    1400   40    14       1...
Python code
def getdistances(data,vec1):
  distancelist=[]
  for i in range(len(data)):
     vec2=data[i]['input']
     di...
Python code
def getdistances(data,vec1):
  distancelist=[]
  for i in range(len(data)):
defvec2=data[i]['input']
      wei...
Too few – k too low
Too many – k too high
Determining the best k

  Divide the dataset up
    Training set
    Test set
  Guess the prices for the test set using
  ...
Determining the best k
                                            Test set

                                            A...
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
       if random()<t...
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
        if random()<...
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
       if random()<t...
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
       if random()<t...
Problems with scale
Scaling the data
Scaling to zero
Determining the best scale

  Try different weights
  Use the “cross-validation” method
  Different ways of choosing a sca...
Methods covered

 Regression trees
 Hierarchical clustering
 k-means clustering
 Multidimensional scaling
 Weight k-neares...
New projects

 Openads
   An open-source ad server
   Users can share impression/click data
   Matrix of what hits based o...
New Projects

 Finance
   Analysts already drowning in info
   Stories sometimes broken on blogs
   Message boards show se...
New Projects

 Entertainment
   How much buzz is a movie generating?
   What psychographic profiles like this type
   of m...
Data Mining Open Ap Is
Data Mining Open Ap Is
Data Mining Open Ap Is
Data Mining Open Ap Is
Upcoming SlideShare
Loading in …5
×

Data Mining Open Ap Is

1,094 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,094
On SlideShare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data Mining Open Ap Is

  1. 1. Data Mining and Open APIs Toby Segaran
  2. 2. About Me Software Developer at Genstruct Work directly with scientists Design algorithms to aid in drug testing “Programming Collective Intelligence” Published by O’Reilly Due out in August Consult with open-source projects and other companies http://kiwitobes.com
  3. 3. Presentation Goals Look at some Open APIs Get some data Visualize algorithms for data-mining Work through some Python code Variety of techniques and sources Advocacy (why you should care)
  4. 4. Open data APIs Zillow Yahoo Answers eBay Amazon Facebook Technorati del.icio.us Twitter HotOrNot Google News Upcoming programmableweb.com/apis for more…
  5. 5. Open API uses Mashups Integration Automation Command-line tools Most importantly, creating datasets!
  6. 6. What is data mining? From a large dataset find the: Implicit Unknown Useful Data could be: Tabular, e.g. Price lists Free text Pictures
  7. 7. Why it’s important now More devices produce more data People share more data The internet is vast Products are more customized Advertising is targeted Human cognition is limited
  8. 8. Traditional Applications Computational Biology Financial Markets Retail Markets Fraud Detection Surveillance Supply Chain Optimization National Security
  9. 9. Traditional = Inaccessible Real applications are esoteric Tutorial examples are trivial Generally lacking in “interest value”
  10. 10. Fun, Accessible Applications Home price modeling Where are the hottest people? Which bloggers are similar? Important attributes on eBay Predicting fashion trends Movie popularity
  11. 11. Zillow
  12. 12. The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price Requires registration key http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
  13. 13. The Zillow API REST Request http://www.zillow.com/webservice/GetDeepSearchResults.htm? zws-id=key&address=address&citystatezip=citystateszip
  14. 14. The Zillow API <SearchResults:searchresults xmlns:SearchResults=quot;http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;> … <response> <results> <result> <zpid>48749425</zpid> <links> … </links> <address> <street>2114 Bigelow Ave N</street> <zipcode>98109</zipcode> <city>SEATTLE</city> <state>WA</state> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <finishedSqFt>3290</finishedSqFt> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldDate>06/18/2002</lastSoldDate> <lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice> <valuation> <amount currency=quot;USDquot;>1091061</amount> </result> </results> </response>
  15. 15. The Zillow API <SearchResults:searchresults xmlns:SearchResults=quot;http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;> … <zipcode>98109</zipcode> <response> <results> <city>SEATTLE</city> <result> <state>WA</state> <zpid>48749425</zpid> <links> <latitude>47.637934</latitude> … <longitude>-122.347936</longitude> </links> <address> </address>Bigelow Ave N</street> <street>2114 <yearBuilt>1924</yearBuilt> <zipcode>98109</zipcode> <city>SEATTLE</city> <lotSizeSqFt>4680</lotSizeSqFt> <state>WA</state> <finishedSqFt>3290</finishedSqFt> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <bathrooms>2.75</bathrooms> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <bedrooms>4</bedrooms> <finishedSqFt>3290</finishedSqFt> <lastSoldDate>06/18/2002</lastSoldDate> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice> <lastSoldDate>06/18/2002</lastSoldDate> <valuation> currency=quot;USDquot;>770000</lastSoldPrice> <lastSoldPrice <valuation> <amountcurrency=quot;USDquot;>1091061</amount> currency=quot;USDquot;>1091061</amount> <amount </result> </results> </response>
  16. 16. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  17. 17. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  18. 18. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  19. 19. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data # Extract the info about this property try: use=doc.getElementsByTagName('useCode')[0].firstChild.data zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  20. 20. A home price dataset House Zip Bathrooms Bedrooms Built Type Price A 02138 1.5 2 1847 Single 505296 B 02139 3.5 9 1916 Triplex 776378 C 02140 3.5 4 1894 Duplex 595027 D 02139 2.5 4 1854 Duplex 552213 E 02138 3.5 5 1909 Duplex 947528 F 02138 3.5 4 1930 Single 2107871 etc..
  21. 21. What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
  22. 22. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  23. 23. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  24. 24. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
  25. 25. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
  26. 26. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
  27. 27. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
  28. 28. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  29. 29. Python Code def variance(rows): def variance(rows): if len(rows)==0: return 0 if len(rows)==0: return for row in rows] data=[float(row[len(row)-1]) 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) mean=sum(data)/len(data)d in data])/len(data) variance=sum([(d-mean)**2 for return variance variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  30. 30. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance # def divideset(rows,column,value): us if a row is in Make a function that tells # the Make a function (true) or the asecond in # first group that tells us if row is group (false) # the first group (true) or the second group (false) split_function=None split_function=None if isinstance(value,int) or isinstance(value,float): if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value split_function=lambda row:row[column]>=value else: else: split_function=lambda row:row[column]==value split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  31. 31. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and returnreturn them # Divide the rows into two sets and them set1=[row for row in rows if split_function(row)] set1=[row for row in rows if not split_function(row)] in rows if split_function(row)] set2=[row for row set2=[row(set1,set2) in rows if not split_function(row)] for row return return (set1,set2)
  32. 32. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  33. 33. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  34. 34. CART Algoritm 22 Square 8 10 Circle 20 18 Circle 6 11 Square 22
  35. 35. CART Algoritm
  36. 36. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  37. 37. Python Code def buildtree(rows,scoref=variance): def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() if len(rows)==0: return decisionnode() current_score=scoref(rows) current_score=scoref(rows) criteria # Set up some variables to track the best #best_gain=0.0some variables to track the best criteria Set up best_criteria=None best_gain=0.0 best_sets=None column_count=len(rows[0])-1 best_criteria=None for col in range(0,column_count): best_sets=None of different values in # Generate the list # this column column_count=len(rows[0])-1 column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  38. 38. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 for try dividing the rows up for each value # Now value in column_values.keys(): # in this column (set1,set2)=divideset(rows,col,value) for value in column_values.keys(): # Information gain (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_gain=gain best_sets=(set1,set2) best_criteria=(col,value) # Create the sub branches if best_gain>0: best_sets=(set1,set2) trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  39. 39. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if best_gain>0: and len(set1)>0 and len(set2)>0: if gain>best_gain best_gain=gain trueBranch=buildtree(best_sets[0]) best_criteria=(col,value) best_sets=(set1,set2) falseBranch=buildtree(best_sets[1]) # Create the sub branches if best_gain>0: return decisionnode(col=best_criteria[0],value=best_criteria[1], trueBranch=buildtree(best_sets[0]) tb=trueBranch,fb=falseBranch) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: else: return decisionnode(results=uniquecounts(rows)) return decisionnode(results=uniquecounts(rows))
  40. 40. Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
  41. 41. Just for Fun… Hot or Not
  42. 42. Just for Fun… Hot or Not
  43. 43. Supervised and Unsupervised Regression trees are supervised “answers” are in the dataset Tree models predict answers Some methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns
  44. 44. Next challenge - Bloggers Millions of blogs online Usually focus on a subject area Can they be characterized automatically? … using only the words in the posts?
  45. 45. The Technorati Top 100
  46. 46. A single blog
  47. 47. Getting the content Use Mark Pilgrim’s Universal Feed Reader Retrieve the post titles and text Split up the words Count occurrence of each word
  48. 48. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  49. 49. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} for e in d.entries: # Loop over all the entries if 'summary' in e: summary=e.summary for e in d.entries: else: summary=e.description if 'summary' in e: summary=e.summary else: summary=e.description words # Extract a list of # Extract a list of words words=getwords(e.title+' '+summary) words=getwords(e.title+' '+summary) for word in words: for word in words: wc.setdefault(word,0) wc.setdefault(word,0) wc[word]+=1 wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  50. 50. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove def getwords(html): all the HTML tags # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) txt=re.compile(r'<[^>]+>').sub('',html) # Split words bywords by all non-alpha characters # Split all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert # Convert to lowercase to lowercase return [word.lower() for word in words if word!=''] return [word.lower() for word in words if word!='']
  51. 51. Building a Word Matrix Build a matrix of word counts Blogs are rows, words are columns Eliminate words that are: Too common Too rare
  52. 52. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  53. 53. Python Code apcount={} wordcounts={} for feedurlinin file('feedlist.txt'): for feedurl file('feedlist.txt'): title,wc=getwordcounts(feedurl) title,wc=getwordcounts(feedurl) wordcounts[title]=wc wordcounts[title]=wc for word,count in wc.items(): forapcount.setdefault(word,0) word,count in wc.items(): apcount.setdefault(word,0) if count>1: if apcount[word]+=1 count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  54. 54. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] wordlist=[] for w,bc in apcount.items(): for w,bc in apcount.items(): frac=float(bc)/len(feedlist) frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  55. 55. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') out=file('blogdata.txt','w') for word in wordlist: out.write('t%s' % word) out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') out.write('n') for blog,wcinin wordcounts.items(): for blog,wc wordcounts.items(): out.write(blog) out.write(blog) for wordin wordlist: for word in wordlist: if word in wc: out.write('t%d' % wc[word]) if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') else: out.write('t0') out.write('n') out.write('n')
  56. 56. The Word Matrix “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12
  57. 57. Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0) 2 + (0 − 2) 2 + (1 − 2) 2 + (2 − 12) 2 = 12 (approx)
  58. 58. Other Distance Metrics Manhattan Tanamoto Pearson Correlation Chebychev Spearman
  59. 59. Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
  60. 60. Hierarchical Algorithm
  61. 61. Hierarchical Algorithm
  62. 62. Hierarchical Algorithm
  63. 63. Hierarchical Algorithm
  64. 64. Hierarchical Algorithm
  65. 65. Dendrogram
  66. 66. Python Code class bicluster: def __init__(self,vec,left=None,right=None,distance=0.0,id=None): self.left=left self.right=right self.vec=vec self.id=id self.distance=distance
  67. 67. Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  68. 68. Python Code def hcluster(rows,distance=pearson): distances={} distances={} currentclustid=-1 currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] # Clusters are initially just the rows while len(clust)>1: lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  69. 69. Python Code def hcluster(rows,distance=pearson): distances={} while len(clust)>1: currentclustid=-1 # Clusters are initially just the rows lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) while len(clust)>1: lowestpair=(0,1) # loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance through every pair looking for i inloopin range(len(clust)): # range(len(clust)): for the smallest distance through every pair looking for i for j for j range(i+1,len(clust)): in in range(i+1,len(clust)): # distances is the cache of distance calculations # distances is the cache of distances: if (clust[i].id,clust[j].id) not in distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] distances[(clust[i].id,clust[j].id)]= if d<closest: closest=d distance(clust[i].vec,clust[j].vec) lowestpair=(i,j) d=distances[(clust[i].id,clust[j].id)] # calculate the average of the two clusters mergevec=[ if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 d<closest: for i in range(len(clust[0].vec))] closest=d # create the new cluster lowestpair=(i,j) newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  70. 70. Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: # calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) the average of the two clusters d=distances[(clust[i].id,clust[j].id)] mergevec=[ if d<closest: closest=d (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 lowestpair=(i,j) #in range(len(clust[0].vec)) calculate the average of the two clusters for i mergevec=[ ] (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 # create for i in range(len(clust[0].vec))] #the new new cluster create the cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative distance=closest,id=currentclustid) currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[1]] del clust[lowestpair[0]] del clust[lowestpair[0]] clust.append(newcluster) clust.append(newcluster) return clust[0]
  71. 71. Hierarchical Blog Clusters
  72. 72. Hierarchical Blog Clusters
  73. 73. Hierarchical Blog Clusters
  74. 74. Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
  75. 75. Hierarchical Word Clusters
  76. 76. K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
  77. 77. K-Means Algorithm
  78. 78. K-Means Algorithm
  79. 79. K-Means Algorithm
  80. 80. K-Means Algorithm
  81. 81. K-Means Algorithm
  82. 82. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  83. 83. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) ranges=[(min([row[i] for row in rows]), for i in range(len(rows[0]))] # Create k randomly placed centroids max([row[i] for row in rows])) clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] for i in range(len(rows[0]))] lastmatches=None for t in range(100): # Create k randomly placed centroids print 'Iteration %d' % t bestmatches=[[] for i in range(k)] clusters=[[random.random()* # Find which centroid is the closest for each row for j(ranges[i][1]-ranges[i][0])+ranges[i][0] in range(len(rows)): row=rows[j] for i in range(len(rows[0]))] bestmatch=0 for i in range(k): for j in range(k)] d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  84. 84. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids for t in range(100): clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] bestmatches=[[] for i in range(k)] lastmatches=None for t in range(100): # Find which centroid is the closest for each row print 'Iteration %d' % t bestmatches=[[] for i in range(k)] for j in range(len(rows)): # Find which centroid is the closest for each row row=rows[j] for j in range(len(rows)): row=rows[j] bestmatch=0 bestmatch=0 for for iin range(k): i in range(k): d=distance(clusters[i],row) d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete if bestmatches==lastmatches: break bestmatches[bestmatch].append(j) lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  85. 85. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break if bestmatches==lastmatches: break lastmatches=bestmatches lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  86. 86. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): # Move the centroids to the average of their members d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i for i in range(k): bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete avgs=[0.0]*len(rows[0]) if bestmatches==lastmatches: break lastmatches=bestmatches if len(bestmatches[i])>0: # Move the centroids toin average of their members for rowid the bestmatches[i]: for i in range(k): avgs=[0.0]*len(rows[0])range(len(rows[rowid])): for m in if len(bestmatches[i])>0: avgs[m]+=rows[rowid][m] for rowid in bestmatches[i]: for m in range(len(rows[rowid])): for j in range(len(avgs)): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) avgs[j]/=len(bestmatches[i]) clusters[i]=avgs clusters[i]=avgs return bestmatches
  87. 87. K-Means Results >> [rownames[r] for r in k[0]] ['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', quot;Seth's Blogquot;] >> [rownames[r] for r in k[1]] ['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
  88. 88. 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
  89. 89. Multidimensional Scaling
  90. 90. Multidimensional Scaling
  91. 91. Multidimensional Scaling
  92. 92. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  93. 93. def scaledown(data,distance=pearson,rate=0.01): n=len(data) The real distances between every pair of items n=len(data) # # The realrealdist=[[distance(data[i],data[j]) for j inpair of items distances between every range(n)] for i in range(0,n)] realdist=[[distance(data[i],data[j]) for j in range(n)] outersum=0.0 # for i initialize the starting points of the locations in 2D in range(0,n)] Randomly loc=[[random.random(),random.random()] for i in range(n)] outersum=0.0 fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  94. 94. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # RandomlyRandomly initialize the startingof the locations in 2D in # initialize the starting points points of the locations 2D loc=[[random.random(),random.random()] for i in range(n)] loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  95. 95. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None lasterror=None for m in range(0,1000): for m in # Find projected distances range(0,1000): for i in range(n): # Find projected distances for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for i in range(n): for x in range(len(loc[i]))])) for j in range(n): # Move points fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) grad=[[0.0,0.0] for i in range(n)] for x in range(len(loc[i]))])) totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  96. 96. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] # Move points lasterror=None grad=[[0.0,0.0]# m in range(0,1000): for for i in range(n)] Find projected distances for i in range(n): for j in range(n): totalerror=0 fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) for k in range(n): # Move points for j in range(n): grad=[[0.0,0.0] for i in range(n)] if j==k: continue totalerror=0 # The errorfor k inpercent difference between the distances is range(n): for j in range(n): errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the towards the # Each point needs to be moved away from or other # other point# in proportionhow much error much error it has point in proportion to to how it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm totalerror+=abs(errorterm) print totalerror # Keep trackIfof answer got worse by moving the points, we are done the the total error # if lasterror and lasterror<totalerror: break totalerror+=abs(errorterm) lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  97. 97. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) # If the answer got worse by moving the points, we are done print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break if lasterror and lasterror<totalerror: break lasterror=totalerror lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  98. 98. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break # Move each of the points by the learning rate times the gradient lasterror=totalerror for k in range(n): of the points by the learning rate times the gradient # Move each loc[k][0]-=rate*grad[k][0] for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] loc[k][1]-=rate*grad[k][1] return loc
  99. 99. Numerical Predictions Back to “supervised” learning We have a set of numerical attributes Specs for a laptop Age and rating for wine Ratios for a stock Want to predict another attribute Formula/model is unknown e.g. price
  100. 100. Regression Trees? Regression trees find hard boundaries Can’t deal with complex formulae
  101. 101. Statistical regression Requires specification of a model Usually linear Doesn’t handle context
  102. 102. Alternative - Interpolation Find “similar” items Guess price based on similar items Need to determine: What is similar? How should we aggregate prices?
  103. 103. Price Data from EBay
  104. 104. The eBay API XML API Send XML over HTTPS Receive results in XML http://developer.ebay.com/quickstartguide.
  105. 105. Some Python Code def getHeaders(apicall,siteID=quot;0quot;,compatabilityLevel = quot;433quot;): headers = {quot;X-EBAY-API-COMPATIBILITY-LEVELquot;: compatabilityLevel, quot;X-EBAY-API-DEV-NAMEquot;: devKey, quot;X-EBAY-API-APP-NAMEquot;: appKey, quot;X-EBAY-API-CERT-NAMEquot;: certKey, quot;X-EBAY-API-CALL-NAMEquot;: apicall, quot;X-EBAY-API-SITEIDquot;: siteID, quot;Content-Typequot;: quot;text/xmlquot;} return headers def sendRequest(apicall,xmlparameters): connection = httplib.HTTPSConnection(serverUrl) connection.request(quot;POSTquot;, '/ws/api.dll', xmlparameters, getHeaders(apicall)) response = connection.getresponse() if response.status != 200: print quot;Error sending request:quot; + response.reason else: data = response.read() connection.close() return data
  106. 106. Some Python Code def getItem(itemID): xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+ quot;<GetItemRequest xmlns=quot;urn:ebay:apis:eBLBaseComponentsquot;>quot;+ quot;<RequesterCredentials><eBayAuthToken>quot; + userToken + quot;</eBayAuthToken></RequesterCredentials>quot; + quot;<ItemID>quot; + str(itemID) + quot;</ItemID>quot;+ quot;<DetailLevel>ItemReturnAttributes</DetailLevel>quot;+ quot;</GetItemRequest>quot; data=sendRequest('GetItem',xml) result={} response=parseString(data) result['title']=getSingleValue(response,'Title') sellingStatusNode = response.getElementsByTagName('SellingStatus')[0]; result['price']=getSingleValue(sellingStatusNode,'CurrentPrice') result['bids']=getSingleValue(sellingStatusNode,'BidCount') seller = response.getElementsByTagName('Seller') result['feedback'] = getSingleValue(seller[0],'FeedbackScore') attributeSet=response.getElementsByTagName('Attribute'); attributes={} for att in attributeSet: attID=att.attributes.getNamedItem('attributeID').nodeValue attValue=getSingleValue(att,'ValueLiteral') attributes[attID]=attValue result['attributes']=attributes return result
  107. 107. Building an item table RAM CPU HDD Screen DVD Price D600 512 1400 40 14 1 $350 Lenovo 160 300 5 13 0 $80 T22 256 900 20 14 1 $200 Pavillion 1024 1600 120 17 1 $800 etc..
  108. 108. Distance between items RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? T22 256 900 20 14 1 $200 Euclidean, just like in clustering (512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2
  109. 109. Idea 1 – use the closest item With the item whose price I want to guess: Calculate the distance for every item in my dataset Guess that the price is the same as the closest This is called kNN with k=1
  110. 110. Problems with “outliers” The closest item may be anomalous Why? Exceptional deal that won’t occur again Something missing from the dataset Data errors
  111. 111. Using an average RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 No. 2 512 1400 60 14 1 $400 No. 3 1024 1600 120 15 0 $325 k=3, estimate = $361
  112. 112. Using a weighted average RAM CPU HDD Screen DVD Price Weight New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 3 No. 2 512 1400 60 14 1 $400 2 No. 3 1024 1600 120 15 0 $325 1 Estimate = $367
  113. 113. Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): vec2=data[i]['input'] distancelist.append((euclidean(vec1,vec2),i)) distancelist.sort() return distancelist def weightedknn(data,vec1,k=5,weightf=gaussian): # Get distances dlist=getdistances(data,vec1) avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  114. 114. Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): defvec2=data[i]['input'] weightedknn(data,vec1,k=5,weightf=gaussian): distancelist.append((euclidean(vec1,vec2),i)) # Get distances distancelist.sort() dlist=getdistances(data,vec1) return distancelist avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  115. 115. Too few – k too low
  116. 116. Too many – k too high
  117. 117. Determining the best k Divide the dataset up Training set Test set Guess the prices for the test set using the training set See how good the guesses are for different values of k Known as “cross-validation”
  118. 118. Determining the best k Test set Attribute Price Attribute Price 10 20 10 20 Training set 11 30 Attribute Price 8 10 11 30 6 0 8 10 6 0 For k = 1, guess = 30, error = 10 For k = 2, guess = 20, error = 0 For k = 3, guess = 13, error = 7 Repeat with different test sets, average the error
  119. 119. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  120. 120. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  121. 121. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  122. 122. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  123. 123. Problems with scale
  124. 124. Scaling the data
  125. 125. Scaling to zero
  126. 126. Determining the best scale Try different weights Use the “cross-validation” method Different ways of choosing a scale: Range-scaling Intuitive guessing Optimization
  127. 127. Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Weight k-nearest neighbors
  128. 128. New projects Openads An open-source ad server Users can share impression/click data Matrix of what hits based on Page Text Ad Ad placement Search query Can we improve targeting?
  129. 129. New Projects Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
  130. 130. New Projects Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors

×