DemoDay1.ppt
Upcoming SlideShare
Loading in...5
×
 

DemoDay1.ppt

on

  • 1,240 views

 

Statistics

Views

Total Views
1,240
Views on SlideShare
1,240
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    DemoDay1.ppt DemoDay1.ppt Presentation Transcript

    • Demo Day #1 Carlton Northern CS795/895 2/4/09
    • Chapter 2: Question 2
      • Tag similarity. Using the del.icio.us API, create a dataset of tags and items. Use this to calculate similarity between tags and see if you can find any that are almost identical. Find some items that could have been tagged “programming” but were not.
    • Code
      • import recommendations
      • from pydelicious import get_tags
      • from deliciousrec import *
      • deltags=initializeTagDict()
      • for item in deltags:
      • print item.encode('utf-8’)
      • fillTagDict(deltags)
      • for item in deltags:
      • print item.encode('utf-8'), deltags[item]
      • import random
      • for j in range(5):
      • tag=deltags.keys( )[random.randint(0,len(deltags)-1)]
      • print "Top 5 Recommendations of Similar tags to '" + tag + "'"
      • print recommendations.topMatches(deltags,tag)
      • print "Top 5 Recommendations of Similar tags to 'programming'."
      • print recommendations.topMatches(deltags,'programming')
      • Top 5 Recommendations of Similar tags to 'Superbowl'
      • [(1.0, u'superbowl'), (0.184676525182837, u'steelers'), (0.11406678655312195, u'timeline'), (0.11406678655312195, u'nytimes'), (0.11406678655312195, u'NYTimes')]
      • Top 5 Recommendations of Similar tags to 'DTW'
      • [(0.14635737421958822, u'workshopresource'), (0.11167205430813504, u'socialwhois'), (0.092945407701130411, u'friendfeed'), (0.08627094606290428, u'whois'), (0.075960928574145103, u'peoplesearch')]
      • Top 5 Recommendations of Similar tags to 'icon'
      • [(0.42655020573161051, u'icons'), (0.27976526305782667, u'webdesign'), (0.26598011028287921, u'icnons'), (0.26449205376543711, u'iconos'), (0.19809350035187084, u'graphics')]
      • Top 5 Recommendations of Similar tags to 'Javascript'
      • [(1.0, u'javascript'), (0.20119183173535465, u'ajax'), (0.14328977139680074, u'js'), (0.13167774373493463, u'webdev'), (0.1267376966007103, u'web')]
      • Top 5 Recommendations of Similar tags to 'casestudy'
      • [(0.16426628285871578, u'workshopresource'), (0.15481354307873288, u'twitter'), (0.15481354307873288, u'Twitter'), (0.14328977139680074, u'socialmedia'), (0.12573750709445097, u'socialwhois')]
      • Top 5 Recommendations of Similar tags to 'programming'.
      • [(0.30691624365482234, u'development'), (0.24390862944162436, u'tips'), (0.18090101522842639, u'tutorials'), (0.12705419376095389, u'webdev'), (0.12705419376095389, u'php')]
    • Delicious “development”
      • http://delicious.com/search?p=development&u=&chk=&context=all&tag=development&fr=del_icio_us&lc=0
    • Chapter 3: Question 2
      • Modify the blog parsing code to cluster individual entries instead of entire blogs. Do entries from the same blog cluster together? What about entries from the same date?
    • 20 Blogs Parsed
      • http://feedproxy.google.com/Mobilecrunch
      • http://feeds2.feedburner.com/AndroidCommunity?format=xml
      • http://feeds.feedburner.com/iphonehacks
      • http://blogs.abcnews.com/theblotter/index.rdf
      • http://rss.cnn.com/rss/cnn_latest.rss
      • http://hypem.com/feed/time/today/1/feed.xml
      • http://f-measure.blogspot.com/feeds/posts/default
      • http://www.kotaku.com/index.xml
      • http://www.gamingblog.org/rss.xml
      http://www.perezhilton.com/index.xml http://www.tmz.com/rss.xml http://feeds.feedburner.com/Mashable http://www.readwriteweb.com/rss.xml http://googleblog.blogspot.com/rss.xml http://feeds.feedburner.com/GoogleOperatingSystem http://feeds.feedburner.com/codinghorror/ http://scobleizer.wordpress.com/feed/ http://carlton-northern.blogspot.com/feeds/posts/default http://www.lifehacker.com/index.xml http://feeds.feedburner.com/TechCrunch
    • Generate Cluster Code
      • Import clusters
      • blognames,words,data=clusters.readfile('entrydata.txt')
      • clust=clusters.hcluster(data)
      • clusters.printclust(clust,labels=blognames)
      • clusters.drawdendrogram(clust,blognames,jpeg='entryclust.jpg')
      • print "Generated blogclust.jpg"
    • Cluster JPG
      • http://www.cs.odu.edu/~cnorther/files/entryclust.jpg
    • Chapter 4: Question 1
      • Word separation. The separatewords method currently considers any nonalphanumeric character to be a separator, meaning it will not properly index entries like “C++,” “$20,” “Ph.D.,” or “617-555-1212.” What is a better way to separate words? Does using whitespace as a separator work? Write a better word separation function.
    • Crawler Code
      • import searchengine
      • print "Running TestCrawler";
      • pagelist=['http://carlton-northern.blogspot.com/']
      • crawler=searchengine.crawler('myblog&+.db')
      • crawler.createindextables()
      • crawler.crawl(pagelist)
      • crawler.calculatepagerank()
      • print "Done"
    • Separatewords Function
      • def separatewords(self, text):
      • splitter = re.compile('[^a-zA-Z0-9_&+@*]')
      • return [s.lower() for s in splitter.split(text) if s != '']
    • Test Code
      • import searchengine
      • if __name__ == "__main__":
      • crawler=searchengine.crawler('myblog&+.db')
      • e=searchengine.searcher('myblog&+.db')
      • for row in crawler.con.execute('select rowid, word from wordlist'):
      • print row
      • print e.query('half&amp ')
      • print e.query('c++ ')