Your SlideShare is downloading. ×
  • Like
Twitter6
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
554
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet #2 charsyam@naver.com
  • 2. Data Mining
  • 3. Discover New Knowledge
  • 4. Discover New Knowledgefrom Existing Information
  • 5. What do #TeaParty and #JustinBieber have in common
  • 6. Tools: Pymongo, MongoDBapt-get install python-devpip install pymongo
  • 7. Get Tweetsfrom pymongo.connection import Connectionimport sysimport tweepyconnection = Connection("localhost")db = connection.fooimport tweepyapi = tweepy.API()tweets = api.search(#JustinBieber, rpp=100)for tweet in tweets: db.foo.save(tweet.__getstate__())
  • 8. Insert TO MongoDBfrom pymongo.connection import Connectionimport sysimport tweepyconnection = Connection("localhost")db = connection.fooimport tweepyapi = tweepy.API()for num in range(1,16): tweets = api.search(#JustinBieber, rpp=100, page=num) for tweet in tweets: db.foo.save(tweet.__getstate__())
  • 9. Count Frequency in mongo MAPmap = function(){ words = this.text.split( ); for ( i in words ){ emit({ key: words[i] }, {count: 1}); }};
  • 10. Count Frequency in mongo REDUCEreduce = function (key, values) { var count = 0; values.forEach(function (v) {count += v.count;}); return {count:count};}
  • 11. Count Frequency in mongo EXECUTEres = db.foo.mapReduce( map, reduce, {out: "mystring"});
  • 12. Count Frequency in mongo RESULT{ "_id" : { "key" : "#1000ADay" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#1000aday" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#500ADay" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#500aday" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#AutoFollow" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#Bieber" }, "value" : { "count" : 1 } }
  • 13. Get From MongoDBfrom pymongo.connection import Connectionimport sysimport tweepyconnection = Connection("localhost")db = connection.foocursor = db.mystring.find()for d in cursor: print d
  • 14. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?
  • 15. intersectionimport sysfrom sets import Setif __name__==__main__: r1 = open( sys.argv[1] ) r2 = open( sys.argv[2] ) s1 = Set() s2 = Set() for line in r1.readlines(): key = line.split() if( len(key) > 0 ): s1.add(key[0]) for line in r2.readlines(): key = line.split() if( len(key) > 0 ): s2.add(key[0]) s3 = s1.intersection(s2) print len(s1) print len(s2) print len(s3)
  • 16. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?
  • 17. Which Get Retweeted More Often: #JustinBieber or #TeaParty?
  • 18. How Much Overlap ExistsBetween the Entities of #TeaParty and #JustinBieber Tweet?
  • 19. Thank You!