Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Twitter6

867 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Twitter6

  1. 1. Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet #2 charsyam@naver.com
  2. 2. Data Mining
  3. 3. Discover New Knowledge
  4. 4. Discover New Knowledgefrom Existing Information
  5. 5. What do #TeaParty and #JustinBieber have in common
  6. 6. Tools: Pymongo, MongoDBapt-get install python-devpip install pymongo
  7. 7. Get Tweetsfrom pymongo.connection import Connectionimport sysimport tweepyconnection = Connection("localhost")db = connection.fooimport tweepyapi = tweepy.API()tweets = api.search(#JustinBieber, rpp=100)for tweet in tweets: db.foo.save(tweet.__getstate__())
  8. 8. Insert TO MongoDBfrom pymongo.connection import Connectionimport sysimport tweepyconnection = Connection("localhost")db = connection.fooimport tweepyapi = tweepy.API()for num in range(1,16): tweets = api.search(#JustinBieber, rpp=100, page=num) for tweet in tweets: db.foo.save(tweet.__getstate__())
  9. 9. Count Frequency in mongo MAPmap = function(){ words = this.text.split( ); for ( i in words ){ emit({ key: words[i] }, {count: 1}); }};
  10. 10. Count Frequency in mongo REDUCEreduce = function (key, values) { var count = 0; values.forEach(function (v) {count += v.count;}); return {count:count};}
  11. 11. Count Frequency in mongo EXECUTEres = db.foo.mapReduce( map, reduce, {out: "mystring"});
  12. 12. Count Frequency in mongo RESULT{ "_id" : { "key" : "#1000ADay" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#1000aday" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#500ADay" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#500aday" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#AutoFollow" }, "value" : { "count" : 1 } }{ "_id" : { "key" : "#Bieber" }, "value" : { "count" : 1 } }
  13. 13. Get From MongoDBfrom pymongo.connection import Connectionimport sysimport tweepyconnection = Connection("localhost")db = connection.foocursor = db.mystring.find()for d in cursor: print d
  14. 14. What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets?
  15. 15. intersectionimport sysfrom sets import Setif __name__==__main__: r1 = open( sys.argv[1] ) r2 = open( sys.argv[2] ) s1 = Set() s2 = Set() for line in r1.readlines(): key = line.split() if( len(key) > 0 ): s1.add(key[0]) for line in r2.readlines(): key = line.split() if( len(key) > 0 ): s2.add(key[0]) s3 = s1.intersection(s2) print len(s1) print len(s2) print len(s3)
  16. 16. On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags?
  17. 17. Which Get Retweeted More Often: #JustinBieber or #TeaParty?
  18. 18. How Much Overlap ExistsBetween the Entities of #TeaParty and #JustinBieber Tweet?
  19. 19. Thank You!

×