Social Data and Log Analysis Using MongoDB
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Social Data and Log Analysis Using MongoDB

on

  • 15,758 views

 

Statistics

Views

Total Views
15,758
Views on SlideShare
14,860
Embed Views
898

Actions

Likes
43
Downloads
421
Comments
5

10 Embeds 898

http://d.hatena.ne.jp 641
http://seikoudoku2000.hatenablog.com 188
http://webcache.googleusercontent.com 22
http://garagekidztweetz.hatenablog.com 18
http://paper.li 11
https://twitter.com 7
http://twitter.com 6
http://a0.twimg.com 3
http://translate.googleusercontent.com 1
http://localhost 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Very nice. Thank Takahiro.
    Are you sure you want to
    Your message goes here
    Processing…
  • great, very helpful.
    Are you sure you want to
    Your message goes here
    Processing…
  • Two Question:
    slide#17: the collection name should be 'user_access'?
    Why using Map/Reduce two times to get pv and uu?
    Are you sure you want to
    Your message goes here
    Processing…
  • very nice
    Are you sure you want to
    Your message goes here
    Processing…
  • interesting
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Social Data and Log Analysis Using MongoDB Presentation Transcript

  • 1. Social Data and Log Analysis Using MongoDB 2011/03/01(Tue) #mongotokyo doryokujin
  • 2. Self-Introduction• doryokujin (Takahiro Inoue), Age: 25• Education: University of Keio • Master of Mathematics March 2011 ( Maybe... ) • Major: Randomized Algorithms and Probabilistic Analysis• Company: Geisha Tokyo Entertainment (GTE) • Data Mining Engineer (only me, part-time)• Organized Community: • MongoDB JP, Tokyo Web Mining
  • 3. My Job• I’m a Fledgling Data Scientist • Development of analytical systems for social data • Development of recommendation systems for social data• My Interest: Big Data Analysis • How to generate logs scattered many servers • How to storage and access to data • How to analyze and visualization of billions of data
  • 4. Agenda• My Company’s Analytic Architecture• How to Handle Access Logs• How to Handle User Trace Logs• How to Collaborate with Front Analytic Tools• My Future Analytic Architecture
  • 5. Agenda Hadoop, Mongo Map Reduce• My Company’s Analytic Architecture Hadoop, Schema Free• How to Handle Access Logs• How to Handle User Trace Logs REST Interface, JSON• How to Collaborate with Front Analytic Tools Capped Collection,• My Future Analytic Architecture Modifier OperationOf Course Everything With
  • 6. My Company’sAnalytic Architecture
  • 7. Social Game (Mobile): Omiseyasan• Enjoy arranging their own shop (and avatar)• Communicate with other users by shopping, part-time, ...• Buy seeds of items to display their own shop
  • 8. Data FlowAccess
  • 9. Back-end Architecture Pretreatment: Trimming, As a Central Data Server Validation, Filtering,...Dumbo (Hadoop Streaming) PyMongo Back Up To S3
  • 10. Front-end Architecture sleepy.mongoose (REST Interface)PyMongo Web UISocial Data Analysis Data Analysis
  • 11. Environment• MongoDB: 1.6.4 • PyMongo: 1.9• Hadoop: CDH2 ( soon update to CDH3 ) • Dumbo: Simple Python Module for Hadoop Streaming• Cassandra: 0.6.11 • R, Neo4j, jQuery, Munin, ...• [Data Size (a rough estimate)] • Access Log 15GB / day ( gzip ) - 2,000M PV • User Trace Log 5GB / day ( gzip )
  • 12. How to Handle Access Logs
  • 13. How to Handle Access LogsPretreatment: Trimming, As a Data Server Validation, Filtering, ... Back Up To S3
  • 14. Access Data Flow Caution: need MongoDB >= 1.7.4 user_pageview agent_pageview daily_pageview Pretreatment 2nd Map Reduceuser_access hourly_pageview 1st Map Reduce Group by
  • 15. Hadoop• Using Hadoop: Pretreatment Raw Records• [Map / Reduce] • Read all records • Split each record by ‘¥s’ • Filter unnecessary records (such as *.swf) • Check records whether correct or not • Insert (save) records to MongoDB ※ write operations won’t yet fully utilize all cores
  • 16. Access Logs110.44.178.25 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/BattleSelectAssetPage.html;jsessionid=9587B0309581914AB7438A34B1E51125-n15.at3?collec tion=12&opensocial_app_id=00000&opensocial_owner_id=00000 HTTP/1.0" 200 6773 "-""DoCoMo/2.0 ***"110.44.178.26 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/shopping/battle/ShoppingBattleTopPage.html;jsessionid=D901918E3CAE46E6B928A316D1938C3A-n11.a p1?opensocial_app_id=00000&opensocial_owner_id=11111 HTTP/1.0" 200 15254 "-""DoCoMo/2.0 ***"110.44.178.27 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/BattleSelectAssetDetailPage;jsessionid=202571F97B444370ECB495C2BCC6A1D5-n14.at11?asse t=53&collection=9&opensocial_app_id=00000&opensocial_owner_id=22222 HTTP/1.0" 20011616 "-" "SoftBank/***"...(many records)
  • 17. Collection: user_trace> db.user_trace.find({user: "7777", date: "2011-02-12"}).limit(0) .forEach(printjson){ "_id" : "2011-02-12+05:39:31+7777+18343+Access", "lastUpdate" : "2011-02-19", "ipaddr" : "202.32.107.166", "requestTimeStr" : "12/Feb/2011:05:39:31 +0900", "date" : "2011-02-12", "time" : "05:39:31", "responseBodySize" : 18343, "userAgent" : "DoCoMo/2.0 SH07A3(c500;TB;W24H14)", "statusCode" : "200", "splittedPath" : "/avatar2-gree/MyPage, "userId" : "7777", "resource" : "/avatar2-gree/MyPage;jsessionid=...?battlecardfreegacha=1&feed=...&opensocial_app_id=...&opensocial_viewer_id=...&opensocial_owner_id=..."}
  • 18. 1st Map Reduce• [Aggregation] • Group by url, date, userId • Group by url, date, userAgent • Group by url, date, time • Group by url, date, statusCode• Map Reduce operations runs in parallel on all shards
  • 19. 1st Map Reduce with PyMongomap = Code(""" function(){ • this.userId emit({ path:this.splittedPath, • this.userAgent userId:this.userId, date:this.date },1)} • this. timeRange """) • this.statusCode reduce = Code(""" function(key, values){ var count = 0; values.forEach(function(v) { count += 1; }); return {"count": count, "lastUpdate": today}; } """)
  • 20. # ( mongodb >= 1.7.4 ) result = db.user_access.map_reduce(map, reduce, marge_out="user_pageview", full_response=True, query={"date": date})• About output collection, there are 4 options: (MongoDB >= 1.7.4) • out : overwrite collection if already exists • marge_output : merge new data into the old output collection • reduce_output : reduce operation will be performed on the two values (the same key on new result and old collection) and the result will be written to the output collection. • full_responce (=false) : If True, return on stats on the operation. If False, No collection will be created, and the whole map-reduce operation will happen in RAM. The Result set fits within the 8MB/doc limit (16MB/doc in 1.8?).
  • 21. Map Reduce (>=1.7.4): out option in JavaScript• "collectionName" : If you pass a string indicating the name of a collection, then the output will replace any existing output collection with the same name.• { merge : "collectionName" } : This option will merge new data into the old output collection. In other words, if the same key exists in both the result set and the old collection, the new key will overwrite the old one.• { reduce : "collectionName" } : If documents exists for a given key in the result set and in the old collection, then a reduce operation (using the specified reduce function) will be performed on the two values and the result will be written to the output collection. If a finalize function was provided, this will be run after the reduce as well.• { inline : 1} : With this option, no collection will be created, and the whole map- reduce operation will happen in RAM. Also, the results of the map-reduce will be returned within the result object. Note that this option is possible only when the result set fits within the 8MB limit. http://www.mongodb.org/display/DOCS/MapReduce
  • 22. Collection: user_pageview> db.user_pageview.find({ "_id.userId": "7777", • Regular Expression "_id.path": "/.*MyPage$/", "_id.date": {$lte: "2011-02-12"} • <, >, <=, >= ).limit(1).forEach(printjson)#####{ "_id" : { "date" : "2011-02-12", "path" : "/avatar2-gree/MyPage", "userId" : "7777", }, "value" : { "count" : 10, "lastUpdate" : "2011-02-19" }}
  • 23. 2nd Map Reduce with PyMongomap = Code(""" function(){ emit({ "path" : this._id.path, "date": this._id.date, },{ "pv": this.value.count, "uu": 1 }); }""")reduce = Code(""" function(key, values){ var pv = 0; var uu = 0; values.forEach(function(v){ pv += v.pv; uu += v.uu; }); return {"pv": pv, "uu": uu}; }""")
  • 24. 2nd Map Reduce with PyMongomap = Code(""" function(){ emit({ "path" : this._id.path, "date": this._id.date, },{ "pv": this.value.count, "uu": 1 }); }""")reduce = Code(""" function(key, values){ var pv = 0; Must be the same key var uu = 0; ({“pv”: NaN} if not) values.forEach(function(v){ pv += v.pv; uu += v.uu; }); return {"pv": pv, "uu": uu}; }""")
  • 25. # ( mongodb >= 1.7.4 )result = db.user_pageview.map_reduce(map, reduce, marge_out="daily_pageview", full_response=True, query={"date": date})
  • 26. Collection: daily_pageview> db.daily_pageview.find({ "_id.date": "2011-02-12", "_id.path": /.*MyPage$/ }).limit(1).forEach(printjson){ "_id" : { "date" : "2011-02-12", "path" : "/avatar2-gree/MyPage", }, "value" : { "uu" : 53536, "pv" : 539467 }}
  • 27. Current Map Reduce is Imperfect • [Single Threads per node] • Doesnt scale map-reduce across multiple threads • [Overwrite the Output Collection] • Overwrite the old collection ( no other options like “marge” or “reduce” )# mapreduce code to merge output (MongoDB < 1.7.4)result = db.user_access.map_reduce(map, reduce, full_response=True, out="temp_collection", query={"date": date})[db.user_pageview.save(doc) for doc in temp_collection.find()]
  • 28. Useful Reference: Map Reduce• http://www.mongodb.org/display/DOCS/MapReduce• ALookAt MongoDB 1.8s MapReduce Changes• Map Reduce and Getting Under the Hood with Commands• Map/reduce runs in parallel/distributed?• Map/Reduce parallelism with Master/SlaveA• mapReduce locks the whole server• mapreduce vs find
  • 29. How to HandleUser Trace Logs
  • 30. How to Handle User TRACE LogsPretreatment: Trimming, As a Data Server Validation, Filtering, ... Back Up To S3
  • 31. User Trace / Charge Data Flow user_chargePretreatment daily_chargeuser_trace daily_trace
  • 32. User Trace Log
  • 33. Hadoop• Using Hadoop: Pretreatment Raw Records• [Map / Reduce] • Split each record by ‘¥s’ • Filter Unnecessary Records • Check records whether user behaves dishonestly • Unify format to be able to sum up ( Because raw records are written by free format ) • Sum up records group by “userId” and “actionType” • Insert (save) records to MongoDB ※ write operations won’t yet fully utilize all cores
  • 34. An Example of User Trace Log UserId ActionType ActionDetail
  • 35. An Example of User Trace Log-----Change------ActionLogger a{ChangeP} (Point,1371,1383)ActionLogger a{ChangeP} (Point,2373,2423)------Get------ActionLogger a{GetMaterial} (syouhinnomoto,0,-1) The value of “actionDerail”ActionLogger a{GetMaterial} usesyouhinnomotoActionLogger a{GetMaterial} (omotyanomotoPRO,1,6) must be unified format-----Trade-----ActionLogger a{Trade} buy 3 itigoke-kis from gree.jp:00000 #-----Make-----ActionLogger a{Make} make item kuronekono_nActionLogger a{MakeSelect} make item syouhinnomotoActionLogger a{MakeSelect} (syouhinnomoto,0,1)-----PutOn/Off-----ActionLogger a{PutOff} put off 1 ksuterasActionLogger a{PutOn} put 1 burokkus @2500-----Clear/Clean-----ActionLogger a{ClearLuckyStar} Clear LuckyItem_1 4 times-----Gatcha-----ActionLogger a{Gacha} Play gacha with first free play:ActionLogger a{Gacha} Play gacha:
  • 36. Collection: user_trace> db.user_trace.find({date:"2011-02-12”, actionType:"a{Make}", userId:”7777"}).forEach(printjson){ "_id" : "2011-02-12+7777+a{Make}", "date" : "2011-02-12" "lastUpdate" : "2011-02-19", "userId" : ”7777", "actionType" : "a{Make}", Sum up values group by "actionDetail" : { “userId” and “actionType” "make item ksutera" : 3, "make item makaron" : 1, "make item huwahuwamimiate" : 1, … }}
  • 37. Collection: daily_trace> db.daily_trace.find({ date:{$gte:"2011-02-12”,$lte:”2011-02-19”}, actionType:"a{Make}"}).forEach(printjson){ "_id" : "2011-02-12+group+a{Make}", "date" : "2011-02-12", "lastUpdate" : "2011-02-19", "actionType" : "a{Make}", "actionDetail" : { "make item kinnokarakuridokei" : 615, "make item banjo-" : 377, "make item itigoke-ki" : 135904, ... }, ...}...
  • 38. User Charge Log
  • 39. Collection: user_charge// TOP10 Users at 2011-02-12 abount Accounting> db.user_charge.find({date:"2011-02-12"}) .sort({totalCharge:-1}).limit(10).forEach(printjson){ "_id" : "2011-02-12+7777+Charge", "date" : "2011-02-12", "lastUpdate" : "2011-02-19", "totalCharge" : 10000, "userId" : ”7777", "actionType" : "Charge", Sum up values group by "boughtItem" : { “userId” and “actionType” " EX" : 13, " +6000" : 3, " PRO" : 20 }}{…
  • 40. Collection: daily_charge> db.daily_charge.find({date:"2011-02-12",T:"all"}) .limit(10).forEach(printjson){ "_id" : "2011-02-12+group+Charge+all+all", "date" : "2011-02-12", "total" : 100000, "UU" : 2000, "group" : { " " : 1000000, " " : 1000000, ... }, "boughtItemNum" : { " EX" : 8, " " : 730, ... }, "boughtItem" : { " EX" : 10000, " " : 100000, ... }}
  • 41. Categorize Users
  • 42. Categorize Users user_trace Attribution • [Categorize Users] user_registrat • by play term Attribution ion user_charge • by total amount of charge • by registration Attribution dateuser_savedata user_category Attribution • [ Take an Snapshot of Each Category’suser_pageview Stats per Week]
  • 43. Collection: user_registration> db.user_registration.find({userId:”7777"}).forEach(printjson){ "_id" : "2010-06-29+7777+Registration", "userId" : ”7777" "actionType" : "Registration", Tagging User "category" : { “R1” : “True”, # “T” : “ll” # … }, “firstCharge” : “2010-07-07”, # “lastLogin” : “2010-09-30”, # “playTerm” : 94, “totalCumlativeCharge” : 50000, # “totalMonthCharge” : 10000, # …}
  • 44. Collection: user_category> var cross = new Cross() # User Definition Function> MCResign = cross.calc(“2011-02-12”,“MC”,1)# each value is the number of the user# Charge(yen)/Term(day) 0(z) ~¥1k(s) ~¥10k(m) ¥100k~(l) total~1day(z) 50000 10 5 0 50015~1week(s) 50000 100 50 3 50153~1month(m) 100000 200 100 1 100301~3month(l) 100000 300 50 6 100356month~(ll) 0 0 0 0 0
  • 45. How to Collaborate With Front Analytic Tools
  • 46. Front-end Architecture sleepy.mongoose (REST Interface)PyMongo Web UISocial Data Analysis Data Analysis
  • 47. Web UI and Mongo
  • 48. Data Table: jQuery.DataTables [ Data Table ] • 1 Variable length pagination 2 On-the-fly filtering 3 Multi-column sorting with data type detection• Want to Share Daily Summary 4 Smart handling of column widths 5 Scrolling options for table• Want to See Data from Many Viewpoint viewport 6 ...• Want to Implement Easily
  • 49. Graph: jQuery.HighCharts [ Graph ] • 1. Numerous Chart Types 2. Simple Configuration Syntax 3. Multiple Axes• Want to Visualize Data 4. Tooltip Labels• Handle Time Series Data Mainly 5. Zooming• Want to Implement Easily 6. ...
  • 50. sleepy.mongoose• [REST Interface + Mongo] • Get Data by HTTP GET/POST Request • sleepy.mongoose ‣ request as “/db_name/collection_name/_command” ‣ made by a 10gen engineer: @kchodorow ‣ Sleepy.Mongoose: A MongoDB REST Interface
  • 51. sleepy.mongoose//start server> python httpd.py…listening for connections on http://localhost:27080//connect to MongoDB> curl --data server=localhost:27017 http://localhost:27080/_connect’//request example> http://localhost:27080/playshop/daily_charge/_find?criteria={}&limit=10&batch_size=10{"ok": 1, "results": [{“_id": “…”, ”date":… },{“_id”:…}], "id":0}}
  • 52. JSON: Mongo <---> Ajax sleepy.mongoose (REST Interface) Get JSON• jQuery library and MongoDB are compatible• It is not necessary to describe HTML tag(such as <table>)
  • 53. Example: Web UI
  • 54. R and Mongo
  • 55. Collection: user_registration> db.user_registration.find({userId:”7777"}).forEach(printjson){ "_id" : "2010-06-29+7777+Registration", Want to know the relation "userId" : ”7777" between user attributions "actionType" : "Registration", "category" : { “R1” : “True”, # “T” : “ll” # … }, “firstCharge” : “2010-07-07”, # “lastLogin” : “2010-09-30”, # “playTerm” : 94, “totalCumlativeCharge” : 50000, # “totalMonthCharge” : 10000, # …}
  • 56. R Code: Access MongoDB Using sleepy.mongoose##### LOAD LIBRARY #####library(RCurl)library(rjson)##### CONF #####today.str <- format(Sys.time(), "%Y-%m-%d")url.base <- "http://localhost:27080"mongo.db <- "playshop"mongo.col <- "user_registration"mongo.base <- paste(url.base, mongo.db, mongo.col, sep="/")mongo.sort <- ""mongo.limit <- "limit=100000"mongo.batch <- "batch_size=100000"
  • 57. R Code: Access MongoDB Using sleepy.mongoose##### FUNCTION #####find <- function(query){ mongo <- fromJSON(getURL(url)) docs <- mongo$result makeTable(docs) # My Function}# Example# Using sleepy.mongoose https://github.com/kchodorow/sleepy.mongoosemongo.criteria <- "_find?criteria={ ¥ "totalCumlativeCharge":{"$gt":0,"$lte":1000}}"mongo.query <- paste(mongo.criteria, mongo.sort, ¥ mongo.limit, mongo.batch, sep="&")url <- paste(mongo.base, mongo.query, sep="/")user.charge.low <- find(url)
  • 58. The Result# Result: 10th Document[[10]][[10]]$playTerm[1] 31[[10]]$lastUpdate[1] "2011-02-24"[[10]]$userId[1] "7777"[[10]]$totalCumlativeCharge[1] 10000[[10]]$lastLogin[1] "2011-02-21"[[10]]$date[1] "2011-01-22"[[10]]$`_id`[1] "2011-02-12+18790376+Registration"...
  • 59. Make a Data Table from The Result# Result: Translate Document to Table playTerm totalWinRate totalCumlativeCharge totalCommitNum totalWinNum [1,] 56 42 1000 533 224 [2,] 57 33 1000 127 42 [3,] 57 35 1000 654 229 [4,] 18 31 1000 49 15 [5,] 77 35 1000 982 345 [6,] 77 45 1000 339 153 [7,] 31 44 1000 70 31 [8,] 76 39 1000 229 89 [9,] 40 21 1000 430 92[10,] 26 40 1000 25 10...
  • 60. Scatter Plot / Matrix Each Category (User Attribution)# Run as a batch command$ R --vanilla --quiet < mongo2R.R
  • 61. Munin and MongoDB
  • 62. Monitoring DB StatsMunin configuration examples - MongoDBhttps://github.com/erh/mongo-muninhttps://github.com/osinka/mongo-rs-munin
  • 63. My FutureAnalytic Architecture
  • 64. Realtime AnalysisAccess Logs Flume with MongoDBRealTime(hourly) capped collection user_access daily/hourly (per hour) Trimming MapReduce _access Filtering Modifier Sum Up Sum Up capped daily/hourly collection user_trace (per hour) _traceRealTime(hourly)User Trace Logs
  • 65. FlumeServer A Hourly /Server B RealtimeServer C Flume Plugin Mongo Collector DBServer DServer E Access Log User Trace LogServer F
  • 66. An Output From Mongo-Flume Plugin> db.flume_capped_21.find().limit(1).forEach(printjson){ "_id" : ObjectId("4d658187de9bd9f24323e1b6"), "timestamp" : "Wed Feb 23 2011 21:52:06 GMT+0000 (UTC)", "nanoseconds" : NumberLong("562387389278959"), "hostname" : "ip-10-131-27-115.ap-southeast-1.compute.internal", "priority" : "INFO", "message" : "202.32.107.42 - - [14/Feb/2011:04:30:32 +0900] "GET /avatar2-gree.4d537100/res/swf/avatar/18051727/5/useravatar1582476746.swf?opensocial_app_id=472&opensocial_viewer_id=36858644&opensocial_owner_id=36858644 HTTP/1.1" 200 33640 "-" "DoCoMo/2.0 SH01C(c500;TB;W24H16)"", "metadata" : {}}Mongo Flume Plugin: https://github.com/mongodb/mongo-hadoop/tree/master/flume_plugin
  • 67. Summary
  • 68. Summary• Almighty as a Analytic Data Server • schema-free: social game data are changeable • rich queries: important for analyze many point of view • powerful aggregation: map reduce • mongo shell: analyze from mongo shell are speedy and handy• More... • Scalability: using Replication, Sharding are very easy • Node.js: It enable us server side scripting with Mongo
  • 69. My PresentationMongoDB UI MongoDB : http://www.slideshare.net/doryokujin/mongodb-uimongodbMongoDB Ajax GraphDB : http://www.slideshare.net/doryokujin/mongodbajaxgraphdb-5774546Hadoop MongoDB: http://www.slideshare.net/doryokujin/hadoopmongodbGraphDB GraphDB : http://www.slideshare.net/doryokujin/graphdbgraphdb
  • 70. I ♥ MongoDB JP• continue to be a organizer of MongoDB JP• continue to propose many use cases of MongoDB • ex: Social Data, Log Data, Medical Data, ...• support MongoDB users • by document translation, user-group, IRC, blog, book, twitter,...• boosting services and products using MongoDB
  • 71. Thank you for coming to Mongo Tokyo!![Contact me]twitter: doryokujinskype: doryokujinmail: mr.stoicman@gmail.comblog: http://d.hatena.ne.jp/doryokujin/MongoDB JP: https://groups.google.com/group/mongodb-jp?hl=ja