Your SlideShare is downloading. ×
MongoDB ClickStream and Visualization
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

MongoDB ClickStream and Visualization

1,807
views

Published on

Implementing ClickStream Analytics with Spring, Java, MongoDB and Django

Implementing ClickStream Analytics with Spring, Java, MongoDB and Django

Published in: Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,807
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
52
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Implementing and Visualizing Click-Stream Data with MongoDB Jan 22, 2013 - New York MongoDB User Group Cameron Sim - LearnVest.comMonday, April 15, 13
  • 2. Agenda About LearnVest HL Application Architecture Data Capture Event Packaging MongoDB Data Warehousing Loading & Visualization Finishing upMonday, April 15, 13
  • 3. LearnVest Inc. www.learnvest.com Mission Statement Aiming to making Financial Planning as accessible as having a gym membership Company Key Products Founded in 2008 by Alexa Von Tobel, CEO Account Aggregation and Management (Bank, Credit, Loan, Investment, Mortgage) 50+ People and Growing rapidly Based in NYC Original and Syndicated Newsletter Content Platforms Financial Planning Web & iPhone (tiered product offering) Stack Operational Analytics Wordpress, Backbone.js, Node.js MongoDB 2.2.0 (3-node replica-set) Java Spring 3, Redis, Memcached, Java 6, Spring 3 MongoDB, ActiveMQ, Nginx, MySQL 5.x pyMongo Django 1.4Monday, April 15, 13
  • 4. LearnVest.com WebMonday, April 15, 13
  • 5. LearnVest.com IPhoneMonday, April 15, 13
  • 6. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Warehousing MongoDB Visualization Loading & Data Collection JDBCMonday, April 15, 13
  • 7. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Warehousing MongoDB Visualization Loading & Data Collection JDBCMonday, April 15, 13
  • 8. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Warehousing MongoDB Visualization Loading & Data Collection JDBCMonday, April 15, 13
  • 9. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Warehousing MongoDB Visualization Loading & Data Collection JDBCMonday, April 15, 13
  • 10. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Warehousing MongoDB Visualization Loading & Data Collection JDBCMonday, April 15, 13
  • 11. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Loading & VisualizationData Ware Collection MongoDB JDBCMonday, April 15, 13
  • 12. High Level Architecture Production Analytics Platform Delivery Services Services Loaders & Dashboards } } } } HTTPS pyMongo MongoDB Java Conn MongoDB Replication Event Packaging Loading & Visualization MongoDB Collection JDBCMonday, April 15, 13
  • 13. Philosophy For Data Collection Capture Everything • User-Driven events over web and mobile • System-level exceptions • Everything else Temporary Data • Be ‘ok’ with approximate data • Operational Databases are the system of record Aggregate events as they come in • Remove the overhead of basic metrics (counts, sums) on core events • Group by user unique id and increment counts per event, over time-dimensions (day, week-ending, month, year)Monday, April 15, 13
  • 14. Data Capture IOS - (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source; { NSMutableDictionary *eventData = [NSMutableDictionary dictionary]; if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params]; }Monday, April 15, 13
  • 15. Data Capture WEB (JavaScript) function internalTrackPageView() { var cookie = { userContext: jQuery.cookie(UserContextCookie), }; var trackEvent = { eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } }; // AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader(Accept, application/json); xhr.setRequestHeader(User-Context, cookie.userContext); if(settings.type === PUT || settings.type === POST) { xhr.setRequestHeader(Content-Type, application/json); } } }); }Monday, April 15, 13
  • 16. Bus Event Packaging 1. Spring 3 RESTful service layer, controller methods define the eventCode via @tracking annotation 2. Custom Intercepter class extends HandlerInterceptorAdapter and implements postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher 3. EventPublisher publishes to common event bus queue with multiple subscribers, one of which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest ServiceMonday, April 15, 13
  • 17. Bus Event Packaging 1) Spring RestController Methods Interface @RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request); Concrete/Impl Class @Override @Tracking("user.login") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){ //Implementation return event; }Monday, April 15, 13
  • 18. Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) { Map<String, Object> responseModel = new HashMap<String, Object>(); // remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event " + trackingCode + " : " + ExceptionUtils.getStackTrace(e)); } }Monday, April 15, 13
  • 19. Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) { Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence } public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class); }Monday, April 15, 13
  • 20. Bus Event Packaging The Serialized Json (User Action) { “eventCode” : “user.login”, “eventType” : “login”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “” : “”, “” : “”, “” : “” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }Monday, April 15, 13
  • 21. Bus Event Packaging The Serialized Json (Generic Event) { “eventCode” : “generic.ui”, “eventType” : “pageView”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }Monday, April 15, 13
  • 22. Bus Event Packaging The Serialized Json (Generic Event) { “eventCode” : “generic.ui”, “eventType” : “pageView”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }Monday, April 15, 13
  • 23. MongoDB Data Warehousing MongoDB Information • v2.2.0 • 3-node replica-set • 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines • Each with single 500GB EBS volumes mounted to /opt/data MongoDB Config File dbpath = /opt/data/mongodb/data rest = true replSet = voyager Volumes ~IM events daily on web, ~600K on mobile 2-3 GB per day at start, slowed to ~1GB per day Currently at 78GB (collecting since August 2012) Future Scaling Strategy • Setup 2nd Replica-Set • Shard replica-sets to n at 60% / 250GB per EBS volume • Shard key probably based on sequential mix of email_address & additional stringMonday, April 15, 13
  • 24. MongoDB Data Warehousing Approach 1. Persist all events, bucketed by source:- WEB MOBILE 2. Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year) 3. Insert into collection e_web / e_mobile 4. Upsert into:- e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_year 5. Predictable model for scaling and measuring business growthMonday, April 15, 13
  • 25. MongoDB Data Warehousing 2. Persist all events, bucketed by source, event code and time:- //instantiate collections dynamically DBCollection collection_day = mongodb.getCollection(eventCode + "_day"); DBCollection collection_week = mongodb.getCollection(eventCode + "_week"); DBCollection collection_month = mongodb.getCollection(eventCode + "_month"); DBCollection collection_year = mongodb.getCollection(eventCode + "_year"); BasicDBObject newDocument = new BasicDBObject().append("$inc" new BasicDBObject().append("count", 1)); //update day dimension collection_day.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(d)),newDocument, true, false); //update week dimension collection_week.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(w)), newDocument, true, false); //update month dimension collection_month.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_month.format(d)), newDocument, true, false); //update month dimension collection_year.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_year.format(d)), newDocument, true, false);Monday, April 15, 13
  • 26. MongoDB Data Warehousing Persist all events, bucketed by source, event code and time:- > show collections e_mobile e_web e_web_account_addManual_day e_web_account_addManual_month e_web_account_addManual_week e_web_account_addManual_year e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_year e_mobile_generic_ui_day e_mobile_generic_ui_month e_mobile_generic_ui_week e_mobile_generic_ui_year > db.e_web_user_login_day.find() { "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 5, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50cd6cfcb9a80a2b4ee21422"), "count" : 7, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50cd6e51b9a80a2b4ee21427"), "count" : 2, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 3, "date" : "01/03", "user-context" : "50e49a561b36921910222c33" }Monday, April 15, 13
  • 27. MongoDB Data Warehousing Persist all events > db.e_web.findOne() { "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/ json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension:// fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/ 537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept- encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" :Monday, April 15, 13
  • 28. MongoDB Data Warehousing Indexing Strategy • Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances • Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context) • Heavy insertion rates, much lower read rates....so less indexes the better Monday, April 15, 13
  • 29. MongoDB Data Warehousing Indexing Strategy > db.e_web.getIndexes() [ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" } ] Monday, April 15, 13
  • 30. Loading & Visualization Objective • Show historic and intraday stats on core use cases (logins, conversions) • Show user funnel rates on conversion pages • Show general usability - how do users really use the Web and IOS platforms? Non-Functionals • Intraday doesn’t need to be “real-time”, polling is good enough for now • Overnight batch job for historic must scale horizontally General Implementation Strategy • Do all heavy lifting & object manipulation, UI should just display graph or table • Modularize the service to be able to regenerate any graphs/tables without a full loadMonday, April 15, 13
  • 31. Loading & Visualization Java Batch Service Java Mongo library to query key collections and return user counts and sum of events DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date()))); private HashMap<String, Object> getSumAndCount(DBCursor cursor){ HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m; }Monday, April 15, 13
  • 32. Loading & Visualization Java Batch Service Use Aggregation Framework where required on core collections (e_web) and external data //create aggregation objects DBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) ); DBObject day_value = new BasicDBObject( "day_value", "$day_value"); DBObject groupFields = new BasicDBObject( "_id", day_value); //create the fields to group by, in this case “number” groupFields.put("number", new BasicDBObject( "$sum", 1)); //create the group DBObject group = new BasicDBObject("$group", groupFields); //execute AggregationOutput output = mycollection.aggregate( project, group ); for(DBObject obj : output.results()){ . . }Monday, April 15, 13
  • 33. Loading & Visualization Java Batch Service MongoDB Command Line example on aggregation over a time period, e.g. month > db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} , number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ] )Monday, April 15, 13
  • 34. Loading & Visualization Java Batch Service Persisting events into graph and table collections >db.homeGraphs.find() { "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 } { "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 } { "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" :Monday, April 15, 13
  • 35. Loading & Visualization Django and HighCharts Extract data (pyMongo) def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection(localhost, 27017) db = conn[lvanalytics] cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor) except Exception as e: logger.error(e.message) Return the graph object (as a list or a dict of lists) to the view that called the method pagedata={} pagedata[accountsGraph]=mongodb_home.getHomeChart() return render_to_response(home.html,{pagedata: pagedata}, context_instance=RequestContext(request))Monday, April 15, 13
  • 36. Loading & Visualization Django and HighCharts Populate the series.. (JavaScript with Django templating) seriesOptions[0] = { id: naturalAccounts, name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } };Monday, April 15, 13
  • 37. Loading & Visualization Django and HighCharts And Create the Charts and Tables...Monday, April 15, 13
  • 38. Loading & Visualization Django and HighCharts And Create the Charts and Tables...Monday, April 15, 13
  • 39. Lessons Learned • Date Time managed as two fields, Datetime and Date • Aggregating and upserting documents as events are received works for us • Real-time Map-Reduce in pyMongo - too slow, don’t do this. • Django-noRel - Unstable, use Django and configure MongoDB as a datastore only • Memcached on Django is good enough (at the moment) - use django- celery with rabbitmq to pre-cache all data after data loading • HighCharts is buggy - considering D3 & other libraries • Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)Monday, April 15, 13
  • 40. Next Steps • A/B testing framework, experiments and variances • Unauthenticated / Authenticated user tracking • Provide data async over service layer • Segmentation with graphical libraries like D3 & Cross-Filter (http:// square.github.com/crossfilter/) • Saving Query Criteria, expanding out BI tools for internal users • MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools) • Storm / Kafka for real-time analytics processing • Shard the Replica-Set, looking into Gizzard as the middlewareMonday, April 15, 13
  • 41. Thanks & Questions Hrishi Dixit Kevin Connelly Will Larche Chief Technology Officer Director of Engineering Lead IOS Developer hrishi@learnvest.com kevin@learnvest.com will@learnvest.com Jeremy Brennan Cameron Sim <your name here> Director of UI/UX Technology Director of Analytics Tech New Awesome Developer jeremy@learnvest.com cameron@learnvest.com you@learnvest.com HIRED !Monday, April 15, 13