Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Implementing and Visualizing Clickstream data with MongoDB


Published on

Having recently implemented a new framework for the real-time collection, aggregation and visualization of web and mobile generated Clickstream traffic (realizing daily click-stream volumes of 1M+ events), this walkthrough is about the motivations, throughout-process and key decisions made, as well as an in depth look at the implementation of how to buildout a data-collection, analytics and visualization framework using MongoDB. Technologies covered in this presentation (as well as MongoDB) are Java, Spring, Django and Pymongo.

Published in: Technology

Implementing and Visualizing Clickstream data with MongoDB

  1. 1. Implementing and Visualizing Click- Stream Data with MongoDB Jan 22, 2013 - New York MongoDB User Group Cameron Sim -
  2. 2. Agenda •  About LearnVest •  HL Application Architecture •  Data Capture •  Event Packaging •  MongoDB Data Warehousing •  Loading & Visualization •  Finishing up
  3. 3. LearnVest Inc. Mission Statement Aiming to making Financial Planning as accessible as having a gym membership Company Key Products nded in 2008 by Alexa Von Tobel, CEO Account Aggregation and Managem (Bank, Credit, Loan, Investment, Mort 50+ People and Growing rapidly Based in NYC Original and Syndicated Newsletter Co Platforms Financial Planning Web iPhone (tiered product offering) Stack Analytics Operational MongoDB 2.2.0 (3-node replica-setWordpress, Backbone.js, Node.js Java 6, Spring 3 ava Spring 3, Redis, Memcached,
  4. 4. Web
  5. 5. IPhone
  6. 6. High Level Architecture Production Analytics elivery Services Services Loaders Dashbo HTTPS pyMongo
  7. 7. ure Everything Collection -Driven events over web and mobile m-level exceptions ything else porary Data ok’ with approximate data rational Databases are the system of record egate events as they come in ove the overhead of basic metrics (counts, sums) on core events p by user unique id and increment counts per event, over time-dimensionseek-ending, month, year)
  8. 8. Data Capture OS (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source; NSMutableDictionary *eventData = [NSMutableDictionary dictionary]; if (eventType!=nil) [params setObject:eventType forKey:@eventType]; if (object!=nil) [eventData setObject:object forKey:@object]; if (name!=nil) [eventData setObject:name forKey:@name]; if (page!=nil) [eventData setObject:page forKey:@page]; if (source!=nil) [eventData setObject:source forKey:@source]; if (eventData!=nil) [params setObject:eventData forKey:@eventData]; [[LVNetworkEngine sharedManager] analytics_send:params];
  9. 9. Data Capture WEB (JavaScript) unction internalTrackPageView() { var cookie = { userContext: jQuery.cookie(UserContextCookie), }; var trackEvent = { eventType: pageView, eventData: { page: window.location.pathname + } }; // AJAX jQuery.ajax({ url: /api/track, type: POST, dataType: json, data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader(Accept, application/json); xhr.setRequestHeader(User-Context, cookie.userContext) if(settings.type === PUT || settings.type === POST) xhr.setRequestHeader(Content-Type, application/js } } });
  10. 10. Bus Event Packaging ng 3 RESTful service layer, controller methods define the eventCode via @trackiotation tom Intercepter class extends HandlerInterceptorAdapter and implements Handle() (for each event) to invoke calls via Spring @async to an EventPublisher ntPublisher publishes to common event bus queue with multiple subscribers, one okages the eventPayload MapString, Object object and forwards to Analytics Rest
  11. 11. Bus Event Packaging ing RestController Methods ace estMapping(value = /user/login, method = RequestMethod.POST,rs=Accept=application/json)c MapString, Object userLogin(@RequestBody MapString, Object event,ervletRequest request);ete/Impl Class rideking(user.login)c MapString, Object userLogin(@RequestBody MapString, Object event,ervletRequest request){/Implementationeturn event;
  12. 12. Bus Event Packaging stom Intercepter class extends HandlerInterceptorAdapter cted void handleTracking(String trackingCode, MapString, Object modelMapervletRequest request) {MapString, Object responseModel = new HashMapString, Object(); // remove non-serializables copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error(Error tracking event + trackingCode + : + ExceptionUtils.getStackTrace(e)); }
  13. 13. Bus Event Packaging stom Intercepter class extends HandlerInterceptorAdapter c void publish (String eventCode, MapString,Object eventData, HttpServletRequest requestMapString,Object payload = new HashMapString,Object();String eventId=UUID.randomUUID().toString();MapString, String requestMap = HttpRequestUtils.getRequestHeaders(reques//Normalize messagepayload.put(eventType, eventData.get(eventType));payload.put(eventData, eventData.get(eventType));payload.put(version, eventData.get(eventType));payload.put(eventId, eventId);payload.put(eventTime, new Date());payload.put(request, requestMap);...//Send to the Analytics Service for MongoDB persistencec void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers)Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class)
  14. 14. Bus Event Packaging erialized Json (User Action) tCode” : “user.login”,tType” : “login”,ion” : “1.0”,tTime” : “1358603157746”,tData” : { “” : “”, “” : “”, “” : “” },est” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8, content-length : 204, accept-encoding : gzip,deflate,sdch”, }
  15. 15. Bus Event Packaging erialized Json (Generic Event) tCode” : “generic.ui”,tType” : “pageView”,ion” : “1.0”,tTime” : “1358603157746”,tData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },est” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8, content-length : 204, accept-encoding : gzip,deflate,sdch”, }
  16. 16. MongoDB Data Warehousing goDB Information 0 de replica-set rge (primary), 2x Medium (secondary) AWS Amazon-Linux machines with single 500GB EBS volumes mounted to /opt/data goDB Config File = /opt/data/mongodb/datarest = truereplSet = voyager mes vents daily on web, ~600K on mobile B per day at start, slowed to ~1GB per day ntly at 78GB (collecting since August 2012) re Scaling Strategy p 2nd Replica-Set d replica-sets to n at 60% / 250GB per EBS volume d key probably based on sequential mix of email_address additional string
  17. 17. MongoDB Data Warehousing OBILE ist all events, bucketed by source, event code and time:- EB/MOBILE er.login e (day, week-ending, month, year) ert into collection e_web / e_mobile sert into:- web_user_login_day web_user_login_week web_user_login_month web_user_login_year dictable model for scaling and measuring business growth
  18. 18. MongoDB Data Warehousing DBObject newDocument = new BasicDBObject().append($inc new BasicDBObject().append(count, 1));ate day dimensionction_day.update(new BasicDBObject().append(user-context, userContext) .append(eventType, eventType) .append(date, sdf_day.format(d)),newDocument, true, falseate week dimensionction_week.update(new BasicDBObject().append(user-context, userContext) .append(eventType, eventType) .append(date, sdf_day.format(w)), newDocument, true, falsate month dimensionction_month.update(new BasicDBObject().append(user-context, userContext) .append(eventType, eventType) .append(date, sdf_month.format(d)), newDocument, true, faate month dimensionction_year.update(new BasicDBObject().append(user-context, userContext) .append(eventType, eventType) .append(date, sdf_year.format(d)), newDocument, true, fal
  19. 19. MongoDB Data Warehousing ount_addManual_weeke_web_account_addManual_year_user_login_day_user_login_week_user_login_month_user_login_yeare_mobile_generic_ui_daye_mobile_generic_ui_monthe_mobile_gweeke_mobile_generic_ui_yeare_web_user_login_day.find()d : ObjectId(50e4b9871b36921910222c42), count : 5, date : 01/02,-context : c4ca4238a0b923820dcc509a6f75849b }d : ObjectId(50cd6cfcb9a80a2b4ee21422), count : 7, date : 01/02,-context : c4ca4238a0b923820dcc509a6f75849b }d : ObjectId(50cd6e51b9a80a2b4ee21427), count : 2, date : 01/02,-context : c4ca4238a0b923820dcc509a6f75849b }d : ObjectId(50e4b9871b36921910222c42), count : 3, date : 01/03,-context : 50e49a561b36921910222c33 }
  20. 20. MongoDB Data Warehousing 1, accept-charset : ISO-8859-1,utf-8;q=0.7,*;q=0.3, cookie : size=de=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;IONID=56EB165266A2C4AFF946F139669D746F;oken=73bdcdddf151dc56b8020855b2cb10c8, content-length : 255, accept-ing : gzip,deflate,sdch }, eventType : flick, eventData : { objeon, name : split transaction button, page : #inbox/79876/, sectisaction_river_details } }
  21. 21. MongoDB Data Warehousing xing Strategy xes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Largece and 3.75GB on Medium instances datetime in two fields and compound index on date with other fields like eventTypunique id (user-context) vy insertion rates, much lower read less indexes the better
  22. 22. MongoDB Data Warehousing ing Strategye_web.getIndexes()[ v : 1, key : { request.user-contex created_date : 1 }, ns :ycenter.e_web, name : request.user-context_1_created_date_ v : 1, key : { : 1 created_date : 1 }, ns : moneycenter.e_web name : eventData.name_1_created_date_1 }]
  23. 23. jective Loading Visualization how historic and intraday stats on core use cases (logins, conversions) how user funnel rates on conversion pages how general usability - how do users really use the Web and IOS platforms? on-Functionals traday doesn’t need to be “real-time”, polling is good enough for now Overnight batch job for historic must scale horizontally neral Implementation Strategy o all heavy lifting object manipulation, UI should just display graph or table Modularize the service to be able to regenerate any graphs/tables without a full load
  24. 24. Loading Visualization va Batch Service a Mongo library to query key collections and return user counts and sum of eventsursor webUserLogins = c.find( new BasicDBObject(date, sdf.format(new Date())));vate HashMapString, Object getSumAndCount(DBCursor cursor){ HashMapString, Object m = new HashMapString, Object(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject); count++; sum=sum+(Integer)obj.get(count); } m.put(sum, sum); m.put(count, count); m.put(average, sdf.format(new Float(sum)/count)); return m;
  25. 25. Loading Visualization va Batch Service e Aggregation Framework where required on core collections (e_web) and externareate aggregation objectsbject project = new BasicDBObject($project, new BasicDBObject(day_value, fields) );bject day_value = new BasicDBObject( day_value, $day_value);bject groupFields = new BasicDBObject( _id, day_value);reate the fields to group by, in this case “number”upFields.put(number, new BasicDBObject( $sum, 1));reate the groupbject group = new BasicDBObject($group, groupFields);xecuteregationOutput output = mycollection.aggregate( project, group );(DBObject obj : output.results()){
  26. 26. Loading Visualization va Batch Service ngoDB Command Line example on aggregation over a time period, e.g. monthb.e_web.aggregate( [ { $match : { created_date : { $gt :Date(2012-10-25T00:00:00)}}}, { $project : { day_value : {daydayOfMonth : $created_date }, month:{ $month :reated_date }} }}, { $group : { _id : {day_value:$day_value} number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])
  27. 27. Loading Visualization va Batch Service sisting events into graph and table collections .homeGraphs.find()_id : ObjectId(50f57b5c1d4e714b581674e2), accounts_natural : 54,counts_total : 54, date : ISODate(2011-02-06T05:00:00Z), linked_rate.96, premium_rate : 0, str_date : 2011,01,06, upgrade_rate : 0ers_avg_linked : 3.43, users_linked : 7 }_id : ObjectId(50f57b5c1d4e714b581674e3), accounts_natural : 144,counts_total : 144, date : ISODate(2011-02-07T05:00:00Z), linked_rat.11, premium_rate : 0, str_date : 2011,01,07, upgrade_rate : 0ers_avg_linked : 4, users_linked : 16 }_id : ObjectId(50f57b5c1d4e714b581674e4), accounts_natural : 119,counts_total : 119, date : ISODate(2011-02-08T05:00:00Z), linked_rat.13, premium_rate : 0, str_date : 2011,01,08, upgrade_rate : 0ers_avg_linked : 4.5, users_linked : 18 }
  28. 28. 17) Loading Visualization day numbers try: conn = pymongo.Connection(localhost, db = conn[lvanalytics]accountmetrics.find( cursor = {date : {$gte : dt_from, $lte : dt_to}}).sort(date)urn buildMetricsDict(cursor) except Exception as e:ger.error(e.message)urn the graph object (as a list or a dict of lists) to the view that called thethod edata={}edata[accountsGraph]=mongodb_home.getHomeChart()urn render_to_response(home.html,{pagedata: pagedata},text_instance=RequestContext(request)).homeGraphs.find()_id : ObjectId(50f57b5c1d4e714b581674e2), accounts_natural : 54,
  29. 29. Loading Visualization ango and HighChartspulate the series.. (JavaScript with Django templating) iesOptions[0] = {id: naturalAccounts, name: Natural Accounts, data: [ {% forn pagedata.metrics.accounts_natural %} {% if not forloop.first {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor ], tooltip: { valueDecimals: 2 } };
  30. 30. Loading Visualization ango and HighChartsd Create the Charts and Tables...
  31. 31. Loading Visualization ango and HighChartsd Create the Charts and Tables...
  32. 32. Lessons Learned • Date Time managed as two fields, Datetime and Date • Aggregating and upserting documents as events are received works for us •  Real-time Map-Reduce in pyMongo - too slow, don’t do this. • Django-noRel - Unstable, use Django and configure MongoDB as a datastore only • Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading •  HighCharts is buggy - considering D3 other libraries • Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)
  33. 33. Next Steps • A/B testing framework, experiments and variances •  Unauthenticated / Authenticated user tracking •  Provide data async over service layer • Segmentation with graphical libraries like D3 Cross-Filter ( • Saving Query Criteria, expanding out BI tools for internal users • MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools) • Storm / Kafka for real-time analytics processing • Shard the Replica-Set, looking into Gizzard as the middleware
  34. 34. Hrishi Dixit Chief Technology Officer Kevin Connelly Director of Engineering Will Larche Lead IOS Developer Cameron Sim Jeremy Brennan Director of Analytics Tech your name here Director of UI/UX Technology New Awesome Develope HIR