Open analytics | Cameron Sim
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Open analytics | Cameron Sim

on

  • 2,932 views

 

Statistics

Views

Total Views
2,932
Views on SlideShare
2,932
Embed Views
0

Actions

Likes
0
Downloads
15
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Open analytics | Cameron Sim Presentation Transcript

  • 1. Building a scalable analytics platform forpersonal financial planningMay 23, 2013 - Open AnalyticsCameron Sim - RoundArchIsobar (www.isobar.com)Wednesday, May 22, 13
  • 2. AgendaAbout LearnVestArchitectureData CapturePackagingData WarehousingMetricsFinishing upWednesday, May 22, 13
  • 3. LearnVest Inc.www.learnvest.comCompanyFounded in 2008 by AlexaVon Tobel, CEO50+ People and Growing rapidlyBased in NYCPlatformsWeb & iPhoneMission Statement“Aiming to make financial planning as accessible as having a gym membership”Key ProductsAccount Aggregation and Management(Bank, Credit, Loan, Investment, Mortgage)Original and Syndicated Newsletter ContentFinancial Planning(tiered product offering)StackOperationalWordpress, Backbone.js, Node.jsJava Spring 3, Redis, Memcached,MongoDB,ActiveMQ, Nginx, MySQL 5.xAnalyticsMongoDB 2.2.0, Hadoop, Pig, Java 6, Spring 3pyMongoDjango 1.4Wednesday, May 22, 13
  • 4. LearnVest.comWebWednesday, May 22, 13
  • 5. LearnVest.comIPhoneWednesday, May 22, 13
  • 6. Conversion FunnelsWeb IOS Tele-Sale, scheduled callAccount CreationFree AssessmentPaid ProductWednesday, May 22, 13
  • 7. Component ArchitectureAnalyticsProductionWednesday, May 22, 13
  • 8. High Level Architecture}}}}AnalyticsServices & Event Capture Aggregation & Indexed Search Tools & DashboardsProductionProduction ServicesEvent CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
  • 9. High Level Architecture}}}}AnalyticsServices & Event Capture Aggregation & Indexed Search Tools & DashboardsProductionProduction ServicesEvent CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
  • 10. High Level Architecture}}}}AnalyticsServices & Event Capture Aggregation & Indexed Search Tools & DashboardsProductionProduction ServicesEvent CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
  • 11. High Level Architecture}}}}AnalyticsServices & Event Capture Aggregation & Indexed Search Tools & DashboardsProductionProduction ServicesEvent CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
  • 12. High Level Architecture}}}}AnalyticsServices & Event Capture Aggregation & Indexed Search Tools & DashboardsProductionProduction ServicesEvent CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
  • 13. Philosophy For Data CollectionCapture Everything• User-Driven events over web and mobile• System-level exceptions• Everything elseTemporary Data• Be ‘ok’ with approximate data• Operational Databases are the system of recordAggregate events as they come in• Remove the overhead of basic metrics (counts, sums) on core events•Group by user unique id and increment counts per event, over time-dimensions(day, week-ending, month, year)Wednesday, May 22, 13
  • 14. Philosophy For Data CollectionLogical SeparationEvents• Core use cases (forms, conversion paths)• UI Actions (button clicks, swipes, views, forms)• HttpRequest level analysis (user-agent, ios version upgrades etc)User• Has a status/rating (Account Creation, Linked Bank Account, Paid Products)• Source and Conversion Path (how was the user acquired)• Quantified Actions (User completed x, y, z conversion actions when & how?)• Social Interactions (Facebook,Twitter)• Email Interactions (stats & emails for support@learnvest.com)Wednesday, May 22, 13
  • 15. Data CaptureIOS- (void) sendAnalyticEventType:(NSString*)eventTypeobject:(NSString*)objectname:(NSString*)namepage:(NSString*)pagesource:(NSString*)source;{NSMutableDictionary *eventData = [NSMutableDictionary dictionary];if (eventType!=nil) [params setObject:eventType forKey:@"eventType"];if (object!=nil) [eventData setObject:object forKey:@"object"];if (name!=nil) [eventData setObject:name forKey:@"name"];if (page!=nil) [eventData setObject:page forKey:@"page"];if (source!=nil) [eventData setObject:source forKey:@"source"];if (eventData!=nil) [params setObject:eventData forKey:@"eventData"];[[LVNetworkEngine sharedManager] analytics_send:params];}Wednesday, May 22, 13
  • 16. Data CaptureWEB (JavaScript)function internalTrackPageView() {var cookie = {userContext: jQuery.cookie(UserContextCookie),};var trackEvent = {eventType: "pageView",eventData: {page: window.location.pathname + window.location.search}};// AJAXjQuery.ajax({url: "/api/track",type: "POST",dataType: "json",data: JSON.stringify(trackEvent),// Set Request HeadersbeforeSend: function (xhr, settings) {xhr.setRequestHeader(Accept, application/json);xhr.setRequestHeader(User-Context, cookie.userContext);if(settings.type === PUT || settings.type === POST) {xhr.setRequestHeader(Content-Type, application/json);}}});}Wednesday, May 22, 13
  • 17. Bus Event Packaging1.Spring 3 RESTful service layer, controller methods define the eventCode via @trackingannotation2.Custom Intercepter class extends HandlerInterceptorAdapter and implementspostHandle() (for each event) to invoke calls via Spring @async to an EventPublisher3.EventPublisher publishes to common event bus queue with multiple subscribers, one ofwhich packages the eventPayload Map<String, Object> object and forwards to Analytics RestServiceWednesday, May 22, 13
  • 18. Bus Event Packaging1) Spring RestController MethodsInterface@RequestMapping(value = "/user/login", method = RequestMethod.POST,headers="Accept=application/json")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,HttpServletRequest request);Concrete/Impl Class@Override@Tracking("user.login")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,HttpServletRequest request){//Implementationreturn event;}Wednesday, May 22, 13
  • 19. Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapterprotected void handleTracking(String trackingCode, Map<String, Object> modelMap,HttpServletRequest request) {Map<String, Object> responseModel = new HashMap<String, Object>();// remove non-serializables & copy over data from modelMaptry {this.eventPublisher.publish(trackingCode, responseModel, request);} catch (Exception e) {log.error("Error tracking event " + trackingCode + " : "+ ExceptionUtils.getStackTrace(e));}}Wednesday, May 22, 13
  • 20. Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapterpublic void publish (String eventCode, Map<String,Object> eventData,HttpServletRequest request) {Map<String,Object> payload = new HashMap<String,Object>();String eventId=UUID.randomUUID().toString();Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request);//Normalize messagepayload.put("eventType", eventData.get("eventType"));payload.put("eventData", eventData.get("eventType"));payload.put("version", eventData.get("eventType"));payload.put("eventId", eventId);payload.put("eventTime", new Date());payload.put("request", requestMap);...//Send to the Analytics Service for MongoDB persistence}public void sendPost(EventPayload payload){HttpEntity request = new HttpEntity(payload.getEventPayload(), headers);Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);}Wednesday, May 22, 13
  • 21. Bus Event PackagingThe Serialized Json (User Action){“eventCode” : “user.login”,“eventType” : “login”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : {“” : “”,“” : “”,“” : “”},“request” : {“call-source” : “WEB”,“user-context” : “00002b4f1150249206ac2b692e48ddb3”,“user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11”,“cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" :"204", "accept-encoding" : "gzip,deflate,sdch”,}}Wednesday, May 22, 13
  • 22. Bus Event PackagingThe Serialized Json (Generic Event){“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : {“page” : “/learnvest/moneycenter/inbox”,“section” : “transactions”,“name” : “view transactions”“object” : “page”},“request” : {“call-source” : “WEB”,“user-context” : “00002b4f1150249206ac2b692e48ddb3”,“user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11”,“cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" :"204", "accept-encoding" : "gzip,deflate,sdch”,}}Wednesday, May 22, 13
  • 23. Bus Event PackagingThe Serialized Json (Generic Event){“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : {“page” : “/learnvest/moneycenter/inbox”,“section” : “transactions”,“name” : “view transactions”“object” : “page”},“request” : {“call-source” : “WEB”,“user-context” : “00002b4f1150249206ac2b692e48ddb3”,“user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11”,“cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" :"204", "accept-encoding" : "gzip,deflate,sdch”,}}Wednesday, May 22, 13
  • 24. Event Data WarehousingMongoDB Information• v2.2.0• 3-node replica-set• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines• Each with single 500GB EBS volumes mounted to /opt/dataMongoDB Config Filedbpath = /opt/data/mongodb/datarest = truereplSet = voyagerVolumes~IM events daily on web, ~600K on mobile2-3 GB per day at start, slowed to ~1GB per dayCurrently at 78GB (collecting since August 2012)Future Scaling Strategy• Setup 2nd Replica-Set in a new AWS region• Not intending to shard - data is archived 12 months in lieuWednesday, May 22, 13
  • 25. Event Data WarehousingApproach1. Persist all events, bucketed by source:-WEBMOBILE2. Persist all events, bucketed by source, event code and time:-WEB/MOBILEuser.logintime (day, week-ending, month, year)3. Insert into collection e_web / e_mobile4.Also insert into Daily, weekly and monthly collections for main payload and http requestpayload• e_web_05232013• e_web_request_052320134. Predictable model for scaling and measuring business growthWednesday, May 22, 13
  • 26. Event Data WarehousingPersist all events> db.e_web.findOne(){ "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" :ISODate("2013-01-02T21:07:55.656Z"), "created_date" :ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" :"localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" :"c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel MacOS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4;CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;JSESSIONID=56EB165266A2C4AFF946F139669D746F;csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" :"button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" :Wednesday, May 22, 13
  • 27. Event Data WarehousingAccess Pattern•No reads off primary node, insert only•Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB LargeInstance and 3.75GB on Medium instances•Split datetime in two fields and compound index on date with other fields like eventTypeand user unique id (user-context) Wednesday, May 22, 13
  • 28. Event Data WarehousingIndexing Strategy> db.e_web.getIndexes()[{"v" : 1,"key" : {"request.user-context" : 1,"created_date" : 1},"ns" : "moneycenter.e_web","name" : "request.user-context_1_created_date_1"},{"v" : 1,"key" : {"eventData.name" : 1,"created_date" : 1},"ns" : "moneycenter.e_web","name" : "eventData.name_1_created_date_1"}] Wednesday, May 22, 13
  • 29. User Data WarehousingElastic Search (http://www.elasticsearch.org/)• Open-source lucene cluster• Mature query language, accessed via RestAPI• Unstructured schema and feature rich• Strong API supportConfiguration•Single instance for user•Deployed over 3 EC2 Medium AML instances•Updated by a Java process checking a redis cache for uuids•Accessed by multiple applications for canonical user objects Wednesday, May 22, 13
  • 30. User Data WarehousingBuilding the User ObjectFor each userid in the redis cache, retrieve the following infomration:-• ODS Slave (Learnvest data)• Jotform.com (eform submissions)• FullSlate.com (calendar appointments)• Stripe.com (payments)• Desk.com (emails)Build a canonical JSON Object and save in the elasticsearch clusterMap<String, String> user = new HashMap<String,String>();source.put(...);client.execute(new Index.Builder(source).index(“Users”);Wednesday, May 22, 13
  • 31. MetricsObjective• Show historic and intraday stats on core use cases (logins, conversions)• Show user funnel rates on conversion pages• Show general usability - how do users really use the Web and IOS platforms?Non-Functionals• Intraday doesn’t need to be “real-time”, polling is good enough for now• Overnight batch job for historic must scale horizontallyGeneral Implementation Strategy• Do all heavy lifting & object manipulation, UI should just display graph or table• Modularize the service to be able to regenerate any graphs/tables without a full loadWednesday, May 22, 13
  • 32. MetricsJava Batch ServiceJava Mongo library to query key collections and return user counts and sum of eventsDBCursor webUserLogins = c.find(new BasicDBObject("date", sdf.format(new Date())));private HashMap<String, Object> getSumAndCount(DBCursor cursor){HashMap<String, Object> m = new HashMap<String, Object>();int sum=0;int count=0;DBObject obj;while(cursor.hasNext()){obj=(DBObject)cursor.next();count++;sum=sum+(Integer)obj.get("count");}m.put("sum", sum);m.put("count", count);m.put("average", sdf.format(new Float(sum)/count));return m;}Wednesday, May 22, 13
  • 33. MetricsJava Batch ServiceUse Aggregation Framework where required on core collections (e_web) and external data//create aggregation objectsDBObject project = new BasicDBObject("$project",new BasicDBObject("day_value", fields) );DBObject day_value = new BasicDBObject( "day_value", "$day_value");DBObject groupFields = new BasicDBObject( "_id", day_value);//create the fields to group by, in this case “number”groupFields.put("number", new BasicDBObject( "$sum", 1));//create the groupDBObject group = new BasicDBObject("$group", groupFields);//executeAggregationOutput output = mycollection.aggregate( project, group );for(DBObject obj : output.results()){..}Wednesday, May 22, 13
  • 34. MetricsJava Batch ServiceMongoDB Command Line example on aggregation over a time period, e.g. month> db.e_web.aggregate([{ $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}},{ $project : {day_value : {"day" : { $dayOfMonth : "$created_date" },"month":{ $month : "$created_date" }}}},{ $group : {_id : {day_value:"$day_value"} ,number : { $sum : 1 }} },{ $sort : { day_value : -1 } }])Wednesday, May 22, 13
  • 35. MetricsJava Batch ServicePersisting events into graph and table collections>db.homeGraphs.find(){ "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54,"accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" :"12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0","users_avg_linked" : "3.43", "users_linked" : 7 }{ "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144,"accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" :"11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0","users_avg_linked" : "4", "users_linked" : 16 }{ "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119,"accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" :Wednesday, May 22, 13
  • 36. MetricsDjango and HighChartsExtract data (pyMongo)def getHomeChart(dt_from, dt_to):"""Called by home method to get latest 30 day numbers"""try:conn = pymongo.Connection(localhost, 27017)db = conn[lvanalytics]cursor = db.accountmetrics.find({"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date")return buildMetricsDict(cursor)except Exception as e:logger.error(e.message)Return the graph object (as a list or a dict of lists) to the view that called themethodpagedata={}pagedata[accountsGraph]=mongodb_home.getHomeChart()return render_to_response(home.html,{pagedata: pagedata},context_instance=RequestContext(request))Wednesday, May 22, 13
  • 37. MetricsDjango and HighChartsPopulate the series.. (JavaScript with Django templating)seriesOptions[0] = {id: naturalAccounts,name: "Natural Accounts",data: [{% for a in pagedata.metrics.accounts_natural %}{% if not forloop.first %}, {% endif %}[Date.UTC({{a.0}}),{{a.1}}]{% endfor %}],tooltip: {valueDecimals: 2}};Wednesday, May 22, 13
  • 38. MetricsDjango and HighChartsAnd Create the Charts and Tables...Wednesday, May 22, 13
  • 39. MetricsDjango and HighChartsAnd Create the Charts and Tables...Wednesday, May 22, 13
  • 40. Data Science ToolsIPython Notebook• Deployed on an EC2 Large AML Medium Instance• Configured for Python 2.7.3• Loaded with MatPlotLib, PyLab, SciPy, Numpi, pyMongo, MySQL-pythonInsights•Write wrapper methods to access user data•Accessible to anyone through a browser•Very effective way to scale quickly with little overheadApplications• Decision tree analysis over website and ios - showed common paths• Session level analysis on IOS devices• Multi-page form conversion retention rates• Quicly coduct segment analysis via a programming aPIWednesday, May 22, 13
  • 41. Data Science ToolsPIG• Executed using ruby scripts• Pulled data from MongoDB• Forwarded to AWS EMR cluster for analysis• MR functions written in Python and occasionally JavaInsights• Used for ad-hoc analysis involving large datasetsApplications• Daily,Weekly, Monthly conversion metrics on page views and forms• Identified trends in spending over 1M rows• Used lightly at Learnvest, growing in capabilityWednesday, May 22, 13
  • 42. Things that didn’t workMongoDB UpsertsQuickly becomes read-heavy and slows down the dbMongoDB Aggregation FrameworkFine for adhoc analysis but you might be better off with establishing arepeatable framework to run MR algosDjango-noRelUnstable, use Django and configure MongoDB as a datastore onlyWednesday, May 22, 13
  • 43. Lessons Learned•Date Time managed as two fields, Datetime and Date• Real-time Map-Reduce in pyMongo - too slow, don’t do this.•Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading• HighCharts is buggy - considering D3 & other libraries•Don’t need to retrieve data directly from MongoDB to Django, perhapsprovide all data via a service layer (at the expense of ever-additionalfeatures in pyMongo)•Make better use of EMR upfront if resources are limited and data is vast.Wednesday, May 22, 13
  • 44. Thanks!...Questions?Wednesday, May 22, 13