• Save
Building big data mining SaaS
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Building big data mining SaaS

on

  • 962 views

Presentation from RubySlava: Data-mining edition (10th Oct 2013). About lessons learned from building data-mining SaaS ready for big data

Presentation from RubySlava: Data-mining edition (10th Oct 2013). About lessons learned from building data-mining SaaS ready for big data

Statistics

Views

Total Views
962
Views on SlideShare
879
Embed Views
83

Actions

Likes
4
Downloads
0
Comments
0

2 Embeds 83

http://lanyrd.com 75
https://twitter.com 8

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building big data mining SaaS Presentation Transcript

  • 1. Building big data mining SaaS Prepared for RubySlava Jozo Kovac, CEO 7Segments
  • 2. Agenda • 7SEGMENTS Solution Architecture • Data mining – Intro – SQL – NoSQL • Recommendation Engine – Mahout • Event Based Optimization in Marketing – How it all works together
  • 3. 7SEGMENTS: Big picture Business solutions Intelligent marketing Domain metadata Applications Campaign management Retention analysis Customers & sales analyses Social analyses API Tools Reporting and visualization Predictions & recommendation Planning & execution Optimization engine API Integration Data API CRUD, Search Common API Google, FB, ... Custom API 3rd party apps API Data Relational DB Platform component NoSQL + in memory Other sources (csv, www, ...) Active in my example
  • 4. Live Demo • 2012: – http://campaing.7segments.com • 2013: – http://analytics.7segments.com
  • 5. DATA MINING INTRO
  • 6. Quick intro into data-mining Training Data Target {‚yes‘,‘no‘} Algorithm (trees, NN, SVM…) Knowledge Model Prediction
  • 7. The goal of prediction? % target v skupine Model vs. bežný výber 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 40% model random 25% 14% 8% 4% 10% 20% 30% 40% 3% 2% 2% 1% 50% 60% 70% 80% 90% 100% % celkového počtu klientov 1%
  • 8. Orange trains a tree More: http://orange.biolab.si
  • 9. Orange: python scripting import Orange import random data = Orange.data.Table("voting") test = Orange.data.Table(random.sample(data, 5)) train = Orange.data.Table([d for d in data if d not in test]) tree = Orange.regression.tree.TreeLearner(train, same_majority_pruning=1, m_pruning=2) tree.name = "tree" knn = Orange.classification.knn.kNNLearner(train, k=21) knn.name = "k-NN" lr = Orange.classification.logreg.LogRegLearner(train) lr.name = "lr" classifiers = [tree, knn, lr] target = 0 print "Probabilities for %s:" % data.domain.class_var.values[target] print "original class ", print " ".join("%-9s" % l.name for l in classifiers) return_type = Orange.classification.Classifier.GetProbabilities for d in test: print "%-15s" % (d.getclass()), print " ".join("%5.3f" % c(d, return_type)[target] for c in classifiers) More: http://orange.biolab.si/docs/latest/tutorial/rst/classification/
  • 10. Use of predictions: target best % of customers
  • 11. 2012: AGE OF DARKNESS SQL
  • 12. First version: 2012 • Python + Django / Flask + PostgreSQL • Flask > Django at least for our case – simpler = better • Postgres > MySQL – window functions http://www.postgresql.org/docs/9.1/static/tutorial-window.html SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;
  • 13. SQL approach • SQL – static schema Purchases Customers Custom Attributes
  • 14. Data mining from static SQL schema • ONE SQL FOR ALL COMPANIES! • SELECT CUSTOMER_ID, SUM(PURCHASE.REVENUE) TOTAL_REVENUE, COUNT(*) PURCHASES_COUNT, NOW()-MAX(PURCHASES.DATE) LAST_VISIT, CUSTOM_ATTRIBUTES_1.TEXT_VALUE AS CUSTOM1, … FROM CUSTOMER C LEFT JOIN PURCHASES USING (CUST_ID) LEFT JOIN CUSTOM_ATTRIBUTES (CUST_ID, ATT) LEFT JOIN CUSTOM_ATTRIBUTES_1-N (CUST_ID, ATT)
  • 15. Troubles • Customer attributes for purchases? – customer  N purchases  N purchase attributes – or CREATE TABLE PURCHASE with 999 attributes? • Performance issues – Queries run longer and longer • How to scale DB?
  • 16. 2013: AGE OF BIG DATA
  • 17. 7Segments Web 2013 • Based on 7Segments 2012 • General Events replace old Transactions – Custom event attributes – Event analytics + big data reporting • New campaigns channels – Web content, enhanced social networks • Well documented external API – www.apiary.io
  • 18. New version: 2013 • Python + Flask + noSQL + Angular JS • noSQL >> SQL – Sharding: scaling + performance – Map-reduce: freedom! – JSON > DDL (no more CREATE TABLE) • We tried MongoDB, but looking for change – Map-reduce jobs on 1 core
  • 19. JSON is king db.customers.save({ _id: 'c0001', events: [ {type: 'registration’, properties { prop1: val1, prop2: val2} }, {type: 'login'}, {type: 'logout'} ] }); db.customers.save({ _id: 'c0002', events: [ {type: 'registration’, properties {prop2: val3, prop3: val4} }, {type: 'login'}, {type: 'logout'}, {type: 'login'}, {type: 'purchase'} ] });
  • 20. Map reduce rules function funnelMap() { var steps = ['registration', 'login', 'purchase', 'purchase']; var counts = [0, 0, 0, 0]; var i = 0; for (var j in this.events) { if (this.events[j]['type'] == steps[i]) { counts[i]++; i++; } if (i == steps.length) break; } if (i > 0) emit('funnel', {'counts': counts}); } function funnelReduce(key, values) { var counts = [0, 0, 0, 0]; for (var i in values) { for (var j in values[i].counts) { counts[j] += values[i].counts[j]; } } return {'counts': counts}; } db.customers.mapReduce( funnelMap, funnelReduce, {out: 'funnelResult'} ).find() RESULTS : { "_id" : "funnel", "value" : { "counts" : [ 2, 2, 1, 0 ] } }
  • 21. Event Tracking API + Web Client so easy to do in noSQL • Way how to track (not only) web users <script src="http://api.7segments.com/js/event-tracker.js"></script> <script> EventTracker.init({ company: ”myCompany”, project: ”myProject", subproject: „myDepartment” }); </script> • Fire custom events EventTracker.fire("purchase", { "product": "Ring", "category": "Jewelry", "Amount": 200, } ); • Tracking by cookie or identified customer_id
  • 22. noSQL benefits • Sharding – AddShard() • Productivity – JSON, less coding, more features • Performance – Map reduce to write smarter queries • Choose right noSQL – Mongo MapReduce sucks
  • 23. Mongo: 1 instance = 1 MR job (we run 4 mongod on 24 core server)
  • 24. APACHE MAHOUT
  • 25. Apache Mahout • Java lib with data-mining algorithms – Collaborative filtering – Predictions (Random Forrest, etc.) – Clustering • Distributed computing! – When dataset doesn’t fit one server • Production ready – Not just another academic try & pray lib
  • 26. Mahout example • Input data for rating prediction: user_id, item_id, rating, timestamp • Input data for suggested items: user_id, item_id • Output data – Recommend(user, howManyItems) – recommendedBecause (user, item, howMany) – SimmiliarItem(item, howManyItems)
  • 27. Mahout example
  • 28. Recommendation Components
  • 29. Mahout suggested reading • http://www.slideshare.net/Cataldo/tutoria-mahout-recommendation • http://www.scribd.com/doc/88415137/Mahout-in-Action
  • 30. How it all works together EVENT BASED OPTIMIZATION IN MARKETING
  • 31. Visitor (then customer) track and influence Example Adwords PPC campaign1 Web visit Cookie: ABC Or Shop visit User: Mark Web visit Cookie: ABC ref: PPC1 Web re-visit Cookie: ABC ref: none Web registration Cookie: ABC User: Mark Product Recommendation( Visits: 4, Origin: PPC, Revenue 50€, Products: X; Y; Z ) Purchase User: Mark Amount: 50€ Visits: 3x Origin: PPC1 … timeline
  • 32. Visitor (then customer) track and influence General idea Events Campaigns Events Conversion
  • 33. Split testing A B A Events Campaigns Events B Conversion B R e s u l t s A
  • 34. Split testing - naive implications B B A Events Campaigns A Events Conversion R e s u l t s B
  • 35. Prediction of campaign response Select better offer B A Predict response to offers A, B Events A Campaigns Events B Conversion B R e s u l t s A
  • 36. Adaptive Campaigns Select better offer B A Predict response to offers A, B Events A Campaigns Events B Conversion Success or failure Pick new action from pool Next Best Action Reinforcement Learning A B R e s u l t s use information from recent events Collect more events
  • 37. Thank you for attention! www.7segments.com | info@7segments.com |