Building big data mining SaaS
Prepared for RubySlava
Jozo Kovac, CEO 7Segments
Agenda
• 7SEGMENTS Solution Architecture
• Data mining
– Intro
– SQL
– NoSQL

• Recommendation Engine
– Mahout

• Event Ba...
7SEGMENTS: Big picture

Business solutions
Intelligent
marketing
Domain metadata

Applications
Campaign
management

Retent...
Live Demo
• 2012:
– http://campaing.7segments.com

• 2013:
– http://analytics.7segments.com
DATA MINING INTRO
Quick intro into data-mining
Training
Data

Target
{‚yes‘,‘no‘}

Algorithm
(trees, NN, SVM…)
Knowledge
Model
Prediction
The goal of prediction?

% target v skupine

Model vs. bežný výber
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%

40%
model
random...
Orange trains a tree

More: http://orange.biolab.si
Orange: python scripting
import Orange
import random
data = Orange.data.Table("voting")
test = Orange.data.Table(random.sa...
Use of predictions:
target best % of customers
2012: AGE OF DARKNESS SQL
First version: 2012
• Python + Django / Flask + PostgreSQL
• Flask > Django at least for our case
– simpler = better

• Po...
SQL approach
• SQL – static schema

Purchases
Customers
Custom Attributes
Data mining from static SQL schema
• ONE SQL FOR ALL COMPANIES!
• SELECT
CUSTOMER_ID,
SUM(PURCHASE.REVENUE) TOTAL_REVENUE,...
Troubles
• Customer attributes for purchases?
– customer  N purchases  N purchase attributes
– or CREATE TABLE PURCHASE ...
2013: AGE OF BIG DATA
7Segments Web 2013
• Based on 7Segments 2012
• General Events replace old Transactions
– Custom event attributes
– Event a...
New version: 2013
• Python + Flask + noSQL + Angular JS
• noSQL >> SQL
– Sharding: scaling + performance
– Map-reduce: fre...
JSON is king
db.customers.save({
_id: 'c0001',
events: [
{type: 'registration’, properties { prop1: val1, prop2: val2} },
...
Map reduce rules
function funnelMap() {
var steps = ['registration', 'login', 'purchase', 'purchase'];
var counts = [0, 0,...
Event Tracking API + Web Client
so easy to do in noSQL
• Way how to track (not only) web users
<script src="http://api.7se...
noSQL benefits
• Sharding
– AddShard()

• Productivity
– JSON, less coding, more features

• Performance
– Map reduce to w...
Mongo: 1 instance = 1 MR job
(we run 4 mongod on 24 core server)
APACHE MAHOUT
Apache Mahout
• Java lib with data-mining algorithms
– Collaborative filtering
– Predictions (Random Forrest, etc.)
– Clus...
Mahout example
• Input data for rating prediction:
user_id, item_id, rating, timestamp
• Input data for suggested items:
u...
Mahout example
Recommendation Components
Mahout suggested reading
• http://www.slideshare.net/Cataldo/tutoria-mahout-recommendation
• http://www.scribd.com/doc/884...
How it all works together

EVENT BASED OPTIMIZATION
IN MARKETING
Visitor (then customer) track and influence
Example
Adwords
PPC campaign1

Web visit
Cookie: ABC
Or
Shop visit
User: Mark
...
Visitor (then customer) track and influence
General idea

Events

Campaigns

Events

Conversion
Split testing

A

B
A

Events

Campaigns

Events

B

Conversion

B

R e s u l t s

A
Split testing - naive implications

B

B

A
Events

Campaigns

A
Events

Conversion

R e s u l t s

B
Prediction of campaign response

Select better offer

B
A

Predict response
to offers A, B
Events

A
Campaigns

Events

B
...
Adaptive Campaigns

Select better offer

B
A

Predict response
to offers A, B
Events

A

Campaigns

Events

B

Conversion
...
Thank you for attention!

www.7segments.com | info@7segments.com |
Upcoming SlideShare
Loading in...5
×

Building big data mining SaaS

934

Published on

Presentation from RubySlava: Data-mining edition (10th Oct 2013). About lessons learned from building data-mining SaaS ready for big data

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
934
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Building big data mining SaaS"

  1. 1. Building big data mining SaaS Prepared for RubySlava Jozo Kovac, CEO 7Segments
  2. 2. Agenda • 7SEGMENTS Solution Architecture • Data mining – Intro – SQL – NoSQL • Recommendation Engine – Mahout • Event Based Optimization in Marketing – How it all works together
  3. 3. 7SEGMENTS: Big picture Business solutions Intelligent marketing Domain metadata Applications Campaign management Retention analysis Customers & sales analyses Social analyses API Tools Reporting and visualization Predictions & recommendation Planning & execution Optimization engine API Integration Data API CRUD, Search Common API Google, FB, ... Custom API 3rd party apps API Data Relational DB Platform component NoSQL + in memory Other sources (csv, www, ...) Active in my example
  4. 4. Live Demo • 2012: – http://campaing.7segments.com • 2013: – http://analytics.7segments.com
  5. 5. DATA MINING INTRO
  6. 6. Quick intro into data-mining Training Data Target {‚yes‘,‘no‘} Algorithm (trees, NN, SVM…) Knowledge Model Prediction
  7. 7. The goal of prediction? % target v skupine Model vs. bežný výber 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 40% model random 25% 14% 8% 4% 10% 20% 30% 40% 3% 2% 2% 1% 50% 60% 70% 80% 90% 100% % celkového počtu klientov 1%
  8. 8. Orange trains a tree More: http://orange.biolab.si
  9. 9. Orange: python scripting import Orange import random data = Orange.data.Table("voting") test = Orange.data.Table(random.sample(data, 5)) train = Orange.data.Table([d for d in data if d not in test]) tree = Orange.regression.tree.TreeLearner(train, same_majority_pruning=1, m_pruning=2) tree.name = "tree" knn = Orange.classification.knn.kNNLearner(train, k=21) knn.name = "k-NN" lr = Orange.classification.logreg.LogRegLearner(train) lr.name = "lr" classifiers = [tree, knn, lr] target = 0 print "Probabilities for %s:" % data.domain.class_var.values[target] print "original class ", print " ".join("%-9s" % l.name for l in classifiers) return_type = Orange.classification.Classifier.GetProbabilities for d in test: print "%-15s" % (d.getclass()), print " ".join("%5.3f" % c(d, return_type)[target] for c in classifiers) More: http://orange.biolab.si/docs/latest/tutorial/rst/classification/
  10. 10. Use of predictions: target best % of customers
  11. 11. 2012: AGE OF DARKNESS SQL
  12. 12. First version: 2012 • Python + Django / Flask + PostgreSQL • Flask > Django at least for our case – simpler = better • Postgres > MySQL – window functions http://www.postgresql.org/docs/9.1/static/tutorial-window.html SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;
  13. 13. SQL approach • SQL – static schema Purchases Customers Custom Attributes
  14. 14. Data mining from static SQL schema • ONE SQL FOR ALL COMPANIES! • SELECT CUSTOMER_ID, SUM(PURCHASE.REVENUE) TOTAL_REVENUE, COUNT(*) PURCHASES_COUNT, NOW()-MAX(PURCHASES.DATE) LAST_VISIT, CUSTOM_ATTRIBUTES_1.TEXT_VALUE AS CUSTOM1, … FROM CUSTOMER C LEFT JOIN PURCHASES USING (CUST_ID) LEFT JOIN CUSTOM_ATTRIBUTES (CUST_ID, ATT) LEFT JOIN CUSTOM_ATTRIBUTES_1-N (CUST_ID, ATT)
  15. 15. Troubles • Customer attributes for purchases? – customer  N purchases  N purchase attributes – or CREATE TABLE PURCHASE with 999 attributes? • Performance issues – Queries run longer and longer • How to scale DB?
  16. 16. 2013: AGE OF BIG DATA
  17. 17. 7Segments Web 2013 • Based on 7Segments 2012 • General Events replace old Transactions – Custom event attributes – Event analytics + big data reporting • New campaigns channels – Web content, enhanced social networks • Well documented external API – www.apiary.io
  18. 18. New version: 2013 • Python + Flask + noSQL + Angular JS • noSQL >> SQL – Sharding: scaling + performance – Map-reduce: freedom! – JSON > DDL (no more CREATE TABLE) • We tried MongoDB, but looking for change – Map-reduce jobs on 1 core
  19. 19. JSON is king db.customers.save({ _id: 'c0001', events: [ {type: 'registration’, properties { prop1: val1, prop2: val2} }, {type: 'login'}, {type: 'logout'} ] }); db.customers.save({ _id: 'c0002', events: [ {type: 'registration’, properties {prop2: val3, prop3: val4} }, {type: 'login'}, {type: 'logout'}, {type: 'login'}, {type: 'purchase'} ] });
  20. 20. Map reduce rules function funnelMap() { var steps = ['registration', 'login', 'purchase', 'purchase']; var counts = [0, 0, 0, 0]; var i = 0; for (var j in this.events) { if (this.events[j]['type'] == steps[i]) { counts[i]++; i++; } if (i == steps.length) break; } if (i > 0) emit('funnel', {'counts': counts}); } function funnelReduce(key, values) { var counts = [0, 0, 0, 0]; for (var i in values) { for (var j in values[i].counts) { counts[j] += values[i].counts[j]; } } return {'counts': counts}; } db.customers.mapReduce( funnelMap, funnelReduce, {out: 'funnelResult'} ).find() RESULTS : { "_id" : "funnel", "value" : { "counts" : [ 2, 2, 1, 0 ] } }
  21. 21. Event Tracking API + Web Client so easy to do in noSQL • Way how to track (not only) web users <script src="http://api.7segments.com/js/event-tracker.js"></script> <script> EventTracker.init({ company: ”myCompany”, project: ”myProject", subproject: „myDepartment” }); </script> • Fire custom events EventTracker.fire("purchase", { "product": "Ring", "category": "Jewelry", "Amount": 200, } ); • Tracking by cookie or identified customer_id
  22. 22. noSQL benefits • Sharding – AddShard() • Productivity – JSON, less coding, more features • Performance – Map reduce to write smarter queries • Choose right noSQL – Mongo MapReduce sucks
  23. 23. Mongo: 1 instance = 1 MR job (we run 4 mongod on 24 core server)
  24. 24. APACHE MAHOUT
  25. 25. Apache Mahout • Java lib with data-mining algorithms – Collaborative filtering – Predictions (Random Forrest, etc.) – Clustering • Distributed computing! – When dataset doesn’t fit one server • Production ready – Not just another academic try & pray lib
  26. 26. Mahout example • Input data for rating prediction: user_id, item_id, rating, timestamp • Input data for suggested items: user_id, item_id • Output data – Recommend(user, howManyItems) – recommendedBecause (user, item, howMany) – SimmiliarItem(item, howManyItems)
  27. 27. Mahout example
  28. 28. Recommendation Components
  29. 29. Mahout suggested reading • http://www.slideshare.net/Cataldo/tutoria-mahout-recommendation • http://www.scribd.com/doc/88415137/Mahout-in-Action
  30. 30. How it all works together EVENT BASED OPTIMIZATION IN MARKETING
  31. 31. Visitor (then customer) track and influence Example Adwords PPC campaign1 Web visit Cookie: ABC Or Shop visit User: Mark Web visit Cookie: ABC ref: PPC1 Web re-visit Cookie: ABC ref: none Web registration Cookie: ABC User: Mark Product Recommendation( Visits: 4, Origin: PPC, Revenue 50€, Products: X; Y; Z ) Purchase User: Mark Amount: 50€ Visits: 3x Origin: PPC1 … timeline
  32. 32. Visitor (then customer) track and influence General idea Events Campaigns Events Conversion
  33. 33. Split testing A B A Events Campaigns Events B Conversion B R e s u l t s A
  34. 34. Split testing - naive implications B B A Events Campaigns A Events Conversion R e s u l t s B
  35. 35. Prediction of campaign response Select better offer B A Predict response to offers A, B Events A Campaigns Events B Conversion B R e s u l t s A
  36. 36. Adaptive Campaigns Select better offer B A Predict response to offers A, B Events A Campaigns Events B Conversion Success or failure Pick new action from pool Next Best Action Reinforcement Learning A B R e s u l t s use information from recent events Collect more events
  37. 37. Thank you for attention! www.7segments.com | info@7segments.com |

×