• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building big data mining SaaS
 

Building big data mining SaaS

on

  • 777 views

Presentation from RubySlava: Data-mining edition (10th Oct 2013). About lessons learned from building data-mining SaaS ready for big data

Presentation from RubySlava: Data-mining edition (10th Oct 2013). About lessons learned from building data-mining SaaS ready for big data

Statistics

Views

Total Views
777
Views on SlideShare
702
Embed Views
75

Actions

Likes
4
Downloads
0
Comments
0

2 Embeds 75

http://lanyrd.com 67
https://twitter.com 8

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building big data mining SaaS Building big data mining SaaS Presentation Transcript

    • Building big data mining SaaS Prepared for RubySlava Jozo Kovac, CEO 7Segments
    • Agenda • 7SEGMENTS Solution Architecture • Data mining – Intro – SQL – NoSQL • Recommendation Engine – Mahout • Event Based Optimization in Marketing – How it all works together
    • 7SEGMENTS: Big picture Business solutions Intelligent marketing Domain metadata Applications Campaign management Retention analysis Customers & sales analyses Social analyses API Tools Reporting and visualization Predictions & recommendation Planning & execution Optimization engine API Integration Data API CRUD, Search Common API Google, FB, ... Custom API 3rd party apps API Data Relational DB Platform component NoSQL + in memory Other sources (csv, www, ...) Active in my example
    • Live Demo • 2012: – http://campaing.7segments.com • 2013: – http://analytics.7segments.com
    • DATA MINING INTRO
    • Quick intro into data-mining Training Data Target {‚yes‘,‘no‘} Algorithm (trees, NN, SVM…) Knowledge Model Prediction
    • The goal of prediction? % target v skupine Model vs. bežný výber 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 40% model random 25% 14% 8% 4% 10% 20% 30% 40% 3% 2% 2% 1% 50% 60% 70% 80% 90% 100% % celkového počtu klientov 1%
    • Orange trains a tree More: http://orange.biolab.si
    • Orange: python scripting import Orange import random data = Orange.data.Table("voting") test = Orange.data.Table(random.sample(data, 5)) train = Orange.data.Table([d for d in data if d not in test]) tree = Orange.regression.tree.TreeLearner(train, same_majority_pruning=1, m_pruning=2) tree.name = "tree" knn = Orange.classification.knn.kNNLearner(train, k=21) knn.name = "k-NN" lr = Orange.classification.logreg.LogRegLearner(train) lr.name = "lr" classifiers = [tree, knn, lr] target = 0 print "Probabilities for %s:" % data.domain.class_var.values[target] print "original class ", print " ".join("%-9s" % l.name for l in classifiers) return_type = Orange.classification.Classifier.GetProbabilities for d in test: print "%-15s" % (d.getclass()), print " ".join("%5.3f" % c(d, return_type)[target] for c in classifiers) More: http://orange.biolab.si/docs/latest/tutorial/rst/classification/
    • Use of predictions: target best % of customers
    • 2012: AGE OF DARKNESS SQL
    • First version: 2012 • Python + Django / Flask + PostgreSQL • Flask > Django at least for our case – simpler = better • Postgres > MySQL – window functions http://www.postgresql.org/docs/9.1/static/tutorial-window.html SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;
    • SQL approach • SQL – static schema Purchases Customers Custom Attributes
    • Data mining from static SQL schema • ONE SQL FOR ALL COMPANIES! • SELECT CUSTOMER_ID, SUM(PURCHASE.REVENUE) TOTAL_REVENUE, COUNT(*) PURCHASES_COUNT, NOW()-MAX(PURCHASES.DATE) LAST_VISIT, CUSTOM_ATTRIBUTES_1.TEXT_VALUE AS CUSTOM1, … FROM CUSTOMER C LEFT JOIN PURCHASES USING (CUST_ID) LEFT JOIN CUSTOM_ATTRIBUTES (CUST_ID, ATT) LEFT JOIN CUSTOM_ATTRIBUTES_1-N (CUST_ID, ATT)
    • Troubles • Customer attributes for purchases? – customer  N purchases  N purchase attributes – or CREATE TABLE PURCHASE with 999 attributes? • Performance issues – Queries run longer and longer • How to scale DB?
    • 2013: AGE OF BIG DATA
    • 7Segments Web 2013 • Based on 7Segments 2012 • General Events replace old Transactions – Custom event attributes – Event analytics + big data reporting • New campaigns channels – Web content, enhanced social networks • Well documented external API – www.apiary.io
    • New version: 2013 • Python + Flask + noSQL + Angular JS • noSQL >> SQL – Sharding: scaling + performance – Map-reduce: freedom! – JSON > DDL (no more CREATE TABLE) • We tried MongoDB, but looking for change – Map-reduce jobs on 1 core
    • JSON is king db.customers.save({ _id: 'c0001', events: [ {type: 'registration’, properties { prop1: val1, prop2: val2} }, {type: 'login'}, {type: 'logout'} ] }); db.customers.save({ _id: 'c0002', events: [ {type: 'registration’, properties {prop2: val3, prop3: val4} }, {type: 'login'}, {type: 'logout'}, {type: 'login'}, {type: 'purchase'} ] });
    • Map reduce rules function funnelMap() { var steps = ['registration', 'login', 'purchase', 'purchase']; var counts = [0, 0, 0, 0]; var i = 0; for (var j in this.events) { if (this.events[j]['type'] == steps[i]) { counts[i]++; i++; } if (i == steps.length) break; } if (i > 0) emit('funnel', {'counts': counts}); } function funnelReduce(key, values) { var counts = [0, 0, 0, 0]; for (var i in values) { for (var j in values[i].counts) { counts[j] += values[i].counts[j]; } } return {'counts': counts}; } db.customers.mapReduce( funnelMap, funnelReduce, {out: 'funnelResult'} ).find() RESULTS : { "_id" : "funnel", "value" : { "counts" : [ 2, 2, 1, 0 ] } }
    • Event Tracking API + Web Client so easy to do in noSQL • Way how to track (not only) web users <script src="http://api.7segments.com/js/event-tracker.js"></script> <script> EventTracker.init({ company: ”myCompany”, project: ”myProject", subproject: „myDepartment” }); </script> • Fire custom events EventTracker.fire("purchase", { "product": "Ring", "category": "Jewelry", "Amount": 200, } ); • Tracking by cookie or identified customer_id
    • noSQL benefits • Sharding – AddShard() • Productivity – JSON, less coding, more features • Performance – Map reduce to write smarter queries • Choose right noSQL – Mongo MapReduce sucks
    • Mongo: 1 instance = 1 MR job (we run 4 mongod on 24 core server)
    • APACHE MAHOUT
    • Apache Mahout • Java lib with data-mining algorithms – Collaborative filtering – Predictions (Random Forrest, etc.) – Clustering • Distributed computing! – When dataset doesn’t fit one server • Production ready – Not just another academic try & pray lib
    • Mahout example • Input data for rating prediction: user_id, item_id, rating, timestamp • Input data for suggested items: user_id, item_id • Output data – Recommend(user, howManyItems) – recommendedBecause (user, item, howMany) – SimmiliarItem(item, howManyItems)
    • Mahout example
    • Recommendation Components
    • Mahout suggested reading • http://www.slideshare.net/Cataldo/tutoria-mahout-recommendation • http://www.scribd.com/doc/88415137/Mahout-in-Action
    • How it all works together EVENT BASED OPTIMIZATION IN MARKETING
    • Visitor (then customer) track and influence Example Adwords PPC campaign1 Web visit Cookie: ABC Or Shop visit User: Mark Web visit Cookie: ABC ref: PPC1 Web re-visit Cookie: ABC ref: none Web registration Cookie: ABC User: Mark Product Recommendation( Visits: 4, Origin: PPC, Revenue 50€, Products: X; Y; Z ) Purchase User: Mark Amount: 50€ Visits: 3x Origin: PPC1 … timeline
    • Visitor (then customer) track and influence General idea Events Campaigns Events Conversion
    • Split testing A B A Events Campaigns Events B Conversion B R e s u l t s A
    • Split testing - naive implications B B A Events Campaigns A Events Conversion R e s u l t s B
    • Prediction of campaign response Select better offer B A Predict response to offers A, B Events A Campaigns Events B Conversion B R e s u l t s A
    • Adaptive Campaigns Select better offer B A Predict response to offers A, B Events A Campaigns Events B Conversion Success or failure Pick new action from pool Next Best Action Reinforcement Learning A B R e s u l t s use information from recent events Collect more events
    • Thank you for attention! www.7segments.com | info@7segments.com |