Real-time big data analytics based on product recommendations case study

Real-time big data analytics
based on product recommendations case study
IT Business Solutions B2B Conference
October 2015
© deep.bi

We started as an ad network
The challenge was to recommend
the best product (out of millions)
to the right person in a given moment
(thousands of users within a second)

5 billionad views delivered in 24 months

To put it in the scale context:
If we would serve 1 ad per second it will take
160 years
to serve 5 billion ads

So we needed a solution
SQL databases did not work
Popular NoSQL databases did not work
Standard data warehouse approaches (pre-
aggregations, creating schemas) - did not work

Re-thinking all the problems with
huge data streams flowing to us every second
we have built a complete solution
based on open-source technologies
and fresh, smart ideas from our engineering team
It is called deep.bi
and now we make it available to other companies

DEEP.BI = BIG DATA FAST DATA SOLUTION
high velocity
high volume

deep.bi lets high-growth companies
solve fast data problems by providing
scalable, flexible and real-time
data collection, enrichment and analytics

deep.bi – complete data processing flow
Data
enrichment,
transformation
and integration
Unstructured,
raw data from
many sources
page views, IoT events,
IP, URL, cookie,
transactions, call detail
records, etc.
Find
patterns,
build models,
predict
behavior
collect enrich analyze

How to predict the best offer
based on online data – case study.

Collect website, campaigns and CRM data
Website:
Google
Analytics
Campaigns:
Agency
reports
Apps:
Dedicated
monitoring
tools
Other
systems:
Call center
IVR, emails
Instead of integrating current reporting tools we need to
gather all the single events that our customers generate.
Data is stored in silos. Reporting tools provide aggregated
reports impossible to integrate around single customer.

Collecting raw web data is not enough
2015-05-15T00:26:41.328Z,3,D,
[ip_hidden],i1xszg0f-19hqrje,"Mozilla/5.0 (Windows NT
5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/
42.0.2311.152 Safari/537.36",”[url_hidden]",
7279848891,@906,"https://www.google.pl/",vuser-history-
allegro-1-
hc20150509.1,"122_100003_Park@700:html_620x100_single_ban
ner:See offer"
IP, URL, cookie, user-agent, timestamp

* Coming soon
Enrich raw web and mobile data
50+information
from one
interaction
Purchase intent
Device
Time
Location
ISP
Online context
Weather*
Demographics

We can learn quite a few things from user IP
Example use:
•  international travellers
•  townspeople
•  people in mountains
•  rainy day
•  Country
•  Region
•  City
•  ZIP Code
•  Population
•  Latitude & Longitude
•  Time zone
•  IDD prefix to call the city from
another country
•  Phone area code
•  Mobile Country Code (MCC)
•  Mobile Network Code (MNC)
•  Elevation
•  Weather at the moment of event

ISP tells us more we could expect
Example use:
•  competitors’ users->
acquisition
•  our users -> retention/up-
selling/cross-selling
•  people from particular
company or company type
•  ISP name or Organization name
•  Organization type:
•  Commercial
•  Organization
•  Government
•  Military
•  University/College/School
•  Library
•  Content Delivery Network
•  Fixed Line ISP
•  Mobile ISP
•  Data Center/Web Hosting/Transit
•  Search Engine Spider
•  Reserved
•  Mobile brand
•  Net speed

Detailed information about user device
Example use:
•  smartphone users
•  Apple users
•  Samsung Galaxy users
•  Google browser users
•  Device Type
•  Device Brand
•  Device Model
•  Device Operating System
•  Operating System Producer
•  Browser
•  Browser Producer

Besides user features, track user behavior too.
Deeper understanding of people’s behavior:
•  RFM Segmentation (Recency, Frequency, Monetary)
•  Shopping cart analysis
•  Purchase sequence analysis

User behavior and characteristics
helps predicts next best action/offer
What product should we recommend?
How could end this purchase path?

So, how to build tailored recommendations?
Pick an algorithm that is suitable for the problem
Product [ feature_1, feature_2, …, feature_N]
User [ feature_1, feature_2, …, feature_N]
User [ product_1, product_2, …, product_N]

Simple rules: if a user has some features serve
this group of products
  Manual segment creating: analysts find
segments of users and match them with
product segments
  Simple feature matching: get user weighted
feature vector and match with products feature
vectors
Manual / people managed rules

Find segments automatically (e.g. k-means)
  Product features based recommendations
  User features based recommendations
  Combined product and user based
recommendations (collaborative filtering, deep
learning)
Machine learning-supported recommendations

Productpopularity
Products
The most interesting
recommendations
Recommendations long tail phenomenon

Complex data model for query optimization
 split dimensions in several tables based on reports made
 pre cherry-pick dimensions which we can aggregate based on
cardinality
 index every dimension column is a must
  Impossible to add high-cardinality dimensions
 no way to analyze per user (millions of them)
 no way to event add all of user-agent, url, geo-info, ...
Problems with SQL and NoSQL databases

Complex data loading process
 needs to pre-aggregate in memory
 non-trivial reliability issues
 hard to parallelize
  There is always latency
 pre-aggregation in job loading memory
Problems with SQL and NoSQL databases

Customer
databases
Event
sources*
Raw data
stream
Transformed
data stream
Real-time data ingestion
Kafka
Data
Transformation
& Enrichment
Node.js, Spark
Streaming
Real-time
OLAP Store
Druid
Operational
Store
Cassandra
High performance, multi-purpose storage
Webanalytics
dashboard
deep.biAPI

ETL

Customer
analytics
dashboard
*e.g.. mobile apps, websites, marketing campaigns, IoT (beacons, wearables)

Raw Data Store
Hadoop,
Parquet, Spark
deep.bi – real-time big data architecture

DEEP
Data enrichment,
storage
& analytics
Client’s DEEP
Data Space
End-user browser
Web Data Collection API
(HTML or JS)
Trackers pass event data with
<DEEP tracker>
Ingestion
API
Data Collection APIs
1

<D>
<D>
Mobile Data Collection API
(HTML, JS or Native SDK)
Trackers pass event data with

Events are represented with full flexibility of JSON
{
"data": {
"event_type": "CLICK",
"ad_request_event": {
"ctx": {
"event_time": "2015-07-10T06:15:50.819Z",
"ip_address": "XX.XX.XX.XX",
"geo_info": {
"country": ”US", "region": ”California", "city": ”San Francisco",
"timezone": ”PST", "isp": ”XXX",
"population": 849,774
},
“page": {
"raw_url": ”XXX",
"standardized_domain": ”XXX"
},
"page_info": {
"page_raw_url": ”XXX",
”product_categories": [
{ "id": 20585 },
{ "id": 100126 },
},
"cookie": "ibx8axlw-17j287o",
"user_agent": "Mozilla/5.0 (Linux; Android 4.2.2; GT-S7580 Build/JDQ39)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile
Safari/537.36",

Publish-subscribe service
  The nervous system of enterprise data
  decouple producers from consumers
  reliable buffer data
  send now, process later.
  Scalable distributed, replicated log system
  Pause components, restart processing
  Powered by:
  web giants like LinkedIn, Twitter, Netflix, Uber, Spotify or Pinterest
  >10M messages/second
Apache Kafka

Scalable, fault-tolerant stream processing system
  With simple programming model & rich API & integrations
  Powered by:
  Yahoo, Netflix, eBay
  NASA, Intel, Cisco
  It is our fundamental technology for streaming applications
sessionize events
  detect frauds
  attribute purchases to click or views
  load & read external stores like Druid, Hadoop, Cassand
Apache Spark Streaming

Open Source Streaming Data Store for Interactive Analytics at Scale
  denormalized data
  no more snowflake or star-schema!
  Build real-time dashboards, analytic applications, exploratory tools on it.
  It’s FAST!
  aggregate, drill-down, slice-n-dice in sub-seconds
  advanced column-store with compression
  sophisticated approximate algorithms
  It’s SCALABLE
  horizontally scalable - just add more machines
  replicated, highly-available
  Over 100 PBs of data, millions events/second
Druid – Real-time OLAP Store

Ingest historical & real-time data
  data available for exploration in milliseconds
  can store years of data in very optimized storage
  Powered by
  eBay, Netflix, PayPal, Yahoo
  Cisco
  It is our core data store of all events, historical and real-time data
Druid – Real-time OLAP Store

Apache Spark for batch-processing: fast and general engine for
large-scale data processing
  Replaces Map-Reduce, being up to 10x-100x faster!
  Number 1 open-source project in big data space (contributors, commits)
  In-memory processing (if possible)
  Spark SQL for SQL processing
  Apache Parquet - an optimized storage format
  columnar – read only columns you need
  compressed – specialized compression for data type + generic compression
  2x-4x: 600 GB data -> 150 GB data
  Hadoop can be optimized by 2 order of magnitudes: from hours
to seconds!
Hadoop Optimized

Thank you!
Share your thoughts, challenges
or case studies with us.
Or drop us a line: hello@deep.bi
SUBMIT»

Let’s assume we want to find users who:
  Were interested in smartphones
  Use Samsung product
  Live in cities with population over 1M people
  Are woman
  Were traveling abroad
  Came from our display campaign
So, we have a combination of 6 (k) dimensions from 50 (n).
Using the combination formula: we will have…
Complexity of multidimensional queries

… similar number of possible combinations:
15,890,700
as in Lotto (6 from 49).

Real-time big data analytics based on product recommendations case study

More Related Content

What's hot

Viewers also liked

Similar to Real-time big data analytics based on product recommendations case study

Recently uploaded

Real-time big data analytics based on product recommendations case study

Editor's Notes