Abstract:- Telecommunications service providers (or telcos) have access to massive amounts of historical and streaming data about subscribers. However, it often takes them a long time to build, operationalize and gain value from various machine learning and analytic models. This is true even for relatively common use-cases like churn prediction, purchase propensity, next topup or purchase prediction, subscriber profiling, customer experience modeling, recommendation engines and fraud detection. In this talk, I shall describe our approach to tackling this problem, which involved having a pre-packaged set of analytic pipelines on a scalable Big Data architecture that work on several standard and well known telco data formats and sources, and that we were able to reuse across several different telcos. This allows the telcos to deploy the analytic pipelines on their data, out of the box, and go live in a matter of weeks, as opposed to the several months it used to take if they started from scratch. In the talk, I shall describe our experiences in deploying the pre-packaged analytic pipelines with several telcos in North America, South East Asia and the Middle East. The pipelines work on a variety of historical and streaming data, including call data records having voice, SMS and data usage information, purchase and recharge behavior, location information, browsing/clickstream data, billing and payment information, smartphone device logs, etc. The pipelines run on a combination of Spark and Unscrambl BRAINTM, which includes a real-time machine learning framework, a scalable profile store based on Redis and an aggregation engine that stores efficient summaries of time-series data. I shall describe some of the machine learning models that get trained and scored as part of these pipelines. I shall also remark on how reusable certain models are across different telcos, and how a similar set of features can be used for models like next topup or purchase prediction, churn prediction and purchase propensity across similar telcos in different geographies.
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Data Science Out of The Box : Case Studies in the Telecommunication by Anand Ranganathan
1. Anand Ranganathan,
VP of Solutions
Aug 2017
DATA SCIENCE OUT OF THE BOX:
Case Studies In The
Telecommunications Industry
2. Telecommunications Service Providers have huge
amounts of data related to customer activity that come to
them in real-time
2
• Calling, SMS and data usage information
• Purchase and recharge data
• Plan information
• Browsing data (DPI)
• Location information from CDRs, probes or other sources
• Device Data logs
• Call Center logs
3. But, they face challenges in getting value from this data to
improve customer experience
3
Difficult to integrate data about
customers from multiple
sources into a single view
Difficult to integrate the insights
from the models with other tools
Difficult to build models Difficult to act upon the insights
Difficult to operationalize the
models
Difficult to gain business value
1
2
3
4
5
6
10. 10
Harnessing Data in
Real-Time is key to
creating a great
customer
experience…
… Most
enterprises,
though, have
struggled to
deploy and get
value from analytics …
especially, real-time
analytics
11. Does it have to take years to deploy an advanced analytics
solution ?
Do you really need an army of Data Scientists to create new
models every year ?
Do you really have to stitch together 10 solutions for a ‘single
customer view’ that gets updated once a day?
Why is it still so difficult to create personalized and contextual
campaigns ?
12. Our Vision – Easy to use Real-time analytics in a box
12
Our initial target
domains are:
• Telecommunications
• Healthcare
• Banking
Allow rapid
deployment of
analytics and reduce
time to value
Through reusable
machine learning
pipelines that cover
common needs
in several industries.
13. Firstly, what is a machine learning pipeline?
13
Training
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Train
Model
Model
Test Data
Predictions
Parsing, Cleaning,
Transformations
Feature
Extraction
Score Model
Training Pipeline
Scoring
Pipeline
14. We have 40+ readily deployable ML Pipelines covering
common telco marketing requirements
14
Machine
Learning Real-
Time and
Offline
Predictive
Models
Wallet, Purchase & Journey Models
§ Predict subscriber’s next top-up amount
§ Predict when subscriber might top-up
§ Predict if subscriber will buy or renew package
§ Predict Package expiry
§ Predict if package will expire with high balance
§ Prepaid to Postpaid Conversion Propensity
§ Churn Propensity
§ Next Best Action Model
§ Customer Lifetime Value Prediction
Spatio-Temporal Models
§ Predict home location, work location, weekend travel locations
§ Predict where subscriber will be at given hour & day, e.g. on Fridays
at 7 PM
§ Determine frequently visited locations (malls, churches, office
buildings etc.)
§ Mobility Profiling, e.g. frequent traveler,
stay-at-home, regular commuter
§ Home / Work Location Based Segmentation e.g. Stay-at-home
housewife, Traveling Salesman etc.
Anomaly Detection
§ Detect anomalies in calling pattern within the network / Cell
Site / Location / Subscriber
§ Anomaly Detection in SMS/data usage at Network / Cell Site
/ Location or Subscriber level
§ Anomaly Detection in dropped calls / dropped data sessions
at Network / Cell Site / Location or Subscriber level
Device Models
§ Detect Call Drops & Poor Call Quality from device logs
§ Detect Poorly performing device battery
§ Detect Anomalous Apps based on GPS, wake-lock etc.
§ Determine interests based on App Usage
Communication
• Determine relative preference of SMS, Voice or Data
§ Predict best time of day, day of week or location to
reach subscriber with offers
§ Determine preferred channels of communication
Customer Experience
§ Customer Satisfaction Model, based on dropped calls,
failed data sessions, poor call quality and device issues
§ Predict if customer will call contact center
§ Predict why customer may call contact center
Clickstream and Interests
§ URL Categorization into rich topic hierarchy
§ Long term and short term Interest derivation based
on browsing data of communication
§ Interest prediction based on location & device type
Social Network
§ Determine influencers and social hubs
§ Discover close contacts
§ Identify common interest communities within the
subscriber base
15. … used to create dynamic profiles of customers, locations
and business or retail outlets
15
Historical:
Typical home / work locations?
Recharge patterns
Calling network
Real Time:
Websites visited in last hour
Number of dropped calls in past day
Recharge prediction in next 6 hours
Historical:
Typical population at location
Spend patterns at location
Typical Mobility profiles
Real Time:
Anomalous network loads
Number of queries for weather
Current population
Historical:
Historical Population trends
Browsing behaviors
Communication patterns
Real Time:
Number of customers near business now.
Number of calls to business in last 1 hour
Number of visits to competing business
16. Key principle behind data science out of the box
16
Build ML pipeline once & Operationalize repeatedly
Operationalizing The Pipelines
– The ENGINEERING
Building Pipelines
– The ART
• Repeated for every new deployment
• Create the transformations & features on
historical data
• Train initial version of the model &
generate initial scores
• Create the transformations on streaming
data and update features
• Update scores and models “frequently”
based on streaming data
• Done once on some static representative
datasets
• Explore different possible transformations
of the data
• Explore different kinds of features
• Explore different models
• Finalize on a certain pipeline for a given
problem
17. Machine Learning is not a one-off process taking place in
a static world
17
All model-building & scoring activities happen at a certain point in time
TIME
NOW
Historical Data that has
been collected so far
Streaming Data that will come
in the future
Build initial versions of the
model, score them and
create initial profiles based
on this data
Update scores in the
profiles and refresh models
based on this data
18. Typical Enterprise Architecture
18
Separate processing pathways for real-time analytics and long-term historical
analytics
Telco Data Sources:
• CDRs
• DPI
• Location
• SMSC
• Billing
ETL
Real-time
Streaming
Data.
Historical Data
19. Problems with basic pipeline in streaming settings
19
Training
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Train
Model
Model
Test Data
Predictions
Parsing, Cleaning,
Transformations
Feature
Extraction
Score Model
• Doesn’t show feature creation &
updates on combination of historical &
streaming data
• Doesn't show scoring based on most
recent feature values
• Doesn’t show model refresh
20. Patterns for Machine Learning Pipelines
20
Update models and
predictions on every event.
E.g. time-series predictions
and anomaly detection for
fraud detection.
Refresh models periodically
and score on every event.
E.g. topup prediction with
models updated every
week.
Build model one-time or
infrequently and score on
every event. E.g. Real-time
churn prediction with static
model
Update models and
predictions periodically.
E.g. user interest models,
hangout predictions and
recommendation models.
Build model one-time or
infrequently and score on
every event. E.g. Real-time
churn prediction with static
model
Build models and
predictions one time or very
infrequently. E.g. offline
churn prediction scores.
Online Frequent/Periodic Batch
MODELBUILDING
Frequent/PeriodicOnline
SCORING
21. Typical Enterprise Architecture with Unscrambl Brain
21
Separate processing pathways for real-time analytics and long-term historical
analytics
Telco Data Sources:
• CDRs
• DPI
• Location
• SMSC
• Billing
ETL
Real-time
Streaming
Data
Historical Data
• Stream
Analytics
• Profile Store
• Aggregate
Store
22. Brain is powered by 3 specialized components
22
Leveldb based time-
series aggregate store
Recharges, Number of dropped calls,
Number of international calls,… in the
past 10 minutes, hour, day, week, month
or year
Redis-based
profile store
Last known location of
customers, predicted home
and work locations,,…
Python-based ML
pipeline framework
Call Center Call Prediction Model,
Preferred Channel Prediction
Model,
Social Network Models
23. Online Learning, Online Scoring
23
One-Time Initialization of features
from Historical Data
Online Model Building & Scoring on Streaming Data
Historical
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Model
Maintain
Features
Streaming
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Get Features
for one entity
Train & Score
Model
Write Predictions
24. Periodic Learning, Online Scoring
24
Historical
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Train
Model
Model
Maintain
Features
Get Features
for all entities
Streaming
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Get Features
for one entity
Score
Model
Write Predictions
One-Time Initialization of features
from Historical Data
Periodic Model Re-Training
Online Update of Features and Scoring on
Streaming Data
25. Periodic Learning & Periodic Scoring
25
One-Time Initialization of features
from Historical Data
Periodic Model Re-Training &
Re-Scoring of all Entities
Online Update of Features from Streaming
Data
Historical
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Train &
Score
Model
Model
Maintain
Features
Get Features
for all entities
Streaming
Data
Parsing, Cleaning,
Transformations
Feature
Extraction
Write Predictions
26. Case Study : Telco in SE Asia
26
60+ million subscribers
7+ million optin subscribers
10+ billion CDRs per day
100+ billion URL records per day
15 Machine Learning pipelines rapidly deployed on Spark and Brain to derive a
variety of profile attributes about subscribers
Able to update models and profiles as frequently as needed