SlideShare a Scribd company logo
Lynx Analytics © Confidential 1
Lynx Analytics © Confidential
Demography Estimation from Graph
A Large Telco Social Network Case Study
Gergely Svigruha – gergely.svigruha@lynxanalytics.com
Data Day Texas, Austin, 01/14/2017
Lynx Analytics © Confidential 2
Ideation
Academic origin with an
INSEAD professor and two
PHD students with
Mathematical and Computer
Science background
2010
Incorporation
LynxKiteTM
2013
R&D development team in
Hungary
2016
Opened office in
New York
2016
Capital increase with
strategic investor
Lynx Analytics History
Lynx Analytics © Confidential 3
Our team presence
Data Science
Lynx’s team of data scientists ensures the most
advanced technologies developed for our clients
Infrastructure & Deployment
Dedicated infrastructure engineers with extensive
experience in enterprise deployment
Banking & Telecom expertise
Key experienced experts from Telecom and Banking
industry enabling strict alignment to clients’ needs.
Lynx Analytics © Confidential 4
Estimate demographic data of telecommunication clients
• Demography data (e.g. age, gender or level of education) is invaluable for Telco companies for product design,
direct marketing or churn prevention
• However, the data is not always reliable or available, especially for prepaid customers
Task: Use existing Demography + Behavioral data to predict missing data
?
?, 30 Male, ?
Female, 40SMS SMS
Call
Lynx Analytics © Confidential 5
Social Networks can be very useful
• Many companies have graph data, they just do not know it!
o Customers call each other  they form a social graph!
• Social Networks can be extremely useful predictors of demography
o Friends / family members / co-workers resemble each other in some respect
?
?, 30 Male, ?
Female, 40SMS SMS
Call
Likely to be of the same age as
other community members:
35-40
Lynx Analytics © Confidential 6
We developed a scalable comprehensive graph platform: LynxKiteTM
• Telcos can have a huge customer base, more than a 100mil users in the USA or Asia
• We had to scale up to hundreds of millions of nodes and billions of connections
• Lynx Analytics developed its own scalable graph analytics platform called LynxKite using
Hadoop / Spark
Lynx Analytics © Confidential 7
Lynx Analytics © Confidential
Case Study
Predict age from registration / usage data
Lynx Analytics © Confidential 8
Customer age was predicted from registration data + CDR
Task: Estimate age of 100 million customers
• Registration data available for some but unreliable
• Call Data Records (CDR) available
• Small, but accurate survey data can be used as ground truth
Messy
Registration Data
Cleansed
Registration Data
Model EvaluationSocial Graph + Demography Data
as attributes
Lynx Analytics © Confidential 9
Registration data had to be cleansed
• Registration data can be unreliable due to fake information, missing fields or default values
• Select reliable records using simple predictive models and text mining heuristics
Name Address ID number Date of Birth
John Doe New York, … 342-69-7465 1983.03.11
asdasd qweqwe 123 1980.11.11
Jane Doe Boston,… 167-52-9274 1962.06.21
John Smith Austin,… 926-25-6284
Jane Smith San Jose,… 382-34-7622 1970.01.01
Missing
Data
Fake
Data
Default
Value
Reliable
Data
Reliable
Data
Lynx Analytics © Confidential 10
Age was stamped for 30M people from registrations with 82% accuracy
An age prediction is “correct” if the prediction is within a 5 years, otherwise incorrect
82%
accuracy
Out of the 100mill customers, 30mill is pre-stamped from registration
Lynx Analytics © Confidential 11
We created Social Graph from calls, SMS and location
• Every call and text message infers a social relationship
o Frequency, duration can be used as weights
• Approximate location (cell tower) is often recorded for calls, texting or browsing
o People who often appear at the same location around the same time probably know each other
Calls at
2016/06/21, 12:00
Texts at
2016/06/02, 07:00
Tower 946
2016/06/15, 15:00
Co-location
SMSCall
Lynx Analytics © Confidential 12
Age can be predicted from graph neighborhood
Original age: 40 Original age: 35 Original age: 45
Predicted age: 40
Small coverage in
Iteration 1: skip
Original age: 40
Original age: 35 Original age: 45
Predicted age
(Iteration 1): 40
Predicted age
(Iteration 2): 40
Iterative Neighborhood Methods
Only predict where we have enough values +
small deviation and iterate
Simple Neighborhood Methods
Predict missing values from neighborhood
median / average
Lynx Analytics © Confidential 13
Communities can be detected from Social Graph structure
Maximal Cliques
Overlapping
Cliques can be merged if they overlap enough,
based on INSEAD Research Paper  Infocom
communities
Modular Clustering
Non overlapping, only one community per person
Lynx Analytics © Confidential 14
Infocom Communities resemble real life the most
Infocom communities are merged maximal cliques
• Maximal cliques are too rigid
(not all relationships are captured in CDR)
• We have multiple communities in real life
(friends, family, classmates, co-workers)
 Real communities overlap too
Lynx Analytics © Confidential 15
Age predicted from Community Average / Median
• Take communities with enough defined ages
and small deviation
• If multiple available, pick the one with the
lowest standard deviation
o “Classmates” works better for age than “family”
• Predict from community average / median
• Iterate
Original age: 45
Original age: 40
Original age: 55
Predicted age: 42 Original age: 60
Original age: 20
Smaller Deviation
 Better Predictor
Lynx Analytics © Confidential 16
Geo-partition of the graph improved community algorithm
• Finding maximal cliques / communities is
computationally hard if the graph is too large
• Graph partitioning can help, but it is a difficult
problem in itself
o Heuristic: Partition using geolocation info,
most of people’s friends are physically close too
• For example a geo induced partition containing 20mil
nodes preserved 85% of the edges associated with
those nodes
• Much better than random partitioning of same sizes
(20% of edges preserved)
Lynx Analytics © Confidential 17
A combination of methods resulted in an overall 70% accuracy
<<
Neighborhood methods Community methods
Best community method: ~72% accuracy,
but not available for all customers
• Some people have too few connections, or their
“friends” may not know each other
Best simple neighborhood method: ~62% accuracy
Best iterative neighborhood method: ~67% accuracy
Lynx Analytics © Confidential 18
Further Research
Predict demography from
Social Graph variables
Integrate Graph Analytics &
Deep Learning methods
GraphAI
Lynx Analytics © Confidential 19
Social Graphs can help predicting marital status, level of education or occupation
Members of communities with a high ratio of university graduates are likely to be university
graduates themselves, the same applies to single people
Real estate agents tend to have low clustering coefficient, their “friends” (business connections), do
not necessarily know each other
Entrepreneurs have larger than average PageRank (often used to measure influence) or centrality
Lynx Analytics © Confidential 20
GraphAI
Integrating Graphs with Deep Learning
• Neural networks are graphs themselves
• Lynx’s R&D lab is currently working on a cutting edge technology integrating graph analytics and deep
learning methods to predict missing variables from existing ones / graph structure
Age: 35
Gender: Female
Edu: MSc
Age: ?
Gender: Male
Edu: BA
Age: ?
Gender: Female
Edu: ?
Lynx Analytics © Confidential 21
Lynx Analytics © Confidential
gergely.svigruha@lynxanalytics.com
https://www.lynxanalytics.com/
Thank you

More Related Content

Similar to Demography estimation presentation at Data Day Texas 2017

New analytical methods for geocomputation - Guy Lansley, UCL
New analytical methods for geocomputation - Guy Lansley, UCLNew analytical methods for geocomputation - Guy Lansley, UCL
New analytical methods for geocomputation - Guy Lansley, UCL
Guy Lansley
 
Ketnote: GraphTour Boston
Ketnote: GraphTour BostonKetnote: GraphTour Boston
Ketnote: GraphTour Boston
Neo4j
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Thinkful
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
State of Florida Neo4J Graph Briefing - Keynote
State of Florida Neo4J Graph Briefing - KeynoteState of Florida Neo4J Graph Briefing - Keynote
State of Florida Neo4J Graph Briefing - Keynote
Neo4j
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
Are You Underestimating the Value Within Your Data? A conversation about grap...
Are You Underestimating the Value Within Your Data? A conversation about grap...Are You Underestimating the Value Within Your Data? A conversation about grap...
Are You Underestimating the Value Within Your Data? A conversation about grap...
Neo4j
 
Data Viz - telling stories with data
Data Viz - telling stories with dataData Viz - telling stories with data
Data Viz - telling stories with data
OCSI
 
Data centric business and knowledge graph trends
Data centric business and knowledge graph trendsData centric business and knowledge graph trends
Data centric business and knowledge graph trends
Alan Morrison
 
Community of practice on socio-economic data
Community of practice on socio-economic dataCommunity of practice on socio-economic data
Community of practice on socio-economic data
IFPRI-PIM
 
Community of practice on socio-economic data
Community of practice on socio-economic dataCommunity of practice on socio-economic data
Community of practice on socio-economic data
CGIAR
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
Nozha Boujemaa
 
Thwart Fraud Using Graph-Enhanced Machine Learning and AI
Thwart Fraud Using Graph-Enhanced Machine Learning and AIThwart Fraud Using Graph-Enhanced Machine Learning and AI
Thwart Fraud Using Graph-Enhanced Machine Learning and AI
Neo4j
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart data
caniceconsulting
 
Data Science for Cyber Risk
Data Science for Cyber RiskData Science for Cyber Risk
Data Science for Cyber Risk
Scott Allen Mongeau
 
QAI brochure
QAI brochureQAI brochure
QAI brochure
Teresa Escrig, PhD
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
Ivan Zoratti
 
Monetize Big Data
Monetize Big DataMonetize Big Data
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
Alan Morrison
 

Similar to Demography estimation presentation at Data Day Texas 2017 (20)

New analytical methods for geocomputation - Guy Lansley, UCL
New analytical methods for geocomputation - Guy Lansley, UCLNew analytical methods for geocomputation - Guy Lansley, UCL
New analytical methods for geocomputation - Guy Lansley, UCL
 
Ketnote: GraphTour Boston
Ketnote: GraphTour BostonKetnote: GraphTour Boston
Ketnote: GraphTour Boston
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
State of Florida Neo4J Graph Briefing - Keynote
State of Florida Neo4J Graph Briefing - KeynoteState of Florida Neo4J Graph Briefing - Keynote
State of Florida Neo4J Graph Briefing - Keynote
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Are You Underestimating the Value Within Your Data? A conversation about grap...
Are You Underestimating the Value Within Your Data? A conversation about grap...Are You Underestimating the Value Within Your Data? A conversation about grap...
Are You Underestimating the Value Within Your Data? A conversation about grap...
 
Data Viz - telling stories with data
Data Viz - telling stories with dataData Viz - telling stories with data
Data Viz - telling stories with data
 
Data centric business and knowledge graph trends
Data centric business and knowledge graph trendsData centric business and knowledge graph trends
Data centric business and knowledge graph trends
 
Community of practice on socio-economic data
Community of practice on socio-economic dataCommunity of practice on socio-economic data
Community of practice on socio-economic data
 
Community of practice on socio-economic data
Community of practice on socio-economic dataCommunity of practice on socio-economic data
Community of practice on socio-economic data
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
 
Thwart Fraud Using Graph-Enhanced Machine Learning and AI
Thwart Fraud Using Graph-Enhanced Machine Learning and AIThwart Fraud Using Graph-Enhanced Machine Learning and AI
Thwart Fraud Using Graph-Enhanced Machine Learning and AI
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart data
 
Data Science for Cyber Risk
Data Science for Cyber RiskData Science for Cyber Risk
Data Science for Cyber Risk
 
QAI brochure
QAI brochureQAI brochure
QAI brochure
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
 
Monetize Big Data
Monetize Big DataMonetize Big Data
Monetize Big Data
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 

Recently uploaded

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 

Recently uploaded (20)

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 

Demography estimation presentation at Data Day Texas 2017

  • 1. Lynx Analytics © Confidential 1 Lynx Analytics © Confidential Demography Estimation from Graph A Large Telco Social Network Case Study Gergely Svigruha – gergely.svigruha@lynxanalytics.com Data Day Texas, Austin, 01/14/2017
  • 2. Lynx Analytics © Confidential 2 Ideation Academic origin with an INSEAD professor and two PHD students with Mathematical and Computer Science background 2010 Incorporation LynxKiteTM 2013 R&D development team in Hungary 2016 Opened office in New York 2016 Capital increase with strategic investor Lynx Analytics History
  • 3. Lynx Analytics © Confidential 3 Our team presence Data Science Lynx’s team of data scientists ensures the most advanced technologies developed for our clients Infrastructure & Deployment Dedicated infrastructure engineers with extensive experience in enterprise deployment Banking & Telecom expertise Key experienced experts from Telecom and Banking industry enabling strict alignment to clients’ needs.
  • 4. Lynx Analytics © Confidential 4 Estimate demographic data of telecommunication clients • Demography data (e.g. age, gender or level of education) is invaluable for Telco companies for product design, direct marketing or churn prevention • However, the data is not always reliable or available, especially for prepaid customers Task: Use existing Demography + Behavioral data to predict missing data ? ?, 30 Male, ? Female, 40SMS SMS Call
  • 5. Lynx Analytics © Confidential 5 Social Networks can be very useful • Many companies have graph data, they just do not know it! o Customers call each other  they form a social graph! • Social Networks can be extremely useful predictors of demography o Friends / family members / co-workers resemble each other in some respect ? ?, 30 Male, ? Female, 40SMS SMS Call Likely to be of the same age as other community members: 35-40
  • 6. Lynx Analytics © Confidential 6 We developed a scalable comprehensive graph platform: LynxKiteTM • Telcos can have a huge customer base, more than a 100mil users in the USA or Asia • We had to scale up to hundreds of millions of nodes and billions of connections • Lynx Analytics developed its own scalable graph analytics platform called LynxKite using Hadoop / Spark
  • 7. Lynx Analytics © Confidential 7 Lynx Analytics © Confidential Case Study Predict age from registration / usage data
  • 8. Lynx Analytics © Confidential 8 Customer age was predicted from registration data + CDR Task: Estimate age of 100 million customers • Registration data available for some but unreliable • Call Data Records (CDR) available • Small, but accurate survey data can be used as ground truth Messy Registration Data Cleansed Registration Data Model EvaluationSocial Graph + Demography Data as attributes
  • 9. Lynx Analytics © Confidential 9 Registration data had to be cleansed • Registration data can be unreliable due to fake information, missing fields or default values • Select reliable records using simple predictive models and text mining heuristics Name Address ID number Date of Birth John Doe New York, … 342-69-7465 1983.03.11 asdasd qweqwe 123 1980.11.11 Jane Doe Boston,… 167-52-9274 1962.06.21 John Smith Austin,… 926-25-6284 Jane Smith San Jose,… 382-34-7622 1970.01.01 Missing Data Fake Data Default Value Reliable Data Reliable Data
  • 10. Lynx Analytics © Confidential 10 Age was stamped for 30M people from registrations with 82% accuracy An age prediction is “correct” if the prediction is within a 5 years, otherwise incorrect 82% accuracy Out of the 100mill customers, 30mill is pre-stamped from registration
  • 11. Lynx Analytics © Confidential 11 We created Social Graph from calls, SMS and location • Every call and text message infers a social relationship o Frequency, duration can be used as weights • Approximate location (cell tower) is often recorded for calls, texting or browsing o People who often appear at the same location around the same time probably know each other Calls at 2016/06/21, 12:00 Texts at 2016/06/02, 07:00 Tower 946 2016/06/15, 15:00 Co-location SMSCall
  • 12. Lynx Analytics © Confidential 12 Age can be predicted from graph neighborhood Original age: 40 Original age: 35 Original age: 45 Predicted age: 40 Small coverage in Iteration 1: skip Original age: 40 Original age: 35 Original age: 45 Predicted age (Iteration 1): 40 Predicted age (Iteration 2): 40 Iterative Neighborhood Methods Only predict where we have enough values + small deviation and iterate Simple Neighborhood Methods Predict missing values from neighborhood median / average
  • 13. Lynx Analytics © Confidential 13 Communities can be detected from Social Graph structure Maximal Cliques Overlapping Cliques can be merged if they overlap enough, based on INSEAD Research Paper  Infocom communities Modular Clustering Non overlapping, only one community per person
  • 14. Lynx Analytics © Confidential 14 Infocom Communities resemble real life the most Infocom communities are merged maximal cliques • Maximal cliques are too rigid (not all relationships are captured in CDR) • We have multiple communities in real life (friends, family, classmates, co-workers)  Real communities overlap too
  • 15. Lynx Analytics © Confidential 15 Age predicted from Community Average / Median • Take communities with enough defined ages and small deviation • If multiple available, pick the one with the lowest standard deviation o “Classmates” works better for age than “family” • Predict from community average / median • Iterate Original age: 45 Original age: 40 Original age: 55 Predicted age: 42 Original age: 60 Original age: 20 Smaller Deviation  Better Predictor
  • 16. Lynx Analytics © Confidential 16 Geo-partition of the graph improved community algorithm • Finding maximal cliques / communities is computationally hard if the graph is too large • Graph partitioning can help, but it is a difficult problem in itself o Heuristic: Partition using geolocation info, most of people’s friends are physically close too • For example a geo induced partition containing 20mil nodes preserved 85% of the edges associated with those nodes • Much better than random partitioning of same sizes (20% of edges preserved)
  • 17. Lynx Analytics © Confidential 17 A combination of methods resulted in an overall 70% accuracy << Neighborhood methods Community methods Best community method: ~72% accuracy, but not available for all customers • Some people have too few connections, or their “friends” may not know each other Best simple neighborhood method: ~62% accuracy Best iterative neighborhood method: ~67% accuracy
  • 18. Lynx Analytics © Confidential 18 Further Research Predict demography from Social Graph variables Integrate Graph Analytics & Deep Learning methods GraphAI
  • 19. Lynx Analytics © Confidential 19 Social Graphs can help predicting marital status, level of education or occupation Members of communities with a high ratio of university graduates are likely to be university graduates themselves, the same applies to single people Real estate agents tend to have low clustering coefficient, their “friends” (business connections), do not necessarily know each other Entrepreneurs have larger than average PageRank (often used to measure influence) or centrality
  • 20. Lynx Analytics © Confidential 20 GraphAI Integrating Graphs with Deep Learning • Neural networks are graphs themselves • Lynx’s R&D lab is currently working on a cutting edge technology integrating graph analytics and deep learning methods to predict missing variables from existing ones / graph structure Age: 35 Gender: Female Edu: MSc Age: ? Gender: Male Edu: BA Age: ? Gender: Female Edu: ?
  • 21. Lynx Analytics © Confidential 21 Lynx Analytics © Confidential gergely.svigruha@lynxanalytics.com https://www.lynxanalytics.com/ Thank you

Editor's Notes

  1. Telcos can have a huge customer base, more than a 100mil users in the USA or Asia  need to scale up to hundreds of millions of nodes and billions of connections Lynx Analytics developed its own scalable graph analytics platform called LynxKite using Hadoop / Spark 3 years ago
  2. Neighborhood includes irrelevant people  we need to find communities!
  3. Neighborhood includes irrelevant people  we need to find communities!