Euangelos Linardos
Data Scientist @ Pollfish Inc
2nd Athens Data Science Meetup, Athens 17 December 2015
Data at Pollfish
Twitter: @eualin  Email: euangelos@pollfish.com
I AM EUANGELOS LINARDOS
THE CONCEPT
PART I
ABOUT POLLFISH
Pollfish is a mobile survey platform that delivers online surveys globally.
Pollfish ensures your survey reaches just the right audience and provides the most cost
effective, quick and accurate survey results.
DIY SURVEY TOOL
PUBLISHERS NETWORK
MORE THAN 170M MOBILE DEVICES ALL OVER THE WORLD
UNIQUE USER EXPERIENCE
A WIN WIN WIN SITUATION
I WIN, YOU WIN
EVERYBODY WINS
REAL-TIME RESULTS
SUPERIOR QUALITY
IT DOESN’T MATTER WHAT WE SAY
CLAIM YOUR FREE COUPON
AND TRY IT NOW
NATURE OF DATA
PART II
MOBILE SURVEYS IS
A BIG DATA BUSINESS
VOLUME
● UNIQUE USERS:
~2 M daily
~15 M monthly
~170 M total
● DATA TRAFFIC:
~1 TB daily
~26 TB monthly
~210 TB total
* volume = scale of data
THAT’S A LOT OF SELFIES
VARIETY
● survey
● location
● device
● weather
● network
● publisher
● language
● and many more
* variety = different forms of data
PERSONA (200+)
≠
VARIETY
"taxonomy" and "persona" are used
Interchangeably throughout this presentation!
[TAXONOMY = FEATURE ] [PERSONA = COMB. OF FEATURES]
VELOCITY
● ~11 M requests per day; on every request:
detect possible fraudulent activity
predict user action (start, finish, abort)
OF WHICH…
● ~13% accounts for classifications (new users)
1 update / user / taxonomy
● ~87% accounts for “traditional” lookups (old users)
1 lookup / user
* velocity = analysis of streaming data
VERACITY
● survey answers may be inaccurate
● device  location data may be misleading
● 3rd party data may be outdated or wrong
* veracity = uncertainty of data
Too much to store on a single computer.
We need a cluster to process it.
This is typically what is called “Big Data”.
Amazing dataset to slice and dice!
DATA PROCESSES
PART III
MAIN DATA OPERATIONS
● Reporting
● Business Analytics
● Operational Analytics
● Product Features
REPORTING
REPORTING
● GROUPS OF INTEREST:
publishers
researchers
● EXAMPLE QUERIES:
# of surveys completed through my app?
# of users completed my survey?
BUSINESS ANALYTICS
BUSINESS ANALYTICS
● GROUPS OF INTEREST:
sales and operations
management, executives and investors
● EXAMPLE QUERIES:
count number of (daily, weekly etc.) active users
analyze growth, user behavior, sign-up funnels
company KPIs (Key Performance Indicator)
NPS analysis (Net Promoter Score)
* KPI: evaluate the success of an organization.
* NPS: measure the loyalty of a firm’s customer relationships.
OPERATIONAL ANALYTICS
OPERATIONAL ANALYTICS
● GROUPS OF INTEREST:
devops engineers
data engineers
● EXAMPLE QUERIES:
latency analysis: msec to wait for survey after loading the app
capacity planning: server, people, bandwidth etc.
root cause analysis: locates the root causes of faults
PRODUCT FEATURES
PRODUCT FEATURES
● Data enrichment
● Publisher classification
● Fraud detection
● User personas
● A/B testing
SURVEY PERSONALISATION IS THE FUTURE!
SURVEY
... should fit your mood.
... should fit your activity.
... should be personal!
IF YOU LOOK LIKE THIS #1
Gender: male
Age: 24-34
Marital status: single
Location: california
Interest: sports
salary: 150K
Show PERSONAL
survey! #1
SURVEY SHOULD FOLLOW #1
Gender: male
Age: 24-34
Marital status: single
Location: california
Interest: sports
salary: 150K
interested in
buying the
latest convertible
from
BMW?
IF YOU LOOK LIKE THIS #2
Gender: male
Age: 34-44
Marital status: married
Location: helsinki
Interest: video games
salary: 90K
Show PERSONAL
survey! #2
IF YOU LOOK LIKE THIS #2
Gender: male
Age: 34-44
Marital status: married
Location: helsinki
Interest: video games
salary: 90K
interested in
buying the
latest SUV from
VOLVO?
OVERCOME THE CHALLENGE
Challenge:
survey data is accurate but limited. How do you scale?
Solution:
dedicated machine learning models using quality survey data.
Pollfish Personas:
targetable groups of consumers with similar characteristics, based on device, location data,
and most importantly, survey answers!
POLLFISH PREDICTORS
Multivariate:
persona probability score calculated based on all available attributes.
Daily Updated:
keep your models current with daily model refreshments.
With Customizable Threshold:
customize threshold for precision or recall.
SYSTEM ARCHITECTURE
PART IV
TO MAKE DATA-DRIVEN DECISIONS
DATA AND INFRASTRUCTURE
ARE REQUIRED (AMONG THE OTHERS).
HIGH LEVEL ARCHITECTURE
HDFS
● more data usually beats better algorithms
● raw data is:
complicated
often dirty
evolving structure
duplication all over
● getting data to a central point is hard! #NOT
● it's simple! we just throw them into HDFS!
C*
● a distributed and linearly scalable and distributed key-value store
● ideal for time-series data
● provides fast random access for many small pieces of data
● use it for surveys, user profiles, popularity count and almost anything
POSTGRESQL
● we still use it, a lot!
● powering features that require transactions support, integrity constraints, and more
● aggregated data for dashboard and quick analysis
CRITICAL AND CONSISTENCY
IMPORTANT? → POSTGRESQL
HUGE, GROWING FAST, EVENTUAL
CONSISTENCY OK? → CASSANDRA
RAW AND HISTORICAL? → HDFS
AZKABAN
● allows us to build pipelines of batch jobs
● handles dependency resolution, workflow management, visualisation and more
● the alternative to Luigi and Oozie
SPARK
● general cluster computing platform:
distributed in-memory computational framework
SQL, Machine Learning, Stream Processing, etc.
● easy to use, powerful, high-level API:
Scala, Java, Python and R
TIPS FOR DEVELOPING DATA PRODUCTS
● Collect data, data, DATA!!!
● Large amounts of data can reveal new patterns
● Be careful of “black box” approaches
● Look at your raw data (exploratory analysis)
● Aggregate statistics can be misleading
● Visualize your data
● Include data geeks in design process
● Find opportunity in your error data
Thank you
(we’re hiring):
https://pollfish.workable.com/

Data at Pollfish

  • 1.
    Euangelos Linardos Data Scientist@ Pollfish Inc 2nd Athens Data Science Meetup, Athens 17 December 2015 Data at Pollfish
  • 2.
    Twitter: @eualin Email: euangelos@pollfish.com I AM EUANGELOS LINARDOS
  • 3.
  • 4.
    ABOUT POLLFISH Pollfish isa mobile survey platform that delivers online surveys globally. Pollfish ensures your survey reaches just the right audience and provides the most cost effective, quick and accurate survey results.
  • 5.
  • 6.
    PUBLISHERS NETWORK MORE THAN170M MOBILE DEVICES ALL OVER THE WORLD
  • 7.
  • 8.
    A WIN WINWIN SITUATION I WIN, YOU WIN EVERYBODY WINS
  • 9.
  • 10.
  • 11.
    IT DOESN’T MATTERWHAT WE SAY CLAIM YOUR FREE COUPON AND TRY IT NOW
  • 12.
  • 13.
    MOBILE SURVEYS IS ABIG DATA BUSINESS
  • 14.
    VOLUME ● UNIQUE USERS: ~2M daily ~15 M monthly ~170 M total ● DATA TRAFFIC: ~1 TB daily ~26 TB monthly ~210 TB total * volume = scale of data
  • 15.
    THAT’S A LOTOF SELFIES
  • 16.
    VARIETY ● survey ● location ●device ● weather ● network ● publisher ● language ● and many more * variety = different forms of data PERSONA (200+)
  • 17.
    ≠ VARIETY "taxonomy" and "persona"are used Interchangeably throughout this presentation! [TAXONOMY = FEATURE ] [PERSONA = COMB. OF FEATURES]
  • 18.
    VELOCITY ● ~11 Mrequests per day; on every request: detect possible fraudulent activity predict user action (start, finish, abort) OF WHICH… ● ~13% accounts for classifications (new users) 1 update / user / taxonomy ● ~87% accounts for “traditional” lookups (old users) 1 lookup / user * velocity = analysis of streaming data
  • 19.
    VERACITY ● survey answersmay be inaccurate ● device location data may be misleading ● 3rd party data may be outdated or wrong * veracity = uncertainty of data
  • 20.
    Too much tostore on a single computer. We need a cluster to process it. This is typically what is called “Big Data”. Amazing dataset to slice and dice!
  • 22.
  • 23.
    MAIN DATA OPERATIONS ●Reporting ● Business Analytics ● Operational Analytics ● Product Features
  • 24.
  • 25.
    REPORTING ● GROUPS OFINTEREST: publishers researchers ● EXAMPLE QUERIES: # of surveys completed through my app? # of users completed my survey?
  • 26.
  • 27.
    BUSINESS ANALYTICS ● GROUPSOF INTEREST: sales and operations management, executives and investors ● EXAMPLE QUERIES: count number of (daily, weekly etc.) active users analyze growth, user behavior, sign-up funnels company KPIs (Key Performance Indicator) NPS analysis (Net Promoter Score) * KPI: evaluate the success of an organization. * NPS: measure the loyalty of a firm’s customer relationships.
  • 28.
  • 29.
    OPERATIONAL ANALYTICS ● GROUPSOF INTEREST: devops engineers data engineers ● EXAMPLE QUERIES: latency analysis: msec to wait for survey after loading the app capacity planning: server, people, bandwidth etc. root cause analysis: locates the root causes of faults
  • 30.
  • 31.
    PRODUCT FEATURES ● Dataenrichment ● Publisher classification ● Fraud detection ● User personas ● A/B testing
  • 32.
  • 33.
    SURVEY ... should fityour mood. ... should fit your activity. ... should be personal!
  • 34.
    IF YOU LOOKLIKE THIS #1 Gender: male Age: 24-34 Marital status: single Location: california Interest: sports salary: 150K Show PERSONAL survey! #1
  • 35.
    SURVEY SHOULD FOLLOW#1 Gender: male Age: 24-34 Marital status: single Location: california Interest: sports salary: 150K interested in buying the latest convertible from BMW?
  • 36.
    IF YOU LOOKLIKE THIS #2 Gender: male Age: 34-44 Marital status: married Location: helsinki Interest: video games salary: 90K Show PERSONAL survey! #2
  • 37.
    IF YOU LOOKLIKE THIS #2 Gender: male Age: 34-44 Marital status: married Location: helsinki Interest: video games salary: 90K interested in buying the latest SUV from VOLVO?
  • 38.
    OVERCOME THE CHALLENGE Challenge: surveydata is accurate but limited. How do you scale? Solution: dedicated machine learning models using quality survey data. Pollfish Personas: targetable groups of consumers with similar characteristics, based on device, location data, and most importantly, survey answers!
  • 41.
    POLLFISH PREDICTORS Multivariate: persona probabilityscore calculated based on all available attributes. Daily Updated: keep your models current with daily model refreshments. With Customizable Threshold: customize threshold for precision or recall.
  • 42.
  • 43.
    TO MAKE DATA-DRIVENDECISIONS DATA AND INFRASTRUCTURE ARE REQUIRED (AMONG THE OTHERS).
  • 44.
  • 46.
    HDFS ● more datausually beats better algorithms ● raw data is: complicated often dirty evolving structure duplication all over ● getting data to a central point is hard! #NOT ● it's simple! we just throw them into HDFS!
  • 47.
    C* ● a distributedand linearly scalable and distributed key-value store ● ideal for time-series data ● provides fast random access for many small pieces of data ● use it for surveys, user profiles, popularity count and almost anything
  • 48.
    POSTGRESQL ● we stilluse it, a lot! ● powering features that require transactions support, integrity constraints, and more ● aggregated data for dashboard and quick analysis
  • 49.
    CRITICAL AND CONSISTENCY IMPORTANT?→ POSTGRESQL HUGE, GROWING FAST, EVENTUAL CONSISTENCY OK? → CASSANDRA RAW AND HISTORICAL? → HDFS
  • 50.
    AZKABAN ● allows usto build pipelines of batch jobs ● handles dependency resolution, workflow management, visualisation and more ● the alternative to Luigi and Oozie
  • 51.
    SPARK ● general clustercomputing platform: distributed in-memory computational framework SQL, Machine Learning, Stream Processing, etc. ● easy to use, powerful, high-level API: Scala, Java, Python and R
  • 52.
    TIPS FOR DEVELOPINGDATA PRODUCTS ● Collect data, data, DATA!!! ● Large amounts of data can reveal new patterns ● Be careful of “black box” approaches ● Look at your raw data (exploratory analysis) ● Aggregate statistics can be misleading ● Visualize your data ● Include data geeks in design process ● Find opportunity in your error data
  • 53.