Copyright © Think Big Analytics and Neustar Inc.1
Asking the Right
Questions of your
Data
Mike Peterson
VP of Platforms and Data Architecture, Neustar
Jun 26, 2013
2 Copyright © Neustar Inc.
We have come a long way!!!
3
But where/when is the GOLD?
Unintended Consequence of Big
Data
We need to ask the right Questions
Oh, and lets remember religion
and not forget GOVERNANCE
Copyright © Neustar Inc.
Big Data Evolution Status
4
» New data platform is built – 3Tier
» Collected many Pbs of data
» Hadoop infrastructure in place for 2yrs
» Established Data Science teams
» Machine Learning is in place
» Increased technology skills
» Focused data teams
» Active in the community
Copyright © Neustar Inc.
Our Partners are still a part of our process
5 Copyright © Think Big Analytics and Neustar Inc.
» Expertise in Technologies
» Trusted partner
» Collaborative Teams
» Open source leader
» Invested in client success
» Price/performance
Some Unintended Consequences
6
» More Customer Reporting Request
» Because we suddenly have lots of customer
data available
» Meaning more work for the DW team!!!
» DR Site is more required than ever
» More data, means more critical data to protect
» Network Stress to support DR and other additional
access
» Data Governance is overwhelmed with request
» Retention Policies need to be re-thought
Copyright © Neustar Inc.
Questions
7
» Customer Driven Questions
» Easy to understand
» Subject Questions
» Discover the pivot and you have a good start
» Exploratory Questions
» Thinking of the unformed questions
» Working from the top down
» Narrowing the answer before you test all the data
Copyright © Neustar Inc.
Questions - Approaches
• Understand what manual process you want to automate:
what is currently manually predicted that could be
automated and determine if there’s any way to get training
data comprising of <input,output> pairs.
• Consider methods to augment existing data with a “pivot”
column that can be used to join. For example, geo-location
of an IP address could lead to joining with Census Data
based on zip+4.
Questions - Approaches
• Determine if your problem is one of prediction or one of
grouping (clustering). The latter is more of a task that can
lead to better understanding rather than solving a direct
business problem.
Questions - Approaches
• Determine if you are more interested in finding “interesting”
relationships among data columns rather than knowing the
columns. This is a task I’d call more of “discovery” than
prediction but the idea is to determine one column as the
output column in terms of the other columns as input.
• Doing this for all output columns can lead to “discovery”
of those correlations that are the strongest (e.g., every
time a customer buys beer at 5PM, he is likely to buy
diapers). This is more of a fishing expedition, but can
lead to unusual insights.
Impetus Approach to Questioning Data
11 Copyright © Neustar Inc.
EXISTING DATA
PROPERTY
BUSINESS
STRATEGY
CUSTOMER
PROBLEM
STATEMENTS
ANALYSIS OF
DATA PROPERTY
DISCUSSION
WITH
STAKEHOLDERS
ANALYSIS OF
PROBLEM
STATEMENT
DATA NEEDS
STATEMENT
REFINED
PROBLEM
STATEMENT
DATA ANALYTICS
PLAN
Who knew there was religion in Analytics
12
» Statistical Analysis vs. Machine Learning
» Stats people think “truth”
» Machine Learning people think “near truth”
» Truth is easy to bound
» Cost models make sense to org
» Near Truth is hard to explain and bound
» It is where the real exploration happens
» But – it can consume the Data Scientist
» Both can net real returns – and they need to co-
exist
Copyright © Neustar Inc.
13 Copyright © Neustar Inc.
GOVERNANCE
14
» Don’t forget about Governance
» Contracts
» PII
» Brand
» CPO & CISO are your friends - honestly
» Protect your CUSTOMER DATA
» It will slow you down in the beginning
» But you want your results to be reputable
» We need to get to a policy framework at some
point that is automated
Copyright © Neustar Inc.
About Impetus
» Accelerated consulting and services leader for Big Data;
Headquartered in San Jose since 1996; 1400+; Presences
in Silicon Valley, Atlanta, NYC; offices in India; Expertise
through Architects
» Pioneers in distributed software engineering with vertical
and functional expertise; Dedicated innovation labs; 200+
Big Data practitioners; 80+ dedicated to R&D
Drill
* Incoming
Question
* Problem
Landscape
* Underlying
Constraints
* Specific Goals
Assess
* Goal Driven
Hypotheses
* Data
Requirement
* Resource
Requirements
* Analysis Plan
Target
* Data Collection
* Quality
Assessment
* Cross
Validation
* Restructuring
Analyze
* Test Previous
Hypotheses
* Explore New
Hypotheses
* Test
* Quantify
Results
Recommend
* Summary of
Results
* Key Novel
Insights
* Impact Analysis
* Action Items
Data Science Approach
» Recommender Systems
» Sentiment Analysis
» Topic Identification
» Predictive Analytics
» Data Stream Analytics
Data Science Focus
Areas
Contact us at bigdata@impetus.com
Thank you
Questions?

Asking the Right Questions of Your Data

  • 1.
    Copyright © ThinkBig Analytics and Neustar Inc.1 Asking the Right Questions of your Data Mike Peterson VP of Platforms and Data Architecture, Neustar Jun 26, 2013
  • 2.
    2 Copyright ©Neustar Inc.
  • 3.
    We have comea long way!!! 3 But where/when is the GOLD? Unintended Consequence of Big Data We need to ask the right Questions Oh, and lets remember religion and not forget GOVERNANCE Copyright © Neustar Inc.
  • 4.
    Big Data EvolutionStatus 4 » New data platform is built – 3Tier » Collected many Pbs of data » Hadoop infrastructure in place for 2yrs » Established Data Science teams » Machine Learning is in place » Increased technology skills » Focused data teams » Active in the community Copyright © Neustar Inc.
  • 5.
    Our Partners arestill a part of our process 5 Copyright © Think Big Analytics and Neustar Inc. » Expertise in Technologies » Trusted partner » Collaborative Teams » Open source leader » Invested in client success » Price/performance
  • 6.
    Some Unintended Consequences 6 »More Customer Reporting Request » Because we suddenly have lots of customer data available » Meaning more work for the DW team!!! » DR Site is more required than ever » More data, means more critical data to protect » Network Stress to support DR and other additional access » Data Governance is overwhelmed with request » Retention Policies need to be re-thought Copyright © Neustar Inc.
  • 7.
    Questions 7 » Customer DrivenQuestions » Easy to understand » Subject Questions » Discover the pivot and you have a good start » Exploratory Questions » Thinking of the unformed questions » Working from the top down » Narrowing the answer before you test all the data Copyright © Neustar Inc.
  • 8.
    Questions - Approaches •Understand what manual process you want to automate: what is currently manually predicted that could be automated and determine if there’s any way to get training data comprising of <input,output> pairs. • Consider methods to augment existing data with a “pivot” column that can be used to join. For example, geo-location of an IP address could lead to joining with Census Data based on zip+4.
  • 9.
    Questions - Approaches •Determine if your problem is one of prediction or one of grouping (clustering). The latter is more of a task that can lead to better understanding rather than solving a direct business problem.
  • 10.
    Questions - Approaches •Determine if you are more interested in finding “interesting” relationships among data columns rather than knowing the columns. This is a task I’d call more of “discovery” than prediction but the idea is to determine one column as the output column in terms of the other columns as input. • Doing this for all output columns can lead to “discovery” of those correlations that are the strongest (e.g., every time a customer buys beer at 5PM, he is likely to buy diapers). This is more of a fishing expedition, but can lead to unusual insights.
  • 11.
    Impetus Approach toQuestioning Data 11 Copyright © Neustar Inc. EXISTING DATA PROPERTY BUSINESS STRATEGY CUSTOMER PROBLEM STATEMENTS ANALYSIS OF DATA PROPERTY DISCUSSION WITH STAKEHOLDERS ANALYSIS OF PROBLEM STATEMENT DATA NEEDS STATEMENT REFINED PROBLEM STATEMENT DATA ANALYTICS PLAN
  • 12.
    Who knew therewas religion in Analytics 12 » Statistical Analysis vs. Machine Learning » Stats people think “truth” » Machine Learning people think “near truth” » Truth is easy to bound » Cost models make sense to org » Near Truth is hard to explain and bound » It is where the real exploration happens » But – it can consume the Data Scientist » Both can net real returns – and they need to co- exist Copyright © Neustar Inc.
  • 13.
    13 Copyright ©Neustar Inc.
  • 14.
    GOVERNANCE 14 » Don’t forgetabout Governance » Contracts » PII » Brand » CPO & CISO are your friends - honestly » Protect your CUSTOMER DATA » It will slow you down in the beginning » But you want your results to be reputable » We need to get to a policy framework at some point that is automated Copyright © Neustar Inc.
  • 15.
    About Impetus » Acceleratedconsulting and services leader for Big Data; Headquartered in San Jose since 1996; 1400+; Presences in Silicon Valley, Atlanta, NYC; offices in India; Expertise through Architects » Pioneers in distributed software engineering with vertical and functional expertise; Dedicated innovation labs; 200+ Big Data practitioners; 80+ dedicated to R&D
  • 16.
    Drill * Incoming Question * Problem Landscape *Underlying Constraints * Specific Goals Assess * Goal Driven Hypotheses * Data Requirement * Resource Requirements * Analysis Plan Target * Data Collection * Quality Assessment * Cross Validation * Restructuring Analyze * Test Previous Hypotheses * Explore New Hypotheses * Test * Quantify Results Recommend * Summary of Results * Key Novel Insights * Impact Analysis * Action Items Data Science Approach
  • 17.
    » Recommender Systems »Sentiment Analysis » Topic Identification » Predictive Analytics » Data Stream Analytics Data Science Focus Areas Contact us at bigdata@impetus.com
  • 18.

Editor's Notes

  • #10 Sometimes clustering could be enough to solve a business problem
  • #11   We must understand the columns well before understanding the relationships
  • #18 Data Science results lead to better database marketing – churn analytics, upselling, cross selling, RFM/LTVThese are some of the areas where we’ve used data science and machine learning to come up w/ some interesting models.