Big data Psychology
Vishal Singh
NYU-Stern
History of Data Collection
Database
Marketers
Advent of Retail
Scanner
WWW
Mobile/GPS RFID
Astronomy/
Census
Psychological Insights from Mundane choices
 Psychological research confined to small experiments primarily
on students (98% of all research is with WEIRD subjects!)
 Thesis: Seemingly innocuous information such as aggregated
measures of internet search or mundane choices of grocery
products can reveal aspects of our deep-rooted ideologies,
values, and personality traits
Secondary objective: Automated and fully replicable empirical work flow
Automated Analytics Workflow
Dynamic Reproducible Documents
o Data download/munging part of
document
o Documents are dynamic: Models,
graphics, analysis, and write-up is updated
with flow of new data
o Documents are Interactive and 100%
Reproducible to other co-workers (and
future self)
Nature of Modern Data
The x- V’s of Data
o Volume
o Velocity
o Variety (cognitive challenge lies here)
o Integrating and Harmonizing data from a variety of sources and
formats (numeric, text, image, videos, social media)
Major progress
(AWS/Goggle cloud)
Era of Open Source: Machine Learning / AI Algorithms
Implication: Use of Analytics/Machine learning is simply a good
business practice (almost a necessity) rather than a differentiator.
Example: “Variety” in Data
Online Reviews: Empirical Generalizations
Joint work with Poppy Zhang (Phd Student, NYU) & Karsten Hansen (UCSD)
Scope of the Data
(Clean data files & R codes @ onreviews.org)
Amazon.com: Entire database, all products (1998-2014)
IMDB: All movies (1999-2015)
Vacation Rentals (All of Airbn’b & Homeaway)
Glassdoor (Employee ratings of firms)
YELP (Selected categories & geographies)
Expedia
Volume of Data
Analytics for what?
Primary focus is understanding and insights.
Tools: Visualization, Econometric models,
Interpretable machine learning
Primary focus is deployment. Eg.
Classification of Spam, Banner Ad
Targeting
Example:
What makes a review helpful?
Are there systematic Gender differences?
Approach 1: Classification exercise, take labeled data and run
CNN/RNN on Text and labels of Helpfulness. Get over 90% accuracy
Making Data Usable
Broad Categorization of Variables
 Review Attributes
 Star ratings
 Timing/sequence
 Helpfulness (judged by others)
 Language use
1. Review length
2. Valence (Positive vs Negative)
3. Readability ( words/sentence)
Extract Features of
Customers/Products
Reviewer Attributes
 Heavy vs. occasional reviewer
 Purchase information (sometimes)
 Geography (sometime)
 Gender (proxies)
Product Attributes
 Hedonic/experiential/durable
 Average Rating
 Within category (e.g. Action vs. Comedy),
 Sales Rank
 Popularity ( accumulated number of reviews)
 Price
Once the data is
harmonized, analytics
is simplified drastically.
Example: Quantifying Image
Clarifai
Helpfulness of Review
What makes a review helpful?
Which type of review is most helpful?
1-Star, 3-Star, 5-Star
Psychology Literature
“There is a general bias, based on both innate predispositions and
experience, in animals and humans, to give greater weight to
negative entities (e.g., events, objects, personal traits)” Rozin &
Royzman (2011)
• Negative assessments are perceived as more
diagnostic, particularly when the assessment is
well-reasoned and elaborated at some length
5-Star Reviews Most Helpful
Price and Helpful% : Electronic Products
IMDB
 Review offers quick inference after movie release
 On average, each movie gets 147 reviews
 On average, each review gets 6 helpful votes out
of total 11 votes
Sample review for Avatar
Example 2:
Transaction Data
Context
 ACNielsen's Homescan Consumer Panel
 Detailed purchase histories (2004—2015)
 Use hand-held scanners to record every bar-coded item purchased
 Detailed demographic information
 Additional demographics supplemented using location
information (e.g., Religion, Conservative)
Store Level Data
(35K+ Stores, 2006—2015, All Categories)
Example 1:
Habitual Buying Behavior
(with Karsten Hansen)
Context
o Thought Experiment:
 Suppose you recorded your shopping history for every cereal,
toothpaste, detergent etc. for past 3 or 5 or 10 years
 What can we learn from this information?
 Questions to ask?
 Example: Habits vs. Variety seeking
 What would your product portfolio look like?
25
Positive=>Higher Concentration
Conservativeness (as measured by Voting & Religiosity) associated with:
 Preference for established brands
 Lower propensity to try new products
 Higher brand loyalty (repetitive buying)
Breaking Habits
Will a Fat Tax Work?
Small price differences when reflected in shelf
prices at the point of purchase, have significant
& long-term impact on food choices.
Previous Evidence
o Field Work
Econometric/data problems
Focus on Sales Tax
Industry Funded
Experimental Work 
Lab/Cafeteria/Vending Machines
Small non-representative samples
This Paper: Quasi Natural Experiment
$2.91 $2.91 $2.91 $2.90
$2.87
$2.73
$2.71
$2.60
$2.40
$2.45
$2.50
$2.55
$2.60
$2.65
$2.70
$2.75
$2.80
$2.85
$2.90
$2.95
Whole milk 2% milk 1% milk Skim milk
Uniform Price Non-Uniform Price
Depending on where you live and what supermarket chain you patronize, you see one of these patterns.
Milk Pricing in the US
Milk Pricing in the US
Vishal Singh, Stern School of Business, NYU 31
Non Flat Pricing
Primarily Non-Flat
Mixed
Primarily Flat
Flat Pricing
No Data Available
Southeast FMMO
Pennsylvania: Large milk
producer. State
regulations.
Uniform/Non-Uniform price
structure is consistent across
stores within a chain, even in
mixed states.
Upper Midwest FMMO: Wisconsin is
2nd largest producer
Central FMMO
Northeast FMMO
MidEast
FMMO
DATA
 1800 + supermarkets
 6 Years weekly data
 UPC level sales,
price, promotion etc.
 Counties represent
approximately 50% of
the population
a) Comparison of Demographic Profile between Flat and NonFlat Stores
Flat stores Non-Flat stores
Mean
Std
Dev Mean
Std
Dev p-value
Low income 18% 38% 21% 41% 0.08
High income 19% 39% 20% 40% 0.60
% Poverty 2% 1% 2% 1% 0.22
% Children 4% 1% 4% 1% 0.62
% College 39% 49% 41% 49% 0.58
% White 78% 19% 77% 19% 0.49
% Elderly 12% 4% 12% 5% 0.32
Population density 0.12 0.31 0.13 0.18 0.52
(b) (1) Regression of (Price Whole/ Price 2%) milk and (2) Variance Decomposition
(1) (2)
Estimate Std Error
% of explained variation
accounted for by:
Intercept 1.0393 (0.006)
Median Income -0.0017 (0.002) 0.06%
% HH Kids -0.0003 (0.001) 0.00%
% College -0.0005 (0.002) 0.01%
% White -0.0014 (0.001) 0.09%
Population Density -0.0003 (0.001) 0.00%
Wage 0.0028 (0.002) 0.14%
All retailers within 5 miles -0.0002 (0.001) 0.00%
Discount retailers within 10 miles -0.0021 (0.001) 0.18%
Marketing Order Fixed Effects Included 15.44%
Chain Fixed Effects Included 84.07%
R square 0.658
Is the Pricing Structure Exogenous?
Does it Change Behavior?
Large Response to Small Price Changes
3. Automation/Deployment
Automating Scientific Reporting
Example
Workflow for the Consumer Package Industry
American Politics
Final Thoughts
 Trends
o Data proliferation & Rapid advancement in scalable algorithms
o Era of open source: Standardization of analytical methods & algorithms
o Provided as a Service by Cloud Hosting providers
 Key:
 Intuition & Critical Thinking at every stage rather than a
“Ctrl-C Ctrl-V” approach
My Work: Streamlining this Analytical workflow with Dynamic
Reproducible Documents
Data Intelligence Analytics Deployment

Slalom

  • 1.
  • 2.
    History of DataCollection Database Marketers Advent of Retail Scanner WWW Mobile/GPS RFID Astronomy/ Census
  • 3.
    Psychological Insights fromMundane choices  Psychological research confined to small experiments primarily on students (98% of all research is with WEIRD subjects!)  Thesis: Seemingly innocuous information such as aggregated measures of internet search or mundane choices of grocery products can reveal aspects of our deep-rooted ideologies, values, and personality traits Secondary objective: Automated and fully replicable empirical work flow
  • 4.
    Automated Analytics Workflow DynamicReproducible Documents o Data download/munging part of document o Documents are dynamic: Models, graphics, analysis, and write-up is updated with flow of new data o Documents are Interactive and 100% Reproducible to other co-workers (and future self)
  • 5.
    Nature of ModernData The x- V’s of Data o Volume o Velocity o Variety (cognitive challenge lies here) o Integrating and Harmonizing data from a variety of sources and formats (numeric, text, image, videos, social media) Major progress (AWS/Goggle cloud) Era of Open Source: Machine Learning / AI Algorithms Implication: Use of Analytics/Machine learning is simply a good business practice (almost a necessity) rather than a differentiator.
  • 6.
    Example: “Variety” inData Online Reviews: Empirical Generalizations Joint work with Poppy Zhang (Phd Student, NYU) & Karsten Hansen (UCSD)
  • 7.
    Scope of theData (Clean data files & R codes @ onreviews.org) Amazon.com: Entire database, all products (1998-2014) IMDB: All movies (1999-2015) Vacation Rentals (All of Airbn’b & Homeaway) Glassdoor (Employee ratings of firms) YELP (Selected categories & geographies) Expedia
  • 8.
  • 9.
    Analytics for what? Primaryfocus is understanding and insights. Tools: Visualization, Econometric models, Interpretable machine learning Primary focus is deployment. Eg. Classification of Spam, Banner Ad Targeting
  • 10.
    Example: What makes areview helpful? Are there systematic Gender differences? Approach 1: Classification exercise, take labeled data and run CNN/RNN on Text and labels of Helpfulness. Get over 90% accuracy
  • 11.
    Making Data Usable BroadCategorization of Variables  Review Attributes  Star ratings  Timing/sequence  Helpfulness (judged by others)  Language use 1. Review length 2. Valence (Positive vs Negative) 3. Readability ( words/sentence)
  • 12.
    Extract Features of Customers/Products ReviewerAttributes  Heavy vs. occasional reviewer  Purchase information (sometimes)  Geography (sometime)  Gender (proxies) Product Attributes  Hedonic/experiential/durable  Average Rating  Within category (e.g. Action vs. Comedy),  Sales Rank  Popularity ( accumulated number of reviews)  Price Once the data is harmonized, analytics is simplified drastically.
  • 13.
  • 14.
    Helpfulness of Review Whatmakes a review helpful? Which type of review is most helpful? 1-Star, 3-Star, 5-Star
  • 15.
    Psychology Literature “There isa general bias, based on both innate predispositions and experience, in animals and humans, to give greater weight to negative entities (e.g., events, objects, personal traits)” Rozin & Royzman (2011) • Negative assessments are perceived as more diagnostic, particularly when the assessment is well-reasoned and elaborated at some length
  • 16.
  • 17.
    Price and Helpful%: Electronic Products
  • 18.
    IMDB  Review offersquick inference after movie release  On average, each movie gets 147 reviews  On average, each review gets 6 helpful votes out of total 11 votes Sample review for Avatar
  • 19.
  • 20.
    Context  ACNielsen's HomescanConsumer Panel  Detailed purchase histories (2004—2015)  Use hand-held scanners to record every bar-coded item purchased  Detailed demographic information  Additional demographics supplemented using location information (e.g., Religion, Conservative)
  • 23.
    Store Level Data (35K+Stores, 2006—2015, All Categories)
  • 24.
    Example 1: Habitual BuyingBehavior (with Karsten Hansen)
  • 25.
    Context o Thought Experiment: Suppose you recorded your shopping history for every cereal, toothpaste, detergent etc. for past 3 or 5 or 10 years  What can we learn from this information?  Questions to ask?  Example: Habits vs. Variety seeking  What would your product portfolio look like? 25
  • 26.
  • 27.
    Conservativeness (as measuredby Voting & Religiosity) associated with:  Preference for established brands  Lower propensity to try new products  Higher brand loyalty (repetitive buying)
  • 28.
    Breaking Habits Will aFat Tax Work? Small price differences when reflected in shelf prices at the point of purchase, have significant & long-term impact on food choices.
  • 29.
    Previous Evidence o FieldWork Econometric/data problems Focus on Sales Tax Industry Funded Experimental Work  Lab/Cafeteria/Vending Machines Small non-representative samples
  • 30.
    This Paper: QuasiNatural Experiment $2.91 $2.91 $2.91 $2.90 $2.87 $2.73 $2.71 $2.60 $2.40 $2.45 $2.50 $2.55 $2.60 $2.65 $2.70 $2.75 $2.80 $2.85 $2.90 $2.95 Whole milk 2% milk 1% milk Skim milk Uniform Price Non-Uniform Price Depending on where you live and what supermarket chain you patronize, you see one of these patterns. Milk Pricing in the US
  • 31.
    Milk Pricing inthe US Vishal Singh, Stern School of Business, NYU 31 Non Flat Pricing Primarily Non-Flat Mixed Primarily Flat Flat Pricing No Data Available Southeast FMMO Pennsylvania: Large milk producer. State regulations. Uniform/Non-Uniform price structure is consistent across stores within a chain, even in mixed states. Upper Midwest FMMO: Wisconsin is 2nd largest producer Central FMMO Northeast FMMO MidEast FMMO DATA  1800 + supermarkets  6 Years weekly data  UPC level sales, price, promotion etc.  Counties represent approximately 50% of the population
  • 32.
    a) Comparison ofDemographic Profile between Flat and NonFlat Stores Flat stores Non-Flat stores Mean Std Dev Mean Std Dev p-value Low income 18% 38% 21% 41% 0.08 High income 19% 39% 20% 40% 0.60 % Poverty 2% 1% 2% 1% 0.22 % Children 4% 1% 4% 1% 0.62 % College 39% 49% 41% 49% 0.58 % White 78% 19% 77% 19% 0.49 % Elderly 12% 4% 12% 5% 0.32 Population density 0.12 0.31 0.13 0.18 0.52 (b) (1) Regression of (Price Whole/ Price 2%) milk and (2) Variance Decomposition (1) (2) Estimate Std Error % of explained variation accounted for by: Intercept 1.0393 (0.006) Median Income -0.0017 (0.002) 0.06% % HH Kids -0.0003 (0.001) 0.00% % College -0.0005 (0.002) 0.01% % White -0.0014 (0.001) 0.09% Population Density -0.0003 (0.001) 0.00% Wage 0.0028 (0.002) 0.14% All retailers within 5 miles -0.0002 (0.001) 0.00% Discount retailers within 10 miles -0.0021 (0.001) 0.18% Marketing Order Fixed Effects Included 15.44% Chain Fixed Effects Included 84.07% R square 0.658 Is the Pricing Structure Exogenous?
  • 33.
    Does it ChangeBehavior?
  • 34.
    Large Response toSmall Price Changes
  • 36.
  • 37.
  • 39.
    Example Workflow for theConsumer Package Industry
  • 40.
  • 41.
    Final Thoughts  Trends oData proliferation & Rapid advancement in scalable algorithms o Era of open source: Standardization of analytical methods & algorithms o Provided as a Service by Cloud Hosting providers  Key:  Intuition & Critical Thinking at every stage rather than a “Ctrl-C Ctrl-V” approach My Work: Streamlining this Analytical workflow with Dynamic Reproducible Documents Data Intelligence Analytics Deployment