Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Research Using Behavioral Big Data (BBD)


Published on

Keynote at 2016 IEEE BigData Congress Taipei Session

Published in: Data & Analytics
  • Be the first to comment

Research Using Behavioral Big Data (BBD)

  1. 1. Research Using Behavioral Big Data Methodological, Practical, Ethical & Moral Issues IEEE BigData Congress, Taipei Satellite Session, May 2016 Galit Shmueli 徐茉莉 Institute of Service Science
  2. 2. What is Behavioral Big Data (BBD) • Special type of Big Data • Behavioral: people’s actions, interactions, self-reported opinions, thoughts, feelings • Human and social aspects: Intentions, deception, emotion, reciprocation, herding,… • When aware of data collection -> modified behavior (legal risks, embarrassment, unwanted solicitation)
  3. 3. BBD vs. Medical Big Data • Physical measurements • Data collection timing often set by medical system • Clinical trials: awareness & vested interest • People’s daily actions, interactions, self- reported feelings, opinions, thoughts (UGC) • Data generation timing often chosen by user • Experiments: users often unaware; goal not always in user’s interest
  4. 4. BBD on Citizens and Customers Governments security, law enforcement, traffic (cameras, sensors) Financial Institutions fraud, loans (IT systems, cameras) Telecoms fraud, infrastructure, marketing (IT systems, mobile) Retail chains marketing, operations, merchandising (POS systems, video, social, mobile) Insurance set Usage-Based Insurance premiums (telematics info) Data Collection Technologies: • Cameras • Sensors • IT systems (POS, calls,…) • GPS • Things • Internet • Mobile • Social
  5. 5. BBD on Employees Service Providers quality control, employee performance Electronic Performance Monitoring (EPM) systems, web surfing, e-mails sent and received, telephone use, video, location (taxis)
  6. 6. BBD on Citizens, Customers, Employees: Internet! • BBD now also available to small companies & organizations • Online platforms have BBD (e-commerce, gaming, search, social networks…) • Voluntarily entered by users: personal details, photos, comments, messages, search terms, bids in auctions, likes, payment information, connections with “friends” • Passive footprints: duration on the website, pages browsed, sequence, referring website, Internet browser, operating system, location, IP address. • BBD now available to individuals: Quantified Self (and apps)
  7. 7. More and more human and social activities are moving online Most companies that have BBD were not created for the purpose of generating BBD Two important points
  8. 8. Why should data science researchers care about BBD? Technology is advancing in two directions Fully automated (algorithmic) solutions Because they are (and should be) involved in designing both! Micro-level recording of human and social behavior
  9. 9. Research using BBD Duncan Watts, Microsoft Research: 1. Social science problems are almost always more difficult than they seem 2.The data required to address many problems of interest to social scientists remain difficult to assemble 3.Thorough exploration of complex social problems often requires the complementary application of multiple research traditions
  10. 10. Academic Research Qs using BBD Research about human and social behavior examine new phenomena re-examine old phenomena with better data
  11. 11. Research Communities Researchers with social science + technical backgrounds Information Systems Marketing Computational Social Science
  12. 12. Examples of BBD Studies in Top Journals Consumption in Virtual Worlds (Hinz et al. Info Sys Research, 2015) “The idea that conspicuous consumption can increase social status, as a form of social capital, has been broadly accepted, yet researchers have not been able to test this effect empirically.” • age-old sociology question with new BBD data • BBD from two virtual world websites (gaming with social network) Social influence in Social News Websites (Muchnik et al. Science, 2014) “The recent availability of population- scale data sets on rating behavior and social communication enable novel investigations of social influence...” • Existing question in new context: study social influence bias in rating behavior • BBD from a social news aggregation website where users contribute news articles, discuss them, and rate comments
  13. 13. Online Consumer Ratings of Physicians (Gao et al. Information Systems Research, 2014) “examine how closely the online ratings reflect patients’ opinion about physician quality at large.” • new phenomenon of online ratings of service providers • BBD on direct measures of both the offline population’s perception of physician quality, and consumer generated online reviews. Impact of Teachers on Student Outcomes using Education and Tax BBD (Chetty et al. Amer Econ Review, 2014) • long-term impact of teachers on student outcomes has been of interest in economic policy: old question with new BBD data • combined BBD from administrative school district records and federal income tax records
  14. 14. Emotional Contagion in Social Networks (Kramer et al. Proc of the National Academies of Sciences, 2014) • Can emotional states be transferred to others via emotional contagion? • BBD from large-scale experiment run by FB, manipulating users’ exposure level to emotional expressions in their Facebook News Feed Anonymous Browsing in Online Dating Websites (Bapna et al. Management Science, 2016) “Online dating platforms offer new capabilities, such as extensive search, big data–based recommendations, and varying levels of anonymity, whose parallels do not exist in the physical world...” • new questions about human behavior due to new technologies • BBD from large-scale experiment, partnered with large dating website in N America, testing the effect of anonymous browsing on matching.
  15. 15. ONE WAY MIRRORS IN ONLINE DATING A Randomized Field Experiment Ravi Bapna, University of Minnesota Jui Ramaprasad, Mcgill University Galit Shmueli, National Tsing Hua University Akhmed Umyarov, University of Minnesota
  16. 16. Online Dating 46of the single population in the US uses online dating to find a partner (Gelles 2011) %
  17. 17. Online Dating Website
  18. 18. Non-anonymous Browsing (Default) Profile Visit Recent visitor:
  19. 19. Anonymous Browsing Profile Visit Recent visitor: NONE
  20. 20. Research Question (in simple words) How does anonymous browsing affect user behavior? … and matching?
  21. 21. Formal Research Question what is the relative causal effect of social inhibitions on search preferences vs. social inhibitions of contact initiation in dating markets? given known gender asymmetries, how does this effect differ for men vs. women?
  22. 22. Randomized Field Experiment on Large Online Dating Website 50,000 users receive gift of anonymous browsing
  23. 23. Results Users treated with anonymity become disinhibited view more profiles, view more same-sex and interracial mates get less matches lose ability to leave a weak signal - especially harmful for women!
  24. 24. Role of anonymity and importance of WEAK SIGNAL in online platforms
  25. 25. In Academia Causal Qs are most popular • Methodological challenges: • scalability of stat models • small-sample stat inference • self-selection Predictive Qs (quite rare) • How to use results beyond application-specific? 6 uses of predictive analytics for theory building [Shmueli & Koppius, 2011] In Industry Purpose: evaluate or improve products, service, operations, etc. • Netflix Prize: movie recommender system • Yahoo!, LinkedIn: personalized news content to increase user engagement/clicks [Agarwal & Chen 2016] • Target: pregnancy prediction • Amazon: pricing, etc. • Government: campaign targeting BBD-based Research Questions
  26. 26. Getting BBD for Research 1. Open Data, Publicly Available Data Twitter Kaggle (UCI MR) API and web scraping 2. Partnering with a Company • Both parties interested in research question • Data purchase • Personal connections • Partnership between school and organization (CMU Living Analytics Research Lab)
  27. 27. 3. Crowdsourcing AMT Replacing student subjects • Experiment subjects • Survey respondents • Cleaning and tagging data “easy access to a large, stable, and diverse subject pool, the low cost of doing experiments, and faster iteration between developing theory and executing experiments” [Mason and Suri, 2012]
  28. 28. Using BBD for Research: Human Subjects Institutional Review Board (IRB) “ethics committee” University-level committee designated to approve, monitor, and review biomedical and behavioral research involving humans. • performs benefit-risk analysis for proposed study • guidelines: Beneficence, Justice, and Respect for persons
  29. 29. • HHS propose new IRB exemption criteria for publicly available data (or even buying it) • Council for Big Data, Ethics & Society’s letter: “these criteria for exclusion focus on the status of the dataset… not the content of the dataset nor what will be done with the dataset, which are more accurate criteria for determining the risk profile of the proposed research Ethics: Beyond IRB Facebook experiment [Kramer et al. 2014]: • No IRB “[The work] was consistent with Facebook’s Data Use Policy, to which all users agree prior to creating an account on Facebook, constituting informed consent for this research.” • PNAS editorial Expression of Concern • Varied response from public, academia, press, ethicists, corporates [Adar 2015]
  30. 30. Big Behavioral Experiments
  31. 31. Big Behavioral Experiments: Issues Compare to industrial environment 1. Fast-Changing Environment Multiple A/B tests run every day (overlaps) Users keep evolving 2. Multiplicity and Scaling Computational advertising and content recommendation 3M’s [Agarwal & Chen 2016]: • Multi-response (clicks, shares, likes,…) • Multi-context (mobile, email,...) • Multiple objective (engagment, revenue,...) 3. Spill-Over Effects • Treatment can affect control group (social networks) • Challenge of randomization on a social network (Fienberg, 2015): even if treatment and control members sufficiently far away to avoid spill-over effects, analysis still must account for dependence among units.
  32. 32. Big Behavioral Experiments: Issues Compare to industrial environment 4. Knowledge of Allocation and Gift Effect • Like clinical trials: allocation knowledge can affect outcome • Online users discover their allocation via online forums • Blinding and placebo? • “Gift” or preferential treatment can affect outcome • Bapna et al. (2016) compared effect at end of manipulation time and right after, to determine gift effect 5. Ethical and Moral Issues Ease of running a large scale experiment quickly and at low cost • danger of harming many people quickly • small scale pilot study? AMT: Fair treatment & payment to workers
  33. 33. Observational BBD: Issues Ethical and Moral Issues • Privacy (Netflix) • Data protection and reproducible research • Conflict of interest company-vs-users (Study conclusions lead to operational actions that trade-off the company’s interest with user well-being) • AMT – payment to workers Methodological Issues 1. Self-selection Bias Users choose treatment • Scaling of PSM to big data? 2. Simpson’s Paradox Causal direction reverses when data are disaggregated • Does a dataset have a paradox? 3. Contamination by Experiments 4. Data Size & Dimension Need very large+rich data to answer predictive Qs [Junque de Fortuny et al. 2014]
  34. 34. A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data Inbal Yahav Galit Shmueli Deepa Mani Bar Ilan University National Tsing Hua U Indian School of Business Israel Taiwan India
  35. 35. Self Selection: The Challenge • Large impact studies of an intervention • Individuals/firms choose which group to join How to identify and adjust for self-selection?
  36. 36. Current Methods: Challenges with Big Data 1. Matching leads to severe data loss 2. Suffer from “data dredging” 3. Do not identify variables that drive the selection 4. Assume constant intervention effect 5. Sequential nature is computationally costly 6. Requires user to specify form of selection model
  37. 37. Our Tree-Based Approach: Use a data mining algorithm in a novel way Flexible non-parametric selection model Automated detection of unbalanced variables Easy to interpret, transparent, visual Applicable to binary, polytomous, continuous intervention Useful in Big Data context Identify heterogeneous effects
  38. 38. Example: Impact of training on financial gains Experiment: USA govt program randomly assigned eligible candidates to training program • Goal: increase future earnings • Results (LaLonde, 1986) : üGroups statistically equal in terms of demographic & pre-train earnings ü Average Training Effect = $1794 (p<0.004)
  39. 39. Tree reveals… High-School Matters! LaLonde’s naïve approach (experiment) Tree approach HS dropout (n=348) HS degree (n=97) Not trained (n=260) $4554 $4,495 $4,855 Trained (n=185) $6349 $5,649 $8,047 Training effect $1794 (p=0.004) $1,154 (p=0.063) $3,192 (p=0.015) Overall: $1598 (p=0.017) no yes High school degree
  40. 40. Large Scale Surveys Data Quality • duplicate responses • insincere responses require different approaches at large scale Online surveys: cheap, easy, fast Large pool of available “workers” Supplement experimental/observational studies Para data data on how the survey was accessed/answered • time stamps of opening invitation email, when survey was accessed • Duration for answering each question • [Survey of Adult Skills by the OECD]
  41. 41. Large Scale Surveys Methodological Issue: Generalization Sampling and non-sampling errors “The central issue is whether conditional effects in the sample (the study population) may be transported to desired target populations. Success depends on compatibility of causal structures in study and target populations, and will require subject matter considerations in each concrete case.” [Keiding and Louis, 2016] • Statistical generalization & scientific generalization [Kenett & Shmueli, 2014]
  42. 42. Methodical Analysis Cycle of BBD Inspired by Lifecycle view [Kenett, 2014], and stat thinking building blocks [Hoerl et al. 2014] 1. understand company context and BBD 2. set up the research question 3. determine experimental design 4. obtain IRB approval (if needed) 5. possibly: pilot experiment 6. communicate design with company; assure feasibility 7. company deploys experiment and collects the data 8. company shares the data with the researchers 9. researchers analyze the data and arrive at conclusions 10. researchers share the insights and conclusions with company and research community 11. company operationalizes the insights to improve their business 12. company deploys impact study
  43. 43. Summary Technical Challenges Data access Analysis scalability Quick-changing environment BBD = lots of behavioral data Who has it? How is it analyzed? For what purpose? Methodological Challenges Selection bias Generalization “Control” group contaminated by other experiments Spill-over effects Lack of methodical lifecycle Legal, Ethical, Moral Challenges Privacy violation (Netflix; networks) Risks to human subjects Company vs. Researcher Objectives Gains of company at expense of individuals, communities, societies, & science
  44. 44. Why should data science researchers care about BBD? Technology is advancing in two directions Fully automated (algorithmic) solutions Micro-level recording of human and social behavior
  45. 45. Contemplation Threats to privacy, society, governance, human thought, and human interaction Generalization for company ≠ scientific generalization Personalization efforts -> de-personalization “Law of unintended consequences” • Labeling “student at risk”, “potential criminal” Speed of research, excitement of new abilities, no time for contemplation The Circle, run out of a sprawling California campus, links users’ personal emails, social media, banking, and purchasing with their universal operating system, resulting in one online identity and a new age of civility and transparency.
  46. 46. The Way Forward Convergence of Social Sciences and Engineering Things eventually collect BBD (intentionally or not)
  47. 47. Analytics Humanity Responsibility Galit Shmueli 徐茉莉 Institute of Service Science