Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exposing algorithms pydatadc2016

348 views

Published on

An algorithm is set of steps that perform calculations, process data, or automate tasks. Algorithms are everywhere we look (and even places we don’t look) controlling what we see, do, and where we go. They’re great for solving our problems and helping us make better and quicker decisions, or taking the decision-making out of our hands. Their guidance is perfect in their objective and unbiased calculation. Except they are not, actually. Like everything else, they are created by people, and people have biases that get encoded into the algorithms they create. Algorithms learn from data, which is also created by people, so the algorithms also learn biases from data. This can be a problem when algorithms encode these biases into their calculations and go on to perpetuate the bias.

In this talk you will hear why we should care about algorithmic accountability, and details on a case study on how computational journalism can be used to investigate algorithms and advocate the need for transparency and accountability.

Published in: Data & Analytics
  • Be the first to comment

Exposing algorithms pydatadc2016

  1. 1. EXPOSING ALGORITHMS COMPUTATIONAL JOURNALISM LAB, UNIVERSITY OF MARYLAND
  2. 2. COMPUTATIONAL JOURNALISM ▸ Develop tools for Newsrooms ▸ Data gathering ▸ Story tracking ▸ Personalized news ▸ Comment moderation ▸ Using computational methods to investigate a story ▸ Algorithmic accountability and transparency Applying computer science to journalism
  3. 3. http://www.wordclouds.com
  4. 4. ALGORITHM: POWER, AUTHORITY
  5. 5. GOOGLE CASE STUDY
  6. 6. GOOGLE AUTOCOMPLETE FAQ ▸ “…we exclude a narrow class of search queries related to pornography, violence, hate speech, and copyright infringement.”
  7. 7. GOOGLE AUTOCOMPLETE FAQ ▸ “…we exclude a narrow class of search queries related to pornography, violence, hate speech, and copyright infringement.” ▸ Criteria: Boundaries of censorship; Differences among search engines; Mistakes?
  8. 8. INPUT - OUTPUT STUDY OutputInput
  9. 9. Warning! This presentation contains explicit language.
  10. 10. N. Diakopoulos. Sex, Violence, and Autocomplete Algorithms. Slate. 2013.
  11. 11. What are the criteria?
  12. 12. SEARCH ENGINES ARE COMPLICATED! ▸ Are we using search terms that people in real life use? ▸ Personalization (IP, profile, history) ▸ Randomization, A/B tests ▸ …not to mention Google doesn't want people scraping their results (ack!)
  13. 13. UBER CASE STUDY
  14. 14. ▸ Discriminatory/unfair ▸ Mistake that denies a service ▸ Censorship ▸ Breaks law or social norm ▸ False prediction ▸ Violation of privacy
  15. 15. PREVIOUS WORK ▸ Surge pricing triggered by car requests outnumbering available cars (demand > supply) ▸ Goal of surge pricing: ▸ Encourage more drivers on the road ▸ Redistribute current drivers to areas of high demand
  16. 16. ▸ Surge pricing triggered by car requests outnumbering available cars (demand > supply) ▸ Goal of surge pricing: ▸ Encourage more drivers on the road ▸ Redistribute current drivers to areas of high demand PREVIOUS WORK
  17. 17. CURRENT ▸ Propose service quality may not be the same across D.C. ▸ Expected Wait Time proxy for service: combines car availability, current and historical surge pricing, other hidden factors. ▸ If true, can this be predicted by census data?
  18. 18. APPROACHES, TOOLS ▸ Data sources ▸ Uber API, `uber.py`, census.gov resources (tons, free) ▸ Spatial sampling across the District ▸ Python GIS-related libraries (`geopy`, `address`, `cenpy`) ▸ The http://data.fcc.gov/ API returns an address when given an latitude and longitude ▸ Sample grid-style, averaged to census tracts ▸ Data wrangling and statistics ▸ `pandas`, `numpy`, `statsmodels` ▸ Visualization ▸ CARTO for mapping (3 maps for free) + Adobe Illustrator ▸ `matplotlib` or `seaborn` for graphs ▸ with touch of Adobe Illustrator
  19. 19. APPROACH - BASICALLY ALL PYTHON COLLECTION ▸ Determine our sampling locations: ▸ Spatial sampling DC -> grid (how dense?) ▸ Temporal sampling -> 3 min (why?) ▸ Uber API rate limits, ▸ #API key access ▸ Address validation ▸ https://github.com/comp-journalism/2016-03-wapo-uber/ blob/master/Mapping_points_across_DC.ipynb
  20. 20. TEXT LOCATIONS PASSED TO UBER API
  21. 21. UBER DATA ▸ Expected Wait Time from Uber API for each location every 3 minutes over 4 weeks ▸ Calculated as mean expected wait time per tract (MEWT) ▸ Proportion calculated as percentage time each tract spent with a surge price multiplier > 1
  22. 22. AMERICAN COMMUNITY SURVEY 2014 ▸ % People of Color (POC) ▸ % Poverty ▸ Population Density ▸ Median Household Income ▸ Z-score normalized
  23. 23. APPROACH - STILL BASICALLY ALL PYTHON DATA PROCESSING ▸ Collapse data across time (4 weeks in February 2016) ▸ Average data within census tracts ▸ Select only uberX “product_types” ▸ One “ETA” and one “Surge Price Multiplier” value per tract ▸ Census / American Community Survey data: ▸ Poverty -> Calculate % in each tract ▸ Income -> Median income per tract ▸ Race/Ethnicity -> Dichotomized % ▸ Population density (population x tract land mass) ▸ Normalized to z-scores
  24. 24. ESTIMATED WAIT TIMES FOR UBERX Map showing average ETA for an uberX. Northwest DC has a mostly white racial demographic, whereas southeast is mostly people of color. Tract 92.03. 75% POC, Short wait times Universities, restaurants, bars…
  25. 25. APPROACH - PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON REGRESSION (GLM, STATSMODELS) % POC*** Population Density*** Median Income % Poverty % POC : % Poverty** % POC : IncomeExplanatory Variables:
  26. 26. WHAT NEXT - MORE DATA ▸ Does it reflect differences in Supply/Demand? -> Taxi FOIA ▸ Crime stats -> perception vs facts ▸ Banked / unbanked stats (~14% in DC) ▸ Smart phone ownership ▸ Would the results differ in a different month or city?
  27. 27. DESIGNING FOR TRANSPARENCY AND ACCESSIBILITY WHAT NEXT - DESIGN? ▸ What if: ▸ Taxi demand is high in census tracts underserved by Uber in DC? ▸ Difference in price? Accessibility? Marketing? ▸ Unbanked people with no bank accounts or smart phones could hail via voice? Pay with cash? ▸ Crime perception is different from real life? ▸ Could we indicate crime stats in-app? ▸ Should we? ▸ TRANSPARENCY! https://github.com/comp- journalism/2016-03-wapo-uber ▸ datalensdc.com, Houston, Georgetown, UBER, AARP…
  28. 28. ALGORITHMIC ACCOUNTABILITY IN JOURNALISM ▸ Opportunity for UBER to check our work ▸ Opportunity for audience to check ▸ Spurs us to write better, documented code, check our conclusions and assumptions ▸ Others can use code / data for other stories https://github.com/comp-journalism
  29. 29. ▸ Code: GitHub ▸ IPython Notebook ▸ Documentation: README.md ▸ Data: Google Drive ▸ Save wrangled data at intervals in .csv files ▸ Programmatic solutions where possible https://github.com/comp-journalism Free Open Source ALGORITHMIC ACCOUNTABILITY IN JOURNALISM
  30. 30. QUESTIONS? COLLABORATIONS? Jennifer A. Stark @_JAStark starkja@umd.edu https://github.com/comp-journalism

×