Privacy, Ethics, and Big (Smartphone) Data, at Mobisys 2014

©2009CarnegieMellonUniversity:1
Privacy, Ethics, and
Big (Smartphone) Data
Mobile Cloud Computing and Services
June 16, 2014
Jason Hong
Computer
Human
Interaction:
Mobility
Privacy
Security

Smartphones are Pervasive
• 58% penetration in the
US as of early 2014
• About 1.2M Android
and iOS apps
• Over 75 billion apps
downloaded on each
of Android and iOS

Smartphones are Intimate
• Mobile phones and
millennials (Cisco 2012):
• 75% use in bed before sleep
• 83% sleep with their phones
• 90% check first thing in the
morning
• A third use in bathroom (!!)
• A fifth check every ten
minutes

Smartphone Data is Intimate
Who we know
(contact list)
Who we call
(call log)
Who we text
(sms log)

Smartphone Data is Intimate
Where we go
(gps, foursquare)
Photos
(some geotagged)
Sensors
(accel, sound, light)

The Opportunity
• We are creating
a worldwide
sensor network
with these
smartphones
• We can now
capture and
analyze human
behavior at
unprecedented
fidelity and scale

Understanding Individuals
Min et al, Toss ‘n’ Turn: Smartphone as Sleep
and Sleep Quality Detector, CHI 2014.

Understanding Small Groups
Better
communication
Computational
Social Science

Understanding Urban Areas
• AT&T Work on Human Mobility
Median distance
traveled per day

Laborshed for Morristown NJ

Eric Fischer’s Maps of
Tourists vs Locals

The Challenge
• Mobile and Cloud computing a
clear part of the future
• Lots of fun technical challenges here
– Computer scientists are great here!
– We’re awesome at optimizing things
• But biggest challenges will likely
relate to privacy and ethics

Three Stories
• Reflections and personal experiences
• Livehoods
– How much can we infer about people?
– Who wins, who doesn’t with tech?
• Google Glass
– Why so much negative feedback?
– Are there lessons we can learn?
• PrivacyGrade
– What are ways of protecting people?
– What is the role of developers?

Story 1: The Challenge of
Getting Data About Our Cities
• Today’s methods for getting
city data slow, costly, limited
– Ex. Travel Behavioral Inventory
– US Census 2010 cost $13b
– Quality of life surveys
• Some approaches today:
– Call Data Records
– Deploy a custom app

The Vision: Urban Analytics
• How can we use smartphones + social
media + machine learning to offer new
and useful insights about cities in a
manner that is cheap, fast, and scalable?

Livehoods, Our First Urban
Analytics Tool
• The character of an urban area is defined
not just by the types of places found
there, but also by the people that make
it part of their daily life
Cranshaw et al, The Livehoods Project: Utilizing Social Media
to Understand the Dynamics of a City, ICWSM 2012.

Two Perspectives
“Politically constructed” “Socially constructed”
Neighborhoods have fixed
borders defined by the city
government.
Neighborhoods are organic,
cultural artifacts. Borders are
blurry, imprecise, and may be
different to different people.

Two Perspectives of Cities
“Socially constructed”
Neighborhoods are organic,
cultural artifacts. Borders are
blurry, imprecise, and may be
different to different people.
Can we discover
automated ways of
identifying the “organic”
boundaries of the city?
Can we extract local
cultural knowledge from
social media?
Can we build a collective
cognitive map from data?

Livehoods Data Source
• Crawled 18m check-ins
from foursquare
– Claims 20m users
– People who linked their
foursquare accts to Twitter
• Spectral clustering based
on geographic and social
proximity

☺
☺
☺
☺
If you watch check-ins over
time, you’ll notice that groups
of like-minded people tend to
stay in the same areas.

We can aggregate these
patterns to compute
relationships between check-
in venues.

These relationships can then
be used to identify natural
borders in the urban
landscape.

Livehood 2
Livehood 1
We call the discovered
clusters “Livehoods”
reflecting their dynamic
character.

Try it out at livehoods.org

South Side Pittsburgh
Safety
LH8
LH9LH7
LH6
LH7 vs LH8 “Whenever I was living down on 15th Street [LH7] I had
to worry about drunk people following me home, but on
23rd [LH8] I need to worry about people trying to mug
you... so it’s different. It’s not something I had anticipated,
but there is a distinct difference between the two areas of
the South Side.”

Demographic
Differences
LH8
LH9LH7
LH6
LH6 “There is this interesting mix of people there I don’t
see walking around the neighborhood. I think they
are coming to the Giant Eagle [grocery store] from
lower income neighborhoods... I always assumed
they came from up the hill.”

“I always assumed they
came from up the hill.”

Other Potential Urban Analytics

Topic Modeling (LDA)

Reflections on Urban Analytics
Few Privacy Issues
• Publicly visible data with no
expectations of privacy
– No IRB issues
• Remove venues labeled as “home”
– We only received one request to remove
a venue (wasn’t labeled as a home)
• We only show data about geographic
areas vs individuals
• So far, so good

Reflections on Urban Analytics
But Lots of Questions About Ethics
• Discussions with lots of different folks
on how data like this might be used in
the future
– Urban planners, Yahoo, Facebook,
Google Maps, Startups, and more
• Lots of great questions
– Not so many great answers
– Share some of these with you

How Much Can Be Inferred?

• Very likely much more can be inferred
using rich data like this
– Demographics, socioeconomic, friends
– Physical and mental health (depression)
– How “risky” you are (bars, clinics, etc)
• Unclear how far inferencing can go
– Also, not much can stop advertisers, NSA,
startups, etc
– And, not just what you do, but what
others like you are doing

Built a logistic regression
to predict sexuality based
on what your friends on
Facebook disclosed

“[An analyst at Target] was able to identify about
25 products that… allowed him to assign each
shopper a ‘pregnancy prediction’ score. [H]e
could also estimate her due date to within a small
window, so Target could send coupons timed to
very specific stages of her pregnancy.” (NYTimes)

A New Kind of Redlining?
• “denying, or charging more for,
services such as banking, insurance,
access to health care, … supermarkets,
or denying jobs … against a particular
group of people” (Wikipedia)

Huge Risk of New Kind of
Redlining
Map of Philadelphia
showing redlining of lower
income neighborhoods.
Households and businesses
in the red zones could not
get mortgages or business
loans. (Wikipedia)

Huge Risk of New Kind of
Redlining
Johnson says his jaw
dropped when he read one
of the reasons American
Express gave for lowering
his credit limit:
"Other customers who have
used their card at
establishments where you
recently shopped have a
poor repayment history
with American Express."

Potential Biases in the Data?
Pittsburgh’s Hill District
Was ground zero for Jazz
musicians in 20th century
8,000 residents and 400
businesses, decimating the
economic center of African-
American Pittsburgh
Median Income (2009):
$17,939

Potential Biases in the Data?
• Socioeconomic bias
– Little data from lower socioeconomic areas
• Urban bias
– Twitter, Flickr, Foursquare all more active
per capita in cities
• Age and gender bias
– Most young, male, technology-savvy
• Is this a problem that will solve itself?
– Or, can we address this in our models?

Who Gains From this Data?
• Today, most data only flows one way
– Mainly to advertisers (and NSA)
– Also banks, insurance, credit cards

Who Gains From this Data?
• Can we design systems that share
the value across more people?
– People co-create data and gain value
– Participatory design philosophy
• Can we also make
people feel more
invested in the
cities they live in?

Tiramisu Bus Tracker App
• People can see
incoming bus data
• People can also
share info
– Got on bus
– #seats available
• New feature: people
can discuss transit

Story 2: Google Glass
• Why there has been so much negative
backlash about Google Glass?
• Are there lessons we can learn here
about privacy and adoption of tech?

Origins of Ubicomp

Popular Press Reaction

Why Such Negative Press?
• PARC knew privacy a big problem
– But didn’t know what to do
– So didn’t build any privacy protections
• Unclear value proposition
– Focused on technical aspects
– What benefits to end-users?
• Google Glass is replaying
the past

Why Such Negative Press?
• Grudin’s law
– Why does groupware fail?
– “When those who benefit are not
those who do the work, then the
technology is likely to fail, or at least
be subverted”
• Privacy corollary
– When those who bear the privacy risks do
not benefit in proportion to the perceived
risks, the tech is likely to fail

But Expectations Can Change
• Initial perceptions
of mobile phone users
– Rude, annoying
– Casual chat, driving
• Six weeks later…
– Had same behaviors
• People with more
exposure to mobile
phones better

• Famous 1890
article defining
privacy as “the
right to be let
alone” was about
photography

• People objected to having
phones in their homes
because it “permitted
intrusion… by solicitors,
purveyors of inferior
music, eavesdropping
operators, and even
wire-transmitted germs”

One resort felt the trend so heavily that it
posted a notice: “PEOPLE ARE FORBIDDEN
TO USE THEIR KODAKS ON THE BEACH.”
Other locations were no safer. For a time,
Kodak cameras were banned from the
Washington Monument.
The “Hartford Courant” sounded the alarm
as well, declaring the “the sedate citizen
can’t indulge in any hilariousness without
the risk of being caught in the act and
having his photograph passed around
among his Sunday School children.”

And Framing Really Matters
• Ubicomp -> Invisible Computing
– Talked less about the tech
– More about how it could help people
– More positive press

Privacy Hump
• A lot of our concerns about tech fall
under umbrella term “privacy”
– Value, fears, expectations,
what others around us think
Many legitimate concerns
Many alarmist rants
“Right” way to deploy?
Value proposition?
Rules on proper use?
Things have settled down
Few fears materialized
People feel in control
People get tangible value

But How to Get Over Hump?
• Still a big gap in knowledge on best
ways of mitigating privacy issues
• Prime example: Facebook news feed

“We’d put an ad for a lawn mower next to
diapers. We’d put a coupon for wineglasses next
to infant clothes. That way, it looked like all the
products were chosen by chance.”

• Privacy Placebos: Things that make people feel
better about privacy, but doesn’t offer much.
• Other examples: Privacy policies, access logs
• How ethical is it to use these approaches?

Story 3: PrivacyGrade
•
– every SMS, search, and phone#
•
• How can we help improve privacy?
– Developers
– End-users

What Do Developers Know
about Privacy?
• Interviews with 13 app developers
• Surveys with 228 app developers
• What tools? Knowledge? Incentives?
• Points of leverage?
Balebako et al, The Privacy and Security Behaviors
of Smartphone App Developers. USEC 2014.

Summary of Findings
• App developers not evil
• But need to monetize
– Ads and analytics heavily used
– Turns out libraries big source of problems
• Also don’t know what to do
– “I haven’t even read [our privacy policy]. I
mean, it’s just legal stuff that’s required, so
I just put in there.”
– Often just ask other people around them
– Low awareness of guidelines

Developers Need a Lot of Help!
• Good set best practices for security
– SSL, hashing of passwords, randomization
– Common attacks: SQL, XSS, CSRF
• What are best practices for privacy???
• Developers also have many problems
– App functionality, bandwidth, power,
making money… privacy far down the list

What are your apps really doing?
Shares your location,
gender, unique phone ID,
phone# with advertisers
Uploads your entire
contact list to their server
(including phone #s)
Lin et al, Expectation and Purpose: Understanding User’s Mental Models
of Mobile App Privacy thru Crowdsourcing. Ubicomp 2012.

Many Smartphone Apps Have
“Unusual” Permissions
Location Data
Unique device ID
Location Data
Network Access
Unique device ID
Location Data
Unique device ID

Expectations vs Reality

Privacy as Expectations
• Apply this same idea of mental models
for privacy
– Compare what people expect an app
to do vs what an app actually does
– Emphasize the biggest gaps,
misconceptions that many people had
App Behavior
(What an app
actually does)
User Expectations
(What people think
the app does)

10% users were surprised this app
wrote contents to their SD card.
sent their approximate location to
dictionary.com for searching nearby
words.
sent their phone’s unique ID to
mobile ads providers.
could control their audio settings.
See all
sent their precise location to
sent their approximate location
to mobile ads providers.
sent their phone’s unique ID to
can control camera flashlight.

Results for Location Data
(N=20 per app, Expectations Condition)
App Comfort Level (-2 – 2)
Maps 1.52
GasBuddy 1.47
Weather Channel 1.45
Foursquare 0.95
TuneIn Radio 0.60
Evernote 0.15
Angry Birds -0.70
Brightest Flashlight Free -1.15
Toss It -1.2

How to Scale It Up?

How PrivacyGrade Works
• Libraries can offer us insight into the
purpose of a permission request
– Location used by Google Maps library vs
Location used by advertising library
• Create a model of people’s concerns of
permission by purpose
Lin et al, Modeling Users’ Mobile App Privacy Preferences: Restoring
Usability in a Sea of Permission Settings. SOUPS 2014.

• We crowdsourced people’s expectations
of a core set of 837 apps
– Ex. “How comfortable are you with
Fruit Ninja using your location for ads?”
– 20 people per question (as above)

Early Discussion of
PrivacyGrade
• Can we use big data for privacy?
• Many points of leverage in ecosystem
– Policy makers / third party advocates
– Developers
– OS / Hardware / app store
– End-users
• Legal issues
– Still under discussion with CMU lawyers
– DMCA, liability issues

How far should we go in
inferring behaviors?
How can we minimize
biases in our data and
models?
How can we help ensure
data doesn’t flow one way?

What are better ways of
going over the privacy
hump?
How acceptable is it
to use privacy placebos?

How can we help
developers with privacy?
Can we use big data
approaches for privacy?
What are ways of getting
our work out there?

How can we work with
public policy makers
to create better
guidelines
around privacy?

How can we create
a connected world we
would all want to live in?
Computer
Human
Interaction:
Mobility
Privacy
Security

Thanks!
More info at cmuchimps.org
or email jasonh@cs.cmu.edu
Special thanks to:
• Army Research Office
• NSF
• Alfred P. Sloan
• NQ Mobile
• DARPA
• Google
• CMU Cylab
• Shah Amini
• Justin Cranshaw
• Afsaneh Doryab
• Jialiu Lin
• Song Luan
• Jun-Ki Min
• Dan Tasse
• Jason Wiese

Summary
• Smartphones and cloud computing
offer big opportunity to understand
human behavior
• Also pose many large challenges, in
privacy and ethics
• But I’m optimistic

86

Using Location Data to
Infer Friendships
• 2.8m location sightings of
489 users of Locaccino
friend finder in Pittsburgh
• Place entropy for inferring
social quality of a place
– #unique people seen in a place
– 0.0002 x 0.0002 lat/lon grid,
~30m x 30m
Cranshaw et al, Bridging the Gap Between Physical Location and
Online Social Networks, Ubicomp 2010

Inferring Friendships
• 67 different machine learning features
– Location diversity (and entropy)
– Intensity and Duration
– Specificity (TF-IDF)
– Graph structure (overlap in friends)
• 92% accuracy in predicting friend/not
– Location entropy improves performance
over shallow features like #co-locations

• Insert graph here
• Describe entropy
Co-location data to infer friendship
Using place entropy, accuracy of 92%
Can also infer number of friends

• Insert graph here
• Describe entropy
Could infer friend / not-friend based on co-
location patterns and place entropy at 92%

Brooklyn Queens Expressway

Bezerkeley, CA

Expectations Condition
Why do you think Angry Birds uses
your location data?
How comfortable are you with Angry
Birds using your location data?

Purpose Condition
Angry Birds uses your location data
for advertising.
How comfortable are you with Angry
Birds using your location data?

Few people read privacy policies
• We want to install the app
• Reading policies not part of main task
• Complexity (the pain!!!)
• Clear cost (time) for unclear benefit

How PrivacyGrade Works (1)
• We operationalize privacy as people’s
expectations of data use
– Ex. Most people don’t expect Fruit Ninja
to use location data, but it actually does
– Ex. Most people do expect Google Maps
to use location data

Privacy, Ethics, and Big (Smartphone) Data, at Mobisys 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Privacy, Ethics, and Big (Smartphone) Data, at Mobisys 2014

Similar to Privacy, Ethics, and Big (Smartphone) Data, at Mobisys 2014 (20)

Recently uploaded

Recently uploaded (20)

Privacy, Ethics, and Big (Smartphone) Data, at Mobisys 2014