The REAL Impact of Big Data on Privacy

LET‟S GET REAL
THE REAL IMPACT OF BIG DATA
ClaudiuPopa
@datarisk

THE DEFINITION
 Short for big data analytics
 The trend towards larger data sets allowing
correlations to be found to spot business
trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and
determine real-time roadway traffic conditions

THE BUZZWORD
 Marketing term Increasingly used to evoke
computational power and large scale algorithms
that can turn mountain of data into usable
information and intelligent business decisions
 Modern day version of Big Brother. Diverse
transactional data that can impact privacy

MENU
 PROMISE
 REALITY
 OPPORTUNITY
 CHALLENGE
 SOLUTION

The Promise of Big Data
Sexiest Job of the 21st Century?
Predicting crime/Fraud detection Timing investments Mining astro data for E.T.
Productivity loss of stressful travel Call centre analytics Dynamic ticket pricing
Guessing demand for better service E-com & customer service Capture recurring revenue
Using sensor data for efficiency Barack Obama‟s campaign Medical research/treatment
Reduced pushback campaigns Crowdfund / Crowdsource Open pit data mining
Mapping the sex trade Disease eradication Tracking endangered species

Big benefits
Source: Fast Company. Photos: Jeff Brown. Illustrations: Justin Mezzell

can‟t we agree on a definition?
2012 IBM Big Data @Work Survey (1144 professionals in 95 countries/26 industries)

More than Mining: The Reality of Big Data

More fun than mining for BitCoins
Full graph back to the 1500s saved at http://Popa.ca/GraphInfo

“
”
Seeking clarity? Given the signal to
noise ratio, big data itself appears
to be telling us that working raw
numbers at ever greater scale to
force out some answers may not
be the real achievement here.
Instead, it is perhaps in the
elegance with which it steers us
towards the right questions to ask.

big data is often personal

but socmed is a big part of the risk

Do custodians have a responsibility?
 Google Person Finder
 Flu Trends
 Dengue Trends
 Crisis Maps
 MOOC metrics
 Sex trade tracking

big data growing big time, and fast
Source #1: www.sourcelink.com/blog/guest- author-series/2012/08/13/if-you-build-it-they-will-come

the 3 dimensions of Big Data
 volume
is there enough data? Too much?
 variety
how diverse is the data?
 velocity
how rapidly must it be processed?

the 6 new aspects of Big Data
 validity
what is the threshold for applicability?
 verifiability
will we continue to trust the input source?
 variability
what‟s the tolerance for variation?
 veracity
is it useful for predicting business value?
 value
does it offer valuable insight?
 visualization
can it be meaningfully represented?

Big Value
 silos
are departmental silos good or bad?
 timely exploitation
volume and velocity = complexity
 system integration
the real enablers need to „get‟ privacy

“
”
Visualization is the most likely V
to save big data by making
personalized information
available to owners and not
custodians in natural, frictionless
ways.

Full disclosure: why I care
 integrity
if you can‟t trust the data, immense effort and expense are lost
 confidentiality
the impact of potential breaches is vastly increased
 availability
input availability impacts value and velocity in particular
 privacy
impacted by any breach of data sets/clusters with PII
Hint: it’s not because Gartner said the space is worth up to $3.8 Trillion

The First 4 Vs: Where‟s the Privacy?
Source: Data Science Central

too valuable to discard?
Source: http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/

Visualizing Privacy in Big Data

Are we there yet?
 Utah Data Center will handle yottabytes of data
 TB: 1012 bytes, Exa:1018, Zetta: 1021, Yotta: 1024
 Each Boeing jet engine creates 20TB/hr
 Facebook grows by 500TB/day.
Ref: Visual.ly infographic on big data

Scale
 PB=1015: All printed material in existence in 1995 or
ONE SECOND of data generated at CERN/LHC
 1 exa: All data created on the Internet each day
 SKA will collect „a few exabytes‟ of data each day
processing 10PB/hr and producing up to 100x
CERN‟s output each year

 Using terabyte drives, a yotta centre would be as
large as the states of Delaware and Rhode Island
 Using SDXC cards, it would be as large as the
Great Pyramid of Giza
 NASA already stores 32pb of climate data
Not just about processing

 ...except in the hands of the first indiscriminate
seller and unscrupulous buyer
 unless all elements are preserved the fear is that it
will be more difficult to find gold
 and legislation may force companies to reveal
what they hold and what they share
Data without analysis is useless
Ref: California’s Right to Know bill AB 1291 demands accountability from huge data hoarding firms

 Big Data in medicine often revolves around gee
sequencing and biosamples, vital records, insurance claims.
Data use and reuse has created grave concerns about
privacy and informed consent.
 geotagging and social media fuel the debate over big
data privacy. Does informed mean consent?
 personalized searches are automatically recorded and
mined in every way possible. Crowd wisdom and individual
preferences create a climate of unease among web
search users
Overanalyzed data creates concerns
Source: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup#

 Internet Census 2012 was an unauthorized deep dive into
the security of all Internet connected devices
 It infected vulnerable devices with a custom binary and
used their processing power to expand the scale of scans
 It found over 35million vulnerable devices and millions of
others that should not be online at all
 The research data was analyzed and anonymously
published online after the 2-year project was completed
 9+ TB collected and analyzed, 52bn ICMP pings,
180bn service probes, 71 bn ports tested
Unauthorized Internet scale analysis
Source: http://internetcensus2012.bitbucket.org image: 420,000 Carna Botnet locations

 anticipatory systems like Google Now already have
a positive impact on individual productivity
 crowdfunding has a global economic impact and
an even bigger innovative footprint
 crowdsourcing assists with investigations and
research, especially as small data is tapped
Observing people through shared data

 20 million geolocated tweets during 4-day event
 grocery shopping peaks night before hurricane
 night life up after it
 Manhattan skew
 Impact area shows
lowest tweet rates
 signal problem may
undermine big data
value
Social media & major event correlation
Source: Rutgers Twitter/Foursquare Sandy Study 2012 http://popa.ca/SandyBigData

 global data doubles every 2 years, but only 0.5% is ever
analyzed
 strength comes from pooling the data, but value is in
individualizing findings
 how can personal analytics be custom-fitted to benefit
individuals without first impacting privacy?
Big data‟s promise is not in aggregation

 Hundreds of millions of devices vulnerable globally
 95% unpatched and vulnerable
 5% of those patched are still vulnerable to zero-days
 study shows that 75% are at least 6 mths behind
 but it also shows that the focus isn‟t just on one aspect.
It‟s a massive systemic issue that was allowed to grow
into a global threat.
 Government issued a public recommendation to
discontinue the use of Java because it is unsafe
A sobering look at the Java threat
Source: 2013 Websense http://popa.ca/JavaSecurityPie

“
”
If you can crunch it, more data
means better results. The
caveats are that you get
proportionately less information
by volume and quality tends to
decrease over time.

The Opportunity of Big Data
 Since data storage will reach a practical
asymptotic maximum, we can distribute resources
 This will help with data quality, input filtering, metrics
and statistics, layered privacy filtering, reporting
filters, data siloing and segregation, neural net-style
learning to maximize efficiency, etc.

Redefining organized, as in crime
 every year, hundreds of millions of records are
siphoned from diverse databases globally
 SIN/SSNs, Credit Card Data, home addresses all
amount to one thing: identities
 The vast majority of that data has to date gone
unexploited, likely due to analytic challenges

IBRF: ID business requirements 1st!
1. initial focus on customer-centric outcomes
2. enterprise-wide big data blueprint
3. get near-term results from existing data
4. build analytics capabilities on business priorities
5. create a business case on measurable outcomes

Who isn‟t already mining it?

Responsible Visualization of Big Data
 Data discovery is ideally positioned to identify
PII, create filters & create healthy correlations
 Data quality visuals are opportunities to form
segments or hierarchies & create aggregate logic
 Storytelling often identifies outliers. Expertly narrate
correlations and patterns, de-identify exceptions

Where to look for Privacy in Big Data
 Dashboards are used to present meaningful data
and should be adequately tested for compliance
 Tools can be tailored to audiences and should be
specialized to eliminate undesirable inferences
 Trends and predictions result from proper data
analysis. This is where meaning becomes evident

“
”
Hyped indiscriminately and
handled inappropriately, big
data analytics can be more of
a liability than an opportunity
to derive rich information
through intelligent refinement.

The Challenge of Big Data
 Does logging in with social media accounts
constitute consent?
 Will aggregation and data masking still lead to
personally identifiable information?
 How can data privacy filtering be guaranteed?

Big Data example: the Click dataset
 Objective: “To study the structure and dynamics of
Web traffic networks”
 53.5 billion click anonymized dataset @IndianaU
 Data collected includes referrer, timestamp, URL
 Does sanitization == anonymization == privacy?
Source: CNetS: http://cnets.indiana.edu

Open, Distributed DIY Big Data Tools
 D3: Data Driven
Documents
 GitHub | SourceForge
 Hadoop/MapReduce
 Amazon cloud
 Open source grids
 Mechanical Turk?
Source: Wikipedia Recent Changes Map and wikistream projects

Canadian Big Data: The Source
 Detailed metrics showed preference for high end
products
 Move away from $150 items to $650 ones
increased sales 40% in high end electronics
 Notorious for overcollection, the Source actually
does „the consent bit‟ adequately well

Privacy enjoys safe sets
 intrinsic safety in:
meteorology, environmental, physics, astronomy
and other sciences
 innate risk in
connectomics, biological, Internet, behavioural
and sensory data sets

“
”
We simply cannot afford to
entertain the notion that the
proliferation of scattered data
sources is the last bastion of
privacy protection.

 Build privacy into the input data sets
 Use simple filtering for large data sets & output
 Build algorithms to ensure irreversibility of privacy
 Try to break it!
Technical solution of Big Data privacy

Privacy must be tack[l]ed [head-]on

“
”
Data with the potential of
being personally identifiable
should be treated with the
same veracity as dirty input.

7 Steps to Building your own
Healthy Information Ecosystem
Articulate your vision
Put your stop orders in place
Assign roles and accountabilities
Create processes to manage it
Build controls and code standards from the bottom up
Prioritize data ownership, integrity and classification
Implement layered and automated audits

Big Data Leadership
 Embrace openness, build on what works
 Adopt standards for process and technology
 Draft progressive legislation (Model CASL, PCI and
even CP laws)
 Encourage awareness, promote accountability
 Applaud and showcase responsible innovation
 Put forward important notions of information life
cycle, data ownership, and privacy compliance

Big Data Links
 Free course http://popa.ca/BigDataCourse (Coursera)
 More from http://bigdatauniversity.com/courses/
 This presentation: http://linkedIn.ClaudiuPopa.com
 Visualization gallery: http://datavis.ca
 InformationisBeautiful.net + Awards
 HowBigReally.com / HowManyReally.com
 The signal and the noise (Nate Silver @ amazon.ca)

 Sharing data mining policies
 Demonstrating fair use using audits
 Caring about the purveyor spectrum
 When it gets easy, it may be too late
Big Data open discussion

Follow Twitter.ClaudiuPopa.com
Read Subscribe.ClaudiuPopa.com
Connect LinkedIn.ClaudiuPopa.com

The REAL Impact of Big Data on Privacy

More Related Content

What's hot

Similar to The REAL Impact of Big Data on Privacy

Recently uploaded

The REAL Impact of Big Data on Privacy

Editor's Notes