Putting data science into perspective

PUTTING DATA
SCIENCE INTO
PERSPECTIVE
PRIVACY &
ETHICS
SRAVAN ANKARAJU
FOUNDER & CEO
DIVERGENCE ACADEMY
June 29th, 2017

Is Fake News a Well-defined Machine Learning Problem?
Source

SRAVAN ANKARAJU
• Technology Leader focused on Strategy & Innovation, Risk and
Decision Management
• 13.5 years with Microsoft in Technology Integration consulting,
Developer Support – focus on DevOps & Agile Development
• Big Picture educator – start with concepts & then play to learn
advanced areas. Iterate and iterate often.
• Implementer of Gamification systems in Learning/Training
• Experienced in High volume & high performance transactional
systems.
• Data Analytics for various Fortune 100 companies
FOUNDER & PRESIDENT

3 TRENDS SHAPING MACHINE LEARNING IN 2017
Algorithm Economy is on
the Rise
Expect more Interaction
between Machine and
Humans
Giant companies will develop
ML based AI systems
Source: http://www.datasciencecentral.com/profiles/blogs/trends-shaping-machine-learning-in-2017
1 2
3

EMERGING
TECHNOLOGY
HYPE CYCLE
FOR 2016

AGENDA
• What’s the catch if there is ton of goodness in AI-based systems
• When do you get human involved
• Global companies & Importance of May 25th, 2018
• Privacy Paradox & Distinctive aspects of Big Data
• Data Science Ethics Framework
• Where do you go from here / What you can do today
PUTTING DATA SCIENCE INTO PERSPECTIVE

To study possibly racist algorithms,
professors have to sue the US
http://arstechnica.com/tech-policy/2016/06/do-housing-jobs-sites-have-racist-algorithms-academics-sue-to-find-out/

WHEN DO
YOU GET A
HUMAN
INVOLVED?

QUALITY ASSURANCE
When does securing AI against attacks or reverse-
engineering become more of an issue?
It’s an issue now. One of my biggest learnings from [chatbot] Tay was that
you need to build even AI that is resilient to attacks. It was fascinating to
see what happened on Twitter, but for instance we didn’t face the same
thing in China. Just the social conversation in China is different, and if you
put it in the US corpus it’s different. Then, of course, there was a concerted
attack. Just like you build software today that is resilient to a DDOS
(Distributed Denial of Service) attack, you need to be able to be resilient to
a corpus attack, that tries to pollute the corpus so that you pick up the
wrong thing in your AI learners.
- Satya Nadella, Microsoft CEO

HUMAN INTERVENTION
Whenever you have ambiguity and errors, you need to think
about how you put the human in the loop and escalate to the
human to make choices. That is the art form of an AI product. If
you have ambiguity and error rates, you have to be able to handle
exception. But first you have to detect that exception, and luckily
enough in AI you have confidence and probability and
distribution, so you have to use all of those to get the human in
the loop.
GOVERNANCE
- Satya Nadella, Microsoft CEO

GLOBAL
DATA
PROTECTION
REGULATION

GDPR
Stricter rules will apply to the
collection and use of personal data.
Will apply for May 2018

GLOBAL DATA PROTECTION
REGULATIONOperational Impacts
Mandatory Data-
breach Protection
Privacy Impact
Assessments
Right to be
Forgotten
Privacy by Design
and Default
Mandatory Data
Protection
Officers
1 2 3
4 5

MANDATORY DATA-BREACH
PROTECTION
• Companies that experience data breaches will need to notify
regulators and individuals whose personal data was
compromised.
• Companies will most likely want to avoid the negative
publicity of these disclosures. Multinationals will gradually
ramp up:
• Comprehensive risk assessments
• End-to-end security enhancements
• Outsourced managed security services.
Global Data Protection Regulation Operational Impact

PRIVACY IMPACT ASSESSMENTS
• Require companies to conduct data protection impact
assessments (DPIAs) where their data processing operations
are highly invasive.
• Include marketing activities based on advanced profiling and
analytics.
• Privacy operations may need to extend outside the legal
office, where it has traditionally resided, and into the day-to-
day processes of European businesses.

RIGHT TO BE FORGOTTEN
• Right to erasure could impose a significant burden on
companies with personal data stored across multiple
systems.
• Companies may need to –
• Maintain comprehensive data inventories
• Accelerate data-governance strategies
• Potentially re-architect key systems in order to more efficiently
process Right to be erasure requests.

PRIVACY BY DESIGN AND DEFAULT
• Privacy-friendly settings or postures—such as those that
collect, retain, and share personal information – will be built
into new products, devices, and business processes.
• The flip-side of DPIAs, privacy-design requirements may
give rise to a need for privacy engineers to embed privacy
features throughout the daily operations of their businesses

MANDATORY DATA PROTECTION
OFFICERS
• Require large companies to appoint data protection officers
(DPOs), if their core activities consist of large-scale,
systematic monitoring of people.
• DPOs will have to exhibit expertise in technology and
business processes and project and program management,
such as risk assessment and compliance monitoring skills.
• Talent is in short supply.

PRIVACY PARADOX
Price of using internet services
• People may express concerns about the impact on their privacy of ‘creepy’ uses of their
data, but in practice they contribute their data anyway via the online systems they use. In
other words they provide the data because it is the price of using internet services.
RIGHT TO MY IDENTITY
Microsoft’s Digital Trends report 2015 noted a trend called Right to My
Identity which means that, rather than simply wishing to preserve
privacy through anonymity, a significant percentage of global consumers
now want to be able to control how long information they have shared
stays online, and are also interested in services that help them manage
their digital identity. This suggests consumers have increasing
expectations of how organizations will use their data and want to be able
to influence it.

A Lawyer and A Data Scientist
Walk Into A Bar

An organization wants to use data
generated in different regulatory
environments to learn about its customers
or to predict their behavior. Some
customers are in Germany, some are in
Switzerland, and others are in the U.S. and
Canada. How can a data scientist get the
most out of this data without breaking the
law, when each country has its own
regulations on what he or she can do with
the data?

“Where the data subject has provided the personal data and the
processing is based on consent or on a contract, the data subject shall
have the right to transmit those personal data and any other
information provided by the data subject and retained by an automated
processing system, into another one, in an electronic format which is
commonly used, without hindrance from the controller from whom the
personal data are withdrawn.”
GDPR CONTROVERSY
Data Portability Legalese

DISTINCTIVE ASPECTS OF BIG DATA
ANALYTICS
Potential implications for data protection
Use of algorithms
Opacity of
processing
Tendency to collect
all the data
Repurposing of data
Use of new type of
data
1 2 3
4 5

#1 USE OF ALGORITHMS
• Thinking with data: Find correlations / system learns
• Acting with data: Applied to particular case in the
Application phase
Unpredictability by Design

#2 OPACITY OF THE PROCESSING
The ‘Black Box’ effect
• Deep learning, involves feeding vast quantities of data
through non-linear neural networks that classify the data
based on the outputs from each successive layer.
• The complexity of the processing of data through such
massive networks creates a ‘black box’ effect.
• Makes it very difficult to understand the reasons for
decisions made as a result of deep learning.

#3 USING ALL THE DATA
n=all
In a retail context it could mean analyzing all the purchases
made by shoppers using a loyalty card, and using this to find
correlations, rather than asking a sample of shoppers to take
part in a survey.

#4 REPURPOSING DATA
Different than the original intent
• Geolocated Twitter data to infer people’s residence and mobility
patterns, to supplement official population estimates.
• Geotagged photos on Flickr, together with the profiles of
contributors, have been used as a reliable proxy for estimating
visitor numbers at tourist sites and where the visitors have come
from.
• Mobile-phone presence data to analyze the foot traffic into the
retail centers.
• Data about where shoppers to plan advertising campaigns.
• Data about patterns of movement in an airport to set the rents for
shops and restaurants.

#5 NEW TYPES OF DATA
Tracking without permission
• Developments in technology such as IoT mean that the
traditional scenario in which people consciously provide their
personal data is no longer the only or main way in which
personal data is collected.
• For example by tracking online activity, rather than being
consciously provided by individuals - investigate the
possibility of using data from domestic smart meters to
predict the number of people in a household and whether
they include children or older people.

ALGORITHMIC ACCOUNTABILITY
Five Principles
Needs to be a person with the authority to deal with its adverse individual or societal effects in
a timely fashion. This is not a statement about legal responsibility but, rather, a focus on
avenues for redress, public dialogue, and internal authority for change.
RESPONSIBILI
TY
Any decisions produced by an algorithmic system should be explainable to the people affected
by those decisions. These explanations must be accessible and understandable to the target
audience; purely technical descriptions are not appropriate for the general public.
EXPLAINABILI
TY
Algorithms make mistakes, whether because of data errors in their inputs (garbage in, garbage out)
or statistical uncertainty in their outputs. The principle of accuracy suggests that sources of error
and uncertainty throughout an algorithm and its data sources need to be identified, logged, and
benchmarked. Understanding the nature of errors produced by an algorithmic system can inform
mitigation procedures.
ACCURACY
https://www.technologyreview.com/s/602933/how-to-hold-algorithms-accountable/
The principle of auditability states that algorithms should be developed to enable third parties to
probe and review the behavior of an algorithm. Enabling algorithms to be monitored, checked, and
criticized would lead to more conscious design and course correction in the event of failure.
AUDITABILITY
As algorithms increasingly make decisions based on historical and societal data, existing biases
and historically discriminatory human decisions risk being “baked in” to automated decisions. All
algorithms making decisions about individuals should be evaluated for discriminatory effects. The
results of the evaluation and the criteria used should be publicly released and explained.
FAIRENESS

DATA SCIENCE ACTIVITIES & ORG
MATURITY
Source: Booz Allen Hamilton

IMPLEMENTATION CONSTRAINTS

OPERATING MODELS

GDPR PREPARATION
“No legislation rivals the potential global impact of the EU’s General Data Protection
Regulation (GDPR), going into effect in April 2018. The new law will usher in cascading
privacy demands that will require a renewed focus on data privacy for US companies
that offer goods and services to EU citizens,” said Jay Cline, PwC’s US Privacy Leader.
“Businesses that do not comply with GDPR face a potential 4% fine of global revenues,
increasing the need to successfully navigate how to plan for and implement the
necessary changes.”
Source - http://www.pwc.com/us/en/press-releases/2017/pwc-gdpr-compliance-press-release.html
INFORMATION
SECURITY
TOP INITIATIVES
PRIVACY
POLICIES
GAP
ASSESSMENT
DATA
DISCOVERY

FactGem is a platform that allows users to generate
their own visualization and analysis applications on top
of Neo4j, without the need to learn any other
programming language. FactGem makes data analysis
accessible to everyone, whether they’re a seasoned data
scientist or completely new to data science.
Through the integration of two platforms users can
access regulated data without worrying about the risk
of violating policies. This enables users to gain insight
into data without having to worry about writing code,
requesting data engineering support, or repercussions
for failing to add policies to data. This process
dramatically accelerates innovation across teams, as
the joint solution provides an end-to-end self-service
mechanism for analysts to exploit the most important
data within an organization.

BLOCKCHAIN IMPLEMENTATION
INTERNET OF EVERYTHING NEEDS LEDGER OF EVERYTHING
1. DECENTRALIZED (Shared
Control)
2. TRUSTED (Immutability /
Audit Trail)
3. PUBLIC (Tokens / Exchanges)
Algorithmic Law and Blockchain Enabled Automation

RESOURCES
- Machine Learning: The High-interest Credit Card of Technical Debt
- Attacking discrimination with smarter machine learning
- Rules of Machine Learning [43 rules]

THANK YOU
WHERE DATA SCIENCE MEETS CYBERSECURITY

Putting data science into perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Putting data science into perspective

Similar to Putting data science into perspective (20)

Recently uploaded

Recently uploaded (20)

Putting data science into perspective