Big Data and algorithms

Big Data and algorithms
Impact on individuals and society

About me
● Software engineer
● I worked in web companies
● Big Data, Social Media, Influence Marketing

Context
● Lesson given at high school
● Digital Citizenship (optional alternative class to Catholic Religion)
● 14 to 18 years old students
● Academic year 2017/2018

Potential long term effects of Big Data
● Sensors everywhere recording every kind of data
● Potential dystopia: people will tend to ‘cool down’: standardize, tone down,
suppress their own spontaneous behaviours because always tracked and
digitalized.
● social cooling
● propublica

Big Data
Data gathered, stored and manipulated by automatic processes.
Characterized by the 5 Vs:
● Volume: big size of data (big data) - but each data unit is usually small
● Variety: data of different types:
○ Structured: tables, databases, ..
○ Unstructured: text, images, audio, ..
● Velocity: data available in real-time
● Variability: inconsistent, incomplete, contradictory data
● Veracity: inaccurate, unreliable data, with low quality and a lot of noise

An example
A sensor at each metro station turnstile tracks down each access coupled with
a timestamp (i.e. expressed in seconds starting from 1th of January 1970 in
UTC time)
● UTC: Coordinated Universal Time
● Longitude 0°
● (Greenwich)

Data stored somewhere
Table with all triples stored (timestamp, station, in/out)
Timestamp Station Direction
Tue 01-01-2009 6:00 Battistini in
Tue 01-01-2009 6:02 Colosseo in
Tue 01-01-2009 6:05 Magliana out
Tue 01-01-2009 6:12 Anagnina in

Data processed to be visualized
The eye catches much more information than the brain

Open Big Data
● Open data: data collected by private or public organizations, freely
downloadable or accessible by anyone
● Public knowledge
● I.e.: data about municipality, health, geographic entities
● Linked data: open data linked one another by semantic (= meaningful and
formally structured) links

Algorithms
● Sequence of actions towards a goal
● I.e.: algorithm to get a robot out of a room
○ The robot doesn’t see, it only acknowledges a wall after hitting it
○ Step forward
○ If you hit a wall, step right
○ If you hit a wall while stepping right, turn to your right
loop

An algorithm to control the turnstiles
● If incoming people per minute and per turnstile are above a chosen threshold
IN-max (and outgoing people are under another threshold OUT-max)
○ A turnstile switches: turnstile OUT -> turnstile IN
● And vice versa
● (Incoming people from 6:00 to 6:10) / 10 / n. turnstiles IN = a value IN-V
● (Outgoing people from 6:00 to 6:10) / 10 / n. turnstiles OUT = a value OUT-V
● If IN-V > IN-max and OUT <= OUT-max, then switch OUT -> IN

Programming languages
● Languages to write algorithms
● Understood by machines, so that they can execute them
● Pseudo-languages: languages to sketch algorithms, useful for people but
unreadable by machines
○ Es. block diagram or natural language instructions

Models
● Many times models are created for the observed reality
● To simplify it, otherwise too many variables are involved
● And to take decisions, to execute actions on the observed reality

Example of model
● Class of students
● I want to improve the performance of the students
● Which data can I collect about this reality?
● Which model can I draw?
● Which actions can I put in place?

The starting theory
● I rely on a theory, a direction, an idea
● The idea: students aren’t good enough because teachers are not up to
their job.
● It’s just a theory, as good as any. Other possible theories I could take:
○ Because students are too tired
○ Because they live in poor neighbourhoods
○ Because they spend too much time on their smartphones

Useful data, according to my theory
● Students’ notes at tests, reports, etc.
● Opinions about each teacher given by the school principal and the parents
of the students
● I design an algorithm that evaluates teachers based on this model:
○ Students get better or get worse depending of their teachers’ quality
○ Teachers getting good reviews from principals and students’ parents are actually good

My algorithm
● If at the end of the year with teacher T, students get better notes than the
year before, then T was good by a factor N
● If T gets good reviews from the principal and from students’ parents, then
T was good by a factor R
○ Teacher score S = N + R
○ Among all school teachers, those ones
which are in the x% lower range of the
curve get fired
Gaussian curve: it fits well for sums of random values

Algorithm execution and resulting actions
● The algorithm runs, I get my teachers’ scores
● I find the 5% (for example) of all teachers who place themselves lowest in
the curve
● I fire them
● I optimized the faculty
● Am I ok with that? Did I do a good job?

That really happened
● Article on Washington Post
● There’s a problem:
○ Sarah was a very good teacher, held in high esteem by the principal and students’ parents
○ She got a low score by the algorithm
○ She was fired
● How did that happen?
● S = N + R = low value + high value -> in the lowest 5%

What’s wrong?
● How is it possible that a good teacher got fired?
● Model too naive
● Incoherent data
● Small data
● No feedback

Wrong model
● Each model comes with a choice, focusses on some variables and cuts
out others
● Otherwise it wouldn’t be a model, i.e. a simplified version of the reality
● It has a bias, a prejudice, an inclination more on one side
● In our example, we consider as variables the students’ notes in the
previous year and in the current year
● Too simple: abstract oversimplification of the target reality

Poorly coherent data
● Algorithm input data may not be coherent
● Notes of the previous year (e.g. last year of elementary school) could be
higher than they they should
● In the current year, lower notes assigned by the current teacher seem to
suggest a worsening of the students’ performance, but this could be not
the case

Not enough data
● Data are too few.
● In order for a statistical model to work properly, data must be a lot
(~millions)
● I can’t take notes of 25 students and give just them as an input to the
algorithm
● In any particular case of a class, there could be a thousand reasons why
those students are performing worse:
○ Problems at home
○ Personal problems
○ Change of school

No feedback
● There isn’t any feedback which loops back to the algorithm to steer it
● The feedback comes from the current state of the reality affected by the
action of the algorithm
Modeled reality Algorithm
Model
action
theory
feedback

Impossible to getting things back on an even
keel
● With no feedback the algorithm goes off on its own
● It can’t be updated with data extracted from the observed reality after its
start
● If we fired good teachers, we’ll never know
● If we kept bad teachers, we’ll never know either
● The algorithm isn’t listening to mistakes made, let alone it can’t learn from
them

Automatic speed controller
● Autopilot (controlled variable: speed - but it could be direction too)
● The car must constantly go 100 km/h
● It works in the same way of the other example, just a simpler reality here
Real speed Controller
action
feedback

Controller without feedback
● The controller give the engine an initial power and it reaches e.g. 110 km/h
● From that moment on what does it happen? We’ll never know
Real speed Controller
action

Weapons of math destruction
● Weapons of math destruction - Cathy O’ Neil
● Checklist of a weapon of math destruction (WMD) features:
○ Model and algorithm non-transparent (black box): we don’t know what there’s inside
○ Harmful for people.
○ Even worse, it builds a vicious circle that make things get worse whereas one of its
objectives was to improve objectivity and remove inequalities.
○ It can scale on big numbers.

Vicious circle
● In our example of teachers there’s no vicious circle
● The algorithm output is hardly meaningful, at least it doesn’t worse things
● Another real example with vicious circle:
● Algorithm assigning years in prison: it gives more years to people already
condemned in the past or previous dealings with justice (used in some US
state)
○ Someone living in a rough neighborhood will more likely have higher algorithm scores ->
more severe and long punishments -> even more disadvantaged once out of prison
■ -> vicious circle

When algorithms on Big Data come into play
and when they don’t
● Leveraging algorithms and Big Data to make choices is easy:
○ Automatic
○ Fast
○ With no responsibility for people who use them
● Whenever you need to choose about one single precious individual, you still ask
people to do that (es. Hiring a lawyer for a prestigious firm).
● Whenever you need to choose thousands of time about thousands of
interchangeable people, algorithms will do that (e.g. applicants for McDonald’s)
● Algorithms save money and time, but with the collateral effect of ruining the
lives of many individuals on which they simply fail (collateral damage)

Algorithms on Big Data as weapons of math
destruction
● That’s why weapons of math destruction
● They embody a simplistic and biases vision of a certain reality
● Taking decisions on the basis of a few variables
● Producing numbers, scores, rankings which look like objective
● Everyone starting from an unfavourable position according to the algorithm
parameters, will end up sinking even lower (e.g. more years in prison for repeat
offenders, even minor dealings with justice, in a poor neighborhood) - vicious
circle: the offender will more likely do bad in the future
● Inequalities grow, the exact opposite of what the algorithm’s designers expect

Example of WMD in our brain
● Everyone is coupled with the number of followers on a social media
● Who has already a big number of them will get more and more
● Who has a small number will hardly get more. Why?
● In our brain there’s a little WMD
○ If someone has got a lot of followers, than is an important person, so I’m going to follow
her/him
○ If someone has got a few followers, than is a loser, so I’m not going to follow her/him
○ Vicious circle
● That number is now for us an objective integral part of the person

Big Data and algorithms

Recommended

Recommended

More Related Content

Similar to Big Data and algorithms

Similar to Big Data and algorithms (20)

More from michele minno

More from michele minno (6)

Recently uploaded

Recently uploaded (20)

Big Data and algorithms