Let's understand Data Science

Data Science
By: Sachin Rastogi
1
Credit : All information (images/video/text used for this presentations) is available in public domain. All rights are reserved with their actual owners. My purpose is just to explain Data
Science for non-profit. If you still have any objection, please let me know I will remove respective contents. My email is “sachin.rastogi@yahoo.com”

“To make everyone understand
Data Science with the help of real
stories.
2

What is Data Science?
Data science is an interdisciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from data in various forms.
3 source https://en.wikipedia.org/wiki/Data_science

What is Data Science?
Data science is about using data to create impact for your
organization, Impact can be
• In the form of insights,
• In the form of data products,
• In the form of product recommendations.
4

6
Target CorporationTarget CorporationTarget CorporationTarget Corporation is the second-largest department store retailer in the United States.
• Generally Shoppers don’t buy everything at one store.
• Target sells everything from milk to stuffed animals to lawn furniture to electronics.
• One of the company’s primary goals is to convince customers that they only need
Target, but how?
• Some specific periods in a person’s life when old routines fall apart and their buying
habits are suddenly in flux.
• TimingTimingTimingTiming is everything.
• The key is to reach them earlier, before any other retailers know a baby is on the
way.
“Can you give us a list of such customers ?”
The Target Story

7
Please watch this video @ https://youtu.be/RC5HNTj3Dag

The OSEMN(awesome)Model
2222
8 A very practical definition by Mason & Wiggins (2010)

9
01010101 Collect the data.
Obtain
02020202 Clean the data.
Scrub
03030303 Understand the data.
Explore
04040404 Mathematical
representation of the data.
Model
05050505 Storytelling and drawing
conclusion from the data.
iNterpret

10
01010101 1. Query from database.
2. Read from csv/html/Jason.
3. Generate data e.g. Sensors.
4. Collect from surveys.
5. Download from another location
(e.g. webserver).
Obtain

11
02 Real obtained data may have
missing values, inconsistencies,
errors, weird characters, or
uninteresting columns.
Common scrubbing operations
include:
1. Filtering lines.
2. Extracting certain columns.
3. Replacing values.
4. Extracting words.
5. Handling missing values.
6. Converting data from one format
to another.
Scrub

12
03 This is where it gets interesting,
because here we will get really into
our data.
1. Understand the data.
2. Identify patterns & relationship
among data.
3. Derive Statistics from the data.
4. Create interesting visualization.
Explore

14
04 It is a mathematical
representation of the data.
(with respect to the
assumptions we're willing to
make, the problem we're
trying to solve, and the data
themselves).
Model

15
04 Here we’re using linear regression,
one of the simplest techniques in
data science. We’re fitting the
model (the line) to a data series (the
dots).
We know that the model will be on
the form y =y =y =y = axaxaxax + b+ b+ b+ b
and we’re trying to find the optimal
values of a and b.
We draw a line that best fits the
existing data points on average.
Once we’ve fitted the model, we
can use it to predict outcomes (y
axis) based on inputs (x axis).
Model

“""""The purpose of computing is
insight, not numbers.""""
- Richard Hamming
16

17
05 1. Drawing conclusions from your
data.
2. Evaluating what your results
mean.
3. Visualize your finding – keep it
simple and priority driven.
4. Storytelling about data –
Effectively communicate the
results to non-technical
audiences.
iNterpret

19
What is Strava?
Strava is a social fitness networking application that is used to
track cycling, running, and swimming activities, among others,
using GPS data.
Strava in numbers
1. Activities recorded as at 31 December 2017: 1 billion
2. Runs uploaded in 2017: 136 million
3. Marathons uploaded in 2017: 627,239
4. Every 40 days, a million people join Strava.
Strava’s also counts commuters.

20
Strava Profile-Mr. R - Bike

26
Nike Says Its $250 Running Shoes Will Make You
Run Much Faster. What if That’s Actually True?
Source : https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html

27
• Nike says the shoes are about 4 percent better than some of its best racing shoes.
• Based on profiles from more than 700 races in dozens of countries since 2014, TheTheTheThe
NY Times compiled resultsNY Times compiled resultsNY Times compiled resultsNY Times compiled results from about 280,000 marathon and 215,000 half marathon
completed races.
• Using public race reports and shoe records from StravaStravaStravaStrava, The Times found that runnersrunnersrunnersrunners
in Vaporflys ran 3 to 4 percent fasterin Vaporflys ran 3 to 4 percent fasterin Vaporflys ran 3 to 4 percent fasterin Vaporflys ran 3 to 4 percent faster than similar runners wearing other shoes.
How ?

28
Obtain/Collection of DataObtain/Collection of DataObtain/Collection of DataObtain/Collection of Data
• An ideal experiment to measure how much shoes matter for
race performance will involve a series of marathons on a
variety of courses, with runners randomly assigned different
running shoes.
• There is no such experiment, but something like it happens
around the world almost every weekend.
• Every week, tens of thousands of amateuramateuramateuramateur runners compete in
races and upload their race data — collected on smartphones
or satellite watches — to Strava.
Source : https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html

2929Source : https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html

30 Source : https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html
Scrub/Cleaning of DataScrub/Cleaning of DataScrub/Cleaning of DataScrub/Cleaning of Data
1. No Shoes information.
2. Remove erroneous data.Incomplete data.
3. Higher speed threshold.
4. Virtual road ride.
5. Spelling mistakes.

Explore/Model/InterpretExplore/Model/InterpretExplore/Model/InterpretExplore/Model/Interpret
Below, we describe the four ways we measured the shoes’ effect.
1. Measuring shoe effects using statistical models.
2. Comparing groups of runners who completed the same two
races.
3. Average change among shoe switchers compared with non
switchers.
4. All runners as they switch to a new kind of racing shoe.

Measuring shoe effects using statistical models.
Pros of this approach:Pros of this approach:Pros of this approach:Pros of this approach: Tries to control for race conditions, weather, gender,
age, pre-race training and a runner’s previous race times.
Cons of this approach:Cons of this approach:Cons of this approach:Cons of this approach: Still not a randomized controlled trial.

Comparing groups of runners who completed the same two races.
((((Boston 2017 and Boston 2018))))
Pros of this approach:Pros of this approach:Pros of this approach:Pros of this approach: Follows athletes of similar ability who ran in identical
conditions.
Cons of this approach:Cons of this approach:Cons of this approach:Cons of this approach: Runners could save their special shoes for when they
expect to have a fast race.
Instead of directly comparing performances in the two races, we can
compare the net change of runners who switched to VaporflysVaporflysVaporflysVaporflys with the net
change of similar runners who did not.

Average change among shoe switchers compared with non switchers.
Hundreds of pairs of races in which large groups of runners ran the same two
races and in which a subset of them switched shoes.

All runners as they switch to a new kind of racing shoe.
Pros of this approach:Pros of this approach:Pros of this approach:Pros of this approach: Accounts for runners of varying skills over several
races.
Cons of this approach:Cons of this approach:Cons of this approach:Cons of this approach: Runners could save VaporflysVaporflysVaporflysVaporflys for when they expect to
be faster than normal.

All runners as they switch to a new kind of racing shoe.

None of these approaches are perfect, but they all point to a similar
conclusion.
Wherever we look for evidence that shoes matter in a marathon or half
marathon, wewewewe findfindfindfind VaporflysVaporflysVaporflysVaporflys atatatat orororor nearnearnearnear thethethethe toptoptoptop ofofofof thatthatthatthat listlistlistlist.
RunnersRunnersRunnersRunners whowhowhowho improvedimprovedimprovedimproved theirtheirtheirtheir performanceperformanceperformanceperformance inininin VaporflysVaporflysVaporflysVaporflys andandandand thenthenthenthen switchedswitchedswitchedswitched totototo
otherotherotherother shoesshoesshoesshoes gotgotgotgot slowerslowerslowerslower....

“"Data will talk to you if you’re
willing to listen to it."
-Jim Bergeson
39

The Strava Heatmap
for City Planner
5555
40

41
What is heatmap?
It is a graphical representation of different activates recorded on Strava with respective
GPS data on map. Activities includes Running, Commute, Biking, Swimming etc.
To give a sense of scale, the heatmap consists of:
• 700 million activities
• 1.4 trillion latitude/longitude points
• A total distance of 16 billion km (10 billion miles)
• A total recorded activity duration of 100 thousand years
Source : https://medium.com/strava-engineering/the-global-heatmap-now-6x-hotter-23fc01d301de
Strava Heatmap

42 Source : https://medium.com/strava-engineering/the-global-heatmap-now-6x-hotter-23fc01d301de
Bike counter Correlation

Strava Heatmap
In SeattleSeattleSeattleSeattle, At one intersection, city planner discovered
• Cyclists coming from the south would slow down before crossing,
• Cyclists coming from the north would come to a stop and then walk their bikes or
ride slowly.
• City planner realized the intersection posed a risk to cyclists.
Similarly DOT installed rumblerumblerumblerumble stripsstripsstripsstrips on Highway to avoid motor vehicles running off
the road, but they’re a nightmare for cyclists.

44 Source : https://www.cyclingweekly.com/news/latest-news/five-best-strava-art-139034
Strava accuracy on map
The New Forest ponyThe New Forest ponyThe New Forest ponyThe New Forest ponyThe Strava proposalThe Strava proposalThe Strava proposalThe Strava proposal

45 Source : https://www.strava.com/heatmap#15.00/78.35556/17.44737/hot/run
Strava Heatmap - Running

Strava Heatmap - Biking

Strava Heatmap - Swimming

48
Strava Heatmap - Live
https://www.strava.com/heatmap#15.00/78.35152/17.44928/hot/run

Thanks!
Any questions?Any questions?Any questions?Any questions?
50

Credits
Special thanks to all the people
who made and released these
awesome resources for free:
◎ Presentation template by
SlidesCarnival
◎ Photographs by Unsplash
51

Let's understand Data Science

Recommended

Recommended

More Related Content

Similar to Let's understand Data Science

Similar to Let's understand Data Science (20)

Recently uploaded

Recently uploaded (20)

Let's understand Data Science