How to Prepare for a Career
in Data Science
Juuso Parkkinen, PhD - @ouzor
Head of Data Science, Nightingale Health - @NgaleHealth
Aalto University, November 25, 2019
Outline
1.My Career as a Data Scientist
2.Data Science Workflow
3.Data Science and Business
My Career as a Data Scientist
The Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
My career steps
MSc in bioinformation technology
from HUT / Aalto
PhD in bioinformatics and machine
learning from Aalto
Data Scientist (consultant) at Reaktor
Data Scientist at Nightingale Health
Data Science research: probabilistic
models for biomedical problems
7
More on my research and other projects:
https://ouzor.github.io/projects.html
Data Science as a hobby: open tools for open data
Blogging
Open source programming
Open Knowledge -community
Blogs: https://louhos.github.io/, https://ouzor.github.io/
Open data science example: Biking activity in Helsinki
How do various factors affect biking activity
in Helsinki?
Data sources:
- Automatic bike activity counters from
multiple sites
- Weather data from FMI
Bike activity modelled with Negative
Binomial distribution using R (mgcv::gam)
Done with Janne Sinkkonen and Antti
Poikola
Data, code & results:
https://github.com/apoikola/fillarilaskennat
9
Open Data Science at Reaktor: Apartment price modelling
Kannattaakokauppa.fi by Reaktor: http://kannattaakokauppa.fi
More about the model: https://ouzor.github.io/blog/2016/03/08/apartment-price-model.html
Data Science Workflow
11
Data Science in the vacuum
Typically starts with a clean data set and a clear (modelling) task.
Example: Weather data in csv, and a goal to predict humidity.
What might be different in the real world?
12
Data Science Workflow in the Real World
1. Identifying and defining the problem
2. Accessing data
3. Preprocessing and cleaning the data
4. Exploratory data analysis and visualisation
5. Statistical modelling or machine learning
6. End result
Note the difference between academic interests and practical relevance!
13
ITERATION
Identifying and defining the problem
Learn to be critical and ask good questions!
• Why is this problem important?
• How does solving this improve our user experience?
• How does solving this improve our business?
• Is the problem really something we should solve, or is it something where we happen to have data or methods
available?
• Do we even need to solve this problem!?
Only after the problem is identified, you can start thinking about data science
- Do we have relevant data to support solving the problem?
- Can we use modelling to solve the problem (e.g. prediction or classification)?
14
Accessing data
Data exists in variety of sources and formats.
A data scientist might need to access data from any of
these in a reasonable time.
Typical data sources: Files, APIs, Data bases, web
scraping
Typical data formats:
- CSV, TSV, Excel
- JSON (XML less nowadays)
- Lot’s of strange structure in text files
Domain-specific formats:
- Relational data (networks)
- Spatial data
- Gene expression, genomic data
15
Example: Weather data from WFS API
http://opendata.fmi.fi/wfs?service=WFS&version=2.0.0&request=getFeature&storedquery_id=fmi::forecast::hirlam::surface::point::multipointcoverage&place=helsinki&
16
Be very careful with Excel data formatting!
17
Preprocessing and cleaning (”wrangling” / ”munging”)
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Having data in a tidy format makes data analysis, visualisation and modelling easier.
Data frames in R and Python.
Read more about tidy data: https://r4ds.had.co.nz/tidy-data.html
18
Exploratory Data Analysis and Visualisation
The goal of Exploratory Data Analysis is to get
to know your data, using visual summaries and
computing descriptive statistics.
Includes identifying missing data, outliers and
other possible problems with the data.
This informs preprocessing and cleaning, and
typically needs a couple of iterations before the
data is ready for analysis.
You should also contact domain experts and
confirm if the data looks as it should.
It’s hard to define when the data is really
”clean”. You will develop an instict for this
over time.
19
Statistical modelling and machine learning
Modelling is one way to reach a goal in data analysis, not a
goal in itself.
Pick a suitable method based on your goals - not the other
way around!
Start with simple methods, add complexity gradually, if
needed.
You can get pretty far with linear or logistic regression.
20
End result
The end result of a data science project can be many things, such as
- A single figure describing the association of two variables
- A comprehensive report for a client or business department
- A machine learning product ready to be deployed into production
In most projects, it is important to write some kind of report of documentation of what has been done.
Learning to communicate effectively is a very important skill for data scientists. This includes producing clear visual
summaries of the main results, and using generally understandable language.
21
Deplying Data Products
Data science is useful in creating insights, increasing understanding, and informing decision making.
The biggest impact however comes from intelligent systems that operate automatically and continuously, such as
recommendation engines. This typically means that data science products are deployed as part of larger software
systems.
Deploying your first data products can be frightening for data scientist with no programming background.
Get support from software developers or data engineers!
22
Data Science Tools – Some tips
Make everything reproducible and use version control!
Tidyverse is a family of R packages that cover most of the
data science workflow.
Many similar tools exist for Python!
Tidyverse: https://www.tidyverse.org/
R for Data Science: https://r4ds.had.co.nz/
23
How to learn the Data Science Workflow?
Data Science is an art – you only learn it by doing!
• Pick challenging courses with large and realistic projects
• Start a hobby project, for example using some open data set, and share the code and results (e.g. GitHub)
• Participate competitions and challenges
• Tidytuesdays: https://github.com/rfordatascience/tidytuesday
• Kaggle: https://www.kaggle.com/
Learning a proper Data Science Workflow will help you in producing reliable results in a reasonable time.
This will benefit your career regardless of whether you work in the academia, industry, or somewhere else.
24
Data Science and Business
25
Agile Data Science
Any sufficiently interesting problem has more than one ”correct” answer.
You can use anything between 2 hours and a PhD on single problem. Try to recognize how much effort each problem
is worth of.
You can often get a satisfactory solution with 20% of the effort compared to a ”perfect” solution.
Learn to fail fast. Sometimes data science solutions do not work, and it’s good to realise this as soon as possible.
Adopting agile software development practices helps!
Agile Data Science with R: https://edwinth.github.io/ADSwR/index.html
26
Data Science in a Team
No single person can master every possible data science
skill.
Data scientists work effectively in teams, with
complementary skill sets and backgrounds.
When looking for you first job as a data scientits, look for
places where there are senior people who can help you
learn and grow as a data scientist.
27
Data Science Use Cases
28
Data Science as part of a Product or Project
Data Science is typically only a small part of the larger
Product or Project.
It is important to know what the overall goal is, and to
adjust data science development towards that.
You need to collaborate with other people, such as
designers, software developers, marketing and sales
people, customers, etc.
29
Some takeaway notes
Data Science is an art – you only learn it by doing.
Find ways to continuously learn and practice your skills, with e.g.
hobby projects or competitions.
Finding a problem worth solving is hard.
There is never a single correction solution.
Curiosity and critical thinking are invaluable!
Thank you!
Juuso Parkkinen, PhD - @ouzor
Head of Data Science, Nightingale Health - @NgaleHealth
www.nightingalehealth.com

How to Prepare for a Career in Data Science

  • 2.
    How to Preparefor a Career in Data Science Juuso Parkkinen, PhD - @ouzor Head of Data Science, Nightingale Health - @NgaleHealth Aalto University, November 25, 2019
  • 3.
    Outline 1.My Career asa Data Scientist 2.Data Science Workflow 3.Data Science and Business
  • 4.
    My Career asa Data Scientist
  • 5.
    The Data ScienceVenn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • 6.
    My career steps MScin bioinformation technology from HUT / Aalto PhD in bioinformatics and machine learning from Aalto Data Scientist (consultant) at Reaktor Data Scientist at Nightingale Health
  • 7.
    Data Science research:probabilistic models for biomedical problems 7 More on my research and other projects: https://ouzor.github.io/projects.html
  • 8.
    Data Science asa hobby: open tools for open data Blogging Open source programming Open Knowledge -community Blogs: https://louhos.github.io/, https://ouzor.github.io/
  • 9.
    Open data scienceexample: Biking activity in Helsinki How do various factors affect biking activity in Helsinki? Data sources: - Automatic bike activity counters from multiple sites - Weather data from FMI Bike activity modelled with Negative Binomial distribution using R (mgcv::gam) Done with Janne Sinkkonen and Antti Poikola Data, code & results: https://github.com/apoikola/fillarilaskennat 9
  • 10.
    Open Data Scienceat Reaktor: Apartment price modelling Kannattaakokauppa.fi by Reaktor: http://kannattaakokauppa.fi More about the model: https://ouzor.github.io/blog/2016/03/08/apartment-price-model.html
  • 11.
  • 12.
    Data Science inthe vacuum Typically starts with a clean data set and a clear (modelling) task. Example: Weather data in csv, and a goal to predict humidity. What might be different in the real world? 12
  • 13.
    Data Science Workflowin the Real World 1. Identifying and defining the problem 2. Accessing data 3. Preprocessing and cleaning the data 4. Exploratory data analysis and visualisation 5. Statistical modelling or machine learning 6. End result Note the difference between academic interests and practical relevance! 13 ITERATION
  • 14.
    Identifying and definingthe problem Learn to be critical and ask good questions! • Why is this problem important? • How does solving this improve our user experience? • How does solving this improve our business? • Is the problem really something we should solve, or is it something where we happen to have data or methods available? • Do we even need to solve this problem!? Only after the problem is identified, you can start thinking about data science - Do we have relevant data to support solving the problem? - Can we use modelling to solve the problem (e.g. prediction or classification)? 14
  • 15.
    Accessing data Data existsin variety of sources and formats. A data scientist might need to access data from any of these in a reasonable time. Typical data sources: Files, APIs, Data bases, web scraping Typical data formats: - CSV, TSV, Excel - JSON (XML less nowadays) - Lot’s of strange structure in text files Domain-specific formats: - Relational data (networks) - Spatial data - Gene expression, genomic data 15
  • 16.
    Example: Weather datafrom WFS API http://opendata.fmi.fi/wfs?service=WFS&version=2.0.0&request=getFeature&storedquery_id=fmi::forecast::hirlam::surface::point::multipointcoverage&place=helsinki& 16
  • 17.
    Be very carefulwith Excel data formatting! 17
  • 18.
    Preprocessing and cleaning(”wrangling” / ”munging”) “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham Having data in a tidy format makes data analysis, visualisation and modelling easier. Data frames in R and Python. Read more about tidy data: https://r4ds.had.co.nz/tidy-data.html 18
  • 19.
    Exploratory Data Analysisand Visualisation The goal of Exploratory Data Analysis is to get to know your data, using visual summaries and computing descriptive statistics. Includes identifying missing data, outliers and other possible problems with the data. This informs preprocessing and cleaning, and typically needs a couple of iterations before the data is ready for analysis. You should also contact domain experts and confirm if the data looks as it should. It’s hard to define when the data is really ”clean”. You will develop an instict for this over time. 19
  • 20.
    Statistical modelling andmachine learning Modelling is one way to reach a goal in data analysis, not a goal in itself. Pick a suitable method based on your goals - not the other way around! Start with simple methods, add complexity gradually, if needed. You can get pretty far with linear or logistic regression. 20
  • 21.
    End result The endresult of a data science project can be many things, such as - A single figure describing the association of two variables - A comprehensive report for a client or business department - A machine learning product ready to be deployed into production In most projects, it is important to write some kind of report of documentation of what has been done. Learning to communicate effectively is a very important skill for data scientists. This includes producing clear visual summaries of the main results, and using generally understandable language. 21
  • 22.
    Deplying Data Products Datascience is useful in creating insights, increasing understanding, and informing decision making. The biggest impact however comes from intelligent systems that operate automatically and continuously, such as recommendation engines. This typically means that data science products are deployed as part of larger software systems. Deploying your first data products can be frightening for data scientist with no programming background. Get support from software developers or data engineers! 22
  • 23.
    Data Science Tools– Some tips Make everything reproducible and use version control! Tidyverse is a family of R packages that cover most of the data science workflow. Many similar tools exist for Python! Tidyverse: https://www.tidyverse.org/ R for Data Science: https://r4ds.had.co.nz/ 23
  • 24.
    How to learnthe Data Science Workflow? Data Science is an art – you only learn it by doing! • Pick challenging courses with large and realistic projects • Start a hobby project, for example using some open data set, and share the code and results (e.g. GitHub) • Participate competitions and challenges • Tidytuesdays: https://github.com/rfordatascience/tidytuesday • Kaggle: https://www.kaggle.com/ Learning a proper Data Science Workflow will help you in producing reliable results in a reasonable time. This will benefit your career regardless of whether you work in the academia, industry, or somewhere else. 24
  • 25.
    Data Science andBusiness 25
  • 26.
    Agile Data Science Anysufficiently interesting problem has more than one ”correct” answer. You can use anything between 2 hours and a PhD on single problem. Try to recognize how much effort each problem is worth of. You can often get a satisfactory solution with 20% of the effort compared to a ”perfect” solution. Learn to fail fast. Sometimes data science solutions do not work, and it’s good to realise this as soon as possible. Adopting agile software development practices helps! Agile Data Science with R: https://edwinth.github.io/ADSwR/index.html 26
  • 27.
    Data Science ina Team No single person can master every possible data science skill. Data scientists work effectively in teams, with complementary skill sets and backgrounds. When looking for you first job as a data scientits, look for places where there are senior people who can help you learn and grow as a data scientist. 27
  • 28.
  • 29.
    Data Science aspart of a Product or Project Data Science is typically only a small part of the larger Product or Project. It is important to know what the overall goal is, and to adjust data science development towards that. You need to collaborate with other people, such as designers, software developers, marketing and sales people, customers, etc. 29
  • 30.
    Some takeaway notes DataScience is an art – you only learn it by doing. Find ways to continuously learn and practice your skills, with e.g. hobby projects or competitions. Finding a problem worth solving is hard. There is never a single correction solution. Curiosity and critical thinking are invaluable!
  • 31.
    Thank you! Juuso Parkkinen,PhD - @ouzor Head of Data Science, Nightingale Health - @NgaleHealth www.nightingalehealth.com