The document discusses the tools and perspectives of a machine learning engineer, providing examples of projects involving machine learning microservices, recommendations, and industrial analytics, and covering topics like experiment tracking, algorithm libraries, model serving, and AB testing from the perspective of an ML consultant. It also examines the job market and skills needed for machine learning roles in industry.
Machine Learning Engineer Perspective on Industry Trends and Job Market
1. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Machine Learning
is more than
Algorithms
A Consultant's
Perspective on the
Industry and the Job
Market
1
2. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 2
Image source: https:/
/cloud.google.com/blog/products/application-development/a-cloud-built-for-developers-2021-year-in-review
Agenda
Introduction
Example Projects
Tooling
Job Market Perspectives
Key Takeaways
3. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 3
Image source: https:/
/cloud.google.com/blog/products/application-development/a-cloud-built-for-developers-2021-year-in-review
Agenda
Introduction
Example Projects
Tooling
Job Market Perspectives
Key Takeaways
4. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 4
Image source: https:/
/cloud.google.com/blog/products/application-development/a-cloud-built-for-developers-2021-year-in-review
Agenda
Introduction
Example Projects
Tooling
Job Market Perspectives
Key Takeaways
Feel free to ask questions right away!
Then it’s my duty to have a look on the
wall clock.
5. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
$ whoami
5
Niklas Haas
living in Dusseldorf, NRW
Data Scientist turned ML Engineer
Graduated 2017 in Industrial Engineering and Management from
Karlsruhe Institute of Technology, KIT.
Curriculum focus on Statistics, Operations Research, Information
Technology
With codecentric AG in Solingen since 2018
codecentric has a 4+1 model, i.e. we work 4 days for the customer
and have 1 day per week for learning and development of ourselves
and / or the company (though administrative tasks are also included
in this time)
6. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
$ whoami
6
Project history
Machine Learning Microservice for the Industry 4.0 platform
Create a microservice for early detection of a serious production
failure and integrate into the existing Industry 4.0 Platform.
Customer Lifecycle Recommendations
Scaling Algorithms for Detection of Customer Churn. Migrating an
on-prem solution for personalized customer communication to GCP
and improving the product in close collaboration with its users
Recommendations in Wholesale
Set up a scalable ML system for personalized product
recommendations on the Google Cloud Platform (GCP) following
MLOps principles. Evaluating ML models using AB testing
Industrial Analytics in Renewable Energy
Unsupervised pattern recognition on wind turbine data. Using
automated feature engineering and bayesian clustering to build a
penetrable and validatable decision support system (Decision Tree).
Results are presented in an interactive Dashboard.
9. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
How relevant is the
Data “Scientist” for
the industry?
OR
How much “deep”
learning do you need?
10. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
In real-life ML systems, there are many things to consider
10
Image source: https:/
/cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
11. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
In real-life ML systems, there are many things to consider
11
Image source: https:/
/cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
That might be the expectations of a Data Scientist.
However, this alone does not add value to the business, as it is likely to not go beyond the “Proof of Concept” state.
12. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
In real-life ML systems, there are many things to consider
12
Image source: https:/
/cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
That might be the expectations of a Data Scientist.
However, this alone does not all value to the business, as it is likely to not go beyond the “Proof of Concept” state.
As of 2020:
“Gartner research shows only 53% of projects make it from artificial
intelligence (AI) prototypes to production.”
https:/
/www.gartner.com/en/newsroom/press-releases/2020-10-19-g
artner-identifies-the-top-strategic-technology-trends-for-2021#:~:text
=Gartner%20research%20shows%20only%2053,a%20production%2Dgra
de%20AI%20pipeline.
13. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
In real-life ML systems, there are many things to consider
13
Image source: https:/
/cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
To cover all aspects of a ML system, from our experience, the
ratio of Data Scientists to {Data,ML,DevOps} Engineers
should be around 1:3.
At least in the project ramp-up phase.
14. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 14
Image source: https:/
/cloud.google.com/blog/products/application-development/a-cloud-built-for-developers-2021-year-in-review
Agenda
Introduction
Example Projects
Tooling
Job Market Perspectives
Key Takeaways
15. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Our projects in the Gartner “AI” Hype Cycle
15
Image source: https:/
/www.gartner.com/en/articles/the-4-trends-that-prevail-on-the-gartner-hype-cycle-for-ai-2021
16. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Our Projects represented in “State of AI” 2021 (2nd pandemic year)
16
Image source: https:/
/www.mckinsey.com/business-functions/quantumblack/our-insights/global-survey-the-state-of-ai-in-2021
17. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Machine Learning Microservice for the Industry 4.0 platform
17
Reference: https://www.youtube.com/watch?v=WywQm0wHLvA
Reference: https://www.kampf.de/de/digitale-produkte/theadvanced/
18. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Recommendations in Wholesale
18
References:
https://www.codecentric.de/success-stories/metro-digital
https://cloud.google.com/bigquery-ml/docs/bigqueryml-mf-implicit-tutorial
https://developers.google.com/machine-learning/recommendation/collaborative/matrix
https://cloud.google.com/retail/recommendations-ai/docs/create-models
Using matrix factorization with implicit feedback (customer did not give explicit rating but gave implicit feedback).
Use Cases:
- Ranked Promotions
- “Others you may like” (-> not “Frequently bought together”, this is better calculated using basket mining, it has
a different objective: click through rate vs. conversion rate).
19. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Customer Lifecycle Recommendations
19
Reference:
https:/
/dzone.com/articles/xgboost-a-deep-dive-into-boosting (image taken from here)
https://www.codecentric.de/success-stories/metro-digital
https://towardsdatascience.com/churn-prediction-3a4a36c2129a
https://github.com/dmlc/xgboost
https://github.com/slundberg/shap
Target variable: will the customer
buy something in the next 3
months?
Boosting = sequential
optimization of Decision Trees
SHAP for model explanation
20. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Industrial Analytics in Renewable Energy
20
Semi-supervised pattern recognition on wind turbine data
1. Infer patterns empirically from data
2. Classify/interpret data with domain knowledge and assign
classes
3. Build higher-level analysis and models on these classes
21. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
NLP: Automating document processing - Sherloq (cc project)
21
22. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 22
Agenda
Introduction
Example Projects
Tooling
Job Market Perspectives
Key Takeaways
23. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
There is no shortage in tooling - Linux Foundation Data & AI Landscape
23
Image source: https:/
/landscape.lfai.foundation/
24. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
There is no shortage in tooling - Linux Foundation Data & AI Landscape
24
Image source: https:/
/landscape.lfai.foundation/
The amount of available open source tooling is ridiculous. It is like the wild west.
Innovation is everywhere.
Strategy that works for us:
Be aware of your own FOMO (fear of missing out) and ignore it.
Get proficient in the tools you use.
Discuss with colleagues / the community, what they use to get inspired.
Regularly try out new tools.
If it gives you the productivity boost, adapt it.
Do NOT choose tooling simply because it is new/cool/promising (sometimes referred
to as “tech porn”).
Also: Do NOT replace one tool with another unless it adds real value.
25. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Before we start: Python vs R (caution: my opinion)
25
Area of application
General purpose with performance-optimized packages for
statistical computing/ML
Statistical computing
Contributors
Professional Software Engineers from all over the world, also from
Google et al.
Academia
(Unit-)Testing & Linting capabilities Excellent and prominent (pytest, flake8, black, yapf) Exists, but I never saw it used in the wild
Development Environments
Depending on the use: VSCode, PyCharm, Spyder (Matlab/RStudio
clone), Jupyter
I saw basically only RStudio in the wild
API development (for example for
model serving)
Many alternatives: flask, FastAPI, Django plumbe.R ? (I never used it)
Dashboarding & Data Apps Plotly / Dash, Streamlit Shiny
Documentation and QA websites, github, StackOverflow
websites, Github, CRAN (awful, looks like from the 90s), a little bit
on StackOverflow
Verdict (my opinion!)
No doubt the relevant language for the industry.
Learning opportunities are abundant. Using Python will improve
your coding skills. A lot of 3rd party software have APIs in Python.
No demand for R in the industry unless in specialized areas.
Mainly Academia. Using R will almost surely not improve your
coding skills, because that is no focus of the community.
26. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Without “MLOps mindset”, the situation might look like this
26
Data
...
Notebooks with data
storytelling, can turn
very long
Model artifacts,
processed data
28. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Experimentation and Data Storytelling: JupyterLab
28
For
experimentation
and data
storytelling
Source: https:/
/jupyter.org/
Source: https:/
/jupyterlab.readthedocs.io/en/stable/getting_started/overview.html
Reference: https:/
/github.com/jupyterlab/jupyterlab
29. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Interactive Data Apps: Streamlit
29
For building data
apps
(comparable to R
Shiny)
Source: https:/
/streamlit.io/
Source: https:/
/share.streamlit.io/data-science-at-swast/handover_poc/main/handover.py
Source: https:/
/github.com/streamlit/streamlit
30. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Experiment Tracking: MLFlow
30
For experiment
and metadata
tracking.
Integrates nicely
with all popular
ML frameworks.
Source: https:/
/mlflow.org/
Source: https:/
/towardsdatascience.com/managing-your-machine-learning-experiments-with-mlflow-1cd6ee21996e
31. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
SQL (+ DBT )
31
You will need it.
Everywhere.
Some things
don’t need a
fancy ML model,
they can be
done using SQL
:-)
DBT is a great
tool! You can
write data tests
and
auto-generate
data lineage.
Source: https:/
/console.cloud.google.com/bigquery
Source: https:/
/www.postgresql.org/
Source: https:/
/www.getdbt.com/
Source: https:/
/www.datatask.io/blog/workflow-dbt-materialisations-documentation/
BigQuery
32. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Dashboarding and Apps for Stakeholders: Metabase
32
Nice tool to
discover data
from databases
with
dashboards.
Share
dashboards and
data stories with
stakeholders.
Authentication
mechanism is
included!
Start in a docker
container.
Source: https:/
/github.com/metabase/metabase
Source: https:/
/www.metabase.com/start/oss/
docker run -d -p 3000:3000
--name metabase metabase/metabase
33. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Algorithms: Sklearn / xgboost / Tensorflow
33
Good old sklearn
is the industry
standard.
Xgboost often
delivers the best
results.
For image
processing, you
might use
fine-tuned
Tensorflow
models.
Source: https:/
/en.wikipedia.org/wiki/Scikit-learn
Source: https:/
/scikit-learn.org/stable/
Source: https:/
/xgboost.readthedocs.io/en/stable/
Source: https:/
/github.com/tensorflow/tensorflow
34. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Pipelining: Kedro / Dagster / Kubeflow
34
Split your ML
tasks in logical
and
self-contained
steps and
combine them in
pipelines.
Use kedro /
dagster for
lightweight
on-machine
tasks, and
Kubeflow for
heavyweight
scaling on
kubernetes
(k8s).
Source: https:/
/github.com/kedro-org/kedro
Source:https:/
/github.com/dagster-io/dagster
Source: https:/
/www.kubeflow.org/
35. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Model Serving: Flask / FastAPI / BentoML
35
Flask / FastAPI
to build from
scratch / to add
some business
logic
BentoML: To
simply serve
business models
Microservices
are nice, but
splitting logic
into different
services
introduces
latency
overhead.
Source: https:/
/flask.palletsprojects.com/en/2.1.x/
Source: https:/
/fastapi.tiangolo.com/
Source: https:/
/github.com/bentoml/BentoML
36. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Basics: Git, Containerization, Shell (bash, zsh), poetry
36
You will need
these basics in
every ML project
(also in every
software
project).
Source: https:/
/git-scm.com/
Source: https:/
/alexec.github.io/slides/intro-to-docker.html#/
Source: https:/
/de.wikipedia.org/wiki/Bash_(Shell)
Source: https:/
/de.wikipedia.org/wiki/Z_shell
Source: https:/
/github.com/python-poetry
37. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
AB testing: “by hand”
37
So far we did it
only “by hand”,
i.e. we use SQL +
Dashboarding.
We haven’t
found a very nice
open source
solution yet.
BigQuery
AB distribution image source: https:/
/pubs.rsc.org/image/article/2018/AN/c8an01303a/c8an01303a-f3_hi-res.gif
38. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
NLP processing: spaCy
38
A pretty good
industry
standard for
NLP.
Source: https:/
/spacy.io/
39. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Comparison of Public Cloud Providers: AWS vs Google vs Microsoft
39
39
Source: https:/
/aws.amazon.com/resources/analyst-reports/gartner-mq-cips-2021/
The “public cloud” market is dominated by 3 major providers
from the US:
Amazon Web Services / AWS:
https:/
/aws.amazon.com/
Microsoft Azure:
https:/
/azure.microsoft.com/
Google Cloud Platform / GCP:
https:/
/cloud.google.com/?hl=de
40. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Comparison of Public Cloud Providers: AWS vs Google vs Microsoft
40
40
Source: https:/
/www.itprotoday.com/iaas-and-paas/aws-continues-dominance-over-azure-google-cloud-strong-growth
41. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
AI / Data offerings by public ☁ providers and other vendors
41
41
Personal recommendation: Go to a public cloud (AWS, Azure, GCP) or
Databricks (cloud-agnostic) and use the open source tooling suite they
deploy as a service.
In my opinion no vendor solution has shown to be superior by now. Often,
the vendors just rebrand given OSS solutions (Jupyterlab!) with tweaks.
42. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
ML is getting commoditized very fast! The entrance barrier shrinks.
42
Source: https:/
/cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models
BigQuery ML to train models from tabular data directly
in SQL! No Python + Pandas + Notebooks needed.
Can enable faster model iteration.
43. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
AutoML tools are getting better very fast!
43
Source: https:/
/cloud.google.com/vision/docs/features-list
Source: https:/
/codelabs.developers.google.com/vertex_custom_training_prediction#1
Source: https:/
/console.cloud.google.com/vertex-ai/datasets/create
AWS, Azure, GCP offer similar services. Though these services can
differ in quality.
44. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
AutoML tools are getting better very fast! - Example: Vertex AI by
Google Cloud
44
Source: https:/
/cloud.google.com/vision/docs/features-list
Source: https:/
/codelabs.developers.google.com/vertex_custom_training_prediction#1
Source: https:/
/console.cloud.google.com/vertex-ai/datasets/create
AWS, Azure, GCP offer similar services. Though these services can
differ in quality.
45. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 45
Image source: https:/
/cloud.google.com/blog/products/application-development/a-cloud-built-for-developers-2021-year-in-review
Agenda
Introduction
Tooling
Example Projects
Job Market Perspectives
Key Takeaways
46. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles
46
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Data
Scientist
Machine
Learning
Engineer
Data
Engineer
Backend
Engineer
DevOps
Engineer
The jobs generate different output and use different tools.
47. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - My Path
47
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Data
Scientist
Machine
Learning
Engineer
Data
Engineer
Backend
Engineer
DevOps
Engineer
?
48. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - Skills
48
Source: Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Engineer
Backend
Engineer
/ SWE
DevOps
Engineer
49. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - Data Analyst - Task & Tooling
49
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Scientist
Machine
Learning
Engineer
Data
Engineer
Backend
Engineer
DevOps
Engineer
Data
Analyst
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
50. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - Data Scientist - Task & Tooling
50
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Machine
Learning
Engineer
Data
Engineer
Backend
Engineer
DevOps
Engineer
Data
Analyst
Data
Scientist
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
Identify business problems that
can be solved with data science
+ implement solutions
Python / sklearn / pandas /
Streamlit / Notebooks / SQL /
AB testing / Power Point …
51. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - ML Engineer - Task & Tooling
51
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Data
Scientist
Backend
Engineer
Data
Engineer
DevOps
Engineer
Machine
Learning
Engineer
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
Identify business problems that
can be solved with data science
+ implement solutions
Python / sklearn / pandas /
Streamlit / Notebooks / SQL /
AB testing / Power Point …
Scale data solutions and
increase ML team productivity
following MLOps principles
Pipeline tools / MLFlow /
Kubernetes / Databases /
Python / YAML / …
52. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - Data Engineer - Task & Tooling
52
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Data
Scientist
Machine
Learning
Engineer
Backend
Engineer
DevOps
Engineer
Data
Engineer
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
Identify business problems that
can be solved with data science
+ implement solutions
Python / sklearn / pandas /
Streamlit / Notebooks / SQL /
AB testing / Power Point …
Scale data solutions and
increase ML team productivity
following MLOps principles
Pipeline tools / MLFlow /
Kubernetes / Databases /
Python / YAML / …
Design data models and
implement processes that are
essential to run the business
SQL / Postgres / BigQuery /
Message Queues / Serverless
Functions / DBT /Kafka / …
53. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - Backend Engineer - Task & Tooling
53
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Data
Scientist
Machine
Learning
Engineer
Data
Engineer
DevOps
Engineer
Backend
Engineer
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
Identify business problems that
can be solved with data science
+ implement solutions
Python / sklearn / pandas /
Streamlit / Notebooks / SQL /
AB testing / Power Point …
Scale data solutions and
increase ML team productivity
following MLOps principles
Pipeline tools / MLFlow /
Kubernetes / Databases /
Python / YAML / …
Design data models and
implement processes that are
essential to run the business
SQL / Postgres / BigQuery /
Message Queues / Serverless
Functions / DBT /Kafka / …
Java /
Java Spring /
Golang /
Microservices /
Docker /
Authentication /
API Management /
…
Engineer scalable
backend systems
that implement the
business logic
54. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - DevOps Engineer - Task & Tooling
54
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Data
Scientist
Machine
Learning
Engineer
Backend
Engineer
Data
Engineer
DevOps
Engineer
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
Identify business problems that
can be solved with data science
+ implement solutions
Python / sklearn / pandas /
Streamlit / Notebooks / SQL /
AB testing / Power Point …
Scale data solutions and
increase ML team productivity
following MLOps principles
Pipeline tools / MLFlow /
Kubernetes / Databases /
Python / YAML / …
Design data models and
implement processes that are
essential to run the business
SQL / Postgres / BigQuery /
Message Queues / Serverless
Functions / DBT /Kafka / …
Java /
Java Spring /
Golang /
Microservices /
Containers /
Authentication /
API Management /
…
Engineer scalable
backend systems
that implement the
business logic
Promote the DevOps culture
of releasing software
frequently and establishing
feedback loops everywhere
Continuous Integration (CI) /
Continuous Delivery (CD) /
Containers / Cloud / APM / …
55. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles - DevOps Engineer - Task & Tooling
55
Source: modified from Chandra Reddy, https:/
/medium.com/@lchandratejareddy/a-data-analyst-vs-a-data-scientist-vs-a-data-engineer-91b1f46d5995
Data
Analyst
Excel /
Power Point /
Power BI /
Tableau /
SQL /
Python / R Scripts /
…
Create one-off or
recurring analyses
as foundation for
business decisions
Identify business problems that
can be solved with data science
+ implement solutions
Python / sklearn / pandas /
Streamlit / Notebooks / SQL /
AB testing / Power Point …
Scale data solutions and
increase ML team productivity
following MLOps principles
Pipeline tools / MLFlow /
Kubernetes / Databases /
Python / YAML / …
Design data models and
implement processes that are
essential to run the business
SQL / Postgres / BigQuery /
Message Queues / Serverless
Functions / DBT /Kafka / …
Java /
Java Spring /
Golang /
Microservices /
Containers /
Authentication /
API Management /
…
Engineer scalable
backend systems
that implement the
business logic
Promote the DevOps culture
of releasing software
frequently and establishing
feedback loops everywhere
Continuous Integration (CI) /
Continuous Delivery (CD) /
Containers / Cloud / APM / …
Data
Scientist
Machine
Learning
Engineer
Backend
Engineer
Data
Engineer
DevOps
Engineer
56. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles Google Trends - Germany
56
Source: https:/
/trends.google.de/trends/explore?geo=DE&q=data%20analyst,data%20scientist,machine%20learning%20engineer,data%20engineer
57. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Job Roles Google Trends - United States
57
Source: https:/
/trends.google.de/trends/explore?geo=US&q=data%20analyst,data%20scientist,machine%20learning%20engineer,data%20engineer
58. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
LinkedIn Job Offerings
58
Source: https:/
/de.linkedin.com/jobs
See that the number of results for Data Analyst and Data Scientist are similar?
This is because many companies promote the same job as “Data Analyst” and “Data Scientist” at the same time.
Be careful: When applying for a Data Scientist job, you might actually end up with a Data Analyst job.
59. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
LinkedIn Job Offerings
59
Source: https:/
/de.linkedin.com/jobs
However, there is also a considerable overlap between “Data Scientist” and “Machine Learning Engineer”.
But it’s not as much as between “Data Analyst” and “Data Scientist”.
Whereas the “Data Engineer” role is pretty well defined from years of experience in classical Business
Intelligence (BI) environments, thus there is not much overlap to the other roles.
60. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Salary DE - Data Analyst
60
Source: https:/
/www.kununu.com/de/gehalt/datenanalyst-982
61. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Salary DE - Data Scientist + ML Engineer
61
Source: https:/
/www.kununu.com/de/gehalt/data-scientist-973
62. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Salary DE - Data(base) Engineer
62
Source: https:/
/www.kununu.com/de/gehalt/datenbankentwickler-985
63. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Resources for self-development
63
For beginners:
- Udacity (high quality content!):
- https:/
/www.udacity.com/course/data-scientist-nanodegree--nd025
- https:/
/www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0
821
- https:/
/www.udacity.com/course/intro-to-relational-databases--ud197
- To get2know GCP:
- https:/
/www.cloudskillsboost.google/ with free hands-on labs
- https:/
/developers.google.com/machine-learning/crash-course/ml-intro (from developers)
- https:/
/www.coursera.org/learn/gcp-fundamentals
Advanced:
- Do Certifications (very advanced, needs prior knowledge in GCP):
- https:/
/cloud.google.com/certification/data-engineer
- https:/
/cloud.google.com/certification/machine-learning-engineer
64. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer 64
Image source: https:/
/cloud.google.com/blog/products/application-development/a-cloud-built-for-developers-2021-year-in-review
Agenda
Introduction
Example Projects
Tooling
Job Market Perspectives
Key Takeaways
65. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
Key Takeaways
65
The job roles in the data space are very different, they do different things.
You should decide what mixture of Communication/Math/Programming/Business
you want.
It can happen that the Data Scientist Job is actually a Data Analyst Job, be careful!
MLOps is currently the sort-of industry standard for structuring ML projects.
The tooling landscape is abundant and innovating all the time, it is impossible to keep up.
Instead, develop your own mechanism to deal with this complexity.
66. niklas.haas@codecentric.de | Machine Learning Engineer & Google Cloud Data Engineer
The job roles in the data space are very different, they do different things.
You should decide what mixture of Communication/Math/Programming/Business
you want.
It can happen that the Data Scientist Job is actually a Data Analyst Job, be careful!
MLOps is currently the sort-of industry standard for structuring ML projects.
The tooling landscape is abundant and innovating all the time, it is impossible to keep up.
Instead, develop your own mechanism to deal with this complexity.
Key Takeaways
66
Thanks for having me!
Special thanks to Maxx Richard Rahman
for reaching out to me!
Feel free to add me on LinkedIn:
https:/
/www.linkedin.com/in/niklas-haas/
I am open to feedback about the content / slides / style
of presentation / presentation performance! Right now
or on LinkedIn or via email.
I have time for more questions :-)