Data science is a multidisciplinary field that combines scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves analyzing, interpreting, and deriving actionable information from large and complex datasets to support decision-making and solve problems in various domains.
Key components of data science include:
Data Collection and Preparation: Data scientists gather and collect data from various sources, which may include databases, websites, sensors, social media, or other digital platforms. They clean, transform, and preprocess the data to ensure its quality and suitability for analysis.
Data Exploration and Visualization: Data scientists explore and visualize the data using statistical techniques and visualization tools. They look for patterns, trends, and relationships within the data to gain a deeper understanding of the underlying insights and potential correlations.
Machine Learning and Predictive Modeling: Data scientists apply machine learning algorithms and predictive modeling techniques to build models that can make predictions or classifications based on the available data. This involves training models on historical data and evaluating their performance on new or unseen data.
Statistical Analysis: Statistical analysis is a fundamental aspect of data science. Data scientists use statistical methods to analyze data, test hypotheses, identify significant variables, and quantify uncertainties to make informed decisions.
Data Interpretation and Communication: Data scientists interpret the results of their analysis and communicate their findings to stakeholders in a clear and meaningful way. They use data visualization techniques, storytelling, and data-driven insights to convey complex information and facilitate decision-making.
Domain Knowledge: Data scientists often work in specific domains or industries and require domain knowledge to understand the context and interpret the results effectively. This allows them to identify relevant variables, apply appropriate techniques, and generate actionable insights.
Data science has applications across various sectors, including finance, healthcare, marketing, retail, telecommunications, and more. It helps organizations gain a competitive advantage, optimize processes, identify trends, improve customer experiences, and drive data-informed decision-making.
To work in data science, proficiency in programming languages (such as Python or R), statistical knowledge, data manipulation skills, and experience with machine learning algorithms are typically required. Data scientists also need critical thinking, problem-solving abilities, and effective communication skills to effectively analyze data and communicate insights to both technical and non-technical stakeholders.
1. Chapter 01
Introduction
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2. Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
3. What is „Big Data“?!?
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Is this really
about size?
4. Naive Definition
• Naive definition:
• Big data only depends on the data size
• 1 Gigabyte? 1 Terabyte? 1 Petabyte?
• Naive interpretation misses important aspects
• Time:
• Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of
data per second
• Diversity:
• Analyzing spread sheets with numeric data is different from analyzing Web
pages that contain a mixture of text and images
• Distribution:
• Analyzing data from a single source is different from analyzing data from multiple
sources
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
5. Definition of Big Data
• Following Gartner‘s IT Glossary:
• Big data is high-volume, high-velocity and/or high-variety information
assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and
process automation.
• The three Vs
• Volume
• Velocity
• Variety
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Some people actually use 10 Vs to define
big data!
• Variability
• Veracity
• Validity
• Vulnerability
• Volatility
• Visualization
• Value
7. The 3 Vs: Velocity
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
• Speed at which new data is created
• Speed at which data must be processed and analyzed
• Often close to real-time
8. The 3 Vs: Variety
• Diversity in data types and data sources
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Structured
Semi-
Structured
Quasi-Structured
Unstructured
• Data with defined types and structure
• Example: comma separated values
• Textual data with parseable pattern
• Example: XML files with schema
• Textual data with erratic formats that can be
formated with effort
• Example: Clickstream data
• Data that has no inherent structure, often
with multiple formats
• Example: Web site, videos
9. Examples for data types
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Structured Quasi-Structured
Semi-Structured Unstructured
10. Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
11. Defining Data Science
• Unfortunately, there is no clear definition (yet?)
• Goal is the extraction of knowledge from data
• Combination of techniques from different disciplines
• Scientific principles guide the data analysis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
12. What is „Data Science“?!?
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Tools? Big Data?
Machine Learning?
13. Mathematical Aspects
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Computational
Geometry
Optimization Stochastics
Scientific
Computing
Machine
Learning
14. Computer Science Aspects
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data Structures and
Algorithms
Databases Distributed Computing
Software Engineering Artificial Intelligence Machine Learning
15. Statistical Aspects
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Linear Models Statistical Tests Inference
Time Series Analysis Machine Learning
16. Applications
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Intelligent Systems Robotics Marketing
Medicine Autonomous Driving Social Networks
17. Data Science vs. Business Intelligence
• Business Intelligence (Gartner IT Glossary)
• […] best practices that enable access to and analysis of information to
improve and optimize decisions and performance.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Depth of
Insights
Time
High
Low
Past Future
Present
Business
Intelligence
Data
Science
Business
Intelligence
Data Science
Techniques Dashboards,
alerts, queries
Optimization,
predictive modelling,
forecasting
Data Types Structured, data
warehouses
Any kind, often
unstructured
Common
questions
What
happened…?
How much did…?
When did…?
What if…?
What will…?
How can we…?
18. More Data More Opportunities
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2000’s
Content Management
1990’s
Relational Databases
& Data Warehouses
2010’s
Key-Value Storages
& Unstructured Data
VOLUME
OF
INFORMATION
LARGE
SMALL
TERABYTES PETABYTES EXABYTES
19. Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
20. What are Data Scientists?
• Not computer scientists
• But should know about databases, data structures, algorithms, etc.
• Not mathematicians
• But should know about optimization, stochastics, etc.
• Not statisticians
• But should know about regression, statistical tests, etc.
• Not domain experts
• But must work together with them
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
21. Skills of Data Scientists
Data
Scientists
Quantitative
• Maths
• Algorithms
• Statistics
Technical
• Programming
• Infrastructures
Skeptical
• Create
hypotheses, but
be skeptical
about them
Collaborative
• Teamwork
• Communication
skills
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
A bit of everything
… but actually as much as
possible of everything
22. Different types of Data Scientists
• According to Microsoft Research:
• Polymath
• „Do it all“
• Data Evangelist
• Data analysis, disseminating and acting
on insights
• Data Preparer
• Querying existing data, preparing data
for analysis
• Data Shapers
• Analyzing and preparing data
• Data Analyzer
• Analyzing data
• Platform Builder
• Collect data and create
infrastructures
• Moonlighters (50%/20%)
• „Spare time“ data scientists
• Insight Actors
• Use the outcome and act on
insights.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software
Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)
23. Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
24. Summary
• Big data has a high volume, velocity, and variety
• Different data structures
• Structured, semi-structured, quasi-structured, unstructured
• Data science is a very diverse discipline
• Maths, computer science, statistics, applications
Data scientists require a diverse skillset
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science