SlideShare a Scribd company logo
1 of 24
Chapter 01
Introduction
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What is „Big Data“?!?
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Is this really
about size?
Naive Definition
• Naive definition:
• Big data only depends on the data size
• 1 Gigabyte? 1 Terabyte? 1 Petabyte?
• Naive interpretation misses important aspects
• Time:
• Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of
data per second
• Diversity:
• Analyzing spread sheets with numeric data is different from analyzing Web
pages that contain a mixture of text and images
• Distribution:
• Analyzing data from a single source is different from analyzing data from multiple
sources
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Definition of Big Data
• Following Gartner‘s IT Glossary:
• Big data is high-volume, high-velocity and/or high-variety information
assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and
process automation.
• The three Vs
• Volume
• Velocity
• Variety
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Some people actually use 10 Vs to define
big data!
• Variability
• Veracity
• Validity
• Vulnerability
• Volatility
• Visualization
• Value
The 3 Vs: Volume
• Scale of the data must be „big“
• No clear definition
• „that demand […] innovative forms of information processing“ (Gartner)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
© Statista 2018
Data center storage worldwide
The 3 Vs: Velocity
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
• Speed at which new data is created
• Speed at which data must be processed and analyzed
• Often close to real-time
The 3 Vs: Variety
• Diversity in data types and data sources
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Structured
Semi-
Structured
Quasi-Structured
Unstructured
• Data with defined types and structure
• Example: comma separated values
• Textual data with parseable pattern
• Example: XML files with schema
• Textual data with erratic formats that can be
formated with effort
• Example: Clickstream data
• Data that has no inherent structure, often
with multiple formats
• Example: Web site, videos
Examples for data types
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Structured Quasi-Structured
Semi-Structured Unstructured
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Defining Data Science
• Unfortunately, there is no clear definition (yet?)
• Goal is the extraction of knowledge from data
• Combination of techniques from different disciplines
• Scientific principles guide the data analysis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What is „Data Science“?!?
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Tools? Big Data?
Machine Learning?
Mathematical Aspects
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Computational
Geometry
Optimization Stochastics
Scientific
Computing
Machine
Learning
Computer Science Aspects
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Data Structures and
Algorithms
Databases Distributed Computing
Software Engineering Artificial Intelligence Machine Learning
Statistical Aspects
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Linear Models Statistical Tests Inference
Time Series Analysis Machine Learning
Applications
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Intelligent Systems Robotics Marketing
Medicine Autonomous Driving Social Networks
Data Science vs. Business Intelligence
• Business Intelligence (Gartner IT Glossary)
• […] best practices that enable access to and analysis of information to
improve and optimize decisions and performance.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Depth of
Insights
Time
High
Low
Past Future
Present
Business
Intelligence
Data
Science
Business
Intelligence
Data Science
Techniques Dashboards,
alerts, queries
Optimization,
predictive modelling,
forecasting
Data Types Structured, data
warehouses
Any kind, often
unstructured
Common
questions
What
happened…?
How much did…?
When did…?
What if…?
What will…?
How can we…?
More Data  More Opportunities
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2000’s
Content Management
1990’s
Relational Databases
& Data Warehouses
2010’s
Key-Value Storages
& Unstructured Data
VOLUME
OF
INFORMATION
LARGE
SMALL
TERABYTES PETABYTES EXABYTES
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What are Data Scientists?
• Not computer scientists
• But should know about databases, data structures, algorithms, etc.
• Not mathematicians
• But should know about optimization, stochastics, etc.
• Not statisticians
• But should know about regression, statistical tests, etc.
• Not domain experts
• But must work together with them
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Skills of Data Scientists
Data
Scientists
Quantitative
• Maths
• Algorithms
• Statistics
Technical
• Programming
• Infrastructures
Skeptical
• Create
hypotheses, but
be skeptical
about them
Collaborative
• Teamwork
• Communication
skills
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
A bit of everything
… but actually as much as
possible of everything
Different types of Data Scientists
• According to Microsoft Research:
• Polymath
• „Do it all“
• Data Evangelist
• Data analysis, disseminating and acting
on insights
• Data Preparer
• Querying existing data, preparing data
for analysis
• Data Shapers
• Analyzing and preparing data
• Data Analyzer
• Analyzing data
• Platform Builder
• Collect data and create
infrastructures
• Moonlighters (50%/20%)
• „Spare time“ data scientists
• Insight Actors
• Use the outcome and act on
insights.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software
Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)
Outline
• Introduction to Big Data
• Data Science and Business Intelligence
• The Skillset of Data Scientists
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Big data has a high volume, velocity, and variety
• Different data structures
• Structured, semi-structured, quasi-structured, unstructured
• Data science is a very diverse discipline
• Maths, computer science, statistics, applications
 Data scientists require a diverse skillset
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 01-Introduction.pptx

JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoïc Lejoly
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science OverviewDavide Mauri
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptxKashishKashish22
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...jybufgofasfbkpoovh
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Wes McKinney
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringJames Densmore
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldKaren Lopez
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 

Similar to 01-Introduction.pptx (20)

JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Ds01 data science
Ds01   data scienceDs01   data science
Ds01 data science
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data Engineering
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Labou "Data Science and the Library at UC San Diego"
Labou "Data Science and the Library at UC San Diego"Labou "Data Science and the Library at UC San Diego"
Labou "Data Science and the Library at UC San Diego"
 

More from Shree Shree

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

01-Introduction.pptx

  • 1. Chapter 01 Introduction Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. What is „Big Data“?!? Introduction to Data Science https://sherbold.github.io/intro-to-data-science Is this really about size?
  • 4. Naive Definition • Naive definition: • Big data only depends on the data size • 1 Gigabyte? 1 Terabyte? 1 Petabyte? • Naive interpretation misses important aspects • Time: • Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per second • Diversity: • Analyzing spread sheets with numeric data is different from analyzing Web pages that contain a mixture of text and images • Distribution: • Analyzing data from a single source is different from analyzing data from multiple sources Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 5. Definition of Big Data • Following Gartner‘s IT Glossary: • Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. • The three Vs • Volume • Velocity • Variety Introduction to Data Science https://sherbold.github.io/intro-to-data-science Some people actually use 10 Vs to define big data! • Variability • Veracity • Validity • Vulnerability • Volatility • Visualization • Value
  • 6. The 3 Vs: Volume • Scale of the data must be „big“ • No clear definition • „that demand […] innovative forms of information processing“ (Gartner) Introduction to Data Science https://sherbold.github.io/intro-to-data-science © Statista 2018 Data center storage worldwide
  • 7. The 3 Vs: Velocity Introduction to Data Science https://sherbold.github.io/intro-to-data-science • Speed at which new data is created • Speed at which data must be processed and analyzed • Often close to real-time
  • 8. The 3 Vs: Variety • Diversity in data types and data sources Introduction to Data Science https://sherbold.github.io/intro-to-data-science Structured Semi- Structured Quasi-Structured Unstructured • Data with defined types and structure • Example: comma separated values • Textual data with parseable pattern • Example: XML files with schema • Textual data with erratic formats that can be formated with effort • Example: Clickstream data • Data that has no inherent structure, often with multiple formats • Example: Web site, videos
  • 9. Examples for data types Introduction to Data Science https://sherbold.github.io/intro-to-data-science Structured Quasi-Structured Semi-Structured Unstructured
  • 10. Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 11. Defining Data Science • Unfortunately, there is no clear definition (yet?) • Goal is the extraction of knowledge from data • Combination of techniques from different disciplines • Scientific principles guide the data analysis Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 12. What is „Data Science“?!? Introduction to Data Science https://sherbold.github.io/intro-to-data-science Tools? Big Data? Machine Learning?
  • 13. Mathematical Aspects Introduction to Data Science https://sherbold.github.io/intro-to-data-science Computational Geometry Optimization Stochastics Scientific Computing Machine Learning
  • 14. Computer Science Aspects Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data Structures and Algorithms Databases Distributed Computing Software Engineering Artificial Intelligence Machine Learning
  • 15. Statistical Aspects Introduction to Data Science https://sherbold.github.io/intro-to-data-science Linear Models Statistical Tests Inference Time Series Analysis Machine Learning
  • 16. Applications Introduction to Data Science https://sherbold.github.io/intro-to-data-science Intelligent Systems Robotics Marketing Medicine Autonomous Driving Social Networks
  • 17. Data Science vs. Business Intelligence • Business Intelligence (Gartner IT Glossary) • […] best practices that enable access to and analysis of information to improve and optimize decisions and performance. Introduction to Data Science https://sherbold.github.io/intro-to-data-science Depth of Insights Time High Low Past Future Present Business Intelligence Data Science Business Intelligence Data Science Techniques Dashboards, alerts, queries Optimization, predictive modelling, forecasting Data Types Structured, data warehouses Any kind, often unstructured Common questions What happened…? How much did…? When did…? What if…? What will…? How can we…?
  • 18. More Data  More Opportunities Introduction to Data Science https://sherbold.github.io/intro-to-data-science 2000’s Content Management 1990’s Relational Databases & Data Warehouses 2010’s Key-Value Storages & Unstructured Data VOLUME OF INFORMATION LARGE SMALL TERABYTES PETABYTES EXABYTES
  • 19. Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 20. What are Data Scientists? • Not computer scientists • But should know about databases, data structures, algorithms, etc. • Not mathematicians • But should know about optimization, stochastics, etc. • Not statisticians • But should know about regression, statistical tests, etc. • Not domain experts • But must work together with them Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 21. Skills of Data Scientists Data Scientists Quantitative • Maths • Algorithms • Statistics Technical • Programming • Infrastructures Skeptical • Create hypotheses, but be skeptical about them Collaborative • Teamwork • Communication skills Introduction to Data Science https://sherbold.github.io/intro-to-data-science A bit of everything … but actually as much as possible of everything
  • 22. Different types of Data Scientists • According to Microsoft Research: • Polymath • „Do it all“ • Data Evangelist • Data analysis, disseminating and acting on insights • Data Preparer • Querying existing data, preparing data for analysis • Data Shapers • Analyzing and preparing data • Data Analyzer • Analyzing data • Platform Builder • Collect data and create infrastructures • Moonlighters (50%/20%) • „Spare time“ data scientists • Insight Actors • Use the outcome and act on insights. Introduction to Data Science https://sherbold.github.io/intro-to-data-science Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)
  • 23. Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 24. Summary • Big data has a high volume, velocity, and variety • Different data structures • Structured, semi-structured, quasi-structured, unstructured • Data science is a very diverse discipline • Maths, computer science, statistics, applications  Data scientists require a diverse skillset Introduction to Data Science https://sherbold.github.io/intro-to-data-science