Why data science is important? How much powerful is artificial intelligence? WHat exactly is data science? What is life sycle of a typical data science project? How one can become a data scientist? What are different career options in data sicnece?
12. • Amount of data
Volume
• Different types(structured, semi-structured,
unstructured), sources, resolutions
• e.g., text, images, videos, audio
Variety
• Data generation and handling speed
Velocity
• Data in doubt (varying levels of noise ad
processing errors)
Veracity
Data Science with Dr Shahid
14. Statistics
• Traditionally concerned with
analyzing primary (e.g.
Experimental) data collected
for checking specific
hypotheses(ideas)
• Primary data analysis or top-
down(confirmatory) analysis
• Hypothesis evaluation or
testing
Data Science
• Typically concerned with
analyzing secondary (e.g.,
observational) data collected
for other reasons
• Secondary data analysis or
bottom-up(exploratory)
analysis
• Hypothesis generation
• Knowledge discovery
Data Science with Dr Shahid
15. Data science is an interdisciplinary field
Encompasses the usage of computing tools in order to extract
knowledge from data by deploying statistical methods
Multiple definitions exist, reason being the nature of
cross-disciplinary skills needed to create value
Holy-grail of data science can be ascertained
through Venn diagrams, e.g., Drew Conway’s
Data Science with Dr Shahid
16. Data science as portrayed by Drew Conway
Data Science with Dr Shahid
20. Gregory Piatetsky-Shapiro, Ph.D
Knowledge Discovery to
Data Mining to Predictive
Analytics and now to
Data Science
Essence is always: discovery
of what is true and useful
Data Science with Dr Shahid
24. Business
Understanding
Goals
• Specify key
variables
(model targets,
metrics of
success)
• Relevant data
sources
How?
• Define
*objectives
(business
problems,
stakeholders)
• **SMART
metrics
• Find the data
Artifacts
• Iterating charter
• Data Sources
• Data
Dictionaries
Data Science with Dr Shahid
25. Objectives
How much/many: Regression
Which category: Classification
Which group: Clustering
Is it weird: Anomaly Detection
Which opinion: Recommendation
Specific
Measurable
Achievable
Relevant
Time-bound
Data Science with Dr Shahid
27. Data
Acqusition
Goals
• Clean, high
quality
• Architecture of
data pipeline
(refresh & score)
How?
• Data Ingestion
• Explore the data
(quality, eda)
• Setup data
pipeline (Batch-based
,Streaming or real time, A hybrid)
Artifacts
• Data Q report
• Solution
Architecture
• Checkpoint
decision (re-evaluate
before full-feature engineering/model
building)
Data Science with Dr Shahid
29. Modeling
Goals
• Optimal
features
• Informative
model
• Production
ready model
How?
• Feature
engineering
• Model Training
• Production
Ready?
Artifacts
• Feature sets
• Model report
• Checkpoint
decision (Evaluate for
production)
Data Science with Dr Shahid
30. Model Training
Raw data Features
Starting data
Training split (70-80%)
Validation split
(10-15%)
Test split
(10-15%)
Model gets
trained
Hyper
parameters
Model gets
evaluated
Data Science with Dr Shahid
32. Deployement
Goals
• Deploy models
with a data
pipeline to a
production env
How?
• Operationalize
the model
Artifacts
• Status
dashboard
(system health
& KPIs)
• Final Modeling
report
• Final solution
arch doc
Data Science with Dr Shahid
33. Customer
acceptance
Goals
• Finalize project
deliverables
Confirm that the
pipeline, the model,
and their deployment
in a production
environment satisfy
the customer's
objectives.
How?
• System
validation
• Project hand-off
Artifacts
• Exit report of
the project for
the customer
Data Science with Dr Shahid
38. Data Science with Dr Shahid
• Linear algebra, Calculus
• Probability theory, Graph theory
• Distributions, summary stats, hypothesis testing
Math/Statistics
• Supervised learning
• Unsupervised learning
• Validation, model comparison
Machine
learning
• Algorithms and data structures
• Data Visualization
• Data processing
Software engg
39. Data Science with Dr Shahid
Data
Scientists
Data Analyst
ML
engineer
Data engineer
Data
Architect
BI developer
43. Data Science with Dr Shahid
https://www.facebook.com/drshahid.phd
https://www.linkedin.com/in/muhammad-shahid-67876212
muhammad.shahid@ieee.org
Thank You!
Editor's Notes
Continuous Features
A measurable difference exists between the values continuous features take on. Also continuous features are usually a subset of all real numbers. Some example features are:
Distance, Time. Cost, Temperature
Categorical Features
With categorical features, there is a specified number of discrete, possible feature values. These values may or may not have an ordering to them. If they do have a natural ordering, they are called ordinal categorical features. Otherwise if there is no intrinsic ordering, they are called nominal categorical features.
Nominal Car Models Colors TV Shows Ordinal High-Medium-Low 1-10 Years Old, 11-20 Years Old, 30-40 Years Old Happy, Neutral, Sad