Data Analysis
and
Visualization
CS 352)
Week 1- Introduction
Textbook
Data Analysis and Visualization Using
Python by Dr Ossama, 2018
Python for Data Analysis by Wes McKinney,
2022
Ultimate Python Libraries for Data Analysis
and Visualization by Abhinaba Banerjee,
2024
Python Data Science Handbook by Jake
VanderPlas , 2022
Summary of
course
objectives:
Understand and write programs to store
and manipulate data and measurements.
Implement the fundamental concepts of
interactive visualization of data.
Implement common data
transformations and statistical analysis
Demonstrate current machine learning
techniques for prediction and
knowledge discovery
Your intentions/expectations?
In what ways do you think this course could help your
professional development?
What topics are you most interested in?
What suggestions do you have for the instructors and the
course?
Introduction to Data Analysis
Overview of key
Python libraries
for data analysis
01
Libraries provide
built-in functions
and modules
02
Extensive range
of functionalities
available
03
Types of Python Data Analysis Libraries
Categorized into
three main groups:
Scientific
Computing Libraries
Data Visualization
Libraries
Machine Learning
Libraries
Scientific Computing Libraries
Essential for data manipulation
and analysis
Major libraries:
Pandas
NumPy
SciPy
Pandas - Data Manipulation
Provides data
structures and tools
for analysis
Fast access to
structured data
Key feature:
DataFrame (2D table
with rows & columns)
Supports easy
indexing functionality
NumPy - Array-
Based Computation
Uses arrays as primary
input/output
Supports matrix
operations
Enables fast array
processing with
minimal coding
SciPy - Advanced
Math Functions
Includes
functions for:
Optimization
Linear algebra
Signal
processing
Statistics
Also supports
data visualization
Data Visualization
Libraries
Essential for
communicating results
effectively
Major libraries:
Matplotlib
Seaborn
Matplotlib - Customizable Graphs
Most well-known
visualization
library
Used for
creating graphs
and plots
Highly
customizable
and versatile
Seaborn -
High-Level
Visualization Built on
Matplotlib
Simplifies
generation of:
Heat maps
Time series
plots
Violin plots
Machine Learning Libraries
Used for predictive modeling and
analytics
Key libraries:
Scikit-learn
Statsmodels
Scikit-learn - ML Algorithms
Contains tools for:
Regression
Classification
Clustering
Built on NumPy, SciPy, and
Matplotlib
Statsmodels - Statistical Analysis
Allows data exploration Supports:
Estimating statistical models
Performing statistical tests
Best Dataset Websites
Finding the Best Datasets for Data Science & Research
General-Purpose Datasets
Kaggle - Large
collection for ML & AI
Google Dataset
Search - Aggregated
datasets
UCI ML Repository -
Academic & research-
focused
Data.gov - U.S.
government open
data
Data World -
Community-driven
datasets
Big Data & Business Datasets
AWS Open Data
Registry - AI & Cloud
datasets
Google Cloud Public
Datasets - Cloud-
based analytics
Microsoft Azure
Open Datasets - AI &
Business Intelligence
FiveThirtyEight -
Political, sports, and
social data
Health & Science Datasets
WHO DATA - GLOBAL
HEALTH DATASETS
CDC OPEN DATA - PUBLIC
HEALTH AND DISEASE-
RELATED DATASETS
PHYSIONET - CLINICAL AND
PHYSIOLOGICAL HEALTH
DATA
Finance & Economics Datasets
WORLD BANK OPEN DATA -
ECONOMIC INDICATORS
IMF DATA -
MACROECONOMIC &
FINANCIAL STATISTICS
QUANDL - MARKET &
FINANCIAL DATA
Geospatial & Environmental Datasets
NASA Earth Data - Climate
& satellite imagery
OpenStreetMap - Free
geospatial data
USGS Earth Explorer -
Remote sensing data
Choosing the Right Dataset
SELECT BASED ON PROJECT
NEEDS
DATASET QUALITY, SIZE, AND
SOURCE CREDIBILITY
USE PLATFORMS LIKE KAGGLE,
GOOGLE DATASET SEARCH, AND
GOVERNMENT REPOSITORIES

DAVLectuer3 Exploratory data analysis .pdf

  • 1.
  • 2.
    Textbook Data Analysis andVisualization Using Python by Dr Ossama, 2018 Python for Data Analysis by Wes McKinney, 2022 Ultimate Python Libraries for Data Analysis and Visualization by Abhinaba Banerjee, 2024 Python Data Science Handbook by Jake VanderPlas , 2022
  • 3.
    Summary of course objectives: Understand andwrite programs to store and manipulate data and measurements. Implement the fundamental concepts of interactive visualization of data. Implement common data transformations and statistical analysis Demonstrate current machine learning techniques for prediction and knowledge discovery
  • 4.
    Your intentions/expectations? In whatways do you think this course could help your professional development? What topics are you most interested in? What suggestions do you have for the instructors and the course?
  • 5.
    Introduction to DataAnalysis Overview of key Python libraries for data analysis 01 Libraries provide built-in functions and modules 02 Extensive range of functionalities available 03
  • 6.
    Types of PythonData Analysis Libraries Categorized into three main groups: Scientific Computing Libraries Data Visualization Libraries Machine Learning Libraries
  • 7.
    Scientific Computing Libraries Essentialfor data manipulation and analysis Major libraries: Pandas NumPy SciPy
  • 8.
    Pandas - DataManipulation Provides data structures and tools for analysis Fast access to structured data Key feature: DataFrame (2D table with rows & columns) Supports easy indexing functionality
  • 9.
    NumPy - Array- BasedComputation Uses arrays as primary input/output Supports matrix operations Enables fast array processing with minimal coding
  • 10.
    SciPy - Advanced MathFunctions Includes functions for: Optimization Linear algebra Signal processing Statistics Also supports data visualization
  • 11.
    Data Visualization Libraries Essential for communicatingresults effectively Major libraries: Matplotlib Seaborn
  • 12.
    Matplotlib - CustomizableGraphs Most well-known visualization library Used for creating graphs and plots Highly customizable and versatile
  • 13.
    Seaborn - High-Level Visualization Builton Matplotlib Simplifies generation of: Heat maps Time series plots Violin plots
  • 14.
    Machine Learning Libraries Usedfor predictive modeling and analytics Key libraries: Scikit-learn Statsmodels
  • 15.
    Scikit-learn - MLAlgorithms Contains tools for: Regression Classification Clustering Built on NumPy, SciPy, and Matplotlib
  • 16.
    Statsmodels - StatisticalAnalysis Allows data exploration Supports: Estimating statistical models Performing statistical tests
  • 17.
    Best Dataset Websites Findingthe Best Datasets for Data Science & Research
  • 18.
    General-Purpose Datasets Kaggle -Large collection for ML & AI Google Dataset Search - Aggregated datasets UCI ML Repository - Academic & research- focused Data.gov - U.S. government open data Data World - Community-driven datasets
  • 19.
    Big Data &Business Datasets AWS Open Data Registry - AI & Cloud datasets Google Cloud Public Datasets - Cloud- based analytics Microsoft Azure Open Datasets - AI & Business Intelligence FiveThirtyEight - Political, sports, and social data
  • 20.
    Health & ScienceDatasets WHO DATA - GLOBAL HEALTH DATASETS CDC OPEN DATA - PUBLIC HEALTH AND DISEASE- RELATED DATASETS PHYSIONET - CLINICAL AND PHYSIOLOGICAL HEALTH DATA
  • 21.
    Finance & EconomicsDatasets WORLD BANK OPEN DATA - ECONOMIC INDICATORS IMF DATA - MACROECONOMIC & FINANCIAL STATISTICS QUANDL - MARKET & FINANCIAL DATA
  • 22.
    Geospatial & EnvironmentalDatasets NASA Earth Data - Climate & satellite imagery OpenStreetMap - Free geospatial data USGS Earth Explorer - Remote sensing data
  • 23.
    Choosing the RightDataset SELECT BASED ON PROJECT NEEDS DATASET QUALITY, SIZE, AND SOURCE CREDIBILITY USE PLATFORMS LIKE KAGGLE, GOOGLE DATASET SEARCH, AND GOVERNMENT REPOSITORIES