SQL & Data Analysis
Module 1: Basics of Databases
What is a
database?
• A database is any object that is used to collect, store
& organize data
• Examples of databases:
o Excel Spreadsheets
o File cabinet in an office with organized/disorganized data
o Collection of Text Files (txt, csv, xml, json)
• Databases are comprised of a series of tables
• Within a database data is typically modeled using
rows and columns in within these tables to make
processing efficient
Types of
Databases
• There are many types of databases used depending
on the needs of an organization. For e.g. -
• Distributed
• Relational
• Object-Oriented
• NoSQL, etc.
• We will stick to Relational Databases (RDBMS) for
our course, as it is still the most popular
Relational
Database
• Relational Databases (RDBMS) is a type of db that
comprises of tables which are related to each other
through fields/columns
• One field in a table can point to another field in a
different table
• Data is placed into predefined categories in those
tables
• These relationships are maintained by Schemas,
which are nothing but an architecture of how the
data will be stored. They define the shape of the
data and how they relate to other tables.
Databases &
Tables
Within a DB, a field in a table can relate to another table as shown above. This is the main idea
behind a relational database.
These relationships between fields across tables in a database is maintained by the scehma.
Schema
• A schema is the structure that represents the logical view of the database. It defines
how relationships within fields across tables are defined
• Think of schemas as a descriptive representation of a database and is depicted by the
means of diagrams
• In the analytics world, it is primarily Database admins or Data Engineers who design
the schema and provide it to analysts or data scientists to make it usable
• Schemas can be broadly divided into 2 categories
• Physical Schema – Pertains to the actual storage of data on disks and includes the table
names, column names, and data types
• Logical Schema – Pertains to the logical constraints that need to be defined for the data
storage. It pertains to tables, views, etc. in order to define how they are linked together
View 1 View 3
View 2
Logical Schema
Physical Schema
But why do we
need these
multi-table
schemas
• The answer to that question leads us to Normalization
• Normalization is a design techniques that allows admins/designers to reduce data redundancy
and eliminate the need to have bloated tables.
• Normalization rules divide large tables into multiple small tables and link them using
relationships/schema
• These rules are also responsible to ensure any dependencies are stored logically and
the relationships make sense
• The lack of normalization rules also lead to anomalies in basic functions such as Insert, Update,
Delete*
*We will cover Insert, Update, Delete and other functions in session 2
Non – Normalized Table:
On the face of it, all the information seems to be correct. But notice how the table is bloated. Any changes to customer info will
require several updates to the table for the same field which is an expensive operation.
There are several ways to normalize a database and specific rules associated with each known as First Normal
(1NF), Second Normal (2NF) and so on. This course doesn’t cover normalization in detail, but anyone interested
can go to this link.
Table
Identifiers
• So how do databases maintain these relationships:
• One of the more important concepts in Database Management is Primary & Foreign Keys
• Primary Key is a column or in some cases a set of columns (composite keys) which uniquely
identifies a row in the table. Any and every relational database is so configured to ensure the
uniqueness of a primary key by forcing only one row with a given primary key value in a table.
Each table can only have one primary key.
• Foreign Key is a column or set of columns whose values correspond/link to the values of a
primary key in another table. A foreign key defined in a table refers to the primary key of another
table. Foreign keys allow relational database normalization esp. when tables need to be access by
other tables.
*We will cover Insert, Update, Delete and other functions in session 2
CustomerID (PK)
Name
City
Province
Postal Code
OrderID (PK)
Quantity
CustomerID (FK)
ProductID (FK)
ProductDesc
ProductID (PK)
ProductDesc
Colour
Supplier
Customer
Order
Product
Database
Vendors
• The SQL language and the RDBMS schema concepts are not proprietary to any
vendor. SQL is the programming language upon which all these DB solutions
are built. Here are some of the most famous ones:
MySQL Microsoft SQL Server SQLite
SAP Sybase PostgreSQL OracleDB
IBM DB2 Microsoft Access
Each vendor has a standard SQL package and implementation on top of which
they add enhancements that differentiate each vendor and their offerings and
serve specific design purposes
For the purposes of our course, we chose PostgreSQL, for 2 specific reasons.
• It is open source, which makes it free to use and easy to customize
• It allows creation of local servers, which makes it quick to setup and easy to
connect to any programming language, IDE, etc.
Database
Architecture
• RDBMS networks are generally designed using a client-server architecture.
• A client-server architecture is a computing model in which the server hosts and manages
all the resources required by clients (or users). Simply put a server is a centralized
computer that provides resources for all its clients.
• Server-clients maintain a many-to-one relationship which means that multiple clients (or
users) are connected to a server concurrently
• This architecture allows division of a network into its individual components leading to
more efficient system design and maintenance.
• Types of DBMS architecture:
• One tier architecture – Simplest architecture design in which the server, the database
and the client all reside on a single computer. For the purposes of our course, we will use
a one tier architecture.
• Two tier architecture - Architecture where the presentation layer runs on a client
computer whereas the data is stored on a separate server allowing separation. This
allows data security to be stricter while also leading to faster communication and query
Database
Architecture
Illustrated
One tier architecture
Client 1
Client 2
Server
Two tier architecture Three tier architecture
Client 1
Client 2
Database
Server
Setting up
your one-tier
architecture-
based
environment
In order to be able to use PostgreSQL on local machines we will have to install
PostgreSQL on our computer and then add pgAdmin4 along with that as an
added wrapper for a coding environment
Installing PostgreSQL
• For Windows
• Resource 1
• Resource 2
• For MacOS
• Postgre Wrapper for MacOS - https://postgresapp.com
Installing pgAdmin4
• For Windows & Mac - https://www.pgadmin.org/download/
For more advanced installs refer to Appendix A:
Installing PostgreSQL on the PostgreSQL Up & Running Book from O'Reilly

SQL - Basics of Databases 101 Learning.pdf

  • 1.
    SQL & DataAnalysis Module 1: Basics of Databases
  • 2.
    What is a database? •A database is any object that is used to collect, store & organize data • Examples of databases: o Excel Spreadsheets o File cabinet in an office with organized/disorganized data o Collection of Text Files (txt, csv, xml, json) • Databases are comprised of a series of tables • Within a database data is typically modeled using rows and columns in within these tables to make processing efficient
  • 3.
    Types of Databases • Thereare many types of databases used depending on the needs of an organization. For e.g. - • Distributed • Relational • Object-Oriented • NoSQL, etc. • We will stick to Relational Databases (RDBMS) for our course, as it is still the most popular
  • 4.
    Relational Database • Relational Databases(RDBMS) is a type of db that comprises of tables which are related to each other through fields/columns • One field in a table can point to another field in a different table • Data is placed into predefined categories in those tables • These relationships are maintained by Schemas, which are nothing but an architecture of how the data will be stored. They define the shape of the data and how they relate to other tables.
  • 5.
    Databases & Tables Within aDB, a field in a table can relate to another table as shown above. This is the main idea behind a relational database. These relationships between fields across tables in a database is maintained by the scehma.
  • 6.
    Schema • A schemais the structure that represents the logical view of the database. It defines how relationships within fields across tables are defined • Think of schemas as a descriptive representation of a database and is depicted by the means of diagrams • In the analytics world, it is primarily Database admins or Data Engineers who design the schema and provide it to analysts or data scientists to make it usable • Schemas can be broadly divided into 2 categories • Physical Schema – Pertains to the actual storage of data on disks and includes the table names, column names, and data types • Logical Schema – Pertains to the logical constraints that need to be defined for the data storage. It pertains to tables, views, etc. in order to define how they are linked together View 1 View 3 View 2 Logical Schema Physical Schema
  • 7.
    But why dowe need these multi-table schemas • The answer to that question leads us to Normalization • Normalization is a design techniques that allows admins/designers to reduce data redundancy and eliminate the need to have bloated tables. • Normalization rules divide large tables into multiple small tables and link them using relationships/schema • These rules are also responsible to ensure any dependencies are stored logically and the relationships make sense • The lack of normalization rules also lead to anomalies in basic functions such as Insert, Update, Delete* *We will cover Insert, Update, Delete and other functions in session 2 Non – Normalized Table: On the face of it, all the information seems to be correct. But notice how the table is bloated. Any changes to customer info will require several updates to the table for the same field which is an expensive operation. There are several ways to normalize a database and specific rules associated with each known as First Normal (1NF), Second Normal (2NF) and so on. This course doesn’t cover normalization in detail, but anyone interested can go to this link.
  • 8.
    Table Identifiers • So howdo databases maintain these relationships: • One of the more important concepts in Database Management is Primary & Foreign Keys • Primary Key is a column or in some cases a set of columns (composite keys) which uniquely identifies a row in the table. Any and every relational database is so configured to ensure the uniqueness of a primary key by forcing only one row with a given primary key value in a table. Each table can only have one primary key. • Foreign Key is a column or set of columns whose values correspond/link to the values of a primary key in another table. A foreign key defined in a table refers to the primary key of another table. Foreign keys allow relational database normalization esp. when tables need to be access by other tables. *We will cover Insert, Update, Delete and other functions in session 2 CustomerID (PK) Name City Province Postal Code OrderID (PK) Quantity CustomerID (FK) ProductID (FK) ProductDesc ProductID (PK) ProductDesc Colour Supplier Customer Order Product
  • 9.
    Database Vendors • The SQLlanguage and the RDBMS schema concepts are not proprietary to any vendor. SQL is the programming language upon which all these DB solutions are built. Here are some of the most famous ones: MySQL Microsoft SQL Server SQLite SAP Sybase PostgreSQL OracleDB IBM DB2 Microsoft Access Each vendor has a standard SQL package and implementation on top of which they add enhancements that differentiate each vendor and their offerings and serve specific design purposes For the purposes of our course, we chose PostgreSQL, for 2 specific reasons. • It is open source, which makes it free to use and easy to customize • It allows creation of local servers, which makes it quick to setup and easy to connect to any programming language, IDE, etc.
  • 10.
    Database Architecture • RDBMS networksare generally designed using a client-server architecture. • A client-server architecture is a computing model in which the server hosts and manages all the resources required by clients (or users). Simply put a server is a centralized computer that provides resources for all its clients. • Server-clients maintain a many-to-one relationship which means that multiple clients (or users) are connected to a server concurrently • This architecture allows division of a network into its individual components leading to more efficient system design and maintenance. • Types of DBMS architecture: • One tier architecture – Simplest architecture design in which the server, the database and the client all reside on a single computer. For the purposes of our course, we will use a one tier architecture. • Two tier architecture - Architecture where the presentation layer runs on a client computer whereas the data is stored on a separate server allowing separation. This allows data security to be stricter while also leading to faster communication and query
  • 11.
    Database Architecture Illustrated One tier architecture Client1 Client 2 Server Two tier architecture Three tier architecture Client 1 Client 2 Database Server
  • 12.
    Setting up your one-tier architecture- based environment Inorder to be able to use PostgreSQL on local machines we will have to install PostgreSQL on our computer and then add pgAdmin4 along with that as an added wrapper for a coding environment Installing PostgreSQL • For Windows • Resource 1 • Resource 2 • For MacOS • Postgre Wrapper for MacOS - https://postgresapp.com Installing pgAdmin4 • For Windows & Mac - https://www.pgadmin.org/download/ For more advanced installs refer to Appendix A: Installing PostgreSQL on the PostgreSQL Up & Running Book from O'Reilly

Editor's Notes

  • #7 Additionally students can also refer to the Section 1: Database Fundamentals in the Learn SQL Database Programming book. (Available through the Conestoga library)
  • #8 Additionally students can also refer to the Section 1: Database Fundamentals in the Learn SQL Database Programming book. (Available through the Conestoga library)