2. OBJECTIVES
• To expose students with the basics of managing the information
• To explore the various aspects of database design and modeling,
• To examine the basic issues in information governance and
information integration
• To understand the overview of information architecture.
3. UNIT I DATABASE MODELLING,
MANAGEMENT AND
DEVELOPMENT
Database design and modelling - Business Rules and
Relationship; Java database Connectivity (JDBC), Database
connection Manager, Stored Procedures. Trends in Big Data
systems including NoSQL – Hadoop, HDFS, MapReduce,
Hive, and enhancements.
4. Introduction
• Database
– Central repository of data
– To store the data in a structured manner.
• Database Design
• A process of defining the structure of a database.
5. Data Model
Collection of conceptual tools for describing data, data
relationship and consistency constraint.
Represents the nature of the data, business rules governing
the data, and how it will be organized in the database.
6. Data Model
Levels:
Conceptual – describes WHAT the system contains
Logical – describes HOW the system will be implemented
regardless of the DBMS
Physical - describes HOW the system will be implemented using a
specific DBMS.
7. Data Model - Element
Element Definition
Entity
a real world thing or an interaction between 2 or more real world
things.
Attribute Pieces of information the need to now about entities
Relationship
How entities depend on each other in terms of why the entities
depend on each other and what that relationship is
8. Data Model - Example
Customer and Product are an entity
Customer: customer name, customer id
Product: Product name, Price
Sale- Relationship
9. Data Model - Types
Entity Relationship Models
Unified Modeling Language
11. Components of ER Diagram
Component Description Symbol
Entity Rectangle
Relationship Diamond
Attributes for
any Entity Ellipse
Key Attribute
for any Entity
the attribute
name inside
the Ellipse is
underlined.
12. Components of ER Diagram
Component Description Symbol
Derived
Attribute for
any Entity
dotted
ellipse is
created
inside the
main ellipse
Multivalued
Attribute for
any Entity
Double
Ellipse
13. ER Diagram - Entity
Component Example Symbol
Entity
Employee,
Manager,
Department
Weak Entity
depends on
another
entity
14. ER Diagram - Attribute
Component Description Symbol
Attribute
(Name, Age,
Address)
property or
characteristic
of an entity
Key Attribute
main
characterstic
of an Entity
Composite
Attribute
have their
own attributes
15. ER Diagram - Relationship
Component Description Symbol
One to One
Relationship
one student can enroll
only for one course and
a course will also have
only one Student
One to Many
Relationship
1 student can opt for
many courses
Many to One
Relationship
Student enrolls for only
one Course but a
Course can have many
Students
16. ER Diagram - Relationship
Component Description Symbol
Many to Many
Relationship
one student can enroll
for more than one
courses. And a course
can have more than 1
student enrolled in it
17. ER Diagram - Example
College Database: Statements
A college contains many departments
Each department can offer any number of courses
Many instructors can work in a department
An instructor can work only in one department
For each department there is a Head
An instructor can be head of only one department
Each instructor can take any number of courses
A course can be taken by only one instructor
A student can enroll for any number of courses
Each course can have any number of students
18. ER Diagram - Steps
Identify the Entities
Identify the relationship
Identify the key attributes
Identify other relevant attribute
19. ER Diagram - Example
Step 1 : Identify the Entities
Department
Course
Instructor
Student
20. ER Diagram - Example
Step 2 : Identify the relationships
Department and Course - One to Many (1:N)
Department and Instructor - One to Many (1:N)
Department and Head - One to One (1:1)
Course and student - Many to Many (M:N)
Course and instructor - Many to One (N :1)
21. ER Diagram - Example
Step 3: Identify the key attributes
Department_Name is the key attribute for the Entity
"Department".
Course_ID is the key attribute for "Course" Entity.
Student_ID is the key attribute for "Student" Entity.
Instructor_ID is the key attribute for "Instructor" Entity.
22. ER Diagram - Example
Step 4: Identify other relevant attributes
For the department entity, other attributes are location
For course entity, other attributes are course_name,duration
For instructor entity, other attributes are first_name, last_name,
phone
For student entity, first_name, last_name, phone
23. ER Diagram - Example
Step 5: Draw complete ER diagram
25. NORMALIZATION
a process of organizing the data in database to
avoid data redundancy
Forms:
First normal form(1NF)
Second normal form(2NF)
Third normal form(3NF)
Boyce & Codd normal form (BCNF)
26. NORMALIZATION
First normal form(1NF) :
an attribute (column) of a table cannot hold multiple
values. It should hold only atomic values.
emp_id emp_name emp_address emp_mobile
101 Herschel New Delhi 8912312390
102 Jon Kanpur
8812121212
9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore
9990000123
8123450987
27. NORMALIZATION
First normal form(1NF) :
each attribute of a table must have atomic (single)
values.
emp_id emp_name emp_address emp_mobile
101 Herschel New Delhi 8912312390
102 Jon Kanpur 8812121212
102 Jon Kanpur 9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore 9990000123
104 Lester Bangalore 8123450987
28. NORMALIZATION
Second normal form(1NF) : A table is said to be
in 2NF if both the following conditions hold:
Table is in 1NF (First normal form)
All the non-key columns are dependent on the table’s
primary key.
30. NORMALIZATION – 2NF
The table is in first normal form and all the columns depend on the
table’s primary key.
teacher_id subject teacher_age
111 Maths 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
teacher_id teacher_age
111 38
222 38
333 40
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
31. NORMALIZATION
A table design is said to be in 3NF if both the
following conditions hold:
Table must be in 2NF
Transitive functional dependency of non-prime attribute
on any super key should be removed.
32. NORMALIZATION – 3NF
Emp_id Emp_name Emp_zip Emp_state Emp_city Emp_district
1001 John 282005 UP Agra Dayal Bagh
1002 Ajeet 222008 TN Chennai M-City
1006 Lora 282007 TN Chennai Urrapakkam
1101 Lilly 292008 UK Pauri Bhagwan
1201 Steve 222999 MP Gwalior Ratan
33. NORMALIZATION – 3NF
emp_id emp_name emp_zip
1001 John 282005
1002 Ajeet 222008
1006 Lora 282007
1101 Lilly 292008
1201 Steve 222999
emp_zip emp_state emp_city emp_district
282005 UP Agra Dayal Bagh
222008 TN Chennai M-City
282007 TN Chennai Urrapakkam
292008 UK Pauri Bhagwan
222999 MP Gwalior Ratan
34. NORMALIZATION
Boyce Codd normal form:
it is in 3NF and for every functional dependency
X->Y, X should be the super key of the table.
emp_id emp_nationality emp_dept dept_type dept_no_of_emp
1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American
design and technical
support
D134 100
1002 American Purchasing department D134 600
35. NORMALIZATION
Functional dependencies in the table above:
emp_id->emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
emp_id emp_nationality emp_dept dept_type dept_no_of_emp
1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American
design and technical
support
D134 100
1002 American Purchasing department D134 600
36. NORMALIZATION - BCNF
emp_nationality table:
emp_dept table:
Emp_dept_mapping :
emp_id
emp_national
ity
1001 Austrian
1002 American
emp_dept dept_type
dept_no
_of_emp
Production and
planning
D001 200
stores D001 250
design and
technical
support
D134 100
Purchasing
department
D134 600
emp_id emp_dept
1001 Production and planning
1001 stores
1002 design and technical support
1002 Purchasing department
37. Business Rule
a brief, precise and unambiguous description of a Policy, Procedure, or
Principle within a specific organization..
Some of the Example of Business rules are,
i) A customer may generate many invoices
ii) An invoice is generated by only one customer
iii) A training session cannot be scheduled for fewer than 10
employees or for more than 30 employees., etc
38. Business Rule - Purpose
• They help standardize the Company's view of data
• They can be a communication tool in between users and designers
• They allow the designer to understand the nature, role and scope of Data
• They allow the designer to understand business processes
• They allow the DB designer to understand to develop appropriate
relationship participation rules and constraints to create an accurate data
model
39. Java Database Connectivity (JDBC)
JDBC Driver is a software component that enables
java application to interact with the database.
There are 4 types of JDBC drivers:
JDBC-ODBC bridge driver
Native-API driver (partially java driver)
Network Protocol driver (fully java driver)
Thin driver (fully java driver)
40. JDBC Driver - JDBC-ODBC bridge
driver
uses ODBC driver to connect to the database.
converts JDBC method calls into the ODBC
function calls. Ex: JDK 1.2
41. JDBC Driver - Native-API driver
uses the client-side libraries of the database.
converts JDBC method calls into native calls of the
database API
42. JDBC Driver - Network Protocol
driver
uses middleware (application server) that converts
JDBC calls directly or indirectly into the vendor-
specific database protocol
43. JDBC Driver - Thin driver
The thin driver converts JDBC calls directly into
the vendor-specific database protocol. That is why
it is known as thin driver. It is fully written in Java
language
44. Steps to connect a Java Application to
Database
Register the Driver
Create a Connection
Create SQL Statement
Execute SQL Statement
Closing the connection
45. Register the Driver
forName() - to register the driver class
- to dynamically load the driver class
Syntax:
public static void forName(String className)
throws ClassNotFoundException
Example:
Class.forName("oracle.jdbc.driver.OracleDriver");
46. Create the connection object
getConnection() method of DriverManager class is used to
establish connection with the database
Syntax:
1) public static Connection getConnection(String url)
throws SQLException
2) public static Connection getConnection(String url,
String name,String password) throws SQLException
Example:
Connection con=DriverManager.getConnection(
"jdbc:oracle:thin:@localhost:1521:xe","system","password");
47. Create the Statement object
The createStatement() method - used to
create statement.
The object of statement - to execute queries
with the database.
Syntax:
public Statement createStatement() throws SQLException
Example:
Statement stmt=con.createStatement();
48. Execute the query
The executeQuery() method - used to execute queries to
the database.
returns the object of ResultSet that can be used to get all
the records of a table.
Syntax:
public ResultSet executeQuery(String sql)throws SQLException
Example:
ResultSet rs=stmt.executeQuery("select * from emp");
while(rs.next()){
System.out.println(rs.getInt(1)+" "+rs.getString(2)); }
49. Close the connection object
close() method of Connection interface
is used to close the connection.
Syntax:
public void close()throws SQLException
Example:
con.close();
50. Stored Procedures
A stored routine is a set of SQL statements that
can be stored in the server
Steps:
Picking a Delimiter
How to Work with a Stored Procedure
Parameters
Variables
Flow Control Structures
51. Step 1: Picking a Delimiter
the character or string of characters that tells the
mySQL client finished typing in an SQL statement.
use “//”
52. Step 2:Work with a stored procedure
Creating a Stored Procedure
DELIMITER //
CREATE PROCEDURE `p2` ()
LANGUAGE SQL
DETERMINISTIC
SQL SECURITY DEFINER
COMMENT 'A procedure'
BEGIN
SELECT 'Hello World !';
END//
53. Step 2:Work with a stored procedure
Calling a Stored Procedure
to enter the word CALL, followed by the name of the
procedure, and then the parentheses, including all the
parameters between them (variables or values).
Parentheses are compulsory.
CALL stored_procedure_name (param1, param2, ....)
CALL procedure1(10 , 'string parameter' ,
@parameter_var);
54. Step 3: Parameter
• CREATE PROCEDURE proc1 () : Parameter list is
empty
• CREATE PROCEDURE proc1 (IN varname DATA-
TYPE) : One input parameter. The word IN is optional
because parameters are IN (input) by default.
• CREATE PROCEDURE proc1 (OUT varname DATA-
TYPE) : One output parameter.
• CREATE PROCEDURE proc1 (INOUT varname DATA-
TYPE) : One parameter which is both input and output.
56. Step 5: Flow Control Structures
DELIMITER //
CREATE PROCEDURE `proc_IF` (IN param1 INT)
BEGIN
DECLARE variable1 INT;
SET variable1 = param1 + 1;
IF variable1 = 0 THEN
SELECT variable1;
END IF;
IF param1 = 0 THEN
SELECT 'Parameter value = 0';
ELSE
SELECT 'Parameter value <> 0';
END IF;
END //
58. Overview
“90% of the world’s data was generated in the last few years.”
• a collection of large datasets that
cannot be processed using traditional
computing techniques.
59. What Comes Under Big Data?
Big data involves the data produced by different
devices and applications.
Black Box Data
Social Media Data
Stock Exchange Data
Transport Data
Search Engine Data
60. Characteristics
Volume – the amount of data handled by the
application.
Velocity – the rate at which the data flows into the
system.
Variety – different types of data generated
Structured data : Relational data.
Semi Structured data : XML data
Unstructured data : Word, PDF, Text, Media Logs
61. Benefits
Using the information in the social media:
learning about the response for their campaigns,
promotions, and other advertising mediums.
consumers, product companies and retail organizations
are planning their production.
Using the data regarding the previous medical
history of patients, hospitals are providing better
and quick service
62. Big Data Technologies
providing more accurate analysis, which may lead
to more concrete decision-making resulting in
greater operational efficiencies, cost reductions,
and reduced risks for the business.
require an infrastructure that can manage and
process huge volumes of structured and
unstructured data in realtime and can protect data
privacy and security
63. Hadoop - Big Data Solutions
Traditional Approach
64. Hadoop - Big Data Solutions
Google’s Solution
MapReduce: divides the task into small parts and
assigns those parts to many computers connected over
the network, and collects the results to form the final
result dataset.
65. Hadoop - Big Data Solutions
Hadoop
Doug Cutting, Mike Cafarella and team took the
solution provided by Google and started an Open
Source Project called HADOOP in 2005.
Two parts:
Stoarge – HDFS
Processing - MapReduce
68. Hadoop -Architecture
Both Master Node and Slave Nodes contain two Hadoop
Components:
HDFS Component
MapReduce Component
Master Node (HDFS) – Name Node – To store metadata
Slave Node (HDFS) – Data Node – to store actual data
70. Components of Ecosystem
HDFS – Hadoop Distributes File System - Storage
MapReduce – Data Processing
YARN – Resource Management
Hive - querying and analyzing large datasets
Pig - analyzing and querying huge dataset – Programmer
Hbase - to store structured data in tables
Sqoop - imports data from external sources
Zookeeper - coordinates a large cluster of machines
Oozie - a workflow scheduler
71. Hadoop Distributed File System
holds very large amount of data and provides easier access
To store such huge data, the files are stored across multiple
machines
These files are stored in redundant fashion to rescue the
system from possible data losses in case of failure
72. Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with
HDFS.
The built-in servers of namenode and datanode help users
to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
74. HDFS Architecture Elements
Namenode:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming,
closing, and opening files and directories
Datanode:
Datanodes perform read-write operations on the file systems,
as per client request.
They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode
75. HDFS Architecture Elements
Block:
the user data is stored in the files of HDFS.
The file in a file system will be divided into one or more
segments and/or stored in individual data nodes.
Size: 64 MB to 128 MB
78. MapReduce
to process huge amount of data in parallel, reliable and
efficient way in cluster environments.
uses Divide and Conquer technique to process large amount
of data.
It divides input task into smaller and manageable sub-tasks
to execute them in-parallel.
Steps:
Map function
Shuffle function
Reduce function
79. MapReduce – Map Function
It takes input tasks and divides them into smaller sub-tasks
Sub steps:
Splitting - takes input DataSet from Source and divide
into smaller Sub-DataSets.
Mapping - takes those smaller Sub-DataSets and
perform required action or computation on each Sub-
DataSet
The output of this Map Function is a set of key and value
pairs as <Key, Value>
80. MapReduce – Shuffle Function
Combine Function
Sub steps:
Merging - combines all key-value pairs which have
same keys.
Sorting - takes input from Merging step and sort all key-
value pairs by using Keys
Shuffle Function returns a list of <Key, List<Value>>
sorted pairs to next step
81. MapReduce – Reduce Function
takes list of <Key, List<Value>> sorted pairs from Shuffle
Function and perform reduce operation
83. YARN
Yet Another Resource Negotiator
Resource Manager that enables Hadoop to improve its
distributed processing capabilities.
Resource Manager: communicates with the clients, tracks
resources on the cluster and the jobs by assigning tasks to
NodeManagers.
Entities:
Scheduler – scheduling resources to various tasks.
ApplicationMaster – exchange resources from the
scheduler and works with NodeManager
84. YARN
NodeManager:
Launches and tracks tasks on DataNodes.
Container: portion of the NodeManager’s capacity and
it is used by the client for running a program.
86. NoSQL
Not only SQL or NoSQL
all databases and data stores that are not based on the
Relational Database Management Systems or RDBMS
principles.
relates to large data sets accessed and manipulated on a
Web scale
new classes of database products consist of column-based
data stores, key/value pair databases, and document
databases
87. NoSQL - Types
Key-Value database: a big hash table of keys and values
Document-based database: stores documents made up of
tagged elements.
Column-based database: each storage block contains data
from only one column.
Graph-based database: a network database that uses nodes
to represent and store data.
88. HIVE
Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing
easy.
developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open
source under the name Apache Hive.
90. HIVE – Data Flow
Executing Query from the UI
The driver is interacting with Compiler for getting the plan
The compiler creates the plan for a job to be executed.
Compiler communicating with Meta store for getting
metadata request.
Meta store sends metadata information back to compiler
Compiler communicating with Driver with the proposed
plan to execute the query
Driver Sending execution plans to Execution engine
91. HIVE – Data Flow
Execution Engine (EE) acts as a bridge between Hive and
Hadoop to process the query. For DFS operations
first contacts Name Node and then to Data nodes to get
the values stored in tables.
It collects actual data from data nodes related to
mentioned query
communicates bi-directionally with Meta store present
in Hive to perform DDL (Data Definition Language)
operations
92. HIVE – Data Flow
Fetching results from driver
Sending results to Execution engine. Once the results
fetched from data nodes to the EE, it will send results back
to driver and to UI ( front end)
93. Hive Vs Relational Databases:-
Relational databases are of "Schema on READ and
Schema on Write“.
Hive is "Schema on READ only".
supports "READ Many WRITE Once" pattern