IT6701-Information Management Unit 1

IT6701 – Information
Management
IV Year / VII Semester

OBJECTIVES
• To expose students with the basics of managing the information
• To explore the various aspects of database design and modeling,
• To examine the basic issues in information governance and
information integration
• To understand the overview of information architecture.

UNIT I DATABASE MODELLING,
MANAGEMENT AND
DEVELOPMENT
Database design and modelling - Business Rules and
Relationship; Java database Connectivity (JDBC), Database
connection Manager, Stored Procedures. Trends in Big Data
systems including NoSQL – Hadoop, HDFS, MapReduce,
Hive, and enhancements.

Introduction
• Database
– Central repository of data
– To store the data in a structured manner.
• Database Design
• A process of defining the structure of a database.

Data Model
 Collection of conceptual tools for describing data, data
relationship and consistency constraint.
 Represents the nature of the data, business rules governing
the data, and how it will be organized in the database.

Data Model
 Levels:
 Conceptual – describes WHAT the system contains
 Logical – describes HOW the system will be implemented
regardless of the DBMS
 Physical - describes HOW the system will be implemented using a
specific DBMS.

Data Model - Element
Element Definition
Entity
a real world thing or an interaction between 2 or more real world
things.
Attribute Pieces of information the need to now about entities
Relationship
How entities depend on each other in terms of why the entities
depend on each other and what that relationship is

Data Model - Example
 Customer and Product are an entity
Customer: customer name, customer id
Product: Product name, Price
Sale- Relationship

Data Model - Types
 Entity Relationship Models
Unified Modeling Language

Entity Relationship Models
 a visual representation of data that describes how
data is related to each other.

Components of ER Diagram
Component Description Symbol
Entity Rectangle
Relationship Diamond
Attributes for
any Entity Ellipse
Key Attribute
for any Entity
the attribute
name inside
the Ellipse is
underlined.

Components of ER Diagram
Derived
Attribute for
any Entity
dotted
ellipse is
created
inside the
main ellipse
Multivalued
Attribute for
any Entity
Double
Ellipse

ER Diagram - Entity
Component Example Symbol
Entity
Employee,
Manager,
Department
Weak Entity
depends on
another
entity

ER Diagram - Attribute
Attribute
(Name, Age,
Address)
property or
characteristic
of an entity
Key Attribute
main
characterstic
of an Entity
Composite
Attribute
have their
own attributes

ER Diagram - Relationship
One to One
Relationship
one student can enroll
only for one course and
a course will also have
only one Student
One to Many
Relationship
1 student can opt for
many courses
Many to One
Relationship
Student enrolls for only
one Course but a
Course can have many
Students

ER Diagram - Relationship
Many to Many
Relationship
one student can enroll
for more than one
courses. And a course
can have more than 1
student enrolled in it

ER Diagram - Example
 College Database: Statements
 A college contains many departments
 Each department can offer any number of courses
 Many instructors can work in a department
 An instructor can work only in one department
 For each department there is a Head
 An instructor can be head of only one department
 Each instructor can take any number of courses
 A course can be taken by only one instructor
 A student can enroll for any number of courses
 Each course can have any number of students

ER Diagram - Steps
 Identify the Entities
Identify the relationship
Identify the key attributes
Identify other relevant attribute

 Step 1 : Identify the Entities
Department
Course
Instructor
Student

 Step 2 : Identify the relationships
Department and Course - One to Many (1:N)
Department and Instructor - One to Many (1:N)
Department and Head - One to One (1:1)
Course and student - Many to Many (M:N)
Course and instructor - Many to One (N :1)

 Step 3: Identify the key attributes
Department_Name is the key attribute for the Entity
"Department".
Course_ID is the key attribute for "Course" Entity.
Student_ID is the key attribute for "Student" Entity.
Instructor_ID is the key attribute for "Instructor" Entity.

 Step 4: Identify other relevant attributes
For the department entity, other attributes are location
For course entity, other attributes are course_name,duration
For instructor entity, other attributes are first_name, last_name,
phone
For student entity, first_name, last_name, phone

Step 5: Draw complete ER diagram

ER Diagram – College Database

NORMALIZATION
 a process of organizing the data in database to
avoid data redundancy
Forms:
First normal form(1NF)
Second normal form(2NF)
Third normal form(3NF)
Boyce & Codd normal form (BCNF)

NORMALIZATION
 First normal form(1NF) :
an attribute (column) of a table cannot hold multiple
values. It should hold only atomic values.
emp_id emp_name emp_address emp_mobile
101 Herschel New Delhi 8912312390
102 Jon Kanpur
8812121212
9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore
9990000123
8123450987

NORMALIZATION
 First normal form(1NF) :
each attribute of a table must have atomic (single)
values.
emp_id emp_name emp_address emp_mobile
101 Herschel New Delhi 8912312390
102 Jon Kanpur 8812121212
102 Jon Kanpur 9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore 9990000123
104 Lester Bangalore 8123450987

NORMALIZATION
 Second normal form(1NF) : A table is said to be
in 2NF if both the following conditions hold:
Table is in 1NF (First normal form)
All the non-key columns are dependent on the table’s
primary key.

NORMALIZATION – 2NF

Candidate Keys: {teacher_id, subject}
Non prime attribute: teacher_age
teacher_id subject teacher_age
111 Maths 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40


The table is in first normal form and all the columns depend on the
table’s primary key.
teacher_id subject teacher_age
111 Maths 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
teacher_id teacher_age
111 38
222 38
333 40
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry

NORMALIZATION
 A table design is said to be in 3NF if both the
following conditions hold:
Table must be in 2NF
Transitive functional dependency of non-prime attribute
on any super key should be removed.

Emp_id Emp_name Emp_zip Emp_state Emp_city Emp_district
1001 John 282005 UP Agra Dayal Bagh
1002 Ajeet 222008 TN Chennai M-City
1006 Lora 282007 TN Chennai Urrapakkam
1101 Lilly 292008 UK Pauri Bhagwan
1201 Steve 222999 MP Gwalior Ratan

emp_id emp_name emp_zip
1001 John 282005
1002 Ajeet 222008
1006 Lora 282007
1101 Lilly 292008
1201 Steve 222999
emp_zip emp_state emp_city emp_district
282005 UP Agra Dayal Bagh
222008 TN Chennai M-City
282007 TN Chennai Urrapakkam
292008 UK Pauri Bhagwan
222999 MP Gwalior Ratan

NORMALIZATION
 Boyce Codd normal form:
it is in 3NF and for every functional dependency
X->Y, X should be the super key of the table.
emp_id emp_nationality emp_dept dept_type dept_no_of_emp
1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American
design and technical
support
D134 100
1002 American Purchasing department D134 600

NORMALIZATION
 Functional dependencies in the table above:
emp_id->emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
emp_id emp_nationality emp_dept dept_type dept_no_of_emp
1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American
design and technical
support
D134 100
1002 American Purchasing department D134 600

NORMALIZATION - BCNF
 emp_nationality table:
emp_dept table:
Emp_dept_mapping :
emp_id
emp_national
ity
1001 Austrian
1002 American
emp_dept dept_type
dept_no
_of_emp
Production and
planning
D001 200
stores D001 250
design and
technical
support
D134 100
Purchasing
department
D134 600
emp_id emp_dept
1001 Production and planning
1001 stores
1002 design and technical support
1002 Purchasing department

Business Rule
 a brief, precise and unambiguous description of a Policy, Procedure, or
Principle within a specific organization..
Some of the Example of Business rules are,
 i) A customer may generate many invoices
 ii) An invoice is generated by only one customer
 iii) A training session cannot be scheduled for fewer than 10
employees or for more than 30 employees., etc

Business Rule - Purpose
• They help standardize the Company's view of data
• They can be a communication tool in between users and designers
• They allow the designer to understand the nature, role and scope of Data
• They allow the designer to understand business processes
• They allow the DB designer to understand to develop appropriate
relationship participation rules and constraints to create an accurate data
model

Java Database Connectivity (JDBC)
JDBC Driver is a software component that enables
java application to interact with the database.
There are 4 types of JDBC drivers:
JDBC-ODBC bridge driver
Native-API driver (partially java driver)
Network Protocol driver (fully java driver)
Thin driver (fully java driver)

JDBC Driver - JDBC-ODBC bridge
driver
uses ODBC driver to connect to the database.
converts JDBC method calls into the ODBC
function calls. Ex: JDK 1.2

JDBC Driver - Native-API driver
uses the client-side libraries of the database.
converts JDBC method calls into native calls of the
database API

JDBC Driver - Network Protocol
driver
uses middleware (application server) that converts
JDBC calls directly or indirectly into the vendor-
specific database protocol

JDBC Driver - Thin driver
The thin driver converts JDBC calls directly into
the vendor-specific database protocol. That is why
it is known as thin driver. It is fully written in Java
language

Steps to connect a Java Application to
Database
Register the Driver
Create a Connection
Create SQL Statement
Execute SQL Statement
Closing the connection

Register the Driver
forName() - to register the driver class
- to dynamically load the driver class
Syntax:
public static void forName(String className)
throws ClassNotFoundException
Example:
Class.forName("oracle.jdbc.driver.OracleDriver");

Create the connection object
getConnection() method of DriverManager class is used to
establish connection with the database
Syntax:
1) public static Connection getConnection(String url)
throws SQLException
2) public static Connection getConnection(String url,
String name,String password) throws SQLException
Example:
Connection con=DriverManager.getConnection(
"jdbc:oracle:thin:@localhost:1521:xe","system","password");

Create the Statement object
The createStatement() method - used to
create statement.
The object of statement - to execute queries
with the database.
Syntax:
public Statement createStatement() throws SQLException
Example:
Statement stmt=con.createStatement();

Execute the query
The executeQuery() method - used to execute queries to
the database.
returns the object of ResultSet that can be used to get all
the records of a table.
Syntax:
public ResultSet executeQuery(String sql)throws SQLException
Example:
ResultSet rs=stmt.executeQuery("select * from emp");
while(rs.next()){
System.out.println(rs.getInt(1)+" "+rs.getString(2)); }

Close the connection object
close() method of Connection interface
is used to close the connection.
Syntax:
public void close()throws SQLException
Example:
con.close();

Stored Procedures
A stored routine is a set of SQL statements that
can be stored in the server
Steps:
Picking a Delimiter
How to Work with a Stored Procedure
Parameters
Variables
Flow Control Structures

Step 1: Picking a Delimiter
 the character or string of characters that tells the
mySQL client finished typing in an SQL statement.
 use “//”

Step 2:Work with a stored procedure
Creating a Stored Procedure
DELIMITER //
CREATE PROCEDURE `p2` ()
LANGUAGE SQL
DETERMINISTIC
SQL SECURITY DEFINER
COMMENT 'A procedure'
BEGIN
SELECT 'Hello World !';
END//

Step 2:Work with a stored procedure
 Calling a Stored Procedure
to enter the word CALL, followed by the name of the
procedure, and then the parentheses, including all the
parameters between them (variables or values).
Parentheses are compulsory.
CALL stored_procedure_name (param1, param2, ....)
CALL procedure1(10 , 'string parameter' ,
@parameter_var);

Step 3: Parameter
• CREATE PROCEDURE proc1 () : Parameter list is
empty
• CREATE PROCEDURE proc1 (IN varname DATA-
TYPE) : One input parameter. The word IN is optional
because parameters are IN (input) by default.
• CREATE PROCEDURE proc1 (OUT varname DATA-
TYPE) : One output parameter.
• CREATE PROCEDURE proc1 (INOUT varname DATA-
TYPE) : One parameter which is both input and output.

Step 4: Variables
• DECLARE varname DATA-TYPE DEFAULT
defaultvalue;
• DECLARE a, b INT DEFAULT 5;
• DECLARE str VARCHAR(50);
• DECLARE today TIMESTAMP DEFAULT
CURRENT_DATE;
• DECLARE v1, v2, v3 TINYINT;

Step 5: Flow Control Structures
DELIMITER //
CREATE PROCEDURE `proc_IF` (IN param1 INT)
BEGIN
DECLARE variable1 INT;
SET variable1 = param1 + 1;
IF variable1 = 0 THEN
SELECT variable1;
END IF;
IF param1 = 0 THEN
SELECT 'Parameter value = 0';
ELSE
SELECT 'Parameter value <> 0';
END IF;
END //

Overview
 “90% of the world’s data was generated in the last few years.”
• a collection of large datasets that
cannot be processed using traditional
computing techniques.

What Comes Under Big Data?
 Big data involves the data produced by different
devices and applications.
Black Box Data
Social Media Data
Stock Exchange Data
Transport Data
Search Engine Data

Characteristics
 Volume – the amount of data handled by the
application.
Velocity – the rate at which the data flows into the
system.
Variety – different types of data generated
Structured data : Relational data.
Semi Structured data : XML data
Unstructured data : Word, PDF, Text, Media Logs

Benefits
Using the information in the social media:
learning about the response for their campaigns,
promotions, and other advertising mediums.
consumers, product companies and retail organizations
are planning their production.
Using the data regarding the previous medical
history of patients, hospitals are providing better
and quick service

Big Data Technologies
providing more accurate analysis, which may lead
to more concrete decision-making resulting in
greater operational efficiencies, cost reductions,
and reduced risks for the business.
require an infrastructure that can manage and
process huge volumes of structured and
unstructured data in realtime and can protect data
privacy and security

Hadoop - Big Data Solutions
Traditional Approach

Google’s Solution
MapReduce: divides the task into small parts and
assigns those parts to many computers connected over
the network, and collects the results to form the final
result dataset.

Hadoop
 Doug Cutting, Mike Cafarella and team took the
solution provided by Google and started an Open
Source Project called HADOOP in 2005.
 Two parts:
Stoarge – HDFS
Processing - MapReduce

Hadoop -Architecture
Both Master Node and Slave Nodes contain two Hadoop
Components:
HDFS Component
MapReduce Component
 Master Node (HDFS) – Name Node – To store metadata
 Slave Node (HDFS) – Data Node – to store actual data

Components of Ecosystem
HDFS – Hadoop Distributes File System - Storage
MapReduce – Data Processing
YARN – Resource Management
Hive - querying and analyzing large datasets
Pig - analyzing and querying huge dataset – Programmer
Hbase - to store structured data in tables
Sqoop - imports data from external sources
Zookeeper - coordinates a large cluster of machines
Oozie - a workflow scheduler

Hadoop Distributed File System
holds very large amount of data and provides easier access
 To store such huge data, the files are stored across multiple
machines
These files are stored in redundant fashion to rescue the
system from possible data losses in case of failure

Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with
HDFS.
The built-in servers of namenode and datanode help users
to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

HDFS Architecture Elements
Namenode:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming,
closing, and opening files and directories
 Datanode:
Datanodes perform read-write operations on the file systems,
as per client request.
They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode

HDFS Architecture Elements
 Block:
the user data is stored in the files of HDFS.
The file in a file system will be divided into one or more
segments and/or stored in individual data nodes.
 Size: 64 MB to 128 MB

Writing a file in a Hadoop cluster


Reading a file in a Hadoop cluster


MapReduce
 to process huge amount of data in parallel, reliable and
efficient way in cluster environments.
uses Divide and Conquer technique to process large amount
of data.
It divides input task into smaller and manageable sub-tasks
to execute them in-parallel.
Steps:
Map function
Shuffle function
Reduce function

MapReduce – Map Function
 It takes input tasks and divides them into smaller sub-tasks
Sub steps:
Splitting - takes input DataSet from Source and divide
into smaller Sub-DataSets.
Mapping - takes those smaller Sub-DataSets and
perform required action or computation on each Sub-
DataSet
The output of this Map Function is a set of key and value
pairs as <Key, Value>

MapReduce – Shuffle Function
 Combine Function
Sub steps:
Merging - combines all key-value pairs which have
same keys.
Sorting - takes input from Merging step and sort all key-
value pairs by using Keys
 Shuffle Function returns a list of <Key, List<Value>>
sorted pairs to next step

MapReduce – Reduce Function
 takes list of <Key, List<Value>> sorted pairs from Shuffle
Function and perform reduce operation

YARN
 Yet Another Resource Negotiator
 Resource Manager that enables Hadoop to improve its
distributed processing capabilities.
Resource Manager: communicates with the clients, tracks
resources on the cluster and the jobs by assigning tasks to
NodeManagers.
Entities:
Scheduler – scheduling resources to various tasks.
ApplicationMaster – exchange resources from the
scheduler and works with NodeManager

YARN
 NodeManager:
Launches and tracks tasks on DataNodes.
Container: portion of the NodeManager’s capacity and
it is used by the client for running a program.

NoSQL
 Not only SQL or NoSQL
all databases and data stores that are not based on the
Relational Database Management Systems or RDBMS
principles.
relates to large data sets accessed and manipulated on a
Web scale
new classes of database products consist of column-based
data stores, key/value pair databases, and document
databases

NoSQL - Types
Key-Value database: a big hash table of keys and values
Document-based database: stores documents made up of
tagged elements.
Column-based database: each storage block contains data
from only one column.
Graph-based database: a network database that uses nodes
to represent and store data.

HIVE
 Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing
easy.
developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open
source under the name Apache Hive.

HIVE – Data Flow
 Executing Query from the UI
The driver is interacting with Compiler for getting the plan
The compiler creates the plan for a job to be executed.
Compiler communicating with Meta store for getting
metadata request.
Meta store sends metadata information back to compiler
Compiler communicating with Driver with the proposed
plan to execute the query
Driver Sending execution plans to Execution engine

HIVE – Data Flow
 Execution Engine (EE) acts as a bridge between Hive and
Hadoop to process the query. For DFS operations
 first contacts Name Node and then to Data nodes to get
the values stored in tables.
It collects actual data from data nodes related to
mentioned query
communicates bi-directionally with Meta store present
in Hive to perform DDL (Data Definition Language)
operations

HIVE – Data Flow
 Fetching results from driver
Sending results to Execution engine. Once the results
fetched from data nodes to the EE, it will send results back
to driver and to UI ( front end)

Hive Vs Relational Databases:-
 Relational databases are of "Schema on READ and
Schema on Write“.
Hive is "Schema on READ only".
supports "READ Many WRITE Once" pattern

IT6701-Information Management Unit 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IT6701-Information Management Unit 1

Similar to IT6701-Information Management Unit 1 (20)

More from SIMONTHOMAS S

More from SIMONTHOMAS S (20)

Recently uploaded

Recently uploaded (20)

IT6701-Information Management Unit 1