IT6701 Information Management - Unit I

IT6701 – Information Management
Unit I – Database Modelling,
Management and Development
By
Kaviya.P, AP/IT
Kamaraj College of Engineering & Technology
1

Unit I – Database Modelling,
Management and Development
Database design and modelling - Business Rules and
Relationship; Java database Connectivity (JDBC),
Database connection Manager, Stored Procedures.
Trends in Big Data systems including NoSQL -
Hadoop HDFS, MapReduce, Hive, and
enhancements.
2

Database Design and Modelling
• Database Design
– Process of producing and representing a database in particular model.
– Process of defining the structure of a database.
– Data modelling is the first step in in database design.
• Levels of Abstraction
– Conceptual database design
– Logical database design
– Physical database design
3

• Conceptual database design - (What is represented in the database?)
– An abstract model is created from business rules and user requirements.
– Entity-Relationship (ER) Model is used to represent the conceptual design.
• Entity – Real things in the world
• Relationships – Reflects interactions between entities
• Attributes – Properties of entities and relationships
• Logical database design - (Logical representation and Relational model)
– ER Model is converted into a relational model through logical database
design.
– The data are arranged into the logical structures and mapped into DBMS
tables with accompanying constraints.
4

• Physical database design
– Actual physical implementation
of the database in a Database
Management Systems.
– It includes the description of data
features, data types, indexing, etc.
– How the information is
represented in the database or on
how data structures are
implemented to represent what is
modelled.
5

Database Modelling – ER Model
– Conceptual designing tool which describes data as
entities, relationships, and attributes.
– Diagrammatic representation of the model.
– Entity: Real world thing. (Eg: person, student, car)
– Entity Set: Collection of entities of similar type.
(Eg: Total No. of Students enrolled in a course)
– Attributes: Properties that describe the entity.
6

Types of Attributes:
– Composite Attribute: Combination of
multiple attribute. (Eg: Address includes
street, city, zip_code).
– Simple Attribute: One which cannot be
decomposed into smaller units. (Eg:
Age)
– Single Valued Attributes: Can hold a
single value. (Eg: Rank)
– Multi valued Attributes: Can store
multiple values. (Eg: Mobile No.)
7

– Stored Attributes: Attributes
whose values are stored in the
database. (Eg: DoB)
– Derived Attributes: Attributes
whose values are calculated from
one or more attributes in the
database. (Eg: Age can be
calculated from DoB)
8

– Null values: If the value for certain instances of an entity does not
exist or is not available.
– Complex Attributes: These attributes are those in which the
relationship can exist between two are more entities if they are
associated with each other or one entity refers to one or more
entities.
9

Relationship:
– Whenever an attribute of one entity
type refers to another entity type, a
relationship exists.
– Degree of relationship:
• Binary: A relationship of degree
two.
• Ternary: A relationship of degree
three.
• n-ary: n-entities participating in n.
10

Constraints
• Cardinality ratio – Maximum
number of relationship instances
that an entity can participate.
– 1:1 relationship
– 1:N relationship
– N:1 relationship
– M:N relationship
11

Constraints
• Participation constraint – It specifies whether the existence of an entity
depends on it being related to another entity.
– Total/Mandatory participation: If the existence of an entity is determined
through its participation in a relationship. (Eg: a student must enroll in a
course)
– Partial/optional participation: If only a part of the set of entities participate
in a relationship. (Eg: Every teaching_staff will not be an HoD of a
department)
12

Database Modelling – ER Model - Keys
• Keys: Allow us to identify a particular entity.
• Super key: A super key is a set of one or more
attributes (columns), which can uniquely identify a row
in a table.
• Candidate key: The minimal set of attribute
which can uniquely identify a tuple is known as
candidate key. (Minimal subset of super key)
• Primary key: An attribute which allows us to
uniquely identify a particular instance in the
database.
• Foreign key(Referential Integrity): If multiple
references exist then updation or modification of
any one should be reflected in all other places.
13

One – to – One
Cardinality
14

One – to – Many
Cardinality
Many – to – One
Cardinality
Many – to – Many
Cardinality
Participation
15

Database Modelling – Extended ER Model
• Specialisation: The result of taking a subset of a higher level entity set to
form a low level entity set. (Eg: Person -> Customer, Employee)
• Generalisation: The result of taking the union of two or more disjoint
entity sets to produce a higher level entity set. (Eg: Customer, Employee
-> Person)
• Aggregation: An abstraction in which relationship sets are treated as
higher level entity sets and can participate in the relationships.
16

Database Modelling – Case Study: Hospital Management System
17

Business Rules
• Database design is an important phase in the system development life cycle.
• The inputs to design phase will be the business rules and functions identified
in the requirement gathering phase.
• Business rules are used to describe various aspects of the business domain.
(Eg: Students need to be enrolled in the course before appearing for his/her
examination)
• The following are the business rules:
– The explanation of a concept relevant to the application. (Course is evaluated
through theory + practical examination)
– An integrity constraint on the data of the application. (Minimum mark to pass a
course is 50%)
– A derivation rule, whereby information can be derived form other information.
(Grade of the student is assigned based on the marks obtained)
18

Business Rules
Identifying Business Rules
• Business rules allow the database designer to develop relationship
rules and constraints and help in the creation of a correct data
model.
• It is good communication tool between users and designers.
• It gives the proper classification of entities, attributes, relationships,
and constraints.
• The noun in a business rule will be transformed into an entity in
the model and a verb (active or passive) will be interpreted as a
relationship among entities.
19

Java Database Connectivity (JDBC)
• The JDBC API (Application Programming Interface) provides a
way for creating database connections from Java programmes.
• It provides methods to execute SQL statements and process the
results obtained from those statements.
• Types of JDBC drivers
– Type 1 – JDBC ODBC Bridge Driver
– Type 2 – Java Native Driver
– Type 3 – Java Network Protocol Driver
– Type 4 – Pure Java Driver
20

Type 1 – JDBC ODBC Bridge Driver
• It provides a bridge to access the ODBC drivers installed on each client machine.
• This bridge translates the standard JDBC calls to corresponding ODBC calls and
send them to the ODBC data source via ODBC libraries.
• This driver requires that native ODBC libraries, divers and their required support
files be installed and configured on each client machine.
• They are the slowest of all types due to multiple levels of translation.
21

Type 2 – Java Native Driver
• It mainly uses the Java Native Interface (JNI) to translate calls to the local database API.
• The JDBC calls are translated into vendor-specific API calls which act as a façade for
forwarding requests between application and database.
• Type 2 drivers are usually faster than Type 1.
• Similar to Type 1 drivers, these drivers also require native libraries to be installed and
configured on each client machine.
22

Type 3 – Java Network Protocol Driver
• It use an intermediate driver listener that acts as a gateway for multiple database servers.
• The Java client sends JDBC request to the listener which is turn connect to database
server using another driver.
• It do not require any installation on the client side, which is why it is preferred over the
first two types of driver.
23

Type 4 – Pure Java Driver
• It is most commonly used JDBC driver in most enterprise application because they
convert JDBC API calls to direct network calls using vendor-specific implementation
details.
• Type 4 divers offer better performance compared to the other types and it also does not
require any installation or configuration to be done on client machine.
24

Accessing Database using JDBC
• Steps:
– Import JDBC Packages - import java.sql.*;
– Register the JDBC Driver - Class.forName()
Eg: Class.forName(“oracle.jdbc.driver.OracleDriver”);
– Creating a database connection
Eg: Connection con = DriverManager.getConnection(url,user,password)
String url = “ jdbc:oracle:thin:@localhost:1521:xe”
DriverManager.getConnection()
getConnection(String url)
getConnection(String url, Properties prop)
getConnection(String url, String user, String password)
25

• Steps:
– Executing queries
Eg: Statement st = con.createStatement();
int m = st.executeUpdate(sql);
Interfaces Recommended Use
Statement Use for general-purpose access to your database. Useful when you are using static
SQL statements at runtime. The Statement interface cannot accept parameters.
PreparedStatement Use when you plan to use the SQL statements many times. The PreparedStatement
interface accepts input parameters at runtime.
CallableStatement Use the when you want to access the database stored procedures. The
CallableStatement interface can also accept runtime input parameters.
boolean execute (String SQL)
int executeUpdate (String SQL)
ResultSet executeQuery (String SQL)
26

• Steps:
– Processing the results (handling SQL exceptions)
• ResultSet objects from Statement and PreparedStatement class, which contains the
query output which has to be processed.
• The output value from CallableStatement using OUT parameter, this could either be a
single value or a ResultSet.
• SQL Exception exception has to be caught and gracefully transmitted to the calling
programme.
– Closing the database connection
• By closing connection, objects of Statement and ResultSet will be closed
automatically.
• The close() method of Connection interface is used to close the connection.
Eg: con.close(); 27

Stored Procedure
• A stored procedure is a prepared SQL code that you can save and reuse over and over
again.
• A set of SQL statements, written together to form a logical unit, for performing a
specific task.
• It is a subroutine used by the applications to access relational databases and are stored in
the database data dictionary.
• It can be compiled and executed with different parameters and results and they can have
any combination of input, output, input/output parameters.
• Advantages of Stored Procedure:
– Stored procedures are fast.
– Stored procedures are portable.
– Stored procedures are always available as 'source code' in the database itself.
– Stored procedures are migratory. 28

Stored Procedure – PL/SQL
Example
Function
CREATE [OR REPLACE] FUNCTION function_name
[(parameter_name [IN | OUT | IN OUT] type [, ...])]
RETURN return_datatype
{IS | AS}
BEGIN
< function_body >
END [function_name];
Procedure
CREATE [OR REPLACE] PROCEDURE procedure_name
[(parameter_name [IN | OUT | IN OUT] type [, ...])]
{IS | AS}
BEGIN
< procedure_body >
END procedure_name;
29

Trends in Big Data systems
• Big data is the term used for a collection of data sets so large and complex that it
becomes difficult to process it using on-hand database management tools or
traditional data processing applications.
• Need for Big Data
– A huge amount of data needs to be analyzed for the betterment of the
organization and to improve customer experience.
– The current systems (single server) cannot handle such huge amount of
data.
– Hence, either the capacity of the single machine needs to be increased or a
cluster of machines can be used to act like a single system, which works in a
distributed manner.
– Such a solution is provided through Hadoop.
30

Trends in Big Data systems
Characteristics of Big Data
• Big data can be characterized by specifying three V’s: Volume, Variety, Velocity
• Volume: Specifies that amount of data handled by the application. (Eg: Twitter)
• Velocity: Addresses the rate at which the data flows into the system.
• Variety: Describes the different types of data generated from unstructured text to
structured records, from images to sound and video, from sensor data to
geographic locations, etc., all specifying information needed for processing.
• Fourth V, value: which refers to the return of investment (ROI) of the data and its
processing.
31

Hadoop
• Hadoop is an open-source framework that allows to store and process big
data in a distributed environment across clusters of computers using simple
programming models.
32

Hadoop
Hadoop 1.x Architecture
34

Hadoop
Hadoop 2.x Architecture
35

Hadoop
• Hadoop follows a master-slave architecture for the creation of a cluster.
• It consists of two parts: Storage unit & Processing unit.
• Storage is provided through a Hadoop Distributed File System (HDFS).
• Processing is done through MapReduce.
Storage - HDFS
• HDFS is spread across machines and acts as a single file system.
• The master node has information about the location of the data. The data
are stored in the slave node.
• HDFS runs daemons to handle data storage. They are NameNode,
DataNode, and Secondary NameNode.
36

Hadoop
• The cluster has a single NameNode running on the server and multiple
DataNodes running on the slaves.
• Every slave machine will run a DataNode daemon.
• NameNode acts as a single point of availability for the data. If it goes down
the DataNodes, it would be difficult to make sense of the blocks on them.
• Thus, the NameNode has to run on a dual or a triple redundant hardware
machine like RAID1+0 for storage.
• For faster access, the NameNode is stored in RAM. If the NameNode
crashes, all the data will be lost.
37

Hadoop
• To make this data persistent, secondary NameNode is used.
• NameNode contacts the secondary NameNode every hour and pushes the
metadata onto it, creating a checkpoint.
• NameNode can act as a single point of failure, and hence a backup is essential.
• From Hadoop 2.x onwards, a provision for passive or standby backup is
provided.
• This standby NameNode bachup will take control whenever the active
NameNode fails, thereby providing system availability.
• High data availability is achieved through data replication or duplication. The
default replication factor is 3, every file has three replicas.
38

Hadoop
Processing – MapReduce
• In Hadoop 1.x, the processing part was handled through MapReduce.
• The daemons running for MapReduce are JobTracker and TaskTracker.
• JobTracker: The master that manages the jobs submitted by the client and the
resources by it in the cluster.
• The JobTracker will split the job into various tasks that can run parallelly using the
TaskTracker.
• With hadoop 2.x, a few changes were made in the processing structure.
• Apache Hadoop 2.0 includes YARN, which separates processing components into
resource management and processing components.
• YARN daemons are ResourceManager, and NodeManager, help in processing the
data.
39

Hadoop
Characteristics of Hadoop
• Highly scalable: More machines can easily be added to the cluster as needed to
increase the capacity/power of the cluster.
• Commodity hardware-based: Desktops can be used to create a cluster. Specialized
hardware is not required. Therefore scalable and economical.
• Open source: You can look into the code and contribute back to the community.
• Reliable: If a machine crashes, the data are not lost.
40

Hadoop
Components of Hadoop
• Hadoop is a galaxy of tools.
• Every tool has a specific advantage or purpose.
• The collection of components is known as the Hadoop ecosystem.
• It inlcudes tools for data storage, data manipulation, integration with other systems,
machine learning, cluster management and development, etc.
• Components are:
– Hadoop Distributed File System (HDFS) – Flume, Sqoop
– YARN & HBase
– MapReduce
– Hive & Pig
– Oozie
– Zookeeper
41

Hadoop
Components of Hadoop
• Flume and Sqoop are used for data integrating. They are used to get the data from
external files into Hadoop. Flume is a service to move a large amount of data in real
time. Sqoop is the integration of SQL and Hadoop.
• YARN and MapReduce for data processing.
• YARN stands for Yet Another Resource Negotiator which is for resource management.
• HBase is the data storage or Hadoop database which provides interactive access to the
data stored in HDFS.
• Hive, Pig are used for data analysis. These high-level languages that allow to conctruct
queries, so that data processing can be performed. (Hive – Facebook, Pig – Yahoo)
• Oozie is a workflow scheduler, which is used to manage Hadoop jobs.
• Zookeeper provides operational services for a Hadoop cluster. It provides distributed
configuration services, synchronization services and naming registry. 43

HDFS – Hadoop Distributed
File System
HDFS
• HDFS is the file system required by Hadoop.
• It is an atypical file system, which does not format the hard drives in the
cluster.
• Instead it sits on top of the underlying operating system and its file system and
uses it to store and manage data.
• HDFS divides the file into a block either 64MB or 128MB. This block is then
replicated thrice or the number of times specified by the user.
• The NameNode maintains the split information and location details.
44

File System
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of NameNode and DataNode help users to easily check the
status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
45

File System
HDFS Architecture
46

File System
HDFS
• The storage can sometimes get very huge such that the disks are arranged in
different racks and connected through switches.
• If all replicas are stored in the same rack, and if the switch accessing that rack
fails, all the replicas will be unavailable defying the purpose of having
redundancy.
• HDFS has a feature of rack awareness through which the NameNode knows
which rack each data file is on.
47

File System
Rack awareness in HDFS
48

File System
HDFS
• Hadoop also has an intelligent behavior in terms of self-healing because if one
of the DataNode goes down, then the heartbeat(or status message) from that
DataNode to the NameNode will cease.
• After a few minutes, the NameNode will consider that DataNode to be dead
and whatever tasks that were running on that DataNode will get respawned so
that the replica count of 3 is achieved.
49

File System
HDFS – Preparing HDFS writes
50

HDFS – Hadoop Distributed File System
HDFS – Preparing HDFS writes
1. The client creates the file by calling create( ) on Distributed File System(DFS).
2. DFS makes an RPC call to the NameNode to create a new file in the file system's namespace,
with no blocks associated with it
3. The DFS returns an FSDataOutputStream for the client to start writing data to
FSDataOutputStream wraps a DFSOutputStream which handles communication with the
DataNodes and NameNode.
4. The DataStreamer streams the packets to the first DataNode in the pipeline, which stores each
packet and forwards it to the second DataNode in the pipeline.
5. When the client has finished writing data, it calls close( ) on the stream.
6. This action flushes all the remaining packets to the DataNode pipeline and waits for
acknowledgments before contacting the NameNode to signal that the file is complete.
7. The NameNode already knows which blocks the file is made up of (because DataStreamer asks
for block allocations), so it only has to wait for blocks to be minimally replicated before
returning successfully. 51

File System
HDFS – Reading Data from HDFS
52

HDFS – Hadoop Distributed File System
HDFS – Reading Data from HDFS
1. The client opens the file it wishes to read by calling open ( ) on the File System object,
which for HDFS is an instance of Distributed File System (DFS).
2. DFS calls the NameNode using RPCs, to determine the locations of the first few blocks in
the file.
3. The DFS returns an FSDatalnputStream to the client for it to read data from.
4. FSDataInputStream in turn wraps a DFSInputStream, which manages the DataNode and
NameNode I/O.
5. The client then calls read() on the stream.
6. During reading, if the OFSInputStream encounters an error while communicating with a
DataNode, it will try the next closest one for that block.
7. If a corrupted block is found, the DFSInputStream attempts to read a replica of the block
from another DataNode; it also reports the corrupted block to the NameNode.
53

MapReduce
• MapReduce is a programming model for processing large data sets with a
parallel distributed algorithm on a cluster.
• In traditional systems, data are brought from the datacenter in the main
memory, where the application is running.
• In MapReduce, the application is transferred to the location where data are
stored and executed parallelly.
• Thus, multiple instances of the MapReduce jobs exist at any given time, which
works parallelly on the data stored in HDFS.
54

MapReduce
MapReduce Framework
• It works on a divide and conquer policy.
• The job is divided into multiple tasks known as Map and then the output is
combined using a task known as Reducer.
• The MapReduce program comprises of two components: Map and Reduce.
• The Mapper part does the processing, while the Reducer aggregates the data.
• There is a third component called shuffle and sort present between Map and
Reduce.
• The output of the Map is given to shuffle and sort which is then passed onto
the Reducer.
55

MapReduce
MapReduce Framework
• The shuffle and sort groups the output show that all the data belonging to the
same group are given to a single machine.
• There can be one or many instances of Reducer running for a given job. So, it
is essential that the group of similar data is given to a single machine.
56

MapReduce
Reading the Data into the MapReduce Program
• Map task reads the input from the cluster as a sequence of (key, value) pair.
• The processing is done on the value and the output is also provided as a (key,
value) pair.
• These pairs from Map tasks are combined into groups and then sorted based
on the key through Shuffle and Sort phase.
• This intermediate output is given to the Reduce task combines the results and
provides the final output, which is written onto HDFS.
57

MapReduce
MapReduce Structure
58

MapReduce
MapReduce WorkFlow
Worker
Worker
Worker
Worker
Worker
read
local
write
remote
read,
sort
Output
File 0
Output
File 1
write
Split 0
Split 1
Split 2
Input Data Output Data
Map
extract something you
care about from each
record
Reduce
aggregate,
summarize,
filter, or
transform 59

MapReduce
MapReduce - Example
60

MapReduce
MapReduce - Example
61

MapReduce
MapReduce – Example
62

Hive
• Hive started at Facebook.
• Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
• Hive resides on top of Hadoop to summarize Big Data, and makes querying
and analyzing easy.
• Using Hive, one can create tables, create database, read the data and create
partition so that the data set can be restructured for processing.
• Hive has a lot of schema flexibility such that the tables can be altered,
columns can be moved, or the whole data set can be reloaded.
• It also has JDBC-ODBC connection so that it can be used with tools like
Tableau visualization. 63

Hive
• Limitations of Hive:
– It is not a relational database.
– It is not designed for OnLine Transaction Processing (OLTP).
– It is not a language for real-time queries and row-level updates.
• Features of Hive:
– It stores schema in a database and processed data into HDFS.
– It is designed for OLAP.
– It provides SQL type language for querying called HiveQL or HQL.
– It is familiar, fast, scalable, and extensible.
64

Hive
• Metastore is the information stored when you create table, database or a
view.
• On top of Metastore lies thriftAPI that enables browsing and querying using
JDBC-ODBC.
• The table definition, column definition, view definition will be stored in
Metastore.
• For Hive, the default data store is derby.
65

Hive
Hive Architecture
• Hive shell: Interact through create table, submit query
• Metastore: Table definition, view definition, database definition
• Execution Engine: For execution
• Compiler: For optimization
• Driver: Take the code and convert it into Hadoop understandable terms for
execution.
67

Hive
Create Database Statement
– CREATE DATABASE [IF NOT EXISTS] <database name> ;
Drop Database Statement
– DROP DATABASE IF EXISTS <database name>;
Create Table Statement
– CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
68

NoSQL
• An approach to data management and database design that is useful for
very large sets of distributed data.
• NoSQL was designed to access and analyze massive amounts of
unstructured data or data stored remotely on multiple virtual servers in the
cloud.
• Types of NoSQL databases
– Graph database
– Key-value database
– Column stores (also known as wide-column stores)
– Document database
69

NoSQL
• Graph database
– It is based on graph theory and used for representing networks from the
network of people in a social context to network of cities in geological
mapping.
– These databases are designed for data whose relations are well represented
as a graph and has elements which are interconnected, with an
undetermined number of relations between them.
– Ex: Neo4j, Giraph
70

NoSQL
• Key-value store
– They are the simplest databases and use a key to access a value.
– These types of databases are designed for storing data in a scheme-free
way.
– In a key-value store, all of the data within consists of an indexed key and
a value, hence the name.
– Ex: Cassandra, DyanmoDB
71

NoSQL
• Column stores
– These data stores are designed for storing data tables as sections of
columns of data, rather than as rows of data.
– Wide-column stores offer high performance and a highly scalable
architecture.
– Ex: Hbase, BigTable
72

NoSQL
• Document database
– These databases expand the idea of key-value stores where “documents”
contain more complex data.
– They contain data and each document is assigned a unique key, which is
used to retrieve the document.
– These are designed for storing, retrieving and managing document-
oriented information, also known as semi-structured data.
– Tree or hierarchical data structures can be directly stored in these
databases.
– Ex: MongoDB, CouchDB
73

IT6701 Information Management - Unit I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IT6701 Information Management - Unit I

Similar to IT6701 Information Management - Unit I (20)

More from pkaviya

More from pkaviya (20)

Recently uploaded

Recently uploaded (20)

IT6701 Information Management - Unit I