• Process of creating a data model for an information system by
applying formal data modeling techniques.
• Process used to define and analyze data requirements needed
to support the business processes.
• Therefore, the process of data modeling involves professional
data modelers working closely with business stakeholders, as
well as potential users of the information system.
What is Data Model?
• Data Model is a collection of conceptual tools for describing data,
data relationships, data semantics and consistency constraint.
• A data model is a conceptual representation of data structures
required for data base and is very powerful in expressing and
communicating the business requirements.
• A data model visually represents the nature of data, business rules
governing the data, and how it will be organized in the database.
• A data model provides a way to describe the design of a
database at the physical, logical and view levels.
• There are three different types of data models produced while
progressing from requirements to the actual database to be
used for the information system
• Conceptual: describes WHAT the system contains.
• Logical: describes HOW the system will be implemented,
regardless of the DBMS.
• Physical: describes HOW the system will be implemented
using a specific DBMS.
Different Data Models
A data model consists of entities related to each other on a diagram:
Entity A real world thing or an interaction between 2 or more real world things.
Attribute The atomic pieces of information that we need to know about entities.
Relationship How entities depend on each other in terms of why the entities depend on each other (the
relationship) and what that relationship is (the cardinality of the relationship).
Given that …
• “Customer” is an entity.
• “Product” is an entity.
• For a “Customer” we need to know their “customer number”
attribute and “name” attribute.
• For a “Product” we need to know the “product name” attribute and
• “Sale” is an entity that is used to record the interaction of
“Customer” and “Product”.
Here is the diagram that encapsulates these rules:
• By convention, entities are named in the singular.
• The attributes of “Customer” are “Customer No” (which is the unique
identifier or primary key of the “Customer” entity and is shown by the #
symbol) and “Customer Name”.
• “Sale” has a composite primary key made up of the primary key of
“Customer”, the primary key of “Product” and the date of the sale.
• Think of entities as tables, think of attributes as columns on the table and
think of instances as rows on that table:
• If we want to know the price of a Sale, we can ‘find’ it by using the “Product
Code” on the instance of “Sale” we are interested in and look up the
corresponding “Price” on the “Product” entity with the matching “Product
Types of Data Models
• Entity-Relationship (E-R) Models
• UML (unified modeling language)
• Entity Relationship Diagrams (ERD) as this is the most widely used
• ERDs have an advantage in that they are capable of being normalized
• Represent entities as rectangles
• List attributes within the rectangle
Why and When
• The purpose of a data model is to describe the concepts relevant
to a domain, the relationships between those concepts, and
information associated with them.
• Used to model data in a standard, consistent, predictable
manner in order to manage it as a resource.
• To have a clear picture of the base data that your business
• To identify missing and redundant base data.
• To Establish a baseline for communication across functional
boundaries within your organization.
• Provides a basis for defining business rules.
• Makes it cheaper, easier, and faster to upgrade your IT solutions.
• Define terms related to entity relationship modeling, including
entity, entity instance, attribute, relationship and cardinality, and
• Describe the entity modeling process.
• Discuss how to draw an entity relationship diagram.
• Describe how to recognize entities, attributes, relationships,
A database can be modeled as:
– a collection of entities,
– relationship among entities.
Database systems are often modeled using an Entity Relationship
(ER) diagram as the "blueprint" from which the actual data is
stored — the output of the design phase.
Entity Relationship Diagram (ERD)
• ER model allows us to sketch database designs
• ERD is a graphical tool for modeling data.
• ERD is widely used in database design
• ERD is a graphical representation of the logical structure of a database
• ERD is a model that identifies the concepts or entities that exist in a
system and the relationships between those entities
Purposes of ERD
An ERD serves several purposes
• The database analyst/designer gains a better understanding of
the information to be contained in the database through the
process of constructing the ERD.
• The ERD serves as a documentation tool.
• Finally, the ERD is used to communicate the logical structure of
the database to users. In particular, the ERD effectively
communicates the logic of the database to users.
Components of an ERD
An ERD typically consists of four different graphical components:
Classification of Relationship
• Optional Relationship
– An Employee may or may not be assigned to a Department
– A Patient may or may not be assigned to a Bed
• Mandatory Relationship
– Every Course must be taught by at least one Teacher
– Every mother have at least a Child
Express the number of entities to which another entity can be
associated via a relationship set.
• Cardinality Constraints - the number of instances of one entity that can
or must be associated with each instance of another entity.
• Minimum Cardinality
– If zero, then optional
– If one or more, then mandatory
• Maximum Cardinality
– The maximum number
Cardinality Constraints (Contd.)
• For a binary relationship set the mapping cardinality must be one of
the following types:
–One to one
• A Manager Head one Department and vice versa
–One to many ( or many to one)
• An Employee Works in one Department or One Department has many
–Many to many
• A Teacher Teaches many Students and A student is taught by many
Teachers 27K.T.Mikel Raj
General Steps to create an ERD
• Identify the entity
• Identify the entity's attributes
• Identify the Primary Keys
• Identify the relation between entities
• Identify the Cardinality constraint
• Draw the ERD
• Check the ERD
Developing an ERD
The process has ten steps:
1. Identify Entities
2. Find Relationships
3. Draw Rough ERD
4. Fill in Cardinality
5. Define Primary Keys
6. Draw Key-Based ERD
7. Identify Attributes
8. Map Attributes
9. Draw fully attributed ERD
10. Check Results 30K.T.Mikel Raj
A Simple Example
A company has several departments. Each department has a
supervisor and at least one employee. Employees must be assigned
to at least one, but possibly more departments. At least one
employee is assigned to a project, but an employee may be on
vacation and not assigned to any projects. The important data fields
are the names of the departments, projects, supervisors and
employees, as well as the supervisor and employee number and a
unique project number.
• One approach to this is to work through the information and highlight those
words which you think correspond to entities.
• A company has several departments. Each department has a supervisor and at
least one employee. Employees must be assigned to at least one, but possibly
more departments. At least one employee is assigned to a project, but an
employee may be on vacation and not assigned to any projects. The important
data fields are the names of the departments, projects, supervisors and
employees, as well as the supervisor and employee number and a unique
• A true entity should have more than one instance
• Aim is to identify the associations, the connections between pairs of
• A simple approach to do this is using a relationship matrix (table)
that has rows and columns for each of the identified entities.
Find Relationships (Contd.)
• Go through each cell and decide whether or not there is an association.
For example, the first cell on the second row is used to indicate if there is
a relationship between the entity "Employee" and the entity
Names placed in the cells are meant to capture/describe the
relationships. So you can use them like this
• A Department is assigned an employee
• A Department is run by a supervisor
• An employee belongs to a department
• An employee works on a project
• A supervisor runs a department
• A project uses an employee
Draw Rough ERD
Draw a diagram and:
• Place all the entities in rectangles
• Use diamonds and lines to represent the relationships between
• General Examples
Fill in Cardinality
– Each department has one supervisor.
– Each supervisor has one department.
– Each employee can belong to one or more departments
– Each department must have one or more employees
– Each project must have one or more employees
– Each employee can have 0 or more projects.
Fill in Cardinality (Contd.)
The cardinality of a relationship can only have the following
–One and only one
–One or more
–Zero or more
–Zero or one
Each instance of A is related to a minimum of
zero and a maximum of one instance of B
Each instance of B is related to a minimum of
one and a maximum of one instance of A
Each instance of A is related to a minimum of
one and a maximum of many instances of B
Each instance of B is related to a minimum of
zero and a maximum of many instances of A
• In this step we try to identify and name all the attributes essential to the system we are
studying without trying to match them to particular entities.
• The best way to do this is to study the forms, files and reports currently kept by the users
of the system and circle each data item on the paper copy.
• Cross out those which will not be transferred to the new system, extraneous items such as
signatures, and constant information which is the same for all instances of the form (e.g.
your company name and address). The remaining circled items should represent the
attributes you need. You should always verify these with your system users. (Sometimes
forms or reports are out of date.)
• The only attributes indicated are the names of the departments, projects, supervisors and
employees, as well as the supervisor and employee NUMBER and a unique project number.
• For each attribute we need to match it with exactly one entity. Often it
seems like an attribute should go with more than one entity (e.g. Name). In
this case you need to add a modifier to the attribute name to make it unique
(e.g. Customer Name, Employee Name, etc.) or determine which entity an
attribute "best' describes.
• If you have attributes left over without corresponding entities, you may have
missed an entity and its corresponding relationships. Identify these missed
entities and add them to the relationship matrix now.
Check ERD Results
• Look at your diagram from the point of view of a system owner or
user. Is everything clear?
• Check through the Cardinality pairs.
• Also, look over the list of attributes associated with each entity to
see if anything has been omitted.
– Collection of data
– Database management system
– Storing and organizing data
– Relational database
– Structured Query Language
– Java Database Connectivity
– JDBC driver
• Programs developed with Java/JDBC are platform and vendor
• “write once, compile once, run anywhere”
• Write apps in java to access any DB, using standard SQL
statements – while still following Java conventions.
• JDBC driver manager and JDBC drivers provide the bridge
between the database and java worlds.
• JDBC heavily influenced by ODBC
• ODBC provides a C interface for database access on Windows
• ODBC has a few commands with lots of complex options. Java
prefers simple methods but lots of them.
• Type 1: Uses a bridging technology to access a database. JDBC-ODBC bridge is an example. It provides a
gateway to the ODBC.
• Type 2: Native API drivers. Driver contains Java code that calls native C/C++ methods provided by the
• Type 3: Generic network API that is then translated into database-specific access at the server level.
The JDBC driver on the client uses sockets to call a middleware application on the server that translates
the client requests into an API specific to the desired driver. Extremely flexible.
• Type 4: Using network protocols built into the database engine talk directly to the database using Java
sockets. Almost always comes only from database vendors.
3rd Party API
Native C/C++ API
JDBC Drivers Types
JDBC driver implementations vary because of the wide variety of
operating systems and hardware platforms in which Java operates.
Sun has divided the implementation types into four categories,
Types 1, 2, 3, and 4.
Common JDBC Components
The JDBC API provides the following interfaces and classes −
This class manages a list of database drivers.
Matches connection requests from the java application with the proper database
driver using communication sub protocol.
The first driver that recognizes a certain subprotocol under JDBC will be used to
establish a database Connection.
This interface handles the communications with the database
You will interact directly with Driver objects very rarely. Instead,
you use DriverManager objects, which manages objects of this type.
It also abstracts the details associated with working with Driver
This interface with all methods for contacting a database.
The connection object represents communication context, i.e., all
communication with database is through connection object only.
You use objects created from this interface to submit the SQL statements to
These objects hold data retrieved from a database after you execute an SQL
query using Statement objects.
It acts as an iterator to allow you to move through its data.
This class handles any errors that occur in a database application
Type 1: JDBC-ODBC Bridge Driver
In a Type 1 driver, a JDBC bridge is used to access ODBC drivers installed
on each client machine.
Using ODBC, It requires configuring on your system a Data Source Name
(DSN) that represents the target database.
The JDBC-ODBC Bridge that comes with JDK 1.2 is a good example of this
kind of driver.
Type 2: JDBC-Native API
In a Type 2 driver, JDBC API calls are converted into native C/C++ API calls,
which are unique to the database.
These drivers are typically provided by the database vendors and used in
the same manner as the JDBC-ODBC Bridge.
The vendor-specific driver must be installed on each client machine.
The Oracle Call Interface (OCI) driver is an example of a Type 2 driver.
Type 3: JDBC-Net pure Java
In a Type 3 driver, a three-tier approach is used to access databases.
The JDBC clients use standard network sockets to communicate with a
middleware application server.
The socket information is then translated by the middleware application
server into the call format required by the DBMS, and forwarded to the
Type 4: 100% Pure Java
In a Type 4 driver, a pure Java-based driver communicates directly with the
vendor's database through socket connection.
This is the highest performance driver available for the database and is
usually provided by the vendor itself.
This kind of driver is extremely flexible, you don't need to install special
software on the client or server. Further, these drivers can be downloaded
The following steps are required to create a new Database using JDBC application −
Import the packages:
Requires that you include the packages containing the JDBC classes needed for
Most often, using import java.sql.* will suffice.
Register the JDBC driver:
Requires that you initialize a driver so you can open a communications channel with
Open a connection:
Using the DriverManager.getConnection() method to create a
Connection object, which represents a physical connection with the
To create a new database, you need not give any database name
while preparing database URL as mentioned in the below example.
Execute a query:
Using an object of type Statement for building and submitting an SQL
statement to the database.
Clean up the environment:
Explicitly closing all database resources versus relying on the JVM's
Stored Procedure Language
Stored Procedure Overview
Stored Procedure is a function in a shared library accessible to the
can also write stored procedures using languages such as C or Java
Advantages of stored procedure : Reduced network traffic
The more SQL statements that are grouped together for execution, the
larger the savings in network traffic
Applications using stored
Writing Stored Procedures
Tasks performed by the client application
Tasks performed by the stored procedure, when invoked
The CALL statement
Explicit parameter to be defined :
IN: Passes a value to the stored procedure from the client application
OUT: Stores a value that is passed to the client application when the stored procedure
INOUT : Passes a value to the stored procedure from the client application, and returns a
value to the Client application when the stored procedure terminates
Some Valid SQL Procedure Body Statements
Can invoke Stored procedure stored at the location of the database by using the SQL CALL
Nested SQL Procedures:
To call a target SQL procedure from within a caller SQL procedure, simply include a CALL
statement with the appropriate number and types of parameters in your caller.
IF <condition> THEN
EXIT WHEN <condition>
What is Big Data?
• Big data is a massive volume of both structured and unstructured data that
is so large it is difficult to process using traditional database and software
• In most enterprise scenarios the volume of data is too big or it moves too
fast or it exceeds current processing capacity.
• Despite these problems, big data has the potential to help companies
improve operations and make faster, more intelligent decisions.
Why Big Data
Key enablers of appearance and growth of Big Data are
Increase of storage capacities
Increase of processing power
Availability of data
Every day we create 2.5 quintillion bytes of data; 90% of the data in the world
today has been created in the last two years alone
Big Data Everywhere!
• Lots of data is being collected
– Web data, e-commerce
– purchases at department/
– Bank/Credit Card
– Social Network
How much data?
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
Three V‘s of Big Data
Types of Data
• Three concepts come with big data :
Semi structured Data &
It concerns all data which can be stored in database SQL in table with
rows and columns.
They have relational key and can be easily mapped into pre-designed fields.
Today, those data’s are the most processed in development and the simplest
way to manage information.
But structured data’s represent only 5 to 10% of all informatics data’s.
Semi structured data
• Semi-structured data is information that doesn’t reside in a relational
database but that does have some organizational properties that make it
easier to analyze.
semi structured documents.
• But as Structured data, semi structured data represents a few parts of data (5
• Unstructured data represent around 80% of data.
• It often include text and multimedia content.
Examples: include e-mail messages, word processing documents, videos, photos, audio files,
presentations, WebPages and many other kinds of business documents.
• Note that while these sorts of files may have an internal structure, they are
still considered « unstructured » because the data they contain doesn’t fit
neatly in a database.
• Unstructured data is everywhere. In fact, most individuals and organizations
conduct their lives around unstructured data.
• Here are some examples of machine-generated unstructured
Photographs and video
Social media data
Mobile data &
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
Examples of Big Data
IT log analytics
IT solutions and IT departments generate an enormous quantity of logs and trace data.
In the absence of a Big Data solution, much of this data must go unexamined:
organizations simply don't have the manpower or resource to churn through all that
information by hand, let alone in real time.
With a Big Data solution in place, however, those logs and trace data can be put to good
Within this list of Big Data application examples, IT log analytics is the most broadly
Applications for Big Data Analytics
Trading Analytics Fraud and Risk
Retail: Churn, NBO
NoSQL Not SQL
does not mean
NoSQL Not Only SQL
Not Relational DatabaseIt means
• Large Volume of Data
• Dynamic Schemas
• Horizontally Scalable
* Some Operations can be achieved by Enterprise class RDBMS software but with very High cost
• NoSQL is a non-relational database management systems, different from
traditional relational database management systems in some significant ways.
• NoSQL database provides a mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used in relation databases
• It is designed for distributed data stores where very large scale of data storing
needs (for example Google or Facebook which collects terabits of data every
day for their users).
Document Oriented Databases
Document oriented databases treat a document as a whole and avoid
splitting a document in its constituent name/value pairs.
At a collection level, this allows for putting together a diverse set of
documents into a single collection.
Document databases allow indexing of documents on the basis of not
only its primary identifier but also its properties.
Different open-source document databases are available today but the
most prominent among the available options are MongoDB and
In fact, MongoDB has become one of the most popular NoSQL
Graph Based Databases
A graph database uses graph structures with nodes, edges, and
properties to represent and store data.
By definition, a graph database is any storage system that
provides index-free adjacency. This means that every element
contains a direct pointer to its adjacent element and no index
lookups are necessary.
General graph databases that can store any graph are distinct
from specialized graph databases such as triple-stores and
network databases. Indexes are used for traversing the graph.
Column Based Databases
The column-oriented storage allows data to be stored effectively.
It avoids consuming space when storing nulls by simply not
storing a column when a value doesn’t exist for that column.
Each unit of data can be thought of as a set of key/value pairs,
where the unit itself is identified with the help of a primary
identifier, often referred to as the primary key.
Key Value Databases
The key of a key/value pair is a unique value in the set and can be
easily looked up to access the data.
Key/value pairs are of varied types: some keep the data in
memory and some provide the capability to persist the data to
Benefits of NoSQL over RDBMS
NoSQL databases being schema-less do not define any strict data
Dynamic and Agile
NoSQL databases have good tendency to grow dynamically with
changing requirements. It can handle structured, semi-
structured and unstructured data.
NoSQL scales horizontally by adding more servers and using
concepts of sharding and replication.
This behavior of NoSQL fits with the cloud computing
services such as Amazon Web Services (AWS) which allows you
to handle virtual servers which can be expanded horizontally
All the NoSQL databases claim to deliver better and faster performance
as compared to traditional RDBMS implementations.
It is impossible for a web service to provide following three guarantees at the
A distributed system can satisfy any two of these guarantees at the same
time but not all three
All the servers in the system will have the same data so anyone using the
system will get the same copy regardless of which server answers their
The system will always respond to a request (even if it's not the latest data
or consistent across the system or just a message saying the system isn't
The system continues to operate as a whole even if individual servers fail
or can't be reached..
An open source software framework
Supports Data intensive Distributed Applications.
Derived from Google’s Map-Reduce and Google File System papers.
Written in the Java Programming Language.
Need to process huge datasets on large no. of computers.
It is expensive to build reliability into each application.
Nodes fails everyday
Failure is expected, rather than exceptional.
Need common infrastructure
Efficient, reliable, easy to use.
Open sourced , Apache License
What is Hadoop Used for ?
Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)
Video and Image Analysis(NASA)
Hadoop High Level Architecture
Goals of HDFS
1. Very Large Distributed File System
- 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware
- Files are replicated to handle hardware failure
- Detect failures and recovers from them
3. Optimized for Batch Processing
- Data locations exposed so that computation can move to where data resides.
What is Hive
Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
•It stores schema in a database and processed data into HDFS.
•It is designed for OLAP.
•It provides SQL type language for querying called HiveQL or HQL.
•It is familiar, fast, scalable, and extensible.
Architecture of Hive
• The following component diagram depicts the architecture of
Architecture of Hive
Units and its operations
• Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS.
• The user interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
• Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine
• HiveQL is similar to SQL for querying on schema info on the
• It is one of the replacements of traditional approach for
The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as
same as MapReduce results.
HDFS or HBASE
Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
What is Map Reduce?
MapReduce is a processing technique and a program model for
distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map
Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples .
Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of
MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
What is Map Reduce? Cont…
Users specify the computation in terms of a map and a reduce function,
Underlying runtime system automatically parallelizes the computation across
large-scale clusters of machines, and
Underlying system also handles machine failures, efficient communications, and
The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored
in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of
This stage is the combination of the Shuffle stage and the Reduce
The Reducer’s job is to process the data that comes from the
After processing, it produces a new set of output, which will be
stored in the HDFS.