Python Notes for mca i year students osmania university.docx
IT6701 Information Management Unit - V
1. IT6701 – Information Management
Unit V – Information Lifecycle Management
By
Kaviya.P, AP/IT
Kamaraj College of Engineering & Technology
1
2. Unit V – Information Lifecycle Management
Data retention policies; Confidential and
Sensitive data handling, lifecycle management
costs. Archive data using Hadoop; Testing and
delivering big data applications for performance
and functionality; Challenges with data
administration
2
3. Data Retention Policies
What is Data Retention Policies?
• A document retention policy provides for the systematic review, retention and
destruction of documents received or created in the course of business.
• A document retention policy will identify documents that need to be maintained
and contain guidelines for how long certain documents should be kept and how
they should be destroyed.
Purpose of Data Retention Policies
• To maintain important records and documents for future use or reference.
• To dispose of records or documents that are no longer needed.
• To organize records so that they can be searched and accessed easily at a later
date.
3
4. Data Retention Policies
Categories of Requirements
• Legal or Legitimate requirements: The compliance or legal aspect, where a
certain legal case is filed and some piece of information need to be produced in a
court of law.
• Business or Commercial requirements: To make information available from the
operation’s perspective.
• Personal or Private requirements: To make information available from the
personal perspective.
4
5. Data Retention Policies
Scope : Categories of Document (What documents must be protected?)
• Legal Records: It include all the legal records, contracts, trademark, power of attorney,
press release, etc. These are the first set of documents that should be considered for
retention.
• Final Records: Documents not requiring ad hoc modification or alteration. They can
also specify records of completed activities.
• Permanent Records: Include all the business documents that describe the
organization’s details. They can also comprise of contracts, financial registers,
copyrights, patents, proposals.
• Accounting and Corporate Tax Records: Consists of financial statements,
investments, audits, tax returns, purchase, sales records, etc.
5
6. Data Retention Policies
Scope : Categories of Document (What documents must be protected?)
• Workplace Records: Information about the day-to-day activities of employees,
agreements, minutes of meetings, bylaws, etc.
• Employment, Employee, and Payroll Records: Include job postings, job
advertisements, recruitment procedures, performance reviews, etc.
• Bank Records: Information about bank transactions, deposits, cheque details, stop
payment, check bouncing.
• Historic Records: Records that are no longer required by the organization.
• Temporary Records: Documents that are not completed or finalized.
6
7. Data Retention Policies
Data Retention Policy
• When developing a retention policy, it is important to focus on the reason behind data
retention.
• The decision is based creation date, and include other criteria such as last access time, type of
data, time till which data is valid, data value, etc.
• The policy document should include details of the data/document that needs to be retained.
• The data should be divided into various categories such as personal employee data, client
data, financial data, legal data, etc.
• This division would help in deciding the duration of retention and destruction procedures.
• When the data retention period is over, the data should be discarded.
7
8. Data Retention Policies
Why to have Data Retention Policies?
The policy is also helpful to:
• Provide a system for complying with document retention laws
• Ensure that valuable documents are available when needed
• Save money, space and time
• Protect against allegations of selective document destruction, and
• Provide for the routine destruction of non-business, superfluous and outdated
documents
8
9. Data Retention Policies
Why to have Data Retention Policies?
The six most important reasons why an organization should implement a document
retention policy are:
1. To comply with legal duties and requirements, either statutory or regulatory
2. To avoid liability through “spoliation” the improper destruction or alteration of
documents in a litigation situation
3. To support or oppose a position in an investigation or litigation
4. To protect from unnecessary expense and time during discovery
5. To maintain control over discovery and e-discovery, and
6. To keep documents confidential and avoid leakage to attackers or competitors
9
10. Data Retention Policies
Laws Related to Data Retention Policy - India
• In India there is no Central Act which laid down the provisions related to Data Retention
Laws.
• But there are different policies incorporated by various agencies and which maintain and
follows their policies.
• Eg 1: Government of India Central Vigilance Commission by their wide notification
no. No.17/09/2006-Admn. gives the provisions related to Retention period/destruction
schedule of recorded files.
• Eg 2: The Ministry of Finance - Financial intelligence Unit has its own policy.
Notification No. 9/2005 - gives the “rules for Record Keeping and Reporting”.
10
11. Data Retention Policies
Laws Related to Data Retention Policy - India
• Rule 6. Retention of records - The records referred to in rule 3 shall be
maintained for a period of ten years from the date of cessation of the
transactions between the client and the banking company, financial institution or
intermediary, as the case may be.
• Thus, it may be noted that organization has its own Data retention Policies and
certain rules for retention of such records.
• However, there is no such established law wherein it is binding for the
organizations to prepare such policies.
11
12. Confidential and Sensitive Data Handling
Definition of Sensitive Data
• Data collected may be personal, confidential or sensitive in nature.
• Personal data provides information about an individual, and through which an
individual can be easily and uniquely identified, either directly or indirectly.
• Confidential data is the personal data that is private and should not be disclosed
to others.
12
13. Confidential and Sensitive Data Handling
Types of Sensitive Data
• Personal Information
– Sensitive personally identifiable information is data that can be traced back
to an individual, thus revealing one’s identity.
– Such information includes biometric data, medial information and history,
bank and credit card information, Passport or Aadhar numbers.
– Threats include not only crimes such as identity theft, but also disclosure of
personal information that the individual would prefer reminded private.
– Sensitive data should be encrypted both in transit and at rest.
13
14. Confidential and Sensitive Data Handling
Types of Sensitive Data
• Business Information
– Sensitive business information includes everything that poses a risk to the
company in question if discovered by a competitor or the general public.
– Such information includes trade secrets, contract details, acquisition plans,
financial data, supplier details, customer information.
– Methods of protecting corporate information from unauthorized access are
becoming integral to corporate security.
– These methods include deciding policy for security, metadata management
and document sanitization.
14
15. Confidential and Sensitive Data Handling
Types of Sensitive Data
• Classified Information
– It is pertains to a government body and is restricted according to the level of
sensitivity. (Eg: restricted, confidential, secret, and top secret)
– Information is generally classified to protect security.
– Once the risk of harm has passed or decreased, classified information may
be declassified and, possibly, made public.
15
16. Confidential and Sensitive Data Handling
Handling of Sensitive Data
• Sensitive data needs to be handled with utmost care with highest possible security
measures.
• Given a dataset, one or more attribute values in the tuple/record can be sensitive and
hence needs to be protected. But at the same time, other attributes of the same
tuple/record can be made available.
• Thus, the access policy needs to be defined at different granularity levels so that access
of these values for the attributes can be made available.
• Eg: If a query is triggered seeking information of all the patients having certain health
records, it should not reveal the identity of the individuals. Instead some aggregate
function can be applied like giving the total number of count of patients suffering from
the health condition.
16
17. Confidential and Sensitive Data Handling
Access Decision
• The database administrator decides what data should be in the database and who
should have access to it.
• These decisions are based on access policies that are defined in the
organization.
• Multiple factors are considered in making these polices such as availability of
data, acceptability of the access, authenticity of the user, etc.
17
18. Confidential and Sensitive Data Handling
Types of Disclosures
Sensitive data can be also be characterized based on what values are being disclosed.
• Displaying exact data: This is the most serious disclosure where the user will directly
get the sensitive data on request or sometimes without request; the latter being a serious
security concern.
• Displaying Bounds: Bounds are a convenient way of presenting sensitive data,
indicating that the sensitive value lies between high or low value. Eg: An organization
can reveal the range of salaries given to its managers, such that any person willing to
join the organization can take decision based on it.
18
19. Confidential and Sensitive Data Handling
Types of Disclosures
Sensitive data can be also be characterized based on what values are being disclosed.
• Displaying negative results: Sometimes a query could display a negative result,
specifying that a particular value is not present. This is of particular importance if the
data is of binary type and is represented as 0 or 1. Thus disclosing a value 0 is of
significant importance. However, in certain cases displaying information like whether a
student will appear in the top 10 list would not reveal significant information.
• Displaying probable values: Sometimes it maybe be possible to determine the
probability that a certain attribute will hold a particular value.
• Sensitive data can be secured by keeping it in an encrypted format so that the
information is not accidently revealed. But this can be tedious sometimes, if different
attributes need different levels of confidentiality.
19
20. Confidential and Sensitive Data Handling
Handling Data
1. Create a risk aware culture that includes an information security risk management
program. Define security and risk mitigation and handling policies at the enterprise
level.
2. Define data types used in the organization and classify it as confidential or sensitive.
3. Clarify responsibilities and accountability for the protection of confidential/sensitive
data.
4. Limit the access to confidential/sensitive data only to those absolutely essential to
institutional process.
5. Provide awareness and training to properly use the resources and follow the guidelines
and rules specified.
6. Authenticate compliance regularly with your policies and procedures.
20
21. Confidential and Sensitive Data Handling
Law provision in India Defining Sensitive Data and its Handling
Right to Information Act, 2005 gave a stimulus to transparency in government dealings and
concurrently provided some protection against the unwarranted disclosure of confidential
information under the law.
• A new civil provision prescribing damages for an entity that is negligent in using
“reasonable security practices and procedures” while handling “sensitive personal or
data information” resulting in wrongful loss or wrongful gain to any person.
• Criminal punishment for a person (a) if s/he discloses sensitive personal information;
(b) does so without the consent of the person or in breach of relevant contact and (c)
with an intention of ,or knowing that the disclosure would cause wrongful loss or gain.
• The IT rules introduced in 2011, defines “sensitive personal data” for the first time in
India.
21
22. Confidential and Sensitive Data Handling
Law provision in India Defining Sensitive Data and its Handling
The salient features of the new rules are as follows:
• Sensitive personal information: The laws relate to dealing with information generally,
personal information and “sensitive personal or data information”(SPD). SPD is defined to
cover the following : (a)passwords,(b)financial and credit information such as bank account or
credit card or debit card or other payment instrument details;(c)physical, physiological and
mental conditions ;(d) sexual orientation; (e) medical records and history and (f) biometric
and deoxyribonucleic acid(DNA) information. It may be noted that SPD deals with
information of individuals and not information of business.
• Privacy policy: Every business needs to have a privacy policy that must be published on its
website. Even if the business is not handling SPD, it is required to have a privacy policy. It
must describe what information is collected, what is the purpose of using the information, to
whom or how the information might be disclosed and the sound security practices followed to
safeguard the information.
22
23. Confidential and Sensitive Data Handling
Law provision in India Defining Sensitive Data and its Handling
The salient features of the new rules are as follows:
• Consent for collection: A business cannot collect SPD unless it obtains the prior
consent of the Information provider. The consent has to be provided by letter, fax or
email.
• Notification: The business should ensure that the information provider is aware
of the information being collected, the purpose of using the information, the
recipients of the information and the name and address of the agency collecting
the information.
• Use and Retention: The usage of personal information has to be restricted to
the purpose for which it was collected. The data retention rules have to be
followed in terms of maintaining the data for specified period as well as
destroying the data after that. The business should not maintain the SPD for
longer than it is specified.
23
24. Confidential and Sensitive Data Handling
Law provision in India Defining Sensitive Data and its Handling
The salient features of the new rules are as follows:
• Rights of access, correction and withdrawal: The business should permit the
information provider the right to review the information, and should ensure that
any information found to be inaccurate or deficient be corrected. The
information provider also has the right to withdraw its consent to the collection
and use of the information
• Transnational transfer: A business can only transfer the SPD or information to
a party overseas if the overseas party ensures the same level of protection
provided for under the Indian rules.
• Security procedures: The IT Act requires reasonable security procedures to be
maintained to escape liability. The security procedure has to be audited on a
regular basis by an independent auditor, approved by the Government of India.
24
25. Lifecycle Management Costs
• Data Lifecycle Management is the process of handling the flow of business
information throughout its lifespan, from requirements through maintenance.
• Information Lifecycle Management (ILM) is the consistent management of
information from creation to final disposition.
• It is comprised of strategy, process, and technology to effectively manage
information which, when combined, drives improved control over information in
the enterprise.
• It aims at automating the processes involved in organizing data into separate tiers
according to the specified policies, and automating data migration from one tier to
another tier.
• As a rule, newer data, and data that must be accessed more frequently, is stored on
faster, but more expensive storage media, while less critical data is stored on
cheaper, but slower media.
25
26. Lifecycle Management Costs
Benefits of Information Management Lifecycle
• Reduced Risk: Reduce unneeded and expired information, and make your information
easier to manage and discover.
• Cost Saving: eDiscovery, storage, and legal hold costs can be reduced with better
management of information.
• Improved Service: Archiving, eDiscovery, and Records Management may become less
of a distraction and drain on IT and Legal.
• Effective Governance: ILM can introduce management rigor and controls that benefit
the enterprise. ILM can bring the added bonus of improved management of information
for the entire business.
26
27. Lifecycle Management Costs
Five Stages of Data Lifecycle
• Data Creation
– When an employee or client creates and saves a file, that data becomes a part of the
organization’s daily operation.
– Enterprises often store this active data locally and on a network server while backing it
up on local storage appliances or cloud storage.
– This setup provides for fast recovery in case of data loss.
• Backup storage against data loss
– As the system’s efficiency increases, the enterprise can replicate the data from primary
storage into less costly off-site tape vaults or to the cloud.
– In case of a major outage or disaster, the data can be restored completely.
– The backup of the data and the amount of replication depends on the type and value of
the data.
27
28. Lifecycle Management Costs
Five Stages of Data Lifecycle
• Archiving helps contain storage costs
– Older inactive data that is not frequently handled can be retained in case of a legal, regulatory
or audit event.
– Various data storage networks can be used to archive the data, or data can be retained using
cloud or Hadoop.
– Offsite tapes offer high security, quick access, lower storage costs for such long-term data
storage demands.
– This kind of low-cost tape is particularly well suited to unstructured data such as Email.
• Ensuring secure data destruction
– The final stage of data lifecycle requires secure data destruction, which is typically governed
by a schedule that defines when and how you must destroy unwanted data.
– Once data reaches its expiration date, secure media destruction can ensure its environmentally
friendly disposal.
28
29. Lifecycle Management Costs
Five Stages of Data Lifecycle
• Put secure IT asset disposition to work
– The data storage lifecycle does not end until the last traces of data are destroyed –and this
includes information remaining within any obsolete hardware or peripherals.
– As with media destruction, maintain the chain of custody when eliminating any old computers
and office equipment.
Efficient Information Lifecycle Management
• For handling large amount of data, the storage needs to be scalable to accommodate it. Hence,
a flexible architecture should be considered for storage.
• Analytics application in some cases require us to access archived and unstructured data. To
leverage analytics, to make informed decision data can be archived into frameworks like
Hadoop.
• The storage can be optimized for maintenance and licensing costs by migrating rarely used
data into framework like Hadoop.
29
30. Lifecycle Management Costs
To proficiently manage data throughout its entire lifecycle, organizations must keep three
objectives in mind:
• Data veracity(trustworthiness) is critical for both analytics and regulatory compliance.
• Both structured and unstructured data must be managed effectively.
• Data privacy and security must be protected at all times.
30
31. Archive Data Using Hadoop
• The inexpensive cost of storage for Hadoop which supports to store any type of
data like structured , semi-structured or unstructured data plus the ability to query
Hadoop data using SQL commands.
• Hadoop utilizes commodity hardware and can be easily scaled up to
accommodate new data.
• Thus, the Hadoop environment can be used to archive and process the data.
• The Hadoop used to perform archiving is Sqoop, which can move the data to be
archived from the data warehouse into Hadoop.
• You will need to consider what form you want the data to take in your Hadoop
cluster. In general, compressed Hive files are a good option.
31
32. Archive Data Using Hadoop
• Archiving everything has an advantage of providing a single interface across the entire
dataset for issuing queries.
• Partial availability of data would require queries to be executed on the archived data
and the active data, and provide a merged solution of the two queries.
• An enterprise data warehouse archiving solution for Hadoop must provide three key features:
– Schema conversation: The archive must precisely duplicate the schema of the source
warehouse. It is essential to confirm that data values will be archived without loss of
precision. Changes to the source schema, for example, adding new columns or changing data
types, should also be captured by the archive.
– Control and security: The archive must provide access to data on a “need to know” basis; it
must guarantee that sensitive data is encrypted or masked, and that access is audited.
– Querying support: Support for SQL access to the archived data is essential. Applications
would require us to make use of the archived data to generate reports or to perform
analysis.
32
33. Testing and Delivering Big Data Applications for
Performance and Functionality
• Testing bid data application is more a verification of its data processing rather than testing
the individual features of the software product.
• When it comes to big data testing, performance and functional testing are the key
components to evaluate.
• The testing of Hadoop big data application can be performed as a two-step process.
– Checking the functionality: The business logic encoded using MapReduce programs
is tested in this phase. For this, unit testing can be performed and executed in the
pseudo-distributed mode.
– Checking on the cluster: Once the business logic is validated, it can be tested on the
cluster for the performance and failover. Performance testing includes testing of job
completion and the time taken, utilization of the memory and other resources, data
throughput, etc. Failover testing included failure of one or more daemons running in
Hadoop, namely, NameNode, DataNode, Resource Manager, Node Manager or failure of
the device through which the distributed environment is made available.
33
34. Testing and Delivering Big Data Applications for
Performance and Functionality
Testing big data applications have several challenges, which include the following:
• Automation: Support of automation tools for performing testing is not available. Thus,
automation in testing for big data requires someone with technical expertise. Also, automated
tools are not equipped to handle unexpected problems that arise during testing.
• Virtualization: Testing, especially unit testing, is usually performed in a virtual environment.
It is one of the fundamental phases of testing. Virtual machine latency creates timing
problems in real time big data testing. Also, managing images in big data is a hassle.
• Large dataset: The amount of data is huge and can have many variations. Further they can
originate from different sources, thus integrating data is a major challenge. Thus, more data
needs to be verified and this needs to be done at faster rate.
• Testing across platforms: Hadoop is a collection various tools. The applications can be
written using any of the tools. Thus, there is a need of tools that will enable testing across
different platforms.
• Monitoring and diagnostic solution: There are limited solutions that can monitor the entire
execution environment and detect bottleneck or failures. 34
35. Challenges with Data Administration
• The Data administrator is responsible for designing and maintaining data stores.
• Data administration is the method by which data is monitored, managed and
maintained by a person or an organisation.
• Data administration allows an organisation to check its data resources, along with their
processing and communications with different applications and business processes.
• Data Administrator needs to integrate data from multiple resources and provide it to
various applications.
• Data administrator deals with designing of the logical and conceptual models treating
the data at an organisational level whereas Database administrator deal with
implementation of databases required and in use.
35
36. Challenges with Data Administration
Responsibility of Data Administrator
1. Data Policies, Procedures, Standards
• Data administrator should set the data creation and handling policies which include details of
which application can interact with which data, how that data can be changed and what is the effect
of the change.
• Data Procedures are documented plan of actions to be taken to perform a certain activity like
backup and recovery procedures. Data administrator’s role is to ensure that these procedures are
defined and communicated to all concerned employees.
• Data Standards are unambiguous conventions and behaviours that need to be followed so that
the maintenance becomes easy. It can also be used to evaluate database quality.
2. Planning
• Effective administration of data requires an understanding of the organisations needs and the
ability to lead the development of an information architecture that will meet the diverse needs of
the organisation.
• Thus a data administrator needs to plan for an effective administration of data and also provide
support for future needs.
36
37. Challenges with Data Administration
Responsibility of Data Administrator
3. Data Conflict(ownership) Resolution
• Data stores are planned to be shared and usually involve data from several different departments of
the organisation.
• Ownership of data in a sensitive issue in every organisation.
• Data administrator should establish procedures for resolving any conflicts in ownership.
4. Managing the Data Repository
• Data Repositories contain metadata that holds data description of the data stored in data stores.
• They describe an organisations data and data processing resources.
• As the data stores are increasing in size and incorporating unstructured data, data repositories need
to be enhanced to incorporate new and unseen data.
5. Internal Marketing of DA Concepts
• For data administration to be effective, established policies and procedures must be made known
to the internal staff. These may reduce resistance to changes or ownership problems.
37
38. Challenges with Data Administration
Responsibility of Data Administrator
1. Designing the Database
• The administrator is responsible for defining and creating the logical data model, physical
database model and prototyping.
2. Security and Authorization
• The database administrator ensures that there is no unauthorized access to data. In general,
the data should not be accessible to everyone.
• In a database system, user may be granted permission to access only certain views and
relations.
• The administrator can enforce various authentication and authorization techniques through
which the access can be guaranteed only to specific entities.
• Authentication techniques will ensure that the person is an individual who is supposed to
access the data while authorization techniques decide what data has to be given access to.
38
39. Challenges with Data Administration
Responsibility of Data Administrator
3. Data Availability and Recovery from Failures
• The administrator makes sure that the data is available at all times.
• In case of database failure, the administrator should ensure that the data is made
available to its user in such a way that the users are unaware of the failure.
• The administrator also ensures that the data remains in a consistent state and
appropriate techniques to achieve these are implemented.
4. Database Tuning
• Data needs to be evolved with time as the users need change.
• The administrator should modify the structure or design of the database to incorporate
these changes.
• The DBA is responsible for modifying the database in particular the conceptual and
logical design.
39
40. Challenges with Data Administration
Challenges of Data Administrator
• Creating the Data Repository
– With huge amount of data flowing in from various sources, integrating it to create
a common data repository is challenging.
– This is further complicated since the data is in an unstructured format.
– Pre-processing is an important step in preparing the data for processing and
efficient techniques need to be developed.
• Evolving Nature of Data Consideration in Analysis
– A modern administrator is required to have an understanding of the vast domains
as organizations are now dealing with new types of data.
– Eg: A machine data is centrally logged and stored. For tracking the machines
performance its data needs to be understood well enough to gain insight from it even
if they do not possess the relevant technical background.
40
41. Challenges with Data Administration
Challenges of Data Administrator
• Emphasize the capability to build a database quickly, tune it for maximum
performance and restore it to production quickly when problems develop.
• Enforcing the data policies and standards especially those related to security.
• As the organizations needs are changing, efficient support should be provided to
incorporate the changes and make provision for future scope.
• Ownership criteria of the data in not restricted to the internal staff. With the social
media, it is tricky to define the ownership of data.
• The administrator is always expected to keep abreast with new technologies and is
usually involved in mission critical applications.
• Another challenging aspect is that data administrators are required to have a
comprehensive understanding of a wide variety of topics to understand and improve
business processes in their organization.
41