What is Data?
Data
The critical mass of data being generated thanks to the internet,
improved computing technology and the development of Data
Analytics. The world’s most valuable resource is no longer oil, but data.
Data does not have any meaning unless we study it and make inference
out of it or draw insights from it. Think of data analytics as the process
of extracting usable fuel from crude oil…just like crude oil is first mined
from sea and then this crude oil is cleaned, processed to get good
quality fuel, similarly Data is first mined from relevant sources and then
data is cleaned ,analysed for any meaningful insights.
“Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions
of things. Data may be in the form of text documents, images, audio clips, software programs, or other types
of data. If data is not put into context, it doesn't do anything to a human or computer.”
Unit Measurement Value
Bit 1 bit
Byte 8 bits
Kilobyte 1024 bytes
Megabyte 1024 kilobytes
Gigabyte 1024 megabyte
Terabyte 1024 gigabyte
Petabyte 1024 terabyte
Exabyte 1024 petabyte
Zettabyte 1024 exabyte
Yottabyte 1024 zettabyte
Brontobyte 1024 yottabyte
Common Data measurements
Classification of data
• Data can be classified as Primary and secondary data.
Primary Data:
Primary data are the facts and figures collected for the problem at hand
by an investigator or group of investigators directly. This can be further
divided into two namely Observational Data and Questionnaire data.
• Observational data are those which are collected by observing people,
activity or processes. Collecting data from mechanical or electronic (IoT)
devices.
• Questionnaire data are the ones which are collected using in-depth interview
or collecting data using questionnaire forms either manually, telephonically or
using internet medium.
Classification of data
Secondary Data:
when data is collated from a source which already had the information
stored. These facts and figures might have been recorded for an earlier
project by an individual, agency or government. Secondary data can be
divided into two types.
• Internal data: These are the data which are generated from inside the
organization which is captured through ERP systems, CRM system or any
other transactional data system. From these data Financial statements,
vendor/customer lists, different reports of interest are produced.
• External Data: these are data which is present external to the organization
like the survey reports of an external agency, reports of periodicals and
magazines, reports published by government etc.
Data
Primary Data Secondary Data
Observational Data Questionnaire Data Internal Data Secondary Data
Classification of data
Pros and Cons of Primary and Secondary data
• Advantages of Primary data:
• The researcher can decide on the variables, from where to collect, when and
why to collect.
• He can decide upon the size of the data required for problem at hand.
• He can personally collect the data or hire an agency.
• Since the data is collected in the supervision of the researcher, data cleaning
might not be required.
• Disadvantages of Primary data:
• Highly expensive both in terms of money and time.
• Need to keep a tab on the quality of data.
Pros and Cons of Primary and Secondary data
• Advantages of Secondary data:
• Since the data is already available, the data can be procured without wasting
any time in collecting it.
• The cost of acquiring data is relatively inexpensive.
• The researcher is not personally responsible for the quality of data.
• Disadvantages of Secondary data:
• Data quality cannot be guaranteed i.e. fake data or blank data might be there.
• Data can be insufficient or inaccurate.
• Lots of cleaning of data might be required.
Data can be stored in file formats, as in mainframe systems using ISAM and VSAM. Other file formats for data
storage include comma-separated values. These formats continued to find uses across a variety of machine
types.
In corporate computing or enterprise softwares the data is stored in database, database management
system(DBMS) or relational database systems (RDBMS). Where as the data from IoT, Social media etc are
stored in data lakes
How is data stored?
Types of Data
• Data can be divided into two types viz. Structured data and Unstructured
data
• In Structured data, the data is organized into a table in rows and columns
in a formatted structure, typically a database, so that its data can be used
for more effective processing and analysis. By storing the data in structured
format, each field is discrete and its information can be retrieved either
separately or along with data from other fields, in a variety of
combinations. For example numbers, words, measurements, observations
or even just descriptions of things. The transactional data in financial
systems (ERP) and other business applications are some of the places
where structured data is used. Structured data is stored in database,
database management system (DBMS), relational database systems
(RDBMS) or data warehouse.
Types of Data
• Unstructured data:
Unstructured data is information, in different forms than the ones used
in conventional data models and isn't a good fit for a mainstream
relational database. The emergence of internet has resulted into data
explosion and there are terabytes of data generated every second,
formats aren't uniform are in the form of Videos, audios, photos, e-
mails, word documents, power-point presentations and the list can go
on. Due to the advent of IoT (internet of Things) there are many data
generated by the sensors attached in machines, automobiles, server log
files and social media feeds etc. Unstructured data is generally stored
on Data Lakes.
Difference between Structured data and Unstructured data
• Out of the above two unstructured data is the least formatted and
structured data is the most formatted. Below picture depicts
Difference between Structured and Unstructured data
Structured Data Unstructured Data
Characteristics • Usually numbers, texts
• Easy to search
• Pre-defined structure like tables
• Highly organised
• Easy to analyse
• Text, videos, audios, images or other
formats
• No pre-defined structure
• Difficult to search
Stored in Relational database, Data warehouses Dataware houses, Datalakes, NoSQL
databases
Applications • ERP systems like SAP, Oracle etc.
• CRM systems like SiebelCRM
• Railway/Airlines reservation system
• Spread sheet
• Word, Powerpoint files
• Email client
• Social Media sites
• E-commerce sites
Flexibility • Schema dependent, Very rigid • Very flexible, Absence of schema
Example • Transactions in ERP
• Date
• Phone numbers
• Amount
• Names
• Text
• Email messages
• Social media posts
• Audio files
• IoT sensor data

What is Data?

  • 1.
  • 2.
    Data The critical massof data being generated thanks to the internet, improved computing technology and the development of Data Analytics. The world’s most valuable resource is no longer oil, but data. Data does not have any meaning unless we study it and make inference out of it or draw insights from it. Think of data analytics as the process of extracting usable fuel from crude oil…just like crude oil is first mined from sea and then this crude oil is cleaned, processed to get good quality fuel, similarly Data is first mined from relevant sources and then data is cleaned ,analysed for any meaningful insights.
  • 3.
    “Data is acollection of facts, such as numbers, words, measurements, observations or even just descriptions of things. Data may be in the form of text documents, images, audio clips, software programs, or other types of data. If data is not put into context, it doesn't do anything to a human or computer.” Unit Measurement Value Bit 1 bit Byte 8 bits Kilobyte 1024 bytes Megabyte 1024 kilobytes Gigabyte 1024 megabyte Terabyte 1024 gigabyte Petabyte 1024 terabyte Exabyte 1024 petabyte Zettabyte 1024 exabyte Yottabyte 1024 zettabyte Brontobyte 1024 yottabyte Common Data measurements
  • 4.
    Classification of data •Data can be classified as Primary and secondary data. Primary Data: Primary data are the facts and figures collected for the problem at hand by an investigator or group of investigators directly. This can be further divided into two namely Observational Data and Questionnaire data. • Observational data are those which are collected by observing people, activity or processes. Collecting data from mechanical or electronic (IoT) devices. • Questionnaire data are the ones which are collected using in-depth interview or collecting data using questionnaire forms either manually, telephonically or using internet medium.
  • 5.
    Classification of data SecondaryData: when data is collated from a source which already had the information stored. These facts and figures might have been recorded for an earlier project by an individual, agency or government. Secondary data can be divided into two types. • Internal data: These are the data which are generated from inside the organization which is captured through ERP systems, CRM system or any other transactional data system. From these data Financial statements, vendor/customer lists, different reports of interest are produced. • External Data: these are data which is present external to the organization like the survey reports of an external agency, reports of periodicals and magazines, reports published by government etc.
  • 6.
    Data Primary Data SecondaryData Observational Data Questionnaire Data Internal Data Secondary Data Classification of data
  • 7.
    Pros and Consof Primary and Secondary data • Advantages of Primary data: • The researcher can decide on the variables, from where to collect, when and why to collect. • He can decide upon the size of the data required for problem at hand. • He can personally collect the data or hire an agency. • Since the data is collected in the supervision of the researcher, data cleaning might not be required. • Disadvantages of Primary data: • Highly expensive both in terms of money and time. • Need to keep a tab on the quality of data.
  • 8.
    Pros and Consof Primary and Secondary data • Advantages of Secondary data: • Since the data is already available, the data can be procured without wasting any time in collecting it. • The cost of acquiring data is relatively inexpensive. • The researcher is not personally responsible for the quality of data. • Disadvantages of Secondary data: • Data quality cannot be guaranteed i.e. fake data or blank data might be there. • Data can be insufficient or inaccurate. • Lots of cleaning of data might be required.
  • 9.
    Data can bestored in file formats, as in mainframe systems using ISAM and VSAM. Other file formats for data storage include comma-separated values. These formats continued to find uses across a variety of machine types. In corporate computing or enterprise softwares the data is stored in database, database management system(DBMS) or relational database systems (RDBMS). Where as the data from IoT, Social media etc are stored in data lakes How is data stored?
  • 10.
    Types of Data •Data can be divided into two types viz. Structured data and Unstructured data • In Structured data, the data is organized into a table in rows and columns in a formatted structure, typically a database, so that its data can be used for more effective processing and analysis. By storing the data in structured format, each field is discrete and its information can be retrieved either separately or along with data from other fields, in a variety of combinations. For example numbers, words, measurements, observations or even just descriptions of things. The transactional data in financial systems (ERP) and other business applications are some of the places where structured data is used. Structured data is stored in database, database management system (DBMS), relational database systems (RDBMS) or data warehouse.
  • 11.
    Types of Data •Unstructured data: Unstructured data is information, in different forms than the ones used in conventional data models and isn't a good fit for a mainstream relational database. The emergence of internet has resulted into data explosion and there are terabytes of data generated every second, formats aren't uniform are in the form of Videos, audios, photos, e- mails, word documents, power-point presentations and the list can go on. Due to the advent of IoT (internet of Things) there are many data generated by the sensors attached in machines, automobiles, server log files and social media feeds etc. Unstructured data is generally stored on Data Lakes.
  • 12.
    Difference between Structureddata and Unstructured data • Out of the above two unstructured data is the least formatted and structured data is the most formatted. Below picture depicts Difference between Structured and Unstructured data Structured Data Unstructured Data Characteristics • Usually numbers, texts • Easy to search • Pre-defined structure like tables • Highly organised • Easy to analyse • Text, videos, audios, images or other formats • No pre-defined structure • Difficult to search Stored in Relational database, Data warehouses Dataware houses, Datalakes, NoSQL databases Applications • ERP systems like SAP, Oracle etc. • CRM systems like SiebelCRM • Railway/Airlines reservation system • Spread sheet • Word, Powerpoint files • Email client • Social Media sites • E-commerce sites Flexibility • Schema dependent, Very rigid • Very flexible, Absence of schema Example • Transactions in ERP • Date • Phone numbers • Amount • Names • Text • Email messages • Social media posts • Audio files • IoT sensor data