SlideShare a Scribd company logo
1 of 80
Download to read offline
CHAPTER 01: TYPES OF DIGITAL DATA
Data
• Any data that can be processed by digital
computer and stored in the sequences of 0's and
1's (Binary language) is knowns as digital data.
• Whenever you send an email, read a social media
post, or take pictures with your digital camera,
you are working with digital data.
• In general, data can be any character, text,
numbers, voice messages, SMS, WhatsApp
messages, pictures, sound, or video.
Data
• Byte is the basic unit of information
in computer storage and processing, and is
composed of eight bits; a kilobyte is 1,000 bytes;
one megabyte is 1,000 kilobytes . (GB, TB, PB, EB,
ZB, YB)
• Digitizing is the process of converting information
into digital form and is necessary for a computer to
be able to process and store the information.
Data
• It is an invaluable asset of any enterprise (big or small).
• Data is present internal to the enterprise and also exists
outside the firewalls of the enterprise.
• Data may be in homogeneous or heterogeneous.
• Need of the hour is to
– Understand, manage, process,
– and take the data for analysis
– to draw valuable insights.
Types of digital data
1. Structured Data : data stored in the form of
rows and columns (databases, Excel)
2. Un-structured Data: No pre-defined schema
(PPTs, images, Videos, pdfs)
3. Semi-structured Data: Hybrid schema (JSON,
HTML, XML, Email, and so on),
Distribution of digital data (in %)
(by Gartner)
80
10
10
Unstructured
Semi-structured
Structured
Structured Data
• Data which is in an organized form (In rows & columns).
• Computer programs can use this data easily.
• Relationships exists between entities of data.
• Example
– Data stored in databases
– ERP
– CRM
– DW
– Data Cube
Structured Data
• The data conforms to a pre-defined schema or structure
is known as structured data.
• The data can be processed, stored, and retrieved in a
fixed format. This data can be processed easily by
programs.
• Conforms to a relational data model.
• Structured data is organized in semantic chunks/entities
with similar entities grouped together to form
relations/tables.
structured Data
• Descriptions for all entities in a group
• Have the same defined format
• Have a predefined length
• Follow the same order.
Example
Sources of Structured Data
Structured
Data
OLTP
systems
Excel
Databases
Ease with structured data
Ease with
structured
data
Security
Insert/Update/
delete
Scalability
Transaction
processing
(ACID)
Indexing/
Searching
Database (RDBMS)
• Oracle Corp. – Oracle
• IBM – DB2, IBM-Informix
• Microsoft – SQL
• EMC – Greenplum
• Teradata – Teradata
• Open source- MySQL, PostgresSQL
• Sqlite
• Sequel Pro
• Amazon Aurora
• SAP SQL Anywhere, SAP IQ (Sybase)
Semi-structured Data
• Data which does not conform to a data model but has
some structure.
• Computer programs can not use this data easily.
• Example
– emails
– XML
– HTML
– JSON, and so on.
Semi-structured data (SSD)
• It is referred to as self describing structure.
• It is a form of structured data that does not
conform with the formal structure of data models
associated with relational databases or other
forms of data tables.
• It uses metadata and tags to provide semantic
information.
Characteristics of semi-structured data
(SSD)
• Does not conform to a data model
• Cannot be stored in the form of rows and columns
as in a database.
• The tags and elements are used to describe data.
• Attributes in a group may not be the same.
• Similar entities are grouped.
• Size of the same attributes in a group may differ
• Type of same attributes in group may differ.
• Evolving Schema
• Schema and data are tightly coupled.
Example (Names & Emails)
• One way is:
Name: Raju Patil
Email : rp@test.tcs.in, rp70@gmail.com
• Another way is:
First Name: Raju
Last Name :Patil
Email : rajup70@gmail.com
Sources of SSD
• Email
• XML
• TCP/IP
• Zipped files
• Mark-up languages
• Integration of data from heterogeneous sources.
Example: Email format
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images, etc.><Name>
ABC Healthcare Blood Test Report
Date <> ----
Department <> -----
Patient Name <>
Attending Doctor <>
Hemoglobin
content
<>
Patient Age <>
RBC count <>
WBC count <>
Platelet count <>
Diagnosis <notes>
Conclusion <notes>
XML & JSON
Integration of data from heterogeneous
sources
User
Mediator : Uniform access to multiple data sources
OODBMS
RDBMS
Legacy
system
Structured
file
Getting to know Unstructured data
• Over the past few days, Dr. Ben and Dr. Stanley
had been exchanging long emails about a
particular case of gastro-intestinal problem.
• Email contains procedure practiced by Dr. Stanley,
about combination of drugs that has successfully
cured gastro-intestinal disorders in patients.
• Dr. Mark has a patient in the “GoodLife”
emergency unit with quite similar case of gastro-
intestinal disorder.
Unstructured Data
• Unstructured data refers to the data that lacks any
specific form or structure.
• This makes it very difficult and time-consuming to
process and analyze unstructured data.
• Data which does not conform to any data model is USD.
• Computer programs can not use this data directly.
• About 80-90% data of an organization is in this format.
• An enormous amount of knowledge is hidden in this
data.
• Hence finding useful knowledge/insight from USD is very
crucial.
Unstructured Data
• Unstructured data is a generic label for describing data
that is not contained in a database or some other type
of data structure.
• Unstructured data can be textual or non-textual.
• Textual unstructured data is generated in media like
email messages, PowerPoint presentations, Word
documents, comments in social media, etc.
• Non-textual unstructured data is generated in media
like images, CCTV footage, audio files and video files.
• Anything in a non-database form is unstructured data.
Unstructured Data
• Two types:
1. Bitmap objects : image, video, or audio files
2. Textual objects : word, emails, ppts and so on.
Unstructured Data
• Example
– Memos, QR code (Quick Response), Blogs
– Chat rooms, Tweets, Comments, likes, tags
– PPTs, emoji's, emoticons (emotion icons)
– Images, log files, social media posts
– Videos, sensor data (raw), weather data
– Doc files, geospatial data, surveillance data
– Body of email , GPS data, sensor data, etc.
– WhatsApp messages, CCTV footage and so on.
Getting to know Unstructured data
Characteristics of Unstructured data
• This data cannot be stored in the form of rows
and columns as in a database and does not
conform to any data model.
• It is difficult to determine the meaning of the
data.
• It does not follow any rule or semantics, i.e. Not
in any particular format or sequence.
• Not easily usable by a program.
Sources of Unstructured data
• Web pages
• Audio and Videos
• Images
• Body of an email
• Word document
• PPT and reports
• Chats and text messages
• Social media data
• White papers
• Surveys
• SMS
• Free form text
• Server Log files
• Product reviews
Web page is unstructured data
Web Page
Multimedia Image
Database
Text
XML
Challenges
• Storage space: A lot of space is required to store USD.
• Scalability: As the data grows, scalability becomes an
issue and the cost of storing USD increases.
• Retrieve information: Difficult to retrieve required
information from USD
• Security: Ensuring security is difficult due to varied
sources of data. E.g. emails, web pages, etc.
• Indexing & searching: Very difficult and error-prone
as the structure of the USD is not clear.
Challenges
• Interpretation : USD is not easily interpreted by
conventional search algorithms.
• Classification : Different naming conventions
followed across the organization make it difficult to
classify data.
• Deriving meaning : Computer programs cannot
automatically derive meaning or structure from USD.
• File formats : Increasing number of file formats
makes it difficult to interpret data.
Portion of Unstructured data
USD
SD
Dealing with USD
1. Data mining
2. Text mining /Text Analytics
3. NLP
4. Noisy text analytics
5. Manual tagging with meta data
6. Part of speech tagging
7. UIMA
8. Web Scraping
Possible
Solutions
Data Mining
• It is the computing process of discovering patterns
in large data sets involving methods at the
intersection of AI, machine learning &
DL, statistics, and database systems.
• Popular algorithms:
– Association rule mining (MBA)
– Regression Analysis (Y=mX+ c)
– Collaborative filtering
Collaborative filtering
• collaborative filtering uses similarities between users and
items simultaneously to provide recommendations.
• It is a method of making automatic predictions (filtering)
about the interests of a user by collecting preferences
or taste information from many users (collaborating).
• Collaborative filtering works on a fundamental
principle: you are likely to like what someone similar to
you likes.
Collaborative filtering
• Collaborative filtering (CF) is a technique commonly used
• Collaborative filtering (CF) is a technique used
by recommender systems to build personalized
recommendations on the Web.
• Companies that employ CF model include Amazon,
Facebook, Twitter, LinkedIn, Spotify, Google News,
Netflix, iTunes.
Collaborative filtering
Text analytics or text mining
• It is the process of converting
unstructured text data into meaningful data for
analysis, to measure customer opinions, product
reviews, feedback and sentimental analysis to
support fact based decision making.
• Uses many linguistic, statistical, and machine
learning techniques such as clustering, pattern
recognition, tagging, association analysis,
predictive analytics, etc.
Text analytics or text mining
• It helps organizations to find potentially valuable
business insights in corporate documents, customer
emails, call center logs, survey comments, social
network posts, medical records and other sources of
text-based data.
• Text mining capabilities are also being incorporated
into AI chatbots/virtual agents that companies deploy
to provide automated responses to customers as part
of their marketing, sales and customer service
operations.
Natural Language Processing (NLP)
• Natural language processing (NLP) is the ability of a
computer program to understand human language as
it is spoken. NLP is a component of artificial
intelligence (AI).
• It is a field of computer science, artificial
intelligence and computational linguistics concerned
with the interactions between computers and human
(natural) languages (HCI domain).
• NLP strives to build machines that understand and
respond to text or voice data.
Natural Language Processing (NLP)
Noisy text analytics
• It is the process of extracting structured or semi-
structured information from noisy unstructured text data
such as online chat, text messages, emails, message
boards, blogs, wikis, etc.
• The noisy unstructured data comprises one or more of
the followings:
– Spelling mistakes,
– Acronyms
– Non-standard words (HBD, K, GN, GM, VGM, etc.)
– Missing punctuations,
– Missing letters and so on.
Manual tagging with metadata
• It is the process of tagging manually with adequate
metadata to provide the semantics to understand
unstructured data.
.
Road Accident
Part of Speech Tagging
• It is also called as POS or POST or grammatical
tagging.
• It is the process of reading text and tagging each
word in the sentence as belonging to a particular
part of speech such as “noun”, “verb”, “adjective”,
“pronoun”, etc.
.
Unstructured Information
Management Architecture(UIMA)
• It is an open source platform from IBM, which
integrates different kinds of analysis engines to
provide a complete solution for knowledge
discovery from USD.
• It bridge the gap between structured and USD.
Uses of UIMA
• Used to convert unstructured data such as
repair logs and service notes
into relational tables.
• These tables can then be used
by automated tools to detect maintenance or
manufacturing problems.
Uses of UIMA
• Used in medical contexts to analyze clinical notes,
such as the Clinical Text Analysis and Knowledge
Extraction System ( Apache CTAKES).
• CTAKES is an open-source Natural Language
Processing (NLP) system that extracts clinical
information from electronic health/medical
record free-text (Users are free to type whatever
they want in any form).
UIMA block diagram
Users
Acquired from
various
sources
Subjected to
semantic
analysis
Structured
information
access
Query and
presentation
Structured
information
Analysis
Delivery
USD
Transformed
into
Web Scraping
Big Data
• Big data is a term that describes large, hard-
to-manage volumes of data – both structured
and unstructured - none of traditional data
management tools can store it or process it
efficiently.
• experts now predict that 74 zettabytes of
data will be in existence by 2021.
Big Data
• Every day, we create 2.5 quintillion(1018)
bytes of data —90% of the data in the world
today has been created in the last two years
alone.
• This data comes from everywhere: sensors
used to gather climate information, posts to
social media sites, digital pictures and videos,
purchase transaction records, and cell phone
GPS signals, WhatsApp, IOT and so on.
Characteristics of Data
• Composition: Deals with structure of data, i.e.,
sources of data, the granularity(Ex. Postal
address), the types, nature of data (Static or real-
time).
• Condition: Deals with the state of data, that is,
“Can one use data as it is for analysis?” or “Does it
require cleansing for further enhancement and
enrichment?”.
Characteristics of Data
• Context: Deals with
– Where, this data has been generated?
– Why this data generated?
– How sensitive is this data?
– What are the events associated with this data?
– And so on.
Gartner
• Is a global research and advisory firm
providing insights, advice, and tools for
leaders in IT, Finance, HR, Customer Service
and Support
Big data definition- Gartner
• Big data is high-volume, high-velocity, and high-
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insight and decision
making.
• Cost effective and innovative forms of
information processing: Talks about embracing
new techniques and technologies to capture,
store, process, persevere, integrate and visualize
the big data(3vs).
Definition of Big data by Gartner
• Enhanced insight and decision making: Talks
about deriving deeper, richer, and meaningful
insights and then using these insights to make
faster and better decisions to gain business value
and thus a competitive edge.
Big data formula
DATA
Enhanced
Business
Value
Information
Actionable
Intelligence
Better
Decisions
Challenges with Big Data
• Capture
• Storage (Solution: Cloud Computing)
• Curation ( Management of data + Data retention)
• Search
• Analysis
• Transfer
• Visualization
• Privacy violations
3 Vs
3 V’s of Big data
• The data that is big in Volume, Velocity and
Variety is known as big data.
Sources of big data
• Archives: Archives of scanned documents,
customer correspondence records, patient’s
health records, student’s admission records,
students’ assessment records and so on.
• Sensor data: Car sensors, smart electric meters,
office buildings, washing m/c, other electronic
appliances and so on.
• Machine log data: Event logs, application logs,
audit logs, server logs, etc.
Sources of big data
• Public web: Wikipedia, Weather, regulatory, census, etc.
• Data storage: File systems, SQL database, NoSQL
database (Mongo DB, Cassandra) and so on.
• Media: Audio, Video, image, etc.
• Docs: CSV, word docs, PDF, PPT, XLS, etc.
• Business Apps: ERP, CRM, HR, Google Docs, etc.
• Social media: Twitter blogs, Facebook, LinkedIn,
YouTube, Instagram, etc.
• IOT
Other characteristics of big data
• Veracity and Validity: Refers to the accuracy
(quality) and correctness of the data.
• Volatility: Deals with how long the data is valid?,
and how long should it be stored?. (OTP, Aadhar
No., PW)
• Variability: Data flows can be highly inconsistent
with periodic peaks. (In total 7V’s of big data)
Why Big data
More confidence in decision making
More Data
More Accurate analysis
Greater operational efficiency, cost reduction, time
reduction, new product development, optimized
offerings, etc.
Three reasons for leveraging big data
1. Competitive Advantage.
2. Decision making
3. To create new business value out of data.
Typical data warehouse Environment
Typical Hadoop Environment
• It is different from DW environment.
• Here data sources are web logs, images, audios,
videos, social media, doc files, pdfs, etc.
Hadoop Environment
Big data & DW coexistence
Big data & DW coexistence
Digital Types

More Related Content

What's hot

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
UX as Knowledge Work
UX as Knowledge WorkUX as Knowledge Work
UX as Knowledge WorkThomas Wendt
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodNorsaremah Salleh
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Analytics ROI Best Practices
Analytics ROI Best PracticesAnalytics ROI Best Practices
Analytics ROI Best PracticesDATAVERSITY
 
Chapter_1.pptx
Chapter_1.pptxChapter_1.pptx
Chapter_1.pptxAadiSoni3
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
OCS352-IOT -UNIT-1.pdf
OCS352-IOT -UNIT-1.pdfOCS352-IOT -UNIT-1.pdf
OCS352-IOT -UNIT-1.pdfgopinathcreddy
 
Big Data Analytics in Government
Big Data Analytics in GovernmentBig Data Analytics in Government
Big Data Analytics in GovernmentDeepak Ramanathan
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
 
Business Analytics
Business AnalyticsBusiness Analytics
Business AnalyticsLambert1035
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
4. Internet of Things - Reference Model and Architecture
4. Internet of Things - Reference Model and Architecture4. Internet of Things - Reference Model and Architecture
4. Internet of Things - Reference Model and ArchitectureJitendra Tomar
 
Data Preparation Fundamentals
Data Preparation FundamentalsData Preparation Fundamentals
Data Preparation FundamentalsDATAVERSITY
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
How to Report Test Results
How to Report Test ResultsHow to Report Test Results
How to Report Test ResultsSr Edith Bogue
 
the history of data visualization
the history of data visualizationthe history of data visualization
the history of data visualizationNIFTIT
 
Tableau Visual Guidebook
Tableau Visual GuidebookTableau Visual Guidebook
Tableau Visual GuidebookAndy Kriebel
 

What's hot (20)

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
UX as Knowledge Work
UX as Knowledge WorkUX as Knowledge Work
UX as Knowledge Work
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review method
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Analytics ROI Best Practices
Analytics ROI Best PracticesAnalytics ROI Best Practices
Analytics ROI Best Practices
 
Chapter_1.pptx
Chapter_1.pptxChapter_1.pptx
Chapter_1.pptx
 
Data science big data and analytics
Data science big data and analyticsData science big data and analytics
Data science big data and analytics
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
OCS352-IOT -UNIT-1.pdf
OCS352-IOT -UNIT-1.pdfOCS352-IOT -UNIT-1.pdf
OCS352-IOT -UNIT-1.pdf
 
Big Data Analytics in Government
Big Data Analytics in GovernmentBig Data Analytics in Government
Big Data Analytics in Government
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
Business Analytics
Business AnalyticsBusiness Analytics
Business Analytics
 
IR
IRIR
IR
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
4. Internet of Things - Reference Model and Architecture
4. Internet of Things - Reference Model and Architecture4. Internet of Things - Reference Model and Architecture
4. Internet of Things - Reference Model and Architecture
 
Data Preparation Fundamentals
Data Preparation FundamentalsData Preparation Fundamentals
Data Preparation Fundamentals
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
How to Report Test Results
How to Report Test ResultsHow to Report Test Results
How to Report Test Results
 
the history of data visualization
the history of data visualizationthe history of data visualization
the history of data visualization
 
Tableau Visual Guidebook
Tableau Visual GuidebookTableau Visual Guidebook
Tableau Visual Guidebook
 

Similar to Digital Types

Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxWollo UNiversity
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemssamiullahamjad06
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And DesignLijo Stalin
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some BasicsShalin Hai-Jew
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdfTOUSEEQHAIDER14
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 

Similar to Digital Types (20)

Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systems
 
T-SQL
T-SQLT-SQL
T-SQL
 
What is Data?
What is Data?What is Data?
What is Data?
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And Design
 
Big data
Big dataBig data
Big data
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some Basics
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
Foundations of business intelligence databases and information management
Foundations of business intelligence databases and information managementFoundations of business intelligence databases and information management
Foundations of business intelligence databases and information management
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdf
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Dbms
DbmsDbms
Dbms
 
Chapter5
Chapter5Chapter5
Chapter5
 
Unit1 DBMS Introduction
Unit1 DBMS IntroductionUnit1 DBMS Introduction
Unit1 DBMS Introduction
 

Recently uploaded

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

Digital Types

  • 1. CHAPTER 01: TYPES OF DIGITAL DATA
  • 2. Data • Any data that can be processed by digital computer and stored in the sequences of 0's and 1's (Binary language) is knowns as digital data. • Whenever you send an email, read a social media post, or take pictures with your digital camera, you are working with digital data. • In general, data can be any character, text, numbers, voice messages, SMS, WhatsApp messages, pictures, sound, or video.
  • 3. Data • Byte is the basic unit of information in computer storage and processing, and is composed of eight bits; a kilobyte is 1,000 bytes; one megabyte is 1,000 kilobytes . (GB, TB, PB, EB, ZB, YB) • Digitizing is the process of converting information into digital form and is necessary for a computer to be able to process and store the information.
  • 4. Data • It is an invaluable asset of any enterprise (big or small). • Data is present internal to the enterprise and also exists outside the firewalls of the enterprise. • Data may be in homogeneous or heterogeneous. • Need of the hour is to – Understand, manage, process, – and take the data for analysis – to draw valuable insights.
  • 5. Types of digital data 1. Structured Data : data stored in the form of rows and columns (databases, Excel) 2. Un-structured Data: No pre-defined schema (PPTs, images, Videos, pdfs) 3. Semi-structured Data: Hybrid schema (JSON, HTML, XML, Email, and so on),
  • 6.
  • 7. Distribution of digital data (in %) (by Gartner) 80 10 10 Unstructured Semi-structured Structured
  • 8. Structured Data • Data which is in an organized form (In rows & columns). • Computer programs can use this data easily. • Relationships exists between entities of data. • Example – Data stored in databases – ERP – CRM – DW – Data Cube
  • 9. Structured Data • The data conforms to a pre-defined schema or structure is known as structured data. • The data can be processed, stored, and retrieved in a fixed format. This data can be processed easily by programs. • Conforms to a relational data model. • Structured data is organized in semantic chunks/entities with similar entities grouped together to form relations/tables.
  • 10. structured Data • Descriptions for all entities in a group • Have the same defined format • Have a predefined length • Follow the same order.
  • 12. Sources of Structured Data Structured Data OLTP systems Excel Databases
  • 13. Ease with structured data Ease with structured data Security Insert/Update/ delete Scalability Transaction processing (ACID) Indexing/ Searching
  • 14. Database (RDBMS) • Oracle Corp. – Oracle • IBM – DB2, IBM-Informix • Microsoft – SQL • EMC – Greenplum • Teradata – Teradata • Open source- MySQL, PostgresSQL • Sqlite • Sequel Pro • Amazon Aurora • SAP SQL Anywhere, SAP IQ (Sybase)
  • 15. Semi-structured Data • Data which does not conform to a data model but has some structure. • Computer programs can not use this data easily. • Example – emails – XML – HTML – JSON, and so on.
  • 16. Semi-structured data (SSD) • It is referred to as self describing structure. • It is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables. • It uses metadata and tags to provide semantic information.
  • 17. Characteristics of semi-structured data (SSD) • Does not conform to a data model • Cannot be stored in the form of rows and columns as in a database. • The tags and elements are used to describe data. • Attributes in a group may not be the same. • Similar entities are grouped. • Size of the same attributes in a group may differ • Type of same attributes in group may differ. • Evolving Schema • Schema and data are tightly coupled.
  • 18. Example (Names & Emails) • One way is: Name: Raju Patil Email : rp@test.tcs.in, rp70@gmail.com • Another way is: First Name: Raju Last Name :Patil Email : rajup70@gmail.com
  • 19.
  • 20. Sources of SSD • Email • XML • TCP/IP • Zipped files • Mark-up languages • Integration of data from heterogeneous sources.
  • 21. Example: Email format To: <Name> From: <Name> Subject: <Text> CC: <Name> Body: <Text, Graphics, Images, etc.><Name>
  • 22. ABC Healthcare Blood Test Report Date <> ---- Department <> ----- Patient Name <> Attending Doctor <> Hemoglobin content <> Patient Age <> RBC count <> WBC count <> Platelet count <> Diagnosis <notes> Conclusion <notes>
  • 24. Integration of data from heterogeneous sources User Mediator : Uniform access to multiple data sources OODBMS RDBMS Legacy system Structured file
  • 25.
  • 26. Getting to know Unstructured data • Over the past few days, Dr. Ben and Dr. Stanley had been exchanging long emails about a particular case of gastro-intestinal problem. • Email contains procedure practiced by Dr. Stanley, about combination of drugs that has successfully cured gastro-intestinal disorders in patients. • Dr. Mark has a patient in the “GoodLife” emergency unit with quite similar case of gastro- intestinal disorder.
  • 27. Unstructured Data • Unstructured data refers to the data that lacks any specific form or structure. • This makes it very difficult and time-consuming to process and analyze unstructured data. • Data which does not conform to any data model is USD. • Computer programs can not use this data directly. • About 80-90% data of an organization is in this format. • An enormous amount of knowledge is hidden in this data. • Hence finding useful knowledge/insight from USD is very crucial.
  • 28. Unstructured Data • Unstructured data is a generic label for describing data that is not contained in a database or some other type of data structure. • Unstructured data can be textual or non-textual. • Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word documents, comments in social media, etc. • Non-textual unstructured data is generated in media like images, CCTV footage, audio files and video files. • Anything in a non-database form is unstructured data.
  • 29. Unstructured Data • Two types: 1. Bitmap objects : image, video, or audio files 2. Textual objects : word, emails, ppts and so on.
  • 30. Unstructured Data • Example – Memos, QR code (Quick Response), Blogs – Chat rooms, Tweets, Comments, likes, tags – PPTs, emoji's, emoticons (emotion icons) – Images, log files, social media posts – Videos, sensor data (raw), weather data – Doc files, geospatial data, surveillance data – Body of email , GPS data, sensor data, etc. – WhatsApp messages, CCTV footage and so on.
  • 31. Getting to know Unstructured data
  • 32. Characteristics of Unstructured data • This data cannot be stored in the form of rows and columns as in a database and does not conform to any data model. • It is difficult to determine the meaning of the data. • It does not follow any rule or semantics, i.e. Not in any particular format or sequence. • Not easily usable by a program.
  • 33. Sources of Unstructured data • Web pages • Audio and Videos • Images • Body of an email • Word document • PPT and reports • Chats and text messages • Social media data • White papers • Surveys • SMS • Free form text • Server Log files • Product reviews
  • 34. Web page is unstructured data Web Page Multimedia Image Database Text XML
  • 35. Challenges • Storage space: A lot of space is required to store USD. • Scalability: As the data grows, scalability becomes an issue and the cost of storing USD increases. • Retrieve information: Difficult to retrieve required information from USD • Security: Ensuring security is difficult due to varied sources of data. E.g. emails, web pages, etc. • Indexing & searching: Very difficult and error-prone as the structure of the USD is not clear.
  • 36. Challenges • Interpretation : USD is not easily interpreted by conventional search algorithms. • Classification : Different naming conventions followed across the organization make it difficult to classify data. • Deriving meaning : Computer programs cannot automatically derive meaning or structure from USD. • File formats : Increasing number of file formats makes it difficult to interpret data.
  • 37. Portion of Unstructured data USD SD
  • 38. Dealing with USD 1. Data mining 2. Text mining /Text Analytics 3. NLP 4. Noisy text analytics 5. Manual tagging with meta data 6. Part of speech tagging 7. UIMA 8. Web Scraping Possible Solutions
  • 39. Data Mining • It is the computing process of discovering patterns in large data sets involving methods at the intersection of AI, machine learning & DL, statistics, and database systems. • Popular algorithms: – Association rule mining (MBA) – Regression Analysis (Y=mX+ c) – Collaborative filtering
  • 40. Collaborative filtering • collaborative filtering uses similarities between users and items simultaneously to provide recommendations. • It is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). • Collaborative filtering works on a fundamental principle: you are likely to like what someone similar to you likes.
  • 41. Collaborative filtering • Collaborative filtering (CF) is a technique commonly used • Collaborative filtering (CF) is a technique used by recommender systems to build personalized recommendations on the Web. • Companies that employ CF model include Amazon, Facebook, Twitter, LinkedIn, Spotify, Google News, Netflix, iTunes.
  • 43. Text analytics or text mining • It is the process of converting unstructured text data into meaningful data for analysis, to measure customer opinions, product reviews, feedback and sentimental analysis to support fact based decision making. • Uses many linguistic, statistical, and machine learning techniques such as clustering, pattern recognition, tagging, association analysis, predictive analytics, etc.
  • 44. Text analytics or text mining • It helps organizations to find potentially valuable business insights in corporate documents, customer emails, call center logs, survey comments, social network posts, medical records and other sources of text-based data. • Text mining capabilities are also being incorporated into AI chatbots/virtual agents that companies deploy to provide automated responses to customers as part of their marketing, sales and customer service operations.
  • 45. Natural Language Processing (NLP) • Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). • It is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages (HCI domain). • NLP strives to build machines that understand and respond to text or voice data.
  • 47. Noisy text analytics • It is the process of extracting structured or semi- structured information from noisy unstructured text data such as online chat, text messages, emails, message boards, blogs, wikis, etc. • The noisy unstructured data comprises one or more of the followings: – Spelling mistakes, – Acronyms – Non-standard words (HBD, K, GN, GM, VGM, etc.) – Missing punctuations, – Missing letters and so on.
  • 48. Manual tagging with metadata • It is the process of tagging manually with adequate metadata to provide the semantics to understand unstructured data. . Road Accident
  • 49. Part of Speech Tagging • It is also called as POS or POST or grammatical tagging. • It is the process of reading text and tagging each word in the sentence as belonging to a particular part of speech such as “noun”, “verb”, “adjective”, “pronoun”, etc. .
  • 50. Unstructured Information Management Architecture(UIMA) • It is an open source platform from IBM, which integrates different kinds of analysis engines to provide a complete solution for knowledge discovery from USD. • It bridge the gap between structured and USD.
  • 51. Uses of UIMA • Used to convert unstructured data such as repair logs and service notes into relational tables. • These tables can then be used by automated tools to detect maintenance or manufacturing problems.
  • 52. Uses of UIMA • Used in medical contexts to analyze clinical notes, such as the Clinical Text Analysis and Knowledge Extraction System ( Apache CTAKES). • CTAKES is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health/medical record free-text (Users are free to type whatever they want in any form).
  • 53. UIMA block diagram Users Acquired from various sources Subjected to semantic analysis Structured information access Query and presentation Structured information Analysis Delivery USD Transformed into
  • 55.
  • 56. Big Data • Big data is a term that describes large, hard- to-manage volumes of data – both structured and unstructured - none of traditional data management tools can store it or process it efficiently. • experts now predict that 74 zettabytes of data will be in existence by 2021.
  • 57. Big Data • Every day, we create 2.5 quintillion(1018) bytes of data —90% of the data in the world today has been created in the last two years alone. • This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, WhatsApp, IOT and so on.
  • 58. Characteristics of Data • Composition: Deals with structure of data, i.e., sources of data, the granularity(Ex. Postal address), the types, nature of data (Static or real- time). • Condition: Deals with the state of data, that is, “Can one use data as it is for analysis?” or “Does it require cleansing for further enhancement and enrichment?”.
  • 59. Characteristics of Data • Context: Deals with – Where, this data has been generated? – Why this data generated? – How sensitive is this data? – What are the events associated with this data? – And so on.
  • 60. Gartner • Is a global research and advisory firm providing insights, advice, and tools for leaders in IT, Finance, HR, Customer Service and Support
  • 61. Big data definition- Gartner • Big data is high-volume, high-velocity, and high- variety information assets that demand cost effective, innovative forms of information processing for enhanced insight and decision making. • Cost effective and innovative forms of information processing: Talks about embracing new techniques and technologies to capture, store, process, persevere, integrate and visualize the big data(3vs).
  • 62. Definition of Big data by Gartner • Enhanced insight and decision making: Talks about deriving deeper, richer, and meaningful insights and then using these insights to make faster and better decisions to gain business value and thus a competitive edge.
  • 64. Challenges with Big Data • Capture • Storage (Solution: Cloud Computing) • Curation ( Management of data + Data retention) • Search • Analysis • Transfer • Visualization • Privacy violations
  • 65. 3 Vs
  • 66. 3 V’s of Big data • The data that is big in Volume, Velocity and Variety is known as big data.
  • 67. Sources of big data • Archives: Archives of scanned documents, customer correspondence records, patient’s health records, student’s admission records, students’ assessment records and so on. • Sensor data: Car sensors, smart electric meters, office buildings, washing m/c, other electronic appliances and so on. • Machine log data: Event logs, application logs, audit logs, server logs, etc.
  • 68. Sources of big data • Public web: Wikipedia, Weather, regulatory, census, etc. • Data storage: File systems, SQL database, NoSQL database (Mongo DB, Cassandra) and so on. • Media: Audio, Video, image, etc. • Docs: CSV, word docs, PDF, PPT, XLS, etc. • Business Apps: ERP, CRM, HR, Google Docs, etc. • Social media: Twitter blogs, Facebook, LinkedIn, YouTube, Instagram, etc. • IOT
  • 69. Other characteristics of big data • Veracity and Validity: Refers to the accuracy (quality) and correctness of the data. • Volatility: Deals with how long the data is valid?, and how long should it be stored?. (OTP, Aadhar No., PW) • Variability: Data flows can be highly inconsistent with periodic peaks. (In total 7V’s of big data)
  • 70. Why Big data More confidence in decision making More Data More Accurate analysis Greater operational efficiency, cost reduction, time reduction, new product development, optimized offerings, etc.
  • 71. Three reasons for leveraging big data 1. Competitive Advantage. 2. Decision making 3. To create new business value out of data.
  • 72. Typical data warehouse Environment
  • 73. Typical Hadoop Environment • It is different from DW environment. • Here data sources are web logs, images, audios, videos, social media, doc files, pdfs, etc.
  • 75.
  • 76.
  • 77.
  • 78. Big data & DW coexistence
  • 79. Big data & DW coexistence