SlideShare a Scribd company logo
1 of 22
Building Data WareHouse by
 Inmon

Chapter 11: Unstructured Data and the Data Warehouse

http://it-slideshares.blogspot.com/
Contents
Overview
Integrating the Two Worlds
A Themed Match
A Two-Tiered Data Warehouse
A Self-Organizing Map (SOM)
Fitting the Two Environments Together
Summary
Overview
Unstructured   data
 ◦ Casual, informal activities such as those found
   on the personal computer and the Internet
 ◦ Ex: Emails, Spreadsheets, Text files,
   Documents, Portable Document Format
   (.PDF) files, Microsoft PowerPoint (.PPT) files
Structured   data
 ◦ Standard DBMSs, reports, indexes, databases,
   fields, records, and the like
Overview (cont’)
The  primary differences between
 structured data and unstructured data
Integrating the Two Worlds
Text   — The Common Link

                 Plenty of problems arise:
                 • Misspelling
                 • Context
                 • Same name
                 • Nicknames
                 • Diminutives
                 • Incomplete names
                 • Word stems
Integrating the Two Worlds (con’t)
A   Fundamental Mismatch
 ◦ The unstructured environment represents
   documents and communications.
 ◦ The structured environment represents
   transactions.
Matching   Text across the Environments
 ◦ Remove extraneous stop words
 ◦ Reduction of words back to their stem
Integrating the Two Worlds (con’t)
A   Probabilistic Match
Integrating the Two Worlds (con’t)
Matching   All the Information
A Themed Match
Industrially   Recognized Themes
 ◦ The unstructured data is analyzed according
   to the existence of words that relate to
   industrialized themes.
A Themed Match
Naturally   Occurring Themes
                    •   fire—296 occurrences
                    •   fireman—285 occurrences
                    •   hose—277 occurrences
                    •   firetruck—201 occurrences
                    •   alarm—199 occurrences
                    •   smoke—175 occurrences
                    •   heat—128 occurrences


                    •   fire—296 occurrences
                    •   Rock Springs, WY—2
                    •   alabaster—1
                    •   angel—2
                    •   Rio Grande river – 1
                    •   beaver dam—1
A Themed Match
Linkage   through Themes and Themed
 Words
A Themed Match
Linkagethrough Abstraction and
 Metadata
 ◦ Is another way to link the two environments.
A Two-Tiered Data Warehouse
Two-Tiered    Data Warehouse
 ◦ One tier of the data warehouse is for
   unstructured data and another tier of the data
   warehouse is for structured data.
A Two-Tiered Data Warehouse
Dividing
        the Unstructured Data
 Warehouse
 ◦ Unstructured communications
 ◦ Documents and libraries
A Two-Tiered Data Warehouse
Documents      in the Unstructured Data
 Warehouse
 Factors determine whether or not the actual
  document is stored in the data warehouse:
   How many documents are there?
   What is the size of the documents?
   How critical is the information in the document?
   Can the document be easily reached if it is not
    stored in the warehouse?
   Can subsections of the document be captured?
A Two-Tiered Data Warehouse
Visualizing   Unstructured Data
 ◦ Unstructured visualization is the counterpart
   to structured visualization.
 ◦ Structured visualization is known as Business
   Intelligence
 ◦ The essence of structured visualization is the
   display of numbers
A Two-Tiered Data Warehouse
A   Self-Organizing Map (SOM)
 ◦ Produces a display that appears to be a
   topographical map
 ◦ Shows how different words and the
   documents are clustered, and displayed
   according to themes
A Themed Match

The   Unstructured Data Warehouse
 ◦ Is divided into two basic organizations—one part
   for documents and another part for
   communications
A Themed Match

Volumesof Data and the Unstructured Data
 Warehouse
 ◦ Volumes of data are an issue
 ◦ Mitigate the volumes of data that can collect in the
   unstructured data warehouse
Fitting the Two Environments
Together the unstructured environment contains
      Maybe
       data that is incompatible with data from the
       structured environment
      However there are ways that the two
       environments can be related
Fitting the Two Environments
Together
http://it-slideshares.blogspot.com/
Summary
World   of information technology is really
 divided into two worlds—structured data and
 unstructured data
The common bond between the two worlds is
 text.
The structured environment and the
 unstructured environment can be matched at:
 ◦ the identifier level
 ◦ the close identifier level using a probabilistic
   match
 ◦ the keyword to metadata or repository level

More Related Content

Similar to Lecture 11 Unstructured Data and the Data Warehouse

Schema Design
Schema DesignSchema Design
Schema DesignMongoDB
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAsDorothea Salo
 
Concepts of Data Bases
Concepts of Data BasesConcepts of Data Bases
Concepts of Data BasesNetworking
 
Trends in the Database
Trends in the DatabaseTrends in the Database
Trends in the DatabaseMarlon Jamera
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Hugo Besemer
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bitsDipesh Lall
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SEmily Nimsakont
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
 

Similar to Lecture 11 Unstructured Data and the Data Warehouse (20)

Schema Design
Schema DesignSchema Design
Schema Design
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
 
Concepts of Data Bases
Concepts of Data BasesConcepts of Data Bases
Concepts of Data Bases
 
Trends in the Database
Trends in the DatabaseTrends in the Database
Trends in the Database
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1Powerpoint versiebeheer there is no such thing as a final version 1
Powerpoint versiebeheer there is no such thing as a final version 1
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
 
RDMS AND SQL
RDMS AND SQLRDMS AND SQL
RDMS AND SQL
 
Data engineering
Data engineeringData engineering
Data engineering
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
lec6
lec6lec6
lec6
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
 

More from phanleson

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewallsphanleson
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hackingphanleson
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocolsphanleson
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacksphanleson
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applicationsphanleson
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibphanleson
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Dataphanleson
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagiaphanleson
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLphanleson
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Webphanleson
 

More from phanleson (20)

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Firewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth FirewallsFirewall - Network Defense in Depth Firewalls
Firewall - Network Defense in Depth Firewalls
 
Mobile Security - Wireless hacking
Mobile Security - Wireless hackingMobile Security - Wireless hacking
Mobile Security - Wireless hacking
 
Authentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless ProtocolsAuthentication in wireless - Security in Wireless Protocols
Authentication in wireless - Security in Wireless Protocols
 
E-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server AttacksE-Commerce Security - Application attacks - Server Attacks
E-Commerce Security - Application attacks - Server Attacks
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about LibertagiaHướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
 
Lecture 1 - Getting to know XML
Lecture 1 - Getting to know XMLLecture 1 - Getting to know XML
Lecture 1 - Getting to know XML
 
Lecture 4 - Adding XTHML for the Web
Lecture  4 - Adding XTHML for the WebLecture  4 - Adding XTHML for the Web
Lecture 4 - Adding XTHML for the Web
 

Recently uploaded

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 

Recently uploaded (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

Lecture 11 Unstructured Data and the Data Warehouse

  • 1. Building Data WareHouse by Inmon Chapter 11: Unstructured Data and the Data Warehouse http://it-slideshares.blogspot.com/
  • 2. Contents Overview Integrating the Two Worlds A Themed Match A Two-Tiered Data Warehouse A Self-Organizing Map (SOM) Fitting the Two Environments Together Summary
  • 3. Overview Unstructured data ◦ Casual, informal activities such as those found on the personal computer and the Internet ◦ Ex: Emails, Spreadsheets, Text files, Documents, Portable Document Format (.PDF) files, Microsoft PowerPoint (.PPT) files Structured data ◦ Standard DBMSs, reports, indexes, databases, fields, records, and the like
  • 4. Overview (cont’) The primary differences between structured data and unstructured data
  • 5. Integrating the Two Worlds Text — The Common Link Plenty of problems arise: • Misspelling • Context • Same name • Nicknames • Diminutives • Incomplete names • Word stems
  • 6. Integrating the Two Worlds (con’t) A Fundamental Mismatch ◦ The unstructured environment represents documents and communications. ◦ The structured environment represents transactions. Matching Text across the Environments ◦ Remove extraneous stop words ◦ Reduction of words back to their stem
  • 7. Integrating the Two Worlds (con’t) A Probabilistic Match
  • 8. Integrating the Two Worlds (con’t) Matching All the Information
  • 9. A Themed Match Industrially Recognized Themes ◦ The unstructured data is analyzed according to the existence of words that relate to industrialized themes.
  • 10. A Themed Match Naturally Occurring Themes • fire—296 occurrences • fireman—285 occurrences • hose—277 occurrences • firetruck—201 occurrences • alarm—199 occurrences • smoke—175 occurrences • heat—128 occurrences • fire—296 occurrences • Rock Springs, WY—2 • alabaster—1 • angel—2 • Rio Grande river – 1 • beaver dam—1
  • 11. A Themed Match Linkage through Themes and Themed Words
  • 12. A Themed Match Linkagethrough Abstraction and Metadata ◦ Is another way to link the two environments.
  • 13. A Two-Tiered Data Warehouse Two-Tiered Data Warehouse ◦ One tier of the data warehouse is for unstructured data and another tier of the data warehouse is for structured data.
  • 14. A Two-Tiered Data Warehouse Dividing the Unstructured Data Warehouse ◦ Unstructured communications ◦ Documents and libraries
  • 15. A Two-Tiered Data Warehouse Documents in the Unstructured Data Warehouse Factors determine whether or not the actual document is stored in the data warehouse:  How many documents are there?  What is the size of the documents?  How critical is the information in the document?  Can the document be easily reached if it is not stored in the warehouse?  Can subsections of the document be captured?
  • 16. A Two-Tiered Data Warehouse Visualizing Unstructured Data ◦ Unstructured visualization is the counterpart to structured visualization. ◦ Structured visualization is known as Business Intelligence ◦ The essence of structured visualization is the display of numbers
  • 17. A Two-Tiered Data Warehouse A Self-Organizing Map (SOM) ◦ Produces a display that appears to be a topographical map ◦ Shows how different words and the documents are clustered, and displayed according to themes
  • 18. A Themed Match The Unstructured Data Warehouse ◦ Is divided into two basic organizations—one part for documents and another part for communications
  • 19. A Themed Match Volumesof Data and the Unstructured Data Warehouse ◦ Volumes of data are an issue ◦ Mitigate the volumes of data that can collect in the unstructured data warehouse
  • 20. Fitting the Two Environments Together the unstructured environment contains Maybe data that is incompatible with data from the structured environment However there are ways that the two environments can be related
  • 21. Fitting the Two Environments Together
  • 22. http://it-slideshares.blogspot.com/ Summary World of information technology is really divided into two worlds—structured data and unstructured data The common bond between the two worlds is text. The structured environment and the unstructured environment can be matched at: ◦ the identifier level ◦ the close identifier level using a probabilistic match ◦ the keyword to metadata or repository level

Editor's Notes

  1. Matching different formats of electricity—alternating current (AC) and direct current (DC). The unstructured world operates on AC and the structured world operates on DC. Problem in integrating by text: Misspelling—What if two words are found in the two environments— Chernobyl and Chernobile? Should there be a match made between these two worlds? Do they refer to the same thing or something different? Context—The term “bill” is found in the two worlds. Should they be matched? In one case, the reference is to a bird’s beak and in the other case, the reference is to how much money a person is owed. Same name —The same name, “Bob Smith,” appears in both worlds. Are they the same thing? Do they refer to the same person? Or, do they refer to entirely different people who happen to have matching names? Nicknames—In one world, there appears the name “Bill Inmon.” In another world there appears the name “William Inmon.” Should a match be made? Do they refer to the same person? Diminutives —Is 1245 Sharps Ct the same as 1245 Sharps Court? Is NY, NY, the same as New York, New York? Incomplete names —Is Mrs. Inmon the same as Lynn Inmon? Word stems —Should the word “moving” be connected and matched with the word “moved”?
  2. A stop word is a word that occurs so frequently as to be meaningless to the document. Typical stop words include the following: a, an, the, for, to, by from, when, which… The second basic edit that must be done is the reduction of words back to their stem. For example, the following words all have the same grammatical Stem: moving, moved, moves, mover, removing  “move”
  3. In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  4. In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  5. The accounting theme would contain words and phrases such as the following: receivable, payable, cash on hand, asset, debit, due date, account… The finance theme would contain such information as the following: price, margin, discount, gross sale, net sale, interest rate, carrying loan, balance due There can be many industrially recognized themes for word collections. Some of the word themes might be the following: sales, marketing, finance, human resources, engineering, accounting, distribution…
  6. In an organization by “natural” themes, the unstructured data is collected on a document-by-document basis. Once the data is collected, the words and phrases are ranked by number of occurrences. Then, a theme to the document is formed by ranking the words and phrases inside the document based on the number of occurrences.
  7. Raw match of data: if a word is found anywhere in the structured environment and the word is part of the theme of a document, the unstructured document is linked to the structured record. But such a matching is not very meaningful and may actually be misleading.
  8. In Figure 11-11, data in the unstructured environment includes such people as Bill Jones, Mary Adams, Wayne Folmer, and Susan Young. All of these people exist in records of data that have a data element called “Name.” Put another way, data exists at two levels in the structured environment—the abstract level and the actual occurrence level. Figure 11-12 shows this relationship of data. In Figure 11-12, data exists at an abstract level—the metadata level. In addition, data exists at the occurrence level—where the actual occurrences of data reside.
  9. The data found in the unstructured data warehouse is in many ways similar to the data found in the structured data warehouse. Consider the following when looking at data in the unstructured environment: It exists at a low level of granularity. It has an element of time attached to the data. It is typically organized by subject area or “theme.”
  10. The data that can be stored in each section includes the following: ■■ The first n bytes of the document ■■ The document itself (optional) ■■ The communication itself (optional) ■■ Context information ■■ Keyword information
  11. An identifier is an occurrence of data that serves to specifically identify a record. Close identifiers are i dentifiers where there is a good probability that a solid identification has been made.