SlideShare a Scribd company logo
1 of 42
The Data Lakehouse Symposium – February, 2022
Hosted by - Bill Inmon and DataBricks – Feb 1- 4, 2022
The Data Lakehouse Symposium – February, 2022
Text in the Data Lakehouse
David Rapien
Partner in ForestRim Technology
Associate Professor – Lindner College of Business
University of Cincinnati
Let’s Look at the Major Historical Changes in Data Collection, Storage, and Usage
• 1980’s - The Data Warehouse allowed us to hold a single version of the truth and make
enterprise wide decisions.
• 2010 - The Data Lake allowed us to collect all of our “data” in one place.
• 2020- The Data Lakehouse marries the two by adding governance and metadata to data
going into the Data Lake so that it can be separately transitioned into a Data Warehouse
AND consumed by decision makers and analysts.
Text in the
Data Lakehouse
Where is your company’s data focus?
• Data collection for “future use”?
• Business decisions?
• Analysis and research?
• We have none!
Text in the
Data Lakehouse
What types of data does your company collect and store?
• Transactional Data from customer interactions?
• Machine generated data?
• Emails, blogs, customer reviews, medical records, contracts?
• Images, videos, scans, audio files?
Text in the
Data Lakehouse
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
What We will Discuss in Today’s Presentation
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
Text in the
Data Lakehouse
structured textual Analog/IoT
Curated Data Lake and Data Warehouse Data
All Corporate Data in the Lakehouse
~ 20-%
~ 80+%
Pareto’s Law Holds True
~ 80+%
Amount of Data:
Data Used for
Decision Making:
~ 20-%
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
~ 80-90% of business decisions are
made on less than 20% of the data.
Is there something wrong here?
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
Data Warehouse Data
• Physical models
• Tables
• Aggregated
• Scrubbed
• Additional Metadata
• Additional Data Governance
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
• Documents
• Emails
• Contracts
• Medical Records
• Voice of the Customer
• Insurance Claims
• Call Center …
• Other???
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
• Status Data
• Automation Data
• Location Data
• Clickstreams
• Sensor Data
• Images
• Audio / Video files
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
structured textual Analog/IoT
The relative volumes of data
Text in the
Data Lakehouse
structured textual Analog/IoT
The relative amount of business value to be found in the different sectors
Business value
Text in the
Data Lakehouse
How do we currently USE different types of data
structured textual Analog/IoT
Data Warehouse
Timeline Analysis
360° View of the Customer
Curated Data Lake and Data Warehouse Data
Machine Learning / AI
Manual Analysis?
NLP?
Failure?
Textual ETL!
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
Text in the
Data Lakehouse
What data you are missing in your analysis?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records
Text in the
Data Lakehouse
What is similar about most of this textual data?
textual
It is stream of thought
It is different document by document
It does not have Primary Keys or Foreign Keys
It has little format
It is DIRTY DATA!
Text in the
Data Lakehouse
textual
structured
Why?
Because people do not write or talk the same way that
is found in the structured world
Keys
Attributes
Indexes
Physical models
“I was looking at the nice colored
sweater in the window. I wonder if
I could try it on….but
I don’t like the sleeve length…”
These worlds are incompatible.
In order to address text you need a completely different approach
The Issue: The modelling and design techniques that worked in the
Structured world do not work in the world of Text.
Think
About it:
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
Text in the
Data Lakehouse
We consider the types of text that we are storing and NOT USING?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records
Text in the
Data Lakehouse
You need to organize everything and convert each into a standard text format
textual
Voice mails
Dictations
Transcriptions
Audio Data
PDFs
Document scans
Mixed
CSVs
Yelp Reviews
Parquet files
The voice of the customer
Tabular Data
Word documents
Internet
Emails
Call center
General Documents
Real estate deeds/sales
Insurance claims
Warranties
Contracts
Medical records
“Some Format”
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
USE:
Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Convert to a Common Textual Format
Now WHAT do we do with this data?
Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Common Textual Format
Deidentify – Redact Personal Data
Apply Context!
Text in the
Data Lakehouse
If you are going to address text you MUST have a handle on
both text AND context.
It is not sufficient to merely address text.
Text is relatively simple. Context is 90% of the battle.
textual
Furthermore, most of the context that is needed lies OUTSIDE of the text.
You can analyze the text until you are blue in the face
and never find the relevant context of the text
Text in the
Data Lakehouse
By Properly Applying Context
You can convert your Unstructured Textual Data into Structured Data!
textual
This allows you to use your Textual Data for Structured Analysis!
So what is the purpose of all of this?
Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in different areas
Consider the word “Trust”
In Friendship – It is the ability to believe in the word and actions of another
In Finance – It is a legal vehicle used to pass and allocate assets to another
In Networking – It allows one computer to communicate and share with another
Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in SIMILAR areas
Consider the word “Cervical” in the medical field
It could mean: pertaining to the neck
• cervical vertebra
It could mean: pertaining to the lowest segment of the uterus,
• cervical cancer
• cervical hemorrhage
Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in Related areas
Consider the word “Dermatome” in the medical field
It means an area of the skin supplied by a specific nerve root
It is also a surgical instrument used to cut the skin
Text in the
Data Lakehouse
What is Meant by “Adding Context” to Textual Data?
It has different meanings in different areas
1. Extraction of key elements and phrases for categorization
2. Aggregation of terms into layered categories
3. Similar to Data Governance with Data Warehouse Data
• Requires subject matter experts
• Requires understanding of what dimensions you want for analysis
• Can be Highly Political between Departments
• It is controlled by BUSINESS, not IT or Data Analysts!
Text in the
Data Lakehouse
What is the Process of Adding Context to Textual Data?
It matters what analytics you want to perform on your text
1. Data Conversion (Maybe)
2. Data Redaction (Maybe)
3. Data Extraction
• Identification of “Important” phrases or areas (Nexus)
• Running through an Engine to pair the Nexus with the text
4. Data Transformation
• Classification of the matched Nexus phrases
• Adding Metadata
• Dates, Sentiment, Sentence Information, Byte Location,
• Batch #s, Business, Nexus, Customer, …
5. Data Loading
• Data Warehouse, Data Mart, Parquet Files
Text in the
Data Lakehouse
What Can Be Done with Contextualized Data?
We can do Structured Data Analysis
1. Document Markup
• Visually identifies parts of the document
2. Sentiment Analysis
• Gives feeling and degrees of feeling to parts of document
3. Inline Contextualization
• Reverse Mail Merge – Pull out set of terms that have value
4. Document Classification
• Give context to the areas of the document for
correlation or basket analysis
Text in the
Data Lakehouse
What is Document Markup?
1. Data Visualization
• Color coded
• Draws the eyes
2. Used document by document
3. Great for “spot” review
4. Irrelevant and impractical for
analyzing Big Data
Text in the
Data Lakehouse
What is Sentiment Analysis?
1. Assigns Feeling to words
• Color coded
• Draws the eyes
2. Tries to identify and categorize
opinions stated in some text
3. Great for Comments
4. A BASIC requirement for
Voice of the Customer Analytics
Text in the
Data Lakehouse
What is Inline Contextualization?
1. Reverse Mail Merge
2. Pull out set of terms that have value
• Names
• Contract Dates
• Ratings
3. Useful for Contracts
4. Needed for Redaction
5. Needed for Document Separation
• Medical Visits
• Combined repeat visits
6. Needed for retrieval of grouped data
from blocks of text
Text in the
Data Lakehouse
What is Document Classification?
1. Give context to the areas of the document
2. Correlation Analysis
3. Basket Analysis
4. Mind Maps
5. Knowledge Graph
Text in the
Data Lakehouse
structured textual Analog/IoT
Data Warehouse
Curated Data Lake and Data Warehouse Data
Parquet Files
Review
There are many types of data in a Data Lakehouse
Textual ETL
Text in the
Data Lakehouse
Using context, you can convert your
Unstructured Data into Structured Data!
Deidentify data if you are going to store it
Apply Context to your textual data!
Sort your textual data documents by types
Convert your textual data to a common format
Review
Text in the
Data Lakehouse
1. Document Markup
2. Sentiment Analysis
3. Inline Contextualization
4. Document Classification
5. Plus many others…
This conversion allows for Structured Data Analysis
Review
Text in the
Data Lakehouse
Questions
https://www.forestrimtech.com/
info@forestrimtech.com
Text in the
Data Lakehouse
• Bill Inmon – Slides and Conversations
• Inmon, B. (2021). Building The Data Lakehouse. Technics Publications LLC.
• https://www.snowflake.com/guides/what-iot
• https://medicalterminologyblog.com/homonyms-medical-language-2/
• Andrea and Amanda Rapien – Format and Additional Clarifying Material
References and Sources

More Related Content

Similar to Data Lakehouse Symposium | Day 2

michael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new yorkmichael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new york
michaelhamilton
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 

Similar to Data Lakehouse Symposium | Day 2 (20)

The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
michael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new yorkmichael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new york
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Database
DatabaseDatabase
Database
 
Database
DatabaseDatabase
Database
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
01-Introduction.pdf
01-Introduction.pdf01-Introduction.pdf
01-Introduction.pdf
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and Types
 
Unit 2
Unit 2Unit 2
Unit 2
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

INCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSE
INCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSEINCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSE
INCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSE
RicaAbellanosa
 
Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...
Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...
Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...
ahmedjiabur940
 

Recently uploaded (20)

HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATIONHOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
 
Rohtak Escorts Service Girl ^ 9332606886, WhatsApp Anytime Rohtak
Rohtak Escorts Service Girl ^ 9332606886, WhatsApp Anytime RohtakRohtak Escorts Service Girl ^ 9332606886, WhatsApp Anytime Rohtak
Rohtak Escorts Service Girl ^ 9332606886, WhatsApp Anytime Rohtak
 
SALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptxSALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptx
 
SP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdfSP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdf
 
Resumé Karina Perez | Digital Strategist
Resumé Karina Perez | Digital StrategistResumé Karina Perez | Digital Strategist
Resumé Karina Perez | Digital Strategist
 
Gain potential customers through Lead Generation
Gain potential customers through Lead GenerationGain potential customers through Lead Generation
Gain potential customers through Lead Generation
 
Best 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In ChandigarhBest 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In Chandigarh
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
The Impact Of Social Media Advertising.pdf
The Impact Of Social Media Advertising.pdfThe Impact Of Social Media Advertising.pdf
The Impact Of Social Media Advertising.pdf
 
Alpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxAlpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptx
 
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesInstant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
 
Personal Brand Exploration Selk_Ingrid_DMBS_PB1_2024-01.pptx
Personal Brand Exploration Selk_Ingrid_DMBS_PB1_2024-01.pptxPersonal Brand Exploration Selk_Ingrid_DMBS_PB1_2024-01.pptx
Personal Brand Exploration Selk_Ingrid_DMBS_PB1_2024-01.pptx
 
The 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptxThe 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptx
 
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITYHITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
 
Aiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMMAiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMM
 
INCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSE
INCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSEINCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSE
INCOME TAXATION- CHAPTER 13-A ADDITIONAL CLAIMABLE COMPENSATION EXPENSE
 
Distribution Ad Platform_ The Role of Distribution Ad Network.pdf
Distribution Ad Platform_ The Role of  Distribution Ad Network.pdfDistribution Ad Platform_ The Role of  Distribution Ad Network.pdf
Distribution Ad Platform_ The Role of Distribution Ad Network.pdf
 
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdfTAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
 
Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...
Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...
Top Abortion Clinic in Muscat +918761049707!!!!!!!!!!! Get Cytotec kit availa...
 
The Essence of Mothers Celebrating the Heart of the Family.pptx
The Essence of Mothers Celebrating the Heart of the Family.pptxThe Essence of Mothers Celebrating the Heart of the Family.pptx
The Essence of Mothers Celebrating the Heart of the Family.pptx
 

Data Lakehouse Symposium | Day 2

  • 1. The Data Lakehouse Symposium – February, 2022 Hosted by - Bill Inmon and DataBricks – Feb 1- 4, 2022
  • 2. The Data Lakehouse Symposium – February, 2022 Text in the Data Lakehouse David Rapien Partner in ForestRim Technology Associate Professor – Lindner College of Business University of Cincinnati
  • 3. Let’s Look at the Major Historical Changes in Data Collection, Storage, and Usage • 1980’s - The Data Warehouse allowed us to hold a single version of the truth and make enterprise wide decisions. • 2010 - The Data Lake allowed us to collect all of our “data” in one place. • 2020- The Data Lakehouse marries the two by adding governance and metadata to data going into the Data Lake so that it can be separately transitioned into a Data Warehouse AND consumed by decision makers and analysts. Text in the Data Lakehouse
  • 4. Where is your company’s data focus? • Data collection for “future use”? • Business decisions? • Analysis and research? • We have none! Text in the Data Lakehouse
  • 5. What types of data does your company collect and store? • Transactional Data from customer interactions? • Machine generated data? • Emails, blogs, customer reviews, medical records, contracts? • Images, videos, scans, audio files? Text in the Data Lakehouse
  • 6. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse What We will Discuss in Today’s Presentation
  • 7. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse
  • 8. Text in the Data Lakehouse structured textual Analog/IoT Curated Data Lake and Data Warehouse Data All Corporate Data in the Lakehouse ~ 20-% ~ 80+% Pareto’s Law Holds True ~ 80+% Amount of Data: Data Used for Decision Making: ~ 20-%
  • 9. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT ~ 80-90% of business decisions are made on less than 20% of the data. Is there something wrong here? Curated Data Lake and Data Warehouse Data
  • 10. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT Data Warehouse Data • Physical models • Tables • Aggregated • Scrubbed • Additional Metadata • Additional Data Governance Curated Data Lake and Data Warehouse Data
  • 11. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT • Documents • Emails • Contracts • Medical Records • Voice of the Customer • Insurance Claims • Call Center … • Other??? Curated Data Lake and Data Warehouse Data
  • 12. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT • Status Data • Automation Data • Location Data • Clickstreams • Sensor Data • Images • Audio / Video files Curated Data Lake and Data Warehouse Data
  • 13. Text in the Data Lakehouse structured textual Analog/IoT The relative volumes of data
  • 14. Text in the Data Lakehouse structured textual Analog/IoT The relative amount of business value to be found in the different sectors Business value
  • 15. Text in the Data Lakehouse How do we currently USE different types of data structured textual Analog/IoT Data Warehouse Timeline Analysis 360° View of the Customer Curated Data Lake and Data Warehouse Data Machine Learning / AI Manual Analysis? NLP? Failure? Textual ETL!
  • 16. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse
  • 17. Text in the Data Lakehouse What data you are missing in your analysis? textual Voice mails Dictations Transcriptions PDFs Word documents CSVs Yelp Reviews Parquet Files Document scans The voice of the customer Real estate deeds/sales Internet Insurance claims Warranties Emails Call center Contracts Medical records
  • 18. Text in the Data Lakehouse What is similar about most of this textual data? textual It is stream of thought It is different document by document It does not have Primary Keys or Foreign Keys It has little format It is DIRTY DATA!
  • 19. Text in the Data Lakehouse textual structured Why? Because people do not write or talk the same way that is found in the structured world Keys Attributes Indexes Physical models “I was looking at the nice colored sweater in the window. I wonder if I could try it on….but I don’t like the sleeve length…” These worlds are incompatible. In order to address text you need a completely different approach The Issue: The modelling and design techniques that worked in the Structured world do not work in the world of Text. Think About it:
  • 20. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse
  • 21. Text in the Data Lakehouse We consider the types of text that we are storing and NOT USING? textual Voice mails Dictations Transcriptions PDFs Word documents CSVs Yelp Reviews Parquet Files Document scans The voice of the customer Real estate deeds/sales Internet Insurance claims Warranties Emails Call center Contracts Medical records
  • 22. Text in the Data Lakehouse You need to organize everything and convert each into a standard text format textual Voice mails Dictations Transcriptions Audio Data PDFs Document scans Mixed CSVs Yelp Reviews Parquet files The voice of the customer Tabular Data Word documents Internet Emails Call center General Documents Real estate deeds/sales Insurance claims Warranties Contracts Medical records “Some Format” Transcription (Dragon) OCR and Converters Set “Textual” Columns Converters and Formatters Inline Contextualization USE:
  • 23. Text in the Data Lakehouse textual Transcription (Dragon) OCR and Converters Set “Textual” Columns Converters and Formatters Inline Contextualization Convert to a Common Textual Format Now WHAT do we do with this data?
  • 24. Text in the Data Lakehouse textual Transcription (Dragon) OCR and Converters Set “Textual” Columns Converters and Formatters Inline Contextualization Common Textual Format Deidentify – Redact Personal Data Apply Context!
  • 25. Text in the Data Lakehouse If you are going to address text you MUST have a handle on both text AND context. It is not sufficient to merely address text. Text is relatively simple. Context is 90% of the battle. textual Furthermore, most of the context that is needed lies OUTSIDE of the text. You can analyze the text until you are blue in the face and never find the relevant context of the text
  • 26. Text in the Data Lakehouse By Properly Applying Context You can convert your Unstructured Textual Data into Structured Data! textual This allows you to use your Textual Data for Structured Analysis! So what is the purpose of all of this?
  • 27. Text in the Data Lakehouse What is Meant by “the Context” of Textual Data? It has different meanings in different areas Consider the word “Trust” In Friendship – It is the ability to believe in the word and actions of another In Finance – It is a legal vehicle used to pass and allocate assets to another In Networking – It allows one computer to communicate and share with another
  • 28. Text in the Data Lakehouse What is Meant by “the Context” of Textual Data? It has different meanings in SIMILAR areas Consider the word “Cervical” in the medical field It could mean: pertaining to the neck • cervical vertebra It could mean: pertaining to the lowest segment of the uterus, • cervical cancer • cervical hemorrhage
  • 29. Text in the Data Lakehouse What is Meant by “the Context” of Textual Data? It has different meanings in Related areas Consider the word “Dermatome” in the medical field It means an area of the skin supplied by a specific nerve root It is also a surgical instrument used to cut the skin
  • 30. Text in the Data Lakehouse What is Meant by “Adding Context” to Textual Data? It has different meanings in different areas 1. Extraction of key elements and phrases for categorization 2. Aggregation of terms into layered categories 3. Similar to Data Governance with Data Warehouse Data • Requires subject matter experts • Requires understanding of what dimensions you want for analysis • Can be Highly Political between Departments • It is controlled by BUSINESS, not IT or Data Analysts!
  • 31. Text in the Data Lakehouse What is the Process of Adding Context to Textual Data? It matters what analytics you want to perform on your text 1. Data Conversion (Maybe) 2. Data Redaction (Maybe) 3. Data Extraction • Identification of “Important” phrases or areas (Nexus) • Running through an Engine to pair the Nexus with the text 4. Data Transformation • Classification of the matched Nexus phrases • Adding Metadata • Dates, Sentiment, Sentence Information, Byte Location, • Batch #s, Business, Nexus, Customer, … 5. Data Loading • Data Warehouse, Data Mart, Parquet Files
  • 32. Text in the Data Lakehouse What Can Be Done with Contextualized Data? We can do Structured Data Analysis 1. Document Markup • Visually identifies parts of the document 2. Sentiment Analysis • Gives feeling and degrees of feeling to parts of document 3. Inline Contextualization • Reverse Mail Merge – Pull out set of terms that have value 4. Document Classification • Give context to the areas of the document for correlation or basket analysis
  • 33. Text in the Data Lakehouse What is Document Markup? 1. Data Visualization • Color coded • Draws the eyes 2. Used document by document 3. Great for “spot” review 4. Irrelevant and impractical for analyzing Big Data
  • 34. Text in the Data Lakehouse What is Sentiment Analysis? 1. Assigns Feeling to words • Color coded • Draws the eyes 2. Tries to identify and categorize opinions stated in some text 3. Great for Comments 4. A BASIC requirement for Voice of the Customer Analytics
  • 35. Text in the Data Lakehouse What is Inline Contextualization? 1. Reverse Mail Merge 2. Pull out set of terms that have value • Names • Contract Dates • Ratings 3. Useful for Contracts 4. Needed for Redaction 5. Needed for Document Separation • Medical Visits • Combined repeat visits 6. Needed for retrieval of grouped data from blocks of text
  • 36. Text in the Data Lakehouse What is Document Classification? 1. Give context to the areas of the document 2. Correlation Analysis 3. Basket Analysis 4. Mind Maps 5. Knowledge Graph
  • 37. Text in the Data Lakehouse structured textual Analog/IoT Data Warehouse Curated Data Lake and Data Warehouse Data Parquet Files Review There are many types of data in a Data Lakehouse Textual ETL
  • 38. Text in the Data Lakehouse Using context, you can convert your Unstructured Data into Structured Data! Deidentify data if you are going to store it Apply Context to your textual data! Sort your textual data documents by types Convert your textual data to a common format Review
  • 39. Text in the Data Lakehouse 1. Document Markup 2. Sentiment Analysis 3. Inline Contextualization 4. Document Classification 5. Plus many others… This conversion allows for Structured Data Analysis Review
  • 40. Text in the Data Lakehouse Questions
  • 42. Text in the Data Lakehouse • Bill Inmon – Slides and Conversations • Inmon, B. (2021). Building The Data Lakehouse. Technics Publications LLC. • https://www.snowflake.com/guides/what-iot • https://medicalterminologyblog.com/homonyms-medical-language-2/ • Andrea and Amanda Rapien – Format and Additional Clarifying Material References and Sources