SlideShare a Scribd company logo
1 of 2
Download to read offline
It's All About Data Classification and Searching
I don't know if this has been discussed elsewhere but I felt like I had an epiphany so there They way I
see it, in a decade or two the most important technology regarding data will be Data classification
and search technologies.
Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is
too expensive to buy the fastest disks, and even if you do buy them they're smaller than the slower-
spinning drives.
Imagine if speed and size were not issues. I know that's a big assumption but let's play along for a
second... (let's just say that there are plenty of revolutionary advances in the storage space coming
our way within, say, 10-20 years, that will make this concept not seem that far-fetched).
For more information, visit: http://klassify.in/
Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of
extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is
already happening, it's just expensive, so it's not common). Indeed, everyone would just leave all kinds
of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands
would also be clustered seamlessly so they present a single, coherent space, compounding the
problem further.
Within such a chaotic architecture, the only real problems are data classification and mining. I.e.
figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody
cares, as long as they can get to it in a timely fashion.
I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem
for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed
so we didn't get it, but they're saying it should be out in a few years (there were issues with scalability
and speed).
Let's forget about the Microsoft-specific implementation and just think about the concept instead (I'd
use something like a decent database on raw disk and not NTFS, for instance). No more real file
structure as we know it - it's just a huge database occupying the entire drive.
 Think of the advantages:
 Far more resilient to failures
 Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
 Replication via log shipping
 Amazing indexing
 Easy expandability
 The potential for great performance, if done right
 Lots of tuning options (maybe too many for some).
With such a technology, you need a lot more metadata for each file so you can present it in different
ways and also search for it efficiently. Let's consider a simple text document - you're trying to sell some
storage, so you write a proposal for a new client. You could have metadata on:
 Author
 Filename
 Client name
 Type of document - proposal
 Project name
 Excerpt
 Salesperson's name
 Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
 Document revision (possible automatically generated)
A lot of these fields already are to be found in the properties of any MS Word document.
The database would index the metadata at the very least, when the file is created, and any time the
metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory
structure could be created:
Create a virtual directory with all files pertaining to that specific client (most common way people
would organize it)
Show all the material for this specific project
Show all proposals that have to do with this salesperson
Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches)
and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described.
Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being
an exception since metadata creation is almost forced when you rip a CD).
It should be obvious by now that to enable this kind of functionality properly you need really good
ways of classifying and indexing your data and actually create all the metadata that needs to be there,
as automatically as possible. Future software will probably force you to create the metadata in some
way, of course. Existing software that does this classification is fairly poor, in my opinion. Please
correct me if I'm wrong.

More Related Content

Viewers also liked

Blueprints for social_selling_success
Blueprints for social_selling_successBlueprints for social_selling_success
Blueprints for social_selling_successB Mont
 
'Seven' thriller homework
'Seven' thriller homework'Seven' thriller homework
'Seven' thriller homeworkcharliemedia
 
Moving towards a Circular Economy – Europe between Ambitions and Reality
Moving towards a Circular Economy – Europe between Ambitions and RealityMoving towards a Circular Economy – Europe between Ambitions and Reality
Moving towards a Circular Economy – Europe between Ambitions and RealityI W
 
從行動購物App到零售業全通路整合趨勢
從行動購物App到零售業全通路整合趨勢從行動購物App到零售業全通路整合趨勢
從行動購物App到零售業全通路整合趨勢TeSA
 

Viewers also liked (7)

Blueprints for social_selling_success
Blueprints for social_selling_successBlueprints for social_selling_success
Blueprints for social_selling_success
 
'Seven' thriller homework
'Seven' thriller homework'Seven' thriller homework
'Seven' thriller homework
 
Moving towards a Circular Economy – Europe between Ambitions and Reality
Moving towards a Circular Economy – Europe between Ambitions and RealityMoving towards a Circular Economy – Europe between Ambitions and Reality
Moving towards a Circular Economy – Europe between Ambitions and Reality
 
從行動購物App到零售業全通路整合趨勢
從行動購物App到零售業全通路整合趨勢從行動購物App到零售業全通路整合趨勢
從行動購物App到零售業全通路整合趨勢
 
Pedagogy 3
Pedagogy 3Pedagogy 3
Pedagogy 3
 
Lezione motivazioni
Lezione motivazioniLezione motivazioni
Lezione motivazioni
 
Lezione emozioni
Lezione  emozioni Lezione  emozioni
Lezione emozioni
 

Recently uploaded

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

It's all about data classification and searching

  • 1. It's All About Data Classification and Searching I don't know if this has been discussed elsewhere but I felt like I had an epiphany so there They way I see it, in a decade or two the most important technology regarding data will be Data classification and search technologies. Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is too expensive to buy the fastest disks, and even if you do buy them they're smaller than the slower- spinning drives. Imagine if speed and size were not issues. I know that's a big assumption but let's play along for a second... (let's just say that there are plenty of revolutionary advances in the storage space coming our way within, say, 10-20 years, that will make this concept not seem that far-fetched). For more information, visit: http://klassify.in/ Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is already happening, it's just expensive, so it's not common). Indeed, everyone would just leave all kinds of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands would also be clustered seamlessly so they present a single, coherent space, compounding the problem further. Within such a chaotic architecture, the only real problems are data classification and mining. I.e. figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody cares, as long as they can get to it in a timely fashion. I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed so we didn't get it, but they're saying it should be out in a few years (there were issues with scalability and speed). Let's forget about the Microsoft-specific implementation and just think about the concept instead (I'd use something like a decent database on raw disk and not NTFS, for instance). No more real file structure as we know it - it's just a huge database occupying the entire drive.  Think of the advantages:  Far more resilient to failures  Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be  Replication via log shipping  Amazing indexing  Easy expandability  The potential for great performance, if done right  Lots of tuning options (maybe too many for some). With such a technology, you need a lot more metadata for each file so you can present it in different ways and also search for it efficiently. Let's consider a simple text document - you're trying to sell some storage, so you write a proposal for a new client. You could have metadata on:
  • 2.  Author  Filename  Client name  Type of document - proposal  Project name  Excerpt  Salesperson's name  Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches  Document revision (possible automatically generated) A lot of these fields already are to be found in the properties of any MS Word document. The database would index the metadata at the very least, when the file is created, and any time the metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory structure could be created: Create a virtual directory with all files pertaining to that specific client (most common way people would organize it) Show all the material for this specific project Show all proposals that have to do with this salesperson Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches) and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described. Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being an exception since metadata creation is almost forced when you rip a CD). It should be obvious by now that to enable this kind of functionality properly you need really good ways of classifying and indexing your data and actually create all the metadata that needs to be there, as automatically as possible. Future software will probably force you to create the metadata in some way, of course. Existing software that does this classification is fairly poor, in my opinion. Please correct me if I'm wrong.