http://klassify.in/ - They way I see it, in a decade or two the most important technology regarding data will be Data classification and search technologies.
A Journey Into the Emotions of Software Developers
It's all about data classification and searching
1. It's All About Data Classification and Searching
I don't know if this has been discussed elsewhere but I felt like I had an epiphany so there They way I
see it, in a decade or two the most important technology regarding data will be Data classification
and search technologies.
Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is
too expensive to buy the fastest disks, and even if you do buy them they're smaller than the slower-
spinning drives.
Imagine if speed and size were not issues. I know that's a big assumption but let's play along for a
second... (let's just say that there are plenty of revolutionary advances in the storage space coming
our way within, say, 10-20 years, that will make this concept not seem that far-fetched).
For more information, visit: http://klassify.in/
Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of
extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is
already happening, it's just expensive, so it's not common). Indeed, everyone would just leave all kinds
of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands
would also be clustered seamlessly so they present a single, coherent space, compounding the
problem further.
Within such a chaotic architecture, the only real problems are data classification and mining. I.e.
figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody
cares, as long as they can get to it in a timely fashion.
I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem
for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed
so we didn't get it, but they're saying it should be out in a few years (there were issues with scalability
and speed).
Let's forget about the Microsoft-specific implementation and just think about the concept instead (I'd
use something like a decent database on raw disk and not NTFS, for instance). No more real file
structure as we know it - it's just a huge database occupying the entire drive.
Think of the advantages:
Far more resilient to failures
Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
Replication via log shipping
Amazing indexing
Easy expandability
The potential for great performance, if done right
Lots of tuning options (maybe too many for some).
With such a technology, you need a lot more metadata for each file so you can present it in different
ways and also search for it efficiently. Let's consider a simple text document - you're trying to sell some
storage, so you write a proposal for a new client. You could have metadata on:
2. Author
Filename
Client name
Type of document - proposal
Project name
Excerpt
Salesperson's name
Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
Document revision (possible automatically generated)
A lot of these fields already are to be found in the properties of any MS Word document.
The database would index the metadata at the very least, when the file is created, and any time the
metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory
structure could be created:
Create a virtual directory with all files pertaining to that specific client (most common way people
would organize it)
Show all the material for this specific project
Show all proposals that have to do with this salesperson
Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches)
and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described.
Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being
an exception since metadata creation is almost forced when you rip a CD).
It should be obvious by now that to enable this kind of functionality properly you need really good
ways of classifying and indexing your data and actually create all the metadata that needs to be there,
as automatically as possible. Future software will probably force you to create the metadata in some
way, of course. Existing software that does this classification is fairly poor, in my opinion. Please
correct me if I'm wrong.