The Relational File System
Last week I went down to Washington, D.C. where I was invited to speak to a semi-secret society of Pharmaceutical researchers trying to understand and prepare for an onslaught of new data being generated, taxing already overtaxed systems (note the geographic pun), with an ever increasing requirement to mine value out of all data faster than ever before.
While there are plenty of specific Pharma requirements, in general, the high level issues are the same as they are in the media business or in the wild wacky world of Web 2.0. - How to deal with massive, unknown volumes of fixed digital content.
There are really two issues – the first is dealing with the existing issues of how to cope with endlessly growing data in an already wounded IT world, and the second is (gasp!) – How to actually derive value from the data we have at some point in the future.
The first issue has been talked about a lot – stop doing things the same way as always, reevaluate what you want to happen and when, and start managing information based on the new world of fixed content as opposed to treating everything like it just came out of transactional system. The second issue is the really tough one.
Most data created will never be accessed after the first 30 days of its life. Why? It is not that the data can't tell us new things in the future, it's that getting at it in any reasonable time frame, in any reasonable format, and being able to manipulate it to answer questions we had never even considered in the past is at best impractical and really impossible.
The answers are found when you recognize the realities of both issues, and combine those realities into a common answer.
- Stop thinking about data "life" in terms of structured, unstructured, or semi-structured – that is only the state in which is was created. Those are "birth" terms.
- Changing data is "dynamic", persistent data is not. All dynamic data eventually becomes persistent. Whether you keep it or throw it away is a different discussion.
- Recognize that there is a distinct difference between dynamic and persistent data – dynamic data changes, persistent doesn't. By default that would imply that each would have a different set of criteria for things such as performance, protection, access, etc.
- Set policies that are based on the type of data and whether it is dynamic or persistent. They should not be the same in most cases.
- Build an infrastructure capable of supporting all types of dynamic and persistent data – that means you'll have multiple tiers of storage, network, and server capabilities and capacities – and plan on being able to move data up and down that infrastructure fluidly as access requirements avail themselves.
- Unify your ability to find relevant pieces of data no matter what state it is in or where it physically resides. Having one access portal to all data, no matter what type or structure, solves half of the logistical problems we face.
So, that's easy enough to comprehend, but how does one "manage" the data itself? The key to the second problem – being able to find and mine relevant information – is in the application of intelligence to the data itself. We talk about this in terms like indexing or categorizing but what we really mean is we wrap data with meta-data, which we can then search against in the future. The richer the meta-data, the richer the query potential, and the richer the opportunity to find new value from an old asset. That's where the relational file system concept comes into play.
A relational database is superb for being able to find things. It is highly structured and as such relatively simple to organize. So why not put everything into a relational database? Because those databases were not designed to contain ridiculously large data sets. They are also complicated animals, and require intense specialist knowledge to keep them going. They are also enormously expensive. File systems, on the other hand, can house huge amounts of data without issue, can be managed by systems administrators without the need for DBA specialists, and tend to cost next to nothing. The problem with file systems is being able to find those individual nuggets of value inside of them. Crawling an entire petabyte file system to find something is no fun. Doing a query against a petabyte database is less fun, and more expensive.
What we seem to need is a hybrid of both – a fixed-content database, if you will. We need to be able to do structured queries on data sitting in a file system – that may even have been born as transactional!
I'm not talking about archiving. I'm referring to a living, breathing repository that combines all the benefits of structure with none of the limitations.
Many database records are stuffed with "events" – things born as fixed content. Log files, web analytics, alarms, and many more things are examples of one-time events that often reside in a relational database. Why would you put a data object that is fixed at birth such as an event into a relational database? Because you might want to be able to run queries against those events. U.K. company Coppereye lets you put event data directly into their repository and run all queries your little heart desires against it – but it doesn't sit in an RdB, it sits in a flat file, eliminating the cost and complexity of housing huge volumes of data within the database. When the European Union changed the retention rules for cell phone records last year, every player was forced to triple the amount of customer call data that had to be available on line to customers. Call data is born fixed – it is an event. If I told you that you have to triple the size of your already large database in 90 days, would you be happy? That was the only realistic way to solve the problem for these folks until Coppereye showed up and let them stuff ALL of that event data into their system – which sits in front of anybody's NAS.
ESG Research Director John McKnight tells me 40% of large company's average 1TB per month of security log data! That's another perfect example of event data that shouldn't ever be in a database. By sticking it in a hybrid repository, you can keep it forever and when you have an epiphany next year and want to query that data set it is no problem! Wouldn't that solve the Bio-IT issues as well? They do this stuff referred to as High Content – where they might take a zillion images or videos of a set of tests or reactions of different compounds – generating tons of data – and then it has to sit somewhere in case a researcher wants to query against it. What about in 8 years when you find out that some test you just ran shows similarities of another test from the past, and when you instantly called it up it took you no time to realize that Viagra also cures hair loss?
In summary, we need to reconsider our definitions of data as well as the treatment of those data types. Bridging the structured and unstructured worlds of data management is not entirely new – SharePoint does the same thing by adding structure to documents. By eliminating the expense and complexity of housing gigantic data sets while maintaining the ability to find the needle in the haystack, a hybrid data management system seems destined to win.



Comments