Data Mitosis – The Biology of Growth
Many of the problems we face in our attempt to manage a data center are a direct result of data growth. Data growth is constant, which means eventually it destroys everything in its path – sometimes quiet literally. Unaddressed data growth will overflow your file system, your disk, your system, your network, your protection plans, your processes, and your life. We try to stay ahead of this never-ending growth by running around buying more of whatever is going to break next.
I think it's time to address the cause and not the symptoms. Data growth is part natural – there is new data generated all the time in our worlds – but most of it is generated via science. Data sprawl, replicas, and copies of copies. Backup copies of copies. A backup of replica's of copies of copies. You don't have a capacity problem, you have a science problem.
There is a process in biology called Mitosis. Mitosis is when a cell is split, forming two identical cells. Left unchecked in the right environment, those cells will split again, creating four identical cells, and so on. Soon, the Petri dish that stores a microscopic quantity of stuff is overflowing all over the table. If a scientist acted like an IT guy, he would address this issue by pouring (migrating) the contents of the Petri dish into a bigger container before (preferably) it overflowed. Then he would do it again and again.
Originally, the science made sense – the scientists needed a bunch of exact replicas of a single cell in order to perform different tests or experiments on them. In IT the same holds true – we need a bunch of replicas of data in order to run different applications against them. The scientist uses replica's to run various experiments to see what results occur. The IT department uses replica's to run tests, populate data warehouses, create backup copies, create disaster recovery copies, send copies to other users, and so on. The difference is, the scientists know up front how many replicas' they want and need – so they plan for it. They don't need to migrate to a new Petri dish. And when they are done with their experiments, they get rid of the replica's – they don't keep letting them replicate, less we end up in a horror movie. In IT we rarely empty the Petri dish. Instead, we keep on creating new copies of copies. IT processes rarely have the pre-planning that exists in the science lab, so data sprawl happens far beyond any useful reason for having all those copies. And that, my friends, causes a huge percentage of our issues. We answer the challenge by buying the next bigger Petri dish from the sales guy.
Data Domain proved empirically that killing the replicate data in the backup process is a very good thing. There are now a thousand "de-dupe" stories to be heard – with one thing as undeniable fact – killing replicate data when it's no longer useful is good. Keeping it around for no reason isn't.
So, if killing replica's at the end of the data lifecycle is good – then killing them sooner would be better. That's the next frontier. Kill the replica's as soon as they are no longer valuable, before they have a chance to cause problems, and you eliminate problems associated with biological replication. Killing, compacting, de-duplicating, eliminating, or compressing replicate data as close to the point of conception as is feasible will yield the greatest possible downstream benefits. It's only logical.
How will that occur? Two ways: first, eventually you will have to address process and strategy requirements – i.e. actually know how many copies you need for how long and have an actual plan on how to deal with them. Second, you will leverage technology to kill the copies before they can take over, like the cockroaches of IT. Eventually the cockroaches win, and you have to move out.
If de-dupe in the backup target market has created well over $2B in value (and growing), imagine what value will be generated by moving the function closer to the point of creation – for all the different data types we generate. We'd be green (not having something is as green as it gets), rich (we wouldn't need to buy anything new for a while), calm (less things to manage equals less things to break), and might actually be able to take 8 minutes and think how we can add strategic value to our organization, as opposed to running around in a hazmat suit all day dumping out Petri dishes.
So when do we start applying this technology wonder (it's not really, it's still a "duh, I'm a bonehead" process issue as much as anything) up the food chain? If it's good in backup it should be great in primary. But there are different types of data that are created in primary? Records, files, objects, blobs, etc. Data lives in primary infrastructure but goes through different stages – so where and when does it make the most sense to do this? Hey, stop asking questions, just think.
- All data is born dynamic or transactional – Word, PowerPoint, Trading data, Arbitrage, Video, and MP3's, etc. Everything is dynamic for some time. If it matters most at this stage, it tends to have the highest degree of protection – and biggest impact if we lose it – whether on my laptop doing this blog or in the middle of a massive transaction system processing credit card transactions. This is where we normally make our first replica – we probably mirror here.
- According to the Universal Data Lifecycle, which I am perpetuating because it's correct, simple, and obvious (all things I like after all) all data becomes "fixed" or "persistent" after some time. Duh. Not at the same time, but eventually. It's subjective not objective (not really, but it makes people feel better if I say that). At some point data STOPS CHANGING and simply "is". The second stage of the UDS (crafty, eh?) is what we term "Persistent Active Data" – and if you can't figure that out it means data that NO LONGER CHANGES – BUT IS STILL VERY ACTIVE. That does not mean that the access to that data is automatically less important – usually it's more important at this stage. This is where we tend to make the most primary copies of data. We replicate for DR. We make backup copies and snapshots. We replicate to test/development systems. We email copies to our suppliers and partners and our cousin Chuck. Then we backup the copies of the copies and make more copies. Don't get me wrong – we NEED to make these copies many times – as long as disparate systems/applications require them, we need to provide them. We probably don't need to keep backing all 87 copies up, but whatever, to coin the only phrase my 16 year old daughter ever utters to me.
- The third stage of life is when data enters the "Persistent Inactive" state. I'm guessing you can figure out what that means – yes, NON-CHANGING DATA THAT NOW IS RARELY ACCESSED. This is where 90% of all commercial data sits in its lifecycle, fyi, and thus is where 90% of the capital and operational gains can be made – again, from both process and technology. Why would ANYONE backup this data? There is no need – it never changes, and you've already backed up copies of copies of it. Same with DR. At this stage, you want to be thinking about treating this data much differently than in previous stages. It needs to be on really cheap, write once, read seldom if at all, power and cooling efficient gear that preferably a monkey can manage. This is the stage where we next want to apply a massive reduction in the copies of data we have. It's still "primary" storage, but by applying de-dupe here we can probably chop 50% or more of our overall capacity off at the knees. Couple that with some common sense backup/DR policy changes and wow – you might get a free weekend or two. P.S. – there is nothing greener than nothing, if that makes sense.
- The fourth stage is the "who cares, I'm quitting if we ever actually need to go to this stage to recover" stage. It's the offsite deep archive or "doomsday" play. You have to do it, but you don't have to do it with 9756 copies of the same non-changing data do you? 3 or 4 copies seems ok to me.
So the next step is to figure out how to slide the de-dupe lever closer to the point of creation, and the biggest value point is going to be at Stage 3. Eventually, it will go right up to the actual creation point itself but for that to happen were gonna need data virtualization, and that's a different topic. We also have to recognize that crushing backup data (which is brilliant by the way) means de-duping files, but in primary capacity we don't just have files. We need to de-dupe blocks, and records, and objects and so on. Doing it all at backup is cool because we can take all data types and amalgamate them into files and deal with them, but we're gonna have to get smarter when we move upstream. There are only a small handful of people talking about squashing the database, for example. Talk about a big money play potential – the ROI of data squishing on the most expensive, most complex, most visible transaction systems will be huge. Backup is a pain in the arse for sure, but if de-dupe in the backup process has created a few billion dollars of value; imagine what it could do it the transactional world. Video and multi-media will also be huge because of the sheer volume it will consume. Object based stuff was born to hash, but it's still not a mainstream play outside of compliance.
Ciao.



I always enjoy your blog entries. You usually tackle problems with a unique slant on the issue and this time is no different. If only more people would listen to you and your suggestions!
-----Thanks very much. I'm not smart enough to understand any of these issues without an analogy! - Steve.
Posted by: Data Technician | November 21, 2008 at 03:57 PM