Managing Large Data Sets Can Warp Your Project Timeline

Data has gravity. And just like gravity can warp the space-time continuum, data can warp your project timeline.

Woody Evans

Mar 13, 2016

https://a.storyblok.com/f/137721/1024x725/e6bfdc770d/cygx1_ill.jpg
https://a.storyblok.com/f/137721/1024x725/e6bfdc770d/cygx1_ill.jpg

Data has gravity. And just like gravity can warp the space-time continuum, data can warp your project timeline.

The analogy of data in the information realm being like "mass" in the physical realm seems pretty clear.  But, gravity has other effects like bending light and dilating time.  Just like the gravity around a very massive object (like a star) can bend light, the size of your data can bend the path of data activities that get near that "mass" of data. Data management activities like backup, copy, restore, and refresh as well as data mediation activities like ETL, Data Masking, and MDM. And, just like "slow moving" masses seem to spend comparatively more clock time, your "slow moving" data will eat up more clock time compared to your nimbler competitors.

Data Gravity and the 4 key resources

Storage: A big dataset needs big datafiles.  A big datastore need big backups. Every copy you make for someone to use needs the same. And there can be a lot of copies and backups.  If the average app has 8 to 10 copies of production data, and uses a 4 week backup cycle, you could easily be looking at 8 copies live on disk + 9 * 4 full backups (and a bunch of incremental data).  That big dataset really has a data mass that's 44 times its size. That's a lot of Data Gravity, even if you're looking at a very average 5 Tb dataset. (About 220 Tb!)

-

Network: For your dataset alone, you've probably got at least 4 major traffic patterns: (1) Transactions to the dataset itself, (2) Backups of the dataset, (3) provisions and refreshes of copies of your dataset, (4) replicating your dataset to another live copy, and (5) migrating a dataset from one location to another (like moving to the cloud).  If the average app changes 5% a day, gets backed up full weekly/incremental daily, and you refresh your downstream environments 1/week (not even considering provisions), and you replicate production to at least one site, you could easily be looking at moving a data mass 10x the size of your dataset every week – again without even considering provisions or migrations or backups of non-production data.  (Transactions [5Tb*5%*7d] + Backup [5Tb + 6d*5%*5Tb] + Refresh [5Tb*8copies*1/week ] + Replication [5Tb*5%*7 days])

-

Memory: Let's say it's not unreasonable for the memory set aside to service a dataset to be 25% of its size.  For your dataset, you've probably got memory allocated to the service each copy of your data (e.g., in Oracle).  There's also a bunch of memory for processes that do all those storage transactions we talked about in Storage.  So, without breaking a sweat, we could say that for the 8 copies, we're using 25% * 8, or memory equivalent to a data mass 2x the size of your dataset (and we ignored all the memory for copy, provision, backup, refresh) (That would be 2x 5 Tb or 10Tb of Memory)

-

CPU: It takes a lot of CPU horsepower to handle (1) transactions, (2) Backups, (3) provisions/refreshes, (4) replication, and (5) migration.  If we just use the number from our network example, and assume that you can sustain about 800 MB/second for 8 Cores, that would yield about 50 Tb/800 MB/sec ~65,500 seconds or ~18 CPU-Hours at full tilt.  Using our previous math, if we estimate the CPU load of the live dataset to be around 0.64 CPU-Hours (5% Change * 7 Days * 5 Tb * 800 MB/sec), we're using CPU equivalent to ~28x the need of our production data mass.

What would happen if we could change the laws of gravity?

The data visualization magic of Delphix let's your data look the same in two places at once without actually making two full copies. What does Delphix do?  It combines key technologies that allow block-level sharing, timeflow/shared data history, shared compressed memory cache, and virtual provisioning. Delphix can help you change the laws of data gravity. Let's take a look at how these key Delphix technologies would change your consumption of our 4 key resources:

Storage: Delphix shares all of the blocks that are the same.  So, when you make those 8 copies of production data, you start near 0 bytes because it hasn't changed. When you make your "full" backup, if often hasn't changed much since the last "full" backup, so it's also much smaller.  And, since your copies also share the same blocks, all of their files and backups get to share this way as well.  In addition, Delphix compresses everything in a way that doesn't hurt your access speed. Further, the more often you refresh, the_smaller_ each copy gets.  To keep the same mass of data as in our example above (Assuming compressed data is 35% the size of original data), Delphix would need on the order of 0.35*((1+7*0.05)+(8*7*0.05)) = 0.35*(4.15) = 1.45x the original size. From 44x down to 1.45x. That's at least an order of magnitude no matter how you slice it.

-

Network: Delphix can't change the mass of your transactions to the primary dataset. But, the sharing technology means that you get a full backup for the price of an incremental forever after the first pass.  Since provisions are virtual, the network traffic to create the provisioned copy is almost nil (e.g. if you provision off a validated sync on an Oracle database) network traffic is a function of the change since the last baseline because that's the major part of the data that Delphix has to send to get your virtual dataset up and running.  Replication works on similar principles.  Delphix only ships change data after the first pass.  Migration is even more powerful.  If you're moving a virtual dataset from one existing Delphix host to another, it's a few clicks more than a simple shutdown and startup.  That's powerful.  To transmit the same mass of data as in our example above (even ignoring compression), Delphix would need on the order of (Transactions [5% * 7days] + Backup [ZERO(although 7*5% occurs, it's already included in transactions!)] + Refresh [8 copies * 5% * 1/week ] + Replication [5% * 7 days]) = 1.1x the original size.  From 10x down to 1.1x. That's about an order of magnitude before we consider compression.

-

Memory: Delphix uses massively shared compressed cache.  (To be clear, Delphix does NOT change the memory on the host that's running your database; But, it CAN change how much memory you need to run it).  Memory makes data transactions fast (fetching from memory can easily be 10x faster than disk).  Today, we size memory for copies in much the same way as for the production.  The big assumption there is that there will be a cost to fetch from disk if we don't.  But, what if "going to disk" was a lot cheaper a lot more of the time?  The Delphix cache changes those economics.

Our previous required 10 Tb of total memory for 8 copies.  If we assume traffic on one copy is shaped like traffic on other copies, then we could infer an upper boundary of unique data in all those caches at around 5 TB*(1 + 8*5%  an for blocks unique to each copy) or about 7 Tb of unique data. If, like the previous example, we peg 25% of the 7Tb as our memory need, that would mean a combined Delphix cache could services the same shape of traffic with just 25% * 7Tb = 1.75 Tb.  Does that mean you can shrink the memory footprint on the actual hosts and still service the traffic in about the same time? That is exactly what several of our large and small customers do. Let's suppose that you can shrink each of the 8 copy database's memory allocation down to 5% from the original peg of 25%.  Apples to apples, the 1.75 Tb of Delphix memory plus the 5% minimum on the 8 copies shrinks total memory to service the same traffic down to 3.75 Tb in our example.  From 10TB down to 3.75 Tb; From 2x the size of the dataset down to 0.75x; That's less than half.

-

Of course, for all you solution architects and performance engineers here's the disclaimer:

  • this is an entirely hypothetical exercise with plenty of loopholes because traffic shapes are notoriously difficult to pin down (and based on all sorts of variables)

  • There's no way anyone can guarantee the memory reduction.

But, what we CAN guarantee is that customers are doing exactly the sorts of changes described above to achieve the kinds of results predicted in this example.

-

CPU: Compression, Sharing, and virtual provisioning have a dramatic effect on the CPU cycles we need.  If we just follow the math from our previous examples, the cost of backup is already included in what Delphix does(but we'll use 1% to be totally safe).  The cost of refresh with our validated sync is almost zero (but we'll use 1% to be totally safe). That means to accomplish the same work, the CPU cycles you'd need with Delphix will be around 1.45 CPU-Hours or less than 8% as much time.

The stuff of Science Fiction

Delphix gives you power over your data in ways that are simply hard to believe, and would have been impossible just a few years ago. Delphix is data anti-gravity, giving you the power to accomplish the same work with 1/10 the storage.  Delphix is like_faster than light travel_, letting you accomplish the same workload over your network in about 1/10 the bits transmitted.  Delphix shrinks the horizon of your information black hole, letting you accomplish the same workload with 1/2 the memory.  And finally, Delphix is like a quantum computer, letting you solve the same problem with 1/10th the cycles.