Data Safety With Nanobox: Storage Capacity

This article is part of the Data Saftey With Nanobox series

Any time you run an application that stores data provided by other people, you want to ensure the safety of that data. This is a concern with multiple aspects. For one, you need to ensure that the data is secure, and can't be easily obtained by unauthorized parties - but is still accessible to authorized users. For another, you need to ensure you can recover the data in case of loss. You need to ensure you have the space to store all the data you will be entrusted with for the entirety of its expected lifetime, which is generally indefinitely. And so forth.

In this series of articles, I'm going to cover how Nanobox can help you with each of these concerns, one at a time, and how to go about handling them. This article in particular will focus on data storage, and can be used with or without any of the others. The focus will be on best practices, but I will try to include some alternative approaches as well in case the best practice doesn't fit your specific use case.

First, though, for those who aren't already aware, Nanobox is a tool for doing devops tasks so you don't have to. It sets up a development environment unique to your app, and completely isolated from the rest of your system. It lets you specify what that environment should look like (which packages to install, how to configure them, etc), how to set it up, and how to assemble your code, and it lets you do it in a way that any other developer can simply pull down your code and fire up Nanobox on their own system, and get the exact same setup you have on yours. It also lets you deploy your code to one of several cloud hosting platforms — choose from AWS, Digital Ocean, Linode, and more coming soon, and even switch to a different provider at any time — using the exact same environment you have in development (well, you can reconfigure things on their way from development to production, and you usually will, but otherwise everything is exactly the same). In addition, it helps enforce (or at least automatically implement) best practices in every aspect of your app's infrastructure. It's a complete devops tool, which lets you focus on your app, rather than the environments it will run in.

Why Worry About Storage?

Every part of your app will take up some storage space somewhere. Your code takes up some space. The packages your code relies on take up some space. Your database(s) take up some space. Uploaded and generated files take up some space. Backups take up some space. Your server's base OS, the packages installed on the host, various Nanobox components, your app's Docker images, your app's runtime and code builds, and any number of metadata files used to keep these all in sync all take up some space. In short, there are a lot of things that will be making use of storage space in your app, and it's important to ensure you have enough for all of it.

Running out of space can have all kinds of unexpected — and undesired — consequences. Generally, the entire app will stop working properly. In some cases, it might break completely enough that you have to revert to backups to get back up and running. This is obviously not a desirable outcome. So you definitely want to avoid it.

How Does Storage Work?

In Nanobox, storage is just files on your server, mounted into place within your app as appropriate. For a uNFS component, some of these files are then exposed to the rest of your app via NFS, and those NFS shares are mounted into place in your code components. For the other data components, the files are maintained by server software running inside the container. Rather than exposing the files directly to other components, they are manipulated by contacting the server software and telling it what data you wish to store, and how you want to be able to reference it in the future; the server software itself then decides what to actually write to the filesystem.

However they're accessed, since all of these files are ultimately stored on the host server's disk, that disk needs to be large enough to contain it all, and to support growth in the data set. That may mean your app needs to be scaled up to continue working properly, but it might also mean you have some freedom to scale down. To determine which scenario you're facing, there are some guidelines you can follow based on what kinds of data you're working with, and how rapidly you expect it to grow.

General Guidelines

  • Aim to keep usage below 80% of the available space. That is, your capacity target should include an extra 20% for free space. You can find this new target by dividing your final estimate by 0.8.

  • Backups may fill up to the same amount of space as the original data. Multiply your space requirements by the number of local backups you intend to keep plus one (for the original data). That means that if you plan to keep two backups on local storage, you need to multiply your final estimate by three.

  • Some disk space will already be consumed by the host operating system (how much may vary by hosting provider: DO and AWS are about 1 GiB; Linode is 800 MiB), the Nanobox platform components (about 650 MiB, before deploy, though this value will probably change over time), and the software packages used by all the components in your app (will vary by app). This is all before your code size actually comes into play.

  • Each component has size requirements of its own. Your code components (webs and workers) will include the software packages installed by the engine, and anything added by extra_packages in the run.config block, plus your code itself. Data instances include everything in the image, plus anything you've added with extra_packages inside that component's data.{component} block, and then whatever data you and/or your app puts there.

  • In addition, each time you deploy, an archive containing your entire codebase, and a second containing the software for your code containers, are stored in your data warehouse component. This is done for two reasons. One, if you need to roll back to an older deploy, having everything cached someplace local to your app makes that process much faster. And two, having every deploy, original or redeploy alike, come from this archive makes the logic considerably easier to implement, and to know that each deploy will happen the exact same way regardless of whether it's an original or not. By default, the most recent 2 archives are retained in your data warehouse, though you can change this number if you like. Whatever that number is, your storage needs will be about that many, times your code and software size, more than what your code containers already use.

  • Lastly, but perhaps most importantly, you should revisit your estimates regularly. The amount of data you actually store within your app will determine how much the space requirements will vary, and the speed at which that data enters your app will determine how often that variance could become an issue. Try to stay ahead of your app's future needs, and calculate your space needs based on how much data you expect to have by the next time you check on your estimates.


So, to summarize, you need roughly:

(host system size [1.45 - 1.65 GiB] + 
  (stored build count [default 2] * 
    (code size + packages size)
  ) + (
    total number of code component instances * 
    (image size [476 MiB] + packages size + code size)
  ) + 
  {for each code component}(number of instances * writable files size) + 
  {for each data component}(number of instances * 
    (image size [457 - 837 MiB] + extra packages size + 
      (data size * 
        (local backup count + 1)
) / 0.8

to fit your entire app comfortably. It looks a bit more complex than it really is. If you're looking for something a bit less ... precise, though, let me see if I can oblige.

A Simplified Estimate

First, assume your code and the packages your app uses to run it take up about 1 GiB, together (a bit high, but this is a simplified formula, right?). For each web and worker instance, that's 1.5 GiB of space. Assuming each code instance also contains a writable log file, and we periodically truncate this file (generally on deploy) so that it won't usually go over 500 MiB, that brings each instance up to 2 GiB.

Now, for the data instances, rounding to 1 GiB for the container itself, and assuming 1 GiB for the data, that's another 2 GiB per instance. If you like, you can make that 6 GiB for uNFS instances, to account for larger files being stored there.

Finally, we account for the data warehouse. Your code and packages are already assumed to be 1 GiB, combined, per archive. Your backups are assumed to also be 1 GiB each, unless you count your uNFS instances separately — those are 5 GiB each. So, adding the assumption that you'll have 2 deploy archives, and 2 backup files per instance, we have everything we need to estimate our space requirements:

  (number of code instances * 2 GiB) +
  (number of data instances * 2 GiB) +
  (number of uNFS instances * 6 GiB) +
  2 GiB +
  (number of data instances * 2 GiB) +
  (number of uNFS instances * 10 GiB)
) / 0.8

give or take a few GiB.

The Big Picture

Most of this advice is very specific to Nanobox, but it should be pretty easy to extrapolate it to other environments, if you need to do that. The numbers are generally a bit high — and that's desired, when estimating how much space you'll need available. If you want, you can tweak some the formulas above for more exact values — for example, to take into account the compression ratios of your deploy archives and backups — but be sure to adjust your target for consumed space to something lower than 80%, as if your app gets above that amount, it will start to slow down noticeably.

So, how does this fit in with other aspects of data safety? Well, for the most part, you just need to keep your space requirements in mind when deciding your backup/recovery plan. Encryption can add a small amount of overhead, but it's generally on the order of a few MiB at most, so won't impact your overall size needs too strongly unless you have hundreds of instances.

Be safe out there!

Posted in Nanobox, data safety, capacity

Daniel Hunsaker

Daniel Hunsaker

Author, Father, Programmer, Nut. Dan contributes to so many projects he sometimes gets them mixed up. He'll happily help you out on the Nanobox Slack server when the staff are offline.

@sendoshin Idaho, USA
Read More