Data Safety With Nanobox: Backup and Recovery

Any time you run an application that stores data provided by other people, you want to ensure the safety of that data. This is a concern with multiple aspects. For one, you need to ensure that the data is secure, and can't be easily obtained by unauthorized parties — but is still accessible to authorized users. For another, you need to ensure you can recover the data in case of loss. You need to ensure you have the space to store all the data you will be entrusted with for the entirety of its expected lifetime, which is generally indefinitely. And so forth.

In this series of articles, I'm going to cover how Nanobox can help you with each of these concerns, one at a time, and how to go about handling them. This article in particular will focus on data recovery, and can be used with or without any of the others. The focus will be on best practices, but I will try to include some alternative approaches as well in case the best practice doesn't fit your specific use case.

First, though, for those who aren't already aware, Nanobox is a tool for doing devops tasks so you don't have to. It sets up a development environment unique to your app, and completely isolated from the rest of your system. It lets you specify what that environment should look like (which packages to install, how to configure them, etc), how to set it up, and how to assemble your code, and it lets you do it in a way that any other developer can simply pull down your code and fire up Nanobox on their own system, and get the exact same setup you have on yours. It also lets you deploy your code to one of several cloud hosting platforms — choose from AWS, Digital Ocean, Linode, and more coming soon, and even switch to a different provider at any time — using the exact same environment you have in development (well, you can reconfigure things on their way from development to production, and you usually will, but otherwise everything is exactly the same). In addition, it helps enforce (or at least automatically implement) best practices in every aspect of your app's infrastructure. It's a complete devops tool, which lets you focus on your app, rather than the environments it will run in.

Why Worry About Recovery?

To some, the reasons data recovery is such a big deal are obvious. These individuals have probably encountered a major data loss scenario, or are dealing with applications whose very nature requires data loss prevention and recovery, be it by design or by legislation. There are others who are already invested in the importance of data recovery in their projects, too. But what about the rest, those whose apps don't obviously need a recovery plan — and procedures in place to make recovery as swift as possible?

Think about using your app as a potential customer. Imagine you've been using it for a few months fairly actively. How much data have you committed to this app (and remember, your account information counts as committed data)? How much of it will you want to be able to access in the future? How much would you be okay with losing? And how long do you expect it to still be available after you commit it? If you think you'd be okay losing most — or even all — of it, ask some of your actual potential customers how they feel about these same questions. Try to avoid asking your dev team members — focus on average users in your target audience.

Assuming you and/or your customers aren't comfortable losing even some of the data submitted, you have plenty of justification for backups right there. In almost all cases, at least account data will be important enough to preserve against unexpected failures. If you are all OK losing everything — including account data — then you might be able to justify skipping a recovery plan.

In all but the rarest cases, though, data loss is something you'll want to prevent.

How Does Recovery Work?

Once you've decided your application needs data recovery, the next step is to establish a recovery plan. This doesn't need to be very detailed or complex, but you should know how often backups should be taken, where they should be stored, and how you plan to go about restoring them in case of failures. More complex recovery plans may specify different backup frequencies for different data types, multiple backup strategies, multiple storage locations, how long to preserve a given backup file before removing it to make room for more recent ones, what preference should be given to which storage locations and restoration strategies when recovering data, who is responsible for each step, who to notify of problems at any point, and so forth.

For this article, though, I'm just going to use a simple plan:

  • Backups of all data sources are made daily, at 03:00, local time
  • Most recent backup is stored locally in the app's data warehouse component, and synced to an Amazon S3 bucket
  • In case of data loss, restore from local backup, or fall back to S3 copy if needed.

I could store the local backups in the same component I'm backing up, but if something goes wrong with that component, the backup will likely be lost as well. Similarly, I sync the backup files to S3 in case the entire app suffers an issue which compromises my data warehouse. I keep the warehouse copy, though, because it's much faster to restore data from there than it would be from S3.

This recovery plan doesn't look like much, but it is more than enough for most small-to-medium sized sites and other apps. More complex sites/apps, and those which handle more sensitive data (such as catalog, order, and payment info), will probably want a more detailed plan, possibly with more redundancy (keeping older backups around in case of data corruption making its way into the actual live data itself; storing duplicates in more locations to ensure recovery remains possible even if something goes wrong in more than one place at once; etc).

Implementing A Recovery Plan With Nanobox

So you've got a plan, and now you're ready to implement it. The biggest question users have at this point is "How?" — that is, how does one go about actually implementing a recovery plan in a Nanobox app? There are a number of things Nanobox does differently from what users have experienced elsewhere, and those differences mean it isn't necessarily obvious how to use existing tools and procedures within the Nanobox ecosystem. Well, that's the bulk of what this article is actually about!

The high-level overview is that you keep using the same tools you already use to back these things up. The trick is knowing where and how to run them. The answer to that may vary somewhat based on which tool you're using, and how it works, but in general, you should be able to run backups either within a worker component (a type of component designed for and dedicated to tasks that operate in the background, without interaction from the outside world), or within the target component itself. Setup is pretty much the same in either location, but only the worker component has access to your code, so anything that needs a custom script to work properly will have to be run from there.

With that, let's dive down into each of the actual data components available, and see how to implement the recovery plan I outlined above.

Note: Whenever you make changes to an existing data component's configuration, as I'm doing below, you'll need to rebuild that component in your production app. You can either enable your app's Admin → Deploy → Rebuild data components option, or manually rebuild each data component from your dashboard after you deploy.

Additional Note: The article was released shortly after platform components started providing the environment variables this process relies on, below. That means you may need to also rebuild your platform components (at the very least, the warehouse) and re-deploy (make sure your data components are rebuilt as well) to get this approach working.

uNFS

data.storage:  
  image: nanobox/unfs

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        tar cz -C /data/var/db/unfs/ |
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz --data-binary @- &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
        done

Here's our 03:00 backup, scheduled via cron. It tells tar to gather up the contents of /data/var/db/unfs/ (which is the directory holding all your network directories), compress them using gzip, and streams the archive to your data warehouse component using the ReSTful API exposed there by hoarder. Then, if that was successful, it runs through a series of commands to tell hoarder to get rid of any backups older than the most recent one.

If ever we need to restore from the backup, that's pretty straightforward as well:

nanobox console [remote] data.storage  

to connect to your storage component's console, and:

curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-{date}.tgz | tar xz -C /data/var/db/unfs/  

to restore.

MySQL

There are two ways, generally, to backup a (running) MySQL server. One requires direct filesystem-level access to the database files — which we have — but is also really tricky to do properly in a cron job. Additionally, the primary tool used to do it is only available with an Enterprise MySQL license, which is well beyond the scope of this article. So. The other option it is.

data.mysql:  
  image: nanobox/mysql:5.6

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        mysqldump --disable-keys --hex-blob -u ${DATA_MYSQL_USER} -p"${DATA_MYSQL_PASS}" --databases gonano |
        gzip |
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).sql.gz --data-binary @- &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
        done

That should create an SQL dump file with the entire gonano database (you can add others, if you have them, immediately after gonano in the first line of command), the option to disable key checks while restoring data already included in the dump, and binary data exported in hexadecimal for reliable restoration. This file is passed through gzip on its way to the hoarder endpoint on your data warehouse component, as above. And then, again as above, anything older than the most recent backup is removed.

Restoration is pretty simple. Console in to your database component:

nanobox console [remote] data.mysql  

and then restore like so:

curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-{date}.sql.gz | gunzip | mysql -u ${DATA_MYSQL_USER} -p"${DATA_MYSQL_PASS}" gonano  

For large databases, you may need to approach this a bit differently, but the information on what to change, here, is all over the Internet, and also well outside the scope of this article.

Postgres

Backup options for Postgres are similar to those of MySQL. Again, we'll focus on the version that generates a dump file:

data.postgres:  
  image: nanobox/postgres:9.5

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        PGPASSWORD=${DATA_POSTGRES_PASS} pg_dump -U ${DATA_POSTGRES_USER} -w -Fc -O gonano |
        gzip |
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).sql.gz --data-binary @- &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
        done

As above, at 03:00, we'll use pg_dump to grab the contents of gonano, pipe them through gzip, and store them in the data warehouse component with the hoarder service, before cleaning out any old backups. Nothing too surprising, there.

And, of course, restoration is a simple 1-2:

nanobox console [remote] data.postgres  

followed by:

curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-{date}.sql.gz | gunzip | PGPASSWORD=${DATA_POSTGRES_PASS} pg_restore -U ${DATA_POSTGRES_USER} -w -Fc -O  

And, of course, as with MySQL, large databases may need a different approach, so look around for info on how to handle that.

MongoDB

MongoDB doesn't use credentials (by default), so its backup setup is rather straightforward:

data.mongodb:  
  image: nanobox/mongodb:3.0

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        mongodump --out - |
        gzip |
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).gz --data-binary @- &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
        done

Three in the morning will see mongodump connect and spew out the data, pipe it through gzip, and then on through hoarder to our data warehouse, before cleaning out the old backup files.

Restoration should look familiar by now:

nanobox console [remote] data.mongodb  

followed by:

curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-{date}.gz | gunzip | mongorestore -  

Redis

Redis also doesn't bother with authentication by default — generally speaking, if it isn't inside the same private network your client systems are, you're losing more speed to network latency than you're gaining by using Redis over something else. Though, with how fast Redis is, that's an assumption generally based on slow or congested connections, so it's not always an accurate one. Still, Nanobox creates an environment where everything is inside the same private network, so neither latency nor authentication are a big concern.

data.redis:  
  image: nanobox/redis:3.0

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).rdb --data-binary @/data/var/db/redis/dump.rdb &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
        done

One more task at 03:00, this one simply sending the Redis dump file to hoarder directly, so it will ultimately end up in your data warehouse. Because Redis was built with backups in mind, so they're considerably simpler to actually perform. (And, of course, we also clean out the old backups we don't need any longer.)

Restoration is the usual:

nanobox console [remote] data.redis  

and:

curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-{date}.rdb -o /data/var/db/redis/dump.rdb  

Memcache

You probably won't get much use out of backing up or restoring a Memcache component — it's meant to be used as a cache, not a persistent store, and you will lose data if your server ever goes offline for any reason — but just in case that appeals to you (and mostly for the sake of completeness):

data.memcache:  
  image: nanobox/memcached:1.4

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        memcached-tool localhost:11211 dump |
        gzip |
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).gz --data-binary @- &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
        done

As above, at 03:00 we gather the data, this time using memcached-tool, compress with gzip, and stream it to hoarder on your warehouse component, before deleting old backup files.

And to restore, you of course first connect:

nanobox console [remote] data.memcache  

and then you restore:

curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-{date}.gz | gunzip | nc localhost 11211  

(again, as above). If you find a use case for this, and Redis isn't a better fit for your app in this case, please let me know so I can expand my knowledge of app design principles.

External Sync

The observant reader may have noticed that I actually left out one part of the recovery plan, above — syncing a duplicate copy of each backup file to Amazon S3. The process for doing this step is the same regardless of which backup files are involved, so I saved this step for last so I could walk through it only once. If you aren't using S3 to sync your duplicates, the exact commands used here will differ, but the overall approach is still the same. In other words, other external storage options are outside the scope of this article, so consult their own documentation for how things will differ from below.

Syncing To S3

I chose S3 because it's one of the more common targets for this type of thing. It's not necessarily the simplest, though. It'll take a few steps to get this up and running properly.

First, we need to add the AWS CLI tool to every component which will be processing backups:

  # add these lines to data.{component} for each data component
  extra_packages:
    - py27-awscli

Note: You could also install the AWS CLI only on one of these components, and run the sync to S3 from that single location. This introduces its own complications, though — you have to check the warehouse for the list of files to sync, you have to retrieve each file from the warehouse to pipe it to S3, and you need to somehow ensure this will only be run after your backups have completed — so I won't cover that process, here.

Now we do some clever Linux shell scripting tricks to send our backups to S3 at the same time we're sending them to the data warehouse:

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        (
          tar cz -C /data/var/db/unfs/ |
          tee /dev/fd/4 |
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz --data-binary @-
        ) 4>&1 |
        aws s3 cp - s3://${S3_BACKUP_BUCKET}/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

Finally, to make this work, we need to set some environment variables before we deploy. I'll use the CLI approach, below, but you can also set these values in your dashboard:

nanobox evar add AWS_ACCESS_KEY_ID={key}  
nanobox evar add AWS_SECRET_ACCESS_KEY={secret}  
nanobox evar add AWS_DEFAULT_REGION=us-west-2  
nanobox evar add S3_BACKUP_BUCKET={app-name}-backups  

From the next deploy onward, your app's data will not only be backed up in your data warehouse, but also remotely in your S3 backup bucket. Hooray for redundancy!

Note: The extra observant, or those using Redis in their apps, will have noticed that, with Redis, we are syncing a file directly, as-is, in the normal backup process, rather than via pipes as we do with the others. That means our fancy shell tricks aren't necessary, there — either create a separate cron job scheduled at the same time which sends the rdb file directly, or append the AWS CLI command to your cURL command via ; or &&.

Restoring From S3

Now that you have backups on S3, you'll want to know how to restore from them, just in case. Luckily, this is pretty easy. We'll just use the AWS CLI tool in place of our cURL commands, above, like so:

nanobox console [remote] data.storage  
aws s3 cp s3://${S3_BACKUP_BUCKET}/backup-${HOSTNAME}-{date}.tgz - | tar xz -C /data/var/db/unfs/  

Nothing to it. Well, except transfer times across the Internet.

All Together, Now!

So what would this actually look like in a real boxfile.yml? Glad you asked! In the course of writing this article, I actually built such a boxfile.yml! Merge in the portions you need, or even start with this as a template, and merge in what you need from elsewhere instead. Either way, everything you need to back up all your app data is right here — assuming you want to use the recovery plan I used for this article, of course. Adjust anything that you plan to do differently accordingly!

run.config:  
  engine: none

# deploy.config:

web.main:  
  start: sleep 365d

  network_dirs:
    data.storage:
      - test

worker.background:  
  start: sleep 365d

  network_dirs:
    data.storage:
      - test

data.storage:  
  image: nanobox/unfs

  extra_packages:
    - py27-awscli

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        (
          tar cz -C /data/var/db/unfs/ |
          tee /dev/fd/4 |
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz --data-binary @-
        ) 4>&1 |
        aws s3 cp - s3://${AWS_S3_BACKUP_BUCKET}/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

data.mysql:  
  image: nanobox/mysql:5.6

  extra_packages:
    - py27-awscli

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        (
          mysqldump --disable-keys --hex-blob -u ${DATA_MYSQL_USER} -p"${DATA_MYSQL_PASS}" -h ${DATA_MYSQL_HOST} --databases gonano |
          gzip |
          tee /dev/fd/4 |
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).sql.gz --data-binary @-
        ) 4>&1 |
        aws s3 cp - s3://${AWS_S3_BACKUP_BUCKET}/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

data.postgres:  
  image: nanobox/postgresql:9.5

  extra_packages:
    - py27-awscli

  cron:
    - id: postgres-backup
      schedule: '0 3 * * *'
      command: |
        (
          PGPASSWORD=${DATA_POSTGRES_PASS} pg_dump -U ${DATA_POSTGRES_USER} -h ${DATA_POSTGRES_HOST} -Fc -O gonano |
          gzip |
          tee /dev/fd/4 |
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).sql.gz --data-binary @-
        ) 4>&1 |
        aws s3 cp - s3://${AWS_S3_BACKUP_BUCKET}/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

data.mongodb:  
  image: nanobox/mongodb:3.0

  extra_packages:
    - py27-awscli

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        (
          mongodump --out - |
          gzip |
          tee /dev/fd/4 |
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).gz --data-binary @-
        ) 4>&1 |
        aws s3 cp - s3://${AWS_S3_BACKUP_BUCKET}/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

data.redis:  
  image: nanobox/redis:3.0

  extra_packages:
    - py27-awscli

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        backup_date=$(date -u +%Y-%m-%d.%H-%M-%S) &&
        curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-${date}.rdb --data-binary @/data/var/db/redis/dump.rdb &&
        aws s3 cp /data/var/db/redis/dump.rdb s3://${AWS_S3_BACKUP_BUCKET}/backup-${HOSTNAME}-${date}.rdb &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

data.memcache:  
  image: nanobox/memcached:1.4

  extra_packages:
    - py27-awscli

  cron:
    - id: backup
      schedule: '0 3 * * *'
      command: |
        (
          memcached-tool localhost:11211 dump |
          gzip |
          tee /dev/fd/4 |
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).gz --data-binary @-
        ) 4>&1 |
        aws s3 cp - s3://${AWS_S3_BACKUP_BUCKET}/backup-${HOSTNAME}-$(date -u +%Y-%m-%d.%H-%M-%S).tgz &&
        curl -k -s -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/ |
        json_pp |
        grep ${HOSTNAME} |
        sort |
        head -n-1 |
        sed 's/.*: "\(.*\)".*/\1/' |
        while read file
        do
          curl -k -H "X-AUTH-TOKEN: ${WAREHOUSE_DATA_HOARDER_TOKEN}" https://${WAREHOUSE_DATA_HOARDER_HOST}:7410/blobs/${file} -X DELETE
          aws s3 rm s3://${AWS_S3_BACKUP_BUCKET}/${file}
        done

The Big Picture

So now you have all the tools you need to implement data recovery. You have a recovery plan, and you know how to implement it in Nanobox — you've automated the backup process using cron, and understand the restoration process. You even have the ability to sync your backups to external storage locations, if you wish, and this is also automated as part of your existing backup tasks. This is all great, but how does it fit in with the other aspects of data safety?

When it comes to data security, it's always best to ensure the data you're backing up is stored in its still-encrypted form. Also, you'll want to ensure it can't be readily accessed by unauthorized users (which may seem obvious, but is still often overlooked). More detail on how to actually implement data security is the topic of a different article, but those are the biggest aspects to keep in mind when dealing with backups.

As far as data storage, backups can easily consume most of your available storage capacity. Given proper compression, as recommended above, your backup data shouldn't consume the same amount of storage space as the live data does, but it's still possible that your backups may need up to half of the available storage space. It's good to ensure you have enough space for not only your live data, but your backups as well. If you intend to store more than just the most recent backup, be sure to factor that into your capacity estimates as well. More specific advice on how to do these types of estimates, and how to ensure you allocate space accordingly, is the topic of a separate article.

That's it for this article. Hopefully it was helpful to your Nanobox experience!

Daniel Hunsaker

Author, Father, Programmer, Nut. Dan contributes to so many projects he sometimes gets them mixed up. He'll happily help you out on the Nanobox Slack server when the staff are offline.

@sendoshin Idaho, USA

Subscribe to Nanobox

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!