How to handle future upgrades

So, we just had a bit of downtime due to a minor Discourse upgrade. We need to discuss how to better handle this in the future, as it appears that updating one web head (and consequently the database) borks the other web head.

There needs to be some kind of notification procedure for Discourse users in advance of an upgrade (and the inevitable downtime). Thoughts?

Also, I can imagine that we should be updating web heads one after the other, as doing it simultaneously will likely cause conflicts.

In the Discourse settings there’s an option to add HTML above the navbar, so we could use that to warn users. Not the prettiest option though, and doesn’t cover people who use email over the site.

We could have an “upgrade db” which is switched to during the upgrade process.

This means we can upgrade a webhead using the following workflow

  1. Disable it in the load balancer
  2. Create a duplicate of the database
  3. Move the server being upgraded to the duplicate database
  4. Upgrade the server
  5. Reenable the server in the load balancer
  6. Disable the other webhead
  7. Point to the DB that the active webhead is using (the upgraded one)
  8. Upgrade
  9. Enable in load balancer

There must be a quicker way to do this properly, but this is the best I can come up with right now

Also, could we maybe have nginx point to an under-construction-esque page during downtime? Like, “Discourse is currently being upgraded. We’ll be back in a few. Please email tanner@mozillausa.org or yousef@mozilla.org.uk if you continue to experience downtime after a considerable amount of time.”

Edit: I’m liking tad’s idea more, although using a temporary duplicate DB will cause any posts made during the upgrade to be lost when we point to the upgraded one, no?

There are 2 ideas that came into my mind:

  1. Create a third (synced but independent backup) server that is used as the active one while updating the other two. Should be automated via a cron job. Pros: Users wouldn’t notice anything as the website wouldn’t be read-only. Cons: Would require even more work.

  2. Use a crawler to fetch all public pages & posts on discourse and use it as a temporary read-only replacement. Pros: Users could still read things while the main servers are down, only one-way sync required. Cons: Inability to create new posts.

Read only mode :slight_smile:

How long does it take to upgrade a web head? Also can you go in to detail on what broke? Before we move forward and deploy any more infrastructure we will be reviewing our procedures for maintenance of our services. This is a good example of what could go wrong without a solid plan. Some points that I would like everyone to provide some input:

  • A dev environment for Discourse to upgrade BEFORE prod - dev will be upgraded a day or two before prod and tested. Any regressions will be investigated before pushing to prod.
  • If an upgrade costs 10mins of downtime then I think it is perfectly acceptable to plan for 10mins of downtime once a month for an upgrade. The extra work with having another DB or web server to handle posts during an upgrade is not worth it in my opinion. The most important thing is to get the upgrade completed cleanly so full concentration on an upgrade is better without having to think about failing over to prevent downtime
  • No more infrastructure additions until we have Discourse, Puppet, Icinga and Bastions configured and procedures documented. This might seem harsh but we do not have a completed infra now. We need to finish what we started before we move on. For example this Wordpress Multiuser Site, do we have the time to build this AND finish building the rest of the servers/services?
  • Backups! We saw with Discourse how shaky an upgrade could be. I suggest we just take an image of each host being upgraded. It’s the easiest way as it would catch any gotchas that an upgrade may have changed.
1 Like

fwiw, we figured out what went wrong last time. Steps were done in the wrong order on web2, so the latest version wasn’t actually pulled from GitHub.