Thought some more on this after bug 1104428.
The essence of change control is communication and documentation (of the change).
Here’s a very abridged version of what we’re using at Lookout. I’d like to get some thoughts on this and perhaps some discussion at the next Team Meeting.
Types of changes:
As we start, I basically see two types of changes:
-
Standard Change: Any Change that is planned and follows the normal process for approvals. Standard changes go through CCT Review.
-
Emergency Change: A Change that has an immediate need due to some level of urgency around issue (e.g. Load Balancer failed, needs to be replaced in the 4 hours). Emergency Changes require secondary approval. Approval is usually done by “management” but because of our structure it might make sense to get peer approval or escalate to me (via VictorOps).
Again, the essence is around communications. In the first case, the CCT would review, approve, schedule and communicate the change in advance. In the second case, a peer or senior member of the team has awareness and can assist in communications.
Level of Risk:
Changes have different risk levels associated with them.
- Low Risk: Little to no risk of any issues with the targeted change (near-zero impact on uptime)
- Medium Risk: The change occurring has a risk of impacting production service uptime and should be scheduled as part of planned downtime maintenance.
- High Risk: The change will or almost certainly will result in production impact and must occur in a planned downtime maintenance window.
Risk can be determined by an probability/impact matrix:
Change Requests:
Any Change should have the following documented steps:
- Deployment Steps: The actual steps required when deploying a change.
- Rollback Steps: The Roll Back steps to get back to the prior configuration or version.
- Validation Tests: A Pre and Post set of tests to validate the services functionality.
- Peer Nomination: A Peer is someone qualified to review the Change.
Peer Reviewer:
The Peer Reviewer should review the Change for the following:
- Is the Risk Level accurate?
- Are pre and post Validation Steps accurate?
- Do Deployment Steps capture actual steps for deployment?
- Do Rollback Steps get the service back up and running?needed
- Are all other necessary details provided, requested date, expected results post change, etc.?
Peer Reviewer should either ACK the Change or ask for additional information.
Review Process:
Standard Changes are reviewed weekly. During review, the CCT:
- reviews deployment/rollback steps
- approves & schedules - (and communicating)
- (OR denies & provides reasons why and sends back to requestor for additional information)
What do you think about this process?