“Analysis: a detailed examination of the elements or structure of something, typically as a basis for discussion or interpretation.” — Google
Recently, I had the opportunity, along with a colleague, to perform an operational readiness analysis of a client’s new application platform. The platform was in late-stage development, only a few weeks from going live. Being relatively new to the project team, our objective was to determine the amount of work remaining to make the platform operational, from a technical perspective. As well, we wanted to ensure there were no gaps in the project’s scope, as it related to technical operational readiness.
There were various approaches we could have taken to establish the amount of unfinished work and any existing gaps. We could have reviewed the state of the project in the client’s Agile project management system. We could have discussed the project’s state with the Product Owner, Project Manager, Team Lead, and Business Analyst. Neither of these methods on their own would have provided us with a complete picture.
The approach we ultimately took was what we’ve coined, an Operational Readiness Analysis, or our trendier title, the ‘State of the DevOps’. This approach is particularly effective and highly interactive. The analysis may be conducted at any stage of the Software Development Lifecycle (SDLC), to review a project’s operational state of readiness. The analysis involves five simple steps:
Grab Your Post-It Notes
We began, where all good ThoughtWorkers tend to begin, with a meeting room, whiteboard, dry-erase markers, and lots of brightly colored Post-It notes. ThoughtWorkers do love their Post-It notes. We gathered a small subset of the project team, including the Product Owner, Project Manager, and the team’s DevOps Engineer.
Step 1: Categorize
To start, we broadly categorized the operational requirements into buckets of work. In my experience, categorization typically takes one of two forms, either the use of role-based descriptors, like ‘Security’ and ‘Monitoring’, or the use of the ‘ilities’, such as ‘Scalability’ and ‘Maintainability’.
The ‘ilities’ are often referenced in regards to software architecture and the functional and nonfunctional requirements (NFRs) of a project. Although there are many ‘ilities’, project requirements frequently include Usability, Scalability, Maintainability, Reliability, Testability, Availability, and Serviceability. Although I enjoy referring to the ‘ilities’ when drafting project requirements, I find a broad audience more readily understands the high-level category descriptions.
We collectively agreed on eight categories, in addition to a catch-all ‘Other’ category, and distributed them on the whiteboard.
Role-based categories tend to include the following:
- Security and Compliance
- Monitoring and Alerting
- Continuous Integration and Delivery
- Release Management
- Environment Management
- Data Management
- High Availability and Fault Tolerance
- Backup and Restoration
- Support and Maintenance
- Documentation and Training
- Technical Debt
Step 2: Itemize
Next, each person placed as many Post-It notes as they wanted, below the categories on the whiteboard. Each Post-It represented an item someone felt needed to be addressed as part the analysis. Items might represent requirements currently in progress, or requirements still untouched in the project’s backlog. A Post-It note might contain a new requirement. Maybe there is a question about an aspect of the project’s relative importance, or there were concerns about how an aspect of the project had been implemented.
Participants placed an average of five to ten Post-It notes on the whiteboard. The end product was a collective mind-map of the project’s operational state (the following is only an example and doesn’t represent the actual results of any real client analysis).
Whereas categories tend to be generic in nature, the items are usually particular to the project, and it’s technology choices. The ‘Monitoring/Logging’ category might include individual items such as ‘Log Rotation Policy’ or ‘Splunk vs. ELK?’. The ‘Security’ category might include individual items such as ‘Need Password Rotation’ or ‘SSL Certificate Management’.
Participants often use one or two words to summarize more complex thoughts. For example, ‘Rollback’ might be stating, ‘we still don’t have a good rollback strategy for the database’. ‘Release Frequency’ may be questioning, ‘why can’t we release more frequently?’. Make sure to discuss and capture the participant’s full thoughts behind each Post-It, the during the Organize, Prioritize, and Document stages.
Don’t let participants give up too soon placing Post-It notes on the board. Often, one participant’s Post-It will spark additional thoughts other team members. Don’t be afraid to suggest a few items, if the group needs some initial motivation. Lastly, fear not, it’s never too late to go back and add missing items, uncovered in later stages of analysis.
Step 3: Organize
Quickly and non-judgmentally, review all items on the whiteboard. Group duplicates together. Ask for clarification on items where necessary. Move miscategorized items, if the author and team agree there is a more appropriate category.
Examine the ‘Other’ category. If there are items in the ‘Other’ category that are similar, consider adding a new category to capture them. Maybe the team missed identifying a key category, earlier. Conversely, there is nothing wrong with leaving a few stragglers in the ‘Other’ category; don’t overthink the exercise.
Participants should leave this stage with a general understanding of each item on the whiteboard.
Step 4: Prioritize
With all items identified, organized, and understood, discuss each item’s status. Does the item represent incomplete requirements? Does everyone agree an item’s requirements are complete? Team members may not agree on the completeness of particular items. Inconsistencies may reflect unclear requirements. More often, I find, disagreements are a result of an inconsistent or incomplete implementation of requirements across environments — Development, QA, Performance, Staging, and Production.
Why are operational requirements frequently implemented inconsistently? Often, a story is played early in the development stage, prior to all environments being available. For example, a story is played to add monitoring to Development and QA. The story is completed, tested, and closed. Afterward, the Performance environment is created. Later yet, the Staging and Production environments are created. But, there are no new stories to address monitoring the new environments.
One way to avoid this common issue with operational requirements is an effective DevOps automation strategy. In this example, an effective strategy would mean all new environments would get monitoring, automatically. The story should broadly address monitoring, and not, short-sightedly, monitoring of specific environments.
Often, only one or two project team members are responsible for ‘all the DevOps stuff’. They alone possess an accurate perspective on an item’s current status.
Gain agreement from the Product Owner, Project Manager, and key team members on the priority of incomplete requirements. Are they ‘Day One’ must-haves, or ‘Day Two’ and can be completed after the application goes live? Flag all high-priority items.
If the status of an item is unknown, such as ‘why is the QA environment always down?!’, note it as an open question and move on. If the item simply requires a quick answer to resolve, like ‘do we backup the database?’, answer it (‘the database is replicated and backed up daily’) and move on.
Step 5: Document
Snap a picture of the whiteboard and document its contents. We chose the client’s knowledge management system as a vehicle to share the whiteboard’s results. We created a table with columns for Category, Item, Description, Priority, Owner, and Notes. Also, it’s helpful to have a column for the project’s corresponding Story ID or Defect ID. Along with the whiteboard, capture key discussion points, questions, and concerns, raised by the participants.
Step 5: Document
Snap a picture of the whiteboard and document its contents. We chose Confluence to share the whiteboard results, using a table with columns for Category, Item, Description, Priority, Owner, and Notes. We also bulleted all key discussion points, questions, and concerns, raised by the participants.
Make the analysis results available to all team members. Let the team ask questions and poke holes in the results. Adjust the results if necessary.
The actions you take, given the results of the analysis, are up to you. In our case, we ensured all high-priority items were called out on the team’s Agile board. Any new items were captured in the client’s Agile project management system. Finally, we ensured open questions and concerns were addressed in a timely fashion. We continue to track each item’s status weekly, throughout the launch and post-launch periods.
We anticipate conducting a follow-up analysis, thirty days after launch. The goal will be to evaluate the effectiveness of the first analysis and identify additional operational needs as the application enters a business-as-usual (BAU) application lifecycle phase.
Postscript: the ‘ilities’
The ‘ilities’, courtesy of codesqueeze.com and en.wikipedia.org
The capability of a system, network, or process to handle a growing amount of work or its potential to be enlarged to accommodate that growth.
The practical feasibility of observing a reproducible series of such counterexamples if they do exist.
- Reliability (Resilience)
The ability of a system or component to perform its required functions under stated conditions for a specified time.
- Usability (Performance)
The degree to which software can be used by specified consumers to achieve quantified objectives with effectiveness, efficiency, and satisfaction in a quantified context of use.
- Serviceability (Supportability)
The ability of technical support personnel to install, configure, and monitor computer products, identify exceptions or faults, debug or isolate faults to root cause analysis, and provide hardware or software maintenance in pursuit of solving a problem and restoring the product into service.
The proportion of time a system is in a functioning condition. Defined in the service-level agreement (SLA).
The ease with which a product can be maintained. Isolate and correct defects or their cause, prevent unexpected breakdowns, maximize a product’s useful life, maximize efficiency, reliability, and safety, meet new requirements, make future maintenance easier, and cope with a changed environment.
All opinions in this post are my own and not necessarily the views of my current employer or their clients.