Building a successful DevOps team is a challenge. Building a successful Agile DevOps team is an enormous challenge. Many Software Developers have experience working on Agile teams, often starting in college. On the contrary, most Operations Engineers and System Administrators have not participated on an Agile team. They frequently cut their teeth in highly reactive, support ticket-driven environments. Agile methodologies can often seem foreign and uncomfortable to them.
Adding to the challenge, most DevOps teams are expected to execute on a broad range of requirements. DevOps user stories can span the entire software development, delivery, and support lifecycle. Stories often include logging, monitoring, alerting, infrastructure provisioning, networking, security, continuous integration and deployment, and support.
Worst of all, many teams don’t apply the same rigor to developing high-quality user stories for DevOps-related requirements as they do for software requirements. On teams where DevOps Engineers are integrated with Software Developers, I find this is often due to a lack of understanding of DevOps requirements. On stand-alone DevOps teams, I find DevOps Engineers often believe their stories don’t need the same attention to details as their software development peers.
I’ve been lucky enough to be part of multiple DevOps teams, both integrated and stand-alone. Those teams were a mix of engineers from both software development, system administration, and operations backgrounds. One of the reasons these teams were effective, was the presence of a Technical Analyst. The Technical Analyst role falls somewhere in-between an Agile Business Analyst and a traditional Systems Analyst; probably closer to the latter, on a DevOps team. The Technical Analyst ensures that user stories are Ready for Development, they ensure the stories meet the Definition of Ready.
There is a common perception that the Technical Analyst on an Agile team must be an experienced technologist. DevOps is a very broad and quickly evolving discipline. While a good understanding of Agile practices and software development are beneficial, I’ve found organization and investigation skills to be most advantageous to the Technical Analyst.
In this post, we will examine one particular methodology for optimizing the effectiveness of the Technical Analyst, through the use of a consistent approach to functional requirement analysis and story development.
Centralized Application Logging
As a Software Developer,
I need to view application logs,
So that I can monitor and troubleshoot running applications.
Acceptance Criterion 1
I can view the latest logs from multiple applications.
Acceptance Criterion 2
I can view application logs without having to log directly into each host on which the applications are running.
Acceptance Criterion 3
I can view the log entries in exact chronological order, based on timestamps.
I know some DevOps engineers that would gladly pick up this user story, make a handful of assumptions based on previous experiences, and manage to hack out a usable centralized application logging system. The ability to meet the end user’s expectations would be entirely based on the experience (and luck) of the engineer, certainly not the vague and incomplete user story and acceptance criteria, and lack of any technical background details.
Consistently delivering well-defined user stories and comprehensive acceptance criteria, which provide maximum business value with minimal resource utilization, requires a systematic analytical approach.
Based on my experience, I have found certain facets of the software development, delivery, and support lifecycles, affect or are affected by DevOps-related functional requirements. Every requirement should be uniformly analyzed against those facets to improve the overall quality of user stories and acceptance criteria. In no particular order, these facets include:
- Backup and Recovery
- Security and Compliance
- Continuous Integration and Delivery
- Service Level Agreements (SLAs)
- Configuration Management
- Environment Management
- High Availability and Fault Tolerance
- Release Management
- Life Expectancy
- Architectural Approval
- Technology Preferences
- Data Management
- Training and Documentation
- Computer Networking
- Licensing and Legal
- Technical Debt
For simplicity, I prefer to organize these facets into five general categories: Technology, Infrastructure, Delivery, Support, and Organizational.
Do the requirements introduce changes, which would or should produce logs? How should those logs be handled? Do those changes impact existing logging processes and systems?
Do the requirements introduce changes, which would or should be monitored? Do those changes impact existing monitoring processes and systems?
Do the requirements introduce changes, which would or should produce alerts? Do the requirements introduce changes, which impact existing alerting processes and systems?
Do the requirements introduce changes, which require an owner? Do those changes impact existing ownership models? An example of this might be regarding who owns the support and maintenance of new infrastructure.
Do the requirements introduce changes, which would or should require internal, external, or vendor support? Do those changes impact existing support models? For example, adding new service categories to a support ticketing system, like ServiceNow.
Do the requirements introduce changes, which would or should require regular and on-going maintenance? Do those changes impact existing maintenance processes and schedules? An example of maintenance might be regarding how upgrading and patching would be completed for a new monitoring system.
Backup and Recovery
Do the requirements introduce changes, which would or should require backup and recovery? Do the requirements introduce changes, which impact existing backup and recovery processes and systems?
Security and Compliance
Do the requirements introduce changes, which have security and compliance implications? Do those changes impact existing security and compliance requirements, processes, and systems? Compliance in IT frequently includes Regulatory Compliance with HIPAA, Sarbanes-Oxley, and PCI-DSS. The Security and Compliance facet is closely associated with Licensing and Legal.
Continuous Integration and Delivery
Do the requirements introduce changes, which require continuous integration and delivery of application or infrastructure code? Do those changes impact existing continuous integration and delivery processes and systems?
Service Level Agreements (SLAs)
Do the requirements introduce changes, which are or should be covered by Service Level Agreements (SLA)? Do the requirements introduce changes, which impact existing SLAs Often, these requirements are system performance- or support-related, centered around Mean Time Between Failures (MTBF), Mean Time to Repair, and Mean Time to Recovery (MTTR).
Do the requirements introduce changes, which requires the management of new configuration or multiple configurations? Do those changes affect existing configuration or configuration management processes and systems? Configuration management also includes feature management (aka feature switches). Will the changes turned on or turned off for testing and release? Configuration Management is closely associated with Environment Managment and Release Management.
Do the requirements introduce changes, which affect existing software environments or require new software environments? Do those changes affect existing environments? Software environments typically include Development, Test, Performance, Staging, and Production.
High Availability and Fault Tolerance
Do the requirements introduce changes, which require high availability and fault tolerant configurations? Do those changes affect existing highly available and fault-tolerant systems?
Do the requirements introduce changes, which require releases to the Production or other client-facing environments? Do those changes impact existing release management processes and systems?
Do the requirements introduce changes, which have a defined lifespan? For example, will the changes be part a temporary fix, a POC, an MVP, pilot, or a permanent solution? Do those changes impact the lifecycles of existing processes and systems? For example, do the changes replace an existing system?
Do the requirements introduce changes, which require the approval of the Client, Architectural Review Board (ARB), Change Approval Board (CAB), or other governing bodies? Do the requirements introduce changes, which impact other existing or pending technology choices? Architectural Approval is closely related to Technology Preferences, below. Often an ARB or similar governing body will regulate technology choices.
Do the requirements introduce changes, which might be impacted by existing preferred vendor relationships or previously approved technologies? Are the requirements impacted by blocked technology, including countries or companies blacklisted by the Government, such as the Specially Designated Nationals List (SDN) or Consolidated Sanctions List (CSL)? Does the client approve of the use of Open Source technologies? What is the impact of those technology choices on other facets, such as Licensing and Legal, Cost, and Support? Do those changes impact other existing or pending technology choices?
Do the requirements introduce changes, which would or should impact infrastructure? Is the infrastructure new or existing, is it physical, is it virtualized, is it in the cloud? If the infrastructure is virtualized, are there coding requirements about how it is provisioned? If the infrastructure is physical, how will it impact other facets, such as Technology Preferences, Cost, Support, Monitoring, and so forth? Do the requirements introduce changes, which impact existing infrastructure?
Do the requirements introduce changes, which produce data or require the preservation of state? Do those changes impact existing data management processes and systems? Is the data ephemeral or transient? How is the data persisted? Is the data stored in a SQL database, NoSQL database, distributed key-value store, or flat-file? Consider how other aspects, such as Security and Compliance, Technology Preferences, and Architectural Approval, impact or are impacted by the various aspects of Data?
Training and Documentation
Do the requirements introduce changes, which require new training and documentation? This also includes architectural diagrams and other visuals. Do the requirements introduce changes, which impact training and documentation processes and practices?
Do the requirements introduce changes, which require networking additions or modifications? Do those changes impact existing networking processes and systems? I have separated out Computer Networking from Infrastructure since networking is itself, so broad, involving such things as DNS, SDN, VPN, SSL, certificates, load balancing, proxies, firewalls, and so forth. Again, how do networking choices impact other aspects, like Security and Compliance?
Licensing and Legal
Do the requirements introduce changes, which requires licensing? Should or would the requirements require additional legal review? Are there contractual agreements that need authorization? Do the requirements introduce changes, which impact licensing or other legal agreements or contracts?
Do the requirements introduce changes, which requires the purchase of equipment, contracts, outside resources, or other goods which incur a cost? How and where will the funding come from to purchase those items? Are there licensing or other legal agreements and contracts required as part of the purchase? Who approves these purchases? Are they within acceptable budgetary expectations? Do those changes impact existing costs or budgets?
Do the requirements introduce changes, which incur technical debt? What is the cost of that debt? How will impact other requirements and timelines? Do the requirements eliminate existing technical debt? Do those changes impact existing or future technical debt caused by other processes and systems?
Depending on your organization, industry, and DevOps maturity, you may prefer different aspects critical to your requirements.
Let’s take a second look at our logging story, applying what we’ve learned. For each facet, we will ask the following three questions.
Is there any facet of _______ that impacts the logging requirement?
Will the logging requirement impact any facet of _______?
If yes, will the impacts be captured as part of this user story, or elsewhere?
For the sake of brevity, we will only look at the half the facets.
Although the requirement is about centralized logging, there is still analysis that should occur around Logging. For example, we should determine if there is a requirement that the centralized logging application should consume its own logs for easier supportability. If so, we should add an acceptance criterion to the original user story around the logging application’s logs being captured.
If we are building a centralized application logging system, we will assume there will be a centralized logging application, possibly a database, and infrastructure on which the application is running. Therefore, it is reasonable to also assume that the logging application, database, and infrastructure need to be monitored. This would require interfacing with an existing monitoring system or creating a new monitoring system. If and how the system will be monitored needs to be clarified.
Interfacing with a current monitoring system will minimally require additional acceptance criteria. Developing a new monitoring system would undoubtedly require additional user story or stories, with their own acceptance criteria.
If we are monitoring the system, then we could assume that there should be a requirement to alert someone if monitoring finds the centralized application logging system becomes impaired. Similar to monitoring, this would require interfacing with an existing alerting system or creating a new alerting system. A bigger question, who do the alerts go to? These details should be clarified first.
Interfacing with an existing alerting system will minimally require additional acceptance criteria. Developing a new alerting system would require additional user story or stories, with their own acceptance criteria.
Who will own the new centralized application logging system? Ownership impacts many facets, such as Support, Maintenance, and Alerting. Ownership should be clearly defined, before the requirement’s stories are played. Don’t build something that does not have an owner, it will fail as soon as it is launched.
No additional user stories or acceptance criteria should be required.
We will assume that there are no Support requirements, other than what is already covered in Monitoring, Ownership, and Training and Documentation.
Similarly, we will assume that there are no Maintenance requirements, other than what is already covered in Ownership and Training and Documentation. Requirements regarding automating system maintenance in any way should be an additional user story or stories, with their own acceptance criteria.
Backup and Recovery
Similar to both Support and Maintenance, Backup and Recovery should have a clear owner. Regardless of the owner, integrating backup into the centralized application logging system should be addressed as part of the requirement. To create a critical support system such as logging, with no means of backup or recovery of the system or the data, is unwise.
Backup and Recovery should be an additional user story or stories, with their own acceptance criteria.
Security and Compliance
Since we are logging application activity, we should assume there is the potential of personally identifiable information (PII) and other sensitive business data being captured. Due to the security and compliance risks associated with improper handling of PII, there should be clear details in the logging requirement regarding such things as system access.
At a minimum, there are usually requirements for a centralized application logging system to implement some form of user, role, application, and environment-level authorization for viewing logs. Security and Compliance should be an additional user story or stories, with their own acceptance criteria. Any requirements around the obfuscation of sensitive data in the log entries would also be an entirely separate set of business requirements.
Continuous Integration and Delivery
We will assume that there are no Continuous Integration and Delivery requirements around logging. Any requirements to push the logs of existing Continuous Integration and Delivery systems to the new centralized application logging system would be a separate use story.
Service Level Agreements (SLAs)
A critical aspect of most DevOps user stories, which is commonly overlooked, Service Level Agreements. Most requirements should at least a minimal set of performance-related metrics associated with them. Defining SLAs also help to refine the scope of a requirement. For example, what is the anticipated average volume of log entries to be aggregated and over what time period? From how many sources will logs be aggregated? What is the anticipated average size of each log entry, 2 KB or 2 MB? What is the anticipated average volume of concurrent of system users?
Given these base assumptions, how quickly are new log entries expected to be available in the new centralized application logging system? What is the expected overall response time of the system under average load?
Directly applicable SLAs should be documented as acceptance criteria in the original user story, and the remaining SLAs assigned to the other new user stories.
We will also assume that there no Configuration Management requirements around the new centralized application logging system. However, if the requirements prescribed standing up separately logging implementations, possibly to support multiple environments or applications, then Configuration Management would certainly be in scope.
Based on further clarification, we will assume we are only creating a single centralized application logging system instance. However, we still need to learn which application environments need to be captured by that system. Is this application only intended to support the Development and Test environments? Are the higher environments of Staging and Production precluded due to PII?
We should add an explicit acceptance criterion to the original user story around which application environments should be included and which should not.
High Availability and Fault Tolerance
The logging requirement should state whether or not this system is considered business critical. Usually, if a system is considered critical to the business, then we should assume the system requires monitoring, alerting, support, and backup. Additionally, the system should be highly available and fault tolerant.
Making the system highly available would surely have an impact on facets like Infrastructure, Monitoring, and Alerting. High Availability and Fault Tolerance could be a separate user story or stories, with its own acceptance criteria, as long as the original requirement doesn’t dictate those features.
We skipped half the facets. They would certainly expand our story list and acceptance criteria. We didn’t address how the end user would view the logs. Should we assume a web browser? Which web browsers? We didn’t determine if the system should be accessible outside the internal network. We didn’t determine how the system would be configured and deployed. We didn’t investigate if there are any vendor or licensing restrictions, or if there are any required training and documentation needs. Possibly, we will need Architectural Approval before we start to develop.
Based on our limited analysis, we uncovered some further needs:
We need additional information and clarification regarding Logging, Monitoring, Alerting, and Security and Compliance. We should also determine ownership of the resulting centralized application logging system, before it is built, to ensure Alerting, Maintenance, and Support, are covered.
Additional Acceptance Criteria
We need to add additional acceptance criteria to the original user story around SLAs, Security and Compliance, Logging, Monitoring, and Environment Management.I prefer to write acceptance criteria in the
I prefer to write acceptance criteria in the Given-When-Then format. I find the ‘Given-When-Then’ format tends to drive a greater level of story refinement and detail than the ‘I’ format. For example, compare:
I expect to see the most current log entries with no more than a two-minute time delay.
Given the centralized application logging system is operating normally and under average load,
When I view log entries for running applications, from my web browser,
Then I should see the most current log entries with no more than a two-minute time delay.
New User Stories
Following Agile best practices, best illustrated by the INVEST mnemonic, we need to add new, quality user stories, with their own acceptance criteria, around Monitoring, Alerting, Backup and Recovery, Security and Compliance, and High Availability and Fault Tolerance.
Examining all possible facets of each DevOps requirement will improve the overall quality of user stories and acceptance criteria, maximizing business value and minimizing resource utilization.
All opinions in this post are my own and not necessarily the views of my current employer or their clients.