Building a successful DevOps team is a significant challenge. Building a successful Agile DevOps team can be an enormous challenge. Most junior Software Developers have experience working on Agile teams, often starting in college. On the contrary, most Operations Engineers, System Administrators, and Security Specialists have not participated on an Agile team. They frequently cut their teeth in highly reactive, support ticket-driven environments. Agile methodologies, ceremonies, and coding practices, often seem strange and awkward to traditional operation resources.
Adding to the challenge, most DevOps teams are expected to execute on a broad range of requirements. DevOps user stories span the software development, release, and support. DevOps team backlogs often include a diverse assortment of epics, including logging, monitoring, alerting, infrastructure provisioning, networking, security, continuous integration and deployment, and support.
Worst yet, many DevOps teams don’t apply the same rigor to developing high-quality user stories and acceptance criteria for DevOps-related requirements, as they do for software application requirements. On teams where DevOps Engineers are integrated with Software Developers, I find the lack of rigor is often due to a lack of understanding of typical DevOps requirements. On stand-alone DevOps teams, I find engineers often believe their stories don’t need the same attention to details, as their software development peers.
I’ve been lucky enough to be part of several successful DevOps teams, both integrated and stand-alone. The teams were a mix of engineers from both software development, system administration, security, support, and operations backgrounds. One of the many reasons these teams were effective, was the presence of a Technical Analyst. The Technical Analyst role falls somewhere in between an Agile Business Analyst and a traditional Systems Analyst, probably closer to the latter on a DevOps team. The Technical Analyst ensures that user stories are Ready for Development, by meeting the Definition of Ready.
There is a common misperception that the Technical Analyst on a DevOps team must be the senior technologist. DevOps is a very broad and quickly evolving discipline. While a good understanding of Agile practices, CI/CD, and modern software development is beneficial, I’ve found organizational and investigative skills to be most advantageous to the Technical Analyst.
In this post, we will explore how to optimize the effectiveness of a Technical Analyst, through the use of a consistent approach to functional requirement analysis and Agile story development. Uniformity improves the overall quality and consistency of user stories and acceptance criteria, maximizing business value and minimizing resource utilization.
Let’s start with a typical greenfield requirement, frequently assigned to DevOps teams, centralized logging.
Centralized Application Logging
As a Software Developer,
I need to view application logs,
So that I can monitor and troubleshoot running applications.
Acceptance Criterion 1
I can view the latest logs from multiple applications.
Acceptance Criterion 2
I can view application logs without having to log directly into each host on which the applications are running.
Acceptance Criterion 3
I can view the log entries in exact chronological order, based on timestamps.
Technical Background Information
Currently, Software Developers log directly into host machines, using elevated credentials, to review running application logs. This is a high-risk practice, which often results in an increase in high-severity support tickets. This is also a huge security concern. Sensitive data could potentially be accessed from the host, using the elevated credentials.
I know some DevOps Engineers that would willingly pick up this user story, make a handful of assumptions based on previous experiences, and manage to hack out a centralized application logging system. The ability to meet the end user’s expectations would clearly not be based on the vague and incomplete user story, acceptance criteria, and technical background details. Any centralized logging solution would be mostly based on the personal knowledge (and luck!) of the engineer.
Consistently delivering well-defined user stories and comprehensive acceptance criteria, which provide maximum business value with minimal resource utilization, requires a systematic analytical approach.
Based on my experience, I have found certain aspects of the software development, delivery, and support lifecycles, affect or are affected by DevOps-related functional requirements. Every requirement should be consistently analyzed against these facets to improve the overall quality of user stories and acceptance criteria.
For simplicity, I prefer to organize these aspects into five broad categories: Technology, Infrastructure, Delivery, Support, and Organizational. In no particular order, these facets, include:
- Technology Preferences
- Architectural Approval
- Technical Debt
- Configuration and Feature Management
- Life Expectancy
- Backup and Recovery
- Data Management
- Computer Networking
- High Availability and Fault Tolerance
- Environment Management
- Continuous Integration and Delivery
- Release Management
- Service Level Agreements (SLAs)
- Service Level Agreements (SLAs)
- Training and Documentation
- Licensing and Legal
- Security and Compliance
Let’s take a closer look at each aspect, and how they could impact DevOps requirements.
Do the requirements introduce changes, which would or should produce logs? How should those logs be handled? Do those changes impact existing logging processes and systems?
Do the requirements introduce changes, which would or should be monitored? Do those changes impact existing monitoring processes and systems?
Do the requirements introduce changes, which would or should produce alerts? Do the requirements introduce changes, which impact existing alerting processes and systems?
Do the requirements introduce changes, which require an owner? Do those changes impact existing ownership models? An example of this might be regarding who owns the support and maintenance of new infrastructure.
Do the requirements introduce changes, which would or should require internal, external, or vendor support? Do those changes impact existing support models? For example, adding new service categories to a support ticketing system, like ServiceNow.
Do the requirements introduce changes, which would or should require regular and on-going maintenance? Do those changes impact existing maintenance processes and schedules? An example of maintenance might be regarding how upgrading and patching would be completed for a new monitoring system.
Backup and Recovery
Do the requirements introduce changes, which would or should require backup and recovery? Do the requirements introduce changes, which impact existing backup and recovery processes and systems?
Security and Compliance
Do the requirements introduce changes, which have security and compliance implications? Do those changes impact existing security and compliance requirements, processes, and systems? Compliance in IT frequently includes Regulatory Compliance with HIPAA, Sarbanes-Oxley, and PCI-DSS. Security and Compliance is closely associated with Licensing and Legal.
Continuous Integration and Delivery
Do the requirements introduce changes, which require continuous integration and delivery of application or infrastructure code? Do those changes impact existing continuous integration and delivery processes and systems?
Service Level Agreements (SLAs)
Do the requirements introduce changes, which are or should be covered by Service Level Agreements (SLA)? Do the requirements introduce changes, which impact existing SLAs Often, these requirements are system performance- or support-related, centered around Mean Time Between Failures (MTBF), Mean Time to Repair, and Mean Time to Recovery (MTTR).
Do the requirements introduce changes, which requires the management of new configuration or multiple configurations? Do those changes affect existing configuration or configuration management processes and systems? Configuration management also includes feature management (aka feature switches). Will the changes turned on or turned off for testing and release? Configuration Management is closely associated with Environment Managment and Release Management.
Do the requirements introduce changes, which affect existing software environments or require new software environments? Do those changes affect existing environments? Software environments typically include Development, Test, Performance, Staging, and Production.
High Availability and Fault Tolerance
Do the requirements introduce changes, which require high availability and fault tolerant configurations? Do those changes affect existing highly available and fault-tolerant systems?
Do the requirements introduce changes, which require releases to the Production or other client-facing environments? Do those changes impact existing release management processes and systems?
Do the requirements introduce changes, which have a defined lifespan? For example, will the changes be part a temporary fix, a POC, an MVP, pilot, or a permanent solution? Do those changes impact the lifecycles of existing processes and systems? For example, do the changes replace an existing system?
Do the requirements introduce changes, which require the approval of the Client, Architectural Review Board (ARB), Change Approval Board (CAB), or other governing bodies? Do the requirements introduce changes, which impact other existing or pending technology choices? Architectural Approval is closely related to Technology Preferences, below. Often an ARB or similar governing body will regulate technology choices.
Do the requirements introduce changes, which might be impacted by existing preferred vendor relationships or previously approved technologies? Are the requirements impacted by blocked technology, including countries or companies blacklisted by the Government, such as the Specially Designated Nationals List (SDN) or Consolidated Sanctions List (CSL)? Does the client approve of the use of Open Source technologies? What is the impact of those technology choices on other facets, such as Licensing and Legal, Cost, and Support? Do those changes impact other existing or pending technology choices?
Do the requirements introduce changes, which would or should impact infrastructure? Is the infrastructure new or existing, is it physical, is it virtualized, is it in the cloud? If the infrastructure is virtualized, are there coding requirements about how it is provisioned? If the infrastructure is physical, how will it impact other aspects, such as Technology Preferences, Cost, Support, Monitoring, and so forth? Do the requirements introduce changes, which impact existing infrastructure?
Do the requirements introduce changes, which produce data or require the preservation of state? Do those changes impact existing data management processes and systems? Is the data ephemeral or transient? How is the data persisted? Is the data stored in a SQL database, NoSQL database, distributed key-value store, or flat-file? Consider how other aspects, such as Security and Compliance, Technology Preferences, and Architectural Approval, impact or are impacted by the various aspects of Data?
Training and Documentation
Do the requirements introduce changes, which require new training and documentation? This also includes architectural diagrams and other visuals. Do the requirements introduce changes, which impact training and documentation processes and practices?
Do the requirements introduce changes, which require networking additions or modifications? Do those changes impact existing networking processes and systems? I have separated out Computer Networking from Infrastructure since networking is itself, so broad, involving such things as DNS, SDN, VPN, SSL, certificates, load balancing, proxies, firewalls, and so forth. Again, how do networking choices impact other aspects, like Security and Compliance?
Licensing and Legal
Do the requirements introduce changes, which requires licensing? Should or would the requirements require additional legal review? Are there contractual agreements that need authorization? Do the requirements introduce changes, which impact licensing or other legal agreements or contracts?
Do the requirements introduce changes, which requires the purchase of equipment, contracts, outside resources, or other goods which incur a cost? How and where will the funding come from to purchase those items? Are there licensing or other legal agreements and contracts required as part of the purchase? Who approves these purchases? Are they within acceptable budgetary expectations? Do those changes impact existing costs or budgets?
Do the requirements introduce changes, which incur technical debt? What is the cost of that debt? How will impact other requirements and timelines? Do the requirements eliminate existing technical debt? Do those changes impact existing or future technical debt caused by other processes and systems?
Depending on your organization, industry, and DevOps maturity, you may prefer different aspects critical to your requirements.
Let’s take a second look at our centralized logging story, applying what we’ve learned. For each facet, we will ask the following three questions.
Is there any aspect of _______ that impacts the centralized logging requirement(s)?
Will the centralized logging requirement(s) impact any aspect of _______?
If yes, will the impacts be captured as part of this user story, an additional user story, or elsewhere?
For the sake of brevity, we will only look at the half the aspects.
Although the requirement is about centralized logging, there is still analysis that should occur around Logging. For example, we should determine if there is a requirement that the centralized logging application should consume its own logs for easier supportability. If so, we should add an acceptance criterion to the original user story around the logging application’s logs being captured.
If we are building a centralized application logging system, we will assume there will be a centralized logging application, possibly a database, and infrastructure on which the application is running. Therefore, it is reasonable to also assume that the logging application, database, and infrastructure need to be monitored. This would require interfacing with an existing monitoring system, or by creating a new monitoring system. If and how the system will be monitored needs to be clarified.
Interfacing with a current monitoring system will minimally require additional acceptance criteria. Developing a new monitoring system would undoubtedly require additional user story or stories, with their own acceptance criteria.
If we are monitoring the system, then we could assume that there should be a requirement to alert someone if monitoring finds the centralized application logging system becomes impaired. Similar to monitoring, this would require interfacing with an existing alerting system or creating a new alerting system. A bigger question, who do the alerts go to? These details should be clarified first.
Interfacing with an existing alerting system will minimally require additional acceptance criteria. Developing a new alerting system would require additional user story or stories, with their own acceptance criteria.
Who will own the new centralized application logging system? Ownership impacts many aspects, such as Support, Maintenance, and Alerting. Ownership should be clearly defined, before the requirements stories are played. Don’t build something that does not have an owner, it will fail as soon as it is launched.
No additional user stories or acceptance criteria should be required.
We will assume that there are no Support requirements, other than what is already covered in Monitoring, Ownership, and Training and Documentation.
Similarly, we will assume that there are no Maintenance requirements, other than what is already covered in Ownership and Training and Documentation. Requirements regarding automating system maintenance in any way should be an additional user story or stories, with their own acceptance criteria.
Backup and Recovery
Similar to both Support and Maintenance, Backup and Recovery should have a clear owner. Regardless of the owner, integrating backup into the centralized application logging system should be addressed as part of the requirement. To create a critical support system such as logging, with no means of backup or recovery of the system or the data, is unwise.
Backup and Recovery should be an additional user story or stories, with their own acceptance criteria.
Security and Compliance
Since we are logging application activity, we should assume there is the potential of personally identifiable information (PII) and other sensitive business data being captured. Due to the security and compliance risks associated with improper handling of PII, there should be clear details in the logging requirement regarding such things as system access.
At a minimum, there are usually requirements for a centralized application logging system to implement some form of user, role, application, and environment-level authorization for viewing logs. Security and Compliance should be an additional user story or stories, with their own acceptance criteria. Any requirements around the obfuscation of sensitive data in the log entries would also be an entirely separate set of business requirements.
Continuous Integration and Delivery
We will assume that there are no Continuous Integration and Delivery requirements around logging. Any requirements to push the logs of existing Continuous Integration and Delivery systems to the new centralized application logging system would be a separate use story.
Service Level Agreements (SLAs)
A critical aspect of most DevOps user stories, which is commonly overlooked, Service Level Agreements. Most requirements should at least a minimal set of performance-related metrics associated with them. Defining SLAs also help to refine the scope of a requirement. For example, what is the anticipated average volume of log entries to be aggregated and over what time period? From how many sources will logs be aggregated? What is the anticipated average size of each log entry, 2 KB or 2 MB? What is the anticipated average volume of concurrent of system users?
Given these base assumptions, how quickly are new log entries expected to be available in the new centralized application logging system? What is the expected overall response time of the system under average load?
Directly applicable SLAs should be documented as acceptance criteria in the original user story, and the remaining SLAs assigned to the other new user stories.
We will also assume that there no Configuration Management requirements around the new centralized application logging system. However, if the requirements prescribed standing up separately logging implementations, possibly to support multiple environments or applications, then Configuration Management would certainly be in scope.
Based on further clarification, we will assume we are only creating a single centralized application logging system instance. However, we still need to learn which application environments need to be captured by that system. Is this application only intended to support the Development and Test environments? Are the higher environments of Staging and Production precluded due to PII?
We should add an explicit acceptance criterion to the original user story around which application environments should be included and which should not.
High Availability and Fault Tolerance
The logging requirement should state whether or not this system is considered business critical. Usually, if a system is considered critical to the business, then we should assume the system requires monitoring, alerting, support, and backup. Additionally, the system should be highly available and fault tolerant.
Making the system highly available would surely have an impact on facets like Infrastructure, Monitoring, and Alerting. High Availability and Fault Tolerance could be a separate user story or stories, with its own acceptance criteria, as long as the original requirement doesn’t dictate those features.
We skipped half the facets. They would certainly expand our story list and acceptance criteria. We didn’t address how the end user would view the logs. Should we assume a web browser? Which web browsers? We didn’t determine if the system should be accessible outside the internal network. We didn’t determine how the system would be configured and deployed. We didn’t investigate if there are any vendor or licensing restrictions, or if there are any required training and documentation needs. Possibly, we will need Architectural Approval before we start to develop.
Based on our brief, partial analysis, we uncovered some further needs.
We need additional information and clarification regarding Logging, Monitoring, Alerting, and Security and Compliance. We should also determine ownership of the resulting centralized application logging system, before it is built, to ensure Alerting, Maintenance, and Support, are covered.
Additional Acceptance Criteria
We need to add additional acceptance criteria to the original user story around SLAs, Security and Compliance, Logging, Monitoring, and Environment Management.I prefer to write acceptance criteria in the
I prefer to write acceptance criteria in the Given-When-Then format. I find the ‘Given-When-Then’ format tends to drive a greater level of story refinement and detail than the ‘I’ format. For example, compare:
I expect to see the most current log entries with no more than a two-minute time delay.
Given the centralized application logging system is operating normally and under average expected load,
When I view log entries for running applications, from my web browser,
Then I should see the most current log entries with no more than a two-minute time delay.
New User Stories
Following Agile best practices, illustrated by the INVEST mnemonic, we need to add new, quality user stories, with their own acceptance criteria, around Monitoring, Alerting, Backup and Recovery, Security and Compliance, and High Availability and Fault Tolerance.
In this post, we have shown how uniformly analyzing the impacts of the above aspects on the requirements will improve the overall quality and consistency of user stories and acceptance criteria, maximizing business value and minimizing resource utilization.
All opinions in this post are my own and not necessarily the views of my current employer or their clients.