Implement an effective, consistent, and repeatable strategy for documenting data analytics workflows and capturing system requirements
“Data analytics applications involve more than just analyzing data, particularly on advanced analytics projects. Much of the required work takes place upfront, in collecting, integrating, and preparing data and then developing, testing, and revising analytical models to ensure that they produce accurate results. In addition to data scientists and other data analysts, analytics teams often include data engineers, who create data pipelines and help prepare data sets for analysis.” — TechTarget
Introduction
Successful consultants, project managers, and product owners use well-proven and systematic approaches to achieve desired outcomes, including successful customer engagements, project results, and product and service launches. Modern data stacks and analytics workflows are increasingly complex. This technology-agnostic discovery process aims to help an organization efficiently and repeatably capture a concise record of existing analytics workflows, business and technical goals and constraints, and measures of success. If applicable, the discovery process is also used to compile and clarify requirements for new data analytics workflows.

Analytics Workflow Stages
There are many patterns organizations use to delineate the stages of their analytics workflows. This process utilizes six stages of a typical analytics workflow:
- Generate: All the ways data is generated and the systems of record where it is stored or originates from, also referred to as data ingress
- Collect: All the ways data is collected or ingested
- Prepare: All the ways data is transformed, including ETL, ELT, reverse ETL, and ML
- Store: All the ways data is stored, organized, and secured for analytics purposes
- Analyze: All the ways data is analyzed
- Deliver: All the ways data is delivered and how it is consumed, also referred to as data egress or data products
The precise nomenclature is not critical to this process as long as all major functionality is considered.
The Process
The discovery process starts by working backward. It first identifies existing goals and desired outcomes. It then identifies existing and anticipated future constraints. Next, it breaks down the current analytics workflows, examining the four stages of collect, prepare, store, and analyze, the steps required to get from data sources to deliverables. Finally, it captures the inputs and the outputs for the workflows and the data producers and consumers.
Collect, prepare, store, and analyze — the steps required to get from data sources to deliverables.
Specifically, the process identifies and documents the following:
- Business and technical goals and desired outcomes
- Business and technical constraints also referred to as limitations or restrictions
- Analytics workflows: tools, techniques, procedures, and organizational structure
- Outputs also referred to as deliverables, required to achieve desired outcomes
- Inputs, also referred to as data sources, required to achieve desired outcomes
- Data producers and consumers
- Measures of success
- Recommended next steps
Outcomes
Capture business and technical goals and desired outcomes driving the necessity to rearchitect current analytics processes. For example:
- Re-architect analytics processes to modernize, reduce complexity, or add new capabilities
- Reduce or control costs
- Increase performance, scalability, speed
- Migrate on-premises workloads, workflows, processes to the Cloud
- Migrate from one cloud provider or SaaS provider to another
- Move away from proprietary software products to open source software (OSS) or commercial open source software (COSS)
- Migrate away from custom-built software to commercial off-the-shelf (COTS), OSS, or COSS solutions
- Integrate DevOps, GitOps, DataOps, or MLOps practices
- Integrate on-premises, multi-cloud, and SaaS-based hybrid architectures
- Develop new analytics product or service offerings
- Standardize analytics processes
- Leverage the data for AI and ML purposes
- Provide stakeholders with a real-time business KPIs dashboard
- Construct a data lake, data warehouse, data lakehouse, or data mesh
If migration is involved, review the 6 R’s of Cloud Migration: Rehost, Replatform, Repurchase, Refactor, Retain, or Retire.
Constraints
Identify the existing and potential future business and technical constraints that impact analytics workflows. For example:
- Budgets
- Cost attribution
- Timelines
- Access to skilled resources
- Internal and external regulatory requirements, such as HIPAA, SOC2, FedRAMP, GDPR, PCI DSS, CCPA, and FISMA
- Business Continuity and Disaster Recovery (BCDR) requirements
- Architecture Review Board (ARB), Center of Excellence (CoE), Change Advisory Board (CAB), and Release Management standards and guidelines
- Data residency and data sovereignty requirements
- Security policies
- Service Level Agreements (SLAs); see ‘Measures of Success’ section
- Existing vendor, partner, cloud-provider, and SaaS relationships
- Existing licensing and contractual obligations
- Deprecated code dependencies and other technical debt
- Must-keep aspects of existing processes
- Build versus buy propensity
- Proprietary versus open source software propensity
- Insourcing versus outsourcing propensity
- Managed, hosted, SaaS versus self-managed software propensity

Analytics Workflows
Capture analytics workflows using the four stages of collect, prepare, store, and analyze, as a way to organize the discussion:
- High- and low-level architecture, process flow diagrams, sequence diagrams
- Recent architectural assessments such as reviews based on the AWS Data Analytics Lens, AWS Well-Architected Framework, Microsoft Azure Well-Architected Framework, or Google Cloud Architecture Framework
- Analytics tools, including hardware and commercial, custom, and open-source software
- Security policies, processes, standards, and technologies
- Observability, logging, monitoring, alerting, and notification
- Teams, including roles, responsibilities, and skillsets
- Partners, including consultants, vendors, SaaS providers, and Managed Service Providers (MSP)
- SDLC environments, such as Local, Sandbox, Development, Testing, Staging, Production, and Disaster Recovery (DR)
- Business Continuity Planning (BCP) policies, processes, standards, and technologies
- Primary analytics programming languages
- External system dependencies
- DataOps, MLOps, DevOps, CI/CD, SCM/VCS, and Infrastructure-as-Code (IaC) automation policies, processes, standards, and technologies
- Data governance and data lineage policies, processes, standards, and technologies
- Data quality (or data assurance) policies, processes, standards, technologies, and testing methodologies
- Data anomaly detection policies, processes, standards, and technologies
- Intellectual property (IP), the ‘secret sauce’ that differentiates the organization’s processes and provides a competitive advantage, such as ML models, proprietary algorithms, datasets, highly specialized knowledge, and patents
- Overall effectiveness and customer satisfaction with existing analytics platform (document sources of customer feedback)
- Known deficiencies with current analytics processes

Outputs
Identify the deliverables required to meet the desired outcomes. For example, prepare and provide data for:
- Data analytics purposes
- Business Intelligence (BI), visualizations, and dashboards
- Machine Learning (ML) and Artificial Intelligence (AI)
- Data exports and data feeds, such as Excel or CSV-format files
- Hosted datasets for external or internal consumption
- Data APIs for external or internal consumption
- Documentation, Data API guides, data dictionaries, example code such as Notebooks
- SaaS-based product offering

Inputs
Capture sources of data that are required to produce the outputs. For example:
- Batch sources such as flat files from legacy systems, third-party providers, and enterprise platforms
- Streaming sources such as message queues, change data capture (CDC), IoT device telemetry, operational metrics, real time logs, clickstream data, connected devices, mobile, and gaming feeds
- Databases, including relational, NoSQL, key-value, document, in-memory, graph, time series, and wide column (OLTP data stores)
- Data warehouses (OLAP data stores)
- Data lakes
- API endpoints
- Internal, public, and licensed datasets
Use the 5 V’s of big data to dive deep into each data source: Volume, Velocity, Variety, Veracity (or Validity), and Value.

Data Producers and Consumers
Capture all producers and consumers of data:
- Data producers
- Data consumers
- Data access patterns
- Data usage patterns
- Consumer and producer requirements and constraints

Measures of Success
Identify how success is measured for the analytics workflows and by whom. For example:
- Key Performance Indicators (KPIs)
- Service Level Agreements (SLAs)
- Customer Satisfaction Score (CSAT)
- Net Promoter Score (NPS)
- SaaS growth metrics: churn, activation rate, MRR, ARR, CAC, CLV, expansion revenue (source: appcues.com)
- Data quality guarantees
- How are measurements determined, calculated, and weighted?
- What are the business and technical actions resulting from missed measures of success?
Results
The immediate artifact of the data analytics discovery process is a clear and concise document that captures all feedback and inputs. In addition, the document contains all customer-supplied artifacts, such as architectural and process flow diagrams. The document should be thoroughly reviewed for accuracy and completeness by the process participants. This artifact serves as a record of current data analytics workflows and a basis for making workflow improvement recommendations or architecting new workflows.
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.