Implement an effective, consistent, and repeatable strategy for documenting data analytics workflows and capturing system requirements
“Data analytics applications involve more than just analyzing data, particularly on advanced analytics projects. Much of the required work takes place upfront, in collecting, integrating, and preparing data and then developing, testing, and revising analytical models to ensure that they produce accurate results. In addition to data scientists and other data analysts, analytics teams often include data engineers, who create data pipelines and help prepare data sets for analysis.” — TechTarget
Successful consultants, project managers, and product owners take a well-defined and systematic approach to achieve desired outcomes. They use frameworks, templates, and proven patterns to accomplish their goals — consistently successful engagements, project outcomes, and product and service launches.
Given the complexity of modern data analytics workflows, the goal of this platform-agnostic discovery process is to lead an organization, department, team, or customer through an effective, efficient, consistent, and repeatable approach for capturing data analytics workflows and systems requirements. This process will result in a clear understanding of existing analytics workflows, current and desired future business and technical outcomes, and existing and anticipated future business and technical constraints.
If applicable, the discovery process serves as a foundation for architecting new data analytics workflows. Starting with a set of business and technical constraints, the approach facilitates the development of new data analytics workflows to achieve desired business and technical outcomes. Outcomes and constraints determine the requirements.
Analytics Workflow Stages
There are many patterns organizations use to delineate the stages of their analytics workflows. However, the granularity of the stages is not critical to this process as long as all major functionality is considered. This process utilizes six stages of a typical analytics workflow, from left to right: Generate, Collect, Prepare, Store, Analyze, and Deliver.
The discovery process starts by working backward and from the outside in. First, the process identifies current and desired future outcomes (right side of the diagram above). Then, it identifies existing and anticipated future constraints (left-side of the diagram). Next, it identifies the inputs and the outputs for the existing and re-architected workflows. Finally, the process identifies the current and re-architected analytics workflows — collect, prepare, store, and analyze — the steps required to get from data sources to deliverables.
Collect, prepare, store, and analyze — the steps required to get from data sources to deliverables.
Specifically, the process identifies and documents the following:
- Current business and technical outcomes
- Desired future business and technical outcomes
- Existing business and technical constraints
- Anticipated future business and technical constraints
- Inputs required to achieve desired outcomes
- Outputs needed to achieve desired outcomes
- Data producers and consumers
- Existing analytics tools, procedures, and people
- Measures of success for analytics workflows
Capture desired business and technical outcomes (goals and objectives), for example:
- Re-architect analytics workflows to modernize, reduce costs, increase speed, reduce complexity, and add capabilities
- Move analytics workflows to the Cloud (review 6 R’s strategies)
- Migrate from another Cloud or SaaS provider
- Shift away from proprietary platforms to an open-source analytics stack
- Migrate away from custom in-house analytics tools
- Add DataOps — CI/CD automation to existing workflows
- Integrated hybrid on-prem, multi-cloud, and SaaS-based analytics architectures
- Develop new greenfield analytics workflows, products, and services
- Standardization of analytics workflows
- Develop machine learning models from the data
- Provide key stakeholders with real-time business KPIs dashboards
Identify the existing and potential future business and technical constraints that impact analytics workflows, for example:
- People, including lack of specific skills (need for training)
- Internal and external regulatory, data governance, and data lineage requirements
- SLAs and KPIs (measures of workflow effectiveness)
- Current vendor, partner, Cloud-provider, and SaaS relationships
- Licensing and contractual obligations
- Must-keep aspects of current analytics workflows
Capture sources of data that are required to produce the outputs, for example:
- Batch sources such as databases and data feeds
- Streaming sources such as IoT device telemetry, operational metrics, logs, clickstreams, and gaming stats
- Database engines, including relational, key-value, document, in-memory, graph, time series, wide column, and ledger
- Data warehouses
- Data feeds, such as flat files from legacy or third-party systems
- API endpoints
- Public and licensed datasets
Use the 5 V’s of data, including Volume, Velocity, Variety, Veracity, and Value, to dive deep into each input. Capture existing and anticipated data access and usage patterns.
Identify the deliverables required to meet the desired outcomes. For example, prepare and make data available for:
- Further analysis
- Business Intelligence (BI), visualizations, and dashboards
- Machine Learning (ML) and Artificial Intelligence (AI)
- Data exports, such as Excel or CSV-format flat files
- Hosted datasets for external or internal consumption
- Expose data via APIs
Capture existing analytics workflows using the four stages of Collect, Prepare, Store, and Analyze as a way to organize this part of the process:
- High- and low-level architecture, process flow, patterns, and external dependencies
- Analytics tools, including hardware, commercial, custom, and OSS software, libraries, modules, source code
- DataOps, CI/CD, DevOps, and Infrastructure-as-Code (IaC) automation
- People, including Managed Service Providers (MSPs), roles and required skills
- Partners, including consultants, vendors, SaaS-providers
- Overall effectiveness and customer satisfaction with existing analytics workflows
- Deficiencies with current workflows
Use checklists for each stage to ensure all potential features, functions, and capabilities of typical analytics workflows are considered and captured. For example, review a checklist of all possible ways data is typically collected to ensure nothing is missed.
Measures of Success
Identify how success is measured with existing analytics workflows as well as with new workflows, and by whom, for example:
- Key Performance Indicators (KPIs)
- Service Level Agreements (SLAs)
- Customer Satisfaction Score (CSAT)
- Net Promoter Score (NPS)
- How measurements are determined, calculated, and weighted
Data Producers and Consumers
Capture the data producers and data consumers:
- Data producers
- Data consumers
The output of the data analytics discovery process is a clear and concise document that captures all elements described herein. Additionally, the document contains any customer-supplied artifacts, such as architectural and process flow diagrams. The document is thoroughly reviewed for accuracy and completeness by the customer. This document serves as a record of current data analytics workflows and a basis for architecting new workflows.
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.