Posts Tagged Traces
Brief Introduction to Observability on AWS
Posted by Gary A. Stafford in AWS, Cloud, DevOps on March 9, 2023
Explore the wide variety of Application Performance Monitoring (APM) and Observability options on AWS

APM and Observability
Observability is “the extent to which the internal states of a system can be inferred from externally available data” (Gartner). The three pillars of observability data are metrics, logs, and traces. Application Performance Monitoring (APM), a term commonly associated with observability, is “software that enables the observation and analysis of application health, performance, and user experience” (Gartner).
Additional features often associated with APM and observability products and services include the following (in alphabetical order):
- Advanced Threat Protection (ATP)
- Endpoint Detection and Response (EDR)
- Incident Detection Response (IDR)
- Infrastructure Performance Monitoring (IPM)
- Network Device Monitoring (NDM)
- Network Performance Monitoring (NPM)
- OpenTelemetry (OTel)
- Operational (or Operations) Intelligence Platform
- Predictive Monitoring (predictive analytics / predictive modeling)
- Real User Monitoring (RUM)
- Security Information and Event Management (SIEM)
- Security Orchestration, Automation, and Response (SOAR)
- Synthetic Monitoring (directed monitoring / synthetic testing)
- Threat Visibility and Risk Scoring
- Unified Security and Observability Platform
- User Behavior Analytics (UBA)
Not all features are offered by all vendors. Most vendors tend to specialize in one or more areas. Determining which features are essential to your organization before choosing a solution is vital.
AI/ML
Given the growing volume and real-time nature of observability telemetry, many vendors have started incorporating AI and ML into their products and services to improve correlation, anomaly detection, and mitigation capabilities. Understand how these features can reduce operational burden, enhance insights, and simplify complexity.
Decision Factors
APM and observability tooling choices often come down to a “Build vs. Buy” decision for organizations. In the Cloud, this usually means integrating several individual purpose-built products and services, self-managed open-source projects, or investing in an end-to-end APM or unified observability platform. Other decision factors include the need for solutions to support:
- Hybrid cloud environments (on-premises/Cloud)
- Multi-cloud environments (Public Cloud, SaaS, Supercloud)
- Specialized workloads (e.g., Mainframes, HPC, VMware, SAP, SAS)
- Compliant workloads (e.g., PCI DSS, PII, GDPR, FedRAMP)
- Edge Computing and IoT/IIoT
- AI, ML, and Data Analytics monitoring (AIOps, MLOps, DataOps)
- SaaS observability (SaaS providers who offer monitoring to their end-users as part of their service offering)
- Custom log formats and protocols
Finally, the 5 V’s of big data: Velocity, Volume, Value, Variety, and Veracity, also influence the choice of APM and observability tooling. The real-time nature of the observability data, the sheer volume of the data, the source and type of data, and the sensitivity of the data, will all guide tooling choices based on features and cost.
Organizations can choose fully-managed native AWS services, AWS Partner products and services, often SaaS, self-managed open-source observability tooling, or a combination of options. Many AWS and Partner products and services are commercial versions of popular open-source software (COSS).
AWS Options
- Amazon CloudWatch: Observe and monitor resources and applications you run on AWS in real time. CloudWatch features include Amazon CloudWatch Container, Lambda, Contributor, and Application Insights.
- Amazon Managed Service for Prometheus (AMP): Prometheus-compatible service that monitors and provides alerts on containerized applications and infrastructure at scale. Prometheus is “an open-source systems monitoring and alerting toolkit originally built at SoundCloud.”
- Amazon Managed Grafana (AMG): Commonly paired with AMP, AMG is a fully managed service for Grafana, an open-source analytics platform that “enables you to query, visualize, alert on, and explore your metrics, logs, and traces wherever they are stored.”
- Amazon OpenSearch Service: Based on open-source OpenSearch, this service provides interactive log analytics and real-time application monitoring and includes OpenSearch Dashboards (comparable to Kibana). Recently announced new security analytics features provide threat monitoring, detection, and alerting capabilities.
- AWS X-Ray: Trace user requests through your application, viewed using the X-Ray service map console or integrated with CloudWatch using the CloudWatch console’s X-Ray Traces Service map.
Data Collection, Processing, and Forwarding
- AWS Distro for OpenTelemetry (ADOT): Open-source APIs, libraries, and agents to collect distributed traces and metrics for application monitoring. ADOT is an open-source distribution of OpenTelemetry, a “high-quality, ubiquitous, and portable telemetry [solution] to enable effective observability.”
- AWS for Fluent Bit: Fluent Bit image with plugins for both CloudWatch Logs and Kinesis Data Firehose. Fluent Bit is an open-source, “super fast, lightweight, and highly scalable logging and metrics processor and forwarder.”
Security-focused Monitoring
- AWS CloudTrail: Helps enable operational and risk auditing, governance, and compliance in your AWS environment. CloudTrail records events, including actions taken in the AWS Management Console, AWS Command Line Interface (CLI), AWS SDKs, and APIs.
Partner Options
According to the 2022 Gartner® Magic Quadrant™ — APM & Observability Report, leading vendors commonly used by AWS customers include the following (in alphabetical order):
- AppDynamics (Cisco)
- Datadog
- Dynatrace
- Elastic
- Honeycomb
- Instanta (IBM)
- logz.io
- New Relic
- Splunk
- Sumo Logic
Open Source Options
There are countless open-source observability projects to choose from, including the following (in alphabetical order):
- Elastic: Elastic Stack: Elasticsearch, Kibana, Beats, and Logstash
- Fluentd: Data collector for unified logging layer
- Grafana: Platform for monitoring and observability
- Jaeger: End-to-end distributed tracing
- Kiali: Configure, visualize, validate, and troubleshoot Istio Service Mesh
- Loki: Like Prometheus, but for logs
- OpenSearch: Scalable, flexible, and extensible software suite for search, analytics, and observability applications
- OpenTelemetry (OTel): Collection of tools, APIs, and SDKs used to instrument, generate, collect, and export telemetry data
- Prometheus: Monitoring system and time series database
- Zabbix: Single pane of glass view of your whole IT infrastructure stack
🔔 To keep up with future content, follow Gary Stafford on LinkedIn.
This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.