Posts Tagged Grafana
IoT Data Analytics at the Edge: Exploring the convergence of IoT, Data Analytics, and Edge Computing with Grafana, Mosquitto, and TimescaleDB on ARM-based devices
Posted by Gary A. Stafford in Analytics, AWS, Cloud, IoT, Python, Raspberry Pi, Software Development on April 15, 2021
This post is a revised version of an earlier post, featuring major version updates of TimescaleDB (v1.7.4-pg12 to v2.0.0-pg12), Grafana (v7.1.5 to v7.5.2), and Mosquitto (v1.6.12 to v2.0.9). All source code and SQL scripts are revised. Note that TimeScaleDB has a current limitation/bug with Docker on ARM later than v2.0.0.

The Edge
Edge computing is a fast-growing technology trend, which involves pushing compute capabilities to a network’s edge. Wikipedia describes edge computing as a distributed computing paradigm that brings computation and data storage closer to the location needed to improve response times and save bandwidth. The term edge commonly refers to a compute node at the edge of a network (edge device), sitting close to the data source and between that data source and external system such as the Cloud. In his recent post, 3 Advantages (And 1 Disadvantage) Of Edge Computing, well-known futurist Bernard Marr argues reduced bandwidth requirements, reduced latency, and enhanced security and privacy as three primary advantages of edge computing.
David Ricketts, Head of Marketing at Quiss Technology PLC, in his post, Cloud and Edge Computing — The Stats You Need to Know for 2018, estimates that the global edge computing market is expected to reach USD 6.72 billion by 2022 at a compound annual growth rate of a whopping 35.4 percent. Realizing the market potential, many major Cloud providers, edge device manufacturers, and integrators are rapidly expanding their edge compute capabilities. AWS, for example, currently offers more than a dozen services in their edge computing category.
Internet of Things
Edge computing is frequently associated with the Internet of Things (IoT). IoT devices, industrial equipment, and sensors generate data transmitted to other internal and external systems, often by way of edge nodes, such as an IoT Gateway. IoT devices typically generate time-series data. According to Wikipedia, a time series is a set of data points indexed in time order — a sequence taken at successive equally spaced points in time. IoT devices typically generate continuous high-volume streams of time-series data, often on a scale of millions of data points per second. IoT data characteristics require IoT platforms to minimally support temporal accuracy, high-volume ingestion and processing, efficient data compression and downsampling, and real-time querying capabilities.
Edge devices such as IoT Gateways, which aggregate and transmit IoT data from these devices to external systems, are generally lower-powered, with limited processors, memory, and storage. Accordingly, IoT platforms must satisfy all the requirements of IoT data while simultaneously supporting resource-constrained environments.
IoT Analytics at the Edge
Leading Cloud providers AWS, Azure, Google Cloud, IBM Cloud, Oracle Cloud, and Alibaba Cloud all offer IoT services. Many offer IoT services with edge computing capabilities. AWS offers AWS IoT Greengrass. Greengrass provides local compute, messaging, data management, sync, and machine learning (ML) inference capabilities to edge devices. Azure offers Azure IoT Edge. Azure IoT Edge provides the ability to run artificial intelligence (AI), Azure and third-party services, and custom business logic on edge devices using standard containers. Google Cloud offers Edge TPU. Edge TPU (Tensor Processing Unit) is Google’s purpose-built application-specific integrated circuit (ASIC), designed to run AI at the edge.
IoT Analytics
Many Cloud providers also offer IoT analytics as part of their suite of IoT services, although not at the edge. AWS offers AWS IoT Analytics, while Azure has Azure Time Series Insights. Google provides IoT analytics, indirectly, through downstream analytic systems and ad hoc analysis using Google BigQuery or advanced analytics and machine learning with Cloud Machine Learning Engine. These services generally all require data to be transmitted to the Cloud for analytics.

The ability to analyze real-time, streaming IoT data at the edge is critical to a rapid feedback loop. IoT edge analytics can accelerate anomaly detection and remediation, improve predictive maintenance capabilities, and expedite proactive inventory replenishment.
IoT Edge Analytics Stack
In my opinion, the ideal IoT edge analytics stack is comprised of lightweight, purpose-built, easily deployable and manageable, platform- and programming language-agnostic, open-source software components. The minimal IoT edge analytics stack should include:
- Lightweight message broker;
- Purpose-built time-series database;
- ANSI-standard SQL ad-hoc query engine;
- Data visualization tool;
- Simple deployment and management framework;
Each component should be purpose-built for IoT.
Lightweight Message Broker
We will use Eclipse Mosquitto as our message broker. According to the project’s description, Mosquitto is an open-source message broker that implements the Message Queuing Telemetry Transport (MQTT) protocol versions 5.0, 3.1.1, and 3.1. Mosquitto is lightweight and suitable for use on all devices, from low-power single-board computers (SBCs) to full-powered servers.
MQTT Client Library
We will interact with Mosquitto using Eclipse Paho. According to the project, the Eclipse Paho project provides open-source, mainly client-side implementations of MQTT and MQTT-SN in a variety of programming languages. MQTT and MQTT for Sensor Networks (MQTT-SN) are light-weight publish/subscribe messaging transports for TCP/IP and connectionless protocols, such as UDP, respectively.
We will be using Paho’s Python Client. The Paho Python Client provides a client class with support for both MQTT v3.1 and v3.1.1 on Python 2.7 or 3.x. The client also provides helper functions to make publishing messages to an MQTT server straightforward.
Time-Series Database
Time-series databases are optimal for storing IoT data. According to InfluxData, makers of a leading time-series database, InfluxDB, a time-series database (TSDB), is a database optimized for time-stamped or time-series data. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. Jiao Xian of Alibaba Cloud has authored an insightful post on the time-series database ecosystem, What Are Time Series Databases? A few leading Cloud providers offer purpose-built time-series databases, though they are not available at the edge. AWS offers Amazon Timestream, and Alibaba Cloud offers Time Series Database.
InfluxDB is an excellent choice for a time-series database. It was my first choice, along with TimescaleDB, when developing this stack. However, InfluxDB Flux’s apparent incompatibilities with some ARM-based architecture ruled it out for inclusion in the stack for this particular post.
We will use TimescaleDB as our time-series database. TimescaleDB is the leading open-source relational database for time-series data. Described as ‘PostgreSQL for time-series,’ TimescaleDB is based on PostgreSQL, which provides full ANSI SQL, rock-solid reliability, and a massive ecosystem. TimescaleDB claims to achieve 10–100x faster queries than PostgreSQL, InfluxDB, and MongoDB, with native optimizations for time-series analytics.
TimescaleDB is designed for performing analytical queries, both through its native support for PostgreSQL’s full range of SQL functionality and additional functions native to TimescaleDB. These time-series optimized functions include Median/Percentile, Cumulative Sum, Moving Average, Increase, Rate, Delta, Time Bucket, Histogram, and Gap Filling.
Ad-hoc Data Query Engine
We have the option of using psql
, the terminal-based front-end to PostgreSQL, to execute ad-hoc queries against TimescaleDB. The psql
front-end enables you to enter queries interactively, issue them to PostgreSQL, and see the query results.

We also have the option of using pgAdmin, specifically the biarms/pgadmin4 Docker version, to execute ad-hoc queries and perform most other database tasks. pgAdmin is the most popular open-source administration and development platform for PostgreSQL. While several popular Docker versions of pgAdmin only support Linux AMD64 architectures, the biarms/pgadmin4 Docker version supports ARM-based devices.


Data Visualization
For data visualization, we will use Grafana. Grafana allows you to query, visualize, alert on, and understand metrics no matter where they are stored. With Grafana, you can create, explore, and share dashboards, fostering a data-driven culture. Grafana allows you to define thresholds visually and get notified via Slack, PagerDuty, and more. Grafana supports dozens of data sources, including MySQL, PostgreSQL, Elasticsearch, InfluxDB, TimescaleDB, Graphite, Prometheus, Google BigQuery, GraphQL, and Oracle. Grafana is extensible through a large collection of plugins.

Edge Deployment and Management Platform
Docker introduced the current industry standard for containers in 2013. Docker containers are a standardized unit of software that allows developers to isolate apps from their environment. We will use Docker to deploy the IoT edge analytics stack, referred to herein as the GTM Stack, composed of containerized versions of Grafana, TimescaleDB, Eclipse Mosquitto, and pgAdmin, to an ARMv7-based edge node. The acronym, GTM, comes from the three primary OSS projects composing the stack. The abbreviation also suggests Greenwich Mean Time, relating to the precise time-series nature of IoT data.

Running Docker Engine in swarm mode, we can use Docker to deploy the complete IoT edge analytics stack to the swarm, running on the edge node. The deploy command accepts a stack description in the form of a Docker Compose file, a YAML file used to configure the application’s services. With a single command, we can create and start all the services from the configuration file.
Source Code
All source code for this post is available on GitHub. Use the following command to git clone
a local copy of the project. Note that the updated version of the source code for this post is in the v2021–03
branch.
git clone --branch v2021-03 --single-branch --depth 1 \
https://github.com/garystafford/iot-analytics-at-the-edge.git
IoT Devices
For this post, I have deployed three Linux ARM-based IoT devices, each connected to a sensor array. Each sensor array contains multiple analog and digital sensors. The sensors record temperature, humidity, air quality (liquefied petroleum gas (LPG), carbon monoxide (CO), and smoke), light, and motion. For more information on the IoT device and sensor hardware involved, please see my previous post.Getting Started with IoT Analytics on AWS
Analyze environmental sensor data from IoT devices in near real-time with AWS IoT Analyticstowardsdatascience.com
Each ARM-based IoT device runs a small Python3-based script, sensor_data_to_mosquitto.py, shown below.
The IoT devices’ script implements the Eclipse Paho MQTT Python client library. An MQTT message containing simultaneous readings from each sensor is sent to a Mosquitto topic on the edge node at a configurable frequency.
Below are the actual sensor readings sent by the IoT device as an MQTT message to the Mosquitto topic.
IoT Edge Node
For this post, I have deployed a single Linux ARM-based edge node. The three IoT devices containing sensor arrays communicate with the edge node over Wi-Fi. IoT devices could easily use an alternative communication protocol, such as BLE, LoRaWAN, or Ethernet. For more information on BLE and LoRaWAN, please see some of my previous posts:LoRa and LoRaWAN for IoT: Getting Started with LoRa and LoRaWAN Protocols for Low Power, Wide Area Networking of IoT and BLE and GATT for IoT: Getting Started with Bluetooth Low Energy (BLE) and Generic Attribute Profile (GATT) Specification for IoT.
The edge node also runs a small Python3-based script, mosquitto_to_timescaledb.py, shown below.
Like the IoT devices, the edge node’s script implements the Eclipse Paho MQTT Python client library. The script pulls MQTT messages off a Mosquitto topic(s), serializes the message payload to JSON, and writes the payload’s data to the TimescaleDB database. The edge node’s script accepts several arguments, which allow you to configure the necessary Mosquitto and TimescaleDB connection settings.
Why not use Telegraf?
Telegraf is a plugin-driven agent that collects, processes, aggregates, and writes metrics. There is a Telegraf output plugin, the PostgreSQL and TimescaleDB Output Plugin for Telegraf, produced by TimescaleDB. The plugin can replace the need to manage and maintain the above script. However, I chose not to use it because it is not yet an official Telegraf plugin. If the plugin was included in a Telegraf release, I would certainly encourage its use.
Script Management
Both Linux-based IoT devices and edge nodes run systemd
system and service manager. To ensure the Python scripts keep running in the case of a system restart, we define a systemd
unit. Units are objects that systemd
knows how to manage. This is a standardized representation of system resources that can be managed by the suite of daemons and manipulated by the provided utilities. Each script has a systemd
unit file. Below, we see the gtm_stack_mosquitto
unit file, gtm_stack_mosquitto.service.
The gtm_stack_mosq_to_tmscl
unit file, gtm_stack_mosq_to_tmscl.service, is nearly identical.
To install the gtm_stack_mosquitto.service
systemd
unit file on each IoT device, use the following commands:
Installing the gtm_stack_mosq_to_tmscl.service
unit file on the edge node is nearly identical.
Docker Stack
The edge node runs the GTM Docker stack, stack.yml, in a swarm. As discussed earlier, the stack contains four containers: Eclipse Mosquitto, TimescaleDB, Grafana, and pgAdmin. The Mosquitto, TimescaleDB, and Grafana containers have paths within the containers bind-mounted to directories on the edge device. With bind-mounting, the container’s configuration and data will persist if the containers are removed and re-created. The containers are running on an isolated overlay network.
The GTM Docker stack is installed using the following commands on the edge node. We will assume Docker and git are pre-installed on the edge node for this post.
First, we will create several local directories on the edge device, which will be used to bind-mount to the Docker container’s directories. Below, we see the bind-mounted local directories with the eventual container’s contents stored within them.

Next, we copy the custom Mosquitto configuration file, mosquitto.conf, included in the project to the edge device’s correct location.
Lastly, we initialize the Docker swarm and deploy the stack.

docker service ls'
command, showing the running GTM Stack containersTimescaleDB Setup
With the GTM stack running, we need to create a single Timescale hypertable, sensor_data
, in the TimescaleDB demo_iot
database to hold the incoming IoT sensor data. Hypertables, according to TimescaleDB, are designed to be easy to manage and to behave like standard PostgreSQL tables. Hypertables are comprised of many interlinked “chunk” tables. Commands made to the hypertable automatically propagate changes down to all of the chunks belonging to that hypertable.
I suggest using psql
to execute the required DDL statements, which will create the hypertable and the proceeding views and database user permissions. All SQL statements are included in the project’s statements.sql file. One way to use psql
is to install it on your local workstation, then use psql
to connect to the remote edge node. I prefer to instantiate a local PostgreSQL Docker container instance running psql
. I then use the local container’s psql
client to connect to the edge node’s TimescaleDB database. For example, from my local machine, I run the following docker run
command to connect to the edge node’s TimescaleDB database on the edge node, located locally at 192.168.1.12
.
Although not as practical, you can also access psql
from within the TimescaleDB Docker container, running on the actual edge node, using the following docker exec
command.
TimescaleDB Continuous Aggregates
For this post’s demonstration, we will create four TimescaleDB materialized views, which will be queried from a Grafana Dashboard. The materialized views are TimescaleDB Continuous Aggregates. According to Timescale, aggregate queries which touch large swathes of time-series data can take a long time to compute because the system needs to scan large amounts of data on every query execution. To make these queries faster, a continuous aggregate allows materializing the computed aggregates, while also providing means to continuously, and with low overhead, keep them up-to-date as the underlying source data changes.
For example, we generate sensor data every five seconds from the three IoT devices in this post. When visualizing a 24-hour period in Grafana, using continuous aggregates with an interval of one minute, we would reduce the total volume of data queried from 51,840 rows to 4,320 rows, a reduction of over 91%. The larger the time period or the number of IoT devices being analyzed, the more significant these savings will positively impact query performance.
A time_bucket
on the time partitioning column of the hypertable is required for all continuous aggregate views. The time_bucket
function, in this case, has a bucket width (interval) of 1 minute. The interval is configurable.
To automatically refresh the four materialized views, we will create four corresponding continuous aggregate policies. In this demonstration, the continuous aggregate policies create a refresh window between one week ago and one hour ago, with a refresh interval of one hour.
Advanced Analytic Queries
The ability to perform ad-hoc queries on time-series IoT data is an essential feature of the IoT edge analytics stack. We can use psql
, pgAdmin, or even our own IDE to perform ad-hoc queries against the TimescaleDB database on the edge node. Below are examples of typical ad-hoc queries a data analyst might perform on IoT sensor data. These example queries demonstrate TimescaleDB’s advanced analytical capabilities for working with time-series data, including Moving Average, Delta, Time Bucket, and Histogram.
Data Visualization with Grafana
Using the TimescaleDB continuous aggregates we have created, we can quickly build a richly featured dashboard in Grafana. Below we see a typical IoT Dashboard you might build to monitor the post’s IoT sensor data in near real-time. An exported version, dashboard_external_export.json, is included in the GitHub project.


Limiting Grafana’s Access to IoT Data
Following the Grafana recommendation for database user permissions, we create a grafanareader
PostgresSQL user, and limit the user’s access to the sensor_data
table and the four views we created. Grafana will use this user’s credentials to perform SELECT
queries of the TimescaleDB demo_iot
database.
Using PostgreSQL in Grafana
Grafana’s documentation includes a comprehensive set of instructions for Using PostgreSQL in Grafana. To connect to the TimescaleDB database from Grafana, we use the PostgreSQL data source plugin.

The data displayed in each Panel in the Grafana Dashboard is based on a SQL query. For example, the Average Temperature Panel might use a query similar to the example below. This particular query also converts Celsius to Fahrenheit. Note the use of Grafana Macros (e.g., $__time()
, $__timeFilter()
). Macros can be used within a query to simplify syntax and allow for dynamic parts.
SELECT
$__time(bucket),
device_id AS metric,
((avg_temp * 1.9) + 32) AS avg_temp
FROM temperature_humidity_summary_minute
WHERE
$__timeFilter(bucket)
ORDER BY 1,2
Below, we see another example from the Average Humidity Panel. In this particular query, we might choose to limit the humidity data to a valid range of 0%–100%.
SELECT
$__time(bucket),
device_id AS metric,
avg_humidity
FROM temperature_humidity_summary_minute
WHERE
$__timeFilter(bucket)
AND avg_humidity >= 0.0
AND avg_humidity <= 100.0
ORDER BY 1,2
Mobile Friendly
Grafana dashboards are mobile-friendly. Below we see two views of the dashboard, using the Chrome mobile browser on an Apple iPhone.

Grafana Alerts
Grafana allows Alerts to be created based on the Rules you define in each Panel. If data values match the Rule’s conditions, which you pre-define, such as a temperature reading above a certain threshold for a set amount of time, an alert is sent to your choice of destinations. According to the Rule shown below, If the average temperature exceeds 75°F for a period of 5 minutes, an alert is sent.

As demonstrated below, when the laboratory temperature began to exceed 75°F, the alert entered a ‘Pending’ state. If the temperature exceeded 75°F for the pre-determined period of 5 minutes, the alert status changes to ‘Alerting’, and an alert is sent. When the temperature dropped back below 75°F for the pre-determined period of 5 minutes, the alert status changed from ‘Alerting’ to ‘OK’, and a subsequent notification was sent.

There are currently twenty alert notifiers available out-of-the-box with Grafana, including Slack, email, PagerDuty, webhooks, VictorOps, Opsgenie, and Microsoft Teams. We can use Grafana Alerts to notify the proper resources, in near real-time, if an issue is detected based on the data. Below, we see an actual series of high-temperature alerts sent by Grafana to the Slack channel, followed by subsequent notifications as the temperature returned to normal.

Conclusion
This post explored the development of an IoT edge analytics stack comprised of lightweight, purpose-built, easily deployable and manageable platform- and programming language-agnostic, open-source software components. These components included Docker containerized versions of Grafana, TimescaleDB, Eclipse Mosquitto, and pgAdmin, referred to as the GTM Stack. Using the GTM stack, we collected, analyzed, and visualized IoT data without first shipping the data to Cloud or other external systems.
This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
GTM Stack: Exploring IoT Data Analytics at the Edge with Grafana, Mosquitto, and TimescaleDB on ARM-based Architectures
Posted by Gary A. Stafford in Big Data, IoT, Python, Raspberry Pi, Software Development on October 12, 2020
In the following post, we will explore the integration of several open-source software applications to build an IoT edge analytics stack, designed to operate on ARM-based edge nodes. We will use the stack to collect, analyze, and visualize IoT data without first shipping the data to the Cloud or other external systems.

The Edge
Edge computing is a fast-growing technology trend, which involves pushing compute capabilities to the edge. Wikipedia describes edge computing as a distributed computing paradigm that brings computation and data storage closer to the location needed to improve response times and save bandwidth. The term edge commonly refers to a compute node at the edge of a network (edge device), sitting in close proximity to the source a data and between that data source and external system such as the Cloud.
In his recent post, 3 Advantages (And 1 Disadvantage) Of Edge Computing, well-known futurist Bernard Marr argues reduced bandwidth requirements, reduced latency, and enhanced security and privacy as three primary advantages of edge computing. Due to techniques like data downsampling, Marr advises one potential disadvantage of edge computing is that important data could end up being overlooked and discarded in the quest to save bandwidth and reduce latency.
David Ricketts, Head of Marketing at Quiss Technology PLC, estimates in his post, Cloud and Edge Computing — The Stats You Need to Know for 2018, the global edge computing market is expected to reach USD 6.72 billion by 2022 at a compound annual growth rate of a whopping 35.4 percent. Realizing the market potential, many major Cloud providers, edge device manufacturers, and integrators are rapidly expanding their edge compute capabilities. AWS, for example, currently offers more than a dozen services in the edge computing category.
Internet of Things
Edge computing is frequently associated with the Internet of Things (IoT). IoT devices, industrial equipment, and sensors generate data, which is transmitted to other internal and external systems, often by way of edge nodes, such as an IoT Gateway. IoT devices typically generate time-series data. According to Wikipedia, a time series is a set of data points indexed in time order — a sequence taken at successive equally spaced points in time. IoT devices typically generate continuous high-volume streams of time-series data, often on the scale of millions of data points per second. IoT data characteristics require IoT platforms to minimally support temporal accuracy, high-volume ingestion and processing, efficient data compression and downsampling, and real-time querying capabilities.
The IoT devices and the edge devices, such as IoT Gateways, which aggregate and transmit IoT data from these devices to external systems, are generally lower-powered, with limited processor, memory, and storage capabilities. Accordingly, IoT platforms must satisfy all the requirements of IoT data while simultaneously supporting resource-constrained environments.
IoT Analytics at the Edge
Leading Cloud providers AWS, Azure, Google Cloud, IBM Cloud, Oracle Cloud, and Alibaba Cloud all offer IoT services. Many offer IoT services with edge computing capabilities. AWS offers AWS IoT Greengrass. Greengrass provides local compute, messaging, data management, sync, and ML inference capabilities to edge devices. Azure offers Azure IoT Edge. Azure IoT Edge provides the ability to run AI, Azure and third-party services, and custom business logic on edge devices using standard containers. Google Cloud offers Edge TPU. Edge TPU (Tensor Processing Unit) is Google’s purpose-built application-specific integrated circuit (ASIC), designed to run AI at the edge.
IoT Analytics
Many Cloud providers also offer IoT analytics as part of their suite of IoT services, although not at the edge. AWS offers AWS IoT Analytics, while Azure has Azure Time Series Insights. Google provides IoT analytics, indirectly, through downstream analytic systems and ad hoc analysis using Google BigQuery or advanced analytics and machine learning with Cloud Machine Learning Engine. These services generally all require data to be transmitted to the Cloud for analytics.

The ability to analyze IoT data at the edge, as data is streamed in real-time, is critical to a rapid feedback loop. IoT edge analytics can accelerate anomaly detection, improve predictive maintenance capabilities, and expedite proactive inventory replenishment.
The IoT Edge Analytics Stack
In my opinion, the ideal IoT edge analytics stack is comprised of lightweight, purpose-built, easily deployable and manageable, platform- and programming language-agnostic, open-source software components. The minimal IoT edge analytics stack should include a lightweight message broker, a time-series database, an ANSI-standard ad-hoc query engine, and a data visualization tool. Each component should be purpose-built for IoT.
Lightweight Message Broker
We will use Eclipse Mosquitto as our message broker. According to the project’s description, Mosquitto is an open-source message broker that implements the Message Queuing Telemetry Transport (MQTT) protocol versions 5.0, 3.1.1, and 3.1. Mosquitto is lightweight and suitable for use on all devices from low power single board computers (SBCs) to full servers.
MQTT Client Library
We will interact with Mosquitto using Eclipse Paho. According to the project, the Eclipse Paho project provides open-source, mainly client-side implementations of MQTT and MQTT-SN in a variety of programming languages. MQTT and MQTT for Sensor Networks (MQTT-SN) are lightweight publish/subscribe messaging transports for TCP/IP and connectionless protocols, such as UDP, respectively.
We will be using Paho’s Python Client. The Paho Python Client provides a client class with support for both MQTT v3.1 and v3.1.1 on Python 2.7 or 3.x. The client also provides helper functions to make publishing messages to an MQTT server straightforward.
Time-Series Database
Time-series databases are optimal for storing IoT data. According to InfluxData, makers of a leading time-series database, InfluxDB, a time-series database (TSDB), is a database optimized for time-stamped or time-series data. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. Jiao Xian, of Alibaba Cloud, has authored an insightful post on the time-series database ecosystem, What Are Time Series Databases? A few leading Cloud providers offer purpose-built time-series databases, though they are not available at the edge. AWS offers Amazon Timestream and Alibaba Cloud offers Time Series Database.
InfluxDB is an excellent choice for a time-series database. It was my first choice, along with TimescaleDB, when developing this stack. However, InfluxDB Flux’s apparent incompatibilities with some ARM-based architectures ruled it out for inclusion in the stack for this particular post.
We will use TimescaleDB as our time-series database. TimescaleDB is the leading open-source relational database for time-series data. Described as ‘PostgreSQL for time-series,’ TimescaleDB is based on PostgreSQL, which provides full ANSI SQL, rock-solid reliability, and a massive ecosystem. TimescaleDB claims to achieve 10–100x faster queries than PostgreSQL, InfluxDB, and MongoDB, with native optimizations for time-series analytics.
TimescaleDB claims to achieve 10–100x faster queries than PostgreSQL, InfluxDB, and MongoDB, with native optimizations for time-series analytics.
TimescaleDB is designed for performing analytical queries, both through its native support for PostgreSQL’s full range of SQL functionality, as well as additional functions native to TimescaleDB. These time-series optimized functions include Median/Percentile, Cumulative Sum, Moving Average, Increase, Rate, Delta, Time Bucket, Histogram, and Gap Filling.
Ad-hoc Data Query Engine
We have the option of using psql
, the terminal-based front-end to PostgreSQL, to execute ad-hoc queries against TimescaleDB. The psql
front-end enables you to enter queries interactively, issue them to PostgreSQL, and see the query results.

We also have the option of using pgAdmin, specifically the biarms/pgadmin4 Docker version, to execute ad-hoc queries and perform most other database tasks. pgAdmin is the most popular open-source administration and development platform for PostgreSQL. While several popular Docker versions of pgAdmin only support Linux AMD64 architectures, the biarms/pgadmin4 Docker version supports ARM-based devices.


Data Visualization
For data visualization, we will use Grafana. Grafana allows you to query, visualize, alert on, and understand metrics no matter where they are stored. With Grafana, you can create, explore, and share dashboards, fostering a data-driven culture. Grafana allows you to define thresholds visually and get notified via Slack, PagerDuty, and more. Grafana supports dozens of data sources, including MySQL, PostgreSQL, Elasticsearch, InfluxDB, TimescaleDB, Graphite, Prometheus, Google BigQuery, GraphQL, and Oracle. Grafana is extensible through a large collection of plugins.

Edge Deployment and Management Platform
Docker introduced the current industry standard for containers in 2013. Docker containers are a standardized unit of software that allows developers to isolate apps from their environment. We will use Docker to deploy the IoT edge analytics stack, referred to herein as the GTM Stack, composed of containerized versions of Eclipse Mosquitto, TimescaleDB, and Grafana, and pgAdmin, to an ARM-based edge node. The acronym, GTM, comes from the three primary OSS projects composing the stack. The acronym also suggests Greenwich Mean Time, relating to the precise time-series nature of IoT data.

Running Docker Engine in swarm mode, we can use Docker to deploy the complete IoT edge analytics stack to the swarm, running on the edge node. The deploy command accepts a stack description in the form of a Docker Compose file, a YAML file used to configure the application’s services. With a single command, we can create and start all the services from the configuration file.
Source Code
All source code for this post is available on GitHub. Use the following command to git clone
a local copy of the project.
IoT Devices
For this post, I have deployed three Linux ARM-based IoT devices, each connected to a sensor array. Each sensor array contains multiple analog and digital sensors. The sensors record temperature, humidity, air quality (liquefied petroleum gas (LPG), carbon monoxide (CO), and smoke), light, and motion. For more information on the IoT device and sensor hardware involved, please see my previous post: Getting Started with IoT Analytics on AWS.
Each ARM-based IoT device is running a small Python3-based script, sensor_data_to_mosquitto.py, shown below.
The IoT devices’ script implements the Eclipse Paho MQTT Python client library. An MQTT message containing simultaneous readings from each sensor is sent to a Mosquitto topic on the edge node, at a configurable frequency.
IoT Edge Node
For this post, I have deployed a single Linux ARM-based edge node. The three IoT devices, containing sensor arrays, communicate with the edge node over Wi-Fi. The IoT devices could easily use an alternative communication protocol, such as BLE, LoRaWAN, or Ethernet. For more information on BLE and LoRaWAN, please see some of my previous posts: LoRa and LoRaWAN for IoT: Getting Started with LoRa and LoRaWAN Protocols for Low Power, Wide Area Networking of IoT and BLE and GATT for IoT: Getting Started with Bluetooth Low Energy (BLE) and Generic Attribute Profile (GATT) Specification for IoT.
The edge node is also running a small Python3-based script, mosquitto_to_timescaledb.py, shown below.
Similar to the IoT devices, the edge node’s script implements the Eclipse Paho MQTT Python client library. The script pulls MQTT messages off a Mosquitto topic(s), serializes the message payload to JSON, and writes the payload’s data to the TimescaleDB database. The edge node’s script accepts several arguments, which allow you to configure necessary Mosquitto and TimescaleDB variables.
Why not use Telegraf?
Telegraf is a plugin-driven agent that collects, processes, aggregates, and writes metrics. There is a Telegraf output plugin, the PostgreSQL and TimescaleDB Output Plugin for Telegraf, produced by TimescaleDB. The plugin could replace the need to manage and maintain the above script. However, I chose not to use it because it is not yet an official Telegraf plugin. If the plugin was included in a Telegraf release, I would certainly encourage its use.
Script Management
Both the Linux-based IoT devices and the edge node run systemd
system and service manager. To ensure the Python scripts keep running in the case of a system restart, we define a systemd
unit. Units are the objects that systemd
knows how to manage. These are basically a standardized representation of system resources that can be managed by the suite of daemons and manipulated by the provided utilities. Each script has a systemd
unit files. Below, we see the gtm_stack_mosquitto
unit file, gtm_stack_mosquitto.service.
The gtm_stack_mosq_to_tmscl
unit file, gtm_stack_mosq_to_tmscl.service, is nearly identical.
To install the gtm_stack_mosquitto.service
systemd
unit file on each IoT device, use the following commands.
Installing the gtm_stack_mosq_to_tmscl.service
unit file on the edge node is nearly identical.
Docker Stack
The edge node runs the GTM Docker stack, stack.yml, in a swarm. As discussed earlier, the stack contains four containers: Eclipse Mosquitto, TimescaleDB, and Grafana, along with pgAdmin. The Mosquitto, TimescaleDB, and Grafana containers have paths within the containers, bind-mounted to directories on the edge device. With bind-mounting, the container’s data will persist if the containers are removed and re-created. The containers are running on their own isolated overlay network.
The GTM Docker stack is installed using the following commands on the edge node. We will assume Docker and git are pre-installed on the edge node for this post.
First, we create the proper local directories on the edge device, which will be used to bind-mount to the container’s directories. Below, we see the bind-mounted local directories with the eventual container’s contents stored within them.

Next, we copy the custom Mosquitto configuration file, mosquitto.conf, included in the project, to the correct location on the edge device. Lastly, we initialize the Docker swarm and deploy the stack.

docker container ls
command, showing the running GTM Stack containers
docker stats
command, showing the resource consumption of GTM Stack containersTimescaleDB Setup
With the GTM stack running, we need to create a single Timescale hypertable, sensor_data
, in the TimescaleDB demo_iot
database, to hold the incoming IoT sensor data. Hypertables, according to TimescaleDB, are designed to be easy to manage and to behave like standard PostgreSQL tables. Hypertables are comprised of many interlinked “chunk” tables. Commands made to the hypertable automatically propagate changes down to all of the chunks belonging to that hypertable.
I suggest using psql
to execute the required DDL statements, which will create the hypertable, as well as the proceeding views and database user permissions. All SQL statements are included in the project’s statements.sql file. One way to use psql
is to install it on your local workstation, then use psql
to connect to the remote edge node. I prefer to instantiate a local PostgreSQL Docker container instance running psql
. I then use the local container’s psql
client to connect to the edge node’s TimescaleDB database. For example, from my local machine, I run the following docker run
command to connect to the edge node’s TimescaleDB database, on the edge node, located locally at 192.168.1.12
.
Although not always practical, could also access psql
from within the TimescaleDB Docker container, running on the actual edge node, using the following docker exec
command.
TimescaleDB Continuous Aggregates
For this post’s demonstration, we need to create four TimescaleDB database views, which will be queried from an eventual Grafana Dashboard. The views are TimescaleDB Continuous Aggregates. According to Timescale, aggregate queries that touch large swathes of time-series data can take a long time to compute because the system needs to scan large amounts of data on every query execution. TimescaleDB continuous aggregates automatically calculate the results of a query in the background and materialize the results.
For example, in this post, we generate sensor data every five seconds from the three IoT devices. When visualizing a 24-hour period in Grafana, using continuous aggregates with an interval of one minute, we would reduce the total volume of data queried from ~51,840 rows to ~4,320 rows, a reduction of over 91%. The larger the time period or the number of IoT devices being analyzed, the more significant these savings will positively impact query performance.
A time_bucket
on the time partitioning column of the hypertable is required in all continuous aggregate views. The time_bucket
function, in this case, has a bucket width (interval) of 1 minute. The interval is configurable.
Limiting Grafana’s Access to IoT Data
Following the Grafana recommendation for database user permissions, we create a grafanareader
PostgresSQL user, and limit the user’s access to the sensor_data
table and the four views we created. Grafana will use this user’s credentials to perform SELECT
queries of the TimescaleDB demo_iot
database.
Grafana Dashboards
Using the TimescaleDB continuous aggregates we have created, we can quickly build a richly-featured dashboard in Grafana. Below we see a typical IoT Dashboard you might build to monitor the post’s IoT sensor data in near real-time. An exported version, dashboard_external_export.json, is included in the GitHub project.


Grafana’s documentation includes a comprehensive set of instructions for Using PostgreSQL in Grafana. To connect to the TimescaleDB database from Grafana, we use a PostgreSQL data source.

The data displayed in each Panel in the Grafana Dashboard is based on a SQL query. For example, the Average Temperature Panel might use a query similar to the example below. This particular query also converts Celcius to Fahrenheit. Note the use of Grafana Macros (e.g., $__time()
, $__timeFilter()
). Macros can be used within a query to simplify syntax and allow for dynamic parts.
SELECT
$__time(bucket),
device_id AS metric,
((avg_temp * 1.9) + 32) AS avg_temp
FROM temperature_humidity_summary_minute
WHERE
$__timeFilter(bucket)
ORDER BY 1,2
Below, we see another example from the Average Humidity Panel. In this particular query, we might choose to validate the humidity data is within an acceptable range of 0%–100%.
SELECT
$__time(bucket),
device_id AS metric,
avg_humidity
FROM temperature_humidity_summary_minute
WHERE
$__timeFilter(bucket)
AND avg_humidity >= 0.0
AND avg_humidity <= 100.0
ORDER BY 1,2
Mobile Friendly
Grafana dashboards are mobile-friendly. Below we see two views of the dashboard, using the Chrome mobile browser on an Apple iPhone.

Grafana Alerts
Grafana allows Alerts to be created based on Rules you define in each Panel. If data values match the Rule’s conditions, which you pre-define, such as a temperature reading above a certain threshold for a set amount of time, an alert is sent to your choice of destinations. According to the Rule shown below, If the average temperature exceeds 75°F for a period of 5 minutes, an alert is sent.

As demonstrated below, when the temperature in the laboratory began to exceed 75°F, the alert entered a ‘Pending’ state. If the temperature exceeded 75°F for the pre-determined period of 5 minutes, the alert status changed to ‘Alerting’, and an alert was sent. When the temperature dropped back below 75°F for the pre-determined period of 5 minutes, the alert status changed from ‘Alerting’ to ‘OK’, and a subsequent notification was sent.

There are currently 18 destinations available out-of-the-box with Grafana, including Slack, email, PagerDuty, webhooks, HipChat, and Microsoft Teams. We can use Grafana Alerts to notify the proper resources, in near real-time, if an issue is detected, based on the data. Below, we see an actual series of high-temperature alerts sent by Grafana to a Slack channel, followed by subsequent notifications as the temperature returned to normal.

Ad-hoc Queries
The ability to perform ad-hoc queries on the time-series IoT data is an essential feature of the IoT edge analytics stack. We can use psql
or pgAdmin to perform ad-hoc queries against the TimescaleDB database. Below are examples of typical ad-hoc queries we might perform on the IoT sensor data. These example queries demonstrate TimescaleDB’s advanced analytical capabilities for working with time-series data, including Moving Average, Delta, Time Bucket, and Histogram.
Conclusion
In this post, we have explored the development of an IoT edge analytics stack, comprised of lightweight, purpose-built, easily deployable and manageable, platform- and programming language-agnostic, open-source software components. These components included Docker containerized versions of Eclipse Mosquitto, TimescaleDB, Grafana, and pgAdmin, referred to as the GTM Stack. Using the GTM stack, we collected, analyzed, and visualized IoT data, without first shipping the data to Cloud or other external systems.
This blog represents my own viewpoints and not of my employer, Amazon Web Services. All product names, logos, and brands are the property of their respective owners.
Istio Observability with Go, gRPC, and Protocol Buffers-based Microservices on Google Kubernetes Engine (GKE)
Posted by Gary A. Stafford in Bash Scripting, Build Automation, Client-Side Development, Cloud, Continuous Delivery, DevOps, Enterprise Software Development, GCP, Go, JavaScript, Kubernetes, Software Development on April 17, 2019
In the last two posts, Kubernetes-based Microservice Observability with Istio Service Mesh and Azure Kubernetes Service (AKS) Observability with Istio Service Mesh, we explored the observability tools which are included with Istio Service Mesh. These tools currently include Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization and monitoring. Combined with cloud platform-native monitoring and logging services, such as Stackdriver on GCP, CloudWatch on AWS, Azure Monitor logs on Azure, and we have a complete observability solution for modern, distributed, Cloud-based applications.
In this post, we will examine the use of Istio’s observability tools to monitor Go-based microservices that use Protocol Buffers (aka Protobuf) over gRPC (gRPC Remote Procedure Calls) and HTTP/2 for client-server communications, as opposed to the more traditional, REST-based JSON (JavaScript Object Notation) over HTTP (Hypertext Transfer Protocol). We will see how Kubernetes, Istio, Envoy, and the observability tools work seamlessly with gRPC, just as they do with JSON over HTTP, on Google Kubernetes Engine (GKE).
Technologies
gRPC
According to the gRPC project, gRPC, a CNCF incubating project, is a modern, high-performance, open-source and universal remote procedure call (RPC) framework that can run anywhere. It enables client and server applications to communicate transparently and makes it easier to build connected systems. Google, the original developer of gRPC, has used the underlying technologies and concepts in gRPC for years. The current implementation is used in several Google cloud products and Google externally facing APIs. It is also being used by Square, Netflix, CoreOS, Docker, CockroachDB, Cisco, Juniper Networks and many other organizations.
Protocol Buffers
By default, gRPC uses Protocol Buffers. According to Google, Protocol Buffers (aka Protobuf) are a language- and platform-neutral, efficient, extensible, automated mechanism for serializing structured data for use in communications protocols, data storage, and more. Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML. Once you have defined your messages, you run the protocol buffer compiler for your application’s language on your .proto
file to generate data access classes.
Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML.
Protocol buffers currently support generated code in Java, Python, Objective-C, and C++, Dart, Go, Ruby, and C#. For this post, we have compiled for Go. You can read more about the binary wire format of Protobuf on Google’s Developers Portal.
Envoy Proxy
According to the Istio project, Istio uses an extended version of the Envoy proxy. Envoy is deployed as a sidecar to a relevant service in the same Kubernetes pod. Envoy, created by Lyft, is a high-performance proxy developed in C++ to mediate all inbound and outbound traffic for all services in the service mesh. Istio leverages Envoy’s many built-in features, including dynamic service discovery, load balancing, TLS termination, HTTP/2 and gRPC proxies, circuit-breakers, health checks, staged rollouts, fault injection, and rich metrics.
According to the post by Harvey Tuch of Google, Evolving a Protocol Buffer canonical API, Envoy proxy adopted Protocol Buffers, specifically proto3, as the canonical specification of for version 2 of Lyft’s gRPC-first API.
Reference Microservices Platform
In the last two posts, we explored Istio’s observability tools, using a RESTful microservices-based API platform written in Go and using JSON over HTTP for service to service communications. The API platform was comprised of eight Go-based microservices and one sample Angular 7, TypeScript-based front-end web client. The various services are dependent on MongoDB, and RabbitMQ for event queue-based communications. Below, the is JSON over HTTP-based platform architecture.
Below, the current Angular 7-based web client interface.
Converting to gRPC and Protocol Buffers
For this post, I have modified the eight Go microservices to use gRPC and Protocol Buffers, Google’s data interchange format. Specifically, the services use version 3 release (aka proto3) of Protocol Buffers. With gRPC, a gRPC client calls a gRPC server. Some of the platform’s services are gRPC servers, others are gRPC clients, while some act as both client and server, such as Service A, B, and E. The revised architecture is shown below.
gRPC Gateway
Assuming for the sake of this demonstration, that most consumers of the API would still expect to communicate using a RESTful JSON over HTTP API, I have added a gRPC Gateway reverse proxy to the platform. The gRPC Gateway is a gRPC to JSON reverse proxy, a common architectural pattern, which proxies communications between the JSON over HTTP-based clients and the gRPC-based microservices. A diagram from the grpc-gateway GitHub project site effectively demonstrates how the reverse proxy works.
Image courtesy: https://github.com/grpc-ecosystem/grpc-gateway
In the revised platform architecture diagram above, note the addition of the reverse proxy, which replaces Service A at the edge of the API. The proxy sits between the Angular-based Web UI and Service A. Also, note the communication method between services is now Protobuf over gRPC instead of JSON over HTTP. The use of Envoy Proxy (via Istio) is unchanged, as is the MongoDB Atlas-based databases and CloudAMQP RabbitMQ-based queue, which are still external to the Kubernetes cluster.
Alternatives to gRPC Gateway
As an alternative to the gRPC Gateway reverse proxy, we could convert the TypeScript-based Angular UI client to gRPC and Protocol Buffers, and continue to communicate directly with Service A as the edge service. However, this would limit other consumers of the API to rely on gRPC as opposed to JSON over HTTP, unless we also chose to expose two different endpoints, gRPC, and JSON over HTTP, another common pattern.
Demonstration
In this post’s demonstration, we will repeat the exact same installation process, outlined in the previous post, Kubernetes-based Microservice Observability with Istio Service Mesh. We will deploy the revised gRPC-based platform to GKE on GCP. You could just as easily follow Azure Kubernetes Service (AKS) Observability with Istio Service Mesh, and deploy the platform to AKS.
Source Code
All source code for this post is available on GitHub, contained in three projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository, in the new grpc
branch.
git clone \ --branch grpc --single-branch --depth 1 --no-tags \ https://github.com/garystafford/k8s-istio-observe-backend.git
The Angular-based web client source code is located in the k8s-istio-observe-frontend repository on the new grpc
branch. The source protocol buffers .proto
file and the generated code, using the protocol buffers compiler, is located in the new pb-greeting project repository. You do not need to clone either of these projects for this post’s demonstration.
All Docker images for the services, UI, and the reverse proxy are located on Docker Hub.
Code Changes
This post is not specifically about writing Go for gRPC and Protobuf. However, to better understand the observability requirements and capabilities of these technologies, compared to JSON over HTTP, it is helpful to review some of the source code.
Service A
First, compare the source code for Service A, shown below, to the original code in the previous post. The service’s code is almost completely re-written. I relied on several references for writing the code, including, Tracing gRPC with Istio, written by Neeraj Poddar of Aspen Mesh and Distributed Tracing Infrastructure with Jaeger on Kubernetes, by Masroor Hasan.
Specifically, note the following code changes to Service A:
- Import of the pb-greeting protobuf package;
- Local Greeting struct replaced with
pb.Greeting
struct; - All services are now hosted on port
50051
; - The HTTP server and all API resource handler functions are removed;
- Headers, used for distributed tracing with Jaeger, have moved from HTTP request object to metadata passed in the gRPC context object;
- Service A is coded as a gRPC server, which is called by the gRPC Gateway reverse proxy (gRPC client) via the
Greeting
function; - The primary
PingHandler
function, which returns the service’s Greeting, is replaced by the pb-greeting protobuf package’sGreeting
function; - Service A is coded as a gRPC client, calling both Service B and Service C using the
CallGrpcService
function; - CORS handling is offloaded to Istio;
- Logging methods are unchanged;
Source code for revised gRPC-based Service A (gist):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: Service A – gRPC/Protobuf | |
package main | |
import ( | |
"context" | |
"github.com/banzaicloud/logrus-runtime-formatter" | |
"github.com/google/uuid" | |
"github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing" | |
ot "github.com/opentracing/opentracing-go" | |
log "github.com/sirupsen/logrus" | |
"google.golang.org/grpc" | |
"google.golang.org/grpc/metadata" | |
"net" | |
"os" | |
"time" | |
pb "github.com/garystafford/pb-greeting" | |
) | |
const ( | |
port = ":50051" | |
) | |
type greetingServiceServer struct { | |
} | |
var ( | |
greetings []*pb.Greeting | |
) | |
func (s *greetingServiceServer) Greeting(ctx context.Context, req *pb.GreetingRequest) (*pb.GreetingResponse, error) { | |
greetings = nil | |
tmpGreeting := pb.Greeting{ | |
Id: uuid.New().String(), | |
Service: "Service-A", | |
Message: "Hello, from Service-A!", | |
Created: time.Now().Local().String(), | |
} | |
greetings = append(greetings, &tmpGreeting) | |
CallGrpcService(ctx, "service-b:50051") | |
CallGrpcService(ctx, "service-c:50051") | |
return &pb.GreetingResponse{ | |
Greeting: greetings, | |
}, nil | |
} | |
func CallGrpcService(ctx context.Context, address string) { | |
conn, err := createGRPCConn(ctx, address) | |
if err != nil { | |
log.Fatalf("did not connect: %v", err) | |
} | |
defer conn.Close() | |
headersIn, _ := metadata.FromIncomingContext(ctx) | |
log.Infof("headersIn: %s", headersIn) | |
client := pb.NewGreetingServiceClient(conn) | |
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | |
ctx = metadata.NewOutgoingContext(context.Background(), headersIn) | |
defer cancel() | |
req := pb.GreetingRequest{} | |
greeting, err := client.Greeting(ctx, &req) | |
log.Info(greeting.GetGreeting()) | |
if err != nil { | |
log.Fatalf("did not connect: %v", err) | |
} | |
for _, greeting := range greeting.GetGreeting() { | |
greetings = append(greetings, greeting) | |
} | |
} | |
func createGRPCConn(ctx context.Context, addr string) (*grpc.ClientConn, error) { | |
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/ | |
var opts []grpc.DialOption | |
opts = append(opts, grpc.WithStreamInterceptor( | |
grpc_opentracing.StreamClientInterceptor( | |
grpc_opentracing.WithTracer(ot.GlobalTracer())))) | |
opts = append(opts, grpc.WithUnaryInterceptor( | |
grpc_opentracing.UnaryClientInterceptor( | |
grpc_opentracing.WithTracer(ot.GlobalTracer())))) | |
opts = append(opts, grpc.WithInsecure()) | |
conn, err := grpc.DialContext(ctx, addr, opts…) | |
if err != nil { | |
log.Fatalf("Failed to connect to application addr: ", err) | |
return nil, err | |
} | |
return conn, nil | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
lis, err := net.Listen("tcp", port) | |
if err != nil { | |
log.Fatalf("failed to listen: %v", err) | |
} | |
s := grpc.NewServer() | |
pb.RegisterGreetingServiceServer(s, &greetingServiceServer{}) | |
log.Fatal(s.Serve(lis)) | |
} |
Greeting Protocol Buffers
Shown below is the greeting source protocol buffers .proto
file. The greeting response struct, originally defined in the services, remains largely unchanged (gist). The UI client responses will look identical.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
syntax = "proto3"; | |
package greeting; | |
import "google/api/annotations.proto"; | |
message Greeting { | |
string id = 1; | |
string service = 2; | |
string message = 3; | |
string created = 4; | |
} | |
message GreetingRequest { | |
} | |
message GreetingResponse { | |
repeated Greeting greeting = 1; | |
} | |
service GreetingService { | |
rpc Greeting (GreetingRequest) returns (GreetingResponse) { | |
option (google.api.http) = { | |
get: "/api/v1/greeting" | |
}; | |
} | |
} |
When compiled with protoc
, the Go-based protocol compiler plugin, the original 27 lines of source code swells to almost 270 lines of generated data access classes that are easier to use programmatically.
# Generate gRPC stub (.pb.go) protoc -I /usr/local/include -I. \ -I ${GOPATH}/src \ -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \ --go_out=plugins=grpc:. \ greeting.proto # Generate reverse-proxy (.pb.gw.go) protoc -I /usr/local/include -I. \ -I ${GOPATH}/src \ -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \ --grpc-gateway_out=logtostderr=true:. \ greeting.proto # Generate swagger definitions (.swagger.json) protoc -I /usr/local/include -I. \ -I ${GOPATH}/src \ -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \ --swagger_out=logtostderr=true:. \ greeting.proto
Below is a small snippet of that compiled code, for reference. The compiled code is included in the pb-greeting project on GitHub and imported into each microservice and the reverse proxy (gist). We also compile a separate version for the reverse proxy to implement.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Code generated by protoc-gen-go. DO NOT EDIT. | |
// source: greeting.proto | |
package greeting | |
import ( | |
context "context" | |
fmt "fmt" | |
proto "github.com/golang/protobuf/proto" | |
_ "google.golang.org/genproto/googleapis/api/annotations" | |
grpc "google.golang.org/grpc" | |
codes "google.golang.org/grpc/codes" | |
status "google.golang.org/grpc/status" | |
math "math" | |
) | |
// Reference imports to suppress errors if they are not otherwise used. | |
var _ = proto.Marshal | |
var _ = fmt.Errorf | |
var _ = math.Inf | |
// This is a compile-time assertion to ensure that this generated file | |
// is compatible with the proto package it is being compiled against. | |
// A compilation error at this line likely means your copy of the | |
// proto package needs to be updated. | |
const _ = proto.ProtoPackageIsVersion3 // please upgrade the proto package | |
type Greeting struct { | |
Id string `protobuf:"bytes,1,opt,name=id,proto3" json:"id,omitempty"` | |
Service string `protobuf:"bytes,2,opt,name=service,proto3" json:"service,omitempty"` | |
Message string `protobuf:"bytes,3,opt,name=message,proto3" json:"message,omitempty"` | |
Created string `protobuf:"bytes,4,opt,name=created,proto3" json:"created,omitempty"` | |
XXX_NoUnkeyedLiteral struct{} `json:"-"` | |
XXX_unrecognized []byte `json:"-"` | |
XXX_sizecache int32 `json:"-"` | |
} | |
func (m *Greeting) Reset() { *m = Greeting{} } | |
func (m *Greeting) String() string { return proto.CompactTextString(m) } | |
func (*Greeting) ProtoMessage() {} | |
func (*Greeting) Descriptor() ([]byte, []int) { | |
return fileDescriptor_6acac03ccd168a87, []int{0} | |
} | |
func (m *Greeting) XXX_Unmarshal(b []byte) error { | |
return xxx_messageInfo_Greeting.Unmarshal(m, b) | |
} | |
func (m *Greeting) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { | |
return xxx_messageInfo_Greeting.Marshal(b, m, deterministic) |
Using Swagger, we can view the greeting protocol buffers’ single RESTful API resource, exposed with an HTTP GET method. I use the Docker-based version of Swagger UI for viewing protoc
generated swagger definitions.
docker run -p 8080:8080 -d --name swagger-ui \ -e SWAGGER_JSON=/tmp/greeting.swagger.json \ -v ${GOAPTH}/src/pb-greeting:/tmp swaggerapi/swagger-ui
The Angular UI makes an HTTP GET request to the /api/v1/greeting
resource, which is transformed to gRPC and proxied to Service A, where it is handled by the Greeting
function.
gRPC Gateway Reverse Proxy
As explained earlier, the gRPC Gateway reverse proxy service is completely new. Specifically, note the following code features in the gist below:
- Import of the pb-greeting protobuf package;
- The proxy is hosted on port
80
; - Request headers, used for distributed tracing with Jaeger, are collected from the incoming HTTP request and passed to Service A in the gRPC context;
- The proxy is coded as a gRPC client, which calls Service A;
- Logging is largely unchanged;
The source code for the Reverse Proxy (gist):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: gRPC Gateway / Reverse Proxy | |
// reference: https://github.com/grpc-ecosystem/grpc-gateway | |
package main | |
import ( | |
"context" | |
"flag" | |
lrf "github.com/banzaicloud/logrus-runtime-formatter" | |
gw "github.com/garystafford/pb-greeting" | |
"github.com/grpc-ecosystem/grpc-gateway/runtime" | |
log "github.com/sirupsen/logrus" | |
"google.golang.org/grpc" | |
"google.golang.org/grpc/metadata" | |
"net/http" | |
"os" | |
) | |
func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD { | |
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/ | |
var ( | |
otHeaders = []string{ | |
"x-request-id", | |
"x-b3-traceid", | |
"x-b3-spanid", | |
"x-b3-parentspanid", | |
"x-b3-sampled", | |
"x-b3-flags", | |
"x-ot-span-context"} | |
) | |
var pairs []string | |
for _, h := range otHeaders { | |
if v := req.Header.Get(h); len(v) > 0 { | |
pairs = append(pairs, h, v) | |
} | |
} | |
return metadata.Pairs(pairs…) | |
} | |
type annotator func(context.Context, *http.Request) metadata.MD | |
func chainGrpcAnnotators(annotators …annotator) annotator { | |
return func(c context.Context, r *http.Request) metadata.MD { | |
var mds []metadata.MD | |
for _, a := range annotators { | |
mds = append(mds, a(c, r)) | |
} | |
return metadata.Join(mds…) | |
} | |
} | |
func run() error { | |
ctx := context.Background() | |
ctx, cancel := context.WithCancel(ctx) | |
defer cancel() | |
annotators := []annotator{injectHeadersIntoMetadata} | |
mux := runtime.NewServeMux( | |
runtime.WithMetadata(chainGrpcAnnotators(annotators…)), | |
) | |
opts := []grpc.DialOption{grpc.WithInsecure()} | |
err := gw.RegisterGreetingServiceHandlerFromEndpoint(ctx, mux, "service-a:50051", opts) | |
if err != nil { | |
return err | |
} | |
return http.ListenAndServe(":80", mux) | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := lrf.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
flag.Parse() | |
if err := run(); err != nil { | |
log.Fatal(err) | |
} | |
} |
Below, in the Stackdriver logs, we see an example of a set of HTTP request headers in the JSON payload, which are propagated upstream to gRPC-based Go services from the gRPC Gateway’s reverse proxy. Header propagation ensures the request produces a complete distributed trace across the complete service call chain.
Istio VirtualService and CORS
According to feedback in the project’s GitHub Issues, the gRPC Gateway does not directly support Cross-Origin Resource Sharing (CORS) policy. In my own experience, the gRPC Gateway cannot handle OPTIONS HTTP method requests, which must be issued by the Angular 7 web UI. Therefore, I have offloaded CORS responsibility to Istio, using the VirtualService resource’s CorsPolicy configuration. This makes CORS much easier to manage than coding CORS configuration into service code (gist):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
apiVersion: networking.istio.io/v1alpha3 | |
kind: VirtualService | |
metadata: | |
name: service-rev-proxy | |
spec: | |
hosts: | |
– api.dev.example-api.com | |
gateways: | |
– demo-gateway | |
http: | |
– match: | |
– uri: | |
prefix: / | |
route: | |
– destination: | |
port: | |
number: 80 | |
host: service-rev-proxy.dev.svc.cluster.local | |
weight: 100 | |
corsPolicy: | |
allowOrigin: | |
– "*" | |
allowMethods: | |
– OPTIONS | |
– GET | |
allowCredentials: true | |
allowHeaders: | |
– "*" |
Set-up and Installation
To deploy the microservices platform to GKE, follow the detailed instructions in part one of the post, Kubernetes-based Microservice Observability with Istio Service Mesh: Part 1, or Azure Kubernetes Service (AKS) Observability with Istio Service Mesh for AKS.
- Create the external MongoDB Atlas database and CloudAMQP RabbitMQ clusters;
- Modify the Kubernetes resource files and bash scripts for your own environments;
- Create the managed GKE or AKS cluster on GCP or Azure;
- Configure and deploy Istio to the managed Kubernetes cluster, using Helm;
- Create DNS records for the platform’s exposed resources;
- Deploy the Go-based microservices, gRPC Gateway reverse proxy, Angular UI, and associated resources to Kubernetes cluster;
- Test and troubleshoot the platform deployment;
- Observe the results;
The Three Pillars
As introduced in the first post, logs, metrics, and traces are often known as the three pillars of observability. These are the external outputs of the system, which we may observe. As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind. Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.
Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools. Istio’s integrations include Jaeger for distributed tracing, Kiali for Istio service mesh-based microservice visualization and monitoring, and Prometheus and Grafana for metric collection, monitoring, and alerting. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for GKE, CloudWatch for Amazon’s EKS, or Azure Monitor logs for AKS, and we have a complete observability solution for modern, distributed, Cloud-based applications.
Pillar 1: Logging
Moving from JSON over HTTP to gRPC does not require any changes to the logging configuration of the Go-based service code or Kubernetes resources.
Stackdriver with Logrus
As detailed in part two of the last post, Kubernetes-based Microservice Observability with Istio Service Mesh, our logging strategy for the eight Go-based microservices and the reverse proxy continues to be the use of Logrus, the popular structured logger for Go, and Banzai Cloud’s logrus-runtime-formatter.
If you recall, the Banzai formatter automatically tags log messages with runtime/stack information, including function name and line number; extremely helpful when troubleshooting. We are also using Logrus’ JSON formatter. Below, in the Stackdriver console, note how each log entry below has the JSON payload contained within the message with the log level, function name, lines on which the log entry originated, and the message.
Below, we see the details of a specific log entry’s JSON payload. In this case, we can see the request headers propagated from the downstream service.
Pillar 2: Metrics
Moving from JSON over HTTP to gRPC does not require any changes to the metrics configuration of the Go-based service code or Kubernetes resources.
Prometheus
Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.
Grafana
Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana allows users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.
According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see two of the pre-configured dashboards, the Istio Mesh Dashboard and the Istio Performance Dashboard.
Pillar 3: Traces
Moving from JSON over HTTP to gRPC did require a complete re-write of the tracing logic in the service code. In fact, I spent the majority of my time ensuring the correct headers were propagated from the Istio Ingress Gateway to the gRPC Gateway reverse proxy, to Service A in the gRPC context, and upstream to all the dependent, gRPC-based services. I am sure there are a number of optimization in my current code, regarding the correct handling of traces and how this information is propagated across the service call stack.
Jaeger
According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains an excellent overview of Jaeger’s architecture and general tracing-related terminology.
Below we see the Jaeger UI Traces View. In it, we see a series of traces generated by hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab
). Unlike ab
, hey
supports HTTP/2. The use of hey
was detailed in the previous post.
A trace, as you might recall, is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with DAGs.
Below we see the Jaeger UI Trace Detail View. The example trace contains 16 spans, which encompasses nine components – seven of the eight Go-based services, the reverse proxy, and the Istio Ingress Gateway. The trace and the spans each have timings. The root span in the trace is the Istio Ingress Gateway. In this demo, traces do not span the RabbitMQ message queues. This means you would not see a trace which includes the decoupled, message-based communications between Service D to Service F, via the RabbitMQ.
Within the Jaeger UI Trace Detail View, you also have the ability to drill into a single span, which contains additional metadata. Metadata includes the URL being called, HTTP method, response status, and several other headers.
Microservice Observability
Moving from JSON over HTTP to gRPC does not require any changes to the Kiali configuration of the Go-based service code or Kubernetes resources.
Kiali
According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.
The Graph View in the Kiali UI is a visual representation of the components running in the Istio service mesh. Below, filtering on the cluster’s dev
Namespace, we should observe that Kiali has mapped all components in the platform, along with rich metadata, such as their version and communication protocols.
Using Kiali, we can confirm our service-to-service IPC protocol is now gRPC instead of the previous HTTP.
Conclusion
Although converting from JSON over HTTP to protocol buffers with gRPC required major code changes to the services, it did not impact the high-level observability we have of those services using the tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Azure Kubernetes Service (AKS) Observability with Istio Service Mesh
Posted by Gary A. Stafford in Azure, Bash Scripting, Cloud, DevOps, Go, JavaScript, Kubernetes, Software Development on March 31, 2019
In the last two-part post, Kubernetes-based Microservice Observability with Istio Service Mesh, we deployed Istio, along with its observability tools, Prometheus, Grafana, Jaeger, and Kiali, to Google Kubernetes Engine (GKE). Following that post, I received several questions about using Istio’s observability tools with other popular managed Kubernetes platforms, primarily Azure Kubernetes Service (AKS). In most cases, including with AKS, both Istio and the observability tools are compatible.
In this short follow-up of the last post, we will replace the GKE-specific cluster setup commands, found in part one of the last post, with new commands to provision a similar AKS cluster on Azure. The new AKS cluster will run Istio 1.1.3, released 4/15/2019, alongside the latest available version of AKS (Kubernetes), 1.12.6. We will replace Google’s Stackdriver logging with Azure Monitor logs. We will retain the external MongoDB Atlas cluster and the external CloudAMQP cluster dependencies.
Previous articles about AKS include First Impressions of AKS, Azure’s New Managed Kubernetes Container Service (November 2017) and Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB (December 2017).
Source Code
All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository.
git clone \ --branch master --single-branch \ --depth 1 --no-tags \ https://github.com/garystafford/k8s-istio-observe-backend.git
The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend repository. You will not need to clone the Angular UI project for this post’s demonstration.
Setup
This post assumes you have a Microsoft Azure account with the necessary resource providers registered, and the Azure Command-Line Interface (CLI), az
, installed and available to your command shell. You will also need Helm and Istio 1.1.3 installed and configured, which is covered in the last post.
Start by logging into Azure from your command shell.
az login \ --username {{ your_username_here }} \ --password {{ your_password_here }}
Resource Providers
If you are new to Azure or AKS, you may need to register some additional resource providers to complete this demonstration.
az provider list --output table
If you are missing required resource providers, you will see errors similar to the one shown below. Simply activate the particular provider corresponding to the error.
Operation failed with status:'Bad Request'. Details: Required resource provider registrations Microsoft.Compute, Microsoft.Network are missing.
To register the necessary providers, use the Azure CLI or the Azure Portal UI.
az provider register --namespace Microsoft.ContainerService az provider register --namespace Microsoft.Network az provider register --namespace Microsoft.Compute
Resource Group
AKS requires an Azure Resource Group. According to Azure, a resource group is a container that holds related resources for an Azure solution. The resource group includes those resources that you want to manage as a group. I chose to create a new resource group associated with my closest geographic Azure Region, East US, using the Azure CLI.
az group create \ --resource-group aks-observability-demo \ --location eastus
Create the AKS Cluster
Before creating the GKE cluster, check for the latest versions of AKS. At the time of this post, the latest versions of AKS was 1.12.6.
az aks get-versions \ --location eastus \ --output table
Using the latest GKE version, create the GKE managed cluster. There are many configuration options available with the az aks create
command. For this post, I am creating three worker nodes using the Azure Standard_DS3_v2 VM type, which will give us a total of 12 vCPUs and 42 GB of memory. Anything smaller and all the Pods may not be schedulable. Instead of supplying an existing SSH key, I will let Azure create a new one. You should have no need to SSH into the worker nodes. I am also enabling the monitoring add-on. According to Azure, the add-on sets up Azure Monitor for containers, announced in December 2018, which monitors the performance of workloads deployed to Kubernetes environments hosted on AKS.
time az aks create \ --name aks-observability-demo \ --resource-group aks-observability-demo \ --node-count 3 \ --node-vm-size Standard_DS3_v2 \ --enable-addons monitoring \ --generate-ssh-keys \ --kubernetes-version 1.12.6
Using the time
command, we observe that the cluster took approximately 5m48s to provision; I have seen times up to almost 10 minutes. AKS provisioning is not as fast as GKE, which in my experience is at least 2x-3x faster than AKS for a similarly sized cluster.
After the cluster creation completes, retrieve your AKS cluster credentials.
az aks get-credentials \ --name aks-observability-demo \ --resource-group aks-observability-demo \ --overwrite-existing
Examine the Cluster
Use the following command to confirm the cluster is ready by examining the status of three worker nodes.
kubectl get nodes --output=wide
Observe that Azure currently uses Ubuntu 16.04.5 LTS for the worker node’s host operating system. If you recall, GKE offers both Ubuntu as well as a Container-Optimized OS from Google.
Kubernetes Dashboard
Unlike GKE, there is no custom AKS dashboard. Therefore, we will use the Kubernetes Web UI (dashboard), which is installed by default with AKS, unlike GKE. According to Azure, to make full use of the dashboard, since the AKS cluster uses RBAC, a ClusterRoleBinding must be created before you can correctly access the dashboard.
kubectl create clusterrolebinding kubernetes-dashboard \ --clusterrole=cluster-admin \ --serviceaccount=kube-system:kubernetes-dashboard
Next, we must create a proxy tunnel on local port 8001
to the dashboard running on the AKS cluster. This CLI command creates a proxy between your local system and the Kubernetes API and opens your web browser to the Kubernetes dashboard.
az aks browse \ --name aks-observability-demo \ --resource-group aks-observability-demo
Although you should use the Azure CLI, PowerShell, or SDK for all your AKS configuration tasks, using the dashboard for monitoring the cluster and the resources running on it, is invaluable.
The Kubernetes dashboard also provides access to raw container logs. Azure Monitor provides the ability to construct complex log queries, but for quick troubleshooting, you may just want to see the raw logs a specific container is outputting, from the dashboard.
Azure Portal
Logging into the Azure Portal, we can observe the AKS cluster, within the new Resource Group.
In addition to the Azure Resource Group we created, there will be a second Resource Group created automatically during the creation of the AKS cluster. This group contains all the resources that compose the AKS cluster. These resources include the three worker node VM instances, and their corresponding storage disks and NICs. The group also includes a network security group, route table, virtual network, and an availability set.
Deploy Istio
From this point on, the process to deploy Istio Service Mesh and the Go-based microservices platform follows the previous post and use the exact same scripts. After modifying the Kubernetes resource files, to deploy Istio, use the bash script, part4_install_istio.sh. I have added a few more pauses in the script to account for the apparently slower response times from AKS as opposed to GKE. It definitely takes longer to spin up the Istio resources on AKS than on GKE, which can result in errors if you do not pause between each stage of the deployment process.
Using the Kubernetes dashboard, we can view the Istio resources running in the istio-system
Namespace, as shown below. Confirm that all resource Pods are running and healthy before deploying the Go-based microservices platform.
Deploy the Platform
Deploy the Go-based microservices platform, using bash deploy script, part5a_deploy_resources.sh.
The script deploys two replicas (Pods) of each of the eight microservices, Service-A through Service-H, and the Angular UI, to the dev
and test
Namespaces, for a total of 36 Pods. Each Pod will have the Istio sidecar proxy (Envoy Proxy) injected into it, alongside the microservice or UI.
Azure Load Balancer
If we return to the Resource Group created automatically when the AKS cluster was created, we will now see two additional resources. There is now an Azure Load Balancer and Public IP Address.
Similar to the GKE cluster in the last post, when the Istio Ingress Gateway is deployed as part of the platform, it is materialized as an Azure Load Balancer. The front-end of the load balancer is the new public IP address. The back-end of the load-balancer is a pool containing the three AKS worker node VMs. The load balancer is associated with a set of rules and health probes.
DNS
I have associated the new Azure public IP address, connected with the front-end of the load balancer, with the four subdomains I am using to represent the UI and the edge service, Service-A, in both Namespaces. If Azure is your primary Cloud provider, then Azure DNS is a good choice to manage your domain’s DNS records. For this demo, you will require your own domain.
Testing the Platform
With everything deployed, test the platform is responding and generate HTTP traffic for the observability tools to record. Similar to last time, I have chosen hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab
). Unlike ab
, hey
supports HTTP/2. Below, I am running hey
directly from Azure Cloud Shell. The tool is simulating 10 concurrent users, generating a total of 500 HTTP GET requests to Service A.
# quick setup from Azure Shell using Bash go get -u github.com/rakyll/hey cd go/src/github.com/rakyll/hey/ go build ./hey -n 500 -c 10 -h2 http://api.dev.example-api.com/api/ping
We had 100% success with all 500 calls resulting in an HTTP 200 OK success status response code. Based on the results, we can observe the platform was capable of approximately 4 requests/second, with an average response time of 2.48 seconds and a mean time of 2.80 seconds. Almost all of that time was the result of waiting for the response, as the details indicate.
Logging
In this post, we have replaced GCP’s Stackdriver logging with Azure Monitor logs. According to Microsoft, Azure Monitor maximizes the availability and performance of applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from Cloud and on-premises environments. In my opinion, Stackdriver is a superior solution for searching and correlating the logs of distributed applications running on Kubernetes. I find the interface and query language of Stackdriver easier and more intuitive than Azure Monitor, which although powerful, requires substantial query knowledge to obtain meaningful results. For example, here is a query to view the log entries from the services in the dev
Namespace, within the last day.
let startTimestamp = ago(1d); KubePodInventory | where TimeGenerated > startTimestamp | where ClusterName =~ "aks-observability-demo" | where Namespace == "dev" | where Name contains "service-" | distinct ContainerID | join ( ContainerLog | where TimeGenerated > startTimestamp ) on ContainerID | project LogEntrySource, LogEntry, TimeGenerated, Name | order by TimeGenerated desc | render table
Below, we see the Logs interface with the search query and log entry results.
Below, we see a detailed view of a single log entry from Service A.
Observability Tools
The previous post goes into greater detail on the features of each of the observability tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.
We can use the exact same kubectl port-forward
commands to connect to the tools on AKS as we did on GKE. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the tool’s pod.
# Grafana kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=grafana \ -o jsonpath='{.items[0].metadata.name}') 3000:3000 & # Prometheus kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=prometheus \ -o jsonpath='{.items[0].metadata.name}') 9090:9090 & # Jaeger kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=jaeger \ -o jsonpath='{.items[0].metadata.name}') 16686:16686 & # Kiali kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=kiali \ -o jsonpath='{.items[0].metadata.name}') 20001:20001 &
Prometheus and Grafana
Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.
Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.
According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see one of the pre-configured dashboards, the Istio Service Dashboard.
Jaeger
According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.
Below, we see a typical, distributed trace of the services, starting ingress gateway and passing across the upstream service dependencies.
Kaili
According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.
There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin
, the password is 1f2d1e2e67df
.
Below, we see a detailed view of our platform, running in the dev
namespace, on AKS.
Delete AKS Cluster
Once you are finished with this demo, use the following two commands to tear down the AKS cluster and remove the cluster context from your local configuration.
time az aks delete \ --name aks-observability-demo \ --resource-group aks-observability-demo \ --yes kubectl config delete-context aks-observability-demo
Conclusion
In this brief, follow-up post, we have explored how the current set of observability tools, part of the latest version of Istio Service Mesh, integrates with Azure Kubernetes Service (AKS).
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Kubernetes-based Microservice Observability with Istio Service Mesh on Google Kubernetes Engine (GKE): Part 2
Posted by Gary A. Stafford in Cloud, DevOps, GCP, Go, JavaScript, Kubernetes, Software Development on March 21, 2019
In this two-part post, we are exploring the set of observability tools that are part of the latest version of Istio Service Mesh. These tools include Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability solution for modern, distributed applications.
Reference Platform
To demonstrate Istio’s observability tools, in part one of the post, we deployed a reference microservices platform, written in Go, to GKE on GCP. The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A through Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications.
The reference platform is designed to generate HTTP-based service-to-service, TCP-based service-to-database (MongoDB), and TCP-based service-to-queue-to-service (RabbitMQ) IPC (inter-process communication). Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a RabbitMQ queue that Service F consumes and writes to MongoDB, and so on. The goal is to observe these distributed communications using Istio’s observability tools when the system is deployed to Kubernetes.
Pillar 1: Logging
If you recall, logs, metrics, and traces are often known as the three pillars of observability. Since we are using GKE on GCP, we will look at Google’s Stackdriver Logging. According to Google, Stackdriver Logging allows you to store, search, analyze, monitor, and alert on log data and events from GCP and even AWS. Although Stackdriver logging is not an Istio observability feature, logging is an essential pillar of overall observability strategy.
Go-based Microservice Logging
An effective logging strategy starts with what you log, when you log, and how you log. As part of our logging strategy, the eight Go-based microservices are using Logrus, a popular structured logger for Go. The microservices also implement Banzai Cloud’s logrus-runtime-formatter. There is an excellent article on the formatter, Golang runtime Logrus Formatter. These two logging packages give us greater control over what we log, when we log, and how we log information about our microservices. The recommended configuration of the packages is minimal.
func init() { formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} formatter.Line = true log.SetFormatter(&formatter) log.SetOutput(os.Stdout) level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) if err != nil { log.Error(err) } log.SetLevel(level) }
Logrus provides several advantages over Go’s simple logging package, log. Log entries are not only for Fatal errors, nor should all verbose log entries be output in a Production environment. The post’s microservices are taking advantage of Logrus’ ability to log at seven levels: Trace, Debug, Info, Warning, Error, Fatal, and Panic. We have also variabilized the log level, allowing it to be easily changed in the Kubernetes Deployment resource at deploy-time.
The microservices also take advantage of Banzai Cloud’s logrus-runtime-formatter. The Banzai formatter automatically tags log messages with runtime/stack information, including function name and line number; extremely helpful when troubleshooting. We are also using Logrus’ JSON formatter. Note how each log entry below has the JSON payload contained within the message.
Client-side Angular UI Logging
Likewise, we have enhanced the logging of the Angular UI using NGX Logger. NGX Logger is a popular, simple logging module, currently for Angular 6 and 7. It allows “pretty print” to the console, as well as allowing log messages to be POSTed to a URL for server-side logging. For this demo, we will only print to the console. Similar to Logrus, NGX Logger supports multiple log levels: Trace, Debug, Info, Warning, Error, Fatal, and Off. Instead of just outputting messages, NGX Logger allows us to output properly formatted log entries to the web browser’s console.
The level of logs output is dependent on the environment, Production or not Production. Below we see a combination of log entries in the local development environment, including Debug, Info, and Error.
Again below, we see the same page in the GKE-based Production environment. Note the absence of Debug-level log entries output to the console, without changing the configuration. We would not want to expose potentially sensitive information in verbose log output to our end-users in Production.
Controlling logging levels is accomplished by adding the following ternary operator to the app.module.ts file.
LoggerModule.forRoot({ level: !environment.production ? NgxLoggerLevel.DEBUG : NgxLoggerLevel.INFO, serverLogLevel: NgxLoggerLevel.INFO })
Pillar 2: Metrics
For metrics, we will examine at Prometheus and Grafana. Both these leading tools were installed as part of the Istio deployment.
Prometheus
Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.
According to Istio, Istio’s Mixer comes with a built-in Prometheus adapter that exposes an endpoint serving generated metric values. The Prometheus add-on is a Prometheus server that comes pre-configured to scrape Mixer endpoints to collect the exposed metrics. It provides a mechanism for persistent storage and querying of Istio metrics.
With the GKE cluster running, Istio installed, and the platform deployed, the easiest way to access Grafana, is using kubectl port-forward
to connect to the Prometheus server. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the Prometheus pod.
You may connect using Google Cloud Shell or copy and paste the command to your local shell to connect from a local port. Below are the port forwarding commands used in this post.
# Grafana kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=grafana \ -o jsonpath='{.items[0].metadata.name}') 3000:3000 & # Prometheus kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=prometheus \ -o jsonpath='{.items[0].metadata.name}') 9090:9090 & # Jaeger kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=jaeger \ -o jsonpath='{.items[0].metadata.name}') 16686:16686 & # Kiali kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=kiali \ -o jsonpath='{.items[0].metadata.name}') 20001:20001 &
According to Prometheus, user select and aggregate time series data in real time using a functional query language called PromQL (Prometheus Query Language). The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems through Prometheus’ HTTP API. The expression browser includes a drop-down menu with all available metrics as a starting point for building queries. Shown below are a few PromQL examples used in this post.
up{namespace="dev",pod_name=~"service-.*"} container_memory_max_usage_bytes{namespace="dev",container_name=~"service-.*"} container_memory_max_usage_bytes{namespace="dev",container_name="service-f"} container_network_transmit_packets_total{namespace="dev",pod_name=~"service-e-.*"} istio_requests_total{destination_service_namespace="dev",connection_security_policy="mutual_tls",destination_app="service-a"} istio_response_bytes_count{destination_service_namespace="dev",connection_security_policy="mutual_tls",source_app="service-a"}
Below, in the Prometheus console, we see an example graph of the eight Go-based microservices, deployed to GKE. The graph displays the container memory usage over a five minute period. For half the time period, the services were at rest. For the second half of the period, the services were under a simulated load, using hey
. Viewing the memory profile of the services under load can help us determine the container memory minimums and limits, which impact Kubernetes’ scheduling of workloads on the GKE cluster. Metrics such as this might also uncover memory leaks or routing issues, such as the service below, which appears to be consuming 25-50% more memory than its peers.
Another example, below, we see a graph representing the total Istio requests to Service A in the dev
Namespace, while the system was under load.
Compare the graph view above with the same metrics displayed the console view. The multiple entries reflect the multiple instances of Service A in the dev
Namespace, over the five-minute period being examined. The values in the individual metric elements indicate the latest metric that was collected.
Prometheus also collects basic metrics about Istio components, Kubernetes components, and GKE cluster. Below we can view the total memory of each n1-standard-2 VM nodes in the GKE cluster.
Grafana
Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.
According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. The base install files for Istio, and Mixer in particular, ship with a default configuration of global (used for every service) metrics. The pre-configured Istio Dashboards are built to be used in conjunction with the default Istio metrics configuration and a Prometheus back-end.
Below, we see the pre-configured Istio Workload Dashboard. This particular section of the larger dashboard has been filtered to show outbound service metrics in the dev
Namespace of our GKE cluster.
Similarly, below, we see the pre-configured Istio Service Dashboard. This particular section of the larger dashboard is filtered to show client workloads metrics for the Istio Ingress Gateway in our GKE cluster.
Lastly, we see the pre-configured Istio Mesh Dashboard. This dashboard is filtered to show a table view of metrics for components deployed to our GKE cluster.
An effective observability strategy must include more than just the ability to visualize results. An effective strategy must also include the ability to detect anomalies and notify (alert) the appropriate resources or take action directly to resolve incidents. Grafana, like Prometheus, is capable of alerting and notification. You visually define alert rules for your critical metrics. Grafana will continuously evaluate metrics against the rules and send notifications when pre-defined thresholds are breached.
Prometheus supports multiple, popular notification channels, including PagerDuty, HipChat, Email, Kafka, and Slack. Below, we see a new Prometheus notification channel, which sends alert notifications to a Slack support channel.
Prometheus is able to send detailed text-based and visual notifications.
Pillar 3: Traces
According to the Open Tracing website, distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
According to Istio, although Istio proxies are able to automatically send spans, applications need to propagate the appropriate HTTP headers, so that when the proxies send span information, the spans can be correlated correctly into a single trace. To accomplish this, an application needs to collect and propagate the following headers from the incoming request to any outgoing requests.
x-request-id
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
x-ot-span-context
The x-b3
headers originated as part of the Zipkin project. The B3 portion of the header is named for the original name of Zipkin, BigBrotherBird. Passing these headers across service calls is known as B3 propagation. According to Zipkin, these attributes are propagated in-process, and eventually downstream (often via HTTP headers), to ensure all activity originating from the same root are collected together.
In order to demonstrate distributed tracing with Jaeger, I have modified Service A, Service B, and Service E. These are the three services that make HTTP requests to other upstream services. I have added the following code in order to propagate the headers from one service to the next. The Istio sidecar proxy (Envoy) generates the first headers. It is critical that you only propagate the headers that are present in the downstream request and have a value, as the code below does. Propagating an empty header will break the distributed tracing.
headers := []string{ "x-request-id", "x-b3-traceid", "x-b3-spanid", "x-b3-parentspanid", "x-b3-sampled", "x-b3-flags", "x-ot-span-context", } for _, header := range headers { if r.Header.Get(header) != "" { req.Header.Add(header, r.Header.Get(header)) } }
Below, in the highlighted Stackdriver log entry’s JSON payload, we see the required headers, propagated from the root span, which contained a value, being passed from Service A to Service C in the upstream request.
Jaeger
According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.
Below we see the Jaeger UI Traces View. The UI shows the results of a search for the Istio Ingress Gateway service over a period of about forty minutes. We see a timeline of traces across the top with a list of trace results below. As discussed on the Jaeger website, a trace is composed of spans. A span represents a logical unit of work in Jaeger that has an operation name. A trace is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with DAGs.
Below we see the Jaeger UI Trace Detail View. The example trace contains 16 spans, which encompasses eight services – seven of the eight Go-based services and the Istio Ingress Gateway. The trace and the spans each have timings. The root span in the trace is the Istio Ingress Gateway. The Angular UI, loaded in the end user’s web browser, calls the mesh’s edge service, Service A, through the Istio Ingress Gateway. From there, we see the expected flow of our service-to-service IPC. Service A calls Services B and C. Service B calls Service E, which calls Service G and Service H. In this demo, traces do not span the RabbitMQ message queues. This means you would not see a trace which includes a call from Service D to Service F, via the RabbitMQ.
Within the Jaeger UI Trace Detail View, you also have the ability to drill into a single span, which contains additional metadata. Metadata includes the URL being called, HTTP method, response status, and several other headers.
The latest version of Jaeger also includes a Compare feature and two Dependencies views, Force-Directed Graph, and DAG. I find both views rather primitive compared to Kiali, and more similar to Service Graph. Lacking access to Kiali, the views are marginally useful as a dependency graph.
Kiali: Microservice Observability
According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin
, the password is 1f2d1e2e67df
.
Logging into Kiali, we see the Overview menu entry, which provides a global view of all namespaces within the Istio service mesh and the number of applications within each namespace.
The Graph View in the Kiali UI is a visual representation of the components running in the Istio service mesh. Below, filtering on the cluster’s dev
Namespace, we can observe that Kiali has mapped 8 applications (workloads), 10 services, and 24 edges (a graph term). Specifically, we see the Istio Ingres Proxy at the edge of the service mesh, the Angular UI, the eight Go-based microservices and their Envoy proxy sidecars that are taking traffic (Service F did not take any direct traffic from another service in this example), the external MongoDB Atlas cluster, and the external CloudAMQP cluster. Note how service-to-service traffic flows, with Istio, from the service to its sidecar proxy, to the other service’s sidecar proxy, and finally to the service.
Below, we see a similar view of the service mesh, but this time, there are failures between the Istio Ingress Gateway and the Service A, shown in red. We can also observe overall metrics for the HTTP traffic, such as total requests/minute, errors, and status codes.
Kiali can also display average requests times and other metrics for each edge in the graph (the communication between two components).
Kiali can also show application versions deployed, as shown below, the microservices are a combination of versions 1.3 and 1.4.
Focusing on the external MongoDB Atlas cluster, Kiali also allows us to view TCP traffic between the four services within the service mesh and the external cluster.
The Applications menu entry lists all the applications and their error rates, which can be filtered by Namespace and time interval. Here we see that the Angular UI was producing errors at the rate of 16.67%.
On both the Applications and Workloads menu entry, we can drill into a component to view additional details, including the overall health, number of Pods, Services, and Destination Services. Below, we see details for Service B in the dev
Namespace.
The Workloads detailed view also includes inbound and outbound metrics. Below, the outbound volume, duration, and size metrics, for Service A in the dev
Namespace.
Finally, Kiali presents an Istio Config menu entry. The Istio Config menu entry displays a list of all of the available Istio configuration objects that exist in the user’s environment.
Oftentimes, I find Kiali to be my first stop when troubleshooting platform issues. Once I identify the specific components or communication paths having issues, I can search the Stackdriver logs and the Prometheus metrics, through the Grafana dashboard.
Conclusion
In this two-part post, we have explored the current set of observability tools, which are part of the latest version of Istio Service Mesh. These tools included Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability solution for modern, distributed applications.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Kubernetes-based Microservice Observability with Istio Service Mesh on Google Kubernetes Engine (GKE): Part 1
Posted by Gary A. Stafford in Build Automation, Client-Side Development, Cloud, DevOps, GCP, Go, JavaScript, Kubernetes, Software Development on March 10, 2019
In this two-part post, we will explore the set of observability tools which are part of the Istio Service Mesh. These tools include Jaeger, Kiali, Prometheus, and Grafana. To assist in our exploration, we will deploy a Go-based, microservices reference platform to Google Kubernetes Engine, on the Google Cloud Platform.
What is Observability?
Similar to blockchain, serverless, AI and ML, chatbots, cybersecurity, and service meshes, Observability is a hot buzz word in the IT industry right now. According to Wikipedia, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Logs, metrics, and traces are often known as the three pillars of observability. These are the external outputs of the system, which we may observe.
The O’Reilly book, Distributed Systems Observability, by Cindy Sridharan, does an excellent job of detailing ‘The Three Pillars of Observability’, in Chapter 4. I recommend reading this free online excerpt, before continuing. A second great resource for information on observability is honeycomb.io, a developer of observability tools for production systems, led by well-known industry thought-leader, Charity Majors. The honeycomb.io site includes articles, blog posts, whitepapers, and podcasts on observability.
As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind. Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.
Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools. Istio’s integrations include Jaeger for distributed tracing, Kiali for Istio service mesh-based microservice visualization, and Prometheus and Grafana for metric collection, monitoring, and alerting. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability platform for modern, distributed applications.
A Reference Microservices Platform
To demonstrate the observability tools integrated with the latest version of Istio Service Mesh, we will deploy a reference microservices platform, written in Go, to GKE on GCP. I developed the reference platform to demonstrate concepts such as API management, Service Meshes, Observability, DevOps, and Chaos Engineering. The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A – Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications. The platform and all its source code is free and open source.
The reference platform is designed to generate HTTP-based service-to-service, TCP-based service-to-database (MongoDB), and TCP-based service-to-queue-to-service (RabbitMQ) IPC (inter-process communication). Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a RabbitMQ queue that Service F consumes and writes to MongoDB, and so on. These distributed communications can be observed using Istio’s observability tools when the system is deployed to a Kubernetes cluster running the Istio service mesh.
Service Responses
On the reference platform, each upstream service responds to requests from downstream services by returning a small informational JSON payload (termed a greeting in the source code).
The responses are aggregated across the service call chain, resulting in an array of service responses being returned to the edge service and on to the Angular-based UI, running in the end user’s web browser. The response aggregation feature is simply used to confirm that the service-to-service communications, Istio components, and the telemetry tools are working properly.
Each Go microservice contains a /ping
and /health
endpoint. The /health
endpoint can be used to configure Kubernetes Liveness and Readiness Probes. Additionally, the edge service, Service A, is configured for Cross-Origin Resource Sharing (CORS) using the access-control-allow-origin: *
response header. CORS allows the Angular UI, running in end user’s web browser, to call the Service A /ping
endpoint, which resides in a different subdomain from UI. Shown below is the Go source code for Service A.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: Service A | |
package main | |
import ( | |
"encoding/json" | |
"github.com/banzaicloud/logrus-runtime-formatter" | |
"github.com/google/uuid" | |
"github.com/gorilla/mux" | |
"github.com/prometheus/client_golang/prometheus/promhttp" | |
"github.com/rs/cors" | |
log "github.com/sirupsen/logrus" | |
"io/ioutil" | |
"net/http" | |
"os" | |
"strconv" | |
"time" | |
) | |
type Greeting struct { | |
ID string `json:"id,omitempty"` | |
ServiceName string `json:"service,omitempty"` | |
Message string `json:"message,omitempty"` | |
CreatedAt time.Time `json:"created,omitempty"` | |
} | |
var greetings []Greeting | |
func PingHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
log.Debug(r) | |
greetings = nil | |
CallNextServiceWithTrace("http://service-b/api/ping", w, r) | |
CallNextServiceWithTrace("http://service-c/api/ping", w, r) | |
tmpGreeting := Greeting{ | |
ID: uuid.New().String(), | |
ServiceName: "Service-A", | |
Message: "Hello, from Service-A!", | |
CreatedAt: time.Now().Local(), | |
} | |
greetings = append(greetings, tmpGreeting) | |
err := json.NewEncoder(w).Encode(greetings) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func HealthCheckHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
_, err := w.Write([]byte("{\"alive\": true}")) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func ResponseStatusHandler(w http.ResponseWriter, r *http.Request) { | |
params := mux.Vars(r) | |
statusCode, err := strconv.Atoi(params["code"]) | |
if err != nil { | |
log.Error(err) | |
} | |
w.WriteHeader(statusCode) | |
} | |
func CallNextServiceWithTrace(url string, w http.ResponseWriter, r *http.Request) { | |
var tmpGreetings []Greeting | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
req, err := http.NewRequest("GET", url, nil) | |
if err != nil { | |
log.Error(err) | |
} | |
// Headers must be passed for Jaeger Distributed Tracing | |
headers := []string{ | |
"x-request-id", | |
"x-b3-traceid", | |
"x-b3-spanid", | |
"x-b3-parentspanid", | |
"x-b3-sampled", | |
"x-b3-flags", | |
"x-ot-span-context", | |
} | |
for _, header := range headers { | |
if r.Header.Get(header) != "" { | |
req.Header.Add(header, r.Header.Get(header)) | |
} | |
} | |
log.Info(req) | |
client := &http.Client{} | |
response, err := client.Do(req) | |
if err != nil { | |
log.Error(err) | |
} | |
defer response.Body.Close() | |
body, err := ioutil.ReadAll(response.Body) | |
if err != nil { | |
log.Error(err) | |
} | |
err = json.Unmarshal(body, &tmpGreetings) | |
if err != nil { | |
log.Error(err) | |
} | |
for _, r := range tmpGreetings { | |
greetings = append(greetings, r) | |
} | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
c := cors.New(cors.Options{ | |
AllowedOrigins: []string{"*"}, | |
AllowCredentials: true, | |
AllowedMethods: []string{"GET", "POST", "PATCH", "PUT", "DELETE", "OPTIONS"}, | |
}) | |
router := mux.NewRouter() | |
api := router.PathPrefix("/api").Subrouter() | |
api.HandleFunc("/ping", PingHandler).Methods("GET", "OPTIONS") | |
api.HandleFunc("/health", HealthCheckHandler).Methods("GET", "OPTIONS") | |
api.HandleFunc("/status/{code}", ResponseStatusHandler).Methods("GET", "OPTIONS") | |
api.Handle("/metrics", promhttp.Handler()) | |
handler := c.Handler(router) | |
log.Fatal(http.ListenAndServe(":80", handler)) | |
} |
For this demonstration, the MongoDB databases will be hosted, external to the services on GCP, on MongoDB Atlas, a MongoDB-as-a-Service, cloud-based platform. Similarly, the RabbitMQ queues will be hosted on CloudAMQP, a RabbitMQ-as-a-Service, cloud-based platform. I have used both of these SaaS providers in several previous posts. Using external services will help us understand how Istio and its observability tools collect telemetry for communications between the Kubernetes cluster and external systems.
Shown below is the Go source code for Service F, This service consumers messages from the RabbitMQ queue, placed there by Service D, and writes the messages to MongoDB.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: Service F | |
package main | |
import ( | |
"bytes" | |
"context" | |
"encoding/json" | |
"github.com/banzaicloud/logrus-runtime-formatter" | |
"github.com/google/uuid" | |
"github.com/gorilla/mux" | |
log "github.com/sirupsen/logrus" | |
"github.com/streadway/amqp" | |
"go.mongodb.org/mongo-driver/mongo" | |
"go.mongodb.org/mongo-driver/mongo/options" | |
"net/http" | |
"os" | |
"time" | |
) | |
type Greeting struct { | |
ID string `json:"id,omitempty"` | |
ServiceName string `json:"service,omitempty"` | |
Message string `json:"message,omitempty"` | |
CreatedAt time.Time `json:"created,omitempty"` | |
} | |
var greetings []Greeting | |
func PingHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
greetings = nil | |
tmpGreeting := Greeting{ | |
ID: uuid.New().String(), | |
ServiceName: "Service-F", | |
Message: "Hola, from Service-F!", | |
CreatedAt: time.Now().Local(), | |
} | |
greetings = append(greetings, tmpGreeting) | |
CallMongoDB(tmpGreeting) | |
err := json.NewEncoder(w).Encode(greetings) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func HealthCheckHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
_, err := w.Write([]byte("{\"alive\": true}")) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func CallMongoDB(greeting Greeting) { | |
log.Info(greeting) | |
ctx, _ := context.WithTimeout(context.Background(), 10*time.Second) | |
client, err := mongo.Connect(ctx, options.Client().ApplyURI(os.Getenv("MONGO_CONN"))) | |
if err != nil { | |
log.Error(err) | |
} | |
defer client.Disconnect(nil) | |
collection := client.Database("service-f").Collection("messages") | |
ctx, _ = context.WithTimeout(context.Background(), 5*time.Second) | |
_, err = collection.InsertOne(ctx, greeting) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func GetMessages() { | |
conn, err := amqp.Dial(os.Getenv("RABBITMQ_CONN")) | |
if err != nil { | |
log.Error(err) | |
} | |
defer conn.Close() | |
ch, err := conn.Channel() | |
if err != nil { | |
log.Error(err) | |
} | |
defer ch.Close() | |
q, err := ch.QueueDeclare( | |
"service-d", | |
false, | |
false, | |
false, | |
false, | |
nil, | |
) | |
if err != nil { | |
log.Error(err) | |
} | |
msgs, err := ch.Consume( | |
q.Name, | |
"service-f", | |
true, | |
false, | |
false, | |
false, | |
nil, | |
) | |
if err != nil { | |
log.Error(err) | |
} | |
forever := make(chan bool) | |
go func() { | |
for delivery := range msgs { | |
log.Debug(delivery) | |
CallMongoDB(deserialize(delivery.Body)) | |
} | |
}() | |
<-forever | |
} | |
func deserialize(b []byte) (t Greeting) { | |
log.Debug(b) | |
var tmpGreeting Greeting | |
buf := bytes.NewBuffer(b) | |
decoder := json.NewDecoder(buf) | |
err := decoder.Decode(&tmpGreeting) | |
if err != nil { | |
log.Error(err) | |
} | |
return tmpGreeting | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
go GetMessages() | |
router := mux.NewRouter() | |
api := router.PathPrefix("/api").Subrouter() | |
api.HandleFunc("/ping", PingHandler).Methods("GET") | |
api.HandleFunc("/health", HealthCheckHandler).Methods("GET") | |
log.Fatal(http.ListenAndServe(":80", router)) | |
} |
Source Code
All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository. The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend project repository. You should not need to clone the Angular UI project for this demonstration.
git clone --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/k8s-istio-observe-backend.git
Docker images referenced in the Kubernetes Deployment
resource files, for the Go services and UI, are all available on Docker Hub. The Go microservice Docker images were built using the official Golang Alpine base image on DockerHub, containing Go version 1.12.0. Using the Alpine image to compile the Go source code ensures the containers will be as small as possible and contain a minimal attack surface.
System Requirements
To follow along with the post, you will need the latest version of gcloud
CLI (min. ver. 239.0.0), part of the Google Cloud SDK, Helm, and the just releases Istio 1.1.0, installed and configured locally or on your build machine.
Set-up and Installation
To deploy the microservices platform to GKE, we will proceed in the following order.
- Create the MongoDB Atlas database cluster;
- Create the CloudAMQP RabbitMQ cluster;
- Modify the Kubernetes resources and scripts for your own environments;
- Create the GKE cluster on GCP;
- Deploy Istio 1.1.0 to the GKE cluster, using Helm;
- Create DNS records for the platform’s exposed resources;
- Deploy the Go-based microservices, Angular UI, and associated resources to GKE;
- Test and troubleshoot the platform;
- Observe the results in part two!
MongoDB Atlas Cluster
MongoDB Atlas is a fully-managed MongoDB-as-a-Service, available on AWS, Azure, and GCP. Atlas, a mature SaaS product, offers high-availability, guaranteed uptime SLAs, elastic scalability, cross-region replication, enterprise-grade security, LDAP integration, a BI Connector, and much more.
MongoDB Atlas currently offers four pricing plans, Free, Basic, Pro, and Enterprise. Plans range from the smallest, M0-sized MongoDB cluster, with shared RAM and 512 MB storage, up to the massive M400 MongoDB cluster, with 488 GB of RAM and 3 TB of storage.
For this post, I have created an M2-sized MongoDB cluster in GCP’s us-central1 (Iowa) region, with a single user database account for this demo. The account will be used to connect from four of the eight microservices, running on GKE.
Originally, I started with an M0-sized cluster, but the compute resources were insufficient to support the volume of calls from the Go-based microservices. I suggest at least an M2-sized cluster or larger.
CloudAMQP RabbitMQ Cluster
CloudAMQP provides full-managed RabbitMQ clusters on all major cloud and application platforms. RabbitMQ will support a decoupled, eventually consistent, message-based architecture for a portion of our Go-based microservices. For this post, I have created a RabbitMQ cluster in GCP’s us-central1 (Iowa) region, the same as our GKE cluster and MongoDB Atlas cluster. I chose a minimally-configured free version of RabbitMQ. CloudAMQP also offers robust, multi-node RabbitMQ clusters for Production use.
Modify Configurations
There are a few configuration settings you will need to change in the GitHub project’s Kubernetes resource files and Bash deployment scripts.
Istio ServiceEntry for MongoDB Atlas
Modify the Istio ServiceEntry
, external-mesh-mongodb-atlas.yaml file, adding you MongoDB Atlas host address. This file allows egress traffic from four of the microservices on GKE to the external MongoDB Atlas cluster.
apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: mongodb-atlas-external-mesh spec: hosts: - {{ your_host_goes_here }} ports: - name: mongo number: 27017 protocol: MONGO location: MESH_EXTERNAL resolution: NONE
Istio ServiceEntry for CloudAMQP RabbitMQ
Modify the Istio ServiceEntry
, external-mesh-cloudamqp.yaml file, adding you CloudAMQP host address. This file allows egress traffic from two of the microservices to the CloudAMQP cluster.
apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: cloudamqp-external-mesh spec: hosts: - {{ your_host_goes_here }} ports: - name: rabbitmq number: 5672 protocol: TCP location: MESH_EXTERNAL resolution: NONE
Istio Gateway and VirtualService Resources
There are numerous strategies you may use to route traffic into the GKE cluster, via Istio. I am using a single domain for the post, example-api.com
, and four subdomains. One set of subdomains is for the Angular UI, in the dev
Namespace (ui.dev.example-api.com
) and the test
Namespace (ui.test.example-api.com
). The other set of subdomains is for the edge API microservice, Service A, which the UI calls (api.dev.example-api.com
and api.test.example-api.com
). Traffic is routed to specific Kubernetes Service
resources, based on the URL.
According to Istio, the Gateway
describes a load balancer operating at the edge of the mesh, receiving incoming or outgoing HTTP/TCP connections. Modify the Istio ingress Gateway
, inserting your own domains or subdomains in the hosts
section. These are the hosts on port 80 that will be allowed into the mesh.
apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: demo-gateway spec: selector: istio: ingressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - ui.dev.example-api.com - ui.test.example-api.com - api.dev.example-api.com - api.test.example-api.com
According to Istio, a VirtualService
defines a set of traffic routing rules to apply when a host is addressed. A VirtualService
is bound to a Gateway
to control the forwarding of traffic arriving at a particular host and port. Modify the project’s four Istio VirtualServices
, inserting your own domains or subdomains. Here is an example of one of the four VirtualServices
, in the istio-gateway.yaml file.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: angular-ui-dev spec: hosts: - ui.dev.example-api.com