Posts Tagged GCP
IoT Telemetry Collection using Google Protocol Buffers, Google Cloud Functions, Cloud Pub/Sub, and MongoDB Atlas
Posted by Gary A. Stafford in Big Data, Cloud, GCP, Python, Serverless, Software Development on May 21, 2019
Collect IoT sensor telemetry using Google Protocol Buffers’ serialized binary format over HTTPS, serverless Google Cloud Functions, Google Cloud Pub/Sub, and MongoDB Atlas on GCP, as an alternative to integrated Cloud IoT platforms and standard IoT protocols. Aggregate, analyze, and build machine learning models with the data using tools such as MongoDB Compass, Jupyter Notebooks, and Google’s AI Platform Notebooks.
Introduction
Most of the dominant Cloud providers offer IoT (Internet of Things) and IIotT (Industrial IoT) integrated services. Amazon has AWS IoT, Microsoft Azure has multiple offering including IoT Central, IBM’s offering including IBM Watson IoT Platform, Alibaba Cloud has multiple IoT/IIoT solutions for different vertical markets, and Google offers Google Cloud IoT platform. All of these solutions are marketed as industrial-grade, highly-performant, scalable technology stacks. They are capable of scaling to tens-of-thousands of IoT devices or more and massive amounts of streaming telemetry.
In reality, not everyone needs a fully integrated IoT solution. Academic institutions, research labs, tech start-ups, and many commercial enterprises want to leverage the Cloud for IoT applications, but may not be ready for a fully-integrated IoT platform or are resistant to Cloud vendor platform lock-in.
Similarly, depending on the performance requirements and the type of application, organizations may not need or want to start out using IoT/IIOT industry standard data and transport protocols, such as MQTT (Message Queue Telemetry Transport) or CoAP (Constrained Application Protocol), over UDP (User Datagram Protocol). They may prefer to transmit telemetry over HTTP using TCP, or securely, using HTTPS (HTTP over TLS).
Demonstration
In this demonstration, we will collect environmental sensor data from a number of IoT device sensors and stream that telemetry over the Internet to Google Cloud. Each IoT device is installed in a different physical location. The devices contain a variety of common sensors, including humidity and temperature, motion, and light intensity.

Prototype IoT Devices used in this Demonstration
We will transmit the sensor telemetry data as JSON over HTTP to serverless Google Cloud Function HTTPS endpoints. We will then switch to using Google’s Protocol Buffers to transmit binary data over HTTP. We should observe a reduction in the message size contained in the request payload as we move from JSON to Protobuf, which should reduce system latency and cost.
Data received by Cloud Functions over HTTP will be published asynchronously to Google Cloud Pub/Sub. A second Cloud Function will respond to all published events and push the messages to MongoDB Atlas on GCP. Once in Atlas, we will aggregate, transform, analyze, and build machine learning models with the data, using tools such as MongoDB Compass, Jupyter Notebooks, and Google’s AI Platform Notebooks.
For this demonstration, the architecture for JSON over HTTP will look as follows. All sensors will transmit data to a single Cloud Function HTTPS endpoint.
For Protobuf over HTTP, the architecture will look as follows in the demonstration. Each type of sensor will transmit data to a different Cloud Function HTTPS endpoint.
Although the Cloud Functions will automatically scale horizontally to accommodate additional load created by the volume of telemetry being received, there are also other options to scale the system. For example, we could create individual pipelines of functions and topic/subscriptions for each sensor type. We could also split the telemetry data across multiple MongoDB Atlas Collections, based on sensor type, instead of a single collection. In all cases, we will still benefit from the Cloud Function’s horizontal scaling capabilities.
Source Code
All source code is all available on GitHub. Use the following command to clone the project.
git clone \ --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/iot-protobuf-demo.git
You will need to adjust the project’s environment variables to fit your own development and Cloud environments. All source code for this post is written in Python. It is intended for Python 3 interpreters but has been tested using Python 2 interpreters. The project’s Jupyter Notebooks can be viewed from within the project on GitHub or using the free, online Jupyter nbviewer.
Technologies
Protocol Buffers
According to Google, Protocol Buffers (aka Protobuf) are a language- and platform-neutral, efficient, extensible, automated mechanism for serializing structured data for use in communications protocols, data storage, and more. Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML.
Each protocol buffer message is a small logical record of information, containing a series of strongly-typed name-value pairs. Once you have defined your messages, you run the protocol buffer compiler for your application’s language on your .proto
file to generate data access classes.
Google Cloud Functions
According to Google, Cloud Functions is Google’s event-driven, serverless compute platform. Key features of Cloud Functions include automatic scaling, high-availability, fault-tolerance,
no servers to provision, manage, patch or update, only
pay while your code runs, and they easily connect and extend other cloud services. Cloud Functions natively support multiple event-types, including HTTP, Cloud Pub/Sub, Cloud Storage, and Firebase. Current language support includes Python, Go, and Node.
Google Cloud Pub/Sub
According to Google, Cloud Pub/Sub is an enterprise message-oriented middleware for the Cloud. It is a scalable, durable event ingestion and delivery system. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication among independent applications. Cloud Pub/Sub delivers low-latency, durable messaging that integrates with systems hosted on the Google Cloud Platform and externally.
MongoDB Atlas
MongoDB Atlas is a fully-managed MongoDB-as-a-Service, available on AWS, Azure, and GCP. Atlas, a mature SaaS product, offers high-availability, uptime service-level agreements, elastic scalability, cross-region replication, enterprise-grade security, LDAP integration, BI Connector, and much more.
MongoDB Atlas currently offers four pricing plans, Free, Basic, Pro, and Enterprise. Plans range from the smallest, free M0-sized MongoDB cluster, with shared RAM and 512 MB storage, up to the massive M400 MongoDB cluster, with 488 GB of RAM and 3 TB of storage.
Cost Effectiveness of Cloud Functions
At true IIoT scale, Google Cloud Functions may not be the most efficient or cost-effective method of ingesting telemetry data. Based on Google’s pricing model, you get two million free function invocations per month, with each additional million invocations costing USD $0.40. The total cost also includes memory usage, total compute time, and outbound data transfer. If your system is comprised of tens or hundreds of IoT devices, Cloud Functions may prove cost-effective.
However, with thousands of devices or more, each transmitting data multiple times per minutes, you could quickly outgrow the cost-effectiveness of Google Functions. In that case, you might look to Google’s Google Cloud IoT platform. Alternately, you can build your own platform with Google products such as Knative, letting you choose to run your containers either fully managed with the newly-released Cloud Run, or in your Google Kubernetes Engine cluster with Cloud Run on GKE.
Sensor Scripts
For each sensor type, I have developed separate Python scripts, which run on each IoT device. There are two versions of each script, one for JSON over HTTP and one for Protobuf over HTTP.
JSON over HTTPS
Below we see the script, dht_sensor_http_json.py, used to transmit humidity and temperature data via JSON over HTTP to a Google Cloud Function running on GCP. The JSON request payload contains a timestamp, IoT device ID, device type, and the temperature and humidity sensor readings. The URL for the Google Cloud Function is stored as an environment variable, local to the IoT devices, and set when the script is deployed.
import json import logging import os import socket import sys import time import Adafruit_DHT import requests URL = os.environ.get('GCF_URL') JWT = os.environ.get('JWT') SENSOR = Adafruit_DHT.DHT22 TYPE = 'DHT22' PIN = 18 FREQUENCY = 15 def main(): if not URL or not JWT: sys.exit("Are the Environment Variables set?") get_sensor_data(socket.gethostname()) def get_sensor_data(device_id): while True: humidity, temperature = Adafruit_DHT.read_retry(SENSOR, PIN) payload = {'device': device_id, 'type': TYPE, 'timestamp': time.time(), 'data': {'temperature': temperature, 'humidity': humidity}} post_data(payload) time.sleep(FREQUENCY) def post_data(payload): payload = json.dumps(payload) headers = { 'Content-Type': 'application/json; charset=utf-8', 'Authorization': JWT } try: requests.post(URL, json=payload, headers=headers) except requests.exceptions.ConnectionError: logging.error('Error posting data to Cloud Function!') except requests.exceptions.MissingSchema: logging.error('Error posting data to Cloud Function! Are Environment Variables set?') if __name__ == '__main__': sys.exit(main())
Telemetry Frequency
Although the sensors are capable of producing data many times per minute, for this demonstration, sensor telemetry is intentionally limited to only being transmitted every 15 seconds. To reduce system complexity, potential latency, back-pressure, and cost, in my opinion, you should only produce telemetry data at the frequency your requirements dictate.
JSON Web Tokens
For security, in addition to the HTTPS endpoints exposed by the Google Cloud Functions, I have incorporated the use of a JSON Web Token (JWT). JSON Web Tokens are an open, industry standard RFC 7519 method for representing claims securely between two parties. In this case, the JWT is used to verify the identity of the sensor scripts sending telemetry to the Cloud Functions. The JWT contains an id, password, and expiration, all encrypted with a secret key, which is known to each Cloud Function, in order to verify the IoT device’s identity. Without the correct JWT being passed in the Authorization header, the request to the Cloud Function will fail with an HTTP status code of 401 Unauthorized. Below is an example of the JWT’s payload data.
{ "sub": "IoT Protobuf Serverless Demo", "id": "iot-demo-key", "password": "t7J2gaQHCFcxMD6584XEpXyzWhZwRrNJ", "iat": 1557407124, "exp": 1564664724 }
For this demonstration, I created a temporary JWT using jwt.io. The HTTP Functions are using PyJWT
, a Python library which allows you to encode and decode the JWT. The PyJWT library allows the Function to decode and validate the JWT (Bearer Token) from the incoming request’s Authorization header. The JWT token is stored as an environment variable. Deployment instructions are included in the GitHub project.
JSON Payload
Below is a typical JSON request payload (pretty-printed), containing DHT sensor data. This particular message is 148 bytes in size. The message format is intentionally reader-friendly. We could certainly shorten the message’s key fields, to reduce the payload size by an additional 15-20 bytes.
{ "device": "rp829c7e0e", "type": "DHT22", "timestamp": 1557585090.476025, "data": { "temperature": 17.100000381469727, "humidity": 68.0999984741211 } }
Protocol Buffers
For the demonstration, I have built a Protocol Buffers file, sensors.proto
, to support the data output by three sensor types: digital humidity and temperature (DHT), passive infrared sensor (PIR), and digital light intensity (DLI). I am using the newer proto3
version of the protocol buffers language. I have created a common Protobuf sensor message schema, with the variable sensor telemetry stored in the nested data
object, within each message type.
It is important to use the correct Protobuf Scalar Value Type to maintain numeric precision in the language you compile for. For simplicity, I am using a double
to represent the timestamp, as well as the numeric humidity and temperature readings. Alternately, you could choose Google’s Protobuf WellKnownTypes
, Timestamp to store timestamp.
syntax = "proto3"; package sensors; // DHT22 message SensorDHT { string device = 1; string type = 2; double timestamp = 3; DataDHT data = 4; } message DataDHT { double temperature = 1; double humidity = 2; } // Onyehn_PIR message SensorPIR { string device = 1; string type = 2; double timestamp = 3; DataPIR data = 4; } message DataPIR { bool motion = 1; } // Anmbest_MD46N message SensorDLI { string device = 1; string type = 2; double timestamp = 3; DataDLI data = 4; } message DataDLI { bool light = 1; }
Since the sensor data will be captured with scripts written in Python 3, the Protocol Buffers file is compiled for Python, resulting in the file, sensors_pb2.py
.
protoc --python_out=. sensors.proto
Protocol Buffers over HTTPS
Below we see the alternate DHT sensor script, dht_sensor_http_pb.py, which transmits a Protocol Buffers-based binary request payload over HTTPS to a Google Cloud Function running on GCP. Note the request’s Content-Type
header has been changed from application/json
to application/x-protobuf
. In this case, instead of JSON, the same data fields are stored in an instance of the Protobuf’s SensorDHT
message type (sensors_pb2.SensorDHT()
). Note the import sensors_pb2
statement. This statement imports the compiled Protocol Buffers file, which is stored locally to the script on the IoT device.
import logging import os import socket import sys import time import Adafruit_DHT import requests import sensors_pb2 URL = os.environ.get('GCF_DHT_URL') JWT = os.environ.get('JWT') SENSOR = Adafruit_DHT.DHT22 TYPE = 'DHT22' PIN = 18 FREQUENCY = 15 def main(): if not URL or not JWT: sys.exit("Are the Environment Variables set?") get_sensor_data(socket.gethostname()) def get_sensor_data(device_id): while True: try: humidity, temperature = Adafruit_DHT.read_retry(SENSOR, PIN) sensor_dht = sensors_pb2.SensorDHT() sensor_dht.device = device_id sensor_dht.type = TYPE sensor_dht.timestamp = time.time() sensor_dht.data.temperature = temperature sensor_dht.data.humidity = humidity payload = sensor_dht.SerializeToString() post_data(payload) time.sleep(FREQUENCY) except TypeError: logging.error('Error getting sensor data!') def post_data(payload): headers = { 'Content-Type': 'application/x-protobuf', 'Authorization': JWT } try: requests.post(URL, data=payload, headers=headers) except requests.exceptions.ConnectionError: logging.error('Error posting data to Cloud Function!') except requests.exceptions.MissingSchema: logging.error('Error posting data to Cloud Function! Are Environment Variables set?') if __name__ == '__main__': sys.exit(main())
Protobuf Binary Payload
To understand the binary Protocol Buffers-based payload, we can write a sample SensorDHT
message to a file on disk as a byte array.
message = sensorDHT.SerializeToString() binary_file_output = open("./data_binary.txt", "wb") file_byte_array = bytearray(message) binary_file_output.write(file_byte_array)
Then, using the hexdump
command, we can view a representation of the binary data file.
> hexdump -C data_binary.txt 00000000 0a 08 38 32 39 63 37 65 30 65 12 05 44 48 54 32 |..829c7e0e..DHT2| 00000010 32 1d 05 a0 b9 4e 22 0a 0d ec 51 b2 41 15 cd cc |2....N"...Q.A...| 00000020 38 42 |8B| 00000022
The binary data file size is 48 bytes on disk, as compared to the equivalent JSON file size of 148 bytes on disk (32% the size). As a test, we could then send that binary data file as the payload of a POST to the Cloud Function, as shown below using Postman. Postman will serialize the binary data file’s contents to a binary string before transmitting.
Similarly, we can serialize the same binary Protocol Buffers-based SensorDHT
message to a binary string using the SerializeToString
method.
message = sensorDHT.SerializeToString() print(message)
The resulting binary string resembles the following.
b'\n\nrp829c7e0e\x12\x05DHT22\x19c\xee\xbcg\xf5\x8e\xccA"\x12\t\x00\x00\x00\xa0\x99\x191@\x11\x00\x00\x00`f\x06Q@'
The binary string length of the serialized message, and therefore the request payload sent by Postman and received by the Cloud Function for this particular message, is 111 bytes, as compared to the JSON payload size of 148 bytes (75% the size).
Validate Protobuf Payload
To validate the data contained in the Protobuf payload is identical to the JSON payload, we can parse the payload from the serialized binary string using the Protobuf ParseFromString
method. We then convert it to JSON using the Protobuf MessageToJson
method.
message = sensorDHT.SerializeToString() message_parsed = sensors_pb2.SensorDHT() message_parsed.ParseFromString(message) print(MessageToJson(message_parsed))
The resulting JSON object is identical to the JSON payload sent using JSON over HTTPS, earlier in the demonstration.
{ "device": "rp829c7e0e", "type": "DHT22", "timestamp": 1557585090.476025, "data": { "temperature": 17.100000381469727, "humidity": 68.0999984741211 } }
Google Cloud Functions
There are a series of Google Cloud Functions, specifically four HTTP Functions, which accept the sensor data over HTTP from the IoT devices. Each function exposes an HTTPS endpoint. According to Google, you use HTTP functions when you want to invoke your function via an HTTP(S) request. To allow for HTTP semantics, HTTP function signatures accept HTTP-specific arguments.
Below, I have deployed a single function that accepts JSON sensor telemetry from all sensor types, and three functions for Protobuf, one for each sensor type: DHT, PIR, and DLI.
JSON Message Processing
Below, we see the Cloud Function, main.py, which processes the incoming JSON over HTTPS payload from all sensor types. Once the request’s JWT is validated, the JSON message payload is serialized to a byte string and sent to a common Google Cloud Pub/Sub Topic. Note the JWT secret key, id, and password, and the Google Cloud Pub/Sub Topic are all stored as environment variables, local to the Cloud Functions. In my tests, the JSON-based HTTP Functions took an average of 9–18 ms to execute successfully.
import logging import os import jwt from flask import make_response, jsonify from flask_api import status from google.cloud import pubsub_v1 TOPIC = os.environ.get('TOPIC') SECRET_KEY = os.getenv('SECRET_KEY') ID = os.getenv('ID') PASSWORD = os.getenv('PASSWORD') def incoming_message(request): if not validate_token(request): return make_response(jsonify({'success': False}), status.HTTP_401_UNAUTHORIZED, {'ContentType': 'application/json'}) request_json = request.get_json() if not request_json: return make_response(jsonify({'success': False}), status.HTTP_400_BAD_REQUEST, {'ContentType': 'application/json'}) send_message(request_json) return make_response(jsonify({'success': True}), status.HTTP_201_CREATED, {'ContentType': 'application/json'}) def validate_token(request): auth_header = request.headers.get('Authorization') if not auth_header: return False auth_token = auth_header.split(" ")[1] if not auth_token: return False try: payload = jwt.decode(auth_token, SECRET_KEY) if payload['id'] == ID and payload['password'] == PASSWORD: return True except jwt.ExpiredSignatureError: return False except jwt.InvalidTokenError: return False def send_message(message): publisher = pubsub_v1.PublisherClient() publisher.publish(topic=TOPIC, data=bytes(str(message), 'utf-8'))
The Cloud Functions are deployed to GCP using the gcloud functions deploy
CLI command (I use Jenkins to automate the deployments). I have wrapped the deploy commands into bash scripts. The script also copies over a common environment variables YAML file, consumed by the Cloud Function. Each Function has a deployment script, included in the project.
# get latest env vars file cp -f ./../env_vars_file/env.yaml . # deploy function gcloud functions deploy http_json_to_pubsub \ --runtime python37 \ --trigger-http \ --region us-central1 \ --memory 256 \ --entry-point incoming_message \ --env-vars-file env.yaml
Using a .gcloudignore
file, the gcloud functions deploy
CLI command deploys three files: the cloud function (main.py
), required Python packages file (requirements.txt
), the environment variables file (env.yaml
). Google automatically installs dependencies using the requirements.txt
file.
Protobuf Message Processing
Below, we see the Cloud Function, main.py, which processes the incoming Protobuf over HTTPS payload from DHT sensor types. Once the sensor data Protobuf message payload is received by the HTTP Function, it is deserialized to JSON and then serialized to a byte string. The byte string is then sent to a Google Cloud Pub/Sub Topic. In my tests, the Protobuf-based HTTP Functions took an average of 7–14 ms to execute successfully.
As before, note the import sensors_pb2
statement. This statement imports the compiled Protocol Buffers file, which is stored locally to the script on the IoT device. It is used to parse a serialized message into its original Protobuf’s SensorDHT
message type.
import logging import os import jwt import sensors_pb2 from flask import make_response, jsonify from flask_api import status from google.cloud import pubsub_v1 from google.protobuf.json_format import MessageToJson TOPIC = os.environ.get('TOPIC') SECRET_KEY = os.getenv('SECRET_KEY') ID = os.getenv('ID') PASSWORD = os.getenv('PASSWORD') def incoming_message(request): if not validate_token(request): return make_response(jsonify({'success': False}), status.HTTP_401_UNAUTHORIZED, {'ContentType': 'application/json'}) data = request.get_data() if not data: return make_response(jsonify({'success': False}), status.HTTP_400_BAD_REQUEST, {'ContentType': 'application/json'}) sensor_pb = sensors_pb2.SensorDHT() sensor_pb.ParseFromString(data) sensor_json = MessageToJson(sensor_pb) send_message(sensor_json) return make_response(jsonify({'success': True}), status.HTTP_201_CREATED, {'ContentType': 'application/json'}) def validate_token(request): auth_header = request.headers.get('Authorization') if not auth_header: return False auth_token = auth_header.split(" ")[1] if not auth_token: return False try: payload = jwt.decode(auth_token, SECRET_KEY) if payload['id'] == ID and payload['password'] == PASSWORD: return True except jwt.ExpiredSignatureError: return False except jwt.InvalidTokenError: return False def send_message(message): publisher = pubsub_v1.PublisherClient() publisher.publish(topic=TOPIC, data=bytes(message, 'utf-8'))
Cloud Pub/Sub Functions
In addition to HTTP Functions, the demonstration uses a function triggered by Google Cloud Pub/Sub Triggers. According to Google, Cloud Functions can be triggered by messages published to Cloud Pub/Sub Topics in the same GCP project as the function. The function automatically subscribes to the Topic. Below, we see that the function has automatically subscribed to iot-data-demo
Cloud Pub/Sub Topic.
Sending Telemetry to MongoDB Atlas
The common Cloud Function, triggered by messages published to Cloud Pub/Sub, then sends the messages to MongoDB Atlas. There is a minimal amount of cleanup required to re-format the Cloud Pub/Sub messages to BSON (binary JSON). Interestingly, according to bsonspec.org, BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more schema-less than Protocol Buffers, which can give it an advantage in flexibility but also a slight disadvantage in space efficiency (BSON has overhead for field names within the serialized data).
The function uses the PyMongo to connect to MongoDB Atlas. According to their website, PyMongo is a Python distribution containing tools for working with MongoDB and is the recommended way to work with MongoDB from Python.
import base64 import json import logging import os import pymongo MONGODB_CONN = os.environ.get('MONGODB_CONN') MONGODB_DB = os.environ.get('MONGODB_DB') MONGODB_COL = os.environ.get('MONGODB_COL') def read_message(event, context): message = base64.b64decode(event['data']).decode('utf-8') message = message.replace("'", '"') message = message.replace('True', 'true') message = json.loads(message) client = pymongo.MongoClient(MONGODB_CONN) db = client[MONGODB_DB] col = db[MONGODB_COL] col.insert_one(message)
The function responds to the published events and sends the messages to the MongoDB Atlas cluster, running in the same Region, us-central1, as the Cloud Functions and Pub/Sub Topic. Below, we see the current options available when provisioning an Atlas cluster.
MongoDB Atlas provides a rich, web-based UI for managing and monitoring MongoDB clusters, databases, collections, security, and performance.
Although Cloud Pub/Sub to Atlas function execution times are longer in duration than the HTTP functions, the latency is greatly reduced by locating the Cloud Pub/Sub Topic, Cloud Functions, and MongoDB Atlas cluster into the same GCP Region. Cross-region execution times were as high as 500-600 ms, while same-region execution times averaged 200-225 ms. Selecting a more performant Atlas cluster would likely result in even lower function execution times.
Aggregating Data with MongoDB Compass
MongoDB Compass is a free, convenient, desktop application for interacting with your MongoDB databases. You can view the collected sensor data, review message (document) schema, manage indexes, and build complex MongoDB aggregations.
When performing analytics or machine learning, I primarily use MongoDB Compass to preview the captured telemetry data and build aggregation pipelines. Aggregation operations process data records and returns computed results. This feature saves a ton of time, filtering and preparing data for further analysis, visualization, and machine learning with Jupyter Notebooks.
Aggregation pipelines can be directly exported to Java, Node, C#, and Python 3. The exported aggregation pipeline code can be placed directly into your Python applications and Jupyter Notebooks.
Below, the exported aggregation pipelines code from MongoDB Compass is used to load a resultset directly into a Pandas DataFrame. This particular aggregation returns time-series DHT sensor data from a specific IoT device over a 72-hour period.
DEVICE_1 = 'rp59adf374' pipeline = [ { '$match': { 'type': 'DHT22', 'device': DEVICE_1, 'timestamp': { '$gt': 1557619200, '$lt': 1557792000 } } }, { '$project': { '_id': 0, 'timestamp': 1, 'temperature': '$data.temperature', 'humidity': '$data.humidity' } }, { '$sort': { 'timestamp': 1 } } ] aggResult = iot_data.aggregate(pipeline) df1 = pd.DataFrame(list(aggResult))
MongoDB Atlas Performance
In this demonstration, from Python3-based Jupyter Notebooks, I was able to consistently query a MongoDB Atlas collection of almost 70k documents for resultsets containing 3 days (72 hours) worth of digital temperature and humidity data, roughly 10.2k documents, in an average of 825 ms. That is round trip from my local development laptop to MongoDB Atlas running on GCP, in a different geographic region.
Query times on GCP are much faster, such as when running a Notebook in JupyterLab on Google’s AI Platform, or a PySpark job with Cloud Dataproc, against Atlas. Running the same Jupyter Notebook directly on Google’s AI Platform, the same MongoDB Atlas query took an average of 450 ms versus 825 ms (1.83x faster). This was across two different GCP Regions; same Region times should be even faster.
GCP Observability
There are several choices for observing the system’s Google Cloud Functions, Google Cloud Pub/Sub, and MongoDB Atlas. As shown above, the GCP Cloud Functions interface lets you see the individual function executions, execution times, memory usage, and active instances, over varying time intervals.
For a more detailed view of Google Cloud Functions and Google Cloud Pub/Sub, I built two custom dashboards using Stackdriver. According to Google, Stackdriver aggregates metrics, logs, and events from infrastructure, giving developers and operators a rich set of observable signals. I built a custom Stackdriver Cloud Functions dashboard (shown below) and a Cloud Pub/Sub Topics and Subscriptions dashboard.
For functions, I chose to display execution times, memory usage, the number of executions, and network egress, all in a single pane of glass, using four graphs. Below, I am using the 95th percentile average for monitoring. The 95th percentile asserts that 95% of the time, the observed values are below this amount and the remaining 5% of the time, the observed values are above that amount.
Data Analysis using Jupyter Notebooks
According to jupyter.org, the Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. The widespread use of Jupyter Notebooks has grown significantly, as Big Data, AI, and ML have all experienced explosive growth.
PyCharm
JetBrains PyCharm, my favorite Python IDE, has direct integrations with Jupyter Notebooks. In fact, PyCharm’s most recent updates to the Professional Edition greatly enhanced those integrations. PyCharm offers round-trip editing in the IDE and the Jupyter Notebook web browser interface. PyCharm allows you to run and debug individual cells within the notebook. PyCharm automatically starts the Jupyter Server and appropriate kernel for the Notebook you have opened. And, one of my favorite features, PyCharm’s variable viewer tracks the current value of a variable, automatically.
Below, we see the example Analytics Notebook, included in the demonstration’s project, displayed in PyCharm 19.1.2 (Professional Edition). To effectively work with Notebooks in PyCharm really requires a full-size monitor. Working on a laptop with PyCharm’s crowded Notebook UI is workable, but certainly not as effective as on a larger monitor.
Jupyter Notebook Server
Below, we see the same Analytics Notebook, shown above in PyCharm, opened in Jupyter Notebook Server’s web-based client interface, running locally on the development workstation. The web browser-based interface also offers a rich set of features for Notebook development.
From within the Notebook, we are able to query the data from MongoDB Atlas, again using PyMongo, and load the resultsets into Panda DataFrames. As an alternative to hard-coded values and environment variables, with Notebooks, I use the python-dotenv Python package. This package allows me to place my environment variables in a common .env
file and reference them from any Notebook. The package has many options for managing environment variables.
We can then analyze the data using a number of common frameworks, including Pandas, Matplotlib, SciPy, PySpark, and NumPy, to name but a few. Below, we see time series data from four different sensors, on the same IoT device. Viewing the data together, we can study the causal effect of one environment variable on another, such as the impact of light on temperature or humidity.
Below, we can use histograms to visualize temperature frequencies for
intervals, over time, for a given device location.
Machine Learning using Jupyter Notebooks
In addition to data analytics, we can use Jupyter Notebooks with tools such as scikit-learn to build machine learning models based on our sensor telemetry. Scikit-learn is a set of machine learning tools in Python, built on NumPy, SciPy, and matplotlib. Below, I have used JupyterLab on Google’s AI Platform and scikit-learn to build several models, based on the sensor data.
Using scikit-learn, we can build models to predict such things as which IoT device generated a specific temperature and humidity reading, or the temperature and humidity, given the time of day, device location, and external environment variables, or discover anomalies in the sensor telemetry.
Scikit-learn makes it easy to construct randomized training and test datasets, to build models, using data from multiple IoT devices, as shown below.
The project includes a Jupyter Notebook that demonstrates how to build several ML models using sensor data. Examples of supervised learning algorithms used to build the classification models in this demonstration include Support Vector Machine (SVM), k-nearest neighbors (k-NN), and Random Forest Classifier.
Having data from multiple sensors, we are able to enrich the ML models by adding additional categorical (discrete) features to our training data. For example, we could look at the effect of light, motion, and time of day on temperature and humidity.
Conclusion
Hopefully, this post has demonstrated how to efficiently collect telemetry data from IoT devices using Google Protocol Buffers over HTTPS, serverless Google Cloud Functions, Cloud Pub/Sub, and MongoDB Atlas, all on the Google Cloud Platform. Once captured, the telemetry data was easily aggregated and analyzed using common tools, such as MongoDB Compass and Jupyter Notebooks. Further, we used the data and tools to build machine learning models for prediction and anomaly detection.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Image: everythingpossible © 123RF.com
Istio Observability with Go, gRPC, and Protocol Buffers-based Microservices on Google Kubernetes Engine (GKE)
Posted by Gary A. Stafford in Bash Scripting, Build Automation, Client-Side Development, Cloud, Continuous Delivery, DevOps, Enterprise Software Development, GCP, Go, JavaScript, Kubernetes, Software Development on April 17, 2019
In the last two posts, Kubernetes-based Microservice Observability with Istio Service Mesh and Azure Kubernetes Service (AKS) Observability with Istio Service Mesh, we explored the observability tools which are included with Istio Service Mesh. These tools currently include Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization and monitoring. Combined with cloud platform-native monitoring and logging services, such as Stackdriver on GCP, CloudWatch on AWS, Azure Monitor logs on Azure, and we have a complete observability solution for modern, distributed, Cloud-based applications.
In this post, we will examine the use of Istio’s observability tools to monitor Go-based microservices that use Protocol Buffers (aka Protobuf) over gRPC (gRPC Remote Procedure Calls) and HTTP/2 for client-server communications, as opposed to the more traditional, REST-based JSON (JavaScript Object Notation) over HTTP (Hypertext Transfer Protocol). We will see how Kubernetes, Istio, Envoy, and the observability tools work seamlessly with gRPC, just as they do with JSON over HTTP, on Google Kubernetes Engine (GKE).
Technologies
gRPC
According to the gRPC project, gRPC, a CNCF incubating project, is a modern, high-performance, open-source and universal remote procedure call (RPC) framework that can run anywhere. It enables client and server applications to communicate transparently and makes it easier to build connected systems. Google, the original developer of gRPC, has used the underlying technologies and concepts in gRPC for years. The current implementation is used in several Google cloud products and Google externally facing APIs. It is also being used by Square, Netflix, CoreOS, Docker, CockroachDB, Cisco, Juniper Networks and many other organizations.
Protocol Buffers
By default, gRPC uses Protocol Buffers. According to Google, Protocol Buffers (aka Protobuf) are a language- and platform-neutral, efficient, extensible, automated mechanism for serializing structured data for use in communications protocols, data storage, and more. Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML. Once you have defined your messages, you run the protocol buffer compiler for your application’s language on your .proto
file to generate data access classes.
Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML.
Protocol buffers currently support generated code in Java, Python, Objective-C, and C++, Dart, Go, Ruby, and C#. For this post, we have compiled for Go. You can read more about the binary wire format of Protobuf on Google’s Developers Portal.
Envoy Proxy
According to the Istio project, Istio uses an extended version of the Envoy proxy. Envoy is deployed as a sidecar to a relevant service in the same Kubernetes pod. Envoy, created by Lyft, is a high-performance proxy developed in C++ to mediate all inbound and outbound traffic for all services in the service mesh. Istio leverages Envoy’s many built-in features, including dynamic service discovery, load balancing, TLS termination, HTTP/2 and gRPC proxies, circuit-breakers, health checks, staged rollouts, fault injection, and rich metrics.
According to the post by Harvey Tuch of Google, Evolving a Protocol Buffer canonical API, Envoy proxy adopted Protocol Buffers, specifically proto3, as the canonical specification of for version 2 of Lyft’s gRPC-first API.
Reference Microservices Platform
In the last two posts, we explored Istio’s observability tools, using a RESTful microservices-based API platform written in Go and using JSON over HTTP for service to service communications. The API platform was comprised of eight Go-based microservices and one sample Angular 7, TypeScript-based front-end web client. The various services are dependent on MongoDB, and RabbitMQ for event queue-based communications. Below, the is JSON over HTTP-based platform architecture.
Below, the current Angular 7-based web client interface.
Converting to gRPC and Protocol Buffers
For this post, I have modified the eight Go microservices to use gRPC and Protocol Buffers, Google’s data interchange format. Specifically, the services use version 3 release (aka proto3) of Protocol Buffers. With gRPC, a gRPC client calls a gRPC server. Some of the platform’s services are gRPC servers, others are gRPC clients, while some act as both client and server, such as Service A, B, and E. The revised architecture is shown below.
gRPC Gateway
Assuming for the sake of this demonstration, that most consumers of the API would still expect to communicate using a RESTful JSON over HTTP API, I have added a gRPC Gateway reverse proxy to the platform. The gRPC Gateway is a gRPC to JSON reverse proxy, a common architectural pattern, which proxies communications between the JSON over HTTP-based clients and the gRPC-based microservices. A diagram from the grpc-gateway GitHub project site effectively demonstrates how the reverse proxy works.
Image courtesy: https://github.com/grpc-ecosystem/grpc-gateway
In the revised platform architecture diagram above, note the addition of the reverse proxy, which replaces Service A at the edge of the API. The proxy sits between the Angular-based Web UI and Service A. Also, note the communication method between services is now Protobuf over gRPC instead of JSON over HTTP. The use of Envoy Proxy (via Istio) is unchanged, as is the MongoDB Atlas-based databases and CloudAMQP RabbitMQ-based queue, which are still external to the Kubernetes cluster.
Alternatives to gRPC Gateway
As an alternative to the gRPC Gateway reverse proxy, we could convert the TypeScript-based Angular UI client to gRPC and Protocol Buffers, and continue to communicate directly with Service A as the edge service. However, this would limit other consumers of the API to rely on gRPC as opposed to JSON over HTTP, unless we also chose to expose two different endpoints, gRPC, and JSON over HTTP, another common pattern.
Demonstration
In this post’s demonstration, we will repeat the exact same installation process, outlined in the previous post, Kubernetes-based Microservice Observability with Istio Service Mesh. We will deploy the revised gRPC-based platform to GKE on GCP. You could just as easily follow Azure Kubernetes Service (AKS) Observability with Istio Service Mesh, and deploy the platform to AKS.
Source Code
All source code for this post is available on GitHub, contained in three projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository, in the new grpc
branch.
git clone \ --branch grpc --single-branch --depth 1 --no-tags \ https://github.com/garystafford/k8s-istio-observe-backend.git
The Angular-based web client source code is located in the k8s-istio-observe-frontend repository on the new grpc
branch. The source protocol buffers .proto
file and the generated code, using the protocol buffers compiler, is located in the new pb-greeting project repository. You do not need to clone either of these projects for this post’s demonstration.
All Docker images for the services, UI, and the reverse proxy are located on Docker Hub.
Code Changes
This post is not specifically about writing Go for gRPC and Protobuf. However, to better understand the observability requirements and capabilities of these technologies, compared to JSON over HTTP, it is helpful to review some of the source code.
Service A
First, compare the source code for Service A, shown below, to the original code in the previous post. The service’s code is almost completely re-written. I relied on several references for writing the code, including, Tracing gRPC with Istio, written by Neeraj Poddar of Aspen Mesh and Distributed Tracing Infrastructure with Jaeger on Kubernetes, by Masroor Hasan.
Specifically, note the following code changes to Service A:
- Import of the pb-greeting protobuf package;
- Local Greeting struct replaced with
pb.Greeting
struct; - All services are now hosted on port
50051
; - The HTTP server and all API resource handler functions are removed;
- Headers, used for distributed tracing with Jaeger, have moved from HTTP request object to metadata passed in the gRPC context object;
- Service A is coded as a gRPC server, which is called by the gRPC Gateway reverse proxy (gRPC client) via the
Greeting
function; - The primary
PingHandler
function, which returns the service’s Greeting, is replaced by the pb-greeting protobuf package’sGreeting
function; - Service A is coded as a gRPC client, calling both Service B and Service C using the
CallGrpcService
function; - CORS handling is offloaded to Istio;
- Logging methods are unchanged;
Source code for revised gRPC-based Service A (gist):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: Service A – gRPC/Protobuf | |
package main | |
import ( | |
"context" | |
"github.com/banzaicloud/logrus-runtime-formatter" | |
"github.com/google/uuid" | |
"github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing" | |
ot "github.com/opentracing/opentracing-go" | |
log "github.com/sirupsen/logrus" | |
"google.golang.org/grpc" | |
"google.golang.org/grpc/metadata" | |
"net" | |
"os" | |
"time" | |
pb "github.com/garystafford/pb-greeting" | |
) | |
const ( | |
port = ":50051" | |
) | |
type greetingServiceServer struct { | |
} | |
var ( | |
greetings []*pb.Greeting | |
) | |
func (s *greetingServiceServer) Greeting(ctx context.Context, req *pb.GreetingRequest) (*pb.GreetingResponse, error) { | |
greetings = nil | |
tmpGreeting := pb.Greeting{ | |
Id: uuid.New().String(), | |
Service: "Service-A", | |
Message: "Hello, from Service-A!", | |
Created: time.Now().Local().String(), | |
} | |
greetings = append(greetings, &tmpGreeting) | |
CallGrpcService(ctx, "service-b:50051") | |
CallGrpcService(ctx, "service-c:50051") | |
return &pb.GreetingResponse{ | |
Greeting: greetings, | |
}, nil | |
} | |
func CallGrpcService(ctx context.Context, address string) { | |
conn, err := createGRPCConn(ctx, address) | |
if err != nil { | |
log.Fatalf("did not connect: %v", err) | |
} | |
defer conn.Close() | |
headersIn, _ := metadata.FromIncomingContext(ctx) | |
log.Infof("headersIn: %s", headersIn) | |
client := pb.NewGreetingServiceClient(conn) | |
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | |
ctx = metadata.NewOutgoingContext(context.Background(), headersIn) | |
defer cancel() | |
req := pb.GreetingRequest{} | |
greeting, err := client.Greeting(ctx, &req) | |
log.Info(greeting.GetGreeting()) | |
if err != nil { | |
log.Fatalf("did not connect: %v", err) | |
} | |
for _, greeting := range greeting.GetGreeting() { | |
greetings = append(greetings, greeting) | |
} | |
} | |
func createGRPCConn(ctx context.Context, addr string) (*grpc.ClientConn, error) { | |
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/ | |
var opts []grpc.DialOption | |
opts = append(opts, grpc.WithStreamInterceptor( | |
grpc_opentracing.StreamClientInterceptor( | |
grpc_opentracing.WithTracer(ot.GlobalTracer())))) | |
opts = append(opts, grpc.WithUnaryInterceptor( | |
grpc_opentracing.UnaryClientInterceptor( | |
grpc_opentracing.WithTracer(ot.GlobalTracer())))) | |
opts = append(opts, grpc.WithInsecure()) | |
conn, err := grpc.DialContext(ctx, addr, opts…) | |
if err != nil { | |
log.Fatalf("Failed to connect to application addr: ", err) | |
return nil, err | |
} | |
return conn, nil | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
lis, err := net.Listen("tcp", port) | |
if err != nil { | |
log.Fatalf("failed to listen: %v", err) | |
} | |
s := grpc.NewServer() | |
pb.RegisterGreetingServiceServer(s, &greetingServiceServer{}) | |
log.Fatal(s.Serve(lis)) | |
} |
Greeting Protocol Buffers
Shown below is the greeting source protocol buffers .proto
file. The greeting response struct, originally defined in the services, remains largely unchanged (gist). The UI client responses will look identical.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
syntax = "proto3"; | |
package greeting; | |
import "google/api/annotations.proto"; | |
message Greeting { | |
string id = 1; | |
string service = 2; | |
string message = 3; | |
string created = 4; | |
} | |
message GreetingRequest { | |
} | |
message GreetingResponse { | |
repeated Greeting greeting = 1; | |
} | |
service GreetingService { | |
rpc Greeting (GreetingRequest) returns (GreetingResponse) { | |
option (google.api.http) = { | |
get: "/api/v1/greeting" | |
}; | |
} | |
} |
When compiled with protoc
, the Go-based protocol compiler plugin, the original 27 lines of source code swells to almost 270 lines of generated data access classes that are easier to use programmatically.
# Generate gRPC stub (.pb.go) protoc -I /usr/local/include -I. \ -I ${GOPATH}/src \ -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \ --go_out=plugins=grpc:. \ greeting.proto # Generate reverse-proxy (.pb.gw.go) protoc -I /usr/local/include -I. \ -I ${GOPATH}/src \ -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \ --grpc-gateway_out=logtostderr=true:. \ greeting.proto # Generate swagger definitions (.swagger.json) protoc -I /usr/local/include -I. \ -I ${GOPATH}/src \ -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \ --swagger_out=logtostderr=true:. \ greeting.proto
Below is a small snippet of that compiled code, for reference. The compiled code is included in the pb-greeting project on GitHub and imported into each microservice and the reverse proxy (gist). We also compile a separate version for the reverse proxy to implement.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Code generated by protoc-gen-go. DO NOT EDIT. | |
// source: greeting.proto | |
package greeting | |
import ( | |
context "context" | |
fmt "fmt" | |
proto "github.com/golang/protobuf/proto" | |
_ "google.golang.org/genproto/googleapis/api/annotations" | |
grpc "google.golang.org/grpc" | |
codes "google.golang.org/grpc/codes" | |
status "google.golang.org/grpc/status" | |
math "math" | |
) | |
// Reference imports to suppress errors if they are not otherwise used. | |
var _ = proto.Marshal | |
var _ = fmt.Errorf | |
var _ = math.Inf | |
// This is a compile-time assertion to ensure that this generated file | |
// is compatible with the proto package it is being compiled against. | |
// A compilation error at this line likely means your copy of the | |
// proto package needs to be updated. | |
const _ = proto.ProtoPackageIsVersion3 // please upgrade the proto package | |
type Greeting struct { | |
Id string `protobuf:"bytes,1,opt,name=id,proto3" json:"id,omitempty"` | |
Service string `protobuf:"bytes,2,opt,name=service,proto3" json:"service,omitempty"` | |
Message string `protobuf:"bytes,3,opt,name=message,proto3" json:"message,omitempty"` | |
Created string `protobuf:"bytes,4,opt,name=created,proto3" json:"created,omitempty"` | |
XXX_NoUnkeyedLiteral struct{} `json:"-"` | |
XXX_unrecognized []byte `json:"-"` | |
XXX_sizecache int32 `json:"-"` | |
} | |
func (m *Greeting) Reset() { *m = Greeting{} } | |
func (m *Greeting) String() string { return proto.CompactTextString(m) } | |
func (*Greeting) ProtoMessage() {} | |
func (*Greeting) Descriptor() ([]byte, []int) { | |
return fileDescriptor_6acac03ccd168a87, []int{0} | |
} | |
func (m *Greeting) XXX_Unmarshal(b []byte) error { | |
return xxx_messageInfo_Greeting.Unmarshal(m, b) | |
} | |
func (m *Greeting) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { | |
return xxx_messageInfo_Greeting.Marshal(b, m, deterministic) |
Using Swagger, we can view the greeting protocol buffers’ single RESTful API resource, exposed with an HTTP GET method. I use the Docker-based version of Swagger UI for viewing protoc
generated swagger definitions.
docker run -p 8080:8080 -d --name swagger-ui \ -e SWAGGER_JSON=/tmp/greeting.swagger.json \ -v ${GOAPTH}/src/pb-greeting:/tmp swaggerapi/swagger-ui
The Angular UI makes an HTTP GET request to the /api/v1/greeting
resource, which is transformed to gRPC and proxied to Service A, where it is handled by the Greeting
function.
gRPC Gateway Reverse Proxy
As explained earlier, the gRPC Gateway reverse proxy service is completely new. Specifically, note the following code features in the gist below:
- Import of the pb-greeting protobuf package;
- The proxy is hosted on port
80
; - Request headers, used for distributed tracing with Jaeger, are collected from the incoming HTTP request and passed to Service A in the gRPC context;
- The proxy is coded as a gRPC client, which calls Service A;
- Logging is largely unchanged;
The source code for the Reverse Proxy (gist):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: gRPC Gateway / Reverse Proxy | |
// reference: https://github.com/grpc-ecosystem/grpc-gateway | |
package main | |
import ( | |
"context" | |
"flag" | |
lrf "github.com/banzaicloud/logrus-runtime-formatter" | |
gw "github.com/garystafford/pb-greeting" | |
"github.com/grpc-ecosystem/grpc-gateway/runtime" | |
log "github.com/sirupsen/logrus" | |
"google.golang.org/grpc" | |
"google.golang.org/grpc/metadata" | |
"net/http" | |
"os" | |
) | |
func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD { | |
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/ | |
var ( | |
otHeaders = []string{ | |
"x-request-id", | |
"x-b3-traceid", | |
"x-b3-spanid", | |
"x-b3-parentspanid", | |
"x-b3-sampled", | |
"x-b3-flags", | |
"x-ot-span-context"} | |
) | |
var pairs []string | |
for _, h := range otHeaders { | |
if v := req.Header.Get(h); len(v) > 0 { | |
pairs = append(pairs, h, v) | |
} | |
} | |
return metadata.Pairs(pairs…) | |
} | |
type annotator func(context.Context, *http.Request) metadata.MD | |
func chainGrpcAnnotators(annotators …annotator) annotator { | |
return func(c context.Context, r *http.Request) metadata.MD { | |
var mds []metadata.MD | |
for _, a := range annotators { | |
mds = append(mds, a(c, r)) | |
} | |
return metadata.Join(mds…) | |
} | |
} | |
func run() error { | |
ctx := context.Background() | |
ctx, cancel := context.WithCancel(ctx) | |
defer cancel() | |
annotators := []annotator{injectHeadersIntoMetadata} | |
mux := runtime.NewServeMux( | |
runtime.WithMetadata(chainGrpcAnnotators(annotators…)), | |
) | |
opts := []grpc.DialOption{grpc.WithInsecure()} | |
err := gw.RegisterGreetingServiceHandlerFromEndpoint(ctx, mux, "service-a:50051", opts) | |
if err != nil { | |
return err | |
} | |
return http.ListenAndServe(":80", mux) | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := lrf.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
flag.Parse() | |
if err := run(); err != nil { | |
log.Fatal(err) | |
} | |
} |
Below, in the Stackdriver logs, we see an example of a set of HTTP request headers in the JSON payload, which are propagated upstream to gRPC-based Go services from the gRPC Gateway’s reverse proxy. Header propagation ensures the request produces a complete distributed trace across the complete service call chain.
Istio VirtualService and CORS
According to feedback in the project’s GitHub Issues, the gRPC Gateway does not directly support Cross-Origin Resource Sharing (CORS) policy. In my own experience, the gRPC Gateway cannot handle OPTIONS HTTP method requests, which must be issued by the Angular 7 web UI. Therefore, I have offloaded CORS responsibility to Istio, using the VirtualService resource’s CorsPolicy configuration. This makes CORS much easier to manage than coding CORS configuration into service code (gist):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
apiVersion: networking.istio.io/v1alpha3 | |
kind: VirtualService | |
metadata: | |
name: service-rev-proxy | |
spec: | |
hosts: | |
– api.dev.example-api.com | |
gateways: | |
– demo-gateway | |
http: | |
– match: | |
– uri: | |
prefix: / | |
route: | |
– destination: | |
port: | |
number: 80 | |
host: service-rev-proxy.dev.svc.cluster.local | |
weight: 100 | |
corsPolicy: | |
allowOrigin: | |
– "*" | |
allowMethods: | |
– OPTIONS | |
– GET | |
allowCredentials: true | |
allowHeaders: | |
– "*" |
Set-up and Installation
To deploy the microservices platform to GKE, follow the detailed instructions in part one of the post, Kubernetes-based Microservice Observability with Istio Service Mesh: Part 1, or Azure Kubernetes Service (AKS) Observability with Istio Service Mesh for AKS.
- Create the external MongoDB Atlas database and CloudAMQP RabbitMQ clusters;
- Modify the Kubernetes resource files and bash scripts for your own environments;
- Create the managed GKE or AKS cluster on GCP or Azure;
- Configure and deploy Istio to the managed Kubernetes cluster, using Helm;
- Create DNS records for the platform’s exposed resources;
- Deploy the Go-based microservices, gRPC Gateway reverse proxy, Angular UI, and associated resources to Kubernetes cluster;
- Test and troubleshoot the platform deployment;
- Observe the results;
The Three Pillars
As introduced in the first post, logs, metrics, and traces are often known as the three pillars of observability. These are the external outputs of the system, which we may observe. As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind. Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.
Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools. Istio’s integrations include Jaeger for distributed tracing, Kiali for Istio service mesh-based microservice visualization and monitoring, and Prometheus and Grafana for metric collection, monitoring, and alerting. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for GKE, CloudWatch for Amazon’s EKS, or Azure Monitor logs for AKS, and we have a complete observability solution for modern, distributed, Cloud-based applications.
Pillar 1: Logging
Moving from JSON over HTTP to gRPC does not require any changes to the logging configuration of the Go-based service code or Kubernetes resources.
Stackdriver with Logrus
As detailed in part two of the last post, Kubernetes-based Microservice Observability with Istio Service Mesh, our logging strategy for the eight Go-based microservices and the reverse proxy continues to be the use of Logrus, the popular structured logger for Go, and Banzai Cloud’s logrus-runtime-formatter.
If you recall, the Banzai formatter automatically tags log messages with runtime/stack information, including function name and line number; extremely helpful when troubleshooting. We are also using Logrus’ JSON formatter. Below, in the Stackdriver console, note how each log entry below has the JSON payload contained within the message with the log level, function name, lines on which the log entry originated, and the message.
Below, we see the details of a specific log entry’s JSON payload. In this case, we can see the request headers propagated from the downstream service.
Pillar 2: Metrics
Moving from JSON over HTTP to gRPC does not require any changes to the metrics configuration of the Go-based service code or Kubernetes resources.
Prometheus
Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.
Grafana
Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana allows users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.
According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see two of the pre-configured dashboards, the Istio Mesh Dashboard and the Istio Performance Dashboard.
Pillar 3: Traces
Moving from JSON over HTTP to gRPC did require a complete re-write of the tracing logic in the service code. In fact, I spent the majority of my time ensuring the correct headers were propagated from the Istio Ingress Gateway to the gRPC Gateway reverse proxy, to Service A in the gRPC context, and upstream to all the dependent, gRPC-based services. I am sure there are a number of optimization in my current code, regarding the correct handling of traces and how this information is propagated across the service call stack.
Jaeger
According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains an excellent overview of Jaeger’s architecture and general tracing-related terminology.
Below we see the Jaeger UI Traces View. In it, we see a series of traces generated by hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab
). Unlike ab
, hey
supports HTTP/2. The use of hey
was detailed in the previous post.
A trace, as you might recall, is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with DAGs.
Below we see the Jaeger UI Trace Detail View. The example trace contains 16 spans, which encompasses nine components – seven of the eight Go-based services, the reverse proxy, and the Istio Ingress Gateway. The trace and the spans each have timings. The root span in the trace is the Istio Ingress Gateway. In this demo, traces do not span the RabbitMQ message queues. This means you would not see a trace which includes the decoupled, message-based communications between Service D to Service F, via the RabbitMQ.
Within the Jaeger UI Trace Detail View, you also have the ability to drill into a single span, which contains additional metadata. Metadata includes the URL being called, HTTP method, response status, and several other headers.
Microservice Observability
Moving from JSON over HTTP to gRPC does not require any changes to the Kiali configuration of the Go-based service code or Kubernetes resources.
Kiali
According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.
The Graph View in the Kiali UI is a visual representation of the components running in the Istio service mesh. Below, filtering on the cluster’s dev
Namespace, we should observe that Kiali has mapped all components in the platform, along with rich metadata, such as their version and communication protocols.
Using Kiali, we can confirm our service-to-service IPC protocol is now gRPC instead of the previous HTTP.
Conclusion
Although converting from JSON over HTTP to protocol buffers with gRPC required major code changes to the services, it did not impact the high-level observability we have of those services using the tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Kubernetes-based Microservice Observability with Istio Service Mesh on Google Kubernetes Engine (GKE): Part 2
Posted by Gary A. Stafford in Cloud, DevOps, GCP, Go, JavaScript, Kubernetes, Software Development on March 21, 2019
In this two-part post, we are exploring the set of observability tools that are part of the latest version of Istio Service Mesh. These tools include Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability solution for modern, distributed applications.
Reference Platform
To demonstrate Istio’s observability tools, in part one of the post, we deployed a reference microservices platform, written in Go, to GKE on GCP. The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A through Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications.
The reference platform is designed to generate HTTP-based service-to-service, TCP-based service-to-database (MongoDB), and TCP-based service-to-queue-to-service (RabbitMQ) IPC (inter-process communication). Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a RabbitMQ queue that Service F consumes and writes to MongoDB, and so on. The goal is to observe these distributed communications using Istio’s observability tools when the system is deployed to Kubernetes.
Pillar 1: Logging
If you recall, logs, metrics, and traces are often known as the three pillars of observability. Since we are using GKE on GCP, we will look at Google’s Stackdriver Logging. According to Google, Stackdriver Logging allows you to store, search, analyze, monitor, and alert on log data and events from GCP and even AWS. Although Stackdriver logging is not an Istio observability feature, logging is an essential pillar of overall observability strategy.
Go-based Microservice Logging
An effective logging strategy starts with what you log, when you log, and how you log. As part of our logging strategy, the eight Go-based microservices are using Logrus, a popular structured logger for Go. The microservices also implement Banzai Cloud’s logrus-runtime-formatter. There is an excellent article on the formatter, Golang runtime Logrus Formatter. These two logging packages give us greater control over what we log, when we log, and how we log information about our microservices. The recommended configuration of the packages is minimal.
func init() { formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} formatter.Line = true log.SetFormatter(&formatter) log.SetOutput(os.Stdout) level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) if err != nil { log.Error(err) } log.SetLevel(level) }
Logrus provides several advantages over Go’s simple logging package, log. Log entries are not only for Fatal errors, nor should all verbose log entries be output in a Production environment. The post’s microservices are taking advantage of Logrus’ ability to log at seven levels: Trace, Debug, Info, Warning, Error, Fatal, and Panic. We have also variabilized the log level, allowing it to be easily changed in the Kubernetes Deployment resource at deploy-time.
The microservices also take advantage of Banzai Cloud’s logrus-runtime-formatter. The Banzai formatter automatically tags log messages with runtime/stack information, including function name and line number; extremely helpful when troubleshooting. We are also using Logrus’ JSON formatter. Note how each log entry below has the JSON payload contained within the message.
Client-side Angular UI Logging
Likewise, we have enhanced the logging of the Angular UI using NGX Logger. NGX Logger is a popular, simple logging module, currently for Angular 6 and 7. It allows “pretty print” to the console, as well as allowing log messages to be POSTed to a URL for server-side logging. For this demo, we will only print to the console. Similar to Logrus, NGX Logger supports multiple log levels: Trace, Debug, Info, Warning, Error, Fatal, and Off. Instead of just outputting messages, NGX Logger allows us to output properly formatted log entries to the web browser’s console.
The level of logs output is dependent on the environment, Production or not Production. Below we see a combination of log entries in the local development environment, including Debug, Info, and Error.
Again below, we see the same page in the GKE-based Production environment. Note the absence of Debug-level log entries output to the console, without changing the configuration. We would not want to expose potentially sensitive information in verbose log output to our end-users in Production.
Controlling logging levels is accomplished by adding the following ternary operator to the app.module.ts file.
LoggerModule.forRoot({ level: !environment.production ? NgxLoggerLevel.DEBUG : NgxLoggerLevel.INFO, serverLogLevel: NgxLoggerLevel.INFO })
Pillar 2: Metrics
For metrics, we will examine at Prometheus and Grafana. Both these leading tools were installed as part of the Istio deployment.
Prometheus
Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.
According to Istio, Istio’s Mixer comes with a built-in Prometheus adapter that exposes an endpoint serving generated metric values. The Prometheus add-on is a Prometheus server that comes pre-configured to scrape Mixer endpoints to collect the exposed metrics. It provides a mechanism for persistent storage and querying of Istio metrics.
With the GKE cluster running, Istio installed, and the platform deployed, the easiest way to access Grafana, is using kubectl port-forward
to connect to the Prometheus server. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the Prometheus pod.
You may connect using Google Cloud Shell or copy and paste the command to your local shell to connect from a local port. Below are the port forwarding commands used in this post.
# Grafana kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=grafana \ -o jsonpath='{.items[0].metadata.name}') 3000:3000 & # Prometheus kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=prometheus \ -o jsonpath='{.items[0].metadata.name}') 9090:9090 & # Jaeger kubectl port-forward -n istio-system \ $(kubectl get pod -n istio-system -l app=jaeger \ -o jsonpath='{.items[0].metadata.name}') 16686:16686 & # Kiali kubectl -n istio-system port-forward \ $(kubectl -n istio-system get pod -l app=kiali \ -o jsonpath='{.items[0].metadata.name}') 20001:20001 &
According to Prometheus, user select and aggregate time series data in real time using a functional query language called PromQL (Prometheus Query Language). The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems through Prometheus’ HTTP API. The expression browser includes a drop-down menu with all available metrics as a starting point for building queries. Shown below are a few PromQL examples used in this post.
up{namespace="dev",pod_name=~"service-.*"} container_memory_max_usage_bytes{namespace="dev",container_name=~"service-.*"} container_memory_max_usage_bytes{namespace="dev",container_name="service-f"} container_network_transmit_packets_total{namespace="dev",pod_name=~"service-e-.*"} istio_requests_total{destination_service_namespace="dev",connection_security_policy="mutual_tls",destination_app="service-a"} istio_response_bytes_count{destination_service_namespace="dev",connection_security_policy="mutual_tls",source_app="service-a"}
Below, in the Prometheus console, we see an example graph of the eight Go-based microservices, deployed to GKE. The graph displays the container memory usage over a five minute period. For half the time period, the services were at rest. For the second half of the period, the services were under a simulated load, using hey
. Viewing the memory profile of the services under load can help us determine the container memory minimums and limits, which impact Kubernetes’ scheduling of workloads on the GKE cluster. Metrics such as this might also uncover memory leaks or routing issues, such as the service below, which appears to be consuming 25-50% more memory than its peers.
Another example, below, we see a graph representing the total Istio requests to Service A in the dev
Namespace, while the system was under load.
Compare the graph view above with the same metrics displayed the console view. The multiple entries reflect the multiple instances of Service A in the dev
Namespace, over the five-minute period being examined. The values in the individual metric elements indicate the latest metric that was collected.
Prometheus also collects basic metrics about Istio components, Kubernetes components, and GKE cluster. Below we can view the total memory of each n1-standard-2 VM nodes in the GKE cluster.
Grafana
Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.
According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. The base install files for Istio, and Mixer in particular, ship with a default configuration of global (used for every service) metrics. The pre-configured Istio Dashboards are built to be used in conjunction with the default Istio metrics configuration and a Prometheus back-end.
Below, we see the pre-configured Istio Workload Dashboard. This particular section of the larger dashboard has been filtered to show outbound service metrics in the dev
Namespace of our GKE cluster.
Similarly, below, we see the pre-configured Istio Service Dashboard. This particular section of the larger dashboard is filtered to show client workloads metrics for the Istio Ingress Gateway in our GKE cluster.
Lastly, we see the pre-configured Istio Mesh Dashboard. This dashboard is filtered to show a table view of metrics for components deployed to our GKE cluster.
An effective observability strategy must include more than just the ability to visualize results. An effective strategy must also include the ability to detect anomalies and notify (alert) the appropriate resources or take action directly to resolve incidents. Grafana, like Prometheus, is capable of alerting and notification. You visually define alert rules for your critical metrics. Grafana will continuously evaluate metrics against the rules and send notifications when pre-defined thresholds are breached.
Prometheus supports multiple, popular notification channels, including PagerDuty, HipChat, Email, Kafka, and Slack. Below, we see a new Prometheus notification channel, which sends alert notifications to a Slack support channel.
Prometheus is able to send detailed text-based and visual notifications.
Pillar 3: Traces
According to the Open Tracing website, distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
According to Istio, although Istio proxies are able to automatically send spans, applications need to propagate the appropriate HTTP headers, so that when the proxies send span information, the spans can be correlated correctly into a single trace. To accomplish this, an application needs to collect and propagate the following headers from the incoming request to any outgoing requests.
x-request-id
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
x-ot-span-context
The x-b3
headers originated as part of the Zipkin project. The B3 portion of the header is named for the original name of Zipkin, BigBrotherBird. Passing these headers across service calls is known as B3 propagation. According to Zipkin, these attributes are propagated in-process, and eventually downstream (often via HTTP headers), to ensure all activity originating from the same root are collected together.
In order to demonstrate distributed tracing with Jaeger, I have modified Service A, Service B, and Service E. These are the three services that make HTTP requests to other upstream services. I have added the following code in order to propagate the headers from one service to the next. The Istio sidecar proxy (Envoy) generates the first headers. It is critical that you only propagate the headers that are present in the downstream request and have a value, as the code below does. Propagating an empty header will break the distributed tracing.
headers := []string{ "x-request-id", "x-b3-traceid", "x-b3-spanid", "x-b3-parentspanid", "x-b3-sampled", "x-b3-flags", "x-ot-span-context", } for _, header := range headers { if r.Header.Get(header) != "" { req.Header.Add(header, r.Header.Get(header)) } }
Below, in the highlighted Stackdriver log entry’s JSON payload, we see the required headers, propagated from the root span, which contained a value, being passed from Service A to Service C in the upstream request.
Jaeger
According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.
Below we see the Jaeger UI Traces View. The UI shows the results of a search for the Istio Ingress Gateway service over a period of about forty minutes. We see a timeline of traces across the top with a list of trace results below. As discussed on the Jaeger website, a trace is composed of spans. A span represents a logical unit of work in Jaeger that has an operation name. A trace is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with DAGs.
Below we see the Jaeger UI Trace Detail View. The example trace contains 16 spans, which encompasses eight services – seven of the eight Go-based services and the Istio Ingress Gateway. The trace and the spans each have timings. The root span in the trace is the Istio Ingress Gateway. The Angular UI, loaded in the end user’s web browser, calls the mesh’s edge service, Service A, through the Istio Ingress Gateway. From there, we see the expected flow of our service-to-service IPC. Service A calls Services B and C. Service B calls Service E, which calls Service G and Service H. In this demo, traces do not span the RabbitMQ message queues. This means you would not see a trace which includes a call from Service D to Service F, via the RabbitMQ.
Within the Jaeger UI Trace Detail View, you also have the ability to drill into a single span, which contains additional metadata. Metadata includes the URL being called, HTTP method, response status, and several other headers.
The latest version of Jaeger also includes a Compare feature and two Dependencies views, Force-Directed Graph, and DAG. I find both views rather primitive compared to Kiali, and more similar to Service Graph. Lacking access to Kiali, the views are marginally useful as a dependency graph.
Kiali: Microservice Observability
According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin
, the password is 1f2d1e2e67df
.
Logging into Kiali, we see the Overview menu entry, which provides a global view of all namespaces within the Istio service mesh and the number of applications within each namespace.
The Graph View in the Kiali UI is a visual representation of the components running in the Istio service mesh. Below, filtering on the cluster’s dev
Namespace, we can observe that Kiali has mapped 8 applications (workloads), 10 services, and 24 edges (a graph term). Specifically, we see the Istio Ingres Proxy at the edge of the service mesh, the Angular UI, the eight Go-based microservices and their Envoy proxy sidecars that are taking traffic (Service F did not take any direct traffic from another service in this example), the external MongoDB Atlas cluster, and the external CloudAMQP cluster. Note how service-to-service traffic flows, with Istio, from the service to its sidecar proxy, to the other service’s sidecar proxy, and finally to the service.
Below, we see a similar view of the service mesh, but this time, there are failures between the Istio Ingress Gateway and the Service A, shown in red. We can also observe overall metrics for the HTTP traffic, such as total requests/minute, errors, and status codes.
Kiali can also display average requests times and other metrics for each edge in the graph (the communication between two components).
Kiali can also show application versions deployed, as shown below, the microservices are a combination of versions 1.3 and 1.4.
Focusing on the external MongoDB Atlas cluster, Kiali also allows us to view TCP traffic between the four services within the service mesh and the external cluster.
The Applications menu entry lists all the applications and their error rates, which can be filtered by Namespace and time interval. Here we see that the Angular UI was producing errors at the rate of 16.67%.
On both the Applications and Workloads menu entry, we can drill into a component to view additional details, including the overall health, number of Pods, Services, and Destination Services. Below, we see details for Service B in the dev
Namespace.
The Workloads detailed view also includes inbound and outbound metrics. Below, the outbound volume, duration, and size metrics, for Service A in the dev
Namespace.
Finally, Kiali presents an Istio Config menu entry. The Istio Config menu entry displays a list of all of the available Istio configuration objects that exist in the user’s environment.
Oftentimes, I find Kiali to be my first stop when troubleshooting platform issues. Once I identify the specific components or communication paths having issues, I can search the Stackdriver logs and the Prometheus metrics, through the Grafana dashboard.
Conclusion
In this two-part post, we have explored the current set of observability tools, which are part of the latest version of Istio Service Mesh. These tools included Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability solution for modern, distributed applications.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Kubernetes-based Microservice Observability with Istio Service Mesh on Google Kubernetes Engine (GKE): Part 1
Posted by Gary A. Stafford in Build Automation, Client-Side Development, Cloud, DevOps, GCP, Go, JavaScript, Kubernetes, Software Development on March 10, 2019
In this two-part post, we will explore the set of observability tools which are part of the Istio Service Mesh. These tools include Jaeger, Kiali, Prometheus, and Grafana. To assist in our exploration, we will deploy a Go-based, microservices reference platform to Google Kubernetes Engine, on the Google Cloud Platform.
What is Observability?
Similar to blockchain, serverless, AI and ML, chatbots, cybersecurity, and service meshes, Observability is a hot buzz word in the IT industry right now. According to Wikipedia, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Logs, metrics, and traces are often known as the three pillars of observability. These are the external outputs of the system, which we may observe.
The O’Reilly book, Distributed Systems Observability, by Cindy Sridharan, does an excellent job of detailing ‘The Three Pillars of Observability’, in Chapter 4. I recommend reading this free online excerpt, before continuing. A second great resource for information on observability is honeycomb.io, a developer of observability tools for production systems, led by well-known industry thought-leader, Charity Majors. The honeycomb.io site includes articles, blog posts, whitepapers, and podcasts on observability.
As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind. Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.
Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools. Istio’s integrations include Jaeger for distributed tracing, Kiali for Istio service mesh-based microservice visualization, and Prometheus and Grafana for metric collection, monitoring, and alerting. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability platform for modern, distributed applications.
A Reference Microservices Platform
To demonstrate the observability tools integrated with the latest version of Istio Service Mesh, we will deploy a reference microservices platform, written in Go, to GKE on GCP. I developed the reference platform to demonstrate concepts such as API management, Service Meshes, Observability, DevOps, and Chaos Engineering. The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A – Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications. The platform and all its source code is free and open source.
The reference platform is designed to generate HTTP-based service-to-service, TCP-based service-to-database (MongoDB), and TCP-based service-to-queue-to-service (RabbitMQ) IPC (inter-process communication). Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a RabbitMQ queue that Service F consumes and writes to MongoDB, and so on. These distributed communications can be observed using Istio’s observability tools when the system is deployed to a Kubernetes cluster running the Istio service mesh.
Service Responses
On the reference platform, each upstream service responds to requests from downstream services by returning a small informational JSON payload (termed a greeting in the source code).
The responses are aggregated across the service call chain, resulting in an array of service responses being returned to the edge service and on to the Angular-based UI, running in the end user’s web browser. The response aggregation feature is simply used to confirm that the service-to-service communications, Istio components, and the telemetry tools are working properly.
Each Go microservice contains a /ping
and /health
endpoint. The /health
endpoint can be used to configure Kubernetes Liveness and Readiness Probes. Additionally, the edge service, Service A, is configured for Cross-Origin Resource Sharing (CORS) using the access-control-allow-origin: *
response header. CORS allows the Angular UI, running in end user’s web browser, to call the Service A /ping
endpoint, which resides in a different subdomain from UI. Shown below is the Go source code for Service A.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: Service A | |
package main | |
import ( | |
"encoding/json" | |
"github.com/banzaicloud/logrus-runtime-formatter" | |
"github.com/google/uuid" | |
"github.com/gorilla/mux" | |
"github.com/prometheus/client_golang/prometheus/promhttp" | |
"github.com/rs/cors" | |
log "github.com/sirupsen/logrus" | |
"io/ioutil" | |
"net/http" | |
"os" | |
"strconv" | |
"time" | |
) | |
type Greeting struct { | |
ID string `json:"id,omitempty"` | |
ServiceName string `json:"service,omitempty"` | |
Message string `json:"message,omitempty"` | |
CreatedAt time.Time `json:"created,omitempty"` | |
} | |
var greetings []Greeting | |
func PingHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
log.Debug(r) | |
greetings = nil | |
CallNextServiceWithTrace("http://service-b/api/ping", w, r) | |
CallNextServiceWithTrace("http://service-c/api/ping", w, r) | |
tmpGreeting := Greeting{ | |
ID: uuid.New().String(), | |
ServiceName: "Service-A", | |
Message: "Hello, from Service-A!", | |
CreatedAt: time.Now().Local(), | |
} | |
greetings = append(greetings, tmpGreeting) | |
err := json.NewEncoder(w).Encode(greetings) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func HealthCheckHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
_, err := w.Write([]byte("{\"alive\": true}")) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func ResponseStatusHandler(w http.ResponseWriter, r *http.Request) { | |
params := mux.Vars(r) | |
statusCode, err := strconv.Atoi(params["code"]) | |
if err != nil { | |
log.Error(err) | |
} | |
w.WriteHeader(statusCode) | |
} | |
func CallNextServiceWithTrace(url string, w http.ResponseWriter, r *http.Request) { | |
var tmpGreetings []Greeting | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
req, err := http.NewRequest("GET", url, nil) | |
if err != nil { | |
log.Error(err) | |
} | |
// Headers must be passed for Jaeger Distributed Tracing | |
headers := []string{ | |
"x-request-id", | |
"x-b3-traceid", | |
"x-b3-spanid", | |
"x-b3-parentspanid", | |
"x-b3-sampled", | |
"x-b3-flags", | |
"x-ot-span-context", | |
} | |
for _, header := range headers { | |
if r.Header.Get(header) != "" { | |
req.Header.Add(header, r.Header.Get(header)) | |
} | |
} | |
log.Info(req) | |
client := &http.Client{} | |
response, err := client.Do(req) | |
if err != nil { | |
log.Error(err) | |
} | |
defer response.Body.Close() | |
body, err := ioutil.ReadAll(response.Body) | |
if err != nil { | |
log.Error(err) | |
} | |
err = json.Unmarshal(body, &tmpGreetings) | |
if err != nil { | |
log.Error(err) | |
} | |
for _, r := range tmpGreetings { | |
greetings = append(greetings, r) | |
} | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
c := cors.New(cors.Options{ | |
AllowedOrigins: []string{"*"}, | |
AllowCredentials: true, | |
AllowedMethods: []string{"GET", "POST", "PATCH", "PUT", "DELETE", "OPTIONS"}, | |
}) | |
router := mux.NewRouter() | |
api := router.PathPrefix("/api").Subrouter() | |
api.HandleFunc("/ping", PingHandler).Methods("GET", "OPTIONS") | |
api.HandleFunc("/health", HealthCheckHandler).Methods("GET", "OPTIONS") | |
api.HandleFunc("/status/{code}", ResponseStatusHandler).Methods("GET", "OPTIONS") | |
api.Handle("/metrics", promhttp.Handler()) | |
handler := c.Handler(router) | |
log.Fatal(http.ListenAndServe(":80", handler)) | |
} |
For this demonstration, the MongoDB databases will be hosted, external to the services on GCP, on MongoDB Atlas, a MongoDB-as-a-Service, cloud-based platform. Similarly, the RabbitMQ queues will be hosted on CloudAMQP, a RabbitMQ-as-a-Service, cloud-based platform. I have used both of these SaaS providers in several previous posts. Using external services will help us understand how Istio and its observability tools collect telemetry for communications between the Kubernetes cluster and external systems.
Shown below is the Go source code for Service F, This service consumers messages from the RabbitMQ queue, placed there by Service D, and writes the messages to MongoDB.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
// purpose: Service F | |
package main | |
import ( | |
"bytes" | |
"context" | |
"encoding/json" | |
"github.com/banzaicloud/logrus-runtime-formatter" | |
"github.com/google/uuid" | |
"github.com/gorilla/mux" | |
log "github.com/sirupsen/logrus" | |
"github.com/streadway/amqp" | |
"go.mongodb.org/mongo-driver/mongo" | |
"go.mongodb.org/mongo-driver/mongo/options" | |
"net/http" | |
"os" | |
"time" | |
) | |
type Greeting struct { | |
ID string `json:"id,omitempty"` | |
ServiceName string `json:"service,omitempty"` | |
Message string `json:"message,omitempty"` | |
CreatedAt time.Time `json:"created,omitempty"` | |
} | |
var greetings []Greeting | |
func PingHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
greetings = nil | |
tmpGreeting := Greeting{ | |
ID: uuid.New().String(), | |
ServiceName: "Service-F", | |
Message: "Hola, from Service-F!", | |
CreatedAt: time.Now().Local(), | |
} | |
greetings = append(greetings, tmpGreeting) | |
CallMongoDB(tmpGreeting) | |
err := json.NewEncoder(w).Encode(greetings) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func HealthCheckHandler(w http.ResponseWriter, r *http.Request) { | |
w.Header().Set("Content-Type", "application/json; charset=utf-8") | |
_, err := w.Write([]byte("{\"alive\": true}")) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func CallMongoDB(greeting Greeting) { | |
log.Info(greeting) | |
ctx, _ := context.WithTimeout(context.Background(), 10*time.Second) | |
client, err := mongo.Connect(ctx, options.Client().ApplyURI(os.Getenv("MONGO_CONN"))) | |
if err != nil { | |
log.Error(err) | |
} | |
defer client.Disconnect(nil) | |
collection := client.Database("service-f").Collection("messages") | |
ctx, _ = context.WithTimeout(context.Background(), 5*time.Second) | |
_, err = collection.InsertOne(ctx, greeting) | |
if err != nil { | |
log.Error(err) | |
} | |
} | |
func GetMessages() { | |
conn, err := amqp.Dial(os.Getenv("RABBITMQ_CONN")) | |
if err != nil { | |
log.Error(err) | |
} | |
defer conn.Close() | |
ch, err := conn.Channel() | |
if err != nil { | |
log.Error(err) | |
} | |
defer ch.Close() | |
q, err := ch.QueueDeclare( | |
"service-d", | |
false, | |
false, | |
false, | |
false, | |
nil, | |
) | |
if err != nil { | |
log.Error(err) | |
} | |
msgs, err := ch.Consume( | |
q.Name, | |
"service-f", | |
true, | |
false, | |
false, | |
false, | |
nil, | |
) | |
if err != nil { | |
log.Error(err) | |
} | |
forever := make(chan bool) | |
go func() { | |
for delivery := range msgs { | |
log.Debug(delivery) | |
CallMongoDB(deserialize(delivery.Body)) | |
} | |
}() | |
<-forever | |
} | |
func deserialize(b []byte) (t Greeting) { | |
log.Debug(b) | |
var tmpGreeting Greeting | |
buf := bytes.NewBuffer(b) | |
decoder := json.NewDecoder(buf) | |
err := decoder.Decode(&tmpGreeting) | |
if err != nil { | |
log.Error(err) | |
} | |
return tmpGreeting | |
} | |
func getEnv(key, fallback string) string { | |
if value, ok := os.LookupEnv(key); ok { | |
return value | |
} | |
return fallback | |
} | |
func init() { | |
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}} | |
formatter.Line = true | |
log.SetFormatter(&formatter) | |
log.SetOutput(os.Stdout) | |
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info")) | |
if err != nil { | |
log.Error(err) | |
} | |
log.SetLevel(level) | |
} | |
func main() { | |
go GetMessages() | |
router := mux.NewRouter() | |
api := router.PathPrefix("/api").Subrouter() | |
api.HandleFunc("/ping", PingHandler).Methods("GET") | |
api.HandleFunc("/health", HealthCheckHandler).Methods("GET") | |
log.Fatal(http.ListenAndServe(":80", router)) | |
} |
Source Code
All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository. The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend project repository. You should not need to clone the Angular UI project for this demonstration.
git clone --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/k8s-istio-observe-backend.git
Docker images referenced in the Kubernetes Deployment
resource files, for the Go services and UI, are all available on Docker Hub. The Go microservice Docker images were built using the official Golang Alpine base image on DockerHub, containing Go version 1.12.0. Using the Alpine image to compile the Go source code ensures the containers will be as small as possible and contain a minimal attack surface.
System Requirements
To follow along with the post, you will need the latest version of gcloud
CLI (min. ver. 239.0.0), part of the Google Cloud SDK, Helm, and the just releases Istio 1.1.0, installed and configured locally or on your build machine.
Set-up and Installation
To deploy the microservices platform to GKE, we will proceed in the following order.
- Create the MongoDB Atlas database cluster;
- Create the CloudAMQP RabbitMQ cluster;
- Modify the Kubernetes resources and scripts for your own environments;
- Create the GKE cluster on GCP;
- Deploy Istio 1.1.0 to the GKE cluster, using Helm;
- Create DNS records for the platform’s exposed resources;
- Deploy the Go-based microservices, Angular UI, and associated resources to GKE;
- Test and troubleshoot the platform;
- Observe the results in part two!
MongoDB Atlas Cluster
MongoDB Atlas is a fully-managed MongoDB-as-a-Service, available on AWS, Azure, and GCP. Atlas, a mature SaaS product, offers high-availability, guaranteed uptime SLAs, elastic scalability, cross-region replication, enterprise-grade security, LDAP integration, a BI Connector, and much more.
MongoDB Atlas currently offers four pricing plans, Free, Basic, Pro, and Enterprise. Plans range from the smallest, M0-sized MongoDB cluster, with shared RAM and 512 MB storage, up to the massive M400 MongoDB cluster, with 488 GB of RAM and 3 TB of storage.
For this post, I have created an M2-sized MongoDB cluster in GCP’s us-central1 (Iowa) region, with a single user database account for this demo. The account will be used to connect from four of the eight microservices, running on GKE.
Originally, I started with an M0-sized cluster, but the compute resources were insufficient to support the volume of calls from the Go-based microservices. I suggest at least an M2-sized cluster or larger.
CloudAMQP RabbitMQ Cluster
CloudAMQP provides full-managed RabbitMQ clusters on all major cloud and application platforms. RabbitMQ will support a decoupled, eventually consistent, message-based architecture for a portion of our Go-based microservices. For this post, I have created a RabbitMQ cluster in GCP’s us-central1 (Iowa) region, the same as our GKE cluster and MongoDB Atlas cluster. I chose a minimally-configured free version of RabbitMQ. CloudAMQP also offers robust, multi-node RabbitMQ clusters for Production use.
Modify Configurations
There are a few configuration settings you will need to change in the GitHub project’s Kubernetes resource files and Bash deployment scripts.
Istio ServiceEntry for MongoDB Atlas
Modify the Istio ServiceEntry
, external-mesh-mongodb-atlas.yaml file, adding you MongoDB Atlas host address. This file allows egress traffic from four of the microservices on GKE to the external MongoDB Atlas cluster.
apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: mongodb-atlas-external-mesh spec: hosts: - {{ your_host_goes_here }} ports: - name: mongo number: 27017 protocol: MONGO location: MESH_EXTERNAL resolution: NONE
Istio ServiceEntry for CloudAMQP RabbitMQ
Modify the Istio ServiceEntry
, external-mesh-cloudamqp.yaml file, adding you CloudAMQP host address. This file allows egress traffic from two of the microservices to the CloudAMQP cluster.
apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: cloudamqp-external-mesh spec: hosts: - {{ your_host_goes_here }} ports: - name: rabbitmq number: 5672 protocol: TCP location: MESH_EXTERNAL resolution: NONE
Istio Gateway and VirtualService Resources
There are numerous strategies you may use to route traffic into the GKE cluster, via Istio. I am using a single domain for the post, example-api.com
, and four subdomains. One set of subdomains is for the Angular UI, in the dev
Namespace (ui.dev.example-api.com
) and the test
Namespace (ui.test.example-api.com
). The other set of subdomains is for the edge API microservice, Service A, which the UI calls (api.dev.example-api.com
and api.test.example-api.com
). Traffic is routed to specific Kubernetes Service
resources, based on the URL.
According to Istio, the Gateway
describes a load balancer operating at the edge of the mesh, receiving incoming or outgoing HTTP/TCP connections. Modify the Istio ingress Gateway
, inserting your own domains or subdomains in the hosts
section. These are the hosts on port 80 that will be allowed into the mesh.
apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: demo-gateway spec: selector: istio: ingressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - ui.dev.example-api.com - ui.test.example-api.com - api.dev.example-api.com - api.test.example-api.com
According to Istio, a VirtualService
defines a set of traffic routing rules to apply when a host is addressed. A VirtualService
is bound to a Gateway
to control the forwarding of traffic arriving at a particular host and port. Modify the project’s four Istio VirtualServices
, inserting your own domains or subdomains. Here is an example of one of the four VirtualServices
, in the istio-gateway.yaml file.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: angular-ui-dev spec: hosts: - ui.dev.example-api.com gateways: - demo-gateway http: - match: - uri: prefix: / route: - destination: port: number: 80 host: angular-ui.dev.svc.cluster.local
Kubernetes Secret
The project contains a Kubernetes Secret
, go-srv-demo.yaml, with two values. One is for the MongoDB Atlas connection string and one is for the CloudAMQP connections string. Remember Kubernetes Secret
values need to be base64
encoded.
apiVersion: v1 kind: Secret metadata: name: go-srv-config type: Opaque data: mongodb.conn: {{ your_base64_encoded_secret }} rabbitmq.conn: {{ your_base64_encoded_secret }}
On Linux and Mac, you can use the base64
program to encode the connection strings.
> echo -n "mongodb+srv://username:password@atlas-cluster.gcp.mongodb.net/test?retryWrites=true" | base64 bW9uZ29kYitzcnY6Ly91c2VybmFtZTpwYXNzd29yZEBhdGxhcy1jbHVzdGVyLmdjcC5tb25nb2RiLm5ldC90ZXN0P3JldHJ5V3JpdGVzPXRydWU= > echo -n "amqp://username:password@rmq.cloudamqp.com/cluster" | base64 YW1xcDovL3VzZXJuYW1lOnBhc3N3b3JkQHJtcS5jbG91ZGFtcXAuY29tL2NsdXN0ZXI=
Bash Scripts Variables
The bash script, part3_create_gke_cluster.sh, contains a series of environment variables. At a minimum, you will need to change the PROJECT
variable in all scripts to match your GCP project name.
# Constants - CHANGE ME! readonly PROJECT='{{ your_gcp_project_goes_here }}' readonly CLUSTER='go-srv-demo-cluster' readonly REGION='us-central1' readonly MASTER_AUTH_NETS='72.231.208.0/24' readonly GKE_VERSION='1.12.5-gke.5' readonly MACHINE_TYPE='n1-standard-2'
The bash script, part4_install_istio.sh, includes the ISTIO_HOME
variable. The value should correspond to your local path to Istio 1.1.0. On my local Mac, this value is shown below.
readonly ISTIO_HOME='/Applications/istio-1.1.0'
Deploy GKE Cluster
Next, deploy the GKE cluster using the included bash script, part3_create_gke_cluster.sh. This will create a Regional, multi-zone, 3-node GKE cluster, using the latest version of GKE at the time of this post, 1.12.5-gke.5. The cluster will be deployed to the same region as the MongoDB Atlas and CloudAMQP clusters, GCP’s us-central1 (Iowa) region. Planning where your Cloud resources will reside, for both SaaS providers and primary Cloud providers can be critical to minimizing latency for network I/O intensive applications.
Deploy Istio using Helm
With the GKE cluster and associated infrastructure in place, deploy Istio. For this post, I have chosen to install Istio using Helm, as recommended my Istio. To deploy Istio using Helm, use the included bash script, part4_install_istio.sh.
The script installs Istio, using the Helm Chart in the local Istio 1.1.0 install/kubernetes/helm/istio
directory, which you installed as a requirement for this demonstration. The Istio install script overrides several default values in the Istio Helm Chart using the --set
, flag. The list of available configuration values is detailed in the Istio Chart’s GitHub project. The options enable Istio’s observability features, which we will explore in part two. Features include Kiali, Grafana, Prometheus, and Jaeger.
helm install ${ISTIO_HOME}/install/kubernetes/helm/istio-init \ --name istio-init \ --namespace istio-system helm install ${ISTIO_HOME}/install/kubernetes/helm/istio \ --name istio \ --namespace istio-system \ --set prometheus.enabled=true \ --set grafana.enabled=true \ --set kiali.enabled=true \ --set tracing.enabled=true kubectl apply --namespace istio-system \ -f ./resources/secrets/kiali.yaml
Below, we see the Istio-related Workloads running on the cluster, including the observability tools.
Below, we see the corresponding Istio-related Service
resources running on the cluster.
Modify DNS Records
Instead of using IP addresses to route traffic the GKE cluster and its applications, we will use DNS. As explained earlier, I have chosen a single domain for the post, example-api.com
, and four subdomains. One set of subdomains is for the Angular UI, in the dev
Namespace and the test
Namespace. The other set of subdomains is for the edge microservice, Service A, which the API calls. Traffic is routed to specific Kubernetes Service
resources, based on the URL.
Deploying the GKE cluster and Istio triggers the creation of a Google Load Balancer, four IP addresses, and all required firewall rules. One of the four IP addresses, the one shown below, associated with the Forwarding rule, will be associated with the front-end of the load balancer.
Below, we see the new load balancer, with the front-end IP address and the backend VM pool of three GKE cluster’s worker nodes. Each node is assigned one of the IP addresses, as shown above.
As shown below, using Google Cloud DNS, I have created the four subdomains and assigned the IP address of the load balancer’s front-end to all four subdomains. Ingress traffic to these addresses will be routed through the Istio ingress Gateway
and the four Istio VirtualServices
, to the appropriate Kubernetes Service
resources. Use your choice of DNS management tools to create the four A Type DNS records.
Deploy the Reference Platform
Next, deploy the eight Go-based microservices, the Angular UI, and the associated Kubernetes and Istio resources to the GKE cluster. To deploy the platform, use the included bash deploy script, part5a_deploy_resources.sh. If anything fails and you want to remove the existing resources and re-deploy, without destroying the GKE cluster or Istio, you can use the part5b_delete_resources.sh delete script.
The deploy script deploys all the resources two Kubernetes Namespaces, dev
and test
. This will allow us to see how we can differentiate between Namespaces when using the observability tools.
Below, we see the Istio-related resources, which we just deployed. They include the Istio Gateway
, four Istio VirtualService
, and two Istio ServiceEntry
resources.
Below, we see the platform’s Workloads (Kubernetes Deployment
resources), running on the cluster. Here we see two Pods for each Workload, a total of 18 Pods, running in the dev
Namespace. Each Pod contains both the deployed microservice or UI component, as well as a copy of Istio’s Envoy Proxy.
Below, we see the corresponding Kubernetes Service
resources running in the dev
Namespace.
Below, a similar view of the Deployment
resources running in the test
Namespace. Again, we have two Pods for each deployment with each Pod contains both the deployed microservice or UI component, as well as a copy of Istio’s Envoy Proxy.
Test the Platform
We do want to ensure the platform’s eight Go-based microservices and Angular UI are working properly, communicating with each other, and communicating with the external MongoDB Atlas and CloudAMQP RabbitMQ clusters. The easiest way to test the cluster is by viewing the Angular UI in a web browser.
The UI requires you to input the host domain of the Service A, the API’s edge service. Since you cannot use my subdomain, and the JavaScript code is running locally to your web browser, this option allows you to provide your own host domain. This is the same domain or domains you inserted into the two Istio VirtualService
for the UI. This domain route your API calls to either the FQDN (fully qualified domain name) of the Service A Kubernetes Service running in the dev
namespace, service-a.dev.svc.cluster.local
, or the test
Namespace, service-a.test.svc.cluster.local
.
You can also use performance testing tools to load-test the platform. Many issues will not show up until the platform is under load. I recently starting using hey, a modern load generator tool, as a replacement for Apache Bench (ab
), Unlike ab
, hey
supports HTTP/2 endpoints, which is required to test the platform on GKE with Istio. Below, I am running hey
directly from Google Cloud Shell. The tool is simulating 25 concurrent users, generating a total of 1,000 HTTP/2-based GET requests to Service A.
Troubleshooting
If for some reason the UI fails to display, or the call from the UI to the API fails, and assuming all Kubernetes and Istio resources are running on the GKE cluster (all green), the most common explanation is usually a misconfiguration of the following resources:
- Your four Cloud DNS records are not correct. They are not pointing to the load balancer’s front-end IP address;
- You did not configure the four Kubernetes
VirtualService
resources with the correct subdomains; - The GKE-based microservices cannot reach the external MongoDB Atlas and CloudAMQP RabbitMQ clusters. Likely, the Kubernetes
Secret
is constructed incorrectly, or the twoServiceEntry
resources contain the wrong host information for those external clusters;
I suggest starting the troubleshooting by calling Service A, the API’s edge service, directly, using cURL or Postman. You should see a JSON response payload, similar to the following. This suggests the issue is with the UI, not the API.
Next, confirm that the four MongoDB databases were created for Service D, Service, F, Service, G, and Service H. Also, confirm that new documents are being written to the database’s collections.
Next, confirm new the RabbitMQ queue was created, using the CloudAMQP RabbitMQ Management Console. Service D produces messages, which Service F consumes from the queue.
Lastly, review the Stackdriver logs to see if there are any obvious errors.
Part Two
In part two of this post, we will explore each observability tool, and see how they can help us manage our GKE cluster and the reference platform running in the cluster.
Since the cluster only takes minutes to fully create and deploy resources to, if you want to tear down the GKE cluster, run the part6_tear_down.sh script.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Getting Started with Red Hat Ansible for Google Cloud Platform
Posted by Gary A. Stafford in Bash Scripting, Build Automation, DevOps, GCP on January 30, 2019
In this post, we will explore the use of Ansible, the open source community project sponsored by Red Hat, for automating the provisioning, configuration, deployment, and testing of resources on the Google Cloud Platform (GCP). We will start by using Ansible to configure and deploy applications to existing GCP compute resources. We will then expand our use of Ansible to provision and configure GCP compute resources using the Ansible/GCP native integration with GCP modules.
Red Hat Ansible
Ansible, purchased by Red Hat in October 2015, seamlessly provides workflow orchestration with configuration management, provisioning, and application deployment in a single platform. Unlike similar tools, Ansible’s workflow automation is agentless, relying on Secure Shell (SSH) and Windows Remote Management (WinRM). Ansible has published a whitepaper on The Benefits of Agentless Architecture.
According to G2 Crowd, Ansible is a clear leader in the Configuration Management Software category, ranked right behind GitLab. Some of Ansible’s main competitors in the category include GitLab, AWS Config, Puppet, Chef, Codenvy, HashiCorp Terraform, Octopus Deploy, and TeamCity. There are dozens of published articles, comparing Ansible to Puppet, Chef, SaltStack, and more recently, Terraform.
Google Compute Engine
According to Google, Google Compute Engine (GCE) delivers virtual machines (VMs) running in Google’s data centers and on their worldwide fiber network. Compute Engine’s tooling and workflow support enables scaling from single instances to global, load-balanced cloud computing.
Comparable products to GCE in the IaaS category include Amazon Elastic Compute Cloud (EC2), Azure Virtual Machines, IBM Cloud Virtual Servers, and Oracle Compute Cloud Service.
Apache HTTP Server
According to Apache, the Apache HTTP Server (“httpd”) is an open-source HTTP server for modern operating systems including Linux and Windows. The Apache HTTP Server provides a secure, efficient, and extensible server that provides HTTP services in sync with the current HTTP standards. The Apache HTTP Server was launched in 1995 and it has been the most popular web server on the Internet since 1996. We will deploy Apache HTTP Server to GCE VMs, using Ansible.
Demonstration
In this post, we will demonstrate two different workflows with Ansible on GCP. First, we will use Ansible to configure and deploy the Apache HTTP Server to an existing GCE instance.
- Provision and configure a GCE VM instance, disk, firewall rule, and external IP, using the Google Cloud (
gcloud
) CLI tool; - Deploy and configure the Apache HTTP Server and associated packages, using an Ansible Playbook containing an
httpd
Ansible Role; - Manually test the GCP resources and Apache HTTP Server;
- Clean up the GCP resources using the
gcloud
CLI tool;
In the second workflow, we will use Ansible to provision and configure the GCP resources, as well as deploy the Apache HTTP Server the new GCE VM.
- Provision and configure a VM instance, disk, VPC global network, subnetwork, firewall rules, and external IP address, using an Ansible Playbook containing an Ansible Role, as opposed to the
gcloud
CLI tool; - Deploy and configure the Apache HTTP Server and associated packages, using an Ansible Playbook containing an
httpd
Ansible Role; - Test the GCP resources and Apache HTTP Server using role-based test tasks;
- Clean up all the GCP resources using an Ansible Playbook containing an Ansible Role;
Source Code
The source code for this post may be found on the master
branch of the ansible-gcp-demo GitHub repository.
git clone --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/ansible-gcp-demo.git
The project has the following file structure.
. ├── LICENSE ├── README.md ├── _unused │ ├── httpd_playbook.yml ├── ansible │ ├── ansible.cfg │ ├── group_vars │ │ └── webservers.yml │ ├── inventories │ │ ├── hosts │ │ └── webservers_gcp.yml │ ├── playbooks │ │ ├── 10_webserver_infra.yml │ │ └── 20_webserver_config.yml │ ├── roles │ │ ├── gcpweb │ │ └── httpd │ └── site.yml ├── part0_source_creds.sh ├── part1_create_vm.sh └── part2_clean_up.sh
Source code samples in this post are displayed as GitHub Gists which may not display correctly on all mobile and social media browsers, such as LinkedIn.
Setup New GCP Project
For this demonstration, I have created a new GCP Project containing a new service account and public SSH key. The project’s service account will be used the gcloud
CLI tool and Ansible to access and provision compute resources within the project. The SSH key will be used by both tools to SSH into GCE VM within the project. Start by creating a new GCP Project.
Add a new service account to the project on the IAM & admin ⇒ Service accounts tab.
Grant the new service account permission to the ‘Compute Admin’ Role, within the project, using the Role drop-down menu. The principle of least privilege (PoLP) suggests we should limit the service account’s permissions to only the role(s) necessary to provision the required compute resources.
Create a private key for the service account, on the IAM & admin ⇒ Service accounts tab. This private key is different than the SSH key will add to the project, next. This private key contains the credentials for the service account.
Choose the JSON key type.
Download the private key JSON file and place it in a safe location, accessible to Ansible. Be careful not to check this file into source control. Again, this file contains the service account’s credentials used to programmatically access GCP and administer compute resources.
We should now have a service account, associated with the new GCP project, with permissions to the ‘Compute Admin’ role, and a private key which has been downloaded and accessible to Ansible. Note the Email address of the service account, in my case, ansible@ansible-gce-demo.iam.gserviceaccount.com
; you will need to reference this later in your configuration.
Next, create an SSH public/private key pair. The SSH key will be used to programmatically access the GCE VM. Creating a separate key pair allows you to limit its use to just the new GCP project. If compromised, the key pair is easily deleted and replaced in the GCP project and in the Ansible configuration. On a Mac, you can use the following commands to create a new key pair and copy the public key to the clipboard.
ssh-keygen -t rsa -b 4096 -C "ansible" cat ~/.ssh/ansible.pub | pbcopy
Add your new public key clipboard contents to the project, on the Compute Engine ⇒ Metadata ⇒ SSH Keys tab. Adding the key here means it is usable by any VM in the project unless you explicitly block this option when provisioning a new VM and configure a key specifically for that VM.
Note the name, ansible
, associated with the key, you will need to reference this later in your configuration.
Setup Ansible
Although this post is not a primer on Ansible, I will cover a few setup steps I have done to prepare for this demo. On my Mac, I am running Python 3.7, pip 18.1, and Ansible 2.7.6. With Python and pip installed, the easiest way to install Ansible in Mac or Linux is using pip.
pip install ansible
You will also need to install two additional packages in order to gather information about GCP-based hosts using GCE Dynamic Inventory, explained later in the post.
pip install requests google-auth
Ansible Configuration
I created a simple Ansible ansible.cfg
file for this project, located in the /ansible/inventories/
sub-directory. The Ansible configuration file contains the location of the project’s roles and inventory, which is explained later. The file also contains two configuration items associated with an SSH key pair, which we just created. If your key is named differently or in a different location, update the file (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[defaults] | |
host_key_checking = False | |
roles_path = roles | |
inventory = inventories/hosts | |
remote_user = ansible | |
private_key_file = ~/.ssh/ansible | |
[inventory] | |
enable_plugins = host_list, script, yaml, ini, auto, gcp_compute |
Ansible has a complete example of a configuration file parameters on GitHub.
Ansible Environment Variables
To decouple our specific GCP project’s credentials from the Ansible playbooks and roles, Ansible recommends setting those required module parameters as environment variables, as opposed to including them in the playbooks. Additionally, I have set the GCP project name as an environment variable, in order to also decouple it from the playbooks. To set those environment variables, source the script in the project’s root directory, using the
source
command (gist).
source ./
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Source Ansible/GCP credentials | |
# usage: source ./ansible_gcp_creds.sh | |
# Constants – CHANGE ME! | |
export GCP_PROJECT='ansible-gce-demo' | |
export GCP_AUTH_KIND='serviceaccount' | |
export GCP_SERVICE_ACCOUNT_FILE='path/to/your/credentials/file.json' | |
export GCP_SCOPES='https://www.googleapis.com/auth/compute' |
GCP CLI/Ansible Hybrid Workflow
Oftentimes, enterprises employ a mix of DevOps tooling to provision, configure, and deploy to compute resources. In this first workflow, we will use Ansible to configure and deploy a web server to an existing GCE VM, created in advance with the gcloud
CLI tool.
Create GCP Resources
First, use the gcloud
CLI tool to create a GCE VM and associated resources, including an external IP address and firewall rule for port 80 (HTTP). For simplicity, we will use the existing GCP default
Virtual Private Cloud (VPC) network and the default
us-east1 subnetwork. Execute the part1_create_vm.sh
script in the project’s root directory. The default
network should already have port 22 (SSH) open on the firewall. Note the SERVICE_ACCOUNT
variable, in the script, is the service account email found on the IAM & admin ⇒ Service accounts tab, shown in the previous section (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Create GCP VM instance and associated resources | |
# usage: sh ./part1_create_vm.sh | |
# Constants – CHANGE ME! | |
readonly PROJECT='ansible-gce-demo' | |
readonly SERVICE_ACCOUNT='ansible@ansible-gce-demo.iam.gserviceaccount.com' | |
readonly ZONE='us-east1-b' | |
# Create GCE VM with disk storage | |
time gcloud compute instances create web-1 \ | |
–project $PROJECT \ | |
–zone $ZONE \ | |
–machine-type n1-standard-1 \ | |
–network default \ | |
–subnet default \ | |
–network-tier PREMIUM \ | |
–maintenance-policy MIGRATE \ | |
–service-account $SERVICE_ACCOUNT \ | |
–scopes https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \ | |
–tags apache-http-server \ | |
–image centos-7-v20190116 \ | |
–image-project centos-cloud \ | |
–boot-disk-size 200GB \ | |
–boot-disk-type pd-standard \ | |
–boot-disk-device-name compute-disk | |
# Create firewall rule to allow ingress traffic from port 80 | |
time gcloud compute firewall-rules create default-allow-http \ | |
–project $PROJECT \ | |
–description 'Allow HTTP from anywhere' \ | |
–direction INGRESS \ | |
–priority 1000 \ | |
–network default \ | |
–action ALLOW \ | |
–rules tcp:80 \ | |
–source-ranges 0.0.0.0/0 \ | |
–target-tags apache-http-server |
The output from the script should look similar to the following. Note the external IP address associated with the VM, you will need to reference this later in the post.
Using the gcloud
CLI tool or Google Cloud Console, we should be able to view our newly provisioned resources on GCP. First, our new GCE VM, using the Compute Engine ⇒ VM instances ⇒ Details tab.
Next, examine the Network interface details tab. Here we see details about the network and subnetwork our VM is running within. We see the internal and external IP addresses of the VM. We also see the firewall rules, including our new rule, allowing TCP ingress traffic on port 80.
Lastly, examine the new firewall rule, which will allow TCP traffic on port 80 from any IP address to our VM, located in the default
network. Note the other, pre-existing rules controlling access to the default
network.
The final GCP architecture looks as follows.
GCE Dynamic Inventory
Two core concepts in Ansible are hosts and inventory. We need an inventory of the hosts on which to run our Ansible playbooks. If we had long-lived hosts, often referred to as ‘pets’, who had long-lived static IP addresses or DNS entries, then we could manually add the hosts to a static hosts file, similar to the example below.
[webservers] 34.73.171.5 34.73.170.97 34.73.172.153 [dbservers] db1.example.com db2.example.com
However, given the ephemeral nature of the cloud, where hosts (often referred to as ‘cattle’), IP addresses, and even DNS entries are often short-lived, we will use the Ansible concept of Dynamic Inventory.
If you recall we pip
installed two packages, requests
and google-auth
, during our Ansible setup for use with GCE Dynamic Inventory. According to Ansible, the best way to interact with your GCE VM hosts is to use the gcp_compute
inventory plugin. The plugin allows Ansible to dynamically query GCE for the nodes that can be managed. With the gcp_compute
inventory plugin, we can also selectively classify the hosts we find into Groups. We will then run playbooks, containing roles, on a group or groups of hosts.
To demonstrate how to dynamically find the new GCE host, and add it to a group, execute the following command, using the Ansible Inventory CLI.
ansible-inventory --graph -i inventories/webservers_gcp.yml
The command calls the webservers_gcp.yml
file, which contains logic necessary to associate the GCE hosts with the webservers
host group. Ansible’s current documentation is pretty sparse on this subject. Thanks to Matthieu Remy for his great post, How to Use Ansible GCP Compute Inventory Plugin. For this demo, we are only looking for hosts in us-east1-b, which have ‘web-’ in their name. (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
plugin: gcp_compute | |
zones: | |
– us-east1-b | |
projects: | |
– ansible-gce-demo | |
filters: [] | |
groups: | |
webservers: "'web-' in name" | |
scopes: | |
– https://www.googleapis.com/auth/compute | |
service_account_file: ~/Documents/Personal/gcp_creds/ansible-gce-demo-a0dbb4ac2ff7.json | |
auth_kind: serviceaccount |
The output from the command should look similar to the following. We should observe our new VM, as indicated by its external IP address, is assigned to the part of the webservers
group. We will use the power of Dynamic Inventory to apply a playlist to all the hosts within the webservers
group.
We can also view details about hosts by modifying the inventory command.
ansible-inventory --list -i inventories/webservers_gcp.yml --yaml
The output from the command should look similar to the following. This particular example was run against an earlier host, with a different external IP address.
Apache HTTP Server Playbook
For our first taste of Ansible on GCP, we will run an Ansible Playbook to install and configure the Apache HTTP Server on the new CentOS-based VM. According to Ansible, Playbooks, which are YAML-based, can declare configurations, they can also orchestrate steps of any manual ordered process, even as different steps must bounce back and forth between sets of machines in particular orders. They can launch tasks synchronously or asynchronously. Playbooks are used to orchestrate tasks, as opposed to using Ansible’s ad-hoc task execution mode.
A playbook can be ‘monolithic’ in nature, containing all the required Variables, Tasks, and Handlers, to achieve the desired outcome. If we wrote a single playbook to deploy and configure our Apache HTTP Server, it might look like the httpd_playbook.yml
, playbook, below (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Install Apache HTTP Server | |
hosts: webservers | |
become: yes | |
vars: | |
greeting: 'Hello Anisble on GCP!' | |
tasks: | |
– name: upgrade all packages | |
yum: | |
name: '*' | |
state: latest | |
– name: ensure the latest list of packages are installed | |
yum: | |
name: "{{ packages }}" | |
state: latest | |
vars: | |
packages: | |
– httpd | |
– httpd-tools | |
– php | |
– name: deploy apache config file | |
template: | |
src: server-status.conf | |
dest: /etc/httpd/conf.d/server-status.conf | |
notify: | |
– restart apache | |
– name: deploy php document to DocumentRoot | |
template: | |
src: info.php | |
dest: /var/www/html/info.php | |
– name: deploy html document to DocumentRoot | |
template: | |
src: index.html.j2 | |
dest: /var/www/html/index.html | |
vars: | |
greeting: "{{ gretting }}" | |
– name: ensure apache is running | |
service: | |
name: httpd | |
state: started | |
handlers: | |
– name: restart apache | |
service: | |
name: httpd | |
state: restarted |
We could run this playbook with the following command to deploy the Apache HTTP Server, but we won’t. Instead, next, we will run a playbook that applies the httpd
role.
ansible-playbook \
-i inventories/webservers_gcp.yml \
playbooks/httpd_playbook.yml
Ansible Roles
According to Ansible, Roles are ways of automatically loading certain vars_files, tasks, and handlers based on a known file structure. Grouping content by roles also allows easy sharing of roles with other users. The usage of roles is preferred as it provides a nice organizational system.
The httpd
role is identical in functionality to the httpd_playbook.yml
, used in the first workflow. However, the primary parts of the playbook have been decomposed into individual resource files, as described by Ansible. This structure is created using the Ansible Galaxy CLI. Ansible Galaxy is Ansible’s official hub for sharing Ansible content.
ansible-galaxy init httpd
This ansible-galaxy
command creates the following structure. I added the files and Jinja2 template, afterward.
. ├── README.md ├── defaults │ └── main.yml ├── files │ ├── info.php │ └── server-status.conf ├── handlers │ └── main.yml ├── meta │ └── main.yml ├── tasks │ └── main.yml ├── templates │ └── index.html.j2 ├── tests │ ├── inventory │ └── test.yml └── vars └── main.yml
Within the httpd
role:
- Variables are stored in the
defaults/main.yml
file; - Tasks are stored in the
tasks/main.yml
file; - Handles are stored in the
handlers/main.yml
file; - Files are stored in the
files/
sub-directory; - Jinja2 templates are stored in the
templates/
sub-directory; - Test are stored in the
tests/
sub-directory; - Other sub-directories and files contain metadata about the role;
To apply the httpd
role, we will run the 20_webserver_config.yml
playbook. Compare this playbook, below, with the previous, monolithic httpd_playbook.yml
playbook. All of the logic has now been decomposed across the httpd
role’s separate backing files (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Configure GCP webserver(s) | |
hosts: webservers | |
gather_facts: no | |
become: yes | |
roles: | |
– role: httpd |
We can start by running our playbook using Ansible’s Check Mode (“Dry Run”). When ansible-playbook
is run with --check
, Ansible will not make any actual changes to the remote systems. According to Ansible, Check mode is just a simulation, and if you have steps that use conditionals that depend on the results of prior commands, it may be less useful for you. However, it is great for one-node-at-time basic configuration management use cases. Execute the following command using Check mode.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml --check
The output from the command should look similar to the following. It shows that if we execute the actual command, we should expect seven changes to occur.
If everything looks good, then run the same command without using Check mode.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml
The output from the command should look similar to the following. Note the number of items changed, seven, is identical to the results of using Check mode, above.
If we were to execute the command using Check mode for a second time, we should observe zero changed items. This means the last command successfully applied all changes and no new changes are present in the playbook.
Testing the Results
There are a number of methods and tools we could use to test the deployments of the Apache HTTP Server and server tools. First, we can use an ad-hoc ansible
CLI command to confirm the httpd
process is running on the VM, by calling systemctl
. The systemctl
application is used to introspect and control the state of the systemd
system and service manager, running on the CentOS-based VM.
ansible webservers \ -i inventories/webservers_gcp.yml \ -a "systemctl status httpd"
The output from the command should look similar to the following. We see the Apache HTTP Server service details. We also see it being stopped and started as required by the tasks and handler in the role.
We can also check that the home page and PHP info documents, we deployed as part of the playbook, are in the correct location on the VM.
ansible webservers \ -i inventories/webservers_gcp.yml \ -a "ls -al /var/www/html"
The output from the command should look similar to the following. We see the two documents we deployed are in the root of the website directory.
Next, view our website’s home page by pointing your web browser to the external IP address we created earlier and associated with the VM, on port 80 (HTTP). We should observe the variable value in the playbook, ‘Hello Ansible on GCP!’, was injected into the Jinja2 template file, index.html.j2
, and the page deployed correctly to the VM.
If you recall from the httpd
role, we had a task to deploy the server status configuration file. This configuration file exposes the /server-status
endpoint, as shown below. The status page shows the internal and the external IP addresses assigned to the VM. It also shows the current version of Apache HTTP Server and PHP, server uptime, traffic, load, CPU usage, number of requests, number of running processes, and so forth.
Testing with Apache Bench
Apache Bench (ab
) is the Apache HTTP server benchmarking tool. We can use Apache Bench locally, to generate CPU, memory, file, and network I/O loads on the VM. For example, using the following command, we can generate 100K requests to the server-status page, simulating 100 concurrent users.
ab -kc 100 -n 100000 http://your_vms_external_ip/server-status
The output from the command should look similar to the following. Observe this command successfully resulted in a sustained load on the web server for approximately 17.5 minutes.
Using the Compute Engine ⇒ VM instances ⇒ Monitoring tab, we see the corresponding Apache Bench CPU, memory, file, and network load on the VM, starting at about 10:03 AM, soon after running the playbook to install Apache HTTP Server.
Destroy GCP Resources
After exploring the results of our workflow, tear down the existing GCE resources before we continue to the next workflow. To delete resources, execute the part2_clean_up.sh
script in the project’s root directory (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Delete GCP VM instance, IP address, and firewall rule | |
# usage: sh ./part2_clean_up.sh | |
# Constants – CHANGE ME! | |
readonly PROJECT='ansible-gce-demo' | |
readonly ZONE='us-east1-b' | |
time yes | gcloud compute instances delete web-1 \ | |
–project $PROJECT –zone $ZONE | |
time yes | gcloud compute firewall-rules delete default-allow-http \ | |
–project $PROJECT |
The output from the script should look similar to the following.
Ansible Workflow
In the second workflow, we will provision and configure the GCP resources, and deploy Apache HTTP Server to the new GCE VM using Ansible. We will be using the same Project, Region, and Zone as the previous example. However this time, we will create a new global VPC network instead of using the default
network as before, a new subnetwork instead of using the default
subnetwork as before, and a new firewall with ingress rules to open ports 22 and 80. Lastly, will create an external IP address and assign it to the VM.
Provision GCP Resources
Instead of using the gcloud
CLI tool, we will use Ansible to provision the GCP resources. To accomplish this, I have created one playbook, 10_webserver_infra.yml
, with one role, gcpweb
, but two sets of tasks, one to create the GCE resources, create.yml
, and one to delete the GCP resources, delete.yml
. This is a typical Ansible playbook pattern. The standard file directory structure of the role looks as follows, similar to the httpd
role.
. ├── README.md ├── defaults │ └── main.yml ├── files ├── handlers │ └── main.yml ├── meta │ └── main.yml ├── tasks │ ├── create.yml │ ├── delete.yml │ └── main.yml ├── templates ├── tests │ ├── inventory │ └── test.yml └── vars └── main.yml
To provision the GCE resources, we run the 10_webserver_infra.yml
playbook (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Create GCP webserver(s) resources | |
hosts: localhost | |
gather_facts: no | |
connection: local | |
roles: | |
– role: gcpweb |
This playbook runs the gcpweb
role. The role’s default main.yml
task file imports two other sets of tasks, one for create and one for delete. Each set of tasks have a corresponding tag associated with them (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– import_tasks: create.yml | |
tags: | |
– create | |
– import_tasks: delete.yml | |
tags: | |
– delete |
By calling the playbook and passing the ‘create’ tag, the role will run apply the associated set of create tasks. Tags are a powerful construct in Ansible. Execute the following command, passing the create
tag.
ansible-playbook -t create playbooks/10_webserver_infra.yml
In the case of this playbook, the Check mode, used earlier, would fail here. If you recall, this feature is not designed to work with playbooks that have steps that use conditionals that depend on the results of prior commands, such as with this playbook.
The create.yml
file contains six tasks, which leverage Ansible GCP Modules. The tasks create a global VPC network, subnetwork in the us-east1 Region, firewall and rules, external IP address, disk, and VM instance (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: create a network | |
gcp_compute_network: | |
name: ansible-network | |
auto_create_subnetworks: yes | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: network | |
– name: create a subnetwork | |
gcp_compute_subnetwork: | |
name: ansible-subnet | |
region: "{{ region }}" | |
network: "{{ network }}" | |
ip_cidr_range: "{{ ip_cidr_range }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: subnet | |
– name: create a firewall | |
gcp_compute_firewall: | |
name: ansible-firewall | |
network: "projects/{{ lookup('env','GCP_PROJECT') }}/global/networks/{{ network.name }}" | |
allowed: | |
– ip_protocol: tcp | |
ports: ['80','22'] | |
target_tags: | |
– apache-http-server | |
source_ranges: ['0.0.0.0/0'] | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: firewall | |
– name: create an address | |
gcp_compute_address: | |
name: "{{ instance_name }}" | |
region: "{{ region }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: address | |
– name: create a disk | |
gcp_compute_disk: | |
name: "{{ instance_name }}" | |
size_gb: "{{ size_gb }}" | |
source_image: 'projects/centos-cloud/global/images/centos-7-v20190116' | |
zone: "{{ zone }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: present | |
register: disk | |
– name: create an instance | |
gcp_compute_instance: | |
state: present | |
name: "{{ instance_name }}" | |
machine_type: "{{ machine_type }}" | |
disks: | |
– auto_delete: true | |
boot: true | |
source: "{{ disk }}" | |
network_interfaces: | |
– network: "{{ network }}" | |
subnetwork: "{{ subnet }}" | |
access_configs: | |
– name: External NAT | |
nat_ip: "{{ address }}" | |
type: ONE_TO_ONE_NAT | |
zone: "{{ zone }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
tags: | |
items: | |
– apache-http-server | |
– webserver | |
register: instance |
If your interested in what is actually happening during the execution of the playbook, add the verbose option (-v
or -vv
) to the above command. This can be very helpful in learning Ansible.
The output from the command should look similar to the following. Note the changes applied to localhost. Since no GCE VM host(s) exist on GCP until the resources are provisioned, we reference localhost. The entire process took less than two minutes to create a global VPC network, subnetwork, firewall rules, VM, attached disk, and assign a public IP address.
All GCP resources are now provisioned and configured. Below, we see the new GCE VM created by Ansible.
Below, we see the new GCE VM’s network interface details console page, showing details about the VM, NIC, internal and external IP addresses, network, subnetwork, and ingress firewall rules.
Below, we see the VPC details showing each of the automatically-created regional subnets, and our new ‘ansible-subnet’, in the us-east1 region, and spanning 14 IP addresses in the 172.16.0.0/28 CIDR (Classless Inter-Domain Routing) block.
To deploy and configure Apache HTTP Server, run the httpd
role exactly the same way we did in the first workflow.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml
Role-based Testing
In the first workflow, we manually tested our results using a number of ad-hoc commands and by viewing web pages in our browser. These methods of testing do not lend themselves to DevOps automation. A more effective strategy is writing tests, which are part of the role, and maybe run each time the role is applied, as part of a CI/CD pipeline. Each role in this project contains a few simple tests to confirm the success of the tasks in the role. First, run the gcpweb
role’s tests with the following command.
ansible-playbook \ -i inventories/webservers_gcp.yml \ roles/gcpweb/tests/test.yml
The playbook gathers facts about the GCE hosts in the host group and runs a total of five test tasks against those hosts. The tasks confirm the host’s timezone, vCPU count, OS type, OS major version, and hostname, using the facts gathered (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: Test gcpweb Ansible role | |
hosts: webservers | |
gather_facts: yes | |
tasks: | |
# – name: List all ansible facts | |
# debug: | |
# msg: "{{ ansible_facts }}" | |
– name: Check if timezone is UTC | |
debug: | |
msg: Timezone is UTC | |
failed_when: ansible_facts['date_time']['tz'] != 'UTC' | |
– name: Check if processor vCPUs count is 1 | |
debug: | |
msg: Processor vCPUs count is 1 | |
failed_when: ansible_facts['processor_vcpus'] != 1 | |
– name: Check if distribution is CentOS | |
debug: | |
msg: Distribution is CentOS | |
failed_when: ansible_facts['distribution'] != 'CentOS' | |
– name: Check if distribution major version is 7 | |
debug: | |
msg: Distribution major version is 7 | |
failed_when: ansible_facts['distribution_major_version'] != '7' | |
– name: Check if hostname contains 'web-' | |
debug: | |
msg: Hostname contains 'web-' | |
failed_when: "'web-' not in ansible_facts['hostname']" |
The output from the command should look similar to the following. Observe that all five tasks ran successfully.
Next, run the the httpd
role’s tests.
ansible-playbook \ -i inventories/webservers_gcp.yml \ roles/httpd/tests/test.yml
Similarly, the output from the command should look similar to the following. The playbook runs four test tasks this time. The tasks confirm both files are present, the home page is accessible, and that the server-status page displays properly. Below, we all four ran successfully.
Making a Playbook Change
To observe what happens if we apply a change to a playbook, let’s change the greeting
variable value in the /roles/httpd/defaults/main.yml
file in the httpd
role. Recall, the original home page looked as follows.
Change the greeting
variable value and re-run the playbook, using the same command.
ansible-playbook \ -i inventories/webservers_gcp.yml \ playbooks/20_webserver_config.yml
The output from the command should look similar to the following. As expected, we should observe that only one task, deploying the home page, was changed.
Viewing the home page again, or by modifying the associated test task, we should observe the new value is injected into the Jinja2 template file, index.html.j2
, and the new page deployed correctly.
Destroy GCP Resources with Ansible
Once you are finished, you can destroy all the GCP resources by calling the 10_webserver_infra.yml
playbook and passing the delete
tag, the role will run apply the associated set of delete tasks.
ansible-playbook -t delete playbooks/10_webserver_infra.yml
With Ansible, we delete GCP resources by changing the state
from present
to absent
. The playbook will delete the resources in a particular order, to avoid dependency conflicts, such as trying to delete the network before the VM. Note we do not have to explicitly delete the disk since, if you recall, we provisioned the VM instance with the disks.auto_delete=true
option (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
– name: delete an instance | |
gcp_compute_instance: | |
name: "{{ instance_name }}" | |
zone: "{{ zone }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: delete an address | |
gcp_compute_address: | |
name: "{{ instance_name }}" | |
region: "{{ region }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: delete a firewall | |
gcp_compute_firewall: | |
name: ansible-firewall | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: register the existing network | |
gcp_compute_network: | |
name: ansible-network | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
register: network | |
# – debug: | |
# var: network | |
– name: delete a subnetwork | |
gcp_compute_subnetwork: | |
name: ansible-subnet | |
region: "{{ region }}" | |
network: "{{ network }}" | |
ip_cidr_range: "{{ ip_cidr_range }}" | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent | |
– name: delete a network | |
gcp_compute_network: | |
name: ansible-network | |
project: "{{ lookup('env','GCP_PROJECT') }}" | |
state: absent |
The output from the command should look similar to the following. We see the VM instance, attached disk, firewall, rules, external IP address, subnetwork, and finally, the network, each being deleted.
Conclusion
In this post, we saw how easy it is to get started with Ansible on the Google Cloud Platform. Using Ansible’s 300+ cloud modules, provisioning, configuring, deploying to, and testing a wide range of GCP, Azure, and AWS resources are easy, repeatable, and completely automatable.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Automating Multi-Environment Kubernetes Virtual Clusters with Google Cloud DNS, Auth0, and Istio 1.0
Posted by Gary A. Stafford in Bash Scripting, Build Automation, Cloud, DevOps, Enterprise Software Development, GCP, Java Development, Kubernetes, Software Development on January 19, 2019
Kubernetes supports multiple virtual clusters within the same physical cluster. These virtual clusters are called Namespaces. Namespaces are a way to divide cluster resources between multiple users. Many enterprises use Namespaces to divide the same physical Kubernetes cluster into different virtual software development environments as part of their overall Software Development Lifecycle (SDLC). This practice is commonly used in ‘lower environments’ or ‘non-prod’ (not Production) environments. These environments commonly include Continous Integration and Delivery (CI/CD), Development, Integration, Testing/Quality Assurance (QA), User Acceptance Testing (UAT), Staging, Demo, and Hotfix. Namespaces provide a basic form of what is referred to as soft multi-tenancy.
Generally, the security boundaries and performance requirements between non-prod environments, within the same enterprise, are less restrictive than Production or Disaster Recovery (DR) environments. This allows for multi-tenant environments, while Production and DR are normally single-tenant environments. In order to approximate the performance characteristics of Production, the Performance Testing environment is also often isolated to a single-tenant. A typical enterprise would minimally have a non-prod, performance, production, and DR environment.
Using Namespaces to create virtual separation on the same physical Kubernetes cluster provides enterprises with more efficient use of virtual compute resources, reduces Cloud costs, eases the management burden, and often expedites and simplifies the release process.
Demonstration
In this post, we will re-examine the topic of virtual clusters, similar to the recent post, Managing Applications Across Multiple Kubernetes Environments with Istio: Part 1 and Part 2. We will focus specifically on automating the creation of the virtual clusters on GKE with Istio 1.0, managing the Google Cloud DNS records associated with the cluster’s environments, and enabling both HTTPS and token-based OAuth access to each environment. We will use the Storefront API for our demonstration, featured in the previous three posts, including Building a Microservices Platform with Confluent Cloud, MongoDB Atlas, Istio, and Google Kubernetes Engine.
Source Code
The source code for this post may be found on the gke
branch of the storefront-kafka-docker GitHub repository.
git clone --branch gke --single-branch --depth 1 --no-tags \ https://github.com/garystafford/storefront-kafka-docker.git
Source code samples in this post are displayed as GitHub Gists, which may not display correctly on all mobile and social media browsers, such as LinkedIn.
This project contains all the code to deploy and configure the GKE cluster and Kubernetes resources.
To follow along, you will need to register your own domain, arrange for an Auth0, or alternative, authentication and authorization service, and obtain an SSL/TLS certificate.
SSL/TLS Wildcard Certificate
In the recent post, Securing Your Istio Ingress Gateway with HTTPS, we examined how to create and apply an SSL/TLS certificate to our GKE cluster, to secure communications. Although we are only creating a non-prod cluster, it is more and more common to use SSL/TLS everywhere, especially in the Cloud. For this post, I have registered a single wildcard certificate, *.api.storefront-demo.com. This certificate will cover the three second-level subdomains associated with the virtual clusters: dev.api.storefront-demo.com, test.api.storefront-demo.com, and uat.api.storefront-demo.com. Setting the environment name, such as dev.*
, as the second-level subdomain of my storefront-demo
domain, following the first level api.*
subdomain, makes the use of a wildcard certificate much easier.
As shown below, my wildcard certificate contains the Subject Name and Subject Alternative Name (SAN) of *.api.storefront-demo.com. For Production, api.storefront-demo.com, I prefer to use a separate certificate.
Create GKE Cluster
With your certificate in hand, create the non-prod Kubernetes cluster. Below, the script creates a minimally-sized, three-node, multi-zone GKE cluster, running on GCP, with Kubernetes Engine cluster version 1.11.5-gke.5 and Istio on GKE version 1.0.3-gke.0. I have enabled the master authorized networks option to secure my GKE cluster master endpoint. For the demo, you can add your own IP address CIDR on line 9 (i.e. 1.2.3.4/32
), or remove lines 30 – 31 to remove the restriction (gist).
- Lines 16–39: Create a 3-node, multi-zone GKE cluster with Istio;
- Line 48: Creates three non-prod Namespaces: dev, test, and uat;
- Lines 51–53: Enable Istio automatic sidecar injection within each Namespace;
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Create non-prod Kubernetes cluster on GKE | |
# Constants – CHANGE ME! | |
readonly PROJECT='gke-confluent-atlas' | |
readonly CLUSTER='storefront-api-non-prod' | |
readonly REGION='us-central1' | |
readonly MASTER_AUTH_NETS='<your_ip_cidr>' | |
readonly NAMESPACES=( 'dev' 'test' 'uat' ) | |
# Build a 3-node, single-region, multi-zone GKE cluster | |
time gcloud beta container \ | |
–project $PROJECT clusters create $CLUSTER \ | |
–region $REGION \ | |
–no-enable-basic-auth \ | |
–no-issue-client-certificate \ | |
–cluster-version "1.11.5-gke.5" \ | |
–machine-type "n1-standard-2" \ | |
–image-type "COS" \ | |
–disk-type "pd-standard" \ | |
–disk-size "100" \ | |
–scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \ | |
–num-nodes "1" \ | |
–enable-stackdriver-kubernetes \ | |
–enable-ip-alias \ | |
–enable-master-authorized-networks \ | |
–master-authorized-networks $MASTER_AUTH_NETS \ | |
–network "projects/${PROJECT}/global/networks/default" \ | |
–subnetwork "projects/${PROJECT}/regions/${REGION}/subnetworks/default" \ | |
–default-max-pods-per-node "110" \ | |
–addons HorizontalPodAutoscaling,HttpLoadBalancing,Istio \ | |
–istio-config auth=MTLS_STRICT \ | |
–metadata disable-legacy-endpoints=true \ | |
–enable-autoupgrade \ | |
–enable-autorepair | |
# Get cluster creds | |
gcloud container clusters get-credentials $CLUSTER \ | |
–region $REGION –project $PROJECT | |
kubectl config current-context | |
# Create Namespaces | |
kubectl apply -f ./resources/other/namespaces.yaml | |
# Enable automatic Istio sidecar injection | |
for namespace in ${NAMESPACES[@]}; do | |
kubectl label namespace $namespace istio-injection=enabled | |
done |
If successful, the results should look similar to the output, below.
The cluster will contain a pool of three minimally-sized VMs, the Kubernetes nodes.
Deploying Resources
The Istio Gateway and three ServiceEntry resources are the primary resources responsible for routing the traffic from the ingress router to the Services, within the multiple Namespaces. Both of these resource types are new to Istio 1.0 (gist).
- Lines 9–16: Port config that only accepts HTTPS traffic on port 443 using TLS;
- Lines 18–20: The three subdomains being routed to the non-prod GKE cluster;
- Lines 28, 63, 98: The three subdomains being routed to the non-prod GKE cluster;
- Lines 39, 47, 65, 74, 82, 90, 109, 117, 125: Routing to FQDN of Storefront API Services within the three Namespaces;
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
apiVersion: networking.istio.io/v1alpha3 | |
kind: Gateway | |
metadata: | |
name: storefront-gateway | |
spec: | |
selector: | |
istio: ingressgateway | |
servers: | |
– port: | |
number: 443 | |
name: https | |
protocol: HTTPS | |
tls: | |
mode: SIMPLE | |
serverCertificate: /etc/istio/ingressgateway-certs/tls.crt | |
privateKey: /etc/istio/ingressgateway-certs/tls.key | |
hosts: | |
– dev.api.storefront-demo.com | |
– test.api.storefront-demo.com | |
– uat.api.storefront-demo.com | |
— | |
apiVersion: networking.istio.io/v1alpha3 | |
kind: VirtualService | |
metadata: | |
name: storefront-dev | |
spec: | |
hosts: | |
– dev.api.storefront-demo.com | |
gateways: | |
– storefront-gateway | |
http: | |
– match: | |
– uri: | |
prefix: /accounts | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: accounts.dev.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /fulfillment | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: fulfillment.dev.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /orders | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: orders.dev.svc.cluster.local | |
— | |
apiVersion: networking.istio.io/v1alpha3 | |
kind: VirtualService | |
metadata: | |
name: storefront-test | |
spec: | |
hosts: | |
– test.api.storefront-demo.com | |
gateways: | |
– storefront-gateway | |
http: | |
– match: | |
– uri: | |
prefix: /accounts | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: accounts.test.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /fulfillment | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: fulfillment.test.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /orders | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: orders.test.svc.cluster.local | |
— | |
apiVersion: networking.istio.io/v1alpha3 | |
kind: VirtualService | |
metadata: | |
name: storefront-uat | |
spec: | |
hosts: | |
– uat.api.storefront-demo.com | |
gateways: | |
– storefront-gateway | |
http: | |
– match: | |
– uri: | |
prefix: /accounts | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: accounts.uat.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /fulfillment | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: fulfillment.uat.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /orders | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: orders.uat.svc.cluster.local |
Next, deploy the Istio and Kubernetes resources to the new GKE cluster. For the sake of brevity, we will deploy the same number of instances and the same version of each the three Storefront API services (Accounts, Orders, Fulfillment) to each of the three non-prod environments (dev, test, uat). In reality, you would have varying numbers of instances of each service, and each environment would contain progressive versions of each service, as part of the SDLC of each microservice (gist).
- Lines 13–14: Deploy the SSL/TLS certificate and the private key;
- Line 17: Deploy the Istio Gateway and three ServiceEntry resources;
- Lines 20–22: Deploy the Istio Authentication Policy resources each Namespace;
- Lines 26–37: Deploy the same set of resources to the dev, test, and uat Namespaces;
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Deploy Kubernetes/Istio resources | |
# Constants – CHANGE ME! | |
readonly CERT_PATH=~/Documents/Articles/gke-kafka/sslforfree_non_prod | |
readonly NAMESPACES=( 'dev' 'test' 'uat' ) | |
# Kubernetes Secret to hold the server’s certificate and private key | |
kubectl create -n istio-system secret tls istio-ingressgateway-certs \ | |
–key $CERT_PATH/private.key –cert $CERT_PATH/certificate.crt | |
# Istio Gateway and three ServiceEntry resources | |
kubectl apply -f ./resources/other/istio-gateway.yaml | |
# End-user auth applied per environment | |
kubectl apply -f ./resources/other/auth-policy-dev.yaml | |
kubectl apply -f ./resources/other/auth-policy-test.yaml | |
kubectl apply -f ./resources/other/auth-policy-uat.yaml | |
# Loop through each non-prod Namespace (environment) | |
# Re-use same resources (incld. credentials) for all environments, just for the demo | |
for namespace in ${NAMESPACES[@]}; do | |
kubectl apply -n $namespace -f ./resources/config/confluent-cloud-kafka-configmap.yaml | |
kubectl apply -n $namespace -f ./resources/config/mongodb-atlas-secret.yaml | |
kubectl apply -n $namespace -f ./resources/config/confluent-cloud-kafka-secret.yaml | |
kubectl apply -n $namespace -f ./resources/other/mongodb-atlas-external-mesh.yaml | |
kubectl apply -n $namespace -f ./resources/other/confluent-cloud-external-mesh.yaml | |
kubectl apply -n $namespace -f ./resources/services/accounts.yaml | |
kubectl apply -n $namespace -f ./resources/services/fulfillment.yaml | |
kubectl apply -n $namespace -f ./resources/services/orders.yaml | |
done |
The deployed Storefront API Services should look as follows.
Google Cloud DNS
Next, we need to enable DNS access to the GKE cluster using Google Cloud DNS. According to Google, Cloud DNS is a scalable, reliable and managed authoritative Domain Name System (DNS) service running on the same infrastructure as Google. It has low latency, high availability, and is a cost-effective way to make your applications and services available to your users.
Whenever a new GKE cluster is created, a new Network Load Balancer is also created. By default, the load balancer’s front-end is an external IP address.
Using a forwarding rule, traffic directed at the external IP address is redirected to the load balancer’s back-end. The load balancer’s back-end is comprised of three VM instances, which are the three Kubernete nodes in the GKE cluster.
If you are following along with this post’s demonstration, we will assume you have a domain registered and configured with Google Cloud DNS. I am using the storefront-demo.com domain, which I have used in the last three posts to demonstrate Istio and GKE.
Google Cloud DNS has a fully functional web console, part of the Google Cloud Console. However, using the Cloud DNS web console is impractical in a DevOps CI/CD workflow, where Kubernetes clusters, Namespaces, and Workloads are ephemeral. Therefore we will use the following script. Within the script, we reset the IP address associated with the A records for each non-prod subdomains associated with storefront-demo.com domain (gist).
- Lines 23–25: Find the previous load balancer’s front-end IP address;
- Lines 27–29: Find the new load balancer’s front-end IP address;
- Line 35: Start the Cloud DNS transaction;
- Lines 37–47: Add the DNS record changes to the transaction;
- Line 49: Execute the Cloud DNS transaction;
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
# purpose: Update Cloud DNS A Records | |
# Constants – CHANGE ME! | |
readonly PROJECT='gke-confluent-atlas' | |
readonly DOMAIN='storefront-demo.com' | |
readonly ZONE='storefront-demo-com-zone' | |
readonly REGION='us-central1' | |
readonly TTL=300 | |
readonly RECORDS=('dev' 'test' 'uat') | |
# Make sure any old load balancers were removed | |
if [ $(gcloud compute forwarding-rules list –filter "region:($REGION)" | wc -l | awk '{$1=$1};1') -gt 2 ]; then | |
echo "More than one load balancer detected, exiting script." | |
exit 1 | |
fi | |
# Get load balancer IP address from first record | |
readonly OLD_IP=$(gcloud dns record-sets list \ | |
–filter "name=${RECORDS[0]}.api.${DOMAIN}." –zone $ZONE \ | |
| awk 'NR==2 {print $4}') | |
readonly NEW_IP=$(gcloud compute forwarding-rules list \ | |
–filter "region:($REGION)" \ | |
| awk 'NR==2 {print $3}') | |
echo "Old LB IP Address: ${OLD_IP}" | |
echo "New LB IP Address: ${NEW_IP}" | |
# Update DNS records | |
gcloud dns record-sets transaction start –zone $ZONE | |
for record in ${RECORDS[@]}; do | |
echo "${record}.api.${DOMAIN}." | |
gcloud dns record-sets transaction remove \ | |
–name "${record}.api.${DOMAIN}." –ttl $TTL \ | |
–type A –zone $ZONE "${OLD_IP}" | |
gcloud dns record-sets transaction add \ | |
–name "${record}.api.${DOMAIN}." –ttl $TTL \ | |
–type A –zone $ZONE "${NEW_IP}" | |
done | |
gcloud dns record-sets transaction execute –zone $ZONE |
The outcome of the script is shown below. Note how changes are executed as part of a transaction, by automatically creating a transaction.yaml
file. The file contains the six DNS changes, three additions and three deletions. The command executes the transaction and then deletes the transaction.yaml
file.
> sh ./part3_set_cloud_dns.sh
Old LB IP Address: 35.193.208.115 New LB IP Address: 35.238.196.231 Transaction started [transaction.yaml]. dev.api.storefront-demo.com. Record removal appended to transaction at [transaction.yaml]. Record addition appended to transaction at [transaction.yaml]. test.api.storefront-demo.com. Record removal appended to transaction at [transaction.yaml]. Record addition appended to transaction at [transaction.yaml]. uat.api.storefront-demo.com. Record removal appended to transaction at [transaction.yaml]. Record addition appended to transaction at [transaction.yaml]. Executed transaction [transaction.yaml] for managed-zone [storefront-demo-com-zone]. Created [https://www.googleapis.com/dns/v1/projects/gke-confluent-atlas/managedZones/storefront-demo-com-zone/changes/53]. ID START_TIME STATUS 55 2019-01-16T04:54:14.984Z pending
Based on my own domain and cluster details, the transaction.yaml
file looks as follows. Again, note the six DNS changes, three additions, followed by three deletions (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
— | |
additions: | |
– kind: dns#resourceRecordSet | |
name: storefront-demo.com. | |
rrdatas: | |
– ns-cloud-a1.googledomains.com. cloud-dns-hostmaster.google.com. 25 21600 3600 | |
259200 300 | |
ttl: 21600 | |
type: SOA | |
– kind: dns#resourceRecordSet | |
name: dev.api.storefront-demo.com. | |
rrdatas: | |
– 35.238.196.231 | |
ttl: 300 | |
type: A | |
– kind: dns#resourceRecordSet | |
name: test.api.storefront-demo.com. | |
rrdatas: | |
– 35.238.196.231 | |
ttl: 300 | |
type: A | |
– kind: dns#resourceRecordSet | |
name: uat.api.storefront-demo.com. | |
rrdatas: | |
– 35.238.196.231 | |
ttl: 300 | |
type: A | |
deletions: | |
– kind: dns#resourceRecordSet | |
name: storefront-demo.com. | |
rrdatas: | |
– ns-cloud-a1.googledomains.com. cloud-dns-hostmaster.google.com. 24 21600 3600 | |
259200 300 | |
ttl: 21600 | |
type: SOA | |
– kind: dns#resourceRecordSet | |
name: dev.api.storefront-demo.com. | |
rrdatas: | |
– 35.193.208.115 | |
ttl: 300 | |
type: A | |
– kind: dns#resourceRecordSet | |
name: test.api.storefront-demo.com. | |
rrdatas: | |
– 35.193.208.115 | |
ttl: 300 | |
type: A | |
– kind: dns#resourceRecordSet | |
name: uat.api.storefront-demo.com. | |
rrdatas: | |
– 35.193.208.115 | |
ttl: 300 | |
type: A |
Confirm DNS Changes
Use the dig
command to confirm the DNS records are now correct and that DNS propagation has occurred. The IP address returned by dig
should be the external IP address assigned to the front-end of the Google Cloud Load Balancer.
> dig dev.api.storefront-demo.com +short 35.238.196.231
Or, all the three records.
echo \ "dev.api.storefront-demo.com\n" \ "test.api.storefront-demo.com\n" \ "uat.api.storefront-demo.com" \ > records.txt | dig -f records.txt +short 35.238.196.231 35.238.196.231 35.238.196.231
Optionally, more verbosely by removing the +short
option.
> dig +nocmd dev.api.storefront-demo.com ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30763 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;dev.api.storefront-demo.com. IN A ;; ANSWER SECTION: dev.api.storefront-demo.com. 299 IN A 35.238.196.231 ;; Query time: 27 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Wed Jan 16 18:00:49 EST 2019 ;; MSG SIZE rcvd: 72
The resulting records in the Google Cloud DNS management console should look as follows.
JWT-based Authentication
As discussed in the previous post, Istio End-User Authentication for Kubernetes using JSON Web Tokens (JWT) and Auth0, it is typical to limit restrict access to the Kubernetes cluster, Namespaces within the cluster, or Services running within Namespaces to end-users, whether they are humans or other applications. In that previous post, we saw an example of applying a machine-to-machine (M2M) Istio Authentication Policy to only the uat Namespace. This scenario is common when you want to control access to resources in non-production environments, such as UAT, to outside test teams, accessing the uat Namespace through an external application. To simulate this scenario, we will apply the following Istio Authentication Policy to the uat Namespace. (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
apiVersion: authentication.istio.io/v1alpha1 | |
kind: Policy | |
metadata: | |
name: default | |
namespace: uat | |
spec: | |
peers: | |
– mtls: {} | |
origins: | |
– jwt: | |
audiences: | |
– "storefront-api-uat" | |
issuer: "https://storefront-demo.auth0.com/" | |
jwksUri: "https://storefront-demo.auth0.com/.well-known/jwks.json" | |
principalBinding: USE_ORIGIN |
For the dev and test Namespaces, we will apply an additional, different Istio Authentication Policy. This policy will protect against the possibility of dev and test M2M API consumers interfering with uat M2M API consumers and vice-versa. Below is the dev and test version of the Policy (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
apiVersion: authentication.istio.io/v1alpha1 | |
kind: Policy | |
metadata: | |
name: default | |
namespace: dev | |
spec: | |
peers: | |
– mtls: {} | |
origins: | |
– jwt: | |
audiences: | |
– "storefront-api-dev-test" | |
issuer: "https://storefront-demo.auth0.com/" | |
jwksUri: "https://storefront-demo.auth0.com/.well-known/jwks.json" | |
principalBinding: USE_ORIGIN |
Testing Authentication
Using Postman, with the ‘Bearer Token’ type authentication method, as detailed in the previous post, a call a Storefront API resource in the uat Namespace should succeed. This also confirms DNS and HTTPS are working properly.
The dev and test Namespaces require different authentication. Trying to use no Authentication, or authenticating as a UAT API consumer, will result in a 401 Unauthorized
HTTP status, along with the Origin authentication failed.
error message.
Conclusion
In this brief post, we demonstrated how to create a GKE cluster with Istio 1.0.x, containing three virtual clusters, or Namespaces. Each Namespace represents an environment, which is part of an application’s SDLC. We enforced HTTP over TLS (HTTPS) using a wildcard SSL/TLS certificate. We also enforced end-user authentication using JWT-based OAuth 2.0 with Auth0. Lastly, we provided user-friendly DNS routing to each environment, using Google Cloud DNS. Short of a fully managed API Gateway, like Apigee, and automating the execution of the scripts with Jenkins or Spinnaker, this cluster is ready to provide a functional path to Production for developing our Storefront API.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Istio End-User Authentication for Kubernetes using JSON Web Tokens (JWT) and Auth0
Posted by Gary A. Stafford in Bash Scripting, Cloud, DevOps, Enterprise Software Development, GCP, Kubernetes, Software Development on January 6, 2019
In the recent post, Building a Microservices Platform with Confluent Cloud, MongoDB Atlas, Istio, and Google Kubernetes Engine, we built and deployed a microservice-based, cloud-native API to Google Kubernetes Engine, with Istio 1.0.x, on Google Cloud Platform. For brevity, we intentionally omitted a few key features required to operationalize and secure the API. These missing features included HTTPS, user authentication, request quotas, request throttling, and the integration of a full lifecycle API management tool, like Google Apigee.
In a follow-up post, Securing Your Istio Ingress Gateway with HTTPS, we disabled HTTP access to the API running on the GKE cluster. We then enabled bidirectional encryption of communications between a client and GKE cluster with HTTPS.
In this post, we will further enhance the security of the Storefront Demo API by enabling Istio end-user authentication using JSON Web Token-based credentials. Using JSON Web Tokens (JWT), pronounced ‘jot’, will allow Istio to authenticate end-users calling the Storefront Demo API. We will use Auth0, an Authentication-as-a-Service provider, to generate JWT tokens for registered Storefront Demo API consumers, and to validate JWT tokens from Istio, as part of an OAuth 2.0 token-based authorization flow.
JSON Web Tokens
Token-based authentication, according to Auth0, works by ensuring that each request to a server is accompanied by a signed token which the server verifies for authenticity and only then responds to the request. JWT, according to JWT.io, is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object. This information can be verified and trusted because it is digitally signed. Other common token types include Simple Web Tokens (SWT) and Security Assertion Markup Language Tokens (SAML).
JWTs can be signed using a secret with the Hash-based Message Authentication Code (HMAC) algorithm, or a public/private key pair using Rivest–Shamir–Adleman (RSA) or Elliptic Curve Digital Signature Algorithm (ECDSA). Authorization is the most common scenario for using JWT. Within the token payload, you can easily specify user roles and permissions as well as resources that the user can access.
A registered API consumer makes an initial request to the Authorization server, in which they exchange some form of credentials for a token. The JWT is associated with a set of specific user roles and permissions. Each subsequent request will include the token, allowing the user to access authoriz