Posts Tagged k8s

Observing gRPC-based Microservices on Amazon EKS running Istio

Observing a gRPC-based Kubernetes application using Jaeger, Zipkin, Prometheus, Grafana, and Kiali on Amazon EKS running Istio service mesh

Introduction

In the previous two-part post, Kubernetes-based Microservice Observability with Istio Service Mesh, we explored a set of popular open source observability tools easily integrated with the Istio service mesh. Tools included Jaeger and Zipkin for distributed transaction monitoring, Prometheus for metrics collection and alerting, Grafana for metrics querying, visualization, and alerting, and Kiali for overall observability and management of Istio. We rounded out the toolset with the addition of Fluent Bit for log processing and aggregation to Amazon CloudWatch Container Insights. We used these tools to observe a distributed, microservices-based, RESTful application deployed to an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The application platform, running on EKS, used Amazon DocumentDB as a persistent data store and Amazon MQ to exchange messages.

In this post, we will examine those same observability tools to monitor an alternate set of Go-based microservices that use Protocol Buffers (aka Protobuf) over gRPC (gRPC Remote Procedure Calls) and HTTP/2 for client-server communications as opposed to the more common RESTful JSON over HTTP. We will learn how Kubernetes, Istio, and the observability tools work seamlessly with gRPC, just as they do with JSON over HTTP on Amazon EKS.

Kiali Management Console showing gRPC-based reference application platform

Technologies

gRPC

According to the gRPC project, gRPC is a modern open source high-performance Remote Procedure Call (RPC) framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking, and authentication. gRPC is also applicable in the last mile of distributed computing to connect devices, mobile applications, and browsers to backend services.

gRPC was initially created by Google, which has used a single general-purpose RPC infrastructure called Stubby to connect the large number of microservices running within and across its data centers for over a decade. In March 2015, Google decided to build the next version of Stubby and make it open source. gRPC is now used in many organizations outside of Google, including Square, Netflix, CoreOS, Docker, CockroachDB, Cisco, and Juniper Networks. gRPC currently supports over ten languages, including C#, C++, Dart, Go, Java, Kotlin, Node, Objective-C, PHP, Python, and Ruby.

According to widely-cited 2019 tests published by Ruwan Fernando, “gRPC is roughly 7 times faster than REST when receiving data & roughly 10 times faster than REST when sending data for this specific payload. This is mainly due to the tight packing of the Protocol Buffers and the use of HTTP/2 by gRPC.”

Protocol Buffers

With gRPC, you define your service using Protocol Buffers (aka Protobuf), a powerful binary serialization toolset and language. According to Google, Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data — think XML, but smaller, faster, and simpler. Google’s previous documentation claimed protocol buffers were “3 to 10 times smaller and 20 to 100 times faster than XML.

Once you have defined your messages, you run the protocol buffer compiler for your application’s language on your .proto file to generate data access classes. With the proto3 language version, protocol buffers currently support generated code in Java, Python, Objective-C, C++, Dart, Go, Ruby, and C#, with more languages to come. For this post, we have compiled our protobufs for Go. You can read more about the binary wire format of Protobuf on Google’s Developers Portal.

Reference Application Platform

To demonstrate the use of the observability tools, we will deploy a reference application platform to Amazon EKS on AWS. The application platform was developed to demonstrate different Kubernetes platforms, such as EKS, GKE, AKS, and concepts such as service meshes, API management, observability, CI/CD, DevOps, and Chaos Engineering. The platform comprises a backend of eight Go-based microservices labeled generically as Service A — Service H, one Angular 12 TypeScript-based frontend UI, one Go-based gRPC Gateway reverse proxy, four MongoDB databases, and one RabbitMQ message queue.

Reference Application Platform’s Angular-based UI

The reference application platform is designed to generate gRPC-based, synchronous service-to-service IPC (inter-process communication), asynchronous TCP-based service-to-queue-to-service communications, and TCP-based service-to-database communications. For example, Service A calls Service B and Service C; Service B calls Service D and Service E; Service D produces a message to a RabbitMQ queue, which Service F consumes and writes to MongoDB, and so on. The platform’s distributed service communications can be observed using the observability tools when the application is deployed to a Kubernetes cluster running the Istio service mesh.

High-level architecture of the gRPC-based Reference Application Platform

Converting to gRPC and Protocol Buffers

For this post, the eight Go microservices have been modified to use gRPC with protocol buffers over HTTP/2 instead of JSON over HTTP. Specifically, the services use version 3 (aka proto3) of protocol buffers. With gRPC, a gRPC client calls a gRPC server. Some of the platform’s services are gRPC servers, others are gRPC clients, while some act as both client and server.

gRPC Gateway

In the revised platform architecture diagram above, note the addition of the gRPC Gateway reverse proxy that replaces Service A at the edge of the API. The proxy, which translates a RESTful HTTP API into gRPC, sits between the Angular-based Web UI and Service A. Assuming for the sake of this demonstration that most consumers of an API require a RESTful JSON over HTTP API, we have added a gRPC Gateway reverse proxy to the platform. The gRPC Gateway proxies communications between the JSON over HTTP-based clients and the gRPC-based microservices. The gRPC Gateway helps to provide APIs with both gRPC and RESTful styles at the same time.

A diagram from the grpc-gateway GitHub project site demonstrates how the reverse proxy works.

Diagram courtesy: https://github.com/grpc-ecosystem/grpc-gateway

Alternatives to gRPC Gateway

As an alternative to the gRPC Gateway reverse proxy, we could convert the TypeScript-based Angular UI client to communicate via gRPC and protobufs and communicate directly with Service A. One option to achieve this is gRPC Web, a JavaScript implementation of gRPC for browser clients. gRPC Web clients connect to gRPC services via a special proxy, which by default is Envoy. The project’s roadmap includes plans for gRPC Web to be supported in language-specific web frameworks for languages such as Python, Java, and Node.

Demonstration

To follow along with this post’s demonstration, review the installation instructions detailed in part one of the previous post, Kubernetes-based Microservice Observability with Istio Service Mesh, to deploy and configure the Amazon EKS cluster, Istio, Amazon MQ, and DocumentDB. To expedite the deployment of the revised gRPC-based platform to the dev namespace, I have included a Helm chart, ref-app-grpc, in the project. Using the chart, you can ignore any instructions in the previous post that refer to deploying resources to the dev namespace. See the chart’s README file for further instructions.

Deployed gRPC-based Reference Application Platform as seen from Argo CD

Source Code

The gRPC-based microservices source code, Kubernetes resources, and Helm chart are located in the k8s-istio-observe-backend project repository in the 2021-istio branch. This project repository is the only source code you will need for this demonstration.

git clone --branch 2021-istio --single-branch \
https://github.com/garystafford/k8s-istio-observe-backend.git

Optionally, the Angular-based web client source code is located in the k8s-istio-observe-frontend repository on the new 2021-grpc branch. The source protobuf .proto file and the Buf-compiled protobuf files are located in the pb-greeting and protobuf project repositories. You do not need to clone any of these projects for this post’s demonstration.

All Docker images for the services, UI, and the reverse proxy are pulled from Docker Hub.

All images for this post are located on Docker Hub

Code Changes

Although this post is not specifically about writing Go for gRPC and protobuf, to better understand the observability requirements and capabilities of these technologies compared to the previous JSON over HTTP-based services, it is helpful to review the code changes.

Microservices

First, compare the revised source code for Service A, shown below to the original code in the previous post. The service’s code is almost completely rewritten. For example, note the following code changes to Service A, which are synonymous with the other backend services:

  • Import of the v3 greeting protobuf package;
  • Local Greeting struct replaced with pb.Greeting struct;
  • All services are now hosted on port 50051;
  • The HTTP server and all API resource handler functions are removed;
  • Headers used for distributed tracing have moved from HTTP request object to metadata passed in a gRPC Context type;
  • Service A is both a gRPC client and a server, which is called by the gRPC Gateway reverse proxy;
  • The primary GreetingHandler function is replaced by the protobuf package’s Greeting function;
  • gRPC clients, such as Service A, call gRPC servers using the CallGrpcService function;
  • CORS handling is offloaded from the services to Istio;
  • Logging methods are largely unchanged;

Source code for revised gRPC-based Service A:

// author: Gary A. Stafford
// site: https://programmaticponderings.com
// license: MIT License
// purpose: Service A – gRPC/Protobuf
package main
import (
"context"
lrf "github.com/banzaicloud/logrus-runtime-formatter"
"github.com/google/uuid"
"github.com/sirupsen/logrus"
"google.golang.org/grpc"
"google.golang.org/grpc/metadata"
"net"
"os"
"time"
pb "github.com/garystafford/protobuf/greeting/v3"
)
var (
logLevel = getEnv("LOG_LEVEL", "info")
port = getEnv("PORT", ":50051")
serviceName = getEnv("SERVICE_NAME", "Service A")
message = getEnv("GREETING", "Hello, from Service A!")
URLServiceB = getEnv("SERVICE_B_URL", "service-b:50051")
URLServiceC = getEnv("SERVICE_C_URL", "service-c:50051")
greetings []*pb.Greeting
log = logrus.New()
)
type greetingServiceServer struct {
pb.UnimplementedGreetingServiceServer
}
func (s *greetingServiceServer) Greeting(ctx context.Context, _ *pb.GreetingRequest) (*pb.GreetingResponse, error) {
greetings = nil
requestGreeting := pb.Greeting{
Id: uuid.New().String(),
Service: serviceName,
Message: message,
Created: time.Now().Local().String(),
Hostname: getHostname(),
}
greetings = append(greetings, &requestGreeting)
callGrpcService(ctx, &requestGreeting, URLServiceB)
callGrpcService(ctx, &requestGreeting, URLServiceC)
return &pb.GreetingResponse{
Greeting: greetings,
}, nil
}
func callGrpcService(ctx context.Context, requestGreeting *pb.Greeting, address string) {
conn, err := createGRPCConn(ctx, address)
if err != nil {
log.Fatal(err)
}
defer func(conn *grpc.ClientConn) {
err := conn.Close()
if err != nil {
log.Error(err)
}
}(conn)
headersIn, _ := metadata.FromIncomingContext(ctx)
log.Debugf("headersIn: %s", headersIn)
client := pb.NewGreetingServiceClient(conn)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
ctx = metadata.NewOutgoingContext(context.Background(), headersIn)
headersOut, _ := metadata.FromOutgoingContext(ctx)
log.Debugf("headersOut: %s", headersOut)
defer cancel()
responseGreetings, err := client.Greeting(ctx, &pb.GreetingRequest{Greeting: requestGreeting})
if err != nil {
log.Fatal(err)
}
log.Info(responseGreetings.GetGreeting())
for _, responseGreeting := range responseGreetings.GetGreeting() {
greetings = append(greetings, responseGreeting)
}
}
func createGRPCConn(ctx context.Context, addr string) (*grpc.ClientConn, error) {
var opts []grpc.DialOption
opts = append(opts,
grpc.WithInsecure(),
grpc.WithBlock())
conn, err := grpc.DialContext(ctx, addr, opts)
if err != nil {
log.Fatal(err)
return nil, err
}
return conn, nil
}
func getHostname() string {
hostname, err := os.Hostname()
if err != nil {
log.Error(err)
}
return hostname
}
func getEnv(key, fallback string) string {
if value, ok := os.LookupEnv(key); ok {
return value
}
return fallback
}
func run() error {
lis, err := net.Listen("tcp", port)
if err != nil {
log.Fatal(err)
}
grpcServer := grpc.NewServer()
pb.RegisterGreetingServiceServer(grpcServer, &greetingServiceServer{})
return grpcServer.Serve(lis)
}
func init() {
childFormatter := logrus.JSONFormatter{}
runtimeFormatter := &lrf.Formatter{ChildFormatter: &childFormatter}
runtimeFormatter.Line = true
log.Formatter = runtimeFormatter
log.Out = os.Stdout
level, err := logrus.ParseLevel(logLevel)
if err != nil {
log.Error(err)
}
log.Level = level
}
func main() {
if err := run(); err != nil {
log.Fatal(err)
os.Exit(1)
}
}
view raw main.go hosted with ❤ by GitHub

Greeting Protocol Buffers

Shown below is the greeting v3 protocol buffers .proto file. The fields within the Greeting, originally defined in the RESTful JSON-based services as a struct, remains largely unchanged, however, we now have a message— an aggregate containing a set of typed fields. The GreetingRequest is composed of a single Greeting message, while the GreetingResponse message is composed of multiple (repeated) Greeting messages. Services pass a Greeting message in their request and receive an array of one or more messages in response.

syntax = "proto3";
package greeting.v3;
import "google/api/annotations.proto";
option go_package = "github.com/garystafford/pb-greeting/gen/go/greeting/v3";
message Greeting {
string id = 1;
string service = 2;
string message = 3;
string created = 4;
string hostname = 5;
}
message GreetingRequest {
Greeting greeting = 1;
}
message GreetingResponse {
repeated Greeting greeting = 1;
}
service GreetingService {
rpc Greeting (GreetingRequest) returns (GreetingResponse) {
option (google.api.http) = {
get: "/api/greeting"
};
}
}
view raw greeting.proto hosted with ❤ by GitHub

The protobuf is compiled with Buf, the popular Go-based protocol compiler tool. Using Buf, four files are generated: Go, Go gRPC, gRPC Gateway, and Swagger (OpenAPI v2).

.
├── greeting.pb.go
├── greeting.pb.gw.go
├── greeting.swagger.json
└── greeting_grpc.pb.go

Buf is configured using two files, buf.yaml:

version: v1beta1
name: buf.build/garystafford/pb-greeting
deps:
- buf.build/beta/googleapis
- buf.build/grpc-ecosystem/grpc-gateway
build:
roots:
- proto
lint:
use:
- DEFAULT
breaking:
use:
- FILE

And, and buf.gen.yaml:

version: v1beta1
plugins:
- name: go
out: ../protobuf
opt:
- paths=source_relative
- name: go-grpc
out: ../protobuf
opt:
- paths=source_relative
- name: grpc-gateway
out: ../protobuf
opt:
- paths=source_relative
- generate_unbound_methods=true
- name: openapiv2
out: ../protobuf
opt:
- logtostderr=true

The compiled protobuf code is included in the protobuf project on GitHub, and the v3 version is imported into each microservice and the reverse proxy. Below is a snippet of the greeting.pb.go compiled Go file.

// Code generated by protoc-gen-go. DO NOT EDIT.
// versions:
// protoc-gen-go v1.27.1
// protoc v3.17.1
// source: greeting/v3/greeting.proto
package v3
import (
_ "google.golang.org/genproto/googleapis/api/annotations"
protoreflect "google.golang.org/protobuf/reflect/protoreflect"
protoimpl "google.golang.org/protobuf/runtime/protoimpl"
reflect "reflect"
sync "sync"
)
const (
// Verify that this generated code is sufficiently up-to-date.
_ = protoimpl.EnforceVersion(20 protoimpl.MinVersion)
// Verify that runtime/protoimpl is sufficiently up-to-date.
_ = protoimpl.EnforceVersion(protoimpl.MaxVersion 20)
)
type Greeting struct {
state protoimpl.MessageState
sizeCache protoimpl.SizeCache
unknownFields protoimpl.UnknownFields
Id string `protobuf:"bytes,1,opt,name=id,proto3" json:"id,omitempty"`
Service string `protobuf:"bytes,2,opt,name=service,proto3" json:"service,omitempty"`
Message string `protobuf:"bytes,3,opt,name=message,proto3" json:"message,omitempty"`
Created string `protobuf:"bytes,4,opt,name=created,proto3" json:"created,omitempty"`
Hostname string `protobuf:"bytes,5,opt,name=hostname,proto3" json:"hostname,omitempty"`
}
func (x *Greeting) Reset() {
*x = Greeting{}
if protoimpl.UnsafeEnabled {
mi := &file_greeting_v3_greeting_proto_msgTypes[0]
ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))
ms.StoreMessageInfo(mi)
}
}
func (x *Greeting) String() string {
return protoimpl.X.MessageStringOf(x)
}
view raw greeting.pb.go hosted with ❤ by GitHub

Using Swagger, we can view the greeting protocol buffers’ single RESTful API resource, exposed with an HTTP GET method. You can use the Docker-based version of Swagger UI for viewing protoc generated swagger definitions.

docker run -p 8080:8080 -d --name swagger-ui \
-e SWAGGER_JSON=/tmp/greeting/v3/greeting.swagger.json \
-v ${GOAPTH}/src/protobuf:/tmp swaggerapi/swagger-ui

The Angular UI makes an HTTP GET request to the /api/greeting resource, which is transformed to gRPC and proxied to Service A, where it is handled by the Greeting function.

Swagger UI view of the Greeting protobuf

gRPC Gateway Reverse Proxy

As explained earlier, the gRPC Gateway reverse proxy, which translates the RESTful HTTP API into gRPC, is new. In the code sample below, note the following code features:

  1. Import of the v3 greeting protobuf package;
  2. ServeMux, a request multiplexer, matches http requests to patterns and invokes the corresponding handler;
  3. RegisterGreetingServiceHandlerFromEndpoint registers the http handlers for service GreetingService to mux. The handlers forward requests to the gRPC endpoint;
  4. x-b3 request headers, used for distributed tracing, are collected from the incoming HTTP request and propagated to the upstream services in the gRPC Context type;
// author: Gary A. Stafford
// site: https://programmaticponderings.com
// license: MIT License
// purpose: gRPC Gateway / Reverse Proxy
// reference: https://github.com/grpc-ecosystem/grpc-gateway
package main
import (
"context"
"flag"
lrf "github.com/banzaicloud/logrus-runtime-formatter"
pb "github.com/garystafford/protobuf/greeting/v3"
"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
"github.com/sirupsen/logrus"
"google.golang.org/grpc"
"google.golang.org/grpc/metadata"
"net/http"
"os"
)
var (
logLevel = getEnv("LOG_LEVEL", "info")
port = getEnv("PORT", ":50051")
URLServiceA = getEnv("SERVICE_A_URL", "service-a:50051")
log = logrus.New()
)
func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD {
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/
otHeaders := []string{
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
"x-ot-span-context"}
var pairs []string
for _, h := range otHeaders {
if v := req.Header.Get(h); len(v) > 0 {
pairs = append(pairs, h, v)
}
}
return metadata.Pairs(pairs)
}
type annotator func(context.Context, *http.Request) metadata.MD
func chainGrpcAnnotators(annotators annotator) annotator {
return func(c context.Context, r *http.Request) metadata.MD {
var mds []metadata.MD
for _, a := range annotators {
mds = append(mds, a(c, r))
}
return metadata.Join(mds)
}
}
func run() error {
ctx := context.Background()
ctx, cancel := context.WithCancel(ctx)
defer cancel()
annotators := []annotator{injectHeadersIntoMetadata}
mux := runtime.NewServeMux(
runtime.WithMetadata(chainGrpcAnnotators(annotators)),
)
opts := []grpc.DialOption{grpc.WithInsecure()}
err := pb.RegisterGreetingServiceHandlerFromEndpoint(ctx, mux, URLServiceA, opts)
if err != nil {
return err
}
return http.ListenAndServe(port, mux)
}
func getEnv(key, fallback string) string {
if value, ok := os.LookupEnv(key); ok {
return value
}
return fallback
}
func init() {
childFormatter := logrus.JSONFormatter{}
runtimeFormatter := &lrf.Formatter{ChildFormatter: &childFormatter}
runtimeFormatter.Line = true
log.Formatter = runtimeFormatter
log.Out = os.Stdout
level, err := logrus.ParseLevel(logLevel)
if err != nil {
log.Error(err)
}
log.Level = level
}
func main() {
flag.Parse()
if err := run(); err != nil {
log.Fatal(err)
}
}
view raw main.go hosted with ❤ by GitHub

Istio VirtualService and CORS

With the RESTful services in the previous post, CORS was handled by Service A. Service A allowed the UI to make cross-origin requests to the backend API’s domain. Since the gRPC Gateway does not directly support Cross-Origin Resource Sharing (CORS) policy, we have offloaded the CORS responsibility to Istio using the reverse proxy’s VirtualService resource’s CorsPolicy configuration. Moving this responsibility makes CORS much easier to manage as YAML-based configuration and part of the Helm chart. See lines 20–28 below.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: rev-proxy
spec:
hosts:
{{ YOUR_API_HOSTNAME_HERE }}
gateways:
istio-gateway
http:
match:
uri:
prefix: /
route:
destination:
host: rev-proxy.dev.svc.cluster.local
port:
number: 80
weight: 100
corsPolicy:
allowOrigin:
{{ YOUR_UI_HOSTNAME_HERE }}
allowMethods:
OPTIONS
GET
allowCredentials: true
allowHeaders:
"*"

Pillar One: Logs

To paraphrase Jay Kreps on the LinkedIn Engineering Blog, a log is an append-only, totally ordered sequence of records ordered by time. The ordering of records defines a notion of “time” since entries to the left are defined to be older than entries to the right. Logs are a historical record of events that happened in the past. Logs have been around almost as long as computers and are at the heart of many distributed data systems and real-time application architectures.

Go-based Microservice Logging

An effective logging strategy starts with what you log, when you log, and how you log. As part of the platform’s logging strategy, the eight Go-based microservices use Logrus, a popular structured logger for Go, first released in 2014. The platform’s services also implement Banzai Cloud’s logrus-runtime-formatter. These two logging packages give us greater control over what you log, when you log, and how you log information about the services. The recommended configuration of the packages is minimal. Logrus’ JSONFormatter provides for easy parsing by third-party systems and injects additional contextual data fields into the log entries.

func init() {
childFormatter := logrus.JSONFormatter{}
runtimeFormatter := &lrf.Formatter{ChildFormatter: &childFormatter}
runtimeFormatter.Line = true
log.Formatter = runtimeFormatter
log.Out = os.Stdout
level, err := logrus.ParseLevel(logLevel)
if err != nil {
log.Error(err)
}
log.Level = level
}
view raw main.go hosted with ❤ by GitHub

Logrus provides several advantages over Go’s simple logging package, log. For example, log entries are not only for Fatal errors, nor should all verbose log entries be output in a Production environment. Logrus has the capability to log at seven levels: Trace, Debug, Info, Warning, Error, Fatal, and Panic. The log level of the platform’s microservices can be changed at runtime using an environment variable.

Banzai Cloud’s logrus-runtime-formatter automatically tags log messages with runtime and stack information, including function name and line number — extremely helpful when troubleshooting. There is an excellent post on the Banzai Cloud (now part of Cisco) formatter, Golang runtime Logrus Formatter.

Service A log entries as viewed from Amazon CloudWatch Insights

In 2020, Logus entered maintenance mode. The author, Simon Eskildsen (Principal Engineer at Shopify), stated they would not be introducing new features. This does not mean Logrus is dead. With over 18,000 GitHub Stars, Logrus will continue to be maintained for security, bug fixes, and performance. The author states that many fantastic alternatives to Logus now exist, such as Zerolog, Zap, and Apex.

Client-side Angular UI Logging

Likewise, I have enhanced the logging of the Angular UI using NGX Logger. NGX Logger is a simple logging module for angular (currently supports Angular 6+). It allows “pretty print” to the console and allows log messages to be POSTed to a URL for server-side logging. For this demo, the UI will only log to the web browser’s console. Similar to Logrus, NGX Logger supports multiple log levels: Trace, Debug, Info, Warning, Error, Fatal, and Off. However, instead of just outputting messages, NGX Logger allows us to output properly formatted log entries to the browser’s console.

The level of logs output is configured to be dependent on the environment, Production or not Production. Below is an example of the log output from the Angular UI in Chrome. Since the UI’s Docker Image was built with the Production configuration, the log level is set to INFO. You would not want to expose potentially sensitive information in verbose log output to our end-users in Production.

Client-side logs from the platforms’ Angular UI

Controlling logging levels is accomplished by adding the following ternary operator to the app.module.ts file.

imports: [
BrowserModule,
HttpClientModule,
FormsModule,
LoggerModule.forRoot({
level: !environment.production ?
NgxLoggerLevel.DEBUG : NgxLoggerLevel.INFO,
serverLogLevel: NgxLoggerLevel.INFO
})
]
view raw logs.js hosted with ❤ by GitHub

Platform Logs

Based on the platform built, configured, and deployed in part one, you now have access logs from multiple sources.

  1. Amazon DocumentDB: Amazon CloudWatch Audit and Profiler logs;
  2. Amazon MQ: Amazon CloudWatch logs;
  3. Amazon EKS: API server, Audit, Authenticator, Controller manager, and Scheduler CloudWatch logs;
  4. Kubernetes Dashboard: Individual EKS Pod and Replica Set logs;
  5. Kiali: Individual EKS Pod and Container logs;
  6. Fluent Bit: EKS performance, host, dataplane, and application CloudWatch logs;

Fluent Bit

According to a recent AWS Blog post, Fluent Bit Integration in CloudWatch Container Insights for EKS, Fluent Bit is an open source, multi-platform log processor and forwarder that allows you to collect data and logs from different sources and unify and send them to different destinations, including CloudWatch Logs. Fluent Bit is also fully compatible with Docker and Kubernetes environments. Using the newly launched Fluent Bit DaemonSet, you can send container logs from your EKS clusters to CloudWatch logs for logs storage and analytics.

Running Fluent Bit, the EKS cluster’s performance, host, dataplane, and application logs will also be available in Amazon CloudWatch.

Amazon CloudWatch log groups from the demonstration’s EKS cluster

Within the application log groups, you can access the individual log streams for each reference application’s components.

Amazon CloudWatch log streams from the application log group

Within each CloudWatch log stream, you can view individual log entries.

Amazon CloudWatch log stream for Service A

CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes.

Amazon CloudWatch Log Insights — latest errors found in logs for Service F

CloudWatch Logs Insights supports CloudWatch Logs Insights query syntax, a query language you can use to perform queries on your log groups. Each query can include one or more query commands separated by Unix-style pipe characters (|). For example:

fields @timestamp, @message
| filter kubernetes.container_name = "service-f"
and @message like "error"
| sort @timestamp desc
| limit 20

Pillar Two: Metrics

For metrics, we will examine CloudWatch Container Insights, Prometheus, and Grafana. Prometheus and Grafana are industry-leading tools you installed as part of the Istio deployment.

Prometheus

Prometheus is an open source system monitoring and alerting toolkit originally built at SoundCloud circa 2012. Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second project hosted after Kubernetes.

Prometheus Graph of container memory usage during load test

According to Istio, the Prometheus addon is a Prometheus server that comes preconfigured to scrape Istio endpoints to collect metrics. You can use Prometheus with Istio to record metrics that track the health of Istio and applications within the service mesh. You can visualize metrics using tools like Grafana and Kiali. The Istio Prometheus addon is intended for demonstration only and is not tuned for performance or security.

The istioctl dashboardcommand provides access to all of the Istio web UIs. With the EKS cluster running, Istio installed, and the reference application platform deployed, access Prometheus using the istioctl dashboard prometheus command from your terminal. You must be logged into AWS from your terminal to connect to Prometheus successfully. If you are not logged in to AWS, you will often see the following error: Error: not able to locate <tool_name> pod: Unauthorized. Since we used the non-production demonstration versions of the Istio Addons, there is no authentication and authorization required to access Prometheus.

According to Prometheus, users select and aggregate time-series data in real-time using a functional query language called PromQL (Prometheus Query Language). The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems through Prometheus’ HTTP API. The expression browser includes a drop-down menu with all available metrics as a starting point for building queries. Shown below are a few PromQL examples that were developed as part of writing this post.

istio_agent_go_info{kubernetes_namespace="dev"}
istio_build{kubernetes_namespace="dev"}
up{alpha_eksctl_io_cluster_name="istio-observe-demo", job="kubernetes-nodes"}
sum by (pod) (rate(container_network_transmit_packets_total{stack="reference-app",namespace="dev",pod=~"service-.*"}[5m]))
sum by (instance) (istio_requests_total{source_app="istio-ingressgateway",connection_security_policy="mutual_tls",response_code="200"})
sum by (response_code) (istio_requests_total{source_app="istio-ingressgateway",connection_security_policy="mutual_tls",response_code!~"200|0"})

Prometheus APIs

Prometheus has both an HTTP API and a Management API. There are many useful endpoints in addition to the Prometheus UI, available at http://localhost:9090/graph. For example, the Prometheus HTTP API endpoint that lists all the command-line configuration flags is available at http://localhost:9090/api/v1/status/flags. The endpoint that lists all the available Prometheus metrics is available at http://localhost:9090/api/v1/label/__name__/values; over 951 metrics in this demonstration.

The Prometheus endpoint that lists many available metrics with HELP and TYPE to explain their function can be found at http://localhost:9090/metrics.

Understanding Metrics

In addition to these endpoints, the standard service level metrics exported by Istio and available via Prometheus can be found in the Istio Standard Metrics documentation. An explanation of many of the metrics available in Prometheus is also found in the cAdvisor README on their GitHub site. As mentioned in this AWS Blog Post, the cAdvisor metrics are also available from the command line using the following commands:

export NODE=$(kubectl get nodes | sed -n '2 p' | awk {'print $1'})
kubectl get --raw "/api/v1/nodes/${NODE}/proxy/metrics/cadvisor"

Observing Metrics

Below is an example graph of the backend microservice containers deployed to EKS. The graph PromQL expression returns the amount of working set memory, including recently accessed memory, dirty memory, and kernel memory (container_memory_working_set_bytes), summed by pod, in megabytes (MB). There was no load on the services during the period displayed.

sum by (pod) (container_memory_working_set_bytes{namespace="dev", container=~"service-.*|rev-proxy|angular-ui"}) / (1024^2)

The container_memory_working_set_bytes metric is the same metric used by the kubectl top command (not container_memory_usage_bytes). Omitting the --containers=true flag will output pod stats versus containers.

> kubectl top pod -n dev --containers=true | \
grep -v istio-proxy | sort -k 4 -r
POD                           NAME          CPU(cores) MEMORY(bytes)
service-d-69d7469cbf-ts4t7 service-d 135m 13Mi
service-d-69d7469cbf-6thmz service-d 156m 13Mi
service-d-69d7469cbf-nl7th service-d 118m 12Mi
service-d-69d7469cbf-fz5bh service-d 118m 12Mi
service-d-69d7469cbf-89995 service-d 136m 11Mi
service-d-69d7469cbf-g4pfm service-d 106m 10Mi
service-h-69576c4c8c-x9ccl service-h 33m 9Mi
service-h-69576c4c8c-gtjc9 service-h 33m 9Mi
service-h-69576c4c8c-bjgfm service-h 45m 9Mi
service-h-69576c4c8c-8fk6z service-h 38m 9Mi
service-h-69576c4c8c-55rld service-h 36m 9Mi
service-h-69576c4c8c-4xpb5 service-h 41m 9Mi
...

In another Prometheus example, the PromQL query expression returns the per-second rate of CPU resources measured in CPU units (1 CPU = 1 AWS vCPU), as measured over the last 5 minutes, per time series in the range vector, summed by the pod. During this period, the backend services were under a consistent, simulated load of 15 concurrent users using hey. Four instances of Service D pods were consuming the most CPU units during this time period.

sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="dev", container=~"service-.*|rev-proxy|angular-ui"}[5m])) * 1000

The container_cpu_usage_seconds_total metric is the same metric used by the kubectl top command. The above PromQL expression multiplies the query results by 1,000 to match the results from kubectl top, shown below.

> kubectl top pod -n dev --sort-by=cpu
NAME                          CPU(cores)   MEMORY(bytes)
service-d-69d7469cbf-6thmz 159m 60Mi
service-d-69d7469cbf-89995 143m 61Mi
service-d-69d7469cbf-ts4t7 140m 59Mi
service-d-69d7469cbf-fz5bh 135m 58Mi
service-d-69d7469cbf-nl7th 132m 61Mi
service-d-69d7469cbf-g4pfm 119m 62Mi
service-g-c7d68fd94-w5t66 59m 58Mi
service-f-7dc8f64799-qj8qv 56m 55Mi
service-c-69fbc964db-knggt 56m 58Mi
service-h-69576c4c8c-8fk6z 55m 58Mi
service-h-69576c4c8c-4xpb5 55m 58Mi
service-g-c7d68fd94-5cdc2 54m 58Mi
...

Limits

Prometheus also exposes container resource limits. For example, the memory limits set on the reference platform’s backend services, displayed in megabytes (MB), using the container_spec_memory_limit_bytes metric. When viewed alongside the real-time resources consumed by the services, these metrics are useful to properly configure and monitor Kubernetes management features such as the Horizontal Pod Autoscaler.

sum by (container) (container_spec_memory_limit_bytes{namespace="dev", container=~"service-.*|rev-proxy|angular-ui"}) / (1024^2) / count by (container) (container_spec_memory_limit_bytes{namespace="dev", container=~"service-.*|rev-proxy|angular-ui"})

Or, memory limits by Pod:

sum by (pod) (container_spec_memory_limit_bytes{namespace="dev"}) / (1024^2)

Cluster Metrics

Prometheus also contains metrics about Istio components, Kubernetes components, and the EKS cluster. For example, the total available memory in gigabytes (GB) of each of the five m5.large EC2 worker nodes in the istio-observe-demo EKS cluster’s managed-ng-1 Managed Node Group.

machine_memory_bytes{alpha_eksctl_io_cluster_name="istio-observe-demo", alpha_eksctl_io_nodegroup_name="managed-ng-1"} / (1024^3)

For total physical cores, use the machine_cpu_physical_core metric, and for vCPU cores use the machine_cpu_cores metric.

Grafana

Grafana describes itself as the leading open source software for time-series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually rich, data-driven dashboards. Grafana also allows users to define alert rules for their most important metrics visually. Grafana will continuously evaluate rules and can send notifications.

If you deployed Grafana using the Istio addons process demonstrated in part one of the previous post, access Grafana similar to the other tools:

istioctl dashboard grafana
Grafana Home page

According to Istio, Grafana is an open source monitoring solution used to configure dashboards for Istio. You can use Grafana to monitor the health of Istio and applications within the service mesh. While you can build your own dashboards, Istio offers a set of preconfigured dashboards for all of the most important metrics for the mesh and the control plane. The preconfigured dashboards use Prometheus as the data source.

Below is an example of the Istio Mesh Dashboard, filtered to show the eight backend service workloads running in the dev namespace. During this period, the backend services were under a consistent simulated load of approximately 20 concurrent users using hey. You can observe the p50, p90, and p99 latency of requests to these workloads.

View of the Istio Mesh Dashboard

Dashboards are built from Panels, the basic visualization building blocks in Grafana. Each panel has a query editor specific to the data source (Prometheus in this case) selected. The query editor allows you to write your (PromQL) query. For example, below is the PromQL expression query responsible for the p50 latency Panel displayed in the Istio Mesh Dashboard.

label_join((histogram_quantile(0.50, sum(rate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by (le, destination_workload, destination_workload_namespace)) / 1000) or histogram_quantile(0.50, sum(rate(istio_request_duration_seconds_bucket{reporter="source"}[1m])) by (le, destination_workload, destination_workload_namespace)), "destination_workload_var", ".", "destination_workload", "destination_workload_namespace")

Below is an example of the Istio Workload Dashboard. The dashboard contains three sections: General, Inbound Workloads, and Outbound Workloads. We have filtered outbound traffic from the reference platform’s backend services in the dev namespace.

View of the Istio Workload Dashboard

Below is a different view of the Istio Workload Dashboard, the dashboard’s Inbound Workloads section filtered to a single workload, the gRPC Gateway. The gRPC Gateway accepts incoming traffic from the Istio Ingress Gateway, as shown in the dashboard’s panels.

View of the Istio Workload Dashboard

Grafana provides the ability to Explore a Panel. Explore strips away the dashboard and panel options so that you can focus on the query. Below is an example of the Panel showing a steady stream of TCP-based egress traffic for Service F, based on the istio_tcp_sent_bytes_total metric. Service F consumes messages off on the RabbitMQ queue (Amazon MQ) and writes messages to MongoDB (DocumentDB).

Exploring a Grafana dashboard panel

Istio Performance

You can monitor the resource usage of Istio with the Istio Performance Dashboard.

View of the Istio Performance Dashboard

Additional Dashboards

Grafana provides a site containing official and community-built dashboards, including the above-mentioned Istio dashboards. Importing dashboards into your Grafana instance is as simple as copying the dashboard URL or the ID provided from the Grafana dashboard site and pasting it into the dashboard import option of your Grafana instance. However, be aware that not every Kubernetes dashboard in Grafan’s site is compatible with your specific version of Kubernetes, Istio, or EKS, nor relies on Prometheus as a data source. As a result, you might have to test and tweak imported dashboards to get them working.

Below is an example of an imported community dashboard, Kubernetes cluster monitoring (via Prometheus) by Instrumentisto Team (dashboard ID 315).

Alerting

An effective observability strategy must include more than just the ability to visualize results. An effective strategy must also detect anomalies and notify (alert) the appropriate resources or directly resolve incidents. Grafana, like Prometheus, is capable of alerting and notification. You visually define alert rules for your critical metrics. Then, Grafana will continuously evaluate metrics against the rules and send notifications when pre-defined thresholds are breached.

Prometheus supports multiple popular notification channels, including PagerDuty, HipChat, Email, Kafka, and Slack. Below is an example of a Prometheus notification channel that sends alert notifications to a Slack support channel.

Below is an example of an alert based on an arbitrarily high CPU usage of 300 millicpu or millicores (m). When the CPU usage of a single pod goes above that value for more than 3 minutes, an alert is sent. The high CPU usage could be caused by the Horizontal Pod Autoscaler not functioning, or the HPA has reached its maxReplicas limit, or there are not enough resources available within the cluster’s existing worker nodes to schedule additional pods.

Triggered by the alert, Prometheus sends detailed notifications to the designated Slack channel.

Amazon CloudWatch Container Insights

Lastly, in the category of Metrics, Amazon CloudWatch Container Insights collects, aggregates, summarizes, and visualizes metrics and logs from your containerized applications and microservices. CloudWatch alarms can be set on metrics that Container Insights collects. Container Insights is available for Amazon Elastic Container Service (Amazon ECS), including Fargate, Amazon EKS, and Kubernetes platforms on Amazon EC2.

In Amazon EKS, Container Insights uses a containerized version of the CloudWatch agent to discover all running containers in a cluster. It then collects performance data at every layer of the performance stack. Container Insights collects data as performance log events using the embedded metric format. These performance log events are entries that use a structured JSON schema that enables high-cardinality data to be ingested and stored at scale.

In the previous post, we also installed CloudWatch Container Insights monitoring for Prometheus, which automates the discovery of Prometheus metrics from containerized systems and workloads.

Below is an example of a basic Performance Monitoring CloudWatch Container Insights Dashboard. The dashboard is filtered to the dev namespace of the EKS cluster, where the reference application platform is running. During this period, the backend services were put under a simulated load using hey. As the load on the application increased, the ‘Number of Pods’ increased from 20 pods to 56 pods based on the container’s requested resources and HPA configurations. There is also a CloudWatch Alarm, shown on the right of the screen. An alarm was triggered for an arbitrarily high level of network transmission activity.

Next is an example of Container Insights’ Container Map view in CPU mode. You see a visual representation of the dev namespace, with each of the backend service’s Service and Deployment resources shown.

Below, there is a warning icon indicating an Alarm on the cluster was triggered.

Lastly, CloudWatch Insights allows you to jump from the CloudWatch Insights to the CloudWatch Log Insights console. CloudWatch Insights will also write the CloudWatch Insights query for you. Below, we went from the Service D container metrics view in the CloudWatch Insights Performance Monitoring console directly to the CloudWatch Log Insights console with a query, ready to run.

Pillar 3: Traces

According to the Open Tracing website, distributed tracing, also called distributed request tracing, is used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

Header Propagation

According to Istio, header propagation may be accomplished through client libraries, such as Zipkin or Jaeger. Header propagation may also be accomplished manually, referred to as trace context propagation, documented in the Distributed Tracing Task. Alternately, Istio proxies can automatically send spans. Applications need to propagate the appropriate HTTP headers so that when the proxies send span information, the spans can be correlated correctly into a single trace. To accomplish this, an application needs to collect and propagate the following headers from the incoming request to any outgoing requests.

  • x-request-id
  • x-b3-traceid
  • x-b3-spanid
  • x-b3-parentspanid
  • x-b3-sampled
  • x-b3-flags
  • x-ot-span-context

The x-b3 headers originated as part of the Zipkin project. The B3 portion of the header is named for the original name of Zipkin, BigBrotherBird. Passing these headers across service calls is known as B3 propagation. According to Zipkin, these attributes are propagated in-process and eventually downstream (often via HTTP headers) to ensure all activity originating from the same root are collected together.

To demonstrate distributed tracing with Jaeger and Zipkin, the gRPC Gateway passes the b3 headers. While the RESTful JSON-based services passed these headers in the HTTP request object, with gRPC, the heders are passed in the gRPC Context object. The following code has been added to the gRPC Gateway. The Istio sidecar proxy (Envoy) generates the initial headers, which are then propagated throughout the service call chain. It is critical only to propagate the headers present in the downstream request with values, as the code below does.

func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD {
headers := []string{
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
"x-ot-span-context"}
var pairs []string
for _, h := range headers {
if v := req.Header.Get(h); len(v) > 0 {
pairs = append(pairs, h, v)
}
}
return metadata.Pairs(pairs)
}
view raw main.go hosted with ❤ by GitHub

Below, in the CloudWatch logs, we see an example of the HTTP request headers recorded in a log message for Service A. The b3 headers are propagated from the gRPC Gateway reverse proxy to gRPC-based Go services. Header propagation ensures a complete distributed trace across the entire service call chain.

CloudWatch Log Insights Console showing Service A’s log entries

Headers propagated from Service A are shown below. Note the b3 headers propagated from the gRPC Gateway reverse proxy.

{
"function": "callGrpcService",
"level": "debug",
"line": "84",
"msg": "headersOut: map[:
authority:[service-a.dev.svc.cluster.local:50051]
content-type:[application/grpc]
grpcgateway-accept:[application/json, text/plain, */*]
grpcgateway-accept-language:[en-US,en;q=0.9]
grpcgateway-content-type:[application/json]
grpcgateway-origin:[https://ui.example-api.com]
grpcgateway-referer:[https://ui.example-api.com/]
grpcgateway-user-agent:[Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36]
user-agent:[grpc-go/1.39.0]
x-b3-parentspanid:[3b30be08b7a6bad0]
x-b3-sampled:[1]
x-b3-spanid:[c1f63e34996770c9]
x-b3-traceid:[7b084bbca0bade97bdc76741c3973ed6]
x-envoy-attempt-count:[1]
x-forwarded-client-cert:[By=spiffe://cluster.local/ns/dev/sa/default;Hash=9c02df616b245e7ada5394db109cb1fa4086c08591e668e5a67fc3e0520713cf;Subject=\"\";URI=spiffe://cluster.local/ns/dev/sa/default]
x-forwarded-for:[73.232.228.42,192.168.63.73, 127.0.0.6]
x-forwarded-host:[api.example-api.com]
x-forwarded-proto:[http]
x-request-id:[e83b565f-23ca-9f91-953e-03175bdafaa0]
]",
"time": "2021-07-04T13:54:06Z"
}
view raw logrus-grpc-log.txt hosted with ❤ by GitHub

Jaeger

According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. Jaeger is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a helpful overview of Jaeger’s architecture and general tracing-related terminology.

If you deployed Jaeger using the Istio addons process demonstrated in part one of the previous post, access Jaeger similar to the other tools:

istioctl dashboard jaeger

Below are examples of the Jaeger UI’s Search view, displaying the results of a search for the Angular UI and the Istio Ingress Gateway services over a period of time. We see a timeline of traces across the top with a list of trace results below. As discussed on the Jaeger website, a trace is composed of spans. A span represents a logical unit of work in Jaeger that has an operation name. A trace is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with the concept of DAGs.

Latest Angular UI traces
Latest Istio Ingress Gateway traces

Below is a detailed view of a single trace in Jaeger’s Trace Timeline mode. The 16 spans encompass nine of the reference platform’s components: seven backend services, gRPC Gateway, and Istio Ingress Gateway. The spans each have individual timings, with an overall trace time of 195.49 ms. The root span in the trace is the Istio Ingress Gateway. The Angular UI, loaded in the end user’s web browser, calls gRPC Gateway via the Istio Ingress Gateway. From there, we see the expected flow of our service-to-service IPC. Service A calls Services B and Service C. Service B calls Service E, which calls Service G and Service H.

In this demonstration, traces are not instrumented to span the RabbitMQ message queue nor MongoDB. You will not see a trace that includes a call from Service D to Service F via the RabbitMQ.

Detailed view of an Istio Ingress Gateway distributed trace

The visualization of the trace’s timeline demonstrates the synchronous nature of the reference platform’s service-to-service IPC instead of the asynchronous nature of the decoupled communications using the RabbitMQ messaging queue. Service A waits for each service in its call chain to respond before returning its response to the requester.

Within Jaeger’s Trace Timeline view, you have the ability to drill into a single span, which contains additional metadata. The span’s metadata includes the API endpoint URL being called, HTTP method, response status, and several other headers.

Detailed view of an Istio Ingress Gateway distributed trace

A Trace Statistics view is also available.

Trace statistics for an Istio Ingress Gateway distributed trace

Additionally, Jaeger has an experimental Trace Graph mode that displays a graph view of the same trace.

Jaeger also includes a Compare Trace feature and two dependency views: Force-Directed Graph and DAG. I find both views rather primitive compared to Kiali. Lacking access to Kiali, the views are marginally useful as a dependency graph.

Zipkin

Zipkin is a distributed tracing system, which helps gather timing data needed to troubleshoot latency problems in service architectures. According to a 2012 post on Twitter’s Engineering Blog, Zipkin started as a project during Twitter’s first Hack Week. During that week, they implemented a basic version of the Google Dapper paper for Thrift.

Results of a search for the latest traces in Zipkin

Zipkin and Jaeger are very similar in terms of capabilities. I have chosen to focus on Jaeger in this post as I prefer it over Zipkin. If you want to try Zipkin instead of Jaeger, you can use the following commands to remove Jaeger and install Zipkin from the Istio addons extras directory. In part one of the post, we did not install Zipkin by default when we deployed the Istio addons. Be aware that running both tools simultaneously in the same Kubernetes cluster will cause unpredictable tracing results.

kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.10/samples/addons/jaeger.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.10/samples/addons/extras/zipkin.yaml

Access Zipkin similar to the other observability tools:

istioctl dashboard zipkin

Below is an example of a distributed trace visualized in Zipkin’s UI, containing 16 spans, similar to the trace visualized in Jaeger, shown above. The spans encompass eight of the reference platform’s components: seven of the eight backend services and the Istio Ingress Gateway. The spans each have individual timings, with an overall trace time of ~221 ms.

Detailed view of a distributed trace in Zipkin

Zipkin can also visualize a dependency graph based on the distributed trace. Below is an example of a traffic simulation over a 24-hour period, showing network traffic flowing between the reference platform’s components, illustrated as a dependency graph.

Zipkin‘s dependency graph showing traffic over a 24-hour period

Kiali: Microservice Observability

According to their website, Kiali is a management console for an Istio-based service mesh. It provides dashboards and observability, and lets you operate your mesh with robust configuration and validation capabilities. It shows the structure of a service mesh by inferring traffic topology and displaying the mesh’s health. Kiali provides detailed metrics, powerful validation, Grafana access, and strong integration for distributed tracing with Jaeger.

If you deployed Kaili using the Istio addons process demonstrated in part one of the previous post, access Kiali similar to the other tools:

istioctl dashboard kaili

For improved security, install the latest version of Kaili using the customizable install mentioned in Istio’s documentation. Using Kiali’s Install via Kiali Server Helm Chart option adds token-based authentication, similar to the Kubernetes Dashboard.

Kiali’s Overview tab provides a global view of all namespaces within the Istio service mesh and the number of applications within each namespace.

The Graph tab in the Kiali UI represents the components running in the Istio service mesh. Below, filtering on the cluster’s dev Namespace, we can observe that Kiali has mapped 11 applications (workloads), 11 services, and 24 edges (a graph term). Specifically, we see the Istio Ingres Proxy at the edge of the service mesh, gRPC Gateway, Angular UI, and eight backend services, all with their respective Envoy proxy sidecars that are taking traffic (Service F did not take any direct traffic from another service in this example), the external DocumentDB egress point, and the external Amazon MQ egress point. Note how service-to-service traffic flows with Istio, from the service to its sidecar proxy, to the other service’s sidecar proxy, and finally to the service.

Kiali allows you to zoom in and focus on a single component in the graph and its individual metrics.

Kiali can also display average request times and other metrics for each edge in the graph (communication between two components). Kaili can even show those metrics over a given period of time, using Kiali’s Replay feature, shown below.

The Applications tab lists all the applications, their namespace, and labels.

You can drill into an individual component on both the Applications and Workloads tabs and view additional details. Details include the overall health, Pods, and Istio Config status. Below is an overview of the Service A workload in the dev Namespace.

The Workloads detailed view also includes inbound and outbound network metrics. Below is an example of the outbound for Service A in the dev Namespace.

Kiali also gives you access to the individual pod’s container logs. Although log access is not as user-friendly as other log sources discussed previously, having logs available alongside metrics (integration with Grafana), traces (integration with Jaeger), and mesh visualization, all in Kiali, can act as a very effective single pane of glass for observability.

Kiali also has an Istio Config tab. The Istio Config tab displays a list of all of the available Istio configuration objects that exist in the user’s environment.

You can use Kiali to configure and manage the Istio service mesh and its installed resources. Using Kiali, you can actually modify the deployed resources, similar to using the kubectl edit command.

Oftentimes, I find Kiali to be my first stop when troubleshooting platform issues. Once I identify the specific components or communication paths having issues, I then review the specific application logs and Prometheus metrics through the Grafana dashboard.

Tear Down

To tear down the EKS cluster, DocumentDB cluster, and Amazon MQ broker, use the following commands:

# EKS cluster
eksctl delete cluster --name $CLUSTER_NAME
# Amazon MQ
aws mq list-brokers | jq -r '.BrokerSummaries[] | .BrokerId'aws mq delete-broker --broker-id {{ your_broker_id }}
# DocumentDB
aws docdb describe-db-clusters \
| jq -r '.DBClusters[] | .DbClusterResourceId'aws docdb delete-
db-cluster \
--db-cluster-identifier {{ your_cluster_id }}

Conclusion

In this post, we explored a set of popular open source observability tools, easily integrated with the Istio service mesh. These tools included Jaeger and Zipkin for distributed transaction monitoring, Prometheus for metrics collection and alerting, Grafana for metrics querying, visualization, and alerting, and Kiali for overall observability and management of Istio. We rounded out the toolset using Fluent Bit for log processing and forwarding to Amazon CloudWatch Container Insights. Using these tools, we successfully observed a gRPC-based, distributed reference application platform deployed to Amazon EKS.


This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

, , , , ,

Leave a comment

Azure Kubernetes Service (AKS) Observability with Istio Service Mesh

In the last two-part post, Kubernetes-based Microservice Observability with Istio Service Mesh, we deployed Istio, along with its observability tools, Prometheus, Grafana, Jaeger, and Kiali, to Google Kubernetes Engine (GKE). Following that post, I received several questions about using Istio’s observability tools with other popular managed Kubernetes platforms, primarily Azure Kubernetes Service (AKS). In most cases, including with AKS, both Istio and the observability tools are compatible.

In this short follow-up of the last post, we will replace the GKE-specific cluster setup commands, found in part one of the last post, with new commands to provision a similar AKS cluster on Azure. The new AKS cluster will run Istio 1.1.3, released 4/15/2019, alongside the latest available version of AKS (Kubernetes), 1.12.6. We will replace Google’s Stackdriver logging with Azure Monitor logs. We will retain the external MongoDB Atlas cluster and the external CloudAMQP cluster dependencies.

Previous articles about AKS include First Impressions of AKS, Azure’s New Managed Kubernetes Container Service (November 2017) and Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB (December 2017).

Source Code

All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository.

git clone \
  --branch master --single-branch \
  --depth 1 --no-tags \
  https://github.com/garystafford/k8s-istio-observe-backend.git

The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend repository. You will not need to clone the Angular UI project for this post’s demonstration.

Setup

This post assumes you have a Microsoft Azure account with the necessary resource providers registered, and the Azure Command-Line Interface (CLI), az, installed and available to your command shell. You will also need Helm and Istio 1.1.3 installed and configured, which is covered in the last post.

screen_shot_2019-03-27_at_1_35_46_pm

Start by logging into Azure from your command shell.

az login \
  --username {{ your_username_here }} \
  --password {{ your_password_here }}

Resource Providers

If you are new to Azure or AKS, you may need to register some additional resource providers to complete this demonstration.

az provider list --output table

screen_shot_2019-03-27_at_5_37_46_pm

If you are missing required resource providers, you will see errors similar to the one shown below. Simply activate the particular provider corresponding to the error.

Operation failed with status:'Bad Request'. 
Details: Required resource provider registrations 
Microsoft.Compute, Microsoft.Network are missing.

To register the necessary providers, use the Azure CLI or the Azure Portal UI.

az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Compute

Resource Group

AKS requires an Azure Resource Group. According to Azure, a resource group is a container that holds related resources for an Azure solution. The resource group includes those resources that you want to manage as a group. I chose to create a new resource group associated with my closest geographic Azure Region, East US, using the Azure CLI.

az group create \
  --resource-group aks-observability-demo \
  --location eastus

screen_shot_2019-03-26_at_6_54_39_pm

Create the AKS Cluster

Before creating the GKE cluster, check for the latest versions of AKS. At the time of this post, the latest versions of AKS was 1.12.6.

az aks get-versions \
  --location eastus \
  --output table

screen_shot_2019-03-26_at_6_56_38_pm

Using the latest GKE version, create the GKE managed cluster. There are many configuration options available with the az aks create command. For this post, I am creating three worker nodes using the Azure Standard_DS3_v2 VM type, which will give us a total of 12 vCPUs and 42 GB of memory. Anything smaller and all the Pods may not be schedulable. Instead of supplying an existing SSH key, I will let Azure create a new one. You should have no need to SSH into the worker nodes. I am also enabling the monitoring add-on. According to Azure, the add-on sets up Azure Monitor for containers, announced in December 2018, which monitors the performance of workloads deployed to Kubernetes environments hosted on AKS.

time az aks create \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --node-count 3 \
  --node-vm-size Standard_DS3_v2 \
  --enable-addons monitoring \
  --generate-ssh-keys \
  --kubernetes-version 1.12.6

Using the time command, we observe that the cluster took approximately 5m48s to provision; I have seen times up to almost 10 minutes. AKS provisioning is not as fast as GKE, which in my experience is at least 2x-3x faster than AKS for a similarly sized cluster.

screen_shot_2019-03-26_at_7_03_49_pm

After the cluster creation completes, retrieve your AKS cluster credentials.

az aks get-credentials \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --overwrite-existing

Examine the Cluster

Use the following command to confirm the cluster is ready by examining the status of three worker nodes.

kubectl get nodes --output=wide

screen_shot_2019-03-27_at_6_06_10_pm.png

Observe that Azure currently uses Ubuntu 16.04.5 LTS for the worker node’s host operating system. If you recall, GKE offers both Ubuntu as well as a Container-Optimized OS from Google.

Kubernetes Dashboard

Unlike GKE, there is no custom AKS dashboard. Therefore, we will use the Kubernetes Web UI (dashboard), which is installed by default with AKS, unlike GKE. According to Azure, to make full use of the dashboard, since the AKS cluster uses RBAC, a ClusterRoleBinding must be created before you can correctly access the dashboard.

kubectl create clusterrolebinding kubernetes-dashboard \
  --clusterrole=cluster-admin \
  --serviceaccount=kube-system:kubernetes-dashboard

Next, we must create a proxy tunnel on local port 8001 to the dashboard running on the AKS cluster. This CLI command creates a proxy between your local system and the Kubernetes API and opens your web browser to the Kubernetes dashboard.

az aks browse \
  --name aks-observability-demo \
  --resource-group aks-observability-demo

screen_shot_2019-03-26_at_7_08_54_pm

Although you should use the Azure CLI, PowerShell, or SDK for all your AKS configuration tasks, using the dashboard for monitoring the cluster and the resources running on it, is invaluable.

screen_shot_2019-03-26_at_7_06_57_pm

The Kubernetes dashboard also provides access to raw container logs. Azure Monitor provides the ability to construct complex log queries, but for quick troubleshooting, you may just want to see the raw logs a specific container is outputting, from the dashboard.

screen_shot_2019-03-29_at_9_23_57_pm

Azure Portal

Logging into the Azure Portal, we can observe the AKS cluster, within the new Resource Group.

screen_shot_2019-03-26_at_7_08_25_pm

In addition to the Azure Resource Group we created, there will be a second Resource Group created automatically during the creation of the AKS cluster. This group contains all the resources that compose the AKS cluster. These resources include the three worker node VM instances, and their corresponding storage disks and NICs. The group also includes a network security group, route table, virtual network, and an availability set.

screen_shot_2019-03-26_at_7_08_04_pm

Deploy Istio

From this point on, the process to deploy Istio Service Mesh and the Go-based microservices platform follows the previous post and use the exact same scripts. After modifying the Kubernetes resource files, to deploy Istio, use the bash script, part4_install_istio.sh. I have added a few more pauses in the script to account for the apparently slower response times from AKS as opposed to GKE. It definitely takes longer to spin up the Istio resources on AKS than on GKE, which can result in errors if you do not pause between each stage of the deployment process.

screen_shot_2019-03-26_at_7_11_44_pm

screen_shot_2019-03-26_at_7_18_26_pm

Using the Kubernetes dashboard, we can view the Istio resources running in the istio-system Namespace, as shown below. Confirm that all resource Pods are running and healthy before deploying the Go-based microservices platform.

screen_shot_2019-03-26_at_7_16_50_pm

Deploy the Platform

Deploy the Go-based microservices platform, using bash deploy script, part5a_deploy_resources.sh.

screen_shot_2019-03-26_at_7_20_05_pm

The script deploys two replicas (Pods) of each of the eight microservices, Service-A through Service-H, and the Angular UI, to the dev and test Namespaces, for a total of 36 Pods. Each Pod will have the Istio sidecar proxy (Envoy Proxy) injected into it, alongside the microservice or UI.

screen_shot_2019-03-26_at_7_21_24_pm

Azure Load Balancer

If we return to the Resource Group created automatically when the AKS cluster was created, we will now see two additional resources. There is now an Azure Load Balancer and Public IP Address.

screen_shot_2019-03-26_at_7_21_56_pm

Similar to the GKE cluster in the last post, when the Istio Ingress Gateway is deployed as part of the platform, it is materialized as an Azure Load Balancer. The front-end of the load balancer is the new public IP address. The back-end of the load-balancer is a pool containing the three AKS worker node VMs. The load balancer is associated with a set of rules and health probes.

screen_shot_2019-03-26_at_7_22_51_pm

DNS

I have associated the new Azure public IP address, connected with the front-end of the load balancer, with the four subdomains I am using to represent the UI and the edge service, Service-A, in both Namespaces. If Azure is your primary Cloud provider, then Azure DNS is a good choice to manage your domain’s DNS records. For this demo, you will require your own domain.

screen_shot_2019-03-28_at_9_43_42_pm

Testing the Platform

With everything deployed, test the platform is responding and generate HTTP traffic for the observability tools to record. Similar to last time, I have chosen hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab). Unlike ab, hey supports HTTP/2. Below, I am running hey directly from Azure Cloud Shell. The tool is simulating 10 concurrent users, generating a total of 500 HTTP GET requests to Service A.

# quick setup from Azure Shell using Bash
go get -u github.com/rakyll/hey
cd go/src/github.com/rakyll/hey/
go build
  
./hey -n 500 -c 10 -h2 http://api.dev.example-api.com/api/ping

We had 100% success with all 500 calls resulting in an HTTP 200 OK success status response code. Based on the results, we can observe the platform was capable of approximately 4 requests/second, with an average response time of 2.48 seconds and a mean time of 2.80 seconds. Almost all of that time was the result of waiting for the response, as the details indicate.

screen_shot_2019-03-26_at_7_57_03_pm

Logging

In this post, we have replaced GCP’s Stackdriver logging with Azure Monitor logs. According to Microsoft, Azure Monitor maximizes the availability and performance of applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from Cloud and on-premises environments. In my opinion, Stackdriver is a superior solution for searching and correlating the logs of distributed applications running on Kubernetes. I find the interface and query language of Stackdriver easier and more intuitive than Azure Monitor, which although powerful, requires substantial query knowledge to obtain meaningful results. For example, here is a query to view the log entries from the services in the dev Namespace, within the last day.

let startTimestamp = ago(1d);
KubePodInventory
| where TimeGenerated > startTimestamp
| where ClusterName =~ "aks-observability-demo"
| where Namespace == "dev"
| where Name contains "service-"
| distinct ContainerID
| join
(
    ContainerLog
    | where TimeGenerated > startTimestamp
)
on ContainerID
| project LogEntrySource, LogEntry, TimeGenerated, Name
| order by TimeGenerated desc
| render table

Below, we see the Logs interface with the search query and log entry results.

screen_shot_2019-03-29_at_9_13_37_pm

Below, we see a detailed view of a single log entry from Service A.

screen_shot_2019-03-29_at_9_18_12_pm

Observability Tools

The previous post goes into greater detail on the features of each of the observability tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.

We can use the exact same kubectl port-forward commands to connect to the tools on AKS as we did on GKE. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the tool’s pod.

# Grafana
kubectl port-forward -n istio-system \
  $(kubectl get pod -n istio-system -l app=grafana \
  -o jsonpath='{.items[0].metadata.name}') 3000:3000 &
  
# Prometheus
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=prometheus \
  -o jsonpath='{.items[0].metadata.name}') 9090:9090 &
  
# Jaeger
kubectl port-forward -n istio-system \
$(kubectl get pod -n istio-system -l app=jaeger \
-o jsonpath='{.items[0].metadata.name}') 16686:16686 &
  
# Kiali
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=kiali \
  -o jsonpath='{.items[0].metadata.name}') 20001:20001 &

screen_shot_2019-03-26_at_8_04_24_pm

Prometheus and Grafana

Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.

Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.

According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see one of the pre-configured dashboards, the Istio Service Dashboard.

screen_shot_2019-03-26_at_8_16_52_pm

Jaeger

According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.

screen_shot_2019-03-26_at_8_03_31_pm

Below, we see a typical, distributed trace of the services, starting ingress gateway and passing across the upstream service dependencies.

screen_shot_2019-03-26_at_8_03_45_pm

Kaili

According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.

There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin, the password is 1f2d1e2e67df.

screen_shot_2019-03-26_at_7_59_17_pm

Below, we see a detailed view of our platform, running in the dev namespace, on AKS.

screen_shot_2019-03-26_at_8_02_38_pm

Delete AKS Cluster

Once you are finished with this demo, use the following two commands to tear down the AKS cluster and remove the cluster context from your local configuration.

time az aks delete \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --yes

kubectl config delete-context aks-observability-demo

Conclusion

In this brief, follow-up post, we have explored how the current set of observability tools, part of the latest version of Istio Service Mesh, integrates with Azure Kubernetes Service (AKS).

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , ,

1 Comment