Posts Tagged Logging

Brief Introduction to Observability on AWS

Explore the wide variety of Application Performance Monitoring (APM) and Observability options on AWS

Licensed image: ArtemisDiana/Shutterstock.com
Licensed image: ArtemisDiana/Shutterstock.com

APM and Observability

Observability is “the extent to which the internal states of a system can be inferred from externally available data” (Gartner). The three pillars of observability data are metricslogs, and tracesApplication Performance Monitoring (APM), a term commonly associated with observability, is “software that enables the observation and analysis of application health, performance, and user experience” (Gartner).

Additional features often associated with APM and observability products and services include the following (in alphabetical order):

  • Advanced Threat Protection (ATP)
  • Endpoint Detection and Response (EDR)
  • Incident Detection Response (IDR)
  • Infrastructure Performance Monitoring (IPM)
  • Network Device Monitoring (NDM)
  • Network Performance Monitoring (NPM)
  • OpenTelemetry (OTel)
  • Operational (or Operations) Intelligence Platform
  • Predictive Monitoring (predictive analytics / predictive modeling)
  • Real User Monitoring (RUM)
  • Security Information and Event Management (SIEM)
  • Security Orchestration, Automation, and Response (SOAR)
  • Synthetic Monitoring (directed monitoring / synthetic testing)
  • Threat Visibility and Risk Scoring
  • Unified Security and Observability Platform
  • User Behavior Analytics (UBA)

Not all features are offered by all vendors. Most vendors tend to specialize in one or more areas. Determining which features are essential to your organization before choosing a solution is vital.

AI/ML

Given the growing volume and real-time nature of observability telemetry, many vendors have started incorporating AI and ML into their products and services to improve correlation, anomaly detection, and mitigation capabilities. Understand how these features can reduce operational burden, enhance insights, and simplify complexity.

Decision Factors

APM and observability tooling choices often come down to a “Build vs. Buy” decision for organizations. In the Cloud, this usually means integrating several individual purpose-built products and services, self-managed open-source projects, or investing in an end-to-end APM or unified observability platform. Other decision factors include the need for solutions to support:

  • Hybrid cloud environments (on-premises/Cloud)
  • Multi-cloud environments (Public Cloud, SaaS, Supercloud)
  • Specialized workloads (e.g., Mainframes, HPC, VMware, SAP, SAS)
  • Compliant workloads (e.g., PCI DSS, PII, GDPR, FedRAMP)
  • Edge Computing and IoT/IIoT
  • AI, ML, and Data Analytics monitoring (AIOps, MLOps, DataOps)
  • SaaS observability (SaaS providers who offer monitoring to their end-users as part of their service offering)
  • Custom log formats and protocols

Finally, the 5 V’s of big data: Velocity, Volume, Value, Variety, and Veracity, also influence the choice of APM and observability tooling. The real-time nature of the observability data, the sheer volume of the data, the source and type of data, and the sensitivity of the data, will all guide tooling choices based on features and cost.

Organizations can choose fully-managed native AWS services, AWS Partner products and services, often SaaS, self-managed open-source observability tooling, or a combination of options. Many AWS and Partner products and services are commercial versions of popular open-source software (COSS).

AWS Options

Data Collection, Processing, and Forwarding

  • AWS Distro for OpenTelemetry (ADOT): Open-source APIs, libraries, and agents to collect distributed traces and metrics for application monitoring. ADOT is an open-source distribution of OpenTelemetry, a “high-quality, ubiquitous, and portable telemetry [solution] to enable effective observability.
  • AWS for Fluent Bit: Fluent Bit image with plugins for both CloudWatch Logs and Kinesis Data Firehose. Fluent Bit is an open-source, “super fast, lightweight, and highly scalable logging and metrics processor and forwarder.

Security-focused Monitoring

  • AWS CloudTrail: Helps enable operational and risk auditing, governance, and compliance in your AWS environment. CloudTrail records events, including actions taken in the AWS Management Console, AWS Command Line Interface (CLI), AWS SDKs, and APIs.

Partner Options

According to the 2022 Gartner® Magic Quadrant™ — APM & Observability Report, leading vendors commonly used by AWS customers include the following (in alphabetical order):

Open Source Options

There are countless open-source observability projects to choose from, including the following (in alphabetical order):

  • Elastic: Elastic Stack: Elasticsearch, Kibana, Beats, and Logstash
  • Fluentd: Data collector for unified logging layer
  • Grafana: Platform for monitoring and observability
  • Jaeger: End-to-end distributed tracing
  • Kiali: Configure, visualize, validate, and troubleshoot Istio Service Mesh
  • Loki: Like Prometheus, but for logs
  • OpenSearch: Scalable, flexible, and extensible software suite for search, analytics, and observability applications
  • OpenTelemetry (OTel): Collection of tools, APIs, and SDKs used to instrument, generate, collect, and export telemetry data
  • Prometheus: Monitoring system and time series database
  • Zabbix: Single pane of glass view of your whole IT infrastructure stack

🔔 To keep up with future content, follow Gary Stafford on LinkedIn.


This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

, , , , , ,

Leave a comment

Istio Observability with Go, gRPC, and Protocol Buffers-based Microservices on Google Kubernetes Engine (GKE)

In the last two posts, Kubernetes-based Microservice Observability with Istio Service Mesh and Azure Kubernetes Service (AKS) Observability with Istio Service Mesh, we explored the observability tools which are included with Istio Service Mesh. These tools currently include Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization and monitoring. Combined with cloud platform-native monitoring and logging services, such as Stackdriver on GCP, CloudWatch on AWS, Azure Monitor logs on Azure, and we have a complete observability solution for modern, distributed, Cloud-based applications.

In this post, we will examine the use of Istio’s observability tools to monitor Go-based microservices that use Protocol Buffers (aka Protobuf) over gRPC (gRPC Remote Procedure Calls) and HTTP/2 for client-server communications, as opposed to the more traditional, REST-based JSON (JavaScript Object Notation) over HTTP (Hypertext Transfer Protocol). We will see how Kubernetes, Istio, Envoy, and the observability tools work seamlessly with gRPC, just as they do with JSON over HTTP, on Google Kubernetes Engine (GKE).

screen_shot_2019-04-18_at_6_03_38_pm

Technologies

Image result for grpc logogRPC

According to the gRPC project, gRPC, a CNCF incubating project, is a modern, high-performance, open-source and universal remote procedure call (RPC) framework that can run anywhere. It enables client and server applications to communicate transparently and makes it easier to build connected systems. Google, the original developer of gRPC, has used the underlying technologies and concepts in gRPC for years. The current implementation is used in several Google cloud products and Google externally facing APIs. It is also being used by Square, Netflix, CoreOS, Docker, CockroachDB, Cisco, Juniper Networks and many other organizations.

Image result for google developerProtocol Buffers

By default, gRPC uses Protocol Buffers. According to Google, Protocol Buffers (aka Protobuf) are a language- and platform-neutral, efficient, extensible, automated mechanism for serializing structured data for use in communications protocols, data storage, and more. Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML. Once you have defined your messages, you run the protocol buffer compiler for your application’s language on your .proto file to generate data access classes.

Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML.

Protocol buffers currently support generated code in Java, Python, Objective-C, and C++, Dart, Go, Ruby, and C#. For this post, we have compiled for Go. You can read more about the binary wire format of Protobuf on Google’s Developers Portal.

Envoy Proxy

According to the Istio project, Istio uses an extended version of the Envoy proxy. Envoy is deployed as a sidecar to a relevant service in the same Kubernetes pod. Envoy, created by Lyft, is a high-performance proxy developed in C++ to mediate all inbound and outbound traffic for all services in the service mesh. Istio leverages Envoy’s many built-in features, including dynamic service discovery, load balancing, TLS termination, HTTP/2 and gRPC proxies, circuit-breakers, health checks, staged rollouts, fault injection, and rich metrics.

According to the post by Harvey Tuch of Google, Evolving a Protocol Buffer canonical API, Envoy proxy adopted Protocol Buffers, specifically proto3, as the canonical specification of for version 2 of Lyft’s gRPC-first API.

Reference Microservices Platform

In the last two posts, we explored Istio’s observability tools, using a RESTful microservices-based API platform written in Go and using JSON over HTTP for service to service communications. The API platform was comprised of eight Go-based microservices and one sample Angular 7, TypeScript-based front-end web client. The various services are dependent on MongoDB, and RabbitMQ for event queue-based communications. Below, the is JSON over HTTP-based platform architecture.

Golang Service Diagram with Proxy v2

Below, the current Angular 7-based web client interface.

screen_shot_2019-04-15_at_10_23_47_pm

Converting to gRPC and Protocol Buffers

For this post, I have modified the eight Go microservices to use gRPC and Protocol Buffers, Google’s data interchange format. Specifically, the services use version 3 release (aka proto3) of Protocol Buffers. With gRPC, a gRPC client calls a gRPC server. Some of the platform’s services are gRPC servers, others are gRPC clients, while some act as both client and server, such as Service A, B, and E. The revised architecture is shown below.

Golang-Service-Diagram-with-gRPC

gRPC Gateway

Assuming for the sake of this demonstration, that most consumers of the API would still expect to communicate using a RESTful JSON over HTTP API, I have added a gRPC Gateway reverse proxy to the platform. The gRPC Gateway is a gRPC to JSON reverse proxy, a common architectural pattern, which proxies communications between the JSON over HTTP-based clients and the gRPC-based microservices. A diagram from the grpc-gateway GitHub project site effectively demonstrates how the reverse proxy works.

grpc_gateway.png

Image courtesy: https://github.com/grpc-ecosystem/grpc-gateway

In the revised platform architecture diagram above, note the addition of the reverse proxy, which replaces Service A at the edge of the API. The proxy sits between the Angular-based Web UI and Service A. Also, note the communication method between services is now Protobuf over gRPC instead of JSON over HTTP. The use of Envoy Proxy (via Istio) is unchanged, as is the MongoDB Atlas-based databases and CloudAMQP RabbitMQ-based queue, which are still external to the Kubernetes cluster.

Alternatives to gRPC Gateway

As an alternative to the gRPC Gateway reverse proxy, we could convert the TypeScript-based Angular UI client to gRPC and Protocol Buffers, and continue to communicate directly with Service A as the edge service. However, this would limit other consumers of the API to rely on gRPC as opposed to JSON over HTTP, unless we also chose to expose two different endpoints, gRPC, and JSON over HTTP, another common pattern.

Demonstration

In this post’s demonstration, we will repeat the exact same installation process, outlined in the previous post, Kubernetes-based Microservice Observability with Istio Service Mesh. We will deploy the revised gRPC-based platform to GKE on GCP. You could just as easily follow Azure Kubernetes Service (AKS) Observability with Istio Service Mesh, and deploy the platform to AKS.

Source Code

All source code for this post is available on GitHub, contained in three projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository, in the new grpc branch.

git clone \
  --branch grpc --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/k8s-istio-observe-backend.git

The Angular-based web client source code is located in the k8s-istio-observe-frontend repository on the new grpc branch. The source protocol buffers .proto file and the generated code, using the protocol buffers compiler, is located in the new pb-greeting project repository. You do not need to clone either of these projects for this post’s demonstration.

All Docker images for the services, UI, and the reverse proxy are located on Docker Hub.

Code Changes

This post is not specifically about writing Go for gRPC and Protobuf. However, to better understand the observability requirements and capabilities of these technologies, compared to JSON over HTTP, it is helpful to review some of the source code.

Service A

First, compare the source code for Service A, shown below, to the original code in the previous post. The service’s code is almost completely re-written. I relied on several references for writing the code, including, Tracing gRPC with Istio, written by Neeraj Poddar of Aspen Mesh and Distributed Tracing Infrastructure with Jaeger on Kubernetes, by Masroor Hasan.

Specifically, note the following code changes to Service A:

  • Import of the pb-greeting protobuf package;
  • Local Greeting struct replaced with pb.Greeting struct;
  • All services are now hosted on port 50051;
  • The HTTP server and all API resource handler functions are removed;
  • Headers, used for distributed tracing with Jaeger, have moved from HTTP request object to metadata passed in the gRPC context object;
  • Service A is coded as a gRPC server, which is called by the gRPC Gateway reverse proxy (gRPC client) via the Greeting function;
  • The primary PingHandler function, which returns the service’s Greeting, is replaced by the pb-greeting protobuf package’s Greeting function;
  • Service A is coded as a gRPC client, calling both Service B and Service C using the CallGrpcService function;
  • CORS handling is offloaded to Istio;
  • Logging methods are unchanged;

Source code for revised gRPC-based Service A (gist):


// author: Gary A. Stafford
// site: https://programmaticponderings.com
// license: MIT License
// purpose: Service A – gRPC/Protobuf
package main
import (
"context"
"github.com/banzaicloud/logrus-runtime-formatter"
"github.com/google/uuid"
"github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing"
ot "github.com/opentracing/opentracing-go"
log "github.com/sirupsen/logrus"
"google.golang.org/grpc"
"google.golang.org/grpc/metadata"
"net"
"os"
"time"
pb "github.com/garystafford/pb-greeting"
)
const (
port = ":50051"
)
type greetingServiceServer struct {
}
var (
greetings []*pb.Greeting
)
func (s *greetingServiceServer) Greeting(ctx context.Context, req *pb.GreetingRequest) (*pb.GreetingResponse, error) {
greetings = nil
tmpGreeting := pb.Greeting{
Id: uuid.New().String(),
Service: "Service-A",
Message: "Hello, from Service-A!",
Created: time.Now().Local().String(),
}
greetings = append(greetings, &tmpGreeting)
CallGrpcService(ctx, "service-b:50051")
CallGrpcService(ctx, "service-c:50051")
return &pb.GreetingResponse{
Greeting: greetings,
}, nil
}
func CallGrpcService(ctx context.Context, address string) {
conn, err := createGRPCConn(ctx, address)
if err != nil {
log.Fatalf("did not connect: %v", err)
}
defer conn.Close()
headersIn, _ := metadata.FromIncomingContext(ctx)
log.Infof("headersIn: %s", headersIn)
client := pb.NewGreetingServiceClient(conn)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
ctx = metadata.NewOutgoingContext(context.Background(), headersIn)
defer cancel()
req := pb.GreetingRequest{}
greeting, err := client.Greeting(ctx, &req)
log.Info(greeting.GetGreeting())
if err != nil {
log.Fatalf("did not connect: %v", err)
}
for _, greeting := range greeting.GetGreeting() {
greetings = append(greetings, greeting)
}
}
func createGRPCConn(ctx context.Context, addr string) (*grpc.ClientConn, error) {
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/
var opts []grpc.DialOption
opts = append(opts, grpc.WithStreamInterceptor(
grpc_opentracing.StreamClientInterceptor(
grpc_opentracing.WithTracer(ot.GlobalTracer()))))
opts = append(opts, grpc.WithUnaryInterceptor(
grpc_opentracing.UnaryClientInterceptor(
grpc_opentracing.WithTracer(ot.GlobalTracer()))))
opts = append(opts, grpc.WithInsecure())
conn, err := grpc.DialContext(ctx, addr, opts)
if err != nil {
log.Fatalf("Failed to connect to application addr: ", err)
return nil, err
}
return conn, nil
}
func getEnv(key, fallback string) string {
if value, ok := os.LookupEnv(key); ok {
return value
}
return fallback
}
func init() {
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}}
formatter.Line = true
log.SetFormatter(&formatter)
log.SetOutput(os.Stdout)
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info"))
if err != nil {
log.Error(err)
}
log.SetLevel(level)
}
func main() {
lis, err := net.Listen("tcp", port)
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
s := grpc.NewServer()
pb.RegisterGreetingServiceServer(s, &greetingServiceServer{})
log.Fatal(s.Serve(lis))
}

view raw

main.go

hosted with ❤ by GitHub

Greeting Protocol Buffers

Shown below is the greeting source protocol buffers .proto file. The greeting response struct, originally defined in the services, remains largely unchanged (gist). The UI client responses will look identical.


syntax = "proto3";
package greeting;
import "google/api/annotations.proto";
message Greeting {
string id = 1;
string service = 2;
string message = 3;
string created = 4;
}
message GreetingRequest {
}
message GreetingResponse {
repeated Greeting greeting = 1;
}
service GreetingService {
rpc Greeting (GreetingRequest) returns (GreetingResponse) {
option (google.api.http) = {
get: "/api/v1/greeting"
};
}
}

view raw

greeting.proto

hosted with ❤ by GitHub

When compiled with protoc,  the Go-based protocol compiler plugin, the original 27 lines of source code swells to almost 270 lines of generated data access classes that are easier to use programmatically.

# Generate gRPC stub (.pb.go)
protoc -I /usr/local/include -I. \
  -I ${GOPATH}/src \
  -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \
  --go_out=plugins=grpc:. \
  greeting.proto

# Generate reverse-proxy (.pb.gw.go)
protoc -I /usr/local/include -I. \
  -I ${GOPATH}/src \
  -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \
  --grpc-gateway_out=logtostderr=true:. \
  greeting.proto

# Generate swagger definitions (.swagger.json)
protoc -I /usr/local/include -I. \
  -I ${GOPATH}/src \
  -I ${GOPATH}/src/github.com/grpc-ecosystem/grpc-gateway/third_party/googleapis \
  --swagger_out=logtostderr=true:. \
  greeting.proto

Below is a small snippet of that compiled code, for reference. The compiled code is included in the pb-greeting project on GitHub and imported into each microservice and the reverse proxy (gist). We also compile a separate version for the reverse proxy to implement.


// Code generated by protoc-gen-go. DO NOT EDIT.
// source: greeting.proto
package greeting
import (
context "context"
fmt "fmt"
proto "github.com/golang/protobuf/proto"
_ "google.golang.org/genproto/googleapis/api/annotations"
grpc "google.golang.org/grpc"
codes "google.golang.org/grpc/codes"
status "google.golang.org/grpc/status"
math "math"
)
// Reference imports to suppress errors if they are not otherwise used.
var _ = proto.Marshal
var _ = fmt.Errorf
var _ = math.Inf
// This is a compile-time assertion to ensure that this generated file
// is compatible with the proto package it is being compiled against.
// A compilation error at this line likely means your copy of the
// proto package needs to be updated.
const _ = proto.ProtoPackageIsVersion3 // please upgrade the proto package
type Greeting struct {
Id string `protobuf:"bytes,1,opt,name=id,proto3" json:"id,omitempty"`
Service string `protobuf:"bytes,2,opt,name=service,proto3" json:"service,omitempty"`
Message string `protobuf:"bytes,3,opt,name=message,proto3" json:"message,omitempty"`
Created string `protobuf:"bytes,4,opt,name=created,proto3" json:"created,omitempty"`
XXX_NoUnkeyedLiteral struct{} `json:"-"`
XXX_unrecognized []byte `json:"-"`
XXX_sizecache int32 `json:"-"`
}
func (m *Greeting) Reset() { *m = Greeting{} }
func (m *Greeting) String() string { return proto.CompactTextString(m) }
func (*Greeting) ProtoMessage() {}
func (*Greeting) Descriptor() ([]byte, []int) {
return fileDescriptor_6acac03ccd168a87, []int{0}
}
func (m *Greeting) XXX_Unmarshal(b []byte) error {
return xxx_messageInfo_Greeting.Unmarshal(m, b)
}
func (m *Greeting) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) {
return xxx_messageInfo_Greeting.Marshal(b, m, deterministic)

view raw

greeting.pb.go

hosted with ❤ by GitHub

Using Swagger, we can view the greeting protocol buffers’ single RESTful API resource, exposed with an HTTP GET method. I use the Docker-based version of Swagger UI for viewing protoc generated swagger definitions.

docker run -p 8080:8080 -d --name swagger-ui \
  -e SWAGGER_JSON=/tmp/greeting.swagger.json \
  -v ${GOAPTH}/src/pb-greeting:/tmp swaggerapi/swagger-ui

The Angular UI makes an HTTP GET request to the /api/v1/greeting resource, which is transformed to gRPC and proxied to Service A, where it is handled by the Greeting function.

screen_shot_2019-04-15_at_9_05_23_pm

gRPC Gateway Reverse Proxy

As explained earlier, the gRPC Gateway reverse proxy service is completely new. Specifically, note the following code features in the gist below:

  • Import of the pb-greeting protobuf package;
  • The proxy is hosted on port 80;
  • Request headers, used for distributed tracing with Jaeger, are collected from the incoming HTTP request and passed to Service A in the gRPC context;
  • The proxy is coded as a gRPC client, which calls Service A;
  • Logging is largely unchanged;

The source code for the Reverse Proxy (gist):


// author: Gary A. Stafford
// site: https://programmaticponderings.com
// license: MIT License
// purpose: gRPC Gateway / Reverse Proxy
// reference: https://github.com/grpc-ecosystem/grpc-gateway
package main
import (
"context"
"flag"
lrf "github.com/banzaicloud/logrus-runtime-formatter"
gw "github.com/garystafford/pb-greeting"
"github.com/grpc-ecosystem/grpc-gateway/runtime"
log "github.com/sirupsen/logrus"
"google.golang.org/grpc"
"google.golang.org/grpc/metadata"
"net/http"
"os"
)
func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD {
//https://aspenmesh.io/2018/04/tracing-grpc-with-istio/
var (
otHeaders = []string{
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
"x-ot-span-context"}
)
var pairs []string
for _, h := range otHeaders {
if v := req.Header.Get(h); len(v) > 0 {
pairs = append(pairs, h, v)
}
}
return metadata.Pairs(pairs)
}
type annotator func(context.Context, *http.Request) metadata.MD
func chainGrpcAnnotators(annotators annotator) annotator {
return func(c context.Context, r *http.Request) metadata.MD {
var mds []metadata.MD
for _, a := range annotators {
mds = append(mds, a(c, r))
}
return metadata.Join(mds)
}
}
func run() error {
ctx := context.Background()
ctx, cancel := context.WithCancel(ctx)
defer cancel()
annotators := []annotator{injectHeadersIntoMetadata}
mux := runtime.NewServeMux(
runtime.WithMetadata(chainGrpcAnnotators(annotators)),
)
opts := []grpc.DialOption{grpc.WithInsecure()}
err := gw.RegisterGreetingServiceHandlerFromEndpoint(ctx, mux, "service-a:50051", opts)
if err != nil {
return err
}
return http.ListenAndServe(":80", mux)
}
func getEnv(key, fallback string) string {
if value, ok := os.LookupEnv(key); ok {
return value
}
return fallback
}
func init() {
formatter := lrf.Formatter{ChildFormatter: &log.JSONFormatter{}}
formatter.Line = true
log.SetFormatter(&formatter)
log.SetOutput(os.Stdout)
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info"))
if err != nil {
log.Error(err)
}
log.SetLevel(level)
}
func main() {
flag.Parse()
if err := run(); err != nil {
log.Fatal(err)
}
}

view raw

main.go

hosted with ❤ by GitHub

Below, in the Stackdriver logs, we see an example of a set of HTTP request headers in the JSON payload, which are propagated upstream to gRPC-based Go services from the gRPC Gateway’s reverse proxy. Header propagation ensures the request produces a complete distributed trace across the complete service call chain.

screen_shot_2019-04-15_at_11_10_50_pm

Istio VirtualService and CORS

According to feedback in the project’s GitHub Issues, the gRPC Gateway does not directly support Cross-Origin Resource Sharing (CORS) policy. In my own experience, the gRPC Gateway cannot handle OPTIONS HTTP method requests, which must be issued by the Angular 7 web UI. Therefore, I have offloaded CORS responsibility to Istio, using the VirtualService resource’s CorsPolicy configuration. This makes CORS much easier to manage than coding CORS configuration into service code (gist):


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: service-rev-proxy
spec:
hosts:
api.dev.example-api.com
gateways:
demo-gateway
http:
match:
uri:
prefix: /
route:
destination:
port:
number: 80
host: service-rev-proxy.dev.svc.cluster.local
weight: 100
corsPolicy:
allowOrigin:
"*"
allowMethods:
OPTIONS
GET
allowCredentials: true
allowHeaders:
"*"

Set-up and Installation

To deploy the microservices platform to GKE, follow the detailed instructions in part one of the post, Kubernetes-based Microservice Observability with Istio Service Mesh: Part 1, or Azure Kubernetes Service (AKS) Observability with Istio Service Mesh for AKS.

  1. Create the external MongoDB Atlas database and CloudAMQP RabbitMQ clusters;
  2. Modify the Kubernetes resource files and bash scripts for your own environments;
  3. Create the managed GKE or AKS cluster on GCP or Azure;
  4. Configure and deploy Istio to the managed Kubernetes cluster, using Helm;
  5. Create DNS records for the platform’s exposed resources;
  6. Deploy the Go-based microservices, gRPC Gateway reverse proxy, Angular UI, and associated resources to Kubernetes cluster;
  7. Test and troubleshoot the platform deployment;
  8. Observe the results;

The Three Pillars

As introduced in the first post, logs, metrics, and traces are often known as the three pillars of observability. These are the external outputs of the system, which we may observe. As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind. Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.

Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools. Istio’s integrations include Jaeger for distributed tracing, Kiali for Istio service mesh-based microservice visualization and monitoring, and Prometheus and Grafana for metric collection, monitoring, and alerting. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for GKE, CloudWatch for Amazon’s EKS, or Azure Monitor logs for AKS, and we have a complete observability solution for modern, distributed, Cloud-based applications.

Pillar 1: Logging

Moving from JSON over HTTP to gRPC does not require any changes to the logging configuration of the Go-based service code or Kubernetes resources.

Stackdriver with Logrus

As detailed in part two of the last post, Kubernetes-based Microservice Observability with Istio Service Mesh, our logging strategy for the eight Go-based microservices and the reverse proxy continues to be the use of Logrus, the popular structured logger for Go, and Banzai Cloud’s logrus-runtime-formatter.

If you recall, the Banzai formatter automatically tags log messages with runtime/stack information, including function name and line number; extremely helpful when troubleshooting. We are also using Logrus’ JSON formatter. Below, in the Stackdriver console, note how each log entry below has the JSON payload contained within the message with the log level, function name, lines on which the log entry originated, and the message.

screen_shot_2019-04-15_at_11_10_36_pm

Below, we see the details of a specific log entry’s JSON payload. In this case, we can see the request headers propagated from the downstream service.

screen_shot_2019-04-15_at_11_10_50_pm

Pillar 2: Metrics

Moving from JSON over HTTP to gRPC does not require any changes to the metrics configuration of the Go-based service code or Kubernetes resources.

Prometheus

Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.

screen_shot_2019-04-15_at_11_04_54_pm

Grafana

Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana allows users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.

According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see two of the pre-configured dashboards, the Istio Mesh Dashboard and the Istio Performance Dashboard.

screen_shot_2019-04-15_at_10_45_38_pm

screen_shot_2019-04-15_at_10_46_03_pm

Pillar 3: Traces

Moving from JSON over HTTP to gRPC did require a complete re-write of the tracing logic in the service code. In fact, I spent the majority of my time ensuring the correct headers were propagated from the Istio Ingress Gateway to the gRPC Gateway reverse proxy, to Service A in the gRPC context, and upstream to all the dependent, gRPC-based services. I am sure there are a number of optimization in my current code, regarding the correct handling of traces and how this information is propagated across the service call stack.

Jaeger

According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains an excellent overview of Jaeger’s architecture and general tracing-related terminology.

Below we see the Jaeger UI Traces View. In it, we see a series of traces generated by hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab). Unlike abhey supports HTTP/2. The use of hey was detailed in the previous post.

screen_shot_2019-04-18_at_6_08_21_pm

A trace, as you might recall, is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with DAGs.

screen_shot_2019-04-15_at_11_06_13_pm

Below we see the Jaeger UI Trace Detail View. The example trace contains 16 spans, which encompasses nine components – seven of the eight Go-based services, the reverse proxy, and the Istio Ingress Gateway. The trace and the spans each have timings. The root span in the trace is the Istio Ingress Gateway. In this demo, traces do not span the RabbitMQ message queues. This means you would not see a trace which includes the decoupled, message-based communications between Service D to Service F, via the RabbitMQ.

screen_shot_2019-04-15_at_11_08_07_pm

Within the Jaeger UI Trace Detail View, you also have the ability to drill into a single span, which contains additional metadata. Metadata includes the URL being called, HTTP method, response status, and several other headers.

screen_shot_2019-04-15_at_11_08_22_pm

Microservice Observability

Moving from JSON over HTTP to gRPC does not require any changes to the Kiali configuration of the Go-based service code or Kubernetes resources.

Kiali

According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.

The Graph View in the Kiali UI is a visual representation of the components running in the Istio service mesh. Below, filtering on the cluster’s dev Namespace, we should observe that Kiali has mapped all components in the platform, along with rich metadata, such as their version and communication protocols.

screen_shot_2019-04-18_at_6_03_38_pm

Using Kiali, we can confirm our service-to-service IPC protocol is now gRPC instead of the previous HTTP.

screen_shot_2019-04-14_at_11_15_49_am

Conclusion

Although converting from JSON over HTTP to protocol buffers with gRPC required major code changes to the services, it did not impact the high-level observability we have of those services using the tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , , , , ,

2 Comments

Azure Kubernetes Service (AKS) Observability with Istio Service Mesh

In the last two-part post, Kubernetes-based Microservice Observability with Istio Service Mesh, we deployed Istio, along with its observability tools, Prometheus, Grafana, Jaeger, and Kiali, to Google Kubernetes Engine (GKE). Following that post, I received several questions about using Istio’s observability tools with other popular managed Kubernetes platforms, primarily Azure Kubernetes Service (AKS). In most cases, including with AKS, both Istio and the observability tools are compatible.

In this short follow-up of the last post, we will replace the GKE-specific cluster setup commands, found in part one of the last post, with new commands to provision a similar AKS cluster on Azure. The new AKS cluster will run Istio 1.1.3, released 4/15/2019, alongside the latest available version of AKS (Kubernetes), 1.12.6. We will replace Google’s Stackdriver logging with Azure Monitor logs. We will retain the external MongoDB Atlas cluster and the external CloudAMQP cluster dependencies.

Previous articles about AKS include First Impressions of AKS, Azure’s New Managed Kubernetes Container Service (November 2017) and Architecting Cloud-Optimized Apps with AKS (Azure’s Managed Kubernetes), Azure Service Bus, and Cosmos DB (December 2017).

Source Code

All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository.

git clone \
  --branch master --single-branch \
  --depth 1 --no-tags \
  https://github.com/garystafford/k8s-istio-observe-backend.git

The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend repository. You will not need to clone the Angular UI project for this post’s demonstration.

Setup

This post assumes you have a Microsoft Azure account with the necessary resource providers registered, and the Azure Command-Line Interface (CLI), az, installed and available to your command shell. You will also need Helm and Istio 1.1.3 installed and configured, which is covered in the last post.

screen_shot_2019-03-27_at_1_35_46_pm

Start by logging into Azure from your command shell.

az login \
  --username {{ your_username_here }} \
  --password {{ your_password_here }}

Resource Providers

If you are new to Azure or AKS, you may need to register some additional resource providers to complete this demonstration.

az provider list --output table

screen_shot_2019-03-27_at_5_37_46_pm

If you are missing required resource providers, you will see errors similar to the one shown below. Simply activate the particular provider corresponding to the error.

Operation failed with status:'Bad Request'. 
Details: Required resource provider registrations 
Microsoft.Compute, Microsoft.Network are missing.

To register the necessary providers, use the Azure CLI or the Azure Portal UI.

az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Compute

Resource Group

AKS requires an Azure Resource Group. According to Azure, a resource group is a container that holds related resources for an Azure solution. The resource group includes those resources that you want to manage as a group. I chose to create a new resource group associated with my closest geographic Azure Region, East US, using the Azure CLI.

az group create \
  --resource-group aks-observability-demo \
  --location eastus

screen_shot_2019-03-26_at_6_54_39_pm

Create the AKS Cluster

Before creating the GKE cluster, check for the latest versions of AKS. At the time of this post, the latest versions of AKS was 1.12.6.

az aks get-versions \
  --location eastus \
  --output table

screen_shot_2019-03-26_at_6_56_38_pm

Using the latest GKE version, create the GKE managed cluster. There are many configuration options available with the az aks create command. For this post, I am creating three worker nodes using the Azure Standard_DS3_v2 VM type, which will give us a total of 12 vCPUs and 42 GB of memory. Anything smaller and all the Pods may not be schedulable. Instead of supplying an existing SSH key, I will let Azure create a new one. You should have no need to SSH into the worker nodes. I am also enabling the monitoring add-on. According to Azure, the add-on sets up Azure Monitor for containers, announced in December 2018, which monitors the performance of workloads deployed to Kubernetes environments hosted on AKS.

time az aks create \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --node-count 3 \
  --node-vm-size Standard_DS3_v2 \
  --enable-addons monitoring \
  --generate-ssh-keys \
  --kubernetes-version 1.12.6

Using the time command, we observe that the cluster took approximately 5m48s to provision; I have seen times up to almost 10 minutes. AKS provisioning is not as fast as GKE, which in my experience is at least 2x-3x faster than AKS for a similarly sized cluster.

screen_shot_2019-03-26_at_7_03_49_pm

After the cluster creation completes, retrieve your AKS cluster credentials.

az aks get-credentials \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --overwrite-existing

Examine the Cluster

Use the following command to confirm the cluster is ready by examining the status of three worker nodes.

kubectl get nodes --output=wide

screen_shot_2019-03-27_at_6_06_10_pm.png

Observe that Azure currently uses Ubuntu 16.04.5 LTS for the worker node’s host operating system. If you recall, GKE offers both Ubuntu as well as a Container-Optimized OS from Google.

Kubernetes Dashboard

Unlike GKE, there is no custom AKS dashboard. Therefore, we will use the Kubernetes Web UI (dashboard), which is installed by default with AKS, unlike GKE. According to Azure, to make full use of the dashboard, since the AKS cluster uses RBAC, a ClusterRoleBinding must be created before you can correctly access the dashboard.

kubectl create clusterrolebinding kubernetes-dashboard \
  --clusterrole=cluster-admin \
  --serviceaccount=kube-system:kubernetes-dashboard

Next, we must create a proxy tunnel on local port 8001 to the dashboard running on the AKS cluster. This CLI command creates a proxy between your local system and the Kubernetes API and opens your web browser to the Kubernetes dashboard.

az aks browse \
  --name aks-observability-demo \
  --resource-group aks-observability-demo

screen_shot_2019-03-26_at_7_08_54_pm

Although you should use the Azure CLI, PowerShell, or SDK for all your AKS configuration tasks, using the dashboard for monitoring the cluster and the resources running on it, is invaluable.

screen_shot_2019-03-26_at_7_06_57_pm

The Kubernetes dashboard also provides access to raw container logs. Azure Monitor provides the ability to construct complex log queries, but for quick troubleshooting, you may just want to see the raw logs a specific container is outputting, from the dashboard.

screen_shot_2019-03-29_at_9_23_57_pm

Azure Portal

Logging into the Azure Portal, we can observe the AKS cluster, within the new Resource Group.

screen_shot_2019-03-26_at_7_08_25_pm

In addition to the Azure Resource Group we created, there will be a second Resource Group created automatically during the creation of the AKS cluster. This group contains all the resources that compose the AKS cluster. These resources include the three worker node VM instances, and their corresponding storage disks and NICs. The group also includes a network security group, route table, virtual network, and an availability set.

screen_shot_2019-03-26_at_7_08_04_pm

Deploy Istio

From this point on, the process to deploy Istio Service Mesh and the Go-based microservices platform follows the previous post and use the exact same scripts. After modifying the Kubernetes resource files, to deploy Istio, use the bash script, part4_install_istio.sh. I have added a few more pauses in the script to account for the apparently slower response times from AKS as opposed to GKE. It definitely takes longer to spin up the Istio resources on AKS than on GKE, which can result in errors if you do not pause between each stage of the deployment process.

screen_shot_2019-03-26_at_7_11_44_pm

screen_shot_2019-03-26_at_7_18_26_pm

Using the Kubernetes dashboard, we can view the Istio resources running in the istio-system Namespace, as shown below. Confirm that all resource Pods are running and healthy before deploying the Go-based microservices platform.

screen_shot_2019-03-26_at_7_16_50_pm

Deploy the Platform

Deploy the Go-based microservices platform, using bash deploy script, part5a_deploy_resources.sh.

screen_shot_2019-03-26_at_7_20_05_pm

The script deploys two replicas (Pods) of each of the eight microservices, Service-A through Service-H, and the Angular UI, to the dev and test Namespaces, for a total of 36 Pods. Each Pod will have the Istio sidecar proxy (Envoy Proxy) injected into it, alongside the microservice or UI.

screen_shot_2019-03-26_at_7_21_24_pm

Azure Load Balancer

If we return to the Resource Group created automatically when the AKS cluster was created, we will now see two additional resources. There is now an Azure Load Balancer and Public IP Address.

screen_shot_2019-03-26_at_7_21_56_pm

Similar to the GKE cluster in the last post, when the Istio Ingress Gateway is deployed as part of the platform, it is materialized as an Azure Load Balancer. The front-end of the load balancer is the new public IP address. The back-end of the load-balancer is a pool containing the three AKS worker node VMs. The load balancer is associated with a set of rules and health probes.

screen_shot_2019-03-26_at_7_22_51_pm

DNS

I have associated the new Azure public IP address, connected with the front-end of the load balancer, with the four subdomains I am using to represent the UI and the edge service, Service-A, in both Namespaces. If Azure is your primary Cloud provider, then Azure DNS is a good choice to manage your domain’s DNS records. For this demo, you will require your own domain.

screen_shot_2019-03-28_at_9_43_42_pm

Testing the Platform

With everything deployed, test the platform is responding and generate HTTP traffic for the observability tools to record. Similar to last time, I have chosen hey, a modern load generator and benchmarking tool, and a worthy replacement for Apache Bench (ab). Unlike ab, hey supports HTTP/2. Below, I am running hey directly from Azure Cloud Shell. The tool is simulating 10 concurrent users, generating a total of 500 HTTP GET requests to Service A.

# quick setup from Azure Shell using Bash
go get -u github.com/rakyll/hey
cd go/src/github.com/rakyll/hey/
go build
  
./hey -n 500 -c 10 -h2 http://api.dev.example-api.com/api/ping

We had 100% success with all 500 calls resulting in an HTTP 200 OK success status response code. Based on the results, we can observe the platform was capable of approximately 4 requests/second, with an average response time of 2.48 seconds and a mean time of 2.80 seconds. Almost all of that time was the result of waiting for the response, as the details indicate.

screen_shot_2019-03-26_at_7_57_03_pm

Logging

In this post, we have replaced GCP’s Stackdriver logging with Azure Monitor logs. According to Microsoft, Azure Monitor maximizes the availability and performance of applications by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from Cloud and on-premises environments. In my opinion, Stackdriver is a superior solution for searching and correlating the logs of distributed applications running on Kubernetes. I find the interface and query language of Stackdriver easier and more intuitive than Azure Monitor, which although powerful, requires substantial query knowledge to obtain meaningful results. For example, here is a query to view the log entries from the services in the dev Namespace, within the last day.

let startTimestamp = ago(1d);
KubePodInventory
| where TimeGenerated > startTimestamp
| where ClusterName =~ "aks-observability-demo"
| where Namespace == "dev"
| where Name contains "service-"
| distinct ContainerID
| join
(
    ContainerLog
    | where TimeGenerated > startTimestamp
)
on ContainerID
| project LogEntrySource, LogEntry, TimeGenerated, Name
| order by TimeGenerated desc
| render table

Below, we see the Logs interface with the search query and log entry results.

screen_shot_2019-03-29_at_9_13_37_pm

Below, we see a detailed view of a single log entry from Service A.

screen_shot_2019-03-29_at_9_18_12_pm

Observability Tools

The previous post goes into greater detail on the features of each of the observability tools provided by Istio, including Prometheus, Grafana, Jaeger, and Kiali.

We can use the exact same kubectl port-forward commands to connect to the tools on AKS as we did on GKE. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the tool’s pod.

# Grafana
kubectl port-forward -n istio-system \
  $(kubectl get pod -n istio-system -l app=grafana \
  -o jsonpath='{.items[0].metadata.name}') 3000:3000 &
  
# Prometheus
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=prometheus \
  -o jsonpath='{.items[0].metadata.name}') 9090:9090 &
  
# Jaeger
kubectl port-forward -n istio-system \
$(kubectl get pod -n istio-system -l app=jaeger \
-o jsonpath='{.items[0].metadata.name}') 16686:16686 &
  
# Kiali
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=kiali \
  -o jsonpath='{.items[0].metadata.name}') 20001:20001 &

screen_shot_2019-03-26_at_8_04_24_pm

Prometheus and Grafana

Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.

Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.

According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. Below, we see one of the pre-configured dashboards, the Istio Service Dashboard.

screen_shot_2019-03-26_at_8_16_52_pm

Jaeger

According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.

screen_shot_2019-03-26_at_8_03_31_pm

Below, we see a typical, distributed trace of the services, starting ingress gateway and passing across the upstream service dependencies.

screen_shot_2019-03-26_at_8_03_45_pm

Kaili

According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? Kiali works with Istio, in OpenShift or Kubernetes, to visualize the service mesh topology, to provide visibility into features like circuit breakers, request rates and more. It offers insights about the mesh components at different levels, from abstract Applications to Services and Workloads.

There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin, the password is 1f2d1e2e67df.

screen_shot_2019-03-26_at_7_59_17_pm

Below, we see a detailed view of our platform, running in the dev namespace, on AKS.

screen_shot_2019-03-26_at_8_02_38_pm

Delete AKS Cluster

Once you are finished with this demo, use the following two commands to tear down the AKS cluster and remove the cluster context from your local configuration.

time az aks delete \
  --name aks-observability-demo \
  --resource-group aks-observability-demo \
  --yes

kubectl config delete-context aks-observability-demo

Conclusion

In this brief, follow-up post, we have explored how the current set of observability tools, part of the latest version of Istio Service Mesh, integrates with Azure Kubernetes Service (AKS).

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , ,

1 Comment

Kubernetes-based Microservice Observability with Istio Service Mesh on Google Kubernetes Engine (GKE): Part 2

In this two-part post, we are exploring the set of observability tools that are part of the latest version of Istio Service Mesh. These tools include Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability solution for modern, distributed applications.

Reference Platform

To demonstrate Istio’s observability tools, in part one of the post, we deployed a reference microservices platform, written in Go, to GKE on GCP. The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A through Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications.

Golang Service Diagram with Proxy v2.png

The reference platform is designed to generate HTTP-based service-to-service, TCP-based service-to-database (MongoDB), and TCP-based service-to-queue-to-service (RabbitMQ) IPC (inter-process communication). Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a RabbitMQ queue that Service F consumes and writes to MongoDB, and so on. The goal is to observe these distributed communications using Istio’s observability tools when the system is deployed to Kubernetes.

Pillar 1: Logging

If you recall, logs, metrics, and traces are often known as the three pillars of observability. Since we are using GKE on GCP, we will look at Google’s Stackdriver Logging. According to Google, Stackdriver Logging allows you to store, search, analyze, monitor, and alert on log data and events from GCP and even AWS. Although Stackdriver logging is not an Istio observability feature, logging is an essential pillar of overall observability strategy.

Go-based Microservice Logging

An effective logging strategy starts with what you log, when you log, and how you log. As part of our logging strategy, the eight Go-based microservices are using Logrus, a popular structured logger for Go. The microservices also implement Banzai Cloud’s logrus-runtime-formatter. There is an excellent article on the formatter, Golang runtime Logrus Formatter. These two logging packages give us greater control over what we log, when we log, and how we log information about our microservices. The recommended configuration of the packages is minimal.

func init() {
   formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}}
   formatter.Line = true
   log.SetFormatter(&formatter)
   log.SetOutput(os.Stdout)
   level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info"))
   if err != nil {
      log.Error(err)
   }
   log.SetLevel(level)
}

Logrus provides several advantages over Go’s simple logging package, log. Log entries are not only for Fatal errors, nor should all verbose log entries be output in a Production environment. The post’s microservices are taking advantage of Logrus’ ability to log at seven levels: Trace, Debug, Info, Warning, Error, Fatal, and Panic. We have also variabilized the log level, allowing it to be easily changed in the Kubernetes Deployment resource at deploy-time.

The microservices also take advantage of Banzai Cloud’s logrus-runtime-formatter. The Banzai formatter automatically tags log messages with runtime/stack information, including function name and line number; extremely helpful when troubleshooting. We are also using Logrus’ JSON formatter. Note how each log entry below has the JSON payload contained within the message.

screen_shot_2019-03-14_at_9_28_09_pm

Client-side Angular UI Logging

Likewise, we have enhanced the logging of the Angular UI using NGX Logger. NGX Logger is a popular, simple logging module, currently for Angular 6 and 7. It allows “pretty print” to the console, as well as allowing log messages to be POSTed to a URL for server-side logging. For this demo, we will only print to the console. Similar to Logrus, NGX Logger supports multiple log levels: Trace, Debug, Info, Warning, Error, Fatal, and Off. Instead of just outputting messages, NGX Logger allows us to output properly formatted log entries to the web browser’s console.

The level of logs output is dependent on the environment, Production or not Production. Below we see a combination of log entries in the local development environment, including Debug, Info, and Error.

screen_shot_2019-03-17_at_11_58_31_am

Again below, we see the same page in the GKE-based Production environment. Note the absence of Debug-level log entries output to the console, without changing the configuration. We would not want to expose potentially sensitive information in verbose log output to our end-users in Production.

screen_shot_2019-03-17_at_11_58_45_am

Controlling logging levels is accomplished by adding the following ternary operator to the app.module.ts file.

    LoggerModule.forRoot({
      level: !environment.production ? 
        NgxLoggerLevel.DEBUG : NgxLoggerLevel.INFO,
        serverLogLevel: NgxLoggerLevel.INFO
    })

Pillar 2: Metrics

For metrics, we will examine at Prometheus and Grafana. Both these leading tools were installed as part of the Istio deployment.

Prometheus

Prometheus is a completely open source and community-driven systems monitoring and alerting toolkit originally built at SoundCloud, circa 2012. Interestingly, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted-project, after Kubernetes.

According to Istio, Istio’s Mixer comes with a built-in Prometheus adapter that exposes an endpoint serving generated metric values. The Prometheus add-on is a Prometheus server that comes pre-configured to scrape Mixer endpoints to collect the exposed metrics. It provides a mechanism for persistent storage and querying of Istio metrics.

With the GKE cluster running, Istio installed, and the platform deployed, the easiest way to access Grafana, is using kubectl port-forward to connect to the Prometheus server. According to Google, Kubernetes port forwarding allows using a resource name, such as a service name, to select a matching pod to port forward to since Kubernetes v1.10. We forward a local port to a port on the Prometheus pod.

screen_shot_2019-03-15_at_7_32_23_pm.png

You may connect using Google Cloud Shell or copy and paste the command to your local shell to connect from a local port. Below are the port forwarding commands used in this post.

# Grafana
kubectl port-forward -n istio-system \
  $(kubectl get pod -n istio-system -l app=grafana \
  -o jsonpath='{.items[0].metadata.name}') 3000:3000 &
  
# Prometheus
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=prometheus \
  -o jsonpath='{.items[0].metadata.name}') 9090:9090 &
  
# Jaeger
kubectl port-forward -n istio-system \
$(kubectl get pod -n istio-system -l app=jaeger \
-o jsonpath='{.items[0].metadata.name}') 16686:16686 &
  
# Kiali
kubectl -n istio-system port-forward \
  $(kubectl -n istio-system get pod -l app=kiali \
  -o jsonpath='{.items[0].metadata.name}') 20001:20001 &

According to Prometheus, user select and aggregate time series data in real time using a functional query language called PromQL (Prometheus Query Language). The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems through Prometheus’ HTTP API. The expression browser includes a drop-down menu with all available metrics as a starting point for building queries. Shown below are a few PromQL examples used in this post.

up{namespace="dev",pod_name=~"service-.*"}

container_memory_max_usage_bytes{namespace="dev",container_name=~"service-.*"}
container_memory_max_usage_bytes{namespace="dev",container_name="service-f"}
container_network_transmit_packets_total{namespace="dev",pod_name=~"service-e-.*"}

istio_requests_total{destination_service_namespace="dev",connection_security_policy="mutual_tls",destination_app="service-a"}
istio_response_bytes_count{destination_service_namespace="dev",connection_security_policy="mutual_tls",source_app="service-a"}

Below, in the Prometheus console, we see an example graph of the eight Go-based microservices, deployed to GKE. The graph displays the container memory usage over a five minute period. For half the time period, the services were at rest. For the second half of the period, the services were under a simulated load, using hey. Viewing the memory profile of the services under load can help us determine the container memory minimums and limits, which impact Kubernetes’ scheduling of workloads on the GKE cluster. Metrics such as this might also uncover memory leaks or routing issues, such as the service below, which appears to be consuming 25-50% more memory than its peers.

screen_shot_2019-03-15_at_7_15_24_pm

Another example, below, we see a graph representing the total Istio requests to Service A in the dev Namespace, while the system was under load.

screen_shot_2019-03-15_at_5_23_26_pm

Compare the graph view above with the same metrics displayed the console view. The multiple entries reflect the multiple instances of Service A in the dev Namespace, over the five-minute period being examined. The values in the individual metric elements indicate the latest metric that was collected.

screen_shot_2019-03-15_at_5_24_12_pm

Prometheus also collects basic metrics about Istio components, Kubernetes components, and GKE cluster. Below we can view the total memory of each n1-standard-2 VM nodes in the GKE cluster.

screen_shot_2019-03-15_at_8_15_03_pm.png

Grafana

Grafana describes itself as the leading open source software for time series analytics. According to Grafana Labs, Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. You can easily create, explore, and share visually-rich, data-driven dashboards. Grafana also users to visually define alert rules for your most important metrics. Grafana will continuously evaluate rules and can send notifications.

According to Istio, the Grafana add-on is a pre-configured instance of Grafana. The Grafana Docker base image has been modified to start with both a Prometheus data source and the Istio Dashboard installed. The base install files for Istio, and Mixer in particular, ship with a default configuration of global (used for every service) metrics. The pre-configured Istio Dashboards are built to be used in conjunction with the default Istio metrics configuration and a Prometheus back-end.

Below, we see the pre-configured Istio Workload Dashboard. This particular section of the larger dashboard has been filtered to show outbound service metrics in the dev Namespace of our GKE cluster.

screen_shot_2019-03-13_at_10_44_54_pm

Similarly, below, we see the pre-configured Istio Service Dashboard. This particular section of the larger dashboard is filtered to show client workloads metrics for the Istio Ingress Gateway in our GKE cluster.

screen_shot_2019-03-13_at_10_43_11_pm

Lastly, we see the pre-configured Istio Mesh Dashboard. This dashboard is filtered to show a table view of metrics for components deployed to our GKE cluster.

screen_shot_2019-03-13_at_10_34_16_pm

An effective observability strategy must include more than just the ability to visualize results. An effective strategy must also include the ability to detect anomalies and notify (alert) the appropriate resources or take action directly to resolve incidents. Grafana, like Prometheus, is capable of alerting and notification. You visually define alert rules for your critical metrics. Grafana will continuously evaluate metrics against the rules and send notifications when pre-defined thresholds are breached.

Prometheus supports multiple, popular notification channels, including PagerDuty, HipChat, Email, Kafka, and Slack. Below, we see a new Prometheus notification channel, which sends alert notifications to a Slack support channel.

screen_shot_2019-03-13_at_10_55_09_pm

Prometheus is able to send detailed text-based and visual notifications.

screen_shot_2019-03-14_at_6_06_22_pm

Pillar 3: Traces

According to the Open Tracing website, distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

According to Istio, although Istio proxies are able to automatically send spans, applications need to propagate the appropriate HTTP headers, so that when the proxies send span information, the spans can be correlated correctly into a single trace. To accomplish this, an application needs to collect and propagate the following headers from the incoming request to any outgoing requests.

  • x-request-id
  • x-b3-traceid
  • x-b3-spanid
  • x-b3-parentspanid
  • x-b3-sampled
  • x-b3-flags
  • x-ot-span-context

The x-b3 headers originated as part of the Zipkin project. The B3 portion of the header is named for the original name of Zipkin, BigBrotherBird. Passing these headers across service calls is known as B3 propagation. According to Zipkin, these attributes are propagated in-process, and eventually downstream (often via HTTP headers), to ensure all activity originating from the same root are collected together.

In order to demonstrate distributed tracing with Jaeger, I have modified Service A, Service B, and Service E. These are the three services that make HTTP requests to other upstream services. I have added the following code in order to propagate the headers from one service to the next. The Istio sidecar proxy (Envoy) generates the first headers. It is critical that you only propagate the headers that are present in the downstream request and have a value, as the code below does. Propagating an empty header will break the distributed tracing.

headers := []string{
  "x-request-id",
  "x-b3-traceid",
  "x-b3-spanid",
  "x-b3-parentspanid",
  "x-b3-sampled",
  "x-b3-flags",
  "x-ot-span-context",
}

for _, header := range headers {
  if r.Header.Get(header) != "" {
    req.Header.Add(header, r.Header.Get(header))
  }
}

Below, in the highlighted Stackdriver log entry’s JSON payload, we see the required headers, propagated from the root span, which contained a value, being passed from Service A to Service C in the upstream request.

screen_shot_2019-03-19_at_11_01_26_pm

Jaeger

According to their website, Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis, and performance and latency optimization. The Jaeger website contains a good overview of Jaeger’s architecture and general tracing-related terminology.

Below we see the Jaeger UI Traces View. The UI shows the results of a search for the Istio Ingress Gateway service over a period of about forty minutes. We see a timeline of traces across the top with a list of trace results below. As discussed on the Jaeger website, a trace is composed of spans. A span represents a logical unit of work in Jaeger that has an operation name. A trace is an execution path through the system and can be thought of as a directed acyclic graph (DAG) of spans. If you have worked with systems like Apache Spark, you are probably already familiar with DAGs.

screen_shot_2019-03-19_at_8_21_14_pm

Below we see the Jaeger UI Trace Detail View. The example trace contains 16 spans, which encompasses eight services – seven of the eight Go-based services and the Istio Ingress Gateway. The trace and the spans each have timings. The root span in the trace is the Istio Ingress Gateway. The Angular UI, loaded in the end user’s web browser, calls the mesh’s edge service, Service A, through the Istio Ingress Gateway.  From there, we see the expected flow of our service-to-service IPC. Service A calls Services B and C. Service B calls Service E, which calls Service G and Service H. In this demo, traces do not span the RabbitMQ message queues. This means you would not see a trace which includes a call from Service D to Service F, via the RabbitMQ.

screen_shot_2019-03-19_at_8_21_31_pm

Within the Jaeger UI Trace Detail View, you also have the ability to drill into a single span, which contains additional metadata. Metadata includes the URL being called, HTTP method, response status, and several other headers.

screen_shot_2019-03-19_at_8_22_16_pm

The latest version of Jaeger also includes a Compare feature and two Dependencies views, Force-Directed Graph, and DAG. I find both views rather primitive compared to Kiali, and more similar to Service Graph. Lacking access to Kiali, the views are marginally useful as a dependency graph.

screen_shot_2019-03-19_at_8_23_03_pm

Kiali: Microservice Observability

According to their website, Kiali provides answers to the questions: What are the microservices in my Istio service mesh, and how are they connected? There is a common Kubernetes Secret that controls access to the Kiali API and UI. The default login is admin, the password is 1f2d1e2e67df.
screen_shot_2019-03-13_at_8_33_35_pm

Logging into Kiali, we see the Overview menu entry, which provides a global view of all namespaces within the Istio service mesh and the number of applications within each namespace.

screen_shot_2019-03-18_at_11_38_36_pm

The Graph View in the Kiali UI is a visual representation of the components running in the Istio service mesh. Below, filtering on the cluster’s dev Namespace, we can observe that Kiali has mapped 8 applications (workloads), 10 services, and 24 edges (a graph term). Specifically, we see the Istio Ingres Proxy at the edge of the service mesh, the Angular UI, the eight Go-based microservices and their Envoy proxy sidecars that are taking traffic (Service F did not take any direct traffic from another service in this example), the external MongoDB Atlas cluster, and the external CloudAMQP cluster. Note how service-to-service traffic flows, with Istio, from the service to its sidecar proxy, to the other service’s sidecar proxy, and finally to the service.

screen_shot_2019-03-18_at_11_40_16_pm

Below, we see a similar view of the service mesh, but this time, there are failures between the Istio Ingress Gateway and the Service A, shown in red. We can also observe overall metrics for the HTTP traffic, such as total requests/minute, errors, and status codes.

screen_shot_2019-03-13_at_8_45_36_pm

Kiali can also display average requests times and other metrics for each edge in the graph (the communication between two components).

screen_shot_2019-03-13_at_8_51_18_pm

Kiali can also show application versions deployed, as shown below, the microservices are a combination of versions 1.3 and 1.4.

screen_shot_2019-03-18_at_11_43_41_pm

Focusing on the external MongoDB Atlas cluster, Kiali also allows us to view TCP traffic between the four services within the service mesh and the external cluster.

screen_shot_2019-03-13_at_8_46_46_pm

The Applications menu entry lists all the applications and their error rates, which can be filtered by Namespace and time interval. Here we see that the Angular UI was producing errors at the rate of 16.67%.

screen_shot_2019-03-18_at_11_43_48_pm

On both the Applications and Workloads menu entry, we can drill into a component to view additional details, including the overall health, number of Pods, Services, and Destination Services. Below, we see details for Service B in the dev Namespace.

screen_shot_2019-03-18_at_11_44_37_pm

The Workloads detailed view also includes inbound and outbound metrics. Below, the outbound volume, duration, and size metrics, for Service A in the dev Namespace.

screen_shot_2019-03-19_at_8_06_50_pm

Finally, Kiali presents an Istio Config menu entry. The Istio Config menu entry displays a list of all of the available Istio configuration objects that exist in the user’s environment.

screen_shot_2019-03-19_at_8_38_08_pm

Oftentimes, I find Kiali to be my first stop when troubleshooting platform issues. Once I identify the specific components or communication paths having issues, I can search the Stackdriver logs and the Prometheus metrics, through the Grafana dashboard.

Conclusion

In this two-part post, we have explored the current set of observability tools, which are part of the latest version of Istio Service Mesh. These tools included Prometheus and Grafana for metric collection, monitoring, and alerting, Jaeger for distributed tracing, and Kiali for Istio service-mesh-based microservice visualization. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability solution for modern, distributed applications.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , ,

4 Comments

Kubernetes-based Microservice Observability with Istio Service Mesh on Google Kubernetes Engine (GKE): Part 1

In this two-part post, we will explore the set of observability tools which are part of the Istio Service Mesh. These tools include Jaeger, Kiali, Prometheus, and Grafana. To assist in our exploration, we will deploy a Go-based, microservices reference platform to Google Kubernetes Engine, on the Google Cloud Platform.

Golang Service Diagram with Proxy v2

What is Observability?

Similar to blockchain, serverless, AI and ML, chatbots, cybersecurity, and service meshes, Observability is a hot buzz word in the IT industry right now. According to Wikipedia, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Logs, metrics, and traces are often known as the three pillars of observability. These are the external outputs of the system, which we may observe.

The O’Reilly book, Distributed Systems Observability, by Cindy Sridharan, does an excellent job of detailing ‘The Three Pillars of Observability’, in Chapter 4. I recommend reading this free online excerpt, before continuing. A second great resource for information on observability is honeycomb.io, a developer of observability tools for production systems, led by well-known industry thought-leader, Charity Majors. The honeycomb.io site includes articles, blog posts, whitepapers, and podcasts on observability.

As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind. Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.

Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools. Istio’s integrations include Jaeger for distributed tracing, Kiali for Istio service mesh-based microservice visualization, and Prometheus and Grafana for metric collection, monitoring, and alerting. Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability platform for modern, distributed applications.

A Reference Microservices Platform

To demonstrate the observability tools integrated with the latest version of Istio Service Mesh, we will deploy a reference microservices platform, written in Go, to GKE on GCP. I developed the reference platform to demonstrate concepts such as API management, Service Meshes, Observability, DevOps, and Chaos Engineering. The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A – Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications. The platform and all its source code is free and open source.

The reference platform is designed to generate HTTP-based service-to-service, TCP-based service-to-database (MongoDB), and TCP-based service-to-queue-to-service (RabbitMQ) IPC (inter-process communication). Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a RabbitMQ queue that Service F consumes and writes to MongoDB, and so on. These distributed communications can be observed using Istio’s observability tools when the system is deployed to a Kubernetes cluster running the Istio service mesh.

Service Responses

On the reference platform, each upstream service responds to requests from downstream services by returning a small informational JSON payload (termed a greeting in the source code).

Golang Service Diagram with Proxy v2 res

The responses are aggregated across the service call chain, resulting in an array of service responses being returned to the edge service and on to the Angular-based UI, running in the end user’s web browser. The response aggregation feature is simply used to confirm that the service-to-service communications, Istio components, and the telemetry tools are working properly.

screen_shot_2019-03-19_at_8_43_10_pm

Each Go microservice contains a /ping and /health endpoint. The /health endpoint can be used to configure Kubernetes Liveness and Readiness Probes. Additionally, the edge service, Service A, is configured for Cross-Origin Resource Sharing (CORS) using the access-control-allow-origin: * response header. CORS allows the Angular UI, running in end user’s web browser, to call the Service A /ping endpoint, which resides in a different subdomain from UI. Shown below is the Go source code for Service A.


// author: Gary A. Stafford
// site: https://programmaticponderings.com
// license: MIT License
// purpose: Service A
package main
import (
"encoding/json"
"github.com/banzaicloud/logrus-runtime-formatter"
"github.com/google/uuid"
"github.com/gorilla/mux"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/rs/cors"
log "github.com/sirupsen/logrus"
"io/ioutil"
"net/http"
"os"
"strconv"
"time"
)
type Greeting struct {
ID string `json:"id,omitempty"`
ServiceName string `json:"service,omitempty"`
Message string `json:"message,omitempty"`
CreatedAt time.Time `json:"created,omitempty"`
}
var greetings []Greeting
func PingHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
log.Debug(r)
greetings = nil
CallNextServiceWithTrace("http://service-b/api/ping", w, r)
CallNextServiceWithTrace("http://service-c/api/ping", w, r)
tmpGreeting := Greeting{
ID: uuid.New().String(),
ServiceName: "Service-A",
Message: "Hello, from Service-A!",
CreatedAt: time.Now().Local(),
}
greetings = append(greetings, tmpGreeting)
err := json.NewEncoder(w).Encode(greetings)
if err != nil {
log.Error(err)
}
}
func HealthCheckHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, err := w.Write([]byte("{\"alive\": true}"))
if err != nil {
log.Error(err)
}
}
func ResponseStatusHandler(w http.ResponseWriter, r *http.Request) {
params := mux.Vars(r)
statusCode, err := strconv.Atoi(params["code"])
if err != nil {
log.Error(err)
}
w.WriteHeader(statusCode)
}
func CallNextServiceWithTrace(url string, w http.ResponseWriter, r *http.Request) {
var tmpGreetings []Greeting
w.Header().Set("Content-Type", "application/json; charset=utf-8")
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Error(err)
}
// Headers must be passed for Jaeger Distributed Tracing
headers := []string{
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
"x-ot-span-context",
}
for _, header := range headers {
if r.Header.Get(header) != "" {
req.Header.Add(header, r.Header.Get(header))
}
}
log.Info(req)
client := &http.Client{}
response, err := client.Do(req)
if err != nil {
log.Error(err)
}
defer response.Body.Close()
body, err := ioutil.ReadAll(response.Body)
if err != nil {
log.Error(err)
}
err = json.Unmarshal(body, &tmpGreetings)
if err != nil {
log.Error(err)
}
for _, r := range tmpGreetings {
greetings = append(greetings, r)
}
}
func getEnv(key, fallback string) string {
if value, ok := os.LookupEnv(key); ok {
return value
}
return fallback
}
func init() {
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}}
formatter.Line = true
log.SetFormatter(&formatter)
log.SetOutput(os.Stdout)
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info"))
if err != nil {
log.Error(err)
}
log.SetLevel(level)
}
func main() {
c := cors.New(cors.Options{
AllowedOrigins: []string{"*"},
AllowCredentials: true,
AllowedMethods: []string{"GET", "POST", "PATCH", "PUT", "DELETE", "OPTIONS"},
})
router := mux.NewRouter()
api := router.PathPrefix("/api").Subrouter()
api.HandleFunc("/ping", PingHandler).Methods("GET", "OPTIONS")
api.HandleFunc("/health", HealthCheckHandler).Methods("GET", "OPTIONS")
api.HandleFunc("/status/{code}", ResponseStatusHandler).Methods("GET", "OPTIONS")
api.Handle("/metrics", promhttp.Handler())
handler := c.Handler(router)
log.Fatal(http.ListenAndServe(":80", handler))
}

view raw

main.go

hosted with ❤ by GitHub

For this demonstration, the MongoDB databases will be hosted, external to the services on GCP, on MongoDB Atlas, a MongoDB-as-a-Service, cloud-based platform. Similarly, the RabbitMQ queues will be hosted on CloudAMQP, a RabbitMQ-as-a-Service, cloud-based platform. I have used both of these SaaS providers in several previous posts. Using external services will help us understand how Istio and its observability tools collect telemetry for communications between the Kubernetes cluster and external systems.

Shown below is the Go source code for Service F, This service consumers messages from the RabbitMQ queue, placed there by Service D, and writes the messages to MongoDB.


// author: Gary A. Stafford
// site: https://programmaticponderings.com
// license: MIT License
// purpose: Service F
package main
import (
"bytes"
"context"
"encoding/json"
"github.com/banzaicloud/logrus-runtime-formatter"
"github.com/google/uuid"
"github.com/gorilla/mux"
log "github.com/sirupsen/logrus"
"github.com/streadway/amqp"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
"net/http"
"os"
"time"
)
type Greeting struct {
ID string `json:"id,omitempty"`
ServiceName string `json:"service,omitempty"`
Message string `json:"message,omitempty"`
CreatedAt time.Time `json:"created,omitempty"`
}
var greetings []Greeting
func PingHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
greetings = nil
tmpGreeting := Greeting{
ID: uuid.New().String(),
ServiceName: "Service-F",
Message: "Hola, from Service-F!",
CreatedAt: time.Now().Local(),
}
greetings = append(greetings, tmpGreeting)
CallMongoDB(tmpGreeting)
err := json.NewEncoder(w).Encode(greetings)
if err != nil {
log.Error(err)
}
}
func HealthCheckHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, err := w.Write([]byte("{\"alive\": true}"))
if err != nil {
log.Error(err)
}
}
func CallMongoDB(greeting Greeting) {
log.Info(greeting)
ctx, _ := context.WithTimeout(context.Background(), 10*time.Second)
client, err := mongo.Connect(ctx, options.Client().ApplyURI(os.Getenv("MONGO_CONN")))
if err != nil {
log.Error(err)
}
defer client.Disconnect(nil)
collection := client.Database("service-f").Collection("messages")
ctx, _ = context.WithTimeout(context.Background(), 5*time.Second)
_, err = collection.InsertOne(ctx, greeting)
if err != nil {
log.Error(err)
}
}
func GetMessages() {
conn, err := amqp.Dial(os.Getenv("RABBITMQ_CONN"))
if err != nil {
log.Error(err)
}
defer conn.Close()
ch, err := conn.Channel()
if err != nil {
log.Error(err)
}
defer ch.Close()
q, err := ch.QueueDeclare(
"service-d",
false,
false,
false,
false,
nil,
)
if err != nil {
log.Error(err)
}
msgs, err := ch.Consume(
q.Name,
"service-f",
true,
false,
false,
false,
nil,
)
if err != nil {
log.Error(err)
}
forever := make(chan bool)
go func() {
for delivery := range msgs {
log.Debug(delivery)
CallMongoDB(deserialize(delivery.Body))
}
}()
<-forever
}
func deserialize(b []byte) (t Greeting) {
log.Debug(b)
var tmpGreeting Greeting
buf := bytes.NewBuffer(b)
decoder := json.NewDecoder(buf)
err := decoder.Decode(&tmpGreeting)
if err != nil {
log.Error(err)
}
return tmpGreeting
}
func getEnv(key, fallback string) string {
if value, ok := os.LookupEnv(key); ok {
return value
}
return fallback
}
func init() {
formatter := runtime.Formatter{ChildFormatter: &log.JSONFormatter{}}
formatter.Line = true
log.SetFormatter(&formatter)
log.SetOutput(os.Stdout)
level, err := log.ParseLevel(getEnv("LOG_LEVEL", "info"))
if err != nil {
log.Error(err)
}
log.SetLevel(level)
}
func main() {
go GetMessages()
router := mux.NewRouter()
api := router.PathPrefix("/api").Subrouter()
api.HandleFunc("/ping", PingHandler).Methods("GET")
api.HandleFunc("/health", HealthCheckHandler).Methods("GET")
log.Fatal(http.ListenAndServe(":80", router))
}

view raw

main.go

hosted with ❤ by GitHub

Source Code

All source code for this post is available on GitHub in two projects. The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository. The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend project repository. You should not need to clone the Angular UI project for this demonstration.

git clone --branch master --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/k8s-istio-observe-backend.git

Docker images referenced in the Kubernetes Deployment resource files, for the Go services and UI, are all available on Docker Hub. The Go microservice Docker images were built using the official Golang Alpine base image on DockerHub, containing Go version 1.12.0. Using the Alpine image to compile the Go source code ensures the containers will be as small as possible and contain a minimal attack surface.

System Requirements

To follow along with the post, you will need the latest version of gcloud CLI (min. ver. 239.0.0), part of the Google Cloud SDK, Helm, and the just releases Istio 1.1.0, installed and configured locally or on your build machine.
screen_shot_2019-03-19_at_9_23_17_pm.png

Set-up and Installation

To deploy the microservices platform to GKE, we will proceed in the following order.

  1. Create the MongoDB Atlas database cluster;
  2. Create the CloudAMQP RabbitMQ cluster;
  3. Modify the Kubernetes resources and scripts for your own environments;
  4. Create the GKE cluster on GCP;
  5. Deploy Istio 1.1.0 to the GKE cluster, using Helm;
  6. Create DNS records for the platform’s exposed resources;
  7. Deploy the Go-based microservices, Angular UI, and associated resources to GKE;
  8. Test and troubleshoot the platform;
  9. Observe the results in part two!

MongoDB Atlas Cluster

MongoDB Atlas is a fully-managed MongoDB-as-a-Service, available on AWS, Azure, and GCP. Atlas, a mature SaaS product, offers high-availability, guaranteed uptime SLAs, elastic scalability, cross-region replication, enterprise-grade security, LDAP integration, a BI Connector, and much more.

MongoDB Atlas currently offers four pricing plans, Free, Basic, Pro, and Enterprise. Plans range from the smallest, M0-sized MongoDB cluster, with shared RAM and 512 MB storage, up to the massive M400 MongoDB cluster, with 488 GB of RAM and 3 TB of storage.

For this post, I have created an M2-sized MongoDB cluster in GCP’s us-central1 (Iowa) region, with a single user database account for this demo. The account will be used to connect from four of the eight microservices, running on GKE.

screen_shot_2019-03-09_at_7_48_00_pm

Originally, I started with an M0-sized cluster, but the compute resources were insufficient to support the volume of calls from the Go-based microservices. I suggest at least an M2-sized cluster or larger.

CloudAMQP RabbitMQ Cluster

CloudAMQP provides full-managed RabbitMQ clusters on all major cloud and application platforms. RabbitMQ will support a decoupled, eventually consistent, message-based architecture for a portion of our Go-based microservices. For this post, I have created a RabbitMQ cluster in GCP’s us-central1 (Iowa) region, the same as our GKE cluster and MongoDB Atlas cluster. I chose a minimally-configured free version of RabbitMQ. CloudAMQP also offers robust, multi-node RabbitMQ clusters for Production use.

Modify Configurations

There are a few configuration settings you will need to change in the GitHub project’s Kubernetes resource files and Bash deployment scripts.

Istio ServiceEntry for MongoDB Atlas

Modify the Istio ServiceEntry, external-mesh-mongodb-atlas.yaml file, adding you MongoDB Atlas host address. This file allows egress traffic from four of the microservices on GKE to the external MongoDB Atlas cluster.

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: mongodb-atlas-external-mesh
spec:
  hosts:
  - {{ your_host_goes_here }}
  ports:
  - name: mongo
    number: 27017
    protocol: MONGO
  location: MESH_EXTERNAL
  resolution: NONE

Istio ServiceEntry for CloudAMQP RabbitMQ

Modify the Istio ServiceEntry, external-mesh-cloudamqp.yaml file, adding you CloudAMQP host address. This file allows egress traffic from two of the microservices to the CloudAMQP cluster.

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cloudamqp-external-mesh
spec:
  hosts:
  - {{ your_host_goes_here }}
  ports:
  - name: rabbitmq
    number: 5672
    protocol: TCP
  location: MESH_EXTERNAL
  resolution: NONE

Istio Gateway and VirtualService Resources

There are numerous strategies you may use to route traffic into the GKE cluster, via Istio. I am using a single domain for the post, example-api.com, and four subdomains. One set of subdomains is for the Angular UI, in the dev Namespace (ui.dev.example-api.com) and the test Namespace (ui.test.example-api.com). The other set of subdomains is for the edge API microservice, Service A, which the UI calls (api.dev.example-api.com and api.test.example-api.com). Traffic is routed to specific Kubernetes Service resources, based on the URL.

According to Istio, the Gateway describes a load balancer operating at the edge of the mesh, receiving incoming or outgoing HTTP/TCP connections. Modify the Istio ingress Gateway,  inserting your own domains or subdomains in the hosts section. These are the hosts on port 80 that will be allowed into the mesh.

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: demo-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - ui.dev.example-api.com
    - ui.test.example-api.com
    - api.dev.example-api.com
    - api.test.example-api.com

According to Istio, a VirtualService defines a set of traffic routing rules to apply when a host is addressed. A VirtualService is bound to a Gateway to control the forwarding of traffic arriving at a particular host and port. Modify the project’s four Istio VirtualServices, inserting your own domains or subdomains. Here is an example of one of the four VirtualServices, in the istio-gateway.yaml file.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: angular-ui-dev
spec:
  hosts:
  - ui.dev.example-api.com
  gateways:
  - demo-gateway
  http:
  - match:
    - uri:
        prefix: /
    route:
    - destination:
        port:
          number: 80
        host: angular-ui.dev.svc.cluster.local

Kubernetes Secret

The project contains a Kubernetes Secret, go-srv-demo.yaml, with two values. One is for the MongoDB Atlas connection string and one is for the CloudAMQP connections string. Remember Kubernetes Secret values need to be base64 encoded.

apiVersion: v1
kind: Secret
metadata:
  name: go-srv-config
type: Opaque
data:
  mongodb.conn: {{ your_base64_encoded_secret }}
  rabbitmq.conn: {{ your_base64_encoded_secret }}

On Linux and Mac, you can use the base64 program to encode the connection strings.

> echo -n "mongodb+srv://username:password@atlas-cluster.gcp.mongodb.net/test?retryWrites=true" | base64
bW9uZ29kYitzcnY6Ly91c2VybmFtZTpwYXNzd29yZEBhdGxhcy1jbHVzdGVyLmdjcC5tb25nb2RiLm5ldC90ZXN0P3JldHJ5V3JpdGVzPXRydWU=

> echo -n "amqp://username:password@rmq.cloudamqp.com/cluster" | base64
YW1xcDovL3VzZXJuYW1lOnBhc3N3b3JkQHJtcS5jbG91ZGFtcXAuY29tL2NsdXN0ZXI=

Bash Scripts Variables

The bash script, part3_create_gke_cluster.sh, contains a series of environment variables. At a minimum, you will need to change the PROJECT variable in all scripts to match your GCP project name.

# Constants - CHANGE ME!
readonly PROJECT='{{ your_gcp_project_goes_here }}'
readonly CLUSTER='go-srv-demo-cluster'
readonly REGION='us-central1'
readonly MASTER_AUTH_NETS='72.231.208.0/24'
readonly GKE_VERSION='1.12.5-gke.5'
readonly MACHINE_TYPE='n1-standard-2'

The bash script, part4_install_istio.sh, includes the ISTIO_HOME variable. The value should correspond to your local path to Istio 1.1.0. On my local Mac, this value is shown below.

readonly ISTIO_HOME='/Applications/istio-1.1.0'

Deploy GKE Cluster

Next, deploy the GKE cluster using the included bash script, part3_create_gke_cluster.sh. This will create a Regional, multi-zone, 3-node GKE cluster, using the latest version of GKE at the time of this post, 1.12.5-gke.5. The cluster will be deployed to the same region as the MongoDB Atlas and CloudAMQP clusters, GCP’s us-central1 (Iowa) region. Planning where your Cloud resources will reside, for both SaaS providers and primary Cloud providers can be critical to minimizing latency for network I/O intensive applications.

screen_shot_2019-03-09_at_5_44_33_pm

Deploy Istio using Helm

With the GKE cluster and associated infrastructure in place, deploy Istio. For this post, I have chosen to install Istio using Helm, as recommended my Istio. To deploy Istio using Helm, use the included bash script, part4_install_istio.sh.

screen_shot_2019-03-09_at_5_47_57_pm

The script installs Istio, using the Helm Chart in the local Istio 1.1.0 install/kubernetes/helm/istio directory, which you installed as a requirement for this demonstration. The Istio install script overrides several default values in the Istio Helm Chart using the --set, flag. The list of available configuration values is detailed in the Istio Chart’s GitHub project. The options enable Istio’s observability features, which we will explore in part two. Features include Kiali, Grafana, Prometheus, and Jaeger.

helm install ${ISTIO_HOME}/install/kubernetes/helm/istio-init \
  --name istio-init \
  --namespace istio-system

helm install ${ISTIO_HOME}/install/kubernetes/helm/istio \
  --name istio \
  --namespace istio-system \
  --set prometheus.enabled=true \
  --set grafana.enabled=true \
  --set kiali.enabled=true \
  --set tracing.enabled=true

kubectl apply --namespace istio-system \
  -f ./resources/secrets/kiali.yaml

Below, we see the Istio-related Workloads running on the cluster, including the observability tools.

screen_shot_2019-03-09_at_5_58_35_pm

Below, we see the corresponding Istio-related Service resources running on the cluster.

screen_shot_2019-03-09_at_5_59_14_pm

Modify DNS Records

Instead of using IP addresses to route traffic the GKE cluster and its applications, we will use DNS. As explained earlier, I have chosen a single domain for the post, example-api.com, and four subdomains. One set of subdomains is for the Angular UI, in the dev Namespace and the test Namespace. The other set of subdomains is for the edge microservice, Service A, which the API calls. Traffic is routed to specific Kubernetes Service resources, based on the URL.

Deploying the GKE cluster and Istio triggers the creation of a Google Load Balancer, four IP addresses, and all required firewall rules. One of the four IP addresses, the one shown below, associated with the Forwarding rule, will be associated with the front-end of the load balancer.screen_shot_2019-03-09_at_5_49_37_pm

Below, we see the new load balancer, with the front-end IP address and the backend VM pool of three GKE cluster’s worker nodes. Each node is assigned one of the IP addresses, as shown above.

screen_shot_2019-03-09_at_5_57_20_pm

As shown below, using Google Cloud DNS, I have created the four subdomains and assigned the IP address of the load balancer’s front-end to all four subdomains. Ingress traffic to these addresses will be routed through the Istio ingress Gateway and the four Istio VirtualServices, to the appropriate Kubernetes Service resources. Use your choice of DNS management tools to create the four A Type DNS records.

screen_shot_2019-03-09_at_5_56_29_pm

Deploy the Reference Platform

Next, deploy the eight Go-based microservices, the Angular UI, and the associated Kubernetes and Istio resources to the GKE cluster. To deploy the platform, use the included bash deploy script, part5a_deploy_resources.sh. If anything fails and you want to remove the existing resources and re-deploy, without destroying the GKE cluster or Istio, you can use the part5b_delete_resources.sh delete script.

screen_shot_2019-03-09_at_6_01_29_pm

The deploy script deploys all the resources two Kubernetes Namespaces, dev and test. This will allow us to see how we can differentiate between Namespaces when using the observability tools.

Below, we see the Istio-related resources, which we just deployed. They include the Istio Gateway, four Istio VirtualService, and two Istio ServiceEntry resources.

screen_shot_2019-03-10_at_10_48_49_pm

Below, we see the platform’s Workloads (Kubernetes Deployment resources), running on the cluster. Here we see two Pods for each Workload, a total of 18 Pods, running in the dev Namespace. Each Pod contains both the deployed microservice or UI component, as well as a copy of Istio’s Envoy Proxy.

screen_shot_2019-03-09_at_6_12_59_pm

Below, we see the corresponding Kubernetes Service resources running in the dev Namespace.

screen_shot_2019-03-09_at_6_03_02_pm

Below, a similar view of the Deployment resources running in the test Namespace. Again, we have two Pods for each deployment with each Pod contains both the deployed microservice or UI component, as well as a copy of Istio’s Envoy Proxy.

screen_shot_2019-03-09_at_6_13_16_pm

Test the Platform

We do want to ensure the platform’s eight Go-based microservices and Angular UI are working properly, communicating with each other, and communicating with the external MongoDB Atlas and CloudAMQP RabbitMQ clusters. The easiest way to test the cluster is by viewing the Angular UI in a web browser.

screen_shot_2019-03-19_at_8_43_10_pm

The UI requires you to input the host domain of the Service A, the API’s edge service. Since you cannot use my subdomain, and the JavaScript code is running locally to your web browser, this option allows you to provide your own host domain. This is the same domain or domains you inserted into the two Istio VirtualService for the UI. This domain route your API calls to either the FQDN (fully qualified domain name) of the Service A Kubernetes Service running in the dev namespace, service-a.dev.svc.cluster.local, or the test Namespace, service-a.test.svc.cluster.local.

screen_shot_2019-03-17_at_12_02_22_pm.png

You can also use performance testing tools to load-test the platform. Many issues will not show up until the platform is under load. I recently starting using hey, a modern load generator tool, as a replacement for Apache Bench (ab), Unlike ab, hey supports HTTP/2 endpoints, which is required to test the platform on GKE with Istio. Below, I am running hey directly from Google Cloud Shell. The tool is simulating 25 concurrent users, generating a total of 1,000 HTTP/2-based GET requests to Service A.

screen_shot_2019-03-19_at_8_53_47_pm

Troubleshooting

If for some reason the UI fails to display, or the call from the UI to the API fails, and assuming all Kubernetes and Istio resources are running on the GKE cluster (all green), the most common explanation is usually a misconfiguration of the following resources:

  1. Your four Cloud DNS records are not correct. They are not pointing to the load balancer’s front-end IP address;
  2. You did not configure the four Kubernetes VirtualService resources with the correct subdomains;
  3. The GKE-based microservices cannot reach the external MongoDB Atlas and CloudAMQP RabbitMQ clusters. Likely, the Kubernetes Secret is constructed incorrectly, or the two ServiceEntry resources contain the wrong host information for those external clusters;

I suggest starting the troubleshooting by calling Service A, the API’s edge service, directly, using cURL or Postman. You should see a JSON response payload, similar to the following. This suggests the issue is with the UI, not the API.

screen_shot_2019-03-17_at_12_06_27_pm.png

Next, confirm that the four MongoDB databases were created for Service D, Service, F, Service, G, and Service H. Also, confirm that new documents are being written to the database’s collections.

screen_shot_2019-03-17_at_11_55_19_am

Next, confirm new the RabbitMQ queue was created, using the CloudAMQP RabbitMQ Management Console. Service D produces messages, which Service F consumes from the queue.

screen_shot_2019-03-09_at_6_22_08_pm

Lastly, review the Stackdriver logs to see if there are any obvious errors.

screen-shot-2019-03-08-at-4_44_03-pm

Part Two

In part two of this post, we will explore each observability tool, and see how they can help us manage our GKE cluster and the reference platform running in the cluster.

screen_shot_2019-03-09_at_11_38_34_pm

Since the cluster only takes minutes to fully create and deploy resources to, if you want to tear down the GKE cluster, run the part6_tear_down.sh script.

screen_shot_2019-03-10_at_10_58_55_pm.png

All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.

, , , , , , , , , , , , , , , ,

3 Comments

Docker Log Aggregation and Visualization Options with the Elastic Stack

elk

As a Developer and DevOps Engineer, it wasn’t that long ago, I spent a lot of time requesting logs from Operations teams for applications running in Production. Many organizations I’ve worked with have created elaborate systems for requesting, granting, and revoking access to application logs. Requesting and obtaining access to logs typically took hours or days, or simply never got approved. Since most enterprise applications are composed of individual components running on multiple application and web servers, it was necessary to request multiple logs. What was often a simple problem to diagnose and fix, became an unnecessarily time-consuming ordeal.

Hopefully, you are still not in this situation. Given the average complexity of today’s modern, distributed, containerized application platforms, accessing individual logs is simply unrealistic and ineffective. The solution is log aggregation and visualization.

Log Aggregation and Visualization

In the context of this post, log aggregation and visualization is defined as the collection, centralized storage, and the ability to simultaneously display application logs from multiple, dissimilar sources. Take a typical modern web application. The frontend UI might be built with Angular, React, or Node. The UI is likely backed by multiple RESTful services, possibly built in Java Spring Boot or Python Flask, and a database or databases, such as MongoDB or MySQL. To support the application, there are auxiliary components, such as API gateways, load-balancers, and messaging brokers. These components are likely deployed as multiple instances, for performance and availability. All instances generate application logs in varying formats.

When troubleshooting an application, such as the one described above, you must often trace a user’s transaction from UI through firewalls and gateways, to the web server, back through the API gateway, to multiple backend services via load-balancers, through message queues, to databases, possibly to external third-party APIs, and back to the client. This is why log aggregation and visualization is essential.

Logging Options

Log aggregation and visualization solutions typically come in three varieties: cloud-hosted by a SaaS provider, a service provided by your Cloud provider, and self-hosted, either on-premises or in the cloud. Cloud-hosted SaaS solutions include Loggly, Splunk, Logentries, and Sumo Logic. Some of these solutions, such as Splunk, are also available as a self-hosted service. Cloud-provider solutions include AWS CloudWatch and Azure Application Insights. Most hosted solutions have reoccurring pricing models based on the volume of logs or the number of server nodes being monitored.

Self-hosted solutions include Graylog 2, Nagios Log Server, Splunk Free, and Elastic’s Elastic Stack. The ELK Stack (Elasticsearch, Logstash, and Kibana), as it was previously known, has been re-branded the Elastic Stack, which now includes Beats. Beats is Elastic’s lightweight shipper that send data from edge machines to Logstash and Elasticsearch.

Often, you will see other components mentioned in the self-hosted space, such as Fluentd, syslog, and Kafka. These are examples of log aggregators or datastores for logs. They lack the combined abilities to collect, store, and display multiple logs. These components are generally part of a larger log aggregation and visualization solution.

This post will explore self-hosted log aggregation and visualization of a Dockerized application on AWS, using the Elastic Stack. The post details three common variations of log collection and routing to Elasticsearch, using various Docker logging drivers, along with Logspout, Fluentd, and GELF (Graylog Extended Log Format).

Docker Swarm Cluster

The post’s example application is deployed to a Docker Swarm, built on AWS, using Docker CE for AWS. Docker has automated the creation of a Swarm on AWS using Docker Cloud, right from your desktop. Creating a Swarm is as easy as inputting a few options and clicking build. Docker uses an AWS CloudFormation script to provision all the necessary AWS resources for the Docker Swarm.

swam_mode

For this post’s logging example, I built a minimally configured Docker Swarm cluster, consisting of a single Manager Node and three Worker Nodes. The four Swarm nodes, all EC2 instances, are behind an AWS ELB, inside a new AWS VPC.

Logging Diagram AWS Diagram 3D

As seen with the docker node ls command, the Docker Swarm will look similar to the following.


$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
tc2mwa29jl21bp4x3buwbhlhx ip-172-31-5-65.ec2.internal Ready Active
ueonffqdgocfo6dwb7eckjedz ip-172-31-29-135.ec2.internal Ready Active
n9zportqziqyzqsav3cbfakng ip-172-31-39-165.ec2.internal Ready Active
ao55bv77cs0542x9arm6droyo * ip-172-31-47-42.ec2.internal Ready Active Leader

Sample Application Components

Multiple containerized copies of a simple Java Spring Boot RESTful Hello-World service, available on GitHub, along with the associated logging aggregators, are deployed to Worker Node 1 and Worker Node 2. We will explore each of these application components later in the post. The containerized components consist of the following:

  1. Fluentd (garystafford/custom-fluentd)
  2. Logspout (garystafford/custom-logspout)
  3. NGINX (garystafford/custom-nginx)
  4. Hello-World Service using Docker’s default JSON file logging driver
  5. Hello-World Service using Docker’s GELF logging driver
  6. Hello-World Service using Docker’s Fluentd logging driver

NGINX is used as a simple frontend API gateway, which to routes HTTP requests to each of the three logging variations of the Hello-World service (garystafford/hello-world).

A single container, running the entire Elastic Stack (garystafford/custom-elk) is deployed to Worker Node 3. This is to isolate the Elastic Stack from the application. Typically, in a real environment, the Elastic Stack would be running on separate infrastructure for performance and security, not alongside your application. Running a docker service ls, the deployed services appear as follows.


$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
6va602lzfl4y dockercloud-server-proxy global 1/1 dockercloud/server-proxy *:2376->2376/tcp
jjjddhuczx35 elk-demo_elk replicated 1/1 garystafford/custom-elk:latest *:5000->5000/udp,*:5044->5044/tcp,*:5601->5601/tcp,*:9200->9200/tcp,*:12201->12201/udp
mwm1rbo3dp3t elk-demo_fluentd global 2/2 garystafford/custom-fluentd:latest *:24224->24224/tcp,*:24224->24224/udp
ofo02g2kbhg7 elk-demo_hello-fluentd replicated 2/2 garystafford/hello-world:latest
05txkpmizwxq elk-demo_hello-gelf replicated 2/2 garystafford/hello-world:latest
pjs614raq37y elk-demo_hello-logspout replicated 2/2 garystafford/hello-world:latest
9h0l0w2ej1yw elk-demo_logspout global 2/2 garystafford/custom-logspout:latest
wpxjji5wwd4j elk-demo_nginx replicated 2/2 garystafford/custom-nginx:latest *:80->80/tcp
w0y5inpryaql elk-demo_portainer global 1/1 portainer/portainer:latest *:9000->9000/tcp

Portainer

A single instance of Portainer (Docker Hub: portainer/portainer) is deployed on the single Manager Node. Portainer, amongst other things, provides a detailed view of Docker Swarm, showing each Swarm Node and the service containers deployed to them.

portainer

In my opinion, Portainer provides a much better user experience than Docker Enterprise Edition’s most recent Universal Control Plane (UCP). In the past, I have also used Visualizer (dockersamples/visualizer), one of the first open source solutions in this space. However, since the Visualizer project moved to Docker, it seems like the development of new features has completely stalled out. A good list of container tools can be found on StackShare.

Deployment

All the Docker service containers are deployed to the AWS-based Docker Swarm using a single Docker Compose file. The order of service startup is critical. Elasticsearch should fully startup first, followed by Fluentd and Logspout, then the three sets of Hello-World instances, and finally NGINX.

To deploy and start all the Docker services correctly, there are two scripts in the GitHub repository. First, execute the following command, sh ./stack_deploy.sh. This will deploy the Docker service stack and create an overlay network, containing all the services as configured in the docker-compose.yml file. Then, to ensure the services start in the correct sequence, execute sh ./service_update.sh. This will restart each service in the correct order, with pauses between services to allow time for startup; a bit of a hack, but effective.

Collection and Routing Examples

Below is a diagram showing all the components comprising this post’s examples, and includes the protocols and ports on which they communicate. Following, we will look at three variations of self-hosted log collection and routing options for the Elastic Stack.

Logging Diagram

Example 1: Fluentd

The first example of log aggregation and visualization uses Fluentd, a Cloud Native Computing Foundation (CNCF) hosted project. Fluentd is described as ‘an open source data collector for unified logging layer.’ A container running Fluentd with a custom configuration runs globally on each Worker Node where the applications are deployed, in this case, the hello-fluentd Docker service. Here is the custom Fluentd configuration file (fluent.conf):


# fluentd config for logging demo
<source>
@type forward
port 24224
</source>
<filter **>
@type concat
key log
stream_identity_key container_id
multiline_start_regexp /^\S+/
multiline_end_regexp /\s+.*more$/
flush_interval 120s
timeout_label @processdata
</filter>
<label @ERROR>
<match **>
@type stdout
</match>
</label>
<label @processdata>
<match **>
@type stdout
</match>
</label>
<match **>
@type elasticsearch
logstash_format true
host elk
port 9200
index_name fluentd
type_name fluentd
</match>

The Hello-World service is configured through the Docker Compose file to use the Fluentd Docker logging driver. The log entries from the Hello-World containers on the Worker Nodes are diverted from being output to JSON files, using the default JSON file logging driver, to the Fluentd container instance on the same host as the Hello-World container. The Fluentd container is listening for TCP traffic on port 24224.

Fluentd then sends the individual log entries to Elasticsearch directly, bypassing Logstash. Fluentd log entries are sent via HTTP to port 9200, Elasticsearch’s JSON interface.

Logging Diagram Fluentd

Using Fluentd as a transport method, log entries appear as JSON documents in Elasticsearch, as shown below. This Elasticsearch JSON document is an example of a single line log entry. Note the primary field container identifier, when using Fluentd, is container_id. This field will vary depending on the Docker driver and log collector, as seen in the next two logging examples.

fluentd-log.png

The next example shows a Fluentd multiline log entry. Using the Fluentd Concat filter plugin (fluent-plugin-concat), the individual lines of a stack trace from a Java runtime exception, thrown by the hello-fluentd Docker service, have been recombined into a single Elasticsearch JSON document.

fluentd-multiline

In the above log entries, note the DEPLOY_ENV and SERVICE_NAME fields. These values were injected into the Docker Compose file, as environment variables, during deployment of the Hello-World service. The Fluentd Docker logging driver applies these as env options, as shown in the example Docker Compose snippet, below, lines 5-9.


hello-fluentd:
image: garystafford/hello-world:latest
networks:
elk-demo
logging:
driver: fluentd
options:
tag: docker.{{.Name}}
env: SERVICE_NAME,DEPLOY_ENV
deploy:
placement:
constraints:
node.role == worker
node.hostname != ${WORKER_NODE_3}
replicas: 2
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
max_attempts: 3
delay: 5s
environment:
SERVICE_NAME: hello-fluentd
DEPLOY_ENV: ${DEPLOY_ENV}
LOGSPOUT: ignore
command: "java \
-Dspring.profiles.active=${DEPLOY_ENV} \
-Djava.security.egd=file:/dev/./urandom \
-jar hello-world.jar"

Example 2: Logspout

The second example of log aggregation and visualization uses GliderLabs’ Logspout. Logspout is described by GliderLabs as ‘a log router for Docker containers that runs inside Docker. It attaches to all containers on a host, then routes their logs wherever you want. It also has an extensible module system.’ In the post’s example, a container running Logspout with a custom configuration runs globally on each Worker Node where the applications are deployed, identical to Fluentd.

The hello-logspout Docker service is configured through the Docker Compose file to use the default JSON file logging driver. According to Docker, ‘by default, Docker captures the standard output (and standard error) of all your containers and writes them in files using the JSON format. The JSON format annotates each line with its origin (stdout or stderr) and its timestamp. Each log file contains information about only one container.


hello-logspout:
image: garystafford/hello-world:latest
networks:
elk-demo
logging:
driver: json-file
options:
env: SERVICE_NAME,DEPLOY_ENV
max-size: 10m
deploy:
placement:
constraints:
node.role == worker
node.hostname != ${WORKER_NODE_3}
replicas: 2
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
max_attempts: 3
delay: 5s
environment:
SERVICE_NAME: hello-logspout
DEPLOY_ENV: ${DEPLOY_ENV}
command: "java \
-Dspring.profiles.active=${DEPLOY_ENV} \
-Djava.security.egd=file:/dev/./urandom \
-jar hello-world.jar"

Normally, it is not necessary to explicitly set the default Docker logging driver to JSON files. However, in this case, Docker CE for AWS automatically configured each Swarm Nodes Docker daemon default logging driver to Amazon CloudWatch Logs logging driver. The default drive may be seen by running the docker info command while attached to the Docker daemon. Note line 12 in the snippet below.


docker info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 10
Server Version: 17.07.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: awslogs
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active

The hello-fluentd Docker service containers on the Worker Nodes send log entries to individual JSON files. The Fluentd container on each host then retrieves and routes those JSON log entries to Logstash, within the Elastic Stack container running on Worker Node 3, over UDP to port 5000. Logstash, which is explicitly listening for JSON via UDP on port 5000, then outputs those log entries to Elasticsearch, via HTTP to port 9200, Elasticsearch’s JSON interface.

Logging Diagram Logspout

Using Logspout as a transport method, log entries appear as JSON documents in Elasticsearch, as shown below. Note the field differences between the Fluentd log entry above and this entry. There are a number of significant variations, making it difficult to use both methods, across the same distributed application. For example, the main body of the log entry is contained in the message field using Logspout, but in the log field using Fluentd. The name of the Docker container, which serves as the primary means of identifying the container instance, is the docker.name field with Logspout, but container.name for Fluentd.

Another helpful field, provided by Logspout, is the docker.image field. This is beneficial when associating code issues to a particular code release. In this example, the Hello-World service uses the latest Docker image tag, which is not considered best practice. However, in a real production environment, the Docker tags often represents the incremental build number from the CI/CD system, which is tied to a specific build of the code.

logspout-logThe other challenge I have had with Logspout is passing the env and tag options, such as DEPLOY_ENV and SERVICE_NAME, as seen previously with the Fluentd example. Note they are blank in the above sample. It is possible, but not as straightforward as with Fluentd, and requires interacting directly with the Docker daemon on each Worker node.

Example 3: Graylog Extended Format (GELF)

The third and final example of log aggregation and visualization uses the Docker Graylog Extended Format (GELF) logging driver. According to the GELF website, ‘the Graylog Extended Log Format (GELF) is a log format that avoids the shortcomings of classic plain syslog.’ These syslog shortcomings include a maximum length of 1024 bytes, no data types, multiple dialects making parsing difficult, and no compression.

The GELF format, designed to work with the Graylog Open Source Log Management Server, work equally as well with the Elastic Stack. With the GELF logging driver, there is no intermediary logging collector and router, as with Fluentd and Logspout. The hello-gelf Docker service is configured through its Docker Compose file to use the GELF logging driver. The two hello-gelf Docker service containers on the Worker Nodes send log entries directly to Logstash, running within the Elastic Stack container, running on Worker Node 3, via UDP to port 12201.


hello-gelf:
image: garystafford/hello-world:latest
networks:
elk-demo
logging:
driver: gelf
options:
gelf-address: "udp://${ELK_IP}:12201"
tag: docker.{{.Name}}
env: SERVICE_NAME,DEPLOY_ENV
deploy:
placement:
constraints:
node.role == worker
node.hostname != ${WORKER_NODE_3}
replicas: 2
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
max_attempts: 3
delay: 5s
environment:
LOGSPOUT: ignore
SERVICE_NAME: hello-gelf
DEPLOY_ENV: ${DEPLOY_ENV}
command: "java \
-Dspring.profiles.active=${DEPLOY_ENV} \
-Djava.security.egd=file:/dev/./urandom \
-jar hello-world.jar"

Logstash, which is explicitly listening for UDP traffic on port 12201, then outputs those log entries to Elasticsearch, via HTTP to port 9200, Elasticsearch’s JSON interface.

Logging Diagram GELF

Using the Docker Graylog Extended Format (GELF) logging driver as a transport method, log entries appear as JSON documents in Elasticsearch, as shown below. They are the most verbose of the three formats.

gelf-logAgain, note the field differences between the Fluentd and Logspout log entries above, and this GELF entry. Both the field names of the main body of the log entry and the name of the Docker container are different from both previous examples.

Another bonus with GELF, each entry contains the command field, which stores the command used to start the container’s process. This can be helpful when troubleshooting application startup issues. Often, the exact container startup command might have been injected into the Docker Compose file at deploy time by the CI Server and contained variables, as is the case with the Hello-World service. Reviewing the log entry in Kibana for the command is much easier and safer than logging into the container and executing commands to check the running process for the startup command.

Unlike Logspout, and similar to Fluentd, note the DEPLOY_ENV and SERVICE_NAME fields are present in the GELF entry. These were injected into the Docker Compose file as environment variables during deployment of the Hello-World service. The GELF Docker logging driver applies these as env options. With GELF the entry also gets the optional tag, which was passed in the Docker Compose file’s service definition, tag: docker.{{.Name}}.

Unlike Fluentd, GELF and Logspout do not easily handle multiline logs. Below is an example of a multiline Java runtime exception thrown by the hello-gelf Docker service. The stack trace is not recombined into a single JSON document in Elasticsearch, like in the Fluentd example. The stack trace exists as multiple JSON documents, making troubleshooting much more difficult. Logspout entries will look similar to GELF.

gelf-multiline

Pros and Cons

In my opinion, and based on my level of experience with each of the self-hosted logging collection and routing options, the following some of their pros and cons.

Fluentd

  • Pros
    • Part of CNCF, Fluentd is becoming the defacto logging standard for cloud-native applications
    • Easily extensible via a large number of plugins
    • Easily containerized
    • Ability to easily handle multiline log entries (ie. Java stack trace)
    • Ability to use the Fluentd container’s service name as the Fluentd address, not an IP address or DNS resolvable hostname
  • Cons
    • Using Docker’s Fluentd logging driver, if the Fluentd container is not available on the container’s host, the container logging to Fluentd will fail (major con!)

Logspout

  • Pros
    • Doesn’t require a change to the default Docker JSON file logging driver, logs are still viewable via docker logs command (big plus!)
    • Easily to add and remove functionality via Golang modules
    • Easily containerized
  • Cons
    • Inability to easily handle multiline log entries (ie. Java stack trace)
    • Logspout containers must be restarted if the Elastic Stack is restarted to restart logging
    • To reach Logstash, Logspout must use a DNS resolvable hostname or IP address, not the name of the Elastic Stack container on the same overlay network (big con!)

GELF

  • Pros
    • Application containers, using Docker GELF logging driver will not fail if the downstream Logspout container is unavailable
    • Docker GELF logging driver allows compression of logs for shipment to Logspout
  • Cons
    • Inability to easily handle multiline log entries (ie. Java stack trace)

Conclusion

Of course, there are other self-hosted logging collection and routing options, including Elastic’s Beats, journald, and various syslog servers. Each has their pros and cons, depending on your project’s needs. After building and maintaining several self-hosted mission-critical log aggregation and visualization solutions, it is easy to see the appeal of an off-the-shelf cloud-hosted SaaS solution such as Splunk or Cloud provider solutions such as Application Insights.

All opinions in this post are my own and not necessarily the views of my current employer or their clients.

, , , , , , , , , , ,

1 Comment

Streaming Docker Logs to the Elastic Stack (ELK Stack) using Fluentd

Kibana

Introduction

Fluentd and Docker’s native logging driver for Fluentd makes it easy to stream Docker logs from multiple running containers to the Elastic Stack. In this post, we will use Fluentd to stream Docker logs from multiple instances of a Dockerized Spring Boot RESTful service and MongoDB, to the Elastic Stack (ELK).

log_message_flow_notype

In a recent post, Distributed Service Configuration with Consul, Spring Cloud, and Docker, we built a Consul cluster using Docker swarm mode, to host distributed configurations for a Spring Boot service. We will use the resulting swarm cluster from the previous post as a foundation for this post.

Fluentd

According to the Fluentd website, Fluentd is described as an open source data collector, which unifies data collection and consumption for a better use and understanding of data. Fluentd combines all facets of processing log data: collecting, filtering, buffering, and outputting logs across multiple sources and destinations. Fluentd structures data as JSON as much as possible.

Logging Drivers

Docker includes multiple logging mechanisms to get logs from running containers and services. These mechanisms are called logging drivers. Fluentd is one of the ten current Docker logging drivers. According to Docker, The fluentd logging driver sends container logs to the Fluentd collector as structured log data. Then, users can utilize any of the various output plugins, from Fluentd, to write these logs to various destinations.

Elastic Stack

The ELK Stack, now known as the Elastic Stack, is the combination of Elastic’s very popular products: Elasticsearch, Logstash, and Kibana. According to Elastic, the Elastic Stack provides real-time insights from almost any type of structured and unstructured data source.

Setup

All code for this post has been tested on both MacOS and Linux. For this post, I am provisioning and deploying to a Linux workstation, running the most recent release of Fedora and Oracle VirtualBox. If you want to use AWS or another infrastructure provider instead of VirtualBox to build your swarm, it is fairly easy to switch the Docker Machine driver and change a few configuration items in the vms_create.sh script (see Provisioning, below).

Required Software

If you want to follow along with this post, you will need the latest versions of git, Docker, Docker Machine, Docker Compose, and VirtualBox installed.


cat /etc/*release | sed -n '1p'
# Fedora release 25 (Twenty Five)
git –version \
docker –version \
docker-compose –version \
docker-machine –version \
virtualbox –help | sed -n '1p'
# git version 2.9.3
# Docker version 17.04.0-ce, build 4845c56
# docker-compose version 1.11.2, build dfed245
# docker-machine version 0.10.0, build 76ed2a6
# Oracle VM VirtualBox Manager 5.1.18

Source Code

All source code for this post is located in two GitHub repositories. The first repository contains scripts to provision the VMs, create an overlay network and persistent host-mounted volumes, build the Docker swarm, and deploy Consul, Registrator, Swarm Visualizer, Fluentd, and the Elastic Stack. The second repository contains scripts to deploy two instances of the Widget Spring Boot RESTful service and a single instance of MongoDB. You can execute all scripts manually, from the command-line, or from a CI/CD pipeline, using tools such as Jenkins.

Provisioning the Swarm

To start, clone the first repository, and execute the single run_all.sh script, or execute the seven individual scripts necessary to provision the VMs, create the overlay network and host volumes, build the swarm, and deploy Consul, Registrator, Swarm Visualizer, Fluentd, and the Elastic Stack. Follow the steps below to complete this part.


git clone –depth 1 –branch fluentd \
https://github.com/garystafford/microservice-docker-demo-consul.git
cd microservice-docker-demo-consul/scripts/
sh ./run_all.sh # single uber-script
# alternately, run the individual scripts
sh ./vms_create.sh # creates vms using docker machine
sh ./swarm_create.sh # creates the swarm
sh ./ntwk_vols_create.sh # creates overlay network and volumes
sh ./consul_deploy.sh # deploys consul to all nodes
sh ./registrator_deploy.sh # deploys registrator
sh ./stack_deploy.sh # deploys fluentd, visualizer, elastic stack
sh ./stack_validate.sh # waits/tests for all containers to start

When the scripts have completed, the resulting swarm should be configured similarly to the diagram below. Consul, Registrator, Swarm Visualizer, Fluentd, and the Elastic Stack containers should be distributed across the three swarm manager nodes and the three swarm worker nodes (VirtualBox VMs).

swarm_fluentd_diagram

Deploying the Application

Next, clone the second repository, and execute the single run_all.sh script, or execute the four scripts necessary to deploy the Widget Spring Boot RESTful service and a single instance of MongoDB. Follow the steps below to complete this part.


git clone –depth 1 –branch fluentd \
https://github.com/garystafford/microservice-docker-demo-widget.git
cd ../../microservice-docker-demo-widget/scripts/
sh ./run_all.sh # single uber-script
# alternately, run the individual scripts
sh ./profiles_to_consul.sh # pushes widget spring profiles to consul
sh ./stack_deploy.sh # deploys widget and mongodb containers
sh ./stack_validate.sh # waits/tests for all containers to start
sh ./seed_widgets.sh # creates a series of sample widget entries

When the scripts have completed, the Widget service and MongoDB containers should be distributed across two of the three swarm worker nodes (VirtualBox VMs).

swarm_fluentd_diagram_b

To confirm the final state of the swarm and the running container stacks, use the following Docker commands.


$ docker-machine env manager1 \
&& eval $(docker-machine env manager1)
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
manager1 * virtualbox Running tcp://192.168.99.100:2376 v17.04.0-ce
manager2 – virtualbox Running tcp://192.168.99.101:2376 v17.04.0-ce
manager3 – virtualbox Running tcp://192.168.99.102:2376 v17.04.0-ce
worker1 – virtualbox Running tcp://192.168.99.103:2376 v17.04.0-ce
worker2 – virtualbox Running tcp://192.168.99.104:2376 v17.04.0-ce
worker3 – virtualbox Running tcp://192.168.99.105:2376 v17.04.0-ce
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
1c7lyvega5qqwqrz0nsudfmpb * manager1 Ready Active Leader
86md2td2a1zxh4x82tp8zwnim worker2 Ready Active
iw82eojks2q82nvmdu07kz7ap manager3 Ready Active Reachable
valelhqb6p6rtvi1jw970cwfu worker3 Ready Active
vx34axhfwil16iudyba09pb38 manager2 Ready Active Reachable
xhnpumc22fht87z49y5ze3zkh worker1 Ready Active
$ docker stack ls
NAME SERVICES
monitoring_stack 3
widget_stack 2
$ docker service ls
ID NAME MODE REPLICAS IMAGE
6jf2bkbu3k2w widget_stack_mongodb replicated 1/1 mongo:latest
iyu9yvs3xnly widget_stack_widget global 2/2 garystafford/microservice-docker-demo-widget:latest
rzarts0oi0se monitoring_stack_visualizer global 3/3 manomarks/visualizer:latest
yjnpg9tmcss8 monitoring_stack_fluentd global 2/2 garystafford/custom-fluentd:latest
zf0oovhzf0ow monitoring_stack_elk replicated 1/1 sebp/elk:latest

Open the Swarm Visualizer web UI, using any of the swarm manager node IPs, on port 5001, to confirm the swarm health, as well as the running container’s locations.

Visualizer

Lastly, open the Consul Web UI, using any of the swarm manager node IPs, on port 5601, to confirm the running container’s health, as well as their placement on the swarm nodes.

Consul_1

Streaming Logs

Elastic Stack

If you read the previous post, Distributed Service Configuration with Consul, Spring Cloud, and Docker, you will notice we deployed a few additional components this time. First, the Elastic Stack (aka ELK), is deployed to the worker3 swarm worker node, within a single container. I have increased the CPU count and RAM assigned to this VM, to minimally run the Elastic Stack. If you review the docker-compose.yml file, you will note I am using Sébastien Pujadas’ sebp/elk:latest Docker base image from Docker Hub to provision the Elastic Stack. At the time of the post, this was based on the 5.3.0 version of ELK.

Docker Logging Driver

The Widget stack’s docker-compose.yml file has been modified since the last post. The compose file now incorporates a Fluentd logging configuration section for each service. The logging configuration includes the address of the Fluentd instance, on the same swarm worker node. The logging configuration also includes a tag for each log message.


version: '3.0'
services:
widget:
image: garystafford/microservice-docker-demo-widget:fluentd
hostname: widget
depends_on:
mongodb
ports:
8030:8030/tcp
networks:
demo_overlay_net
logging:
driver: fluentd
options:
tag: docker.{{.Name}}
fluentd-address: localhost:24224
labels: com.widget.environment
env: SERVICE_NAME, SERVICE_TAGS
labels:
"com.widget.environment: ${ENVIRONMENT}"
deploy:
mode: global
placement:
constraints:
node.role == worker
node.hostname != worker3
environment:
"CONSUL_SERVER_URL=${CONSUL_SERVER}"
"SERVICE_NAME=widget"
"SERVICE_TAGS=service"
command: "java -Dspring.profiles.active=${ACTIVE_PROFILE} \
-Djava.security.egd=file:/dev/./urandom \
-jar widget/widget-service.jar"

Fluentd

In addition to the Elastic Stack, we have deployed Fluentd to the worker1 and worker2 swarm nodes. This is also where the Widget and MongoDB containers are deployed. Again, looking at the docker-compose.yml file, you will note we are using a custom Fluentd Docker image, garystafford/custom-fluentd:latest, which I created. The custom image is available on Docker Hub.


version: '3.0'
services:
fluentd:
image: garystafford/custom-fluentd:latest
hostname: fluentd
ports:
"24224:24224/tcp"
"24224:24224/udp"
networks:
demo_overlay_net
deploy:
mode: global
placement:
constraints:
node.role == worker
node.hostname != worker3
environment:
SERVICE_NAME: fluentd
SERVICE_TAGS: monitoring

The custom Fluentd Docker image is based on Fluentd’s official onbuild Docker image, fluent/fluentd:onbuild. Fluentd provides instructions for building your own custom images, from their onbuild base images.


FROM fluent/fluentd:onbuild
LABEL maintainer "Gary A. Stafford <garystafford@rochester.rr.com>"
ENV REFRESHED_AT 2017-04-02
USER root
RUN apk add –update –virtual .build-deps \
sudo build-base ruby-dev \
# cutomize following instruction as you wish
&& sudo -u fluent gem install \
fluent-plugin-secure-forward \
fluent-plugin-elasticsearch \
fluent-plugin-concat \
&& sudo -u fluent gem sources –clear-all \
&& apk del .build-deps \
&& rm -rf /var/cache/apk/* \
/home/fluent/.gem/ruby/2.3.0/cache/*.gem
USER fluent

view raw

Dockerfile

hosted with ❤ by GitHub

There were two reasons I chose to create a custom Fluentd Docker image. First, I added the Uken Games’ Fluentd Elasticsearch Plugin, to the Docker Image. This highly configurable Fluentd Output Plugin allows us to push Docker logs, processed by Fluentd to the Elasticsearch. Adding additional plugins is a common reason for creating a custom Fluentd Docker image.

The second reason to create a custom Fluentd Docker image was configuration. Instead of bind-mounting host directories or volumes to the multiple Fluentd containers, to provide Fluentd’s configuration, I baked the configuration file into the immutable Docker image. The bare-bones, basicFluentd configuration file defines three processes, which are Input, Filter, and Output. These processes are accomplished using Fluentd plugins. Fluentd has 6 types of plugins: Input, Parser, Filter, Output, Formatter and Buffer. Fluentd is primarily written in Ruby, and its plugins are Ruby gems.


<source>
@type forward
@id input1
@label @mainstream
port 24224
</source>
<filter **>
@type stdout
</filter>
<label @mainstream>
<match **>
@type copy
<store>
@type file
@id output_docker1
path /fluentd/log/docker.*.log
symlink_path /fluentd/log/docker.log
append true
time_slice_format %Y%m%d
time_slice_wait 1m
time_format %Y%m%dT%H%M%S%z
buffer_path /fluentd/log/docker.*.log
</store>
<store>
@type elasticsearch
logstash_format true
flush_interval 5s
host elk
port 9200
index_name fluentd
type_name fluentd
</store>
</match>
</label>

view raw

fluent.conf

hosted with ❤ by GitHub

Fluentd listens for input on tcp port 24224, using the forward Input Plugin. Docker logs are streamed locally on each swarm node, from the Widget and MongoDB containers to the local Fluentd container, over tcp port 24224, using Docker’s fluentd logging driver, introduced earlier. Fluentd

Fluentd then filters all input using the stdout Filter Plugin. This plugin prints events to stdout, or logs if launched with daemon mode. This is the most basic method of filtering.

Lastly, Fluentd outputs the filtered input to two destinations, a local log file and Elasticsearch. First, the Docker logs are sent to a local Fluentd log file. This is only for demonstration purposes and debugging. Outputting log files is not recommended for production, nor does it meet the 12-factor application recommendations for logging. Second, Fluentd outputs the Docker logs to Elasticsearch, over tcp port 9200, using the Fluentd Elasticsearch Plugin, introduced above.

log_message_flow

Additional Metadata

In addition to the log message itself, in JSON format, the fluentd log driver sends the following metadata in the structured log message: container_id, container_name, and source. This is helpful in identifying and categorizing log messages from multiple sources. Below is a sample of log messages from the raw Fluentd log file, with the metadata tags highlighted in yellow. At the bottom of the output is a log message parsed with jq, for better readability.

fluentd_logs

Using Elastic Stack

Now that our two Docker stacks are up and running on our swarm, we should be streaming logs to Elasticsearch. To confirm this, open the Kibana web console, which should be available at the IP address of the worker3 swarm worker node, on port 5601.

Kibana

For the sake of this demonstration, I increased the verbosity of the Spring Boot Widget service’s log level, from INFO to DEBUG, in Consul. At this level of logging, the two Widget services and the single MongoDB instance were generating an average of 250-400 log messages every 30 seconds, according to Kibana.

If that seems like a lot, keep in mind, these are Docker logs, which are single-line log entries. We have not aggregated multi-line messages, such as Java exceptions and stack traces messages, into single entries. That is for another post. Also, the volume of debug-level log messages generated by the communications between the individual services and Consul is fairly verbose.

Kibana_3

Inspecting log entries in Kibana, we find the metadata tags contained in the raw Fluentd log output are now searchable fields: container_id, container_name, and source, as well as log. Also, note the _type field, with a value of ‘fluentd’. We injected this field in the output section of our Fluentd configuration, using the Fluentd Elasticsearch Plugin. The _type fiel allows us to differentiate these log entries from other potential data sources.

Kibana_2.png

References

All opinions in this post are my own and not necessarily the views of my current employer or their clients.

, , , , , , ,

3 Comments

Containerized Microservice Log Aggregation and Visualization using ELK Stack and Logspout

Log aggregation, visualization, analysis, and monitoring of Dockerized microservices using the ELK Stack (Elasticsearch, Logstash, and Kibana) and Logspout

Kibana Dashboard

Introduction

In the last series of posts, we learned how to use Jenkins CI, Maven, DockerDocker Compose, and Docker Machine to take a set of Java-based microservices from source control on GitHub, to a fully tested set of integrated Docker containers running within an Oracle VirtualBox VM. We performed integration tests, using a scripted set of synthetic transactions, to make sure the microservices were functioning as expected, within their containers.

In this post, we will round out our Virtual-Vehicles microservices REST API project by adding log aggregation, visualization, analysis, and monitoring, using the ELK Stack (Elasticsearch, Logstash, and Kibana) and Logspout.

ELK Stack 3D Diagram

All code for this post is available on GitHub, release version v3.1.0 on the ‘master’ branch (after running ‘git clone …’, run a ‘git checkout tags/v3.1.0’ command).

Logging

If you’re using Docker, then you’re familiar with the command, ‘docker logs container-name command‘. This command streams the log output of running services within a container, commonly used to debugging and troubleshooting. It sure beats ‘docker exec -it container-name cat /var/logs/foo/foo.log‘ and so on, for each log we need to inspect within a container.

With Docker Compose, we gain the command, ‘docker-compose logs‘. This command stream the log output of running services, of all containers defined in our ‘docker-compose.yml‘ file. Although moderately more useful for debugging, I’ve also found it fairly buggy when used with Docker Machine and Docker Swarm.

As helpful as these type of Docker commands are, when you start scaling from one container, to ten containers, to hundreds of containers, individually inspecting container logs from the command line is time-consuming and of little value. Correlating log events between containers is impossible. That’s where solutions such as the ELK Stack and Logspout really shine for containerized environments.

ELK Stack

Although not specifically designed for the purpose, the ELK Stack (Elasticsearch, Logstash, and Kibana) is an ideal tool-chain for log aggregation, visualization, analysis, and monitoring. Individually setting up Elasticsearch, Logstash, and Kibana, and configuring them to communicate with each other is not a small task. Luckily, there are several ready-made Docker images on Docker Hub, whose authors have already done much of the hard work for us. After trying several ELK containers on Docker Hub, I chose on the willdurand/elk image. This image is easy to get started with, and is easily used to build containers using Docker Compose.

Logspout

Using the ELK Stack, we have a way to collect (Logstash), store and search (Elasticsearch), and visualize and analyze (Kibana) our container’s log events. Although Logstash is capable of collecting our log events, to integrate more easily with Docker, we will add another component, Glider Lab’s Logspout, to our tool-chain. Logspout advertises itself as “a log router for Docker containers that runs inside Docker. It attaches to all containers on a host, then routes their logs wherever you want. It also has an extensible module system.”

Since Logspout is extensible through third-party modules, we will use one last component, Loop Lab’s Logspout/Logstash Adapter. Written in the go programming language, the adapter is described as “a minimalistic adapter for Glider Lab’s Logspout to write to Logstash UDP”. This adapter will allow us to collect Docker’s log events with Logspout and send them to Logstash using User Datagram Protocol (UDP).

In order to use the Logspout/Logstash adapter, we need to build a Logspout container from the /logspout Docker image, which contains a customized version of Logspout’s modules.go configuration file. This is explained in the Custom Logspout Builds section of Logspout’s README.md. Below is the modified configuration module with the addition of the adapter (see last import statement).

package main

import (
  _ "github.com/gliderlabs/logspout/adapters/raw"
  _ "github.com/gliderlabs/logspout/adapters/syslog"
  _ "github.com/gliderlabs/logspout/httpstream"
  _ "github.com/gliderlabs/logspout/routesapi"
  _ "github.com/gliderlabs/logspout/httpstream"
  _ "github.com/gliderlabs/logspout/transports/udp"
  _ "github.com/looplab/logspout-logstash"
)

One note with Logspout, according to their website, for now it Logspout only captures stdout and stderr, but a module to collect container syslog is planned. Although syslog is common centralized log collection method, the Docker logs we will collect are sent to stdout and stderr, the lack of syslog support is not a limitation for us, in this demonstration.

We will configure Logstash to accept log events from Logspout, using UDP on port 5000. Below is an abridged version of the logstash-logspout-log4j2.conf configuration file. The except from the configuration file, below, instructs Logstash to listen for Logspout’s messages over UDP on port 5000, and passes them to Elasticsearch.

input {
  udp {
    port  => 5000
    codec => json
    type  => "dockerlogs"
  }
 
# filtering section not shown...
 
output {
  elasticsearch { protocol => "http" }
  stdout { codec => rubydebug }
}

We could spend several posts on the configuration of Logstash. There are an infinite number of input, filter, and output combinations, to collect, transform, and push log events to various programs, including Logstash. The filtering section alone takes some time to learn exactly how to filter and transform log events, based upon the requirements for visualization and analysis.

Apache Log4j Logs

What about our Virtual-Vehicle microservice’s Log4j 2 logs? In the previous posts, you’ll recall we were sending our log events to physical log files within each container, using Log4j’s Rolling File appender.

<Appenders>
    <RollingFile name="RollingFile" fileName="${log-path}/virtual-authentication.log"
                 filePattern="${log-path}/virtual-authentication-%d{yyyy-MM-dd}-%i.log" >
        <PatternLayout>
            <pattern>%d{dd/MMM/yyyy HH:mm:ss,SSS}- %c{1}: %m%n</pattern>
        </PatternLayout>
        <Policies>
            <SizeBasedTriggeringPolicy size="1024 KB" />
        </Policies>
        <DefaultRolloverStrategy max="4"/>
    </RollingFile>
</Appenders>

Given the variety of appenders available with Log4j 2, we have a few options to leverage the ELK Stack with these logs events. The least disruptive change would be to send the Log4j log events to Logspout by redirecting Log4j output from the physical log file to stdout. We could do this by running a Linux link command in each microservice’s Dockerfile, as in the following example with Authentication microservice.

RUN touch /var/log/virtual-authentication.log && \
    ln -sf /dev/stdout /var/log/virtual-authentication.log

This method would not require us to change the log4j2.xml configuration files, and rebuild the services. However, the alternative we will use in this post is switching to Log4j’s Syslog appender. According to Log4j documentation, the Syslog appender is a Socket appender that writes its output to a remote destination specified by a host and port in a format that conforms with either the BSD Syslog format or the RFC 5424 format. The data can be sent over either TCP or UDP.

To use the Syslog appender option, we do need to change each log4j2.xml configuration file, and then rebuild each of the microservices. Instead of using UDP over port 5000, which is the port Logspout is currently using to communicate with Logstash, we will use UDP over port 5001. Below is a sample of the log4j2.xml configuration files for the Authentication microservice.

<Appenders>
    <Syslog name="RFC5424" format="RFC5424" host="elk" port="5001"
            protocol="UDP" appName="virtual-authentication" includeMDC="true"
            facility="SYSLOG" enterpriseNumber="18060" newLine="true"
            messageId="log4j2" mdcId="mdc" id="App"
            connectTimeoutMillis="1000" reconnectionDelayMillis="5000">
        <LoggerFields>
            <KeyValuePair key="thread" value="%t"/>
            <KeyValuePair key="priority" value="%p"/>
            <KeyValuePair key="category" value="%c"/>
            <KeyValuePair key="exception" value="%ex"/>
            <KeyValuePair key="message" value="%m"/>
        </LoggerFields>
    </Syslog>
</Appenders>

To communicate with Logstash over port 5001 with the Syslog appender, we also need to modify the logstash-logspout-log4j2.conf configuration file, again. Below is the unabridged version of the configuration file, with both the Logspout (UDP port 5000) and Log4j (UDP port 5001) configurations.

input {
  udp {
    port  => 5000
    codec => json
    type  => "dockerlogs"
  }

  udp {
    type => "log4j2"
    port => 5001
  }
}

filter {
  if [type] == "log4j2" {
    mutate {
     gsub => ['message', "\n", " "]
     gsub => ['message', "\t", " "]
    }
  }

  if [type] == "dockerlogs" {
    if ([message] =~ "^\tat ") {
      drop {}
    }

    grok {
      break_on_match => false
      match => [ "message", " responded with %{NUMBER:status_code:int}" ]
      tag_on_failure => []
    }

    grok {
      break_on_match => false
      match => [ "message", " in %{NUMBER:response_time:int}ms" ]
      tag_on_failure => []
    }
  }
}

output {
  elasticsearch { protocol => "http" }
  stdout { codec => rubydebug }
}

You will note some basic filtering in the configuration. I will touch upon this in the next section. Below is a diagram showing the complete flow of log events from both Log4j and from the Docker containers to Logspout and the ELK Stack.ELK Log Message Flow

Troubleshooting and Debugging

Trying to troubleshoot why log events may not be showing up in Kibana can be frustrating, without methods to debug the flow of log events along with way. Were the stdout Docker log events successfully received by Logspout? Did Logspout successfully forward the log events to Logstash? Did Log4j successfully push the microservice’s log events to Logstash? Probably the most frustrating of all issues, did you properly configure the Logstash configuration file(s) to receive, filter, transform, and push the log events to Elasticsearch. I spent countless hours debugging filtering, alone. Luckily, there are several ways to ensure log events are flowing. The below diagram shows some of the debug points along the way.

ELK Ports

First, we can check that the log events are making to Logspout from Docker by cURLing or browsing port 8000. Executing ‘curl -X GET --url http://api.virtual-vehicles.com:8000/logs‘ will tail incoming log events received to Logspout. You should see log events flowing into Logspout as you call the microservices through NGINX, by running the project’s integration tests, as shown in the example, below.

Logspout Debugging

Second, we can cURL or browse port 9200. This port will display information about Elasticsearch. There are several useful endpoints exposed by Elasticsearch’s REST API interface. Executing ‘curl -X GET --url http://api.virtual-vehicles.com:9200/_status?pretty‘ will display statistics about Elasticsearch, including the number of log events, referred to as ‘documents’ to Elasticsearch’s structured JSON document-based NoSQL datastore. Note the line, ‘"num_docs": 469‘, indicating 469 log events were captured by Elasticsearch as documents.

{
    "_shards": {
        "total": 32,
        "successful": 16,
        "failed": 0
    },
    "indices": {
        "logstash-2015.08.01": {
            "index": {
                "primary_size_in_bytes": 525997,
                "size_in_bytes": 525997
            },
            "translog": {
                "operations": 492
            },
            "docs": {
                "num_docs": 469,
                "max_doc": 469,
                "deleted_docs": 0
            }
        }
    }
}

If you find log events are not flowing into Logstash, a quick way to start debugging issues is to check Logstash’s log:

docker exec -it jenkins_elk_1 cat /var/log/logstash/stdout.log

If you find log events are flowing into Logstash, but not being captured by Elasticsearch, it’s probably your Logstash configuration file. Either the input, filter, and/or output sections are wrong. A quick way to debug these types of issues is to check Elasticsearch’s log. I’ve found this log often contains useful and specific error messages, which can help fix Logstash configuration issues.

docker exec -it jenkins_elk_1 cat /var/log/elasticsearch/logstash.log

Without log event documents in Elasticsearch, there is no sense moving on to Kibana. Kibana will have no data available to display.

Kibana

If you recall from our last post, the project already has Graphite and StatsD configured and running, as shown below. On its own, Graphite provides important monitoring and performance information about our microservices. In fact, we could choose to also send all our Docker log events, through Logstash, to Graphite. This would require some additional filtering and output configuration.

Graphite Dashboard

However, our main interest in this post is the ELK Stack. The way we visualize and analyze the log events we have captured is through Kibana. Kibana resembles other popular log aggregators and log search and analysis products, like Splunk, Graylog, and Sumo Logic. I suggest you familiarize yourself with Kibana before diving into the this part of the demonstration. Kibana can be confusing at first, if you are not familiar with it’s indexing, discovery, and search features.

We can access Kibana from our browser, at port 8200, ‘http://api.virtual-vehicles.com:8200‘. The first interactions with Kibana will be through the Discover view, as seen in the screen grab shown below. Kibana displays the typical vertical bar chart event timeline, based on log event timestamps. The details of each log event are displayed below the timeline. You can filter and search within this view. Searches can be saved and used later.

Kibana Discovery Tab

Heck, just the ability to view and search all our log events in one place is a huge improvement over the command line. If you look a little closer at the actual log events, as shown below, you will notice two types, ‘dockerlogs‘ and ‘log4j2‘. Looking at the Logstash configuration file again, shown previously, you see we applied the ‘type‘ tag to the log events as they were being processed by Logstash.

Kibana Discovery Message Types

In the Logstash configuration file, shown previously, you will also note the use of a few basic filters. I created a ‘status_code‘ and ‘response_time‘ filter, specifically for the Docker log events. Each Docker log event is passed through the filters. The two fields, ‘status_code‘ and  ‘response_time‘, are extracted from the main log event text and added as separate, indexable, and searchable fields. Below is an example of one such Docker log event, an HTTP DELETE call to the Valet microservice, shown as JSON. Note the two fields, showing a response time of 13ms and a http status code of 204.

{
  "_index": "logstash-2015.08.01",
  "_type": "dockerlogs",
  "_id": "AU7rcyxTA4OY8JukKyIv",
  "_score": null,
  "_source": {
    "message": "DELETE http://api.virtual-vehicles.com/valets/55bd30c2e4b0818a113883a6 
                responded with 204 No Content in 13ms",
    "docker.name": "/jenkins_valet_1",
    "docker.id": "7ef368f9fdca2d338786ecd8fe612011aebbfc9ad9b677c21578332f7c46cf2b",
    "docker.image": "jenkins_valet",
    "docker.hostname": "7ef368f9fdca",
    "@version": "1",
    "@timestamp": "2015-08-01T22:47:49.649Z",
    "type": "dockerlogs",
    "host": "172.17.0.7",
    "status_code": 204,
    "response_time": 13
  },
  "fields": {
    "@timestamp": [
      1438469269649
    ]
  },
  "sort": [
    1438469269649
  ]
}

For comparison, here is a sample Log4j 2 log event, generated by a JsonParseException. Note the different field structure. With more time spent modifying the Log4j event format, and configuring Logstash’s filtering and transforms, we could certainly improve the usability of Log4j log events.

{
  "_index": "logstash-2015.08.02",
  "_type": "log4j2",
  "_id": "AU7wJt8zA4OY8JukKyrt",
  "_score": null,
  "_source": {
    "message": "<43>1 2015-08-02T20:42:35.067Z bc45ce804859 virtual-authentication - log4j2
                [mdc@18060 category=\"com.example.authentication.objectid.JwtController\" exception=\"\"
                message=\"validateJwt() failed: JsonParseException: Unexpected end-of-input: was expecting closing
                quote for a string value  at [Source: java.io.StringReader@12a24457; line: 1, column: 27\\]\"
                priority=\"ERROR\" thread=\"nioEventLoopGroup-3-9\"] validateJwt() failed: JsonParseException:
                Unexpected end-of-input: was expecting closing quote for a string value  at [Source:
                java.io.StringReader@12a24457; line: 1, column: 27] ",
    "@version": "1",
    "@timestamp": "2015-08-02T20:42:35.188Z",
    "type": "log4j2",
    "host": "172.17.0.9"
  },
  "fields": {
    "@timestamp": [
      1438548155188
    ]
  },
  "sort": [
    1438548155188
  ]
}

Kibana Dashboard

To demonstrate the visualization capabilities of Kibana, we will create a Dashboard. Our Dashboard will be composed of a series of Kibana Visualizations. Visualizations are charts, graphs, tables, and metrics, based on the log events we see in the Discovery view. Below, I have created a rather basic Dashboard, containing some simple data visualization, based on our Docker and Log4j log events, collected over a 1-hour period. This one small screen-grab does not begin to do justice to the real power of Kibana.

Kibana Dashboard

In the dashboard above, you see a few basic metrics, such as request response times, response http status code, a chart of which containers are logging events, a graph that shows log events captured per minute, and so forth. Along with Searches, Visualizations and Dashboards can also be saved in Kibana. Note this demonstration’s Docker Compose YAML file does not configure volume mapping between the containers and host. If you destroy the containers, you destroy anything you saved in Kibana.

A key feature of Kibana’s Dashboards is their interactive capabilities. Rolling over any piece of a Visualization brings up an informative pop-up with additional details. For example, as shown below, rolling over the http status code ‘500’ pie chart slice, pops up the number of status code 500 responses. In this case, 15 log events, or 1.92% of the total 2,994 log events captured, had a ‘status_code’ field of ‘500’, within the 24-hour period the Dashboard analyzed.

Kibana Dashboard with Popup

Conveniently, Kibana also allows you to switch from a visual mode to a data table mode, for any Visualization on the Dashboard, as shown below, for a 24-hour period.

Kibana Dashboard as Tables

Conclusion

The ELK Stack is just one of a number of enterprise-class tools available to monitor and analyze the overall health of your applications running within a Dockerized environment. Having well planned logging, monitoring, and analytics strategies is key to this type of project. They should be implemented from the beginning of the project, to increase development and testing velocity, as well as provide quick troubleshooting, key business metrics, and proactive monitoring, once the application is in production.

Notes on Running the GitHub Project

If you download and run this project from GitHub, there is two key steps you should note. First, you need add an entry to your local /etc/hosts file. The IP address will be that of the Docker Machine VM, ‘test’. The hostname is ‘api.virtual-vehicles.com’. which matches the one I used throughout the demo. You should run the following bash command before building your containers from the docker-compose.yml file, but after you have built your VM using Docker Machine. The ‘test’ VM must already exist.

echo "$(docker-machine ip test)   api.virtual-vehicles.com" \
  | sudo tee --append /etc/hosts

If you want to override this domain name with your own, you will need to modify and re-build the microservices project, first. Then, copy those build artifacts into this project, replacing the ones you pulled from GitHub.

Second, in order to achieve HATEOAS in my REST API responses, I have included some variables in my docker-compose.yml file. Wait, docker-compose.yml doesn’t support variables? Well, it can if you use a template file (docker-compose-template.yml) and run a script (compose_replace.sh) to provide variable expansion. My gist explains the technique a little better.

You should also run this command before building your containers from the docker-compose.yml file, but after you have built your VM using Docker Machine. Again, the ‘test’ VM must already exist.

sh compose_replace.sh

Lastly, remember, we can run our integration tests to generate log events, using the following command.

sh tests_color.sh api.virtual-vehicles.com

Integration Tests

, , , , , , , , , , , , , , , , ,

1 Comment