Posts Tagged Google
Securing Your Istio Ingress Gateway with HTTPS
Posted by Gary A. Stafford in Bash Scripting, Cloud, Enterprise Software Development, GCP, Software Development on January 3, 2019
In the last post, Building a Microservices Platform with Confluent Cloud, MongoDB Atlas, Istio, and Google Kubernetes Engine, we built and deployed a microservice-based, cloud-native API to Google Kubernetes Engine (GKE), with Istio 1.0, on Google Cloud Platform (GCP). For brevity, we neglected a few key API features, required in Production, including HTTPS, OAuth for authentication, request quotas, request throttling, and the integration of a full lifecycle API management tool, like Google Apigee.
In this brief post, we will revisit the previous post’s project. We will disable HTTP, and secure the GKE cluster with HTTPS, using simple TLS, as opposed to mutual TLS authentication (mTLS). This post assumes you have created the GKE cluster and deployed the Storefront API and its associated resources, as explained in the previous post.
What is HTTPS?
According to Wikipedia, Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext Transfer Protocol (HTTP) for securing communications over a computer network. In HTTPS, the communication protocol is encrypted using Transport Layer Security (TLS), or, formerly, its predecessor, Secure Sockets Layer (SSL). The protocol is therefore also often referred to as HTTP over TLS, or HTTP over SSL.
Further, according to Wikipedia, the principal motivation for HTTPS is authentication of the accessed website and protection of the privacy and integrity of the exchanged data while in transit. It protects against man-in-the-middle attacks. The bidirectional encryption of communications between a client and server provides a reasonable assurance that one is communicating without interference by attackers with the website that one intended to communicate with, as opposed to an impostor.
Public Key Infrastructure
According to Comodo, both the TLS and SSL protocols use what is known as an asymmetric Public Key Infrastructure (PKI) system. An asymmetric system uses two keys to encrypt communications, a public key and a private key. Anything encrypted with the public key can only be decrypted by the private key and vice-versa.
Again, according to Wikipedia, a PKI is an arrangement that binds public keys with respective identities of entities, like people and organizations. The binding is established through a process of registration and issuance of certificates at and by a certificate authority (CA).
SSL/TLS Digital Certificate
Again, according to Comodo, when you request an HTTPS connection to a webpage, the website will initially send its SSL certificate to your browser. This certificate contains the public key needed to begin the secure session. Based on this initial exchange, your browser and the website then initiate the SSL handshake (actually, TLS handshake). The handshake involves the generation of shared secrets to establish a uniquely secure connection between yourself and the website. When a trusted SSL digital certificate is used during an HTTPS connection, users will see the padlock icon in the browser’s address bar.
Registered Domain
In order to secure an SSL Digital Certificate, required to enable HTTPS with the GKE cluster, we must first have a registered domain name. For the last post, and this post, I am using my own personal domain, storefront-demo.com
. The domain’s primary A record (‘@’) and all sub-domain A records, such as api.dev
, are all resolve to the external IP address on the front-end of the GCP load balancer.
For DNS hosting, I happen to be using Azure DNS to host the domain, storefront-demo.com
. All DNS hosting services basically work the same way, whether you chose Azure, AWS, GCP, or another third party provider.
Let’s Encrypt
If you have used Let’s Encrypt before, then you know how easy it is to get free SSL/TLS Certificates. Let’s Encrypt is the first free, automated, and open certificate authority (CA) brought to you by the non-profit Internet Security Research Group (ISRG).
According to Let’s Encrypt, to enable HTTPS on your website, you need to get a certificate from a Certificate Authority (CA); Let’s Encrypt is a CA. In order to get a certificate for your website’s domain from Let’s Encrypt, you have to demonstrate control over the domain. With Let’s Encrypt, you do this using software that uses the ACME protocol, which typically runs on your web host. If you have generated certificates with Let’s Encrypt, you also know the domain validation by installing the Certbot ACME client can be a bit daunting, depending on your level of access and technical expertise.
SSL For Free
This is where SSL For Free comes in. SSL For Free acts as a proxy of sorts to Let’s Encrypt. SSL For Free generates certificates using their ACME server by using domain validation. Private Keys are generated in your browser and never transmitted.
SSL For Free offers three domain validation methods:
- Automatic FTP Verification: Enter FTP information to automatically verify the domain;
- Manual Verification: Upload verification files manually to your domain to verify ownership;
- Manual Verification (DNS): Add
TXT
records to your DNS server;
Using the third domain validation method, manual verification using DNS, is extremely easy, if you have access to your domain’s DNS recordset.
SSL For Free provides TXT
records for each domain you are adding to the certificate. Below, I am adding a single domain to the certificate.
Add the TXT
records to your domain’s recordset. Shown below is an example of a single TXT
record that has been to my recordset using the Azure DNS service.
SSL For Free then uses the TXT
record to validate your domain is actually yours.
With the TXT
record in place and validation successful, you can download a ZIPped package containing the certificate, private key, and CA bundle. The CA bundle containing the end-entity root and intermediate certificates.
Decoding PEM Encoded SSL Certificate
Using a tool like SSL Shopper’s Certificate Decoder, we can decode our Privacy-Enhanced Mail (PEM) encoded SSL certificates and view all of the certificate’s information. Decoding the information contained in my certificate.crt
, I see the following.
Certificate Information: Common Name: api.dev.storefront-demo.com Subject Alternative Names: api.dev.storefront-demo.com Valid From: December 26, 2018 Valid To: March 26, 2019 Issuer: Let's Encrypt Authority X3, Let's Encrypt Serial Number: 03a5ec86bf79de65fb679ee7741ba07df1e4
Decoding the information contained in my ca_bundle.crt
, I see the following.
Certificate Information: Common Name: Let's Encrypt Authority X3 Organization: Let's Encrypt Country: US Valid From: March 17, 2016 Valid To: March 17, 2021 Issuer: DST Root CA X3, Digital Signature Trust Co. Serial Number: 0a0141420000015385736a0b85eca708
The Let’s Encrypt intermediate certificate is also cross-signed by another certificate authority, IdenTrust, whose root is already trusted in all major browsers. IdenTrust cross-signs the Let’s Encrypt intermediate certificate using their DST Root CA X3. Thus, the Issuer, shown above.
Configure Istio Ingress Gateway
Unzip the sslforfree.zip
package and place the individual files in a location you have access to from the command line.
unzip -l ~/Downloads/sslforfree.zip Archive: /Users/garystafford/Downloads/sslforfree.zip Length Date Time Name --------- ---------- ----- ---- 1943 12-26-2018 18:35 certificate.crt 1707 12-26-2018 18:35 private.key 1646 12-26-2018 18:35 ca_bundle.crt --------- ------- 5296 3 files
Following the process outlined in the Istio documentation, Securing Gateways with HTTPS, run the following command. This will place the istio-ingressgateway-certs
Secret in the istio-system
namespace, on the GKE cluster.
kubectl create -n istio-system secret tls istio-ingressgateway-certs \ --key path_to_files/sslforfree/private.key \ --cert path_to_files/sslforfree/certificate.crt
Modify the existing Istio Gateway from the previous project, istio-gateway.yaml. Remove the HTTP port
configuration item and replace with the HTTPS protocol item (gist). Redeploy the Istio Gateway to the GKE cluster.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
apiVersion: networking.istio.io/v1alpha3 | |
kind: Gateway | |
metadata: | |
name: storefront-gateway | |
spec: | |
selector: | |
istio: ingressgateway | |
servers: | |
# – port: | |
# number: 80 | |
# name: http | |
# protocol: HTTP | |
# hosts: | |
# – api.dev.storefront-demo.com | |
– port: | |
number: 443 | |
name: https | |
protocol: HTTPS | |
tls: | |
mode: SIMPLE | |
serverCertificate: /etc/istio/ingressgateway-certs/tls.crt | |
privateKey: /etc/istio/ingressgateway-certs/tls.key | |
hosts: | |
– api.dev.storefront-demo.com | |
— | |
apiVersion: networking.istio.io/v1alpha3 | |
kind: VirtualService | |
metadata: | |
name: storefront-dev | |
spec: | |
hosts: | |
– api.dev.storefront-demo.com | |
gateways: | |
– storefront-gateway | |
http: | |
– match: | |
– uri: | |
prefix: /accounts | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: accounts.dev.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /fulfillment | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: fulfillment.dev.svc.cluster.local | |
– match: | |
– uri: | |
prefix: /orders | |
route: | |
– destination: | |
port: | |
number: 8080 | |
host: orders.dev.svc.cluster.local |
By deploying the new istio-ingressgateway-certs
Secret and redeploying the Gateway, the certificate and private key were deployed to the /etc/istio/ingressgateway-certs/
directory of the istio-proxy
container, running on the istio-ingressgateway
Pod. To confirm both the certificate and private key were deployed correctly, run the following command.
kubectl exec -it -n istio-system \ $(kubectl -n istio-system get pods \ -l istio=ingressgateway \ -o jsonpath='{.items[0].metadata.name}') \ -- ls -l /etc/istio/ingressgateway-certs/ lrwxrwxrwx 1 root root 14 Jan 2 17:53 tls.crt -> ..data/tls.crt lrwxrwxrwx 1 root root 14 Jan 2 17:53 tls.key -> ..data/tls.key
That’s it. We should now have simple TLS enabled on the Istio Gateway, providing bidirectional encryption of communications between a client (Storefront API consumer) and server (Storefront API running on the GKE cluster). Users accessing the API will now have to use HTTPS.
Confirm HTTPS is Working
After completing the deployment, as outlined in the previous post, test the Storefront API by using HTTP, first. Since we removed the HTTP port item configuration in the Istio Gateway, the HTTP request should fail with a connection refused error. Insecure traffic is no longer allowed by the Storefront API.
Now try switching from HTTP to HTTPS. The page should be displayed and the black lock icon should appear in the browser’s address bar. Clicking on the lock icon, we will see the SSL certificate, used by the GKE cluster is valid.
By clicking on the valid certificate indicator, we may observe more details about the SSL certificate, used to secure the Storefront API. Observe the certificate is issued by Let’s Encrypt Authority X3. It is valid for 90 days from its time of issuance. Let’s Encrypt only issues certificates with a 90-day lifetime. Observe the public key uses SHA-256 with RSA (Rivest–Shamir–Adleman) encryption.
In Chrome, we can also use the Developer Tools Security tab to inspect the certificate. The certificate is recognized as valid and trusted. Also important, note the connection to this Storefront API is encrypted and authenticated using TLS 1.2 (a strong protocol), ECDHE_RSA with X25519 (a strong key exchange), and AES_128_GCM (a strong cipher). According to How’s My SSL?, TLS 1.2 is the latest version of TLS. The TLS 1.2 protocol provides access to advanced cipher suites that support elliptical curve cryptography and AEAD block cipher modes. TLS 1.2 is an improvement on previous TLS 1.1, 1.0, and SSLv3 or earlier.
Lastly, the best way to really understand what is happening with HTTPS, the Storefront API, and Istio, is verbosely curl
an API endpoint.
curl -Iv https://api.dev.storefront-demo.com/accounts/
Using the above curl
command, we can see exactly how the client successfully verifies the server, negotiates a secure HTTP/2 connection (HTTP/2 over TLS 1.2), and makes a request (gist).
- Line 3: DNS resolution of the URL to the external IP address of the GCP load-balancer
- Line 3: HTTPS traffic is routed to TCP port 443
- Lines 4 – 5: Application-Layer Protocol Negotiation (ALPN) starts to occur with the server
- Lines 7 – 9: Certificate to verify located
- Lines 10 – 20: TLS handshake is performed and is successful using TLS 1.2 protocol
- Line 20: CHACHA is the stream cipher and POLY1305 is the authenticator in the Transport Layer Security (TLS) 1.2 protocol ChaCha20-Poly1305 Cipher Suite
- Lines 22 – 27: SSL certificate details
- Line 28: Certificate verified
- Lines 29 – 38: Establishing HTTP/2 connection with the server
- Lines 33 – 36: Request headers
- Lines 39 – 46: Response headers containing the expected 204 HTTP return code
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Trying 35.226.121.90… | |
* TCP_NODELAY set | |
* Connected to api.dev.storefront-demo.com (35.226.121.90) port 443 (#0) | |
* ALPN, offering h2 | |
* ALPN, offering http/1.1 | |
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH | |
* successfully set certificate verify locations: | |
* CAfile: /etc/ssl/cert.pem | |
CApath: none | |
* TLSv1.2 (OUT), TLS handshake, Client hello (1): | |
* TLSv1.2 (IN), TLS handshake, Server hello (2): | |
* TLSv1.2 (IN), TLS handshake, Certificate (11): | |
* TLSv1.2 (IN), TLS handshake, Server key exchange (12): | |
* TLSv1.2 (IN), TLS handshake, Server finished (14): | |
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16): | |
* TLSv1.2 (OUT), TLS change cipher, Client hello (1): | |
* TLSv1.2 (OUT), TLS handshake, Finished (20): | |
* TLSv1.2 (IN), TLS change cipher, Client hello (1): | |
* TLSv1.2 (IN), TLS handshake, Finished (20): | |
* SSL connection using TLSv1.2 / ECDHE-RSA-CHACHA20-POLY1305 | |
* ALPN, server accepted to use h2 | |
* Server certificate: | |
* subject: CN=api.dev.storefront-demo.com | |
* start date: Dec 26 22:35:31 2018 GMT | |
* expire date: Mar 26 22:35:31 2019 GMT | |
* subjectAltName: host "api.dev.storefront-demo.com" matched cert's "api.dev.storefront-demo.com" | |
* issuer: C=US; O=Let's Encrypt; CN=Let's Encrypt Authority X3 | |
* SSL certificate verify ok. | |
* Using HTTP2, server supports multi-use | |
* Connection state changed (HTTP/2 confirmed) | |
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 | |
* Using Stream ID: 1 (easy handle 0x7ff997006600) | |
> HEAD /accounts/ HTTP/2 | |
> Host: api.dev.storefront-demo.com | |
> User-Agent: curl/7.54.0 | |
> Accept: */* | |
> | |
* Connection state changed (MAX_CONCURRENT_STREAMS updated)! | |
< HTTP/2 204 | |
HTTP/2 204 | |
< date: Fri, 04 Jan 2019 03:42:14 GMT | |
date: Fri, 04 Jan 2019 03:42:14 GMT | |
< x-envoy-upstream-service-time: 23 | |
x-envoy-upstream-service-time: 23 | |
< server: envoy | |
server: envoy | |
< | |
* Connection #0 to host api.dev.storefront-demo.com left intact |
Mutual TLS
Istio also supports mutual authentication using the TLS protocol, known as mutual TLS authentication (mTLS), between external clients and the gateway, as outlined in the Istio 1.0 documentation. According to Wikipedia, mutual authentication or two-way authentication refers to two parties authenticating each other at the same time. Mutual authentication a default mode of authentication in some protocols (IKE, SSH), but optional in TLS.
Again, according to Wikipedia, by default, TLS only proves the identity of the server to the client using X.509 certificates. The authentication of the client to the server is left to the application layer. TLS also offers client-to-server authentication using client-side X.509 authentication. As it requires provisioning of the certificates to the clients and involves less user-friendly experience, it is rarely used in end-user applications. Mutual TLS is much more widespread in B2B applications, where a limited number of programmatic clients are connecting to specific web services. The operational burden is limited and security requirements are usually much higher as compared to consumer environments.
This form of mutual authentication would be beneficial if we had external applications or other services outside our GKE cluster, consuming our API. Using mTLS, we could further enhance the security of those types of interactions.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service
Posted by Gary A. Stafford in Big Data, Cloud, GCP, Java Development, Python, Software Development, Technology Consulting on December 11, 2018
There is little question, big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last few years. Behind the hype curves and marketing buzz, these technologies are having a significant influence on all aspects of our modern lives.
However, installing, configuring, and managing the technologies that support big data analytics, data science, ML, and AI, at scale and in Production, often demands an advanced level of familiarity with Linux, distributed systems, cloud- and container-based platforms, databases, and data-streaming applications. The mere ability to manage terabytes and petabytes of transient data is beyond the capability of many enterprises, let alone performing analysis of that data.
To ease the burden of implementing these technologies, the three major cloud providers, AWS, Azure, and Google Cloud, all have multiple Big Data Analytics-, AI-, and ML-as-a-Service offerings. In this post, we will explore one such cloud-based service offering in the field of big data analytics, Google Cloud Dataproc. We will focus on Cloud Dataproc’s ability to quickly and efficiently run Spark jobs written in Java and Python, two widely adopted enterprise programming languages.
Featured Technologies
The following technologies are featured prominently in this post.
Google Cloud Dataproc
According to Google, Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running the Apache Spark and Apache Hadoop ecosystem on Google Cloud Platform. Dataproc is a complete platform for data processing, analytics, and machine learning. Dataproc offers per-second billing, so you only pay for exactly the resources you consume. Dataproc offers frequently updated and native versions of Apache Spark, Hadoop, Pig, and Hive, as well as other related applications. Dataproc has built-in integrations with other Google Cloud Platform (GCP) services, such as Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring. Dataproc’s clusters are configurable and resizable from a three to hundreds of nodes, and each cluster action takes less than 90 seconds on average.
Similar Platform as a Service (PaaS) offerings to Dataproc, include Amazon Elastic MapReduce (EMR), Microsoft Azure HDInsight, and Qubole Data Service. Qubole is offered on AWS, Azure, and Oracle Cloud Infrastructure (Oracle OCI).
According to Google, Cloud Dataproc and Cloud Dataflow, both part of GCP’s Data Analytics/Big Data Product offerings, can both be used for data processing, and there’s overlap in their batch and streaming capabilities. Cloud Dataflow is a fully-managed service for transforming and enriching data in stream and batch modes. Dataflow uses the Apache Beam SDK to provide developers with Java and Python APIs, similar to Spark.
Apache Spark
According to Apache, Spark is a unified analytics engine for large-scale data processing, used by well-known, modern enterprises, such as Netflix, Yahoo, and eBay. With in-memory speeds up to 100x faster than Hadoop, Apache Spark achieves high performance for static, batch, and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer, and a physical execution engine.
According to a post by DataFlair, ‘the DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD (Resilient Distributed Dataset). In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task.’ Below, we see a three-stage DAG visualization, displayed using the Spark History Server Web UI, from a job demonstrated in this post.
Spark’s polyglot programming model allows users to write applications in Scala, Java, Python, R, and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib (Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). Spark may be run using its standalone cluster mode or on Apache Hadoop YARN, Mesos, and Kubernetes.
PySpark
The Spark Python API, PySpark, exposes the Spark programming model to Python. PySpark is built on top of Spark’s Java API. Data is processed in Python and cached and shuffled in the JVM. According to Apache, Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a JVM.
Apache Hadoop
According to Apache, the Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. This is a rather modest description of such a significant and transformative project. When we talk about Hadoop, often it is in the context of the project’s well-known modules, which includes:
- Hadoop Common: The common utilities that support the other Hadoop modules
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data
- Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management, also known as ‘Hadoop NextGen’
- Hadoop MapReduce: A YARN-based system for parallel processing of large datasets
- Hadoop Ozone: An object store for Hadoop
Based on the Powered by Apache Hadoop list, there are many well-known enterprises and academic institutions using Apache Hadoop, including Adobe, eBay, Facebook, Hulu, LinkedIn, and The New York Times.
Spark vs. Hadoop
There are many articles and posts that delve into the Spark versus Hadoop debate, this post is not one of them. Although both are mature technologies, Spark, the new kid on the block, reached version 1.0.0 in May 2014, whereas Hadoop reached version 1.0.0, earlier, in December 2011. According to Google Trends, interest in both technologies has remained relatively high over the last three years. However, interest in Spark, based on the volume of searches, has been steadily outpacing Hadoop for well over a year now. The in-memory speed of Spark over HDFS-based Hadoop and ease of Spark SQL for working with structured data are likely big differentiators for many users coming from a traditional relational database background and users with large or streaming datasets, requiring near real-time processing.
In this post, all examples are built to run on Spark. This is not meant to suggest Spark is necessarily superior or that Spark runs better on Dataproc than Hadoop. In fact, Dataproc’s implementation of Spark relies on Hadoop’s core HDFS and YARN technologies to run.
Demonstration
To show the capabilities of Cloud Dataproc, we will create both a single-node Dataproc cluster and three-node cluster, upload Java- and Python-based analytics jobs and data to Google Cloud Storage, and execute the jobs on the Spark cluster. Finally, we will enable monitoring and notifications for the Dataproc clusters and the jobs running on the clusters with Stackdriver. The post will demonstrate the use of the Google Cloud Console, as well as Google’s Cloud SDK’s command line tools, for all tasks.
In this post, we will be uploading and running individual jobs on the Dataproc Spark cluster, as opposed to using the Cloud Dataproc Workflow Templates. According to Google, Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs. Workflow Templates are useful for automating your Datapoc workflows, however, automation is not the primary topic of this post.
Source Code
All open-sourced code for this post can be found on GitHub in two repositories, one for Java with Spark and one for Python with PySpark. Source code samples are displayed as GitHub Gists, which may not display correctly on all mobile and social media browsers.
Cost
Of course, there is a cost associated with provisioning cloud services. However, if you manage the Google Cloud Dataproc resources prudently, the costs are negligible. Regarding pricing, according to Google, Cloud Dataproc pricing is based on the size of Cloud Dataproc clusters and the duration of time that they run. The size of a cluster is based on the aggregate number of virtual CPUs (vCPUs) across the entire cluster, including the master and worker nodes. The duration of a cluster is the length of time, measured in minutes, between cluster creation and cluster deletion.
Over the course of writing the code for this post, as well as writing the post itself, the entire cost of all the related resources was a minuscule US$7.50. The cost includes creating, running, and deleting more than a dozen Dataproc clusters and uploading and executing approximately 75-100 Spark and PySpark jobs. Given the quick creation time of a cluster, 2 minutes on average or less in this demonstration, there is no reason to leave a cluster running longer than it takes to complete your workloads.
Kaggle Datasets
To explore the features of Dataproc, we will use a publicly-available dataset from Kaggle. Kaggle is a popular open-source resource for datasets used for big-data and ML applications. Their tagline is ‘Kaggle is the place to do data science projects’.
For this demonstration, I chose the IBRD Statement Of Loans Data dataset, from World Bank Financial Open Data, and available on Kaggle. The International Bank for Reconstruction and Development (IBRD) loans are public and publicly guaranteed debt extended by the World Bank Group. IBRD loans are made to, or guaranteed by, countries that are members of IBRD. This dataset contains historical snapshots of the Statement of Loans including the latest available snapshots.
There are two data files available. The ‘Statement of Loans’ latest available snapshots data file contains 8,713 rows of loan data (~3 MB), ideal for development and testing. The ‘Statement of Loans’ historic data file contains approximately 750,000 rows of data (~265 MB). Although not exactly ‘big data’, the historic dataset is large enough to sufficiently explore Dataproc. Both IBRD files have an identical schema with 33 columns of data (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
End of Period | Loan Number | Region | Country Code | Country | Borrower | Guarantor Country Code | Guarantor | Loan Type | Loan Status | Interest Rate | Currency of Commitment | Project ID | Project Name | Original Principal Amount | Cancelled Amount | Undisbursed Amount | Disbursed Amount | Repaid to IBRD | Due to IBRD | Exchange Adjustment | Borrower's Obligation | Sold 3rd Party | Repaid 3rd Party | Due 3rd Party | Loans Held | First Repayment Date | Last Repayment Date | Agreement Signing Date | Board Approval Date | Effective Date (Most Recent) | Closed Date (Most Recent) | Last Disbursement Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-10-31T00:00:00 | IBRD00010 | EUROPE AND CENTRAL ASIA | FR | France | CREDIT NATIONAL | FR | France | NON POOL | Fully Repaid | 4.2500 | P037383 | RECONSTRUCTION | 250000000.00 | 0.00 | 0.00 | 250000000.00 | 38000.00 | 0.00 | 0.00 | 0.00 | 249962000.00 | 249962000.00 | 0.00 | 0.00 | 1952-11-01T00:00:00 | 1977-05-01T00:00:00 | 1947-05-09T00:00:00 | 1947-05-09T00:00:00 | 1947-06-09T00:00:00 | 1947-12-31T00:00:00 | |||
2018-10-31T00:00:00 | IBRD00020 | EUROPE AND CENTRAL ASIA | NL | Netherlands | NON POOL | Fully Repaid | 4.2500 | P037452 | RECONSTRUCTION | 191044211.75 | 0.00 | 0.00 | 191044211.75 | 103372211.75 | 0.00 | 0.00 | 0.00 | 87672000.00 | 87672000.00 | 0.00 | 0.00 | 1952-04-01T00:00:00 | 1972-10-01T00:00:00 | 1947-08-07T00:00:00 | 1947-08-07T00:00:00 | 1947-09-11T00:00:00 | 1948-03-31T00:00:00 | ||||||
2018-10-31T00:00:00 | IBRD00021 | EUROPE AND CENTRAL ASIA | NL | Netherlands | NON POOL | Fully Repaid | 4.2500 | P037452 | RECONSTRUCTION | 3955788.25 | 0.00 | 0.00 | 3955788.25 | 0.00 | 0.00 | 0.00 | 0.00 | 3955788.25 | 3955788.25 | 0.00 | 0.00 | 1953-04-01T00:00:00 | 1954-04-01T00:00:00 | 1948-05-25T00:00:00 | 1947-08-07T00:00:00 | 1948-06-01T00:00:00 | 1948-06-30T00:00:00 | ||||||
2018-10-31T00:00:00 | IBRD00030 | EUROPE AND CENTRAL ASIA | DK | Denmark | NON POOL | Fully Repaid | 4.2500 | P037362 | RECONSTRUCTION | 40000000.00 | 0.00 | 0.00 | 40000000.00 | 17771000.00 | 0.00 | 0.00 | 0.00 | 22229000.00 | 22229000.00 | 0.00 | 0.00 | 1953-02-01T00:00:00 | 1972-08-01T00:00:00 | 1947-08-22T00:00:00 | 1947-08-22T00:00:00 | 1947-10-17T00:00:00 | 1949-03-31T00:00:00 | ||||||
2018-10-31T00:00:00 | IBRD00040 | EUROPE AND CENTRAL ASIA | LU | Luxembourg | NON POOL | Fully Repaid | 4.2500 | P037451 | RECONSTRUCTION | 12000000.00 | 238016.98 | 0.00 | 11761983.02 | 1619983.02 | 0.00 | 0.00 | 0.00 | 10142000.00 | 10142000.00 | 0.00 | 0.00 | 1949-07-15T00:00:00 | 1972-07-15T00:00:00 | 1947-08-28T00:00:00 | 1947-08-28T00:00:00 | 1947-10-24T00:00:00 | 1949-03-31T00:00:00 | ||||||
2018-10-31T00:00:00 | IBRD00050 | LATIN AMERICA AND CARIBBEAN | CL | Chile | Ministry of Finance | CL | Chile | NON POOL | Fully Repaid | 4.5000 | P006578 | POWER | 13500000.00 | 0.00 | 0.00 | 13500000.00 | 12167000.00 | 0.00 | 0.00 | 0.00 | 1333000.00 | 1333000.00 | 0.00 | 0.00 | 1953-07-01T00:00:00 | 1968-07-01T00:00:00 | 1948-03-25T00:00:00 | 1948-03-25T00:00:00 | 1949-04-07T00:00:00 | 1954-12-31T00:00:00 | |||
2018-10-31T00:00:00 | IBRD00060 | LATIN AMERICA AND CARIBBEAN | CL | Chile | Ministry of Finance | CL | Chile | NON POOL | Fully Repaid | 3.7500 | P006577 | FOMENTO AGRIC CREDIT | 2500000.00 | 0.00 | 0.00 | 2500000.00 | 755000.00 | 0.00 | 0.00 | 0.00 | 1745000.00 | 1745000.00 | 0.00 | 0.00 | 1950-07-01T00:00:00 | 1955-01-01T00:00:00 | 1948-03-25T00:00:00 | 1948-03-25T00:00:00 | 1949-04-07T00:00:00 | 1950-01-01T00:00:00 | |||
2018-10-31T00:00:00 | IBRD00070 | EUROPE AND CENTRAL ASIA | NL | Netherlands | NON POOL | Fully Repaid | 3.5625 | P037453 | SHIPPING I | 2000000.00 | 0.00 | 0.00 | 2000000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2000000.00 | 2000000.00 | 0.00 | 0.00 | 1949-01-15T00:00:00 | 1958-07-15T00:00:00 | 1948-07-15T00:00:00 | 1948-05-21T00:00:00 | 1948-08-03T00:00:00 | 1948-08-03T00:00:00 | ||||||
2018-10-31T00:00:00 | IBRD00071 | EUROPE AND CENTRAL ASIA | NL | Netherlands | NON POOL | Fully Repaid | 3.5625 | P037453 | SHIPPING I | 2000000.00 | 0.00 | 0.00 | 2000000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2000000.00 | 2000000.00 | 0.00 | 0.00 | 1949-01-15T00:00:00 | 1958-07-15T00:00:00 | 1948-07-15T00:00:00 | 1948-05-21T00:00:00 | 1948-08-03T00:00:00 | 1948-08-03T00:00:00 |
In this demonstration, both the Java and Python jobs will perform the same simple analysis of the larger historic dataset. For the analysis, we will ascertain the top 25 historic IBRD borrower, we will determine their total loan disbursements, current loan obligations, and the average interest rates they were charged for all loans. This simple analysis will be performed using Spark’s SQL capabilities. The results of the analysis, a Spark DataFrame containing 25 rows, will be saved as a CSV-format data file.
SELECT country, country_code, Format_number(total_disbursement, 0) AS total_disbursement, Format_number(total_obligation, 0) AS total_obligation, Format_number(avg_interest_rate, 2) AS avg_interest_rate FROM (SELECT country, country_code, Sum(disbursed) AS total_disbursement, Sum(obligation) AS total_obligation, Avg(interest_rate) AS avg_interest_rate FROM loans GROUP BY country, country_code ORDER BY total_disbursement DESC LIMIT 25)
Google Cloud Storage
First, we need a location to store our Spark jobs, data files, and results, which will be accessible to Dataproc. Although there are a number of choices, the simplest and most convenient location for Dataproc is a Google Cloud Storage bucket. According to Google, Cloud Storage offers the highest level of availability and performance within a single region and is ideal for compute, analytics, and ML workloads in a particular region. Cloud Storage buckets are nearly identical to Amazon Simple Storage Service (Amazon S3), their object storage service.
Using the Google Cloud Console, Google’s Web Admin UI, create a new, uniquely named Cloud Storage bucket. Our Dataproc clusters will eventually be created in a single regional location. We need to ensure our new bucket is created in the same regional location as the clusters; I chose us-east1.
We will need the new bucket’s link, to use within the Java and Python code as well from the command line with gsutil
. The gsutil
tool is a Python application that lets you access Cloud Storage from the command line. The bucket’s link may be found on the Storage Browser Console’s Overview tab. A bucket’s link is always in the format, gs://bucket-name
.
Alternatively, we may also create the Cloud Storage bucket using gsutil
with the make buckets (mb
) command, as follows:
# Always best practice since features are updated frequently gcloud components update export PROJECT=your_project_name export REGION=us-east1 export BUCKET_NAME=gs://your_bucket_name # Make sure you are creating resources in the correct project gcloud config set project $PROJECT gsutil mb -p $PROJECT -c regional -l $REGION $BUCKET_NAME
Cloud Dataproc Cluster
Next, we will create two different Cloud Dataproc clusters for demonstration purposes. If you have not used Cloud Dataproc previously in your GCP Project, you will first need to enable the API for Cloud Dataproc.
Single Node Cluster
We will start with a single node cluster with no worker nodes, suitable for development and testing Spark and Hadoop jobs, using small datasets. Create a single-node Dataproc cluster using the Single Node Cluster mode option. Create the cluster in the same region as the new Cloud Storage bucket. This will allow the Dataproc cluster access to the bucket without additional security or IAM configuration. I used the n1-standard-1
machine type, with 1 vCPU and 3.75 GB of memory. Observe the resources assigned to Hadoop YARN for Spark job scheduling and cluster resource management.
The new cluster, consisting of a single node and no worker nodes, should be ready for use in a few minutes or less.
Note the Image version, 1.3.16-deb9
. According to Google, Dataproc uses image versions to bundle operating system, big data components, and Google Cloud Platform connectors into one package that is deployed on a cluster. This image, released in November 2018, is the latest available version at the time of this post. The image contains:
- Apache Spark 2.3.1
- Apache Hadoop 2.9.0
- Apache Pig 0.17.0
- Apache Hive 2.3.2
- Apache Tez 0.9.0
- Cloud Storage connector 1.9.9-hadoop2
- Scala 2.11.8
- Python 2.7
To avoid lots of troubleshooting, make sure your code is compatible with the image’s versions. It is important to note the image does not contain a version of Python 3. You will need to ensure your Python code is built to run with Python 2.7. Alternatively, use Dataproc’s --initialization-actions
flag along with bootstrap and setup shell scripts to install Python 3 on the cluster using pip
or conda
. Tips for installing Python 3 on Datapoc be found on Stack Overflow and elsewhere on the Internet.
As as an alternative to the Google Cloud Console, we are able to create the cluster using a REST command. Google provides the Google Cloud Console’s equivalent REST request, as shown in the example below.
Additionally, we have the option of using the gcloud
command line tool. This tool provides the primary command-line interface to Google Cloud Platform and is part of Google’s Cloud SDK, which also includes the aforementioned gsutil
. Here again, Google provides the Google Cloud Console’s equivalent gcloud
command. This is a great way to learn to use the command line.
Using the dataproc clusters create
command, we are able to create the same cluster as shown above from the command line, as follows:
export PROJECT=your_project_name export CLUSTER_1=your_single_node_cluster_name export REGION=us-east1 export ZONE=us-east1-b export MACHINE_TYPE_SMALL=n1-standard-1 gcloud dataproc clusters create $CLUSTER_1 \ --region $REGION \ --zone $ZONE \ --single-node \ --master-machine-type $MACHINE_TYPE_SMALL \ --master-boot-disk-size 500 \ --image-version 1.3-deb9 \ --project $PROJECT
There are a few useful commands to inspect your running Dataproc clusters. The dataproc clusters describe
command, in particular, provides detailed information about all aspects of the cluster’s configuration and current state.
gcloud dataproc clusters list --region $REGION gcloud dataproc clusters describe $CLUSTER_2 \ --region $REGION --format json
Standard Cluster
In addition to the single node cluster, we will create a second three-node Dataproc cluster. We will compare the speed of a single-node cluster to that of a true cluster with multiple worker nodes. Create a new Dataproc cluster using the Standard Cluster mode option. Again, make sure to create the cluster in the same region as the new Storage bucket.
The second cluster contains a single master node and two worker nodes. All three nodes use the n1-standard-4
machine type, with 4 vCPU and 15 GB of memory. Although still considered a minimally-sized cluster, this cluster represents a significant increase in compute power over the first single-node cluster, which had a total of 2 vCPU, 3.75 GB of memory, and no worker nodes on which to distribute processing. Between the two workers in the second cluster, we have 8 vCPU and 30 GB of memory for computation.
Again, we have the option of using the gcloud
command line tool to create the cluster:
export PROJECT=your_project_name export CLUSTER_2=your_three_node_cluster_name export REGION=us-east1 export ZONE=us-east1-b export NUM_WORKERS=2 export MACHINE_TYPE_LARGE=n1-standard-4 gcloud dataproc clusters create $CLUSTER_2 \ --region $REGION \ --zone $ZONE \ --master-machine-type $MACHINE_TYPE_LARGE \ --master-boot-disk-size 500 \ --num-workers $NUM_WORKERS \ --worker-machine-type $MACHINE_TYPE_LARGE \ --worker-boot-disk-size 500 \ --image-version 1.3-deb9 \ --project $PROJECT
Cluster Creation Speed: Cloud Dataproc versus Amazon EMS?
In a series of rather unscientific tests, I found the three-node Dataproc cluster took less than two minutes on average to be created. Compare that time to a similar three-node cluster built with Amazon’s EMR service using their general purpose m4.4xlarge Amazon EC2 instance type. In a similar series of tests, I found the EMR cluster took seven minutes on average to be created. The EMR cluster took 3.5 times longer to create than the comparable Dataproc cluster. Again, although not a totally accurate comparison, since both services offer different features, it gives you a sense of the speed of Dataproc as compared to Amazon EMR.
Staging Buckets
According to Google, when you create a cluster, Cloud Dataproc creates a Cloud Storage staging bucket in your project or reuses an existing Cloud Dataproc-created bucket from a previous cluster creation request. Staging buckets are used to stage miscellaneous configuration and control files that are needed by your cluster. Below, we see the staging buckets created for the two Dataproc clusters.
Project Files
Before uploading the jobs and running them on the Cloud Dataproc clusters, we need to understand what is included in the two GitHub projects. If you recall from the Kaggle section of the post, both projects are basically the same but, written in different languages, Java and Python. The jobs they contain all perform the same basic analysis on the dataset.
Java Project
The dataproc-java-demo
Java-based GitHub project contains three classes, each which are jobs to run by Spark. The InternationalLoansApp
Java class is only intended to be run locally with the smaller 8.7K rows of data in the snapshot CSV file (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package org.example.dataproc; | |
import org.apache.spark.sql.Dataset; | |
import org.apache.spark.sql.Row; | |
import org.apache.spark.sql.SaveMode; | |
import org.apache.spark.sql.SparkSession; | |
public class InternationalLoansApp { | |
public static void main(String[] args) { | |
InternationalLoansApp app = new InternationalLoansApp(); | |
app.start(); | |
} | |
private void start() { | |
SparkSession spark = SparkSession.builder() | |
.appName("dataproc-java-demo") | |
.master("local[*]") | |
.getOrCreate(); | |
spark.sparkContext().setLogLevel("INFO"); // INFO by default | |
// Loads CSV file from local directory | |
Dataset<Row> dfLoans = spark.read() | |
.format("csv") | |
.option("header", "true") | |
.option("inferSchema", true) | |
.load("data/ibrd-statement-of-loans-latest-available-snapshot.csv"); | |
// Basic stats | |
System.out.printf("Rows of data:%d%n", dfLoans.count()); | |
System.out.println("Inferred Schema:"); | |
dfLoans.printSchema(); | |
// Creates temporary view using DataFrame | |
dfLoans.withColumnRenamed("Country", "country") | |
.withColumnRenamed("Country Code", "country_code") | |
.withColumnRenamed("Disbursed Amount", "disbursed") | |
.withColumnRenamed("Borrower's Obligation", "obligation") | |
.withColumnRenamed("Interest Rate", "interest_rate") | |
.createOrReplaceTempView("loans"); | |
// Performs basic analysis of dataset | |
Dataset<Row> dfDisbursement = spark.sql( | |
"SELECT country, country_code, " | |
+ "format_number(total_disbursement, 0) AS total_disbursement, " | |
+ "format_number(ABS(total_obligation), 0) AS total_obligation, " | |
+ "format_number(avg_interest_rate, 2) AS avg_interest_rate " | |
+ "FROM ( " | |
+ "SELECT country, country_code, " | |
+ "SUM(disbursed) AS total_disbursement, " | |
+ "SUM(obligation) AS total_obligation, " | |
+ "AVG(interest_rate) AS avg_interest_rate " | |
+ "FROM loans " | |
+ "GROUP BY country, country_code " | |
+ "ORDER BY total_disbursement DESC " | |
+ "LIMIT 25)" | |
); | |
dfDisbursement.show(25, 100); | |
// Calculates and displays the grand total disbursed amount | |
Dataset<Row> dfGrandTotalDisbursement = spark.sql( | |
"SELECT format_number(SUM(disbursed),0) AS grand_total_disbursement FROM loans" | |
); | |
dfGrandTotalDisbursement.show(); | |
// Calculates and displays the grand total remaining obligation amount | |
Dataset<Row> dfGrandTotalObligation = spark.sql( | |
"SELECT format_number(SUM(obligation),0) AS grand_total_obligation FROM loans" | |
); | |
dfGrandTotalObligation.show(); | |
// Saves results to a locally CSV file | |
dfDisbursement.repartition(1) | |
.write() | |
.mode(SaveMode.Overwrite) | |
.format("csv") | |
.option("header", "true") | |
.save("data/ibrd-summary-small-java"); | |
System.out.println("Results successfully written to CSV file"); | |
} | |
} |
On line 20, the Spark Session’s Master URL, .master("local[*]")
, directs Spark to run locally with as many worker threads as logical cores on the machine. There are several options for setting the Master URL, detailed here.
On line 30, the path to the data file, and on line 84, the output path for the data file, is a local relative file path.
On lines 38–42, we do a bit of clean up on the column names, for only those columns we are interested in for the analysis. Be warned, the column names of the IBRD data are less than ideal for SQL-based analysis, containing mixed-cased characters, word spaces, and brackets.
On line 79, we call Spark DataFrame’s repartition method, dfDisbursement.repartition(1)
. The repartition method allows us to recombine the results of our analysis and output a single CSV file to the bucket. Ordinarily, Spark splits the data into partitions and executes computations on the partitions in parallel. Each partition’s data is written to separate CSV files when a DataFrame is written back to the bucket.
Using coalesce(1)
or repartition(1)
to recombine the resulting 25-Row DataFrame on a single node is okay for the sake of this demonstration, but is not practical for recombining partitions from larger DataFrames. There are more efficient and less costly ways to manage the results of computations, depending on the intended use of the resulting data.
The InternationalLoansAppDataprocSmall
class is intended to be run on the Dataproc clusters, analyzing the same smaller CSV data file. The InternationalLoansAppDataprocLarge
class is also intended to be run on the Dataproc clusters, however, it analyzes the larger 750K rows of data in the IRBD historic CSV file (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package org.example.dataproc; | |
import org.apache.spark.sql.Dataset; | |
import org.apache.spark.sql.Row; | |
import org.apache.spark.sql.SaveMode; | |
import org.apache.spark.sql.SparkSession; | |
public class InternationalLoansAppDataprocLarge { | |
public static void main(String[] args) { | |
InternationalLoansAppDataprocLarge app = new InternationalLoansAppDataprocLarge(); | |
app.start(); | |
} | |
private void start() { | |
SparkSession spark = SparkSession.builder() | |
.appName("dataproc-java-demo") | |
.master("yarn") | |
.getOrCreate(); | |
spark.sparkContext().setLogLevel("WARN"); // INFO by default | |
// Loads CSV file from Google Storage Bucket | |
Dataset<Row> dfLoans = spark.read() | |
.format("csv") | |
.option("header", "true") | |
.option("inferSchema", true) | |
.load("gs://dataproc-demo-bucket/ibrd-statement-of-loans-historical-data.csv"); | |
// Creates temporary view using DataFrame | |
dfLoans.withColumnRenamed("Country", "country") | |
.withColumnRenamed("Country Code", "country_code") | |
.withColumnRenamed("Disbursed Amount", "disbursed") | |
.withColumnRenamed("Borrower's Obligation", "obligation") | |
.withColumnRenamed("Interest Rate", "interest_rate") | |
.createOrReplaceTempView("loans"); | |
// Performs basic analysis of dataset | |
Dataset<Row> dfDisbursement = spark.sql( | |
"SELECT country, country_code, " | |
+ "format_number(total_disbursement, 0) AS total_disbursement, " | |
+ "format_number(ABS(total_obligation), 0) AS total_obligation, " | |
+ "format_number(avg_interest_rate, 2) AS avg_interest_rate " | |
+ "FROM ( " | |
+ "SELECT country, country_code, " | |
+ "SUM(disbursed) AS total_disbursement, " | |
+ "SUM(obligation) AS total_obligation, " | |
+ "AVG(interest_rate) AS avg_interest_rate " | |
+ "FROM loans " | |
+ "GROUP BY country, country_code " | |
+ "ORDER BY total_disbursement DESC " | |
+ "LIMIT 25)" | |
); | |
// Saves results to single CSV file in Google Storage Bucket | |
dfDisbursement.repartition(1) | |
.write() | |
.mode(SaveMode.Overwrite) | |
.format("csv") | |
.option("header", "true") | |
.save("gs://dataproc-demo-bucket/ibrd-summary-large-java"); | |
System.out.println("Results successfully written to CSV file"); | |
} | |
} |
On line 20, note the Spark Session’s Master URL, .master(yarn)
, directs Spark to connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode
when submitting the job. The cluster location will be found based on the HADOOP_CONF_DIR
or YARN_CONF_DIR
variable. Recall, the Dataproc cluster runs Spark on YARN.
Also, note on line 30, the path to the data file, and on line 63, the output path for the data file, is to the Cloud Storage bucket we created earlier (.load("gs://your-bucket-name/your-data-file.csv"
). Cloud Dataproc clusters automatically install the Cloud Storage connector. According to Google, there are a number of benefits to choosing Cloud Storage over traditional HDFS including data persistence, reliability, and performance.
These are the only two differences between the local version of the Spark job and the version of the Spark job intended for Dataproc. To build the project’s JAR file, which you will later upload to the Cloud Storage bucket, compile the Java project using the gradle build
command from the root of the project. For convenience, the JAR file is also included in the GitHub repository.
Python Project
The dataproc-python-demo
Python-based GitHub project contains two Python scripts to be run using PySpark for this post. The international_loans_local.py
Python script is only intended to be run locally with the smaller 8.7K rows of data in the snapshot CSV file. It does a few different analysis with the smaller dataset. (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
# Author: Gary A. Stafford | |
# License: MIT | |
from pyspark.sql import SparkSession | |
def main(): | |
spark = SparkSession \ | |
.builder \ | |
.master("local[*]") \ | |
.appName('dataproc-python-demo') \ | |
.getOrCreate() | |
# Defaults to INFO | |
sc = spark.sparkContext | |
sc.setLogLevel("INFO") | |
# Loads CSV file from local directory | |
df_loans = spark \ | |
.read \ | |
.format("csv") \ | |
.option("header", "true") \ | |
.option("inferSchema", "true") \ | |
.load("data/ibrd-statement-of-loans-latest-available-snapshot.csv") | |
# Prints basic stats | |
print "Rows of data:" + str(df_loans.count()) | |
print "Inferred Schema:" | |
df_loans.printSchema() | |
# Creates temporary view using DataFrame | |
df_loans.withColumnRenamed("Country", "country") \ | |
.withColumnRenamed("Country Code", "country_code") \ | |
.withColumnRenamed("Disbursed Amount", "disbursed") \ | |
.withColumnRenamed("Borrower's Obligation", "obligation") \ | |
.withColumnRenamed("Interest Rate", "interest_rate") \ | |
.createOrReplaceTempView("loans") | |
# Performs basic analysis of dataset | |
df_disbursement = spark.sql(""" | |
SELECT country, country_code, | |
format_number(total_disbursement, 0) AS total_disbursement, | |
format_number(ABS(total_obligation), 0) AS total_obligation, | |
format_number(avg_interest_rate, 2) AS avg_interest_rate | |
FROM ( | |
SELECT country, country_code, | |
SUM(disbursed) AS total_disbursement, | |
SUM(obligation) AS total_obligation, | |
AVG(interest_rate) AS avg_interest_rate | |
FROM loans | |
GROUP BY country, country_code | |
ORDER BY total_disbursement DESC | |
LIMIT 25) | |
""").cache() | |
df_disbursement.show(25, True) | |
# Saves results to a locally CSV file | |
df_disbursement.repartition(1) \ | |
.write \ | |
.mode("overwrite") \ | |
.format("csv") \ | |
.option("header", "true") \ | |
.save("data/ibrd-summary-small-python") | |
print "Results successfully written to CSV file" | |
spark.stop() | |
if __name__ == "__main__": | |
main() |
Identical to the corresponding Java class, note on line 12, the Spark Session’s Master URL, .master("local[*]")
, directs Spark to run locally with as many worker threads as logical cores on the machine.
Also identical to the corresponding Java class, note on line 26, the path to the data file, and on line 66, the output path for the resulting data file, is a local relative file path.
The international_loans_dataproc-large.py
Python script is intended to be run on the Dataproc clusters, analyzing the larger 750K rows of data in the IRBD historic CSV file (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
# Author: Gary A. Stafford | |
# License: MIT | |
from pyspark.sql import SparkSession | |
def main(): | |
spark = SparkSession \ | |
.builder \ | |
.master("yarn") \ | |
.appName('dataproc-python-demo') \ | |
.getOrCreate() | |
# Defaults to INFO | |
sc = spark.sparkContext | |
sc.setLogLevel("WARN") | |
# Loads CSV file from Google Storage Bucket | |
df_loans = spark \ | |
.read \ | |
.format("csv") \ | |
.option("header", "true") \ | |
.option("inferSchema", "true") \ | |
.load("gs://dataproc-demo-bucket/ibrd-statement-of-loans-historical-data.csv") | |
# Creates temporary view using DataFrame | |
df_loans.withColumnRenamed("Country", "country") \ | |
.withColumnRenamed("Country Code", "country_code") \ | |
.withColumnRenamed("Disbursed Amount", "disbursed") \ | |
.withColumnRenamed("Borrower's Obligation", "obligation") \ | |
.withColumnRenamed("Interest Rate", "interest_rate") \ | |
.createOrReplaceTempView("loans") | |
# Performs basic analysis of dataset | |
df_disbursement = spark.sql(""" | |
SELECT country, country_code, | |
format_number(total_disbursement, 0) AS total_disbursement, | |
format_number(ABS(total_obligation), 0) AS total_obligation, | |
format_number(avg_interest_rate, 2) AS avg_interest_rate | |
FROM ( | |
SELECT country, country_code, | |
SUM(disbursed) AS total_disbursement, | |
SUM(obligation) AS total_obligation, | |
AVG(interest_rate) AS avg_interest_rate | |
FROM loans | |
GROUP BY country, country_code | |
ORDER BY total_disbursement DESC | |
LIMIT 25) | |
""").cache() | |
# Saves results to single CSV file in Google Storage Bucket | |
df_disbursement.repartition(1) \ | |
.write \ | |
.mode("overwrite") \ | |
.format("csv") \ | |
.option("header", "true") \ | |
.save("gs://dataproc-demo-bucket/ibrd-summary-large-python") | |
spark.stop() | |
if __name__ == "__main__": | |
main() |
On line 12, note the Spark Session’s Master URL, .master(yarn)
, directs Spark to connect to a YARN cluster.
Again, note on line 26, the path to the data file, and on line 59, the output path for the data file, is to the Cloud Storage bucket we created earlier (.load("gs://your-bucket-name/your-data-file.csv"
).
These are the only two differences between the local version of the PySpark job and the version of the PySpark job intended for Dataproc. With Python, there is no pre-compilation necessary. We will upload the second script, directly.
Uploading Job Resources to Cloud Storage
In total, we need to upload four items to the new Cloud Storage bucket we created previously. The items include the two Kaggle IBRD CSV files, the compiled Java JAR file from the dataproc-java-demo
project, and the Python script from the dataproc-python-demo
project. Using the Google Cloud Console, upload the four files to the new Google Storage bucket, as shown below. Make sure you unzip the two Kaggle IRBD CSV data files before uploading.
Like before, we also have the option of using gsutil
with the copy (cp
) command to upload the four files. The cp
command accepts wildcards, as shown below.
export PROJECT=your_project_name export BUCKET_NAME=gs://your_bucket_name gsutil cp data/ibrd-statement-of-loans-*.csv $BUCKET_NAME gsutil cp build/libs/dataprocJavaDemo-1.0-SNAPSHOT.jar $BUCKET_NAME gsutil cp international_loans_dataproc_large.py $BUCKET_NAME
If our Java or Python jobs were larger, or more complex and required multiple files to run, we could also choose to upload ZIP or other common compression formatted archives using the --archives
flag.
Running Jobs on Dataproc
The easiest way to run a job on the Dataproc cluster is by submitting a job through the Dataproc Jobs UI, part of the Google Cloud Console.
Dataproc has the capability of running multiple types of jobs, including:
- Hadoop
- Spark
- SparkR
- PySpark
- Hive
- SparkSql
- Pig
We will be running both Spark and PySpark jobs as part of this demonstration.
Spark Jobs
To run a Spark job using the JAR file, select Job type Spark. The Region will match your Dataproc cluster and bucket locations, us-east-1 in my case. You should have a choice of both clusters in your chosen region. Run both jobs at least twice, once on both clusters, for a total of four jobs.
Lastly, you will need to input the main class and the path to the JAR file. The JAR location will be:
gs://your_bucket_name/dataprocJavaDemo-1.0-SNAPSHOT.jar
The main class for the smaller dataset will be:
org.example.dataproc.InternationalLoansAppDataprocSmall
The main class for the larger dataset will be:
org.example.dataproc.InternationalLoansAppDataprocLarge
During or after job execution, you may view details in the Output tab of the Dataproc Jobs console.
Like every other step in this demonstration, we can also use the gcloud
command line tool, instead of the web console, to submit our Spark jobs to each cluster. Here, I am submitting the larger dataset Spark job to the three-node cluster.
export CLUSTER_2=your_three_node_cluster_name export REGION=us-east1 export BUCKET_NAME=gs://your_bucket_name gcloud dataproc jobs submit spark \ --region $REGION \ --cluster $CLUSTER_2 \ --class org.example.dataproc.InternationalLoansAppDataprocLarge \ --jars $BUCKET_NAME/dataprocJavaDemo-1.0-SNAPSHOT.jar \ --async
PySpark Jobs
To run a Spark job using the Python script, select Job type PySpark. The Region will match your Dataproc cluster and bucket locations, us-east-1 in my case. You should have a choice of both clusters. Run the job at least twice, once on both clusters.
Lastly, you will need to input the main Python file path. There is only one Dataproc Python script, which analyzes the larger dataset. The script location will be:
gs://your_bucket_name/international_loans_dataproc_large.py
Like every other step in this demonstration, we can also use the gcloud
command line tool instead of the web console to submit our PySpark jobs to each cluster. Below, I am submitting the PySpark job to the three-node cluster.
export CLUSTER_2=your_three_node_cluster_name export REGION=us-east1 export BUCKET_NAME=gs://your_bucket_name gcloud dataproc jobs submit pyspark \ $BUCKET_NAME/international_loans_dataproc_large.py \ --region $REGION \ --cluster $CLUSTER_2 \ --async
Including the optional --async
flag with any of the dataproc jobs submit command, the job will be sent to the Dataproc cluster and immediately release the terminal back to the user. If you do not to use the --async
flag, the terminal will be unavailable until the job is finished.
However, without the flag, we will get the standard output (stdout) and standard error (stderr) from Dataproc. The output includes some useful information, including different stages of the job execution lifecycle and execution times.
File Output
During development and testing, outputting results to the console is useful. However, in Production, the output from jobs is most often written to Apache Parquet, Apache Avro, CSV, JSON, or XML format files, persisted Apache Hive, SQL, or NoSQL database, or streamed to another system for post-processing, using technologies such as Apache Kafka.
Once both the Java and Python jobs have run successfully on the Dataproc cluster, you should observe the results have been saved back to the Storage bucket. Each script saves its results to a single CSV file in separate directories, as shown below.
The final dataset, written to the CSV file, contains the results of the analysis results (gist).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
country | country_code | total_disbursement | total_obligation | avg_interest_rate | |
---|---|---|---|---|---|
Brazil | BR | 4,302,455,404,056 | 1,253,228,385,979 | 4.08 | |
Mexico | MX | 4,219,081,270,927 | 1,297,489,060,082 | 4.94 | |
Indonesia | ID | 3,270,346,860,046 | 1,162,592,633,450 | 4.67 | |
China | CN | 3,065,658,803,841 | 1,178,177,730,111 | 3.01 | |
India | IN | 3,052,082,309,937 | 1,101,910,589,590 | 3.81 | |
Turkey | TR | 2,797,634,959,120 | 1,111,562,740,520 | 4.85 | |
Argentina | AR | 2,241,512,056,786 | 524,815,800,115 | 3.38 | |
Colombia | CO | 1,701,021,819,054 | 758,168,606,621 | 4.48 | |
Korea, Republic of | KR | 1,349,701,565,303 | 9,609,765,857 | 6.81 | |
Philippines | PH | 1,166,976,603,303 | 365,840,981,818 | 5.38 | |
Poland | PL | 1,157,181,357,135 | 671,373,801,971 | 2.89 | |
Morocco | MA | 1,045,267,705,436 | 365,073,667,924 | 4.39 | |
Russian Federation | RU | 915,318,843,306 | 98,207,276,721 | 1.70 | |
Romania | RO | 902,736,599,033 | 368,321,253,522 | 4.36 | |
Egypt, Arab Republic of | EG | 736,945,143,568 | 431,086,774,867 | 4.43 | |
Thailand | TH | 714,203,701,665 | 70,485,749,749 | 5.93 | |
Peru | PE | 655,818,700,812 | 191,464,347,544 | 3.83 | |
Ukraine | UA | 644,031,278,339 | 394,273,593,116 | 1.47 | |
Pakistan | PK | 628,853,154,827 | 121,673,028,048 | 3.82 | |
Tunisia | TN | 625,648,381,742 | 202,230,595,005 | 4.56 | |
Nigeria | NG | 484,529,279,526 | 2,351,912,541 | 5.86 | |
Kazakhstan | KZ | 453,938,975,114 | 292,590,991,287 | 2.81 | |
Algeria | DZ | 390,644,588,386 | 251,720,881 | 5.30 | |
Chile | CL | 337,041,916,083 | 11,479,003,904 | 4.86 | |
Serbia | YF | 331,975,671,975 | 173,516,517,964 | 5.30 |
Cleaning Up
When you are finished, make sure to delete your running clusters. This may be done through the Google Cloud Console. Deletion of the three-node cluster took, on average, slightly more than one minute.
As usual, we can also use the gcloud
command line tool instead of the web console to delete the Dataproc clusters.
export CLUSTER_1=your_single_node_cluster_name export CLUSTER_2=your_three_node_cluster_name export REGION=us-east1 yes | gcloud dataproc clusters delete $CLUSTER_1 --region $REGION yes | gcloud dataproc clusters delete $CLUSTER_2 --region $REGION
Results
Some observations, based on approximately 75 successful jobs. First, both the Python job and the Java jobs ran in nearly the same amount of time on the single-node cluster and then on the three-node cluster. This is beneficial since, although, a lot of big data analysis is performed with Python, Java is still the lingua franca of many large enterprises.
Consecutive Execution
Below are the average times for running the larger dataset on both clusters, in Java, and in Python. The jobs were all run consecutively as opposed to concurrently. The best time was 59 seconds on the three-node cluster compared to the best time of 150 seconds on the single-node cluster, a difference of 256%. Given the differences in the two clusters, this large variation is expected. The average difference between the two clusters for running the large dataset was 254%.
Concurrent Execution
It is important to understand the impact of concurrently running multiple jobs on the same Dataproc cluster. To demonstrate this, both the Java and Python jobs were also run concurrently. In one such test, ten copies of the Python job were run concurrently on the three-node cluster.
Observe that the execution times of the concurrent jobs increase in near-linear time. The first job completes in roughly the same time as the consecutively executed jobs, shown above, but each proceeding job’s execution time increases linearly.
According to Apache, when running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application. Each application is given a maximum amount of resources it can use and holds onto them for its whole duration. Note no tuning was done to the Dataproc clusters to optimize for concurrent execution.
Really Big Data?
Although there is no exact definition of ‘big data’, 750K rows of data at 265 MB is probably not generally considered big data. Likewise, the three-node cluster used in this demonstration is still pretty diminutive. Lastly, the SQL query was less than complex. To really test the abilities of Dataproc would require a multi-gigabyte or multi-terabyte-sized dataset, divided amongst multiple files, computed on a much beefier cluster with more workers nodes and more computer resources.
Monitoring and Instrumentation
In addition to viewing the results of running and completed jobs, there are a number of additional monitoring resources, including the Hadoop Yarn Resource Manager, HDFS NameNode, and Spark History Server Web UIs, and Google Stackdriver. I will only briefly introduce these resources, and not examine any of these interfaces in detail. You’re welcome to investigate the resources for your own clusters. Apache lists other Spark monitoring and instrumentation resources in their documentation.
To access the Hadoop Yarn Resource Manager, HDFS NameNode, and Spark History Server Web UIs, you must create an SSH tunnel and run Chrome through a proxy. Google Dataproc provides both commands and a link to documentation in the Dataproc Cluster tab, to connect.
Hadoop Yarn Resource Manager Web UI
Once you are connected to the Dataproc cluster, via the SSH tunnel and proxy, the Hadoop Yarn Resource Manager Web UI is accessed on port 8088. The UI allows you to view all aspects of the YARN cluster and the distributed applications running on the YARN system.
HDFS NameNode Web UI
Once you are connected to the Dataproc cluster, via the SSH tunnel and proxy, the HDFS NameNode Web UI may is accessed on port 9870. According to the Hadoop Wiki, the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks were across the cluster the file data is kept. It does not store the data of these files itself.
Spark History Server Web UI
We can view the details of all completed jobs using the Spark History Server Web UI. Once you are connected to the cluster, via the SSH tunnel and proxy, the Spark History Server Web UI is accessed on port 18080. Of all the methods of reviewing aspects of a completed Spark job, the History Server provides the most detailed.
Using the History Server UI, we can drill into fine-grained details of each job, including the event timeline.
Also, using the History Server UI, we can see a visualization of the Spark job’s DAG (Directed Acyclic Graph). DataBricks published an excellent post on learning how to interpret the information visualized in the Spark UI.
Not only can view the DAG and drill into each Stage of the DAG, from the UI.
Stackdriver
We can also enable Google Stackdriver for monitoring and management of services, containers, applications, and infrastructure. Stackdriver offers an impressive array of services, including debugging, error reporting, monitoring, alerting, tracing, logging, and dashboards, to mention only a few Stackdriver features.
There are dozens of metrics available, which collectively, reflect the health of the Dataproc clusters. Below we see the states of one such metric, the YARN virtual cores (vcores). A YARN vcore, introduced in Hadoop 2.4, is a usage share of a host CPU. The number of YARN virtual cores is equivalent to the number of worker nodes (2) times the number of vCPUs per node (4), for a total of eight YARN virtual cores. Below, we see that at one point in time, 5 of the 8 vcores have been allocated, with 2 more available.
Next, we see the states of the YARN memory size. YARN memory size is calculated as the number of worker nodes (2) times the amount of memory on each node (15 GB) times the fraction given to YARN (0.8), for a total of 24 GB (2 x 15 GB x 0.8). Below, we see that at one point in time, 20 GB of RAM is allocated with 4 GB available. At that instant in time, the workload does not appear to be exhausting the cluster’s memory.
Notifications
Since no one actually watches dashboards all day, waiting for something to fail, how do know when we have an issue with Dataproc? Stackdrive offers integrations with most popular notification channels, including email, SMS, Slack, PagerDuty, HipChat, and Webhooks. With Stackdriver, we define a condition which describes when a service is considered unhealthy. When triggered, Stackdriver sends a notification to one or more channels.
Below is a preview of two alert notifications in Slack. I enabled Slack as a notification channel and created an alert which is triggered each time a Dataproc job fails. Whenever a job fails, such as the two examples below, I receive a Slack notification through the Slack Channel defined in Stackdriver.
Slack notifications contain a link, which routes you back to Stackdriver, to an incident which was opened on your behalf, due to the job failure.
For convenience, the incident also includes a pre-filtered link directly to the log entries at the time of the policy violation. Stackdriver logging offers advanced filtering capabilities to quickly find log entries, as shown below.
With Stackdriver, you get monitoring, logging, alerting, notification, and incident management as a service, with minimal cost and upfront configuration. Think about how much time and effort it takes the average enterprise to achieve this level of infrastructure observability on their own, most never do.
Conclusion
In this post, we have seen the ease-of-use, extensive feature-set, out-of-the-box integration ability with other cloud services, low cost, and speed of Google Cloud Dataproc, to run big data analytics workloads. Couple this with the ability of Stackdriver to provide monitoring, logging, alerting, notification, and incident management for Dataproc with minimal up-front configuration. In my opinion, based on these features, Google Cloud Dataproc leads other cloud competitors for fully-managed Spark and Hadoop Cluster management.
In future posts, we will examine the use of Cloud Dataproc Workflow Templates for process automation, the integration capabilities of Dataproc with services such as BigQuery, Bigtable, Cloud Dataflow, and Google Cloud Pub/Sub, and finally, DevOps for Big Data with Dataproc and tools like Spinnaker and Jenkins on GKE.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients, nor Apache or Google.
Building Serverless Actions for Google Assistant with Google Cloud Functions, Cloud Datastore, and Cloud Storage
Posted by Gary A. Stafford in Cloud, GCP, JavaScript, Serverless, Software Development on August 11, 2018
Introduction
In this post, we will create an Action for Google Assistant using the ‘Actions on Google’ development platform, Google Cloud Platform’s serverless Cloud Functions, Cloud Datastore, and Cloud Storage, and the current LTS version of Node.js. According to Google, Actions are pieces of software, designed to extend the functionality of the Google Assistant, Google’s virtual personal assistant, across a multitude of Google-enabled devices, including smartphones, cars, televisions, headphones, watches, and smart-speakers.
Here is a brief YouTube video preview of the final Action for Google Assistant, we will explore in this post, running on an Apple iPhone 8.
If you want to compare the development of an Action for Google Assistant with those of AWS and Azure, in addition to this post, please read my previous two posts in this series, Building and Integrating LUIS-enabled Chatbots with Slack, using Azure Bot Service, Bot Builder SDK, and Cosmos DB and Building Asynchronous, Serverless Alexa Skills with AWS Lambda, DynamoDB, S3, and Node.js. All three of the article’s demonstrations are written in Node.js, all three leverage their cloud platform’s machine learning-based Natural Language Understanding services, and all three take advantage of NoSQL database and storage services available on their respective cloud platforms.
Google Technologies
The final architecture of our Action for Google Assistant will look as follows.
Here is a brief overview of the key technologies we will incorporate into our architecture.
Actions on Google
According to Google, Actions on Google is the platform for developers to extend the Google Assistant. Similar to Amazon’s Alexa Skills Kit Development Console for developing Alexa Skills, Actions on Google is a web-based platform that provides a streamlined user-experience to create, manage, and deploy Actions. We will use the Actions on Google platform to develop our Action in this post.
Dialogflow
According to Google, Dialogflow is an enterprise-grade Natural language understanding (NLU) platform that makes it easy for developers to design and integrate conversational user interfaces into mobile apps, web applications, devices, and bots. Dialogflow is powered by Google’s machine learning for Natural Language Processing (NLP). Dialogflow was initially known as API.AI prior being renamed by Google in late 2017.
We will use the Dialogflow web-based development platform and version 2 of the Dialogflow API, which became GA in April 2018, to build our Action for Google Assistant’s rich, natural-language conversational interface.
Google Cloud Functions
Google Cloud Functions are the event-driven serverless compute platform, part of the Google Cloud Platform (GCP). Google Cloud Functions are comparable to Amazon’s AWS Lambda and Azure Functions. Cloud Functions is a relatively new service from Google, released in beta in March 2017, and only recently becoming GA at Cloud Next ’18 (July 2018). The main features of Cloud Functions include automatic scaling, high availability, fault tolerance, no servers to provision, manage, patch or update, and a payment model based on the function’s execution time. The programmatic logic behind our Action for Google Assistant will be handled by a Cloud Function.
Node.js LTS
We will write our Action’s Google Cloud Function using the Node.js 8 runtime. Google just released the ability to write Google Cloud Functions in Node 8.11.1 and Python 3.7.0, at Cloud Next ’18 (July 2018). It is still considered beta functionality. Previously, you had to write your functions in Node version 6 (currently, 6.14.0).
Node 8, also known as Project Carbon, was the first Long Term Support (LTS) version of Node to support async/await with Promises. Async/await is the new way of handling asynchronous operations in Node.js. We will make use of async/await and Promises within our Action’s Cloud Function.
Google Cloud Datastore
Google Cloud Datastore is a highly-scalable NoSQL database. Cloud Datastore is similar in features and capabilities to Azure Cosmos DB and Amazon DynamoDB. Datastore automatically handles sharding and replication and offers features like a RESTful interface, ACID transactions, SQL-like queries, and indexes. We will use Datastore to persist the information returned to the user from our Action for Google Assistant.
Google Cloud Storage
The last technology, Google Cloud Storage is secure and durable object storage, nearly identical to Amazon Simple Storage Service (Amazon S3) and Azure Blob Storage. We will store publicly accessible images in a Google Cloud Storage bucket, which will be displayed in Google Assistant Basic Card responses.
Demonstration
To demonstrate Actions for Google Assistant, we will build an informational Action that responds to the user with interesting facts about Azure, Microsoft’s Cloud computing platform (Google talking about Azure, ironic). Note this is not intended to be an official Microsoft bot and is only used for demonstration purposes.
Source Code
All open-sourced code for this post can be found on GitHub. Note code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.
Development Process
This post will focus on the development and integration of an Action with Google Cloud Platform’s serverless and asynchronous Cloud Functions, Cloud Datastore, and Cloud Storage. The post is not intended to be a general how-to on developing and publishing Actions for Google Assistant, or how to specifically use services on the Google Cloud Platform.
Building the Action will involve the following steps.
- Design the Action’s conversation model;
- Import the Azure Facts Entities into Cloud Datastore on GCP;
- Create and upload the images to Cloud Storage on GCP;
- Create the new Actions on Google project using the Actions on Google console;
- Develop the Action’s Intent using the Dialogflow console;
- Bulk import the Action’s Entities using the Dialogflow console;
- Configure the Dialogflow Actions on Google Integration;
- Develop and deploy the Cloud Function to GCP;
- Test the Action using Actions on Google Simulator;
Let’s explore each step in more detail.
Conversational Model
The conversational model design of the Azure Tech Facts Action for Google Assistant is similar to the Azure Tech Facts Alexa Custom Skill, detailed in my previous post. We will have the option to invoke the Action in two ways, without initial intent (Explicit Invocation) and with intent (Implicit Invocation), as shown below. On the left, we see an example of an explicit invocation of the Action. Google Assistant then queries the user for more information. On the right, an implicit invocation of the Action includes the intent, being the Azure fact they want to learn about. Google Assistant responds directly, both verbally and visually with the fact.
Each fact returned by Google Assistant will include a Simple Response, Basic Card and Suggestions response types for devices with a display, as shown below. The user may continue to ask for additional facts or choose to cancel the Action at any time.
Lastly, as part of the conversational model, we will include the option of asking for a random fact, as well as asking for help. Examples of both are shown below. Again, Google Assistant responds to the user, vocally and, optionally, visually, for display-enabled devices.
GCP Account and Project
The following steps assume you have an existing GCP account and you have created a project on GCP to house the Cloud Function, Cloud Storage Bucket, and Cloud Datastore Entities. The post also assumes that you have the Google Cloud SDK installed on your development machine, and have authenticated your identity from the command line (gist).
# Authenticate with the Google Cloud SDK | |
export PROJECT_ID="<your_project_id>" | |
gcloud beta auth login | |
gcloud config set project ${PROJECT_ID} | |
# Update components or new runtime nodejs8 may be unknown | |
gcloud components update |
Google Cloud Storage
First, the images, actually Azure icons available from Microsoft, displayed in the responses shown above, are uploaded to a Google Storage Bucket. To handle these tasks, we will use the gsutil
CLI to create, upload, and manage the images. The gsutil CLI tool, like gcloud
, is part of the Google Cloud SDK. The gsutil mb
(make bucket) command creates the bucket, gsutil cp
(copy files and objects) command is used to copy the images to the new bucket, and finally, the gsutil iam
(get, set, or change bucket and/or object IAM permissions) command is used to make the images public. I have included a shell script, bucket-uploader.sh
, to make this process easier. (gist).
#!/usr/bin/env sh | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
set -ex | |
# Set constants | |
PROJECT_ID="<your_project_id>" | |
REGION="<your_region>" | |
IMAGE_BUCKET="<your_bucket_name>" | |
# Create GCP Storage Bucket | |
gsutil mb \ | |
-p ${PROJECT_ID} \ | |
-c regional \ | |
-l ${REGION} \ | |
gs://${IMAGE_BUCKET} | |
# Upload images to bucket | |
for file in pics/image-*; do | |
gsutil cp ${file} gs://${IMAGE_BUCKET} | |
done | |
# Make all images public in bucket | |
gsutil iam ch allUsers:objectViewer gs://${IMAGE_BUCKET} |
From the Storage Console on GCP, you should observe the images all have publicly accessible URLs. This will allow the Cloud Function to access the bucket, and retrieve and display the images. There are more secure ways to store and display the images from the function. However, this is the simplest method since we are not concerned about making the images public.
We will need the URL of the new Storage bucket, later, when we develop to our Action’s Cloud Function. The bucket URL can be obtained from the Storage Console on GCP, as shown below in the Link URL.
Google Cloud Datastore
In Cloud Datastore, the category data object is referred to as a Kind, similar to a Table in a relational database. In Datastore, we will have an ‘AzureFact’ Kind of data. In Datastore, a single object is referred to as an Entity, similar to a Row in a relational database. Each one of our entities represents a unique reference value from our Azure Facts Intent’s facts entities, such as ‘competition’ and ‘certifications’. Individual data is known as a Property in Datastore, similar to a Column in a relational database. We will have four Properties for each entity: name, response, title, and image. Lastly, a Key in Datastore is similar to a Primary Key in a relational database. The Key we will use for our entities is the unique reference value string from our Azure Facts Intent’s facts entities, such as ‘competition’ or ‘certifications’. The Key value is stored within the entity’s name Property.
There are a number of ways to create the Datastore entities for our Action, including manually from the Datastore console on GCP. However, to automate the process, we will use a script, written in Node.js and using the Google Cloud Datastore Node.js Client, to create the entities. We will use the Client API’s Datastore Class upsert method, which will create or update an entire collection of entities with one call and returns a callback. The script , upsert-entities.js
, is included in source control and can be run with the following command. Below is a snippet of the script, which shows the structure of the entities (gist).
# Upload Google Datastore entities | |
cd data | |
npm install | |
node ./upsert-entities.js |
Once the upsert
command completes successfully, you should observe a collection of ‘AzureFact’ Type Datastore Entities in the Datastore console on GCP.
Below, we see the structure of a single Datastore Entity, the ‘certifications’ Entity, containing the fact response, title, and name of the image, which is stored in our Google Storage bucket.
New ‘Actions on Google’ Project
With the images uploaded and the database entries created, we can start building our Actions for Google Assistant. Using the Actions on Google web console, we first create a new Actions project.
The Directory Information tab is where we define metadata about the project. This information determines how it will look in the Actions directory and is required to publish your project. The Actions directory is where users discover published Actions on the web and mobile devices.
Actions and Intents
Our project will contain a series of related Actions. According to Google, an Action is ‘an interaction you build for the Assistant that supports a specific intent and has a corresponding fulfillment that processes the intent.’ To build our Actions, we first want to create our Intents. To do so, we will want to switch from the Actions on Google console to the Dialogflow console. Actions on Google provides a link for switching to Dialogflow in the Actions tab.
We will build our Action’s Intents in Dialogflow. The term Intent, used by Dialogflow, is standard terminology across other voice-assistant platforms, such as Amazon’s Alexa and Microsoft’s Azure Bot Service and LUIS. In Dialogflow, will be building Intents—the Azure Facts Intent, Welcome Intent, and the Fallback Intent.
Below, we see the Azure Facts Intent. The Azure Facts Intent is the main Intent, responsible for handling our user’s requests for facts about Azure. The Intent includes a fair number, but certainly not an exhaustive list, of training phrases. These represent all the possible ways a user might express intent when invoking the Action. According to Google, the greater the number of natural language examples in the Training Phrases section of Intents, the better the classification accuracy.
Intent Entities
Each of the highlighted words in the training phrases maps to the facts parameter, which maps to a collection of @facts Entities. Entities represent a list of intents the Action is trained to understand. According to Google, there are three types of entities: system (defined by Dialogflow), developer (defined by a developer), and user (built for each individual end-user in every request) entities. We will be creating developer type entities for our Action’s Intent.
Synonyms
An entity contains Synonyms. Multiple synonyms may be mapped to a single reference value. The reference value is the value passed to the Cloud Function by the Action. For example, take the reference value of ‘competition’. A user might ask Google about Azure’s competition. However, the user might also substitute the words ‘competitor’ or ‘competitors’ for ‘competition’. Using synonyms, if the user utters any of these three words in their intent, they will receive the same response.
Although our Azure Facts Action is a simple example, typical Actions might contain hundreds of entities or more, each with several synonyms. Dialogflow provides the option of copy and pasting bulk entities, in either JSON or CSV format. The project’s source code includes both JSON or CSV formats, which may be input in this manner.
Automated Expansion
Not every possible fact, which will have a response, returned by Google Assistant, needs an entity defined. For example, we created a ‘compliance’ Cloud Datastore Entity. The Action understands the term ‘compliance’ and will return a response to the user if they ask about Azure compliance. However, ‘compliance’ is not defined as an Intent Entity, since we have chosen not to define any synonyms for the term ‘compliance’.
In order to allow this, you must enable Allow Automated Expansion. According to Google, this option allows an Agent to recognize values that have not been explicitly listed in the entity. Google describes Agents as NLU (Natural Language Understanding) modules.
Actions on Google Integration
Another configuration item in Dialogflow that needs to be completed is the Dialogflow’s Actions on Google integration. This will integrate the Azure Tech Facts Action with Google Assistant. Google provides more than a dozen different integrations, as shown below.
The Dialogflow’s Actions on Google integration configuration is simple, just choose the Azure Facts Intent as our Action’s Implicit Invocation intent, in addition to the default Welcome Intent, which is our Action’s Explicit Invocation intent. According to Google, integration allows our Action to reach users on every device where the Google Assistant is available.
Action Fulfillment
When an intent is received from the user, it is fulfilled by the Action. In the Dialogflow Fulfillment console, we see the Action has two fulfillment options, a Webhook or a Cloud Function, which can be edited inline. A Webhook allows us to pass information from a matched intent into a web service and get a result back from the service. In our example, our Action’s Webhook will call our Cloud Function, using the Cloud Function’s URL endpoint. We first need to create our function in order to get the endpoint, which we will do next.
Google Cloud Functions
Our Cloud Function, called by our Action, is written in Node.js 8. As stated earlier, Node 8 LTS was the first LTS version to support async/await with Promises. Async/await is the new way of handling asynchronous operations in Node.js, replacing callbacks.
Our function, index.js, is divided into four sections: constants, intent handlers, helper functions, and the function’s entry point. The Cloud Function attempts to follow many of the coding practices from Google’s code examples on Github.
Constants
The section defines the global constants used within the function. Note the constant for the URL of our new Cloud Storage bucket, on line 30 below, IMAGE_BUCKET
, references an environment variable, process.env.IMAGE_BUCKET
. This value is set in the .env.yaml
file. All environment variables in the .env.yaml
file will be set during the Cloud Function’s deployment, explained later in this post. Environment variables were recently released, and are still considered beta functionality (gist).
// author: Gary A. Stafford | |
// site: https://programmaticponderings.com | |
// license: MIT License | |
'use strict'; | |
/* CONSTANTS */ | |
const { | |
dialogflow, | |
Suggestions, | |
BasicCard, | |
SimpleResponse, | |
Image, | |
} = require('actions-on-google'); | |
const functions = require('firebase-functions'); | |
const Datastore = require('@google-cloud/datastore'); | |
const datastore = new Datastore({}); | |
const app = dialogflow({debug: true}); | |
app.middleware(conv => { | |
conv.hasScreen = | |
conv.surface.capabilities.has('actions.capability.SCREEN_OUTPUT'); | |
conv.hasAudioPlayback = | |
conv.surface.capabilities.has('actions.capability.AUDIO_OUTPUT'); | |
}); | |
const IMAGE_BUCKET = process.env.IMAGE_BUCKET; | |
const SUGGESTION_1 = 'tell me a random fact'; | |
const SUGGESTION_2 = 'help'; | |
const SUGGESTION_3 = 'cancel'; |
The npm package dependencies declared in the constants section, are defined in the dependencies section of the package.json
file. Function dependencies include Actions on Google, Firebase Functions, and Cloud Datastore (gist).
"dependencies": { | |
"@google-cloud/datastore": "^1.4.1", | |
"actions-on-google": "^2.2.0", | |
"dialogflow": "^0.6.0", | |
"dialogflow-fulfillment": "^0.5.0", | |
"firebase-admin": "^6.0.0", | |
"firebase-functions": "^2.0.2" | |
} |
Intent Handlers
The three intent handlers correspond to the three intents in the Dialogflow console: Azure Facts Intent, Welcome Intent, and Fallback Intent. Each handler responds in a very similar fashion. The handlers all return a SimpleResponse
for audio-only and display-enabled devices. Optionally, a BasicCard
is returned for display-enabled devices (gist).
/* INTENT HANDLERS */ | |
app.intent('Welcome Intent', conv => { | |
const WELCOME_TEXT_SHORT = 'What would you like to know about Microsoft Azure?'; | |
const WELCOME_TEXT_LONG = `What would you like to know about Microsoft Azure? ` + | |
`You can say things like: \n` + | |
` _'tell me about Azure certifications'_ \n` + | |
` _'when was Azure released'_ \n` + | |
` _'give me a random fact'_`; | |
const WELCOME_IMAGE = 'image-16.png'; | |
conv.ask(new SimpleResponse({ | |
speech: WELCOME_TEXT_SHORT, | |
text: WELCOME_TEXT_SHORT, | |
})); | |
if (conv.hasScreen) { | |
conv.ask(new BasicCard({ | |
text: WELCOME_TEXT_LONG, | |
title: 'Azure Tech Facts', | |
image: new Image({ | |
url: `${IMAGE_BUCKET}/${WELCOME_IMAGE}`, | |
alt: 'Azure Tech Facts', | |
}), | |
display: 'WHITE', | |
})); | |
conv.ask(new Suggestions([SUGGESTION_1, SUGGESTION_2, SUGGESTION_3])); | |
} | |
}); | |
app.intent('Fallback Intent', conv => { | |
const FACTS_LIST = "Certifications, Cognitive Services, Competition, Compliance, First Offering, Functions, " + | |
"Geographies, Global Infrastructure, Platforms, Categories, Products, Regions, and Release Date"; | |
const WELCOME_TEXT_SHORT = 'Need a little help?'; | |
const WELCOME_TEXT_LONG = `Current facts include: ${FACTS_LIST}.`; | |
const WELCOME_IMAGE = 'image-15.png'; | |
conv.ask(new SimpleResponse({ | |
speech: WELCOME_TEXT_LONG, | |
text: WELCOME_TEXT_SHORT, | |
})); | |
if (conv.hasScreen) { | |
conv.ask(new BasicCard({ | |
text: WELCOME_TEXT_LONG, | |
title: 'Azure Tech Facts Help', | |
image: new Image({ | |
url: `${IMAGE_BUCKET}/${WELCOME_IMAGE}`, | |
alt: 'Azure Tech Facts', | |
}), | |
display: 'WHITE', | |
})); | |
conv.ask(new Suggestions([SUGGESTION_1, SUGGESTION_2, SUGGESTION_3])); | |
} | |
}); | |
app.intent('Azure Facts Intent', async (conv, {facts}) => { | |
let factToQuery = facts.toString(); | |
let fact = await buildFactResponse(factToQuery); | |
const AZURE_TEXT_SHORT = `Sure, here's a fact about ${fact.title}`; | |
conv.ask(new SimpleResponse({ | |
speech: fact.response, | |
text: AZURE_TEXT_SHORT, | |
})); | |
if (conv.hasScreen) { | |
conv.ask(new BasicCard({ | |
text: fact.response, | |
title: fact.title, | |
image: new Image({ | |
url: `${IMAGE_BUCKET}/${fact.image}`, | |
alt: fact.title, | |
}), | |
display: 'WHITE', | |
})); | |
conv.ask(new Suggestions([SUGGESTION_1, SUGGESTION_2, SUGGESTION_3])); | |
} | |
}); |
The Welcome Intent handler handles explicit invocations of our Action. The Fallback Intent handler handles both help requests, as well as cases when Dialogflow cannot match any of the user’s input. Lastly, the Azure Facts Intent handler handles implicit invocations of our Action, returning a fact to the user from Cloud Datastore, based on the user’s requested fact.
Helper Functions
The next section of the function contains two helper functions. The primary function is the buildFactResponse
function. This is the function that queries Google Cloud Datastore for the fact. The second function, the selectRandomFact
, handles the fact value of ‘random’, by selecting a random fact value to query Datastore. (gist).
/* HELPER FUNCTIONS */ | |
function selectRandomFact() { | |
const FACTS_ARRAY = ['description', 'released', 'global', 'regions', | |
'geographies', 'platforms', 'categories', 'products', 'cognitive', | |
'compliance', 'first', 'certifications', 'competition', 'functions']; | |
return FACTS_ARRAY[Math.floor(Math.random() * FACTS_ARRAY.length)]; | |
} | |
function buildFactResponse(factToQuery) { | |
return new Promise((resolve, reject) => { | |
if (factToQuery.toString().trim() === 'random') { | |
factToQuery = selectRandomFact(); | |
} | |
const query = datastore | |
.createQuery('AzureFact') | |
.filter('__key__', '=', datastore.key(['AzureFact', factToQuery])); | |
datastore | |
.runQuery(query) | |
.then(results => { | |
resolve(results[0][0]); | |
}) | |
.catch(err => { | |
console.log(`Error: ${err}`); | |
reject(`Sorry, I don't know the fact, ${factToQuery}.`); | |
}); | |
}); | |
} | |
/* ENTRY POINT */ | |
exports.functionAzureFactsAction = functions.https.onRequest(app); |
Async/Await, Promises, and Callbacks
Let’s look closer at the relationship and asynchronous nature of the Azure Facts Intent intent handler and buildFactResponse
function. Below, note the async
function on line 1 in the intent and the await
function on line 3, which is part of the buildFactResponse
function call. This is typically how we see async/await applied when calling an asynchronous function, such as buildFactResponse
. The await
function allows the intent’s execution to wait for the buildFactResponse
function’s Promise to be resolved, before attempting to use the resolved value to construct the response.
The buildFactResponse
function returns a Promise, as seen on line 28. The Promise’s payload contains the results of the successful callback from the Datastore API’s runQuery
function. The runQuery
function returns a callback, which is then resolved and returned by the Promise, as seen on line 40 (gist).
app.intent('Azure Facts Intent', async (conv, {facts}) => { | |
let factToQuery = facts.toString(); | |
let fact = await buildFactResponse(factToQuery); | |
const AZURE_TEXT_SHORT = `Sure, here's a fact about ${fact.title}`; | |
conv.ask(new SimpleResponse({ | |
speech: fact.response, | |
text: AZURE_TEXT_SHORT, | |
})); | |
if (conv.hasScreen) { | |
conv.ask(new BasicCard({ | |
text: fact.response, | |
title: fact.title, | |
image: new Image({ | |
url: `${IMAGE_BUCKET}/${fact.image}`, | |
alt: fact.title, | |
}), | |
display: 'WHITE', | |
})); | |
conv.ask(new Suggestions([SUGGESTION_1, SUGGESTION_2, SUGGESTION_3])); | |
} | |
}); | |
function buildFactResponse(factToQuery) { | |
return new Promise((resolve, reject) => { | |
if (factToQuery.toString().trim() === 'random') { | |
factToQuery = selectRandomFact(); | |
} | |
const query = datastore | |
.createQuery('AzureFact') | |
.filter('__key__', '=', datastore.key(['AzureFact', factToQuery])); | |
datastore | |
.runQuery(query) | |
.then(results => { | |
resolve(results[0][0]); | |
}) | |
.catch(err => { | |
console.log(`Error: ${err}`); | |
reject(`Sorry, I don't know the fact, ${factToQuery}.`); | |
}); | |
}); | |
} |
The payload returned by Google Datastore, through the resolved Promise to the intent handler, will resemble the example response, shown below. Note the image
, response
, and title
key/value pairs in the textPayload
section of the response payload. These are what are used to format the SimpleResponse
and BasicCard
responses (gist).
{ | |
title: 'Azure Functions', | |
image: 'image-14.png', | |
response: 'According to Microsoft, Azure Functions is a serverless compute service that enables you to run code on-demand without having to explicitly provision or manage infrastructure.', | |
[Symbol(KEY)]: Key { | |
namespace: undefined, | |
name: 'functions', | |
kind: 'AzureFact', | |
path: [Getter] | |
} | |
} |
Cloud Function Deployment
To deploy the Cloud Function to GCP, use the gcloud
CLI with the beta version of the functions deploy command. According to Google, gcloud
is a part of the Google Cloud SDK. You must download and install the SDK on your system and initialize it before you can use gcloud
. You should ensure that your function is deployed to the same region as your Google Storage Bucket. Currently, Cloud Functions are only available in four regions. I have included a shell script, deploy-cloud-function.sh
, to make this step easier. (gist).
#!/usr/bin/env sh | |
# author: Gary A. Stafford | |
# site: https://programmaticponderings.com | |
# license: MIT License | |
set -ex | |
# Set constants | |
REGION="<your_region>" | |
FUNCTION_NAME="<your_function_name>" | |
# Deploy the Google Cloud Function | |
gcloud beta functions deploy ${FUNCTION_NAME} \ | |
--runtime nodejs8 \ | |
--region ${REGION} \ | |
--trigger-http \ | |
--memory 256MB \ | |
--env-vars-file .env.yaml |
The creation or update of the Cloud Function can take up to two minutes. Note the .gcloudignore
file referenced in the verbose output below. This file is created the first time you deploy a new function. Using the the .gcloudignore
file, you can limit the deployed files to just the function (index.js
) and the package.json
file. There is no need to deploy any other files to GCP.
If you recall, the URL endpoint of the Cloud Function is required in the Dialogflow Fulfillment tab. The URL can be retrieved from the deployment output (shown above), or from the Cloud Functions Console on GCP (shown below). The Cloud Function is now deployed and will be called by the Action when a user invokes the Action.
Simulation Testing and Debugging
With our Action and all its dependencies deployed and configured, we can test the Action using the Simulation console on Actions on Google. According to Google, the Action Simulation console allows us to manually test our Action by simulating a variety of Google-enabled hardware devices and their settings. You can also access debug information such as the request and response that your fulfillment receives and sends.
Below, in the Action Simulation console, we see the successful display of the initial Azure Tech Facts containing the expected Simple Response, Basic Card, and Suggestions, triggered by a user’s explicit invocation of the Action.
The simulated response indicates that the Google Cloud Function was called, and it responded successfully. It also indicates that the Google Cloud Function was able to successfully retrieve the correct image from Google Cloud Storage.
Below, we see the successful response to the user’s implicit invocation of the Action, in which they are seeking a fact about Azure’s Cognitive Services. The simulated response indicates that the Google Cloud Function was called, and it responded successfully. It also indicates that the Google Cloud Function was able to successfully retrieve the correct Entity from Google Cloud Datastore, as well as the correct image from Google Cloud Storage.
If we had issues with the testing, the Action Simulation console also contains tabs containing the request and response objects sent to and from the Cloud Function, the audio response, a debug console, and any errors.
Logging and Analytics
In addition to the Simulation console’s ability to debug issues with our service, we also have Google Stackdriver Logging. The Stackdriver logs, which are viewed from the GCP management console, contain the complete requests and responses, to and from the Cloud Function, from the Google Assistant Action. The Stackdriver logs will also contain any logs entries you have explicitly placed in the Cloud Function.
We also have the ability to view basic Analytics about our Action from within the Dialogflow Analytics console. Analytics displays metrics, such as the number of sessions, the number of queries, the number of times each Intent was triggered, how often users exited the Action from an intent, and Sessions flows, shown below.
In simple Action such as this one, the Session flow is not very beneficial. However, in more complex Actions, with multiple Intents and a variety potential user interactions, being able to visualize Session flows becomes essential to understanding the user’s conversational path through the Action.
Conclusion
In this post, we have seen how to use the Actions on Google development platform and the latest version of the Dialogflow API to build Google Actions. Google Actions rather effortlessly integrate with the breath Google Cloud Platform’s many serverless offerings, including Google Cloud Functions, Cloud Datastore, and Cloud Storage.
We have seen how Google is quickly maturing their serverless functions, to compete with AWS and Azure, with the recently announced support of LTS version 8 of Node.js and Python, to create an Actions for Google Assistant.
Impact of Serverless
As an Engineer, I have spent endless days, late nights, and thankless weekends, building, deploying and managing servers, virtual machines, container clusters, persistent storage, and database servers. I think what is most compelling about platforms like Actions on Google, but even more so, serverless technologies on GCP, is that I spend the majority of my time architecting and developing compelling software. I don’t spend time managing infrastructure, worrying about capacity, configuring networking and security, and doing DevOps.
¹Azure is a trademark of Microsoft
All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients, or Google and Microsoft.
Managing Applications Across Multiple Kubernetes Environments with Istio: Part 2
Posted by Gary A. Stafford in Build Automation, DevOps, Enterprise Software Development, GCP, Software Development on April 17, 2018
In this two-part post, we are exploring the creation of a GKE cluster, replete with the latest version of Istio, often referred to as IoK (Istio on Kubernetes). We will then deploy, perform integration testing, and promote an application across multiple environments within the cluster.
Part Two
In Part One of this post, we created a Kubernetes cluster on the Google Cloud Platform, installed Istio, provisioned a PostgreSQL database, and configured DNS for routing. Under the assumption that v1 of the Election microservice had already been released to Production, we deployed v1 to each of the three namespaces.
In Part Two of this post, we will learn how to utilize the advanced API testing capabilities of Postman and Newman to ensure v2 is ready for UAT and release to Production. We will deploy and perform integration testing of a new v2 of the Election microservice, locally on Kubernetes Minikube. Once confident v2 is functioning as intended, we will promote and test v2 across the dev
, test
, and uat
namespaces.
Source Code
As a reminder, all source code for this post can be found on GitHub. The project’s README file contains a list of the Election microservice’s endpoints. To get started quickly, use one of the two following options (gist).
# clone the official v3.0.0 release for this post | |
git clone --depth 1 --branch v3.0.0 \ | |
https://github.com/garystafford/spring-postgresql-demo.git \ | |
&& cd spring-postgresql-demo \ | |
&& git checkout -b v3.0.0 | |
# clone the latest version of code (newer than article) | |
git clone --depth 1 --branch master \ | |
https://github.com/garystafford/spring-postgresql-demo.git \ | |
&& cd spring-postgresql-demo |
Code samples in this post are displayed as Gists, which may not display correctly on some mobile and social media browsers. Links to gists are also provided.
This project includes a kubernetes sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the example shown in the post.
Testing Locally with Minikube
Deploying to GKE, no matter how automated, takes time and resources, whether those resources are team members or just compute and system resources. Before deploying v2 of the Election service to the non-prod GKE cluster, we should ensure that it has been thoroughly tested locally. Local testing should include the following test criteria:
- Source code builds successfully
- All unit-tests pass
- A new Docker Image can be created from the build artifact
- The Service can be deployed to Kubernetes (Minikube)
- The deployed instance can connect to the database and execute the Liquibase changesets
- The deployed instance passes a minimal set of integration tests
Minikube gives us the ability to quickly iterate and test an application, as well as the Kubernetes and Istio resources required for its operation, before promoting to GKE. These resources include Kubernetes Namespaces, Secrets, Deployments, Services, Route Rules, and Istio Ingresses. Since Minikube is just that, a miniature version of our GKE cluster, we should be able to have a nearly one-to-one parity between the Kubernetes resources we apply locally and those applied to GKE. This post assumes you have the latest version of Minikube installed, and are familiar with its operation.
This project includes a minikube sub-directory, containing all the Kubernetes resource files and scripts necessary to recreate the Minikube deployment example shown in this post. The three included scripts are designed to be easily adapted to a CI/CD DevOps workflow. You may need to modify the scripts to match your environment’s configuration. Note this Minikube-deployed version of the Election service relies on the external Amazon RDS database instance.
Local Database Version
To eliminate the AWS costs, I have included a second, alternate version of the Minikube Kubernetes resource files, minikube_db_local This version deploys a single containerized PostgreSQL database instance to Minikube, as opposed to relying on the external Amazon RDS instance. Be aware, the database does not have persistent storage or an Istio sidecar proxy.
Minikube Cluster
If you do not have a running Minikube cluster, create one with the minikube start
command.
Minikube allows you to use normal kubectl
CLI commands to interact with the Minikube cluster. Using the kubectl get nodes
command, we should see a single Minikube node running the latest Kubernetes v1.10.0.
Istio on Minikube
Next, install Istio following Istio’s online installation instructions. A basic Istio installation on Minikube, without the additional add-ons, should only require a single Istio install script.
If successful, you should observe a new istio-system
namespace, containing the four main Istio components: istio-ca
, istio-ingress
, istio-mixer
, and istio-pilot
.
Deploy v2 to Minikube
Next, create a Minikube Development environment, consisting of a dev
Namespace, Istio Ingress, and Secret, using the part1-create-environment.sh
script. Next, deploy v2 of the Election service to thedev
Namespace, along with an associated Route Rule, using the part2-deploy-v2.sh
script. One v2 instance should be sufficient to satisfy the testing requirements.
Access to v2 of the Election service on Minikube is a bit different than with GKE. When routing external HTTP requests, there is no load balancer, no external public IP address, and no public DNS or subdomains. To access the single instance of v2 running on Minikube, we use the local IP address of the Minikube cluster, obtained with the minikube ip
command. The access port required is the Node Port (nodePort
) of the istio-ingress
Service. The command is shown below (gist) and included in the part3-smoke-test.sh
script.
export GATEWAY_URL="$(minikube ip):"\ | |
"$(kubectl get svc istio-ingress -n istio-system -o 'jsonpath={.spec.ports[0].nodePort}')" | |
echo $GATEWAY_URL | |
curl $GATEWAY_URL/v2/actuator/health && echo |
The second part of our HTTP request routing is the same as with GKE, relying on an Istio Route Rules. The /v2/
sub-collection resource in the HTTP request URL is rewritten and routed to the v2 election Pod by the Route Rule. To confirm v2 of the Election service is running and addressable, curl the /v2/actuator/health
endpoint. Spring Actuator’s /health
endpoint is frequently used at the end of a CI/CD server’s deployment pipeline to confirm success. The Spring Boot application can take a few minutes to fully start up and be responsive to requests, depending on the speed of your local machine.
Using the Kubernetes Dashboard, we should see our deployment of the single Election service Pod is running successfully in Minikube’s dev
namespace.
Once deployed, we run a battery of integration tests to confirm that the new v2 functionality is working as intended before deploying to GKE. In the next section of this post, we will explore the process creating and managing Postman Collections and Postman Environments, and how to automate those Collections of tests with Newman and Jenkins.
Integration Testing
The typical reason an application is deployed to lower environments, prior to Production, is to perform application testing. Although definitions vary across organizations, testing commonly includes some or all of the following types: Integration Testing, Functional Testing, System Testing, Stress or Load Testing, Performance Testing, Security Testing, Usability Testing, Acceptance Testing, Regression Testing, Alpha and Beta Testing, and End-to-End Testing. Test teams may also refer to other testing forms, such as Whitebox (Glassbox), Blackbox Testing, Smoke, Validation, or Sanity Testing, and Happy Path Testing.
The site, softwaretestinghelp.com, defines integration testing as, ‘testing of all integrated modules to verify the combined functionality after integration is termed so. Modules are typically code modules, individual applications, client and server applications on a network, etc. This type of testing is especially relevant to client/server and distributed systems.’
In this post, we are concerned that our integrated modules are functioning cohesively, primarily the Election service, Amazon RDS database, DNS, Istio Ingress, Route Rules, and the Istio sidecar Proxy. Unlike Unit Testing and Static Code Analysis (SCA), which is done pre-deployment, integration testing requires an application to be deployed and running in an environment.
Postman
I have chosen Postman, along with Newman, to execute a Collection of integration tests before promoting to the next environment. The integration tests confirm the deployed application’s name and version. The integration tests then perform a series of HTTP GET, POST, PUT, PATCH, and DELETE actions against the service’s resources. The integration tests verify a successful HTTP response code is returned, based on the type of request made.
Postman tests are written in JavaScript, similar to other popular, modern testing frameworks. Postman offers advanced features such as test-chaining. Tests can be chained together through the use of environment variables to store response values and pass them onto to other tests. Values shared between tests are also stored in the Postman Environments. Below, we store the ID of the new candidate, the result of an HTTP POST to the /candidates
endpoint. We then use the stored candidate ID in proceeding HTTP GET, PUT, and PATCH test requests to the same /candidates
endpoint.
Environment-specific variables, such as the resource host, port, and environment sub-collection resource, are abstracted and stored as key/value pairs within Postman Environments, and called through variables in the request URL and within the tests. Thus, the same Postman Collection of tests may be run against multiple environments using different Postman Environments.
Postman Runner allows us to run multiple iterations of our Collection. We also have the option to build in delays between tests. Lastly, Postman Runner can load external JSON and CSV formatted test data, which is beyond the scope of this post.
Postman contains a simple Run Summary UI for viewing test results.
Test Automation
To support running tests from the command line, Postman provides Newman. According to Postman, Newman is a command-line collection runner for Postman. Newman offers the same functionality as Postman’s Collection Runner, all part of the newman
CLI. Newman is Node.js module, installed globally as an npm package, npm install newman --global
.
Typically, Development and Testing teams compose Postman Collections and define Postman Environments, locally. Teams run their tests locally in Postman, during their development cycle. Then, those same Postman Collections are executed from the command line, or more commonly as part of a CI/CD pipeline, such as with Jenkins.
Below, the same Collection of integration tests ran in the Postman Runner UI, are run from the command line, using Newman.
Jenkins
Without a doubt, Jenkins is the leading open-source CI/CD automation server. The building, testing, publishing, and deployment of microservices to Kubernetes is relatively easy with Jenkins. Generally, you would build, unit-test, push a new Docker image, and then deploy your application to Kubernetes using a series of CI/CD pipelines. Below, we see examples of these pipelines using Jenkins Blue Ocean, starting with a continuous integration pipeline, which includes unit-testing and Static Code Analysis (SCA) with SonarQube.
Followed by a pipeline to build the Docker Image, using the build artifact from the above pipeline, and pushes the Image to Docker Hub.
The third pipeline that demonstrates building the three Kubernetes environments and deploying v1 of the Election service to the dev namespace. This pipeline is just for demonstration purposes; typically, you would separate these functions.
Spinnaker
An alternative to Jenkins for the deployment of microservices is Spinnaker, created by Netflix. According to Netflix, ‘Spinnaker is an open source, multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence.’ Spinnaker is designed to integrate easily with Jenkins, dividing responsibilities for continuous integration and delivery, with deployment. Below, Spinnaker two sample deployment pipelines, similar to Jenkins, for deploying v1 and v2 of the Election service to the non-prod GKE cluster.
Below, Spinnaker has deployed v2 of the Election service to dev using a Highlander deployment strategy. Subsequently, Spinnaker has deployed v2 to test using a Red/Black deployment strategy, leaving the previously released v1 Server Group in place, in case a rollback is required.
Once Spinnaker is has completed the deployment tasks, the Postman Collections of smoke and integration tests are executed by Newman, as part of another Jenkins CI/CD pipeline.
In this pipeline, a set of basic smoke tests is run first to ensure the new deployment is running properly, and then the integration tests are executed.
In this simple example, we have a three-stage pipeline created from a Jenkinsfile (gist).
#!groovy | |
def ACCOUNT = "garystafford" | |
def PROJECT_NAME = "spring-postgresql-demo" | |
def ENVIRONMENT = "dev" // assumes 'api.dev.voter-demo.com' reachable | |
pipeline { | |
agent any | |
stages { | |
stage('Checkout SCM') { | |
steps { | |
git changelog: true, poll: false, | |
branch: 'master', | |
url: "https://github.com/${ACCOUNT}/${PROJECT_NAME}" | |
} | |
} | |
stage('Smoke Test') { | |
steps { | |
dir('postman') { | |
nodejs('nodejs') { | |
sh "sh ./newman-smoke-tests-${ENVIRONMENT}.sh" | |
} | |
junit '**/newman/*.xml' | |
} | |
} | |
} | |
stage('Integration Tests') { | |
steps { | |
dir('postman') { | |
nodejs('nodejs') { | |
sh "sh ./newman-integration-tests-${ENVIRONMENT}.sh" | |
} | |
junit '**/newman/*.xml' | |
} | |
} | |
} | |
} | |
} |
Test Results
Newman offers several options for displaying test results. For easy integration with Jenkins, Newman results can be delivered in a format that can be displayed as JUnit test reports. The JUnit test report format, XML, is a popular method of standardizing test results from different testing tools. Below is a truncated example of a test report file (gist).
<?xml version="1.0" encoding="UTF-8"?> | |
<testsuites name="spring-postgresql-demo-v2" time="13.339000000000002"> | |
<testsuite name="/candidates/{{candidateId}}" id="31cee570-95a1-4768-9ac3-3d714fc7e139" tests="1" time="0.669"> | |
<testcase name="Status code is 200" time="0.669"/> | |
</testsuite> | |
<testsuite name="/candidates/{{candidateId}}" id="a5a62fe9-6271-4c89-a076-c95bba458ef8" tests="1" time="0.575"> | |
<testcase name="Status code is 200" time="0.575"/> | |
</testsuite> | |
<testsuite name="/candidates/{{candidateId}}" id="2fc4c902-b931-4b35-b28a-7e264f40ee9c" tests="1" time="0.568"> | |
<testcase name="Status code is 204" time="0.568"/> | |
</testsuite> | |
<testsuite name="/candidates/summary" id="94fe972e-32f4-4f58-a5d5-999cacdf7460" tests="1" time="0.337"> | |
<testcase name="Status code is 200" time="0.337"/> | |
</testsuite> | |
<testsuite name="/candidates/summary/{election}" id="f8f817c8-4785-49f1-8d09-8055b84c4fc0" tests="1" time="0.351"> | |
<testcase name="Status code is 200" time="0.351"/> | |
</testsuite> | |
<testsuite name="/candidates/search/findByLastName?lastName=Paul" id="504f8741-e9d2-4f05-b1ad-c14136030f34" tests="1" time="0.256"> | |
<testcase name="Status code is 200" time="0.256"/> | |
</testsuite> | |
</testsuites> |
Translating Newman test results to JUnit reports allows the percentage of test cases successfully executed, to be tracked over multiple deployments, a universal testing metric. Below we see the JUnit Test Reports Test Result Trend graph for a series of test runs.
Deploying to Development
Development environments typically have a rapid turnover of application versions. Many teams use their Development environment as a continuous integration environment, where every commit that successfully builds and passes all unit tests, is deployed. The purpose of the CI deployments is to ensure build artifacts will successfully deploy through the CI/CD pipeline, start properly, and pass a basic set of smoke tests.
Other teams use the Development environments as an extension of their local Minikube environment. The Development environment will possess some or all of the required external integration points, which the Developer’s local Minikube environment may not. The goal of the Development environment is to help Developers ensure their application is functioning correctly and is ready for the Test teams to evaluate, prior to promotion to the Test environment.
Some external integration points, such as external payment gateways, customer relationship management (CRM) systems, content management systems (CMS), or data analytics engines, are often stubbed-out in lower environments. Generally, third-party providers only offer a limited number of parallel non-Production integration environments. While an application may pass through several non-prod environments, testing against all external integration points will only occur in one or two of those environments.
With v2 of the Election service ready for testing on GKE, we deploy it to the GKE cluster’s dev namespace using the part4a-deploy-v2-dev.sh
script. We will also delete the previous v1 version of the Election service. Similar to the v1 deployment script, the v2 scripts perform a kube-inject
command, which manually injects the Istio sidecar proxy alongside the Election service, into each election v2 Pod. The deployment script also deploys an alternate Istio Route Rule, which routes requests to api.dev.voter-demo.com/v2/*
resource of v2 of the Election service.
Once deployed, we run our Postman Collection of integration tests with Newman or as part of a CI/CD pipeline. In the Development environment, we may choose to run a limited set of tests for the sake of expediency, or because not all external integration points are accessible.
Promotion to Test
With local Minikube and Development environment testing complete, we promote and deploy v2 of the Election service to the Test environment, using the part4b-deploy-v2-test.sh
script. In Test, we will not delete v1 of the Election service.
Often, an organization will maintain a running copy of all versions of an application currently deployed to Production, in a lower environment. Let’s look at two scenarios where this is common. First, v1 of the Election service has an issue in Production, which needs to be confirmed and may require a hot-fix by the Development team. Validation of the v1 Production bug is often done in a lower environment. The second scenario for having both versions running in an environment is when v1 and v2 both need to co-exist in Production. Organizations frequently support multiple API versions. Cutting over an entire API user-base to a new API version is often completed over a series of releases, and requires careful coordination with API consumers.
Testing All Versions
An essential role of integration testing should be to confirm that both versions of the Election service are functioning correctly, while simultaneously running in the same namespace. For example, we want to verify traffic is routed correctly, based on the HTTP request URL, to the correct version. Another common test scenario is database schema changes. Suppose we make what we believe are backward-compatible database changes to v2 of the Election service. We should be able to prove, through testing, that both the old and new versions function correctly against the latest version of the database schema.
There are different automation strategies that could be employed to test multiple versions of an application without creating separate Collections and Environments. A simple solution would be to templatize the Environments file, and then programmatically change the Postman Environment’s version
variable injected from a pipeline parameter (abridged environment file shown below).
Once initial automated integration testing is complete, Test teams will typically execute additional forms of application testing if necessary, before signing off for UAT and Performance Testing to begin.
User-Acceptance Testing
With testing in the Test environments completed, we continue onto UAT. The term UAT suggest that a set of actual end-users (API consumers) of the Election service will perform their own testing. Frequently, UAT is only done for a short, fixed period of time, often with a specialized team of Testers. Issues experienced during UAT can be expensive and impact the ability to release an application to Production on-time if sign-off is delayed.
After deploying v2 of the Election service to UAT, and before opening it up to the UAT team, we would naturally want to repeat the same integration testing process we conducted in the previous Test environment. We must ensure that v2 is functioning as expected before our end-users begin their testing. This is where leveraging a tool like Jenkins makes automated integration testing more manageable and repeatable. One strategy would be to duplicate our existing Development and Test pipelines, and re-target the new pipeline to call v2 of the Election service in UAT.
Again, in a JUnit report format, we can examine individual results through the Jenkins Console.
We can also examine individual results from each test run using a specific build’s Console Output.
Testing and Instrumentation
To fully evaluate the integration test results, you must look beyond just the percentage of test cases executed successfully. It makes little sense to release a new version of an application if it passes all functional tests, but significantly increases client response times, unnecessarily increases memory consumption or wastes other compute resources, or is grossly inefficient in the number of calls it makes to the database or third-party dependencies. Often times, integration testing uncovers potential performance bottlenecks that are incorporated into performance test plans.
Critical intelligence about the performance of the application can only be obtained through the use of logging and metrics collection and instrumentation. Istio provides this telemetry out-of-the-box with Zipkin, Jaeger, Service Graph, Fluentd, Prometheus, and Grafana. In the included Grafana Istio Dashboard below, we see the performance of v1 of the Election service, under test, in the Test environment. We can compare request and response payload size and timing, as well as request and response times to external integration points, such as our Amazon RDS database. We are able to observe the impact of individual test requests on the application and all its integration points.
As part of integration testing, we should monitor the Amazon RDS CloudWatch metrics. CloudWatch allows us to evaluate critical database performance metrics, such as the number of concurrent database connections, CPU utilization, read and write IOPS, Memory consumption, and disk storage requirements.
A discussion of metrics starts moving us toward load and performance testing against Production service-level agreements (SLAs). Using a similar approach to integration testing, with load and performance testing, we should be able to accurately estimate the sizing requirements our new application for Production. Load and Performance Testing helps answer questions like the type and size of compute resources are required for our GKE Production cluster and for our Amazon RDS database, or how many compute nodes and number of instances (Pods) are necessary to support the expected user-load.
All opinions expressed in this post are my own, and not necessarily the views of my current or past employers, or their clients.
Setting Up the Nexus 7 for Development with ADT on Ubuntu
Posted by Gary A. Stafford in Software Development on February 17, 2013
Recently, I purchased a Google Nexus 7. Excited to being development with this Android 4.2 device, I first needed to prepare my development environment. The process of configuring my Ubuntu Linux-based laptop to debug directly on the Nexus 7 was fairly easy once I identified all necessary steps. I thought I would share my process for those who are trying to do the same. Note the following steps assume you already have Java installed on you development computer. Also, that your Nexus 7 is connected via USB.
Configuration
1) Install ADT Bundle: Download and extract the Android Developer Tools (ADT) Bundle from http://developer.android.com/sdk/index.html. I installed the Linux 64-bit version for use on my Ubuntu 12.10 laptop. The bundle, according to the website, includes ‘the essential Android SDK components and a version of the Eclipse IDE with built-in ADT to streamline your Android app development.’
2) Install IA32 Libraries: Install the ia32 shared libraries using Synaptic Package Manager, or the following command, ‘sudo apt-get install ia32-libs’. I received some initial errors with ADT, until I found a post on Stack Overflow that suggested loading the libraries; they did the trick.
3) Configure Nexus 7 to Auto-Mount: To auto-mount the Nexus 7 on your computer, you need to make some changes to your computer’s system configuration. Follow this post to ‘configure your Ubuntu computer to directly access your Nexus 7 exported filesystem in MTP mode as soon as you plug it to a USB port.’, ‘http://bernaerts.dyndns.org/linux/247-ubuntu-automount-nexus7-mtp‘. It looked a bit intimidating at first, but it actually turned out to be pretty easy and only took a few minutes.
4) Setup ADT to Debug Nexus 7: To debug on the Nexus 7 from ADT via USB, according to this next post, ‘you need to add a udev rules file that contains a USB configuration for each type of device you want to use for development.’ Follow the steps in this post, ‘http://developer.android.com/tools/device.html‘ to create the rules file, similar to step 3.
5) Setup Nexus 7 for USB Debugging: The post mentioned in step 4 also discusses setting up you Nexus 7 to enable USB debugging. The post instructs you to ‘go to Settings > About phone and tap ‘Build number seven times. Return to the previous screen to find Developer options.’ From there, check the ‘USB debugging’ box. There are any more options you may wish to experiment with, later.
6) Update you Software: After these first five steps, I suggest updating your software and restarting your computer. Run the following command to make sure your system is up-to-date, ‘sudo apt-get update && sudo apt-get upgrade’. I always run this before and after any software installations to make sure my system is current.
After completing step 3, the Nexus 7 device should appear on your Launcher. However, it is not until steps 4 and 5 that it will appear in ADT.
Demonstration
Below is a simple demonstration of debugging an Android application on the Nexus 7 from ADT, via USB. Although ADT allows you to configure an Android Virtual Device (AVD), direct debugging on the Nexus 7 is substantially faster. The Nexus 7 AVD emulation was incredibly slow, even on my 64-bit, Core i5 laptop.