Posts Tagged ML
Accelerate Software Development with Six Popular Generative AI-Powered Coding Tools
Posted by Gary A. Stafford in AI/ML, Cloud, Software Development on March 25, 2023
Explore six popular generative AI-powered tools, including ChatGPT, Copilot, CodeWhisperer, Tabnine, Bing, and ChatSonic

Introduction
Modern software systems continue to grow inherently more complex over time. We have evolved from bulky monoliths to loosely coupled, event-driven, fault-tolerant, stateless, serverless, cloud-native, real-time, microservices-based, API-first, continuously deployed distributed systems, festooned with CQRS, 2PC, DDD, EDA, DOMA, Sagas, BFFs, GraphQL, gRPC, micro-frontends, contract tests, and Hexagonal architectures. Generative AI-powered coding tools do not necessarily make a developer’s job easier — they assist developers in dealing with increasing system complexity.
This post examines six popular generative AI-powered coding tools, including chat-based OpenAI ChatGPT, Microsoft’s all-new Bing Chat, and ChatSonic, as well as IDE-based Tabnine, GitHub Copilot, and Amazon CodeWhisperer (Preview). In this post, each tool will assist with developing an identical program to complete a series of common tasks on AWS. We will then compare and contrast each tool’s ease of use and the resulting code accuracy and quality.
“Generative AI coding tools are a new class of software development tools that leverage machine learning algorithms to assist developers in writing code. These tools use AI models trained on vast amounts of code to offer suggestions for completing code snippets, writing functions, and even entire blocks of code.” (quote generated by ChatGPT)
The generative AI space is evolving at a breakneck pace. Tools continue to rapidly improve their AI models, add new features, and adjust pricing. In just the short time it took to research and write this article:
- OpenAI announced GPT-4 on March 14, 2023
- Microsoft announced new Bing running on GPT-4 on March 14, 2023
- Google’s Bard AI launched in Early Access on March 21, 2023
- GPT-4 was added to Azure’s OpenAI Service on March 21, 2023
- Writesonic ChatSonic revised its pricing model on March 21, 2023
- GitHub announced GitHub Copilot X on March 22, 2023
- OpenAI announced ChatGPT plugins on March 23, 2023
Generative AI
According to McKinsey & Company in their recent article, What is generative AI?, “Generative artificial intelligence (AI) describes algorithms (such as ChatGPT) that can be used to create new content, including audio, code, images, text, simulations, and videos. Recent new breakthroughs in the field have the potential to drastically change the way we approach content creation.”
Generative AI for Code Generation
According to Papers with Code, “Code Generation is an important field to predict explicit code or program structure from multimodal data sources such as incomplete code, programs in another programming language, natural language descriptions or execution examples. Code Generation tools can assist the development of automatic programming tools to improve programming productivity.” Similarly, according to MarketTechPost, “Generative AI technologies have led to a surge of interest and progress in code generation applications. These technologies use machine learning algorithms and natural language processing to assist developers in automating the time-consuming and laborious portions of coding.”
Common Features
Standard features of leading generative AI-powered coding tools include the following:
- Whole-line, full-function, and block code completion
- Natural language to code completion
- Code suggestions based on the model’s pre-trained dataset
- Context-aware recommendations based on your existing code
- Context-aware recommendations based on your code comments
- Native integration with popular IDEs
- Multi-language coding support
- Respond to follow-up instructions (resulting in refinement of code)
Benefits
The benefits of generative AI-powered code generation include the following:
- Accelerate application development
- Improve development productivity
- Decrease development costs
- Produce higher-quality, more consistent code
- Enable better code documentation and test coverage
- Learn new programming languages and coding techniques
- Contributes to the democratization of programming
Methods of Code Generation
As you explore different generative AI tools for writing code, you will likely develop techniques for generating effective code based on the tool.
Long-form Narrative Approach
With chat-based tools like OpenAI ChatGPT, Microsoft Bing Chat, and ChatSonic, developers might use a longer narrative-type instruction (aka one-shot approach) rather than a series of more concise instructions to generate the code. For example:
- “Write a Python script to create a new DynamoDB table with a partition key of username, a sort key of last_name, and provisioned throughput of 5 RCUs and 5 WCUs. Pass the table name into the function as a variable. Wait until the table exists before continuing. Include try/except blocks and logging. If the table fails to be created, log a FATAL error and exit, and if the table already exists, log a WARNING. Add a main function and wrap everything in a Class called DynamoUtilities.”
- “Add three new functions, including a function to insert a new item into a table, a function to retrieve an item from a table, and a function to delete a table.”
Below is an example of OpenAI ChatGPT’s response using this approach. Of course, this method doesn’t limit you from following up with additional instructions to further enhance and refine the code.

Progressive Approach
Although the previous approach makes for an impressive demonstration, that method of writing code differs from how most developers work. Instead, a developer might use a progressive approach (aka chain of thought) to generate and refine the code with chat-based tools. In my tests with chat-based tools, I found a progressive approach of asking multiple, concise instructions to guide the model toward the desired behavior produced higher-quality code with fewer errors. For example:
- “Write a Python script to create a DynamoDB table with a partition key of username, a sort key of last_name, and provisioned throughput of 5 RCUs and 5 WCUs.”
- “Pass the table name into the create table function as a variable.”
- “Wait until the table exists before continuing.”
- “Add logging with the default level of INFO.”
- “Include try/except blocks, log a FATAL error and exit if the table fails to be created and a WARNING if the table already exists.”
- “Add a function to delete the table and wait until the table is deleted before continuing.”
- “Add a main function.”
- “Add a function to insert a new item into a table and a function to get an item from a table.”
- “Wrap the code in a Class called DynamoUtilities.”
- “Convert the Python code to Java.”
Below is an example of OpenAI ChatGPT’s response using a progressive code generation approach. The final code results from an initial prompt and a series of follow-up instructions to enhance and refine the code.

Be aware that most chat-based tools have a session limit. For example, Bing has a limit of 15 chats per session, which will limit the number of follow-up instructions you can use to modify the code. Additionally, the slightest variation in how an instruction is phrased may significantly impact the code generated. There is a whole science and blossoming industry around prompt optimization!
IDE-based Code Generation
Tabnine, GitHub Copilot, and Amazon CodeWhisperer use generative AI technology to predict and suggest new lines or blocks of code based on their trained pre-model and the context and syntax of existing code and code comments within your IDE. These tools differ from tools like OpenAI ChatGPT, Microsoft Bing Chat, and ChatSonic, which can generate code using a chat-based interaction with the user. IDE-based tools like Tabnine, GitHub Copilot, and Amazon CodeWhisperer are like having an AI-powered paired programming partner that enhances your development productivity. However, these tools also require a reasonable level of development experience, in my opinion.
Below, we see an example of Tabnine Pro’s ability to generate whole-line code completion (line 11) and full-function code completion (lines 12–41) based on the existing code context and syntax (lines 1–11).

Although writing code comments may be perceived as easier than using generative code completion, many industry pundits dissuade what they call “ comment-driven development.” Instead, they believe using generative AI-based code completion results in superior results.
Testing Generative AI Tools
With the assistance of each tool, I have written a Python script to perform the following tasks: 1) create an Amazon DynamoDB table, 2) insert an item into a table, 3) retrieve an item from a table, and finally, 4) delete a table. Although I tested the results in multiple IDEs, I only included the results from Visual Studio Code (VS Code). Tool interactions and code results were consistent across IDEs.
As a reference for “what good looks like” when evaluating code accuracy and quality, I referenced the examples provided with AWS’s Boto3 SDK and DynamoDB API documentation and the PEP 8 — Style Guide for Python Code.

An example of the final code, written with the assistance of GitHub Copilot, including unit tests, is available on GitHub.
OpenAI ChatGPT

According to Wikipedia, ChatGPT (Generative Pre-trained Transformer) “is an artificial intelligence chatbot developed by OpenAI and launched in November 2022. It is built on top of OpenAI GPT-3 and GPT-4 [released March 14, 2023] families of large language models [LLMs] and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques.”
According to the VC-backed California-based startup OpenAI, “ChatGPT interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer follow-up questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.” Further, “ChatGPT was optimized for dialogue using Reinforcement Learning with Human Feedback (RLHF), which uses human demonstrations and preference comparisons to guide the model toward desired behavior.”
Getting Started with ChatGPT
You can get started with ChatGPT for free using the Free Plan. However, if you want to use the latest generative AI models, get faster results, and ensure availability, you should consider ChatGPT Plus for $20/month. GPT-4 is now available through the ChatGPT Plus Plan.

ChatGPT’s Free Plan is a great way to test the generative AI waters. However, due to potentially slow chat responses and unavailability during peak times, you will want to upgrade to ChatGPT Plus if your team plans to rely on the tool.

Using ChatGPT
OpenAI ChatGPT is available through a web-based chat interface or programmatically using OpenAI APIs. For API users, OpenAI provides online API references, code examples, and an interactive coding playground. According to OpenAI, “The OpenAI API can be applied to virtually any task that involves understanding or generating natural language, code, or images.”

Below is an example of output using ChatGPT’s UI. Using the UI, ChatGPT’s results are a mix of code blocks and explanatory text. ChatGPT has a straightforward, clean, and functional web-based user interface. There is a button to copy the resulting code to your clipboard, and previous chat sessions are saved and accessible.

ChatGPT Results
In my tests, using a progressive chat approach to building the code with concise instructional questions versus a longer-form narrative approach produced better-quality code. The code created using a progressive chat approach was accurate, error-free, and well-written.
A significant advantage of ChatGPT is its ability to remember what happened earlier in the conversation. According to OpenAI documentation, the model can reference up to approximately 3k words (or 4k tokens) from the current conversation; any information beyond that is not stored.

Microsoft Bing Chat

According to their blog post on February 7, Microsoft launched “an all new, AI-powered Bing search engine and Edge browser, available in preview now at Bing.com, to deliver better search, more complete answers, a new chat experience and the ability to generate content. We think of these tools as an AI copilot for the web.” The latest version of Bing “is running on a new, next-generation OpenAI large language model that is more powerful than ChatGPT and customized specifically for search. It takes key learnings and advancements from ChatGPT and GPT-3.5 — and it is even faster, more accurate and more capable.” On March 14, Microsoft confirmed the new Bing runs on OpenAI’s latest GPT-4.
Getting Started with Bing Chat
Bing’s Chat mode is only available when you access Bing using Microsoft Edge.

Microsoft Edge is available for most major platforms, including MacOS, iOS, and Android.

Using Bing Chat
Bing Chat now supports up to 15 chats per session and 150 per day. Users can supply up to 15 prompts (questions or instructions) to create and improve the code results. When responding, Bing considers the context of previous prompts in the same chat session.

Bing Chat Results
Like OpenAI ChatGPT, Bing Chat delivered accurate, error-free, and well-written code. Although ChatGPT and Bing Chat are not IDE-based tools, the resulting code is easily copied into your project or saved to disk. Additionally, these tools provide an excellent opportunity to learn new languages and coding techniques without requiring an IDE.

ChatSonic

ChatSonic is a service of Writesonic. According to Y Combinator, startup Writesonic, founded by Samanyou Garg, is backed by leading venture capital firms, including Y Combinator, HOF Capital, Rebel Fund, Soma Capital, Broom Ventures, Amino Capital, and some of the best angels from different industries.
ChatSonic claims to improve upon the limitations of ChatGPT. According to Wordsonic, ChatSonic is “a powerful chatbot, powered by Google Search, allowing it to provide up-to-date and accurate content and hence superior to ChatGPT, which can not generate content on topics after September 2021. Additionally, it can create digital images and respond to voice commands and many more additional features. So, ChatSonic has addressed all the limitations of ChatGPT.”
Using ChatSonic
You can get started with ChatSonic for free, with up to 10,000 words. However, like OpenAI ChatGPT, if you want to take advantage of GPT-4, you will want to upgrade to a paid plan, starting at $13/month with an annual commitment.

Using ChatSonic is nearly identical to using OpenAI ChatGPT.

During my testing for this article, I ran into several issues with ChatSonic’s Premium Plan’s Free Trial. For example, the code block constantly flashed and refreshed on the screen using Chrome for Mac; a minor issue but highly annoying. Also, when copying or downloading code results, you get both code and the informational text, which you must separate. Worse, the final code block did not always include all the code output; the code was intermixed with informational text. Lastly, when using follow-up instructions to modify the results, the regenerated code often missed previous code modifications. In my tests, after several follow-up instructions, the resulting code was missing up to one-third of the previous statements. Due to these issues, the resulting code block was malformed and unrunnable. The more I chatted with ChatSonic, the worse the code issues became, and the issues were repeatable.


ChatSonic Results
Unlike ChatGPT and Bing Chat, ChatSonic’s Premium Plan’s Free Trial did not deliver accurate, error-free, or well-written code during my tests. Although later tests with fewer follow-up instructions provided better though incomplete results, my first impressions were unfavorable. The results might be improved using ChatSonic’s Superior or Ultimate Plans, which use the newest GPT-4 model. However, given the results of Premium, I was not keen to invest in the paid plan to find out.

Tabnine

According to the VC-backed, Israel-based startup Tabnine, “Tabnine is to create and deliver a top-to-bottom AI-assisted development workflow that empowers all code creators, in all languages, from concept through to completion.” Tabnine Pro serves whole-line, full-function, and natural language to code completions. Regarding AI, in Tabline’s opinion, “a multi-model approach far outperforms a monolithic approach. That’s why we’ve developed language-specific code native AI models, which are pre-trained on code and provide faster and more accurate code completions based on your tech stack.” To learn more about their models, read Tabnine’s blog post, Tabnine Enterprise vs. ChatGPT Plus.
Getting Started with Tabline
You can get started with Tabline for free with the Starter plan. However, the features of the Starter plan are limited compared to the Pro Plan, which starts at $12/month for a single user.

You can try out the Pro plan’s features, which are free for 14 days.

Tabnine supports a wide range of programming languages.

To get started with Tabline, choose your favorite IDE and install the required extensions or plugins.

Tabnine provides automated or manual IDE installation.

The Tabnine extension for VS Code already has a staggering 5.1 million downloads! A valuable tip learned from testing the IDE-based tools was to disable all non-essential extensions to give me a baseline of the tool’s capabilities. Initially, with over eighty VS Code extensions enabled, I found it hard to separate and understand each tool’s features and potential conflicts with other extensions I had loaded, especially other generative AI tools.

Using Tabnine
As discussed earlier, Tabnine, like GitHub Copilot and Amazon CodeWhisperer, uses generative AI technology to predict and suggest subsequent lines of code based on context and syntax within your IDE. It is like having IntelliSense on steroids.
Like Copilot and CodeWhisperer, it takes some practice to get efficient with Tabnine. After starting with the initial import
and logging
statements, I used code comments to generate the functions, then returned to add exception-catching and logging.
Tabnine, like Copilot and CodeWhisperer, includes a collapsible navigation bar. However, unlike those tools, I found Tabnine’s navigation bar provided little value, in my opinion. It seemed more like marketing stats and a way to encourage more users to sign-up rather than providing useful development features like Copilot and CodeWhisperer.
Below, we see an example of Tabnine Pro’s ability to generate whole-line code completion (line 11) and full-function code completion (lines 12–41) based on the existing code context and syntax (lines 1–11).

Tabnine Results
Given my development knowledge and experience with similar tools, I found the final code results with Tabnine Pro were accurate, error-free, and well-formed (Pythonic). However, I would have liked to see additional features to differentiate Tabnine from its competitors.

GitHub Copilot

According to GitHub, “Copilot is an AI pair programmer that offers autocomplete-style suggestions as you code. You can receive suggestions from GitHub Copilot either by starting to write the code you want to use or by writing a natural language comment describing what you want the code to do. In addition, GitHub Copilot analyzes the context in the file you are editing and related files, and offers suggestions from within your text editor.” Copilot is powered by OpenAI Codex, a new AI system created by OpenAI.
Given the vast popularity and volume of code on GitHub, it is no wonder Copilot is trained in all languages that appear in GitHub’s public repositories. GitHub points out that the quality of suggestions you receive may depend on the volume and diversity of training data for that language.
GitHub Copilot X
GitHub is extending Copilot’s capabilities, with the announcement of GitHub Copolit X on March 21, 2023. According to GitHub, “With chat and terminal interfaces, support for pull requests, and early adoption of OpenAI’s GPT-4, GitHub Copilot X is our vision for the future of AI-powered software development. Integrated into every part of your workflow.”

Getting Started with Copilot
Get started with GitHub Copilot for as little as $10/month. This Monthly Plan is a great way to test Copilot’s capabilities. In addition, GitHub is currently offering a 60-day free trial of Copilot.

Regarding IDE support, Copilot is currently available as an extension in Visual Studio Code, Visual Studio, Neovim, and the JetBrains suite of IDEs. The GitHub Copilot extension already has 4.2 million downloads, and the GitHub Copilot Nightly extension, used for this post, has almost 250,000 downloads.

GitHub Copilot can also be used with Codespaces, GitHub’s fully configured development environments in the Cloud based on VS Code, similar to AWS Cloud9.
Using Copilot
As discussed earlier, GitHub Copilot, like Tabnine and Amazon CodeWhisperer, uses generative AI technology to predict and suggest new lines of code based on existing context and syntax. In addition, Copilot includes an inline pop-up toolbar, which makes it easier to scroll through and accept code choices.

Copilot also has an extensive collapsible navigation bar, which can explain the code, offer language translations, provide access to Copilot’s Brushes, and can even generate tests. My only disappointment with Copilot is that test generation is currently only supported for JavaScript and TypeScript, given Python’s enormous popularity.

Below, we see an example of Copilot’s ability to generate whole-line code completion (line 14) and full-function code completion (lines 15–34) based on the existing code context and syntax (lines 1–14).

Copilot Results
Given my development knowledge and experience with similar tools, the final code results with Copilot were accurate, error-free, and well-formed. I found Copilot’s additional features, especially those found on its collapsible navigation bar, extended its functionality well beyond most similar generative AI tools I tested. Lastly, I felt Copilot’s ability to learn from the existing code context and syntax was superior to other tools.

Amazon CodeWhisperer (Preview)

According to AWS, Amazon CodeWhisperer, announced in June 2022, “is a general purpose, machine learning-powered code generator that provides you with code recommendations, in real time. As you write code, CodeWhisperer automatically generates suggestions based on your existing code and comments. Your personalized recommendations can vary in size and scope, ranging from a single line comment to fully formed functions.” CodeWhisperer code generation is powered by ML models trained on various data sources, including Amazon and open-source code.
CodeWhisperer currently supports Java, JavaScript, Python, C#, and TypeScript. Although these are some of the most popular languages, CodeWhisperer’s language support is limited compared to Tabnine and Copilot.
Getting Started with CodeWhisperer
It is easy to get started with CodeWhisperer. Unfortunately, final pricing is unavailable since Amazon CodeWhisperer is still in Preview as of late March 2023.

CodeWhisperer enhances your IDE using AWS Toolkit for JetBrains, AWS Toolkit for Visual Studio Code, AWS Lambda console, and AWS Cloud9. With Lambda and AWS Cloud9, the setup simply involves activating CodeWhisperer within the IDE.

Getting started with CodeWhisperer requires installing the AWS Toolkit if applicable, choosing your authentication method, and setting up your Builder ID, IAM Identity Center (AWS SSO), or IAM.

AWS Builder ID
According to the documentation, “the AWS Builder ID is a new personal profile for everyone who builds on AWS. Your AWS Builder ID provides access to tools and builder services on AWS, including Amazon CodeCatalyst and Amazon CodeWhisperer. You can keep your AWS Builder ID as you move between jobs, schools, or other organizations. AWS Builder IDs are free. You only pay for the AWS resources you consume in your AWS accounts.”

Using CodeWhisperer
Nearly identical to Copilot and Tabnine, Amazon CodeWhisperer provides real-time code recommendations in your IDE. Unique features of CodeWhisperer include first-class support for AWS APIs, built-in security scans, a code reference tracker, and responsible AI/ML, which avoids bias by filtering out code recommendations that might be considered biased and unfair.
Below, we see an example of CodeWhisperer’s ability to generate whole-line code completion (line 14) and full-function code completion (lines 15–27) based on the existing code context and syntax (lines 1–14).

CodeWhisperer Results
Like my feedback on Tabnine and Copilot, the final code results with CodeWhisperer were accurate, error-free, and well-formed. In addition, I appreciated the added benefits of CodeWhisperer’s Security Scan and Reference Tracker to identify significant code from other projects.

Other Options
There are many other tools in the Generative AI coding category of comparable quality to the six featured in this post. New tools are being released on an almost daily basis.
Google Bard
On March 21, 2023, Google announced limited access to Bard, an early experiment that lets you collaborate with generative AI. They are beginning with the U.S. and the U.K. and will expand to more countries and languages over time. Bard is based on Google’s LaMDA (Language Model for Dialogue Applications), a transformer-based model.

Azure OpenAI Service with GPT-4
On March 21, 2023, Microsoft announced that GPT-4 was available in preview in Azure OpenAI Service. Unfortunately, Azure OpenAI Service requires registration and is currently only available to approved enterprise customers and partners. To get started, if you are qualified, you must trudge through the 25-question form.

Replit Ghostwriter
California-based VC-backed startup Replit has put the power of Replit’s AI into the world’s most popular online IDE. Replit “automates away the repetitive parts of coding, so you can stay focused on making your creative vision a reality.” Replit announced Ghostwriter Chat, which allows you to chat with a coding AI directly in your IDE. It now has a proactive debugger and awareness of your project’s code.
Conclusion
In this post, we examined six popular generative AI-powered coding tools, including chat-based OpenAI ChatGPT, Microsoft’s all-new Bing Chat, and ChatSonic, as well as IDE-based GitHub Copilot, Amazon CodeWhisperer (Preview), and Tabnine. Each tool assisted with developing an identical program to complete everyday database tasks on AWS. We then compared and contrasted the ease of use and the resulting code accuracy and quality of the tools. The generative AI space is moving at a breakneck pace. Tools continue to rapidly improve their AI models, add new features, and adjust pricing.
Recommended References
- Article: Top Artificial Intelligence (AI) Tools That Can Generate Code To Help Programmers (Marktechpost)
- Article: Top 29 ChatGPT alternatives that will blow your mind in 2023 (Free & Paid) (Writesonic)
- Research Paper: Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges (Triet H. M. Le, Hao Chen, M. Ali Babar)
- Article: NLP vs. NLU vs. NLG: the differences between three natural language processing concepts (IBM)
- Article: Top 17 Generative AI-based Programming Tools (For Developers) (Toward AI)
🔔 To keep up with future content, follow Gary Stafford on LinkedIn.
This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
IoT Telemetry Collection using Google Protocol Buffers, Google Cloud Functions, Cloud Pub/Sub, and MongoDB Atlas
Posted by Gary A. Stafford in Big Data, Cloud, GCP, Python, Serverless, Software Development on May 21, 2019
Collect IoT sensor telemetry using Google Protocol Buffers’ serialized binary format over HTTPS, serverless Google Cloud Functions, Google Cloud Pub/Sub, and MongoDB Atlas on GCP, as an alternative to integrated Cloud IoT platforms and standard IoT protocols. Aggregate, analyze, and build machine learning models with the data using tools such as MongoDB Compass, Jupyter Notebooks, and Google’s AI Platform Notebooks.
Introduction
Most of the dominant Cloud providers offer IoT (Internet of Things) and IIotT (Industrial IoT) integrated services. Amazon has AWS IoT, Microsoft Azure has multiple offering including IoT Central, IBM’s offering including IBM Watson IoT Platform, Alibaba Cloud has multiple IoT/IIoT solutions for different vertical markets, and Google offers Google Cloud IoT platform. All of these solutions are marketed as industrial-grade, highly-performant, scalable technology stacks. They are capable of scaling to tens-of-thousands of IoT devices or more and massive amounts of streaming telemetry.
In reality, not everyone needs a fully integrated IoT solution. Academic institutions, research labs, tech start-ups, and many commercial enterprises want to leverage the Cloud for IoT applications, but may not be ready for a fully-integrated IoT platform or are resistant to Cloud vendor platform lock-in.
Similarly, depending on the performance requirements and the type of application, organizations may not need or want to start out using IoT/IIOT industry standard data and transport protocols, such as MQTT (Message Queue Telemetry Transport) or CoAP (Constrained Application Protocol), over UDP (User Datagram Protocol). They may prefer to transmit telemetry over HTTP using TCP, or securely, using HTTPS (HTTP over TLS).
Demonstration
In this demonstration, we will collect environmental sensor data from a number of IoT device sensors and stream that telemetry over the Internet to Google Cloud. Each IoT device is installed in a different physical location. The devices contain a variety of common sensors, including humidity and temperature, motion, and light intensity.

Prototype IoT Devices used in this Demonstration
We will transmit the sensor telemetry data as JSON over HTTP to serverless Google Cloud Function HTTPS endpoints. We will then switch to using Google’s Protocol Buffers to transmit binary data over HTTP. We should observe a reduction in the message size contained in the request payload as we move from JSON to Protobuf, which should reduce system latency and cost.
Data received by Cloud Functions over HTTP will be published asynchronously to Google Cloud Pub/Sub. A second Cloud Function will respond to all published events and push the messages to MongoDB Atlas on GCP. Once in Atlas, we will aggregate, transform, analyze, and build machine learning models with the data, using tools such as MongoDB Compass, Jupyter Notebooks, and Google’s AI Platform Notebooks.
For this demonstration, the architecture for JSON over HTTP will look as follows. All sensors will transmit data to a single Cloud Function HTTPS endpoint.
For Protobuf over HTTP, the architecture will look as follows in the demonstration. Each type of sensor will transmit data to a different Cloud Function HTTPS endpoint.
Although the Cloud Functions will automatically scale horizontally to accommodate additional load created by the volume of telemetry being received, there are also other options to scale the system. For example, we could create individual pipelines of functions and topic/subscriptions for each sensor type. We could also split the telemetry data across multiple MongoDB Atlas Collections, based on sensor type, instead of a single collection. In all cases, we will still benefit from the Cloud Function’s horizontal scaling capabilities.
Source Code
All source code is all available on GitHub. Use the following command to clone the project.
git clone \ --branch master --single-branch --depth 1 --no-tags \ https://github.com/garystafford/iot-protobuf-demo.git
You will need to adjust the project’s environment variables to fit your own development and Cloud environments. All source code for this post is written in Python. It is intended for Python 3 interpreters but has been tested using Python 2 interpreters. The project’s Jupyter Notebooks can be viewed from within the project on GitHub or using the free, online Jupyter nbviewer.
Technologies
Protocol Buffers
According to Google, Protocol Buffers (aka Protobuf) are a language- and platform-neutral, efficient, extensible, automated mechanism for serializing structured data for use in communications protocols, data storage, and more. Protocol Buffers are 3 to 10 times smaller and 20 to 100 times faster than XML.
Each protocol buffer message is a small logical record of information, containing a series of strongly-typed name-value pairs. Once you have defined your messages, you run the protocol buffer compiler for your application’s language on your .proto
file to generate data access classes.
Google Cloud Functions
According to Google, Cloud Functions is Google’s event-driven, serverless compute platform. Key features of Cloud Functions include automatic scaling, high-availability, fault-tolerance,
no servers to provision, manage, patch or update, only
pay while your code runs, and they easily connect and extend other cloud services. Cloud Functions natively support multiple event-types, including HTTP, Cloud Pub/Sub, Cloud Storage, and Firebase. Current language support includes Python, Go, and Node.
Google Cloud Pub/Sub
According to Google, Cloud Pub/Sub is an enterprise message-oriented middleware for the Cloud. It is a scalable, durable event ingestion and delivery system. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication among independent applications. Cloud Pub/Sub delivers low-latency, durable messaging that integrates with systems hosted on the Google Cloud Platform and externally.
MongoDB Atlas
MongoDB Atlas is a fully-managed MongoDB-as-a-Service, available on AWS, Azure, and GCP. Atlas, a mature SaaS product, offers high-availability, uptime service-level agreements, elastic scalability, cross-region replication, enterprise-grade security, LDAP integration, BI Connector, and much more.
MongoDB Atlas currently offers four pricing plans, Free, Basic, Pro, and Enterprise. Plans range from the smallest, free M0-sized MongoDB cluster, with shared RAM and 512 MB storage, up to the massive M400 MongoDB cluster, with 488 GB of RAM and 3 TB of storage.
Cost Effectiveness of Cloud Functions
At true IIoT scale, Google Cloud Functions may not be the most efficient or cost-effective method of ingesting telemetry data. Based on Google’s pricing model, you get two million free function invocations per month, with each additional million invocations costing USD $0.40. The total cost also includes memory usage, total compute time, and outbound data transfer. If your system is comprised of tens or hundreds of IoT devices, Cloud Functions may prove cost-effective.
However, with thousands of devices or more, each transmitting data multiple times per minutes, you could quickly outgrow the cost-effectiveness of Google Functions. In that case, you might look to Google’s Google Cloud IoT platform. Alternately, you can build your own platform with Google products such as Knative, letting you choose to run your containers either fully managed with the newly-released Cloud Run, or in your Google Kubernetes Engine cluster with Cloud Run on GKE.
Sensor Scripts
For each sensor type, I have developed separate Python scripts, which run on each IoT device. There are two versions of each script, one for JSON over HTTP and one for Protobuf over HTTP.
JSON over HTTPS
Below we see the script, dht_sensor_http_json.py, used to transmit humidity and temperature data via JSON over HTTP to a Google Cloud Function running on GCP. The JSON request payload contains a timestamp, IoT device ID, device type, and the temperature and humidity sensor readings. The URL for the Google Cloud Function is stored as an environment variable, local to the IoT devices, and set when the script is deployed.
import json import logging import os import socket import sys import time import Adafruit_DHT import requests URL = os.environ.get('GCF_URL') JWT = os.environ.get('JWT') SENSOR = Adafruit_DHT.DHT22 TYPE = 'DHT22' PIN = 18 FREQUENCY = 15 def main(): if not URL or not JWT: sys.exit("Are the Environment Variables set?") get_sensor_data(socket.gethostname()) def get_sensor_data(device_id): while True: humidity, temperature = Adafruit_DHT.read_retry(SENSOR, PIN) payload = {'device': device_id, 'type': TYPE, 'timestamp': time.time(), 'data': {'temperature': temperature, 'humidity': humidity}} post_data(payload) time.sleep(FREQUENCY) def post_data(payload): payload = json.dumps(payload) headers = { 'Content-Type': 'application/json; charset=utf-8', 'Authorization': JWT } try: requests.post(URL, json=payload, headers=headers) except requests.exceptions.ConnectionError: logging.error('Error posting data to Cloud Function!') except requests.exceptions.MissingSchema: logging.error('Error posting data to Cloud Function! Are Environment Variables set?') if __name__ == '__main__': sys.exit(main())
Telemetry Frequency
Although the sensors are capable of producing data many times per minute, for this demonstration, sensor telemetry is intentionally limited to only being transmitted every 15 seconds. To reduce system complexity, potential latency, back-pressure, and cost, in my opinion, you should only produce telemetry data at the frequency your requirements dictate.
JSON Web Tokens
For security, in addition to the HTTPS endpoints exposed by the Google Cloud Functions, I have incorporated the use of a JSON Web Token (JWT). JSON Web Tokens are an open, industry standard RFC 7519 method for representing claims securely between two parties. In this case, the JWT is used to verify the identity of the sensor scripts sending telemetry to the Cloud Functions. The JWT contains an id, password, and expiration, all encrypted with a secret key, which is known to each Cloud Function, in order to verify the IoT device’s identity. Without the correct JWT being passed in the Authorization header, the request to the Cloud Function will fail with an HTTP status code of 401 Unauthorized. Below is an example of the JWT’s payload data.
{ "sub": "IoT Protobuf Serverless Demo", "id": "iot-demo-key", "password": "t7J2gaQHCFcxMD6584XEpXyzWhZwRrNJ", "iat": 1557407124, "exp": 1564664724 }
For this demonstration, I created a temporary JWT using jwt.io. The HTTP Functions are using PyJWT
, a Python library which allows you to encode and decode the JWT. The PyJWT library allows the Function to decode and validate the JWT (Bearer Token) from the incoming request’s Authorization header. The JWT token is stored as an environment variable. Deployment instructions are included in the GitHub project.
JSON Payload
Below is a typical JSON request payload (pretty-printed), containing DHT sensor data. This particular message is 148 bytes in size. The message format is intentionally reader-friendly. We could certainly shorten the message’s key fields, to reduce the payload size by an additional 15-20 bytes.
{ "device": "rp829c7e0e", "type": "DHT22", "timestamp": 1557585090.476025, "data": { "temperature": 17.100000381469727, "humidity": 68.0999984741211 } }
Protocol Buffers
For the demonstration, I have built a Protocol Buffers file, sensors.proto
, to support the data output by three sensor types: digital humidity and temperature (DHT), passive infrared sensor (PIR), and digital light intensity (DLI). I am using the newer proto3
version of the protocol buffers language. I have created a common Protobuf sensor message schema, with the variable sensor telemetry stored in the nested data
object, within each message type.
It is important to use the correct Protobuf Scalar Value Type to maintain numeric precision in the language you compile for. For simplicity, I am using a double
to represent the timestamp, as well as the numeric humidity and temperature readings. Alternately, you could choose Google’s Protobuf WellKnownTypes
, Timestamp to store timestamp.
syntax = "proto3"; package sensors; // DHT22 message SensorDHT { string device = 1; string type = 2; double timestamp = 3; DataDHT data = 4; } message DataDHT { double temperature = 1; double humidity = 2; } // Onyehn_PIR message SensorPIR { string device = 1; string type = 2; double timestamp = 3; DataPIR data = 4; } message DataPIR { bool motion = 1; } // Anmbest_MD46N message SensorDLI { string device = 1; string type = 2; double timestamp = 3; DataDLI data = 4; } message DataDLI { bool light = 1; }
Since the sensor data will be captured with scripts written in Python 3, the Protocol Buffers file is compiled for Python, resulting in the file, sensors_pb2.py
.
protoc --python_out=. sensors.proto
Protocol Buffers over HTTPS
Below we see the alternate DHT sensor script, dht_sensor_http_pb.py, which transmits a Protocol Buffers-based binary request payload over HTTPS to a Google Cloud Function running on GCP. Note the request’s Content-Type
header has been changed from application/json
to application/x-protobuf
. In this case, instead of JSON, the same data fields are stored in an instance of the Protobuf’s SensorDHT
message type (sensors_pb2.SensorDHT()
). Note the import sensors_pb2
statement. This statement imports the compiled Protocol Buffers file, which is stored locally to the script on the IoT device.
import logging import os import socket import sys import time import Adafruit_DHT import requests import sensors_pb2 URL = os.environ.get('GCF_DHT_URL') JWT = os.environ.get('JWT') SENSOR = Adafruit_DHT.DHT22 TYPE = 'DHT22' PIN = 18 FREQUENCY = 15 def main(): if not URL or not JWT: sys.exit("Are the Environment Variables set?") get_sensor_data(socket.gethostname()) def get_sensor_data(device_id): while True: try: humidity, temperature = Adafruit_DHT.read_retry(SENSOR, PIN) sensor_dht = sensors_pb2.SensorDHT() sensor_dht.device = device_id sensor_dht.type = TYPE sensor_dht.timestamp = time.time() sensor_dht.data.temperature = temperature sensor_dht.data.humidity = humidity payload = sensor_dht.SerializeToString() post_data(payload) time.sleep(FREQUENCY) except TypeError: logging.error('Error getting sensor data!') def post_data(payload): headers = { 'Content-Type': 'application/x-protobuf', 'Authorization': JWT } try: requests.post(URL, data=payload, headers=headers) except requests.exceptions.ConnectionError: logging.error('Error posting data to Cloud Function!') except requests.exceptions.MissingSchema: logging.error('Error posting data to Cloud Function! Are Environment Variables set?') if __name__ == '__main__': sys.exit(main())
Protobuf Binary Payload
To understand the binary Protocol Buffers-based payload, we can write a sample SensorDHT
message to a file on disk as a byte array.
message = sensorDHT.SerializeToString() binary_file_output = open("./data_binary.txt", "wb") file_byte_array = bytearray(message) binary_file_output.write(file_byte_array)
Then, using the hexdump
command, we can view a representation of the binary data file.
> hexdump -C data_binary.txt 00000000 0a 08 38 32 39 63 37 65 30 65 12 05 44 48 54 32 |..829c7e0e..DHT2| 00000010 32 1d 05 a0 b9 4e 22 0a 0d ec 51 b2 41 15 cd cc |2....N"...Q.A...| 00000020 38 42 |8B| 00000022
The binary data file size is 48 bytes on disk, as compared to the equivalent JSON file size of 148 bytes on disk (32% the size). As a test, we could then send that binary data file as the payload of a POST to the Cloud Function, as shown below using Postman. Postman will serialize the binary data file’s contents to a binary string before transmitting.
Similarly, we can serialize the same binary Protocol Buffers-based SensorDHT
message to a binary string using the SerializeToString
method.
message = sensorDHT.SerializeToString() print(message)
The resulting binary string resembles the following.
b'\n\nrp829c7e0e\x12\x05DHT22\x19c\xee\xbcg\xf5\x8e\xccA"\x12\t\x00\x00\x00\xa0\x99\x191@\x11\x00\x00\x00`f\x06Q@'
The binary string length of the serialized message, and therefore the request payload sent by Postman and received by the Cloud Function for this particular message, is 111 bytes, as compared to the JSON payload size of 148 bytes (75% the size).
Validate Protobuf Payload
To validate the data contained in the Protobuf payload is identical to the JSON payload, we can parse the payload from the serialized binary string using the Protobuf ParseFromString
method. We then convert it to JSON using the Protobuf MessageToJson
method.
message = sensorDHT.SerializeToString() message_parsed = sensors_pb2.SensorDHT() message_parsed.ParseFromString(message) print(MessageToJson(message_parsed))
The resulting JSON object is identical to the JSON payload sent using JSON over HTTPS, earlier in the demonstration.
{ "device": "rp829c7e0e", "type": "DHT22", "timestamp": 1557585090.476025, "data": { "temperature": 17.100000381469727, "humidity": 68.0999984741211 } }
Google Cloud Functions
There are a series of Google Cloud Functions, specifically four HTTP Functions, which accept the sensor data over HTTP from the IoT devices. Each function exposes an HTTPS endpoint. According to Google, you use HTTP functions when you want to invoke your function via an HTTP(S) request. To allow for HTTP semantics, HTTP function signatures accept HTTP-specific arguments.
Below, I have deployed a single function that accepts JSON sensor telemetry from all sensor types, and three functions for Protobuf, one for each sensor type: DHT, PIR, and DLI.
JSON Message Processing
Below, we see the Cloud Function, main.py, which processes the incoming JSON over HTTPS payload from all sensor types. Once the request’s JWT is validated, the JSON message payload is serialized to a byte string and sent to a common Google Cloud Pub/Sub Topic. Note the JWT secret key, id, and password, and the Google Cloud Pub/Sub Topic are all stored as environment variables, local to the Cloud Functions. In my tests, the JSON-based HTTP Functions took an average of 9–18 ms to execute successfully.
import logging import os import jwt from flask import make_response, jsonify from flask_api import status from google.cloud import pubsub_v1 TOPIC = os.environ.get('TOPIC') SECRET_KEY = os.getenv('SECRET_KEY') ID = os.getenv('ID') PASSWORD = os.getenv('PASSWORD') def incoming_message(request): if not validate_token(request): return make_response(jsonify({'success': False}), status.HTTP_401_UNAUTHORIZED, {'ContentType': 'application/json'}) request_json = request.get_json() if not request_json: return make_response(jsonify({'success': False}), status.HTTP_400_BAD_REQUEST, {'ContentType': 'application/json'}) send_message(request_json) return make_response(jsonify({'success': True}), status.HTTP_201_CREATED, {'ContentType': 'application/json'}) def validate_token(request): auth_header = request.headers.get('Authorization') if not auth_header: return False auth_token = auth_header.split(" ")[1] if not auth_token: return False try: payload = jwt.decode(auth_token, SECRET_KEY) if payload['id'] == ID and payload['password'] == PASSWORD: return True except jwt.ExpiredSignatureError: return False except jwt.InvalidTokenError: return False def send_message(message): publisher = pubsub_v1.PublisherClient() publisher.publish(topic=TOPIC, data=bytes(str(message), 'utf-8'))
The Cloud Functions are deployed to GCP using the gcloud functions deploy
CLI command (I use Jenkins to automate the deployments). I have wrapped the deploy commands into bash scripts. The script also copies over a common environment variables YAML file, consumed by the Cloud Function. Each Function has a deployment script, included in the project.
# get latest env vars file cp -f ./../env_vars_file/env.yaml . # deploy function gcloud functions deploy http_json_to_pubsub \ --runtime python37 \ --trigger-http \ --region us-central1 \ --memory 256 \ --entry-point incoming_message \ --env-vars-file env.yaml
Using a .gcloudignore
file, the gcloud functions deploy
CLI command deploys three files: the cloud function (main.py
), required Python packages file (requirements.txt
), the environment variables file (env.yaml
). Google automatically installs dependencies using the requirements.txt
file.
Protobuf Message Processing
Below, we see the Cloud Function, main.py, which processes the incoming Protobuf over HTTPS payload from DHT sensor types. Once the sensor data Protobuf message payload is received by the HTTP Function, it is deserialized to JSON and then serialized to a byte string. The byte string is then sent to a Google Cloud Pub/Sub Topic. In my tests, the Protobuf-based HTTP Functions took an average of 7–14 ms to execute successfully.
As before, note the import sensors_pb2
statement. This statement imports the compiled Protocol Buffers file, which is stored locally to the script on the IoT device. It is used to parse a serialized message into its original Protobuf’s SensorDHT
message type.
import logging import os import jwt import sensors_pb2 from flask import make_response, jsonify from flask_api import status from google.cloud import pubsub_v1 from google.protobuf.json_format import MessageToJson TOPIC = os.environ.get('TOPIC') SECRET_KEY = os.getenv('SECRET_KEY') ID = os.getenv('ID') PASSWORD = os.getenv('PASSWORD') def incoming_message(request): if not validate_token(request): return make_response(jsonify({'success': False}), status.HTTP_401_UNAUTHORIZED, {'ContentType': 'application/json'}) data = request.get_data() if not data: return make_response(jsonify({'success': False}), status.HTTP_400_BAD_REQUEST, {'ContentType': 'application/json'}) sensor_pb = sensors_pb2.SensorDHT() sensor_pb.ParseFromString(data) sensor_json = MessageToJson(sensor_pb) send_message(sensor_json) return make_response(jsonify({'success': True}), status.HTTP_201_CREATED, {'ContentType': 'application/json'}) def validate_token(request): auth_header = request.headers.get('Authorization') if not auth_header: return False auth_token = auth_header.split(" ")[1] if not auth_token: return False try: payload = jwt.decode(auth_token, SECRET_KEY) if payload['id'] == ID and payload['password'] == PASSWORD: return True except jwt.ExpiredSignatureError: return False except jwt.InvalidTokenError: return False def send_message(message): publisher = pubsub_v1.PublisherClient() publisher.publish(topic=TOPIC, data=bytes(message, 'utf-8'))
Cloud Pub/Sub Functions
In addition to HTTP Functions, the demonstration uses a function triggered by Google Cloud Pub/Sub Triggers. According to Google, Cloud Functions can be triggered by messages published to Cloud Pub/Sub Topics in the same GCP project as the function. The function automatically subscribes to the Topic. Below, we see that the function has automatically subscribed to iot-data-demo
Cloud Pub/Sub Topic.
Sending Telemetry to MongoDB Atlas
The common Cloud Function, triggered by messages published to Cloud Pub/Sub, then sends the messages to MongoDB Atlas. There is a minimal amount of cleanup required to re-format the Cloud Pub/Sub messages to BSON (binary JSON). Interestingly, according to bsonspec.org, BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more schema-less than Protocol Buffers, which can give it an advantage in flexibility but also a slight disadvantage in space efficiency (BSON has overhead for field names within the serialized data).
The function uses the PyMongo to connect to MongoDB Atlas. According to their website, PyMongo is a Python distribution containing tools for working with MongoDB and is the recommended way to work with MongoDB from Python.
import base64 import json import logging import os import pymongo MONGODB_CONN = os.environ.get('MONGODB_CONN') MONGODB_DB = os.environ.get('MONGODB_DB') MONGODB_COL = os.environ.get('MONGODB_COL') def read_message(event, context): message = base64.b64decode(event['data']).decode('utf-8') message = message.replace("'", '"') message = message.replace('True', 'true') message = json.loads(message) client = pymongo.MongoClient(MONGODB_CONN) db = client[MONGODB_DB] col = db[MONGODB_COL] col.insert_one(message)
The function responds to the published events and sends the messages to the MongoDB Atlas cluster, running in the same Region, us-central1, as the Cloud Functions and Pub/Sub Topic. Below, we see the current options available when provisioning an Atlas cluster.
MongoDB Atlas provides a rich, web-based UI for managing and monitoring MongoDB clusters, databases, collections, security, and performance.
Although Cloud Pub/Sub to Atlas function execution times are longer in duration than the HTTP functions, the latency is greatly reduced by locating the Cloud Pub/Sub Topic, Cloud Functions, and MongoDB Atlas cluster into the same GCP Region. Cross-region execution times were as high as 500-600 ms, while same-region execution times averaged 200-225 ms. Selecting a more performant Atlas cluster would likely result in even lower function execution times.
Aggregating Data with MongoDB Compass
MongoDB Compass is a free, convenient, desktop application for interacting with your MongoDB databases. You can view the collected sensor data, review message (document) schema, manage indexes, and build complex MongoDB aggregations.
When performing analytics or machine learning, I primarily use MongoDB Compass to preview the captured telemetry data and build aggregation pipelines. Aggregation operations process data records and returns computed results. This feature saves a ton of time, filtering and preparing data for further analysis, visualization, and machine learning with Jupyter Notebooks.
Aggregation pipelines can be directly exported to Java, Node, C#, and Python 3. The exported aggregation pipeline code can be placed directly into your Python applications and Jupyter Notebooks.
Below, the exported aggregation pipelines code from MongoDB Compass is used to load a resultset directly into a Pandas DataFrame. This particular aggregation returns time-series DHT sensor data from a specific IoT device over a 72-hour period.
DEVICE_1 = 'rp59adf374' pipeline = [ { '$match': { 'type': 'DHT22', 'device': DEVICE_1, 'timestamp': { '$gt': 1557619200, '$lt': 1557792000 } } }, { '$project': { '_id': 0, 'timestamp': 1, 'temperature': '$data.temperature', 'humidity': '$data.humidity' } }, { '$sort': { 'timestamp': 1 } } ] aggResult = iot_data.aggregate(pipeline) df1 = pd.DataFrame(list(aggResult))
MongoDB Atlas Performance
In this demonstration, from Python3-based Jupyter Notebooks, I was able to consistently query a MongoDB Atlas collection of almost 70k documents for resultsets containing 3 days (72 hours) worth of digital temperature and humidity data, roughly 10.2k documents, in an average of 825 ms. That is round trip from my local development laptop to MongoDB Atlas running on GCP, in a different geographic region.
Query times on GCP are much faster, such as when running a Notebook in JupyterLab on Google’s AI Platform, or a PySpark job with Cloud Dataproc, against Atlas. Running the same Jupyter Notebook directly on Google’s AI Platform, the same MongoDB Atlas query took an average of 450 ms versus 825 ms (1.83x faster). This was across two different GCP Regions; same Region times should be even faster.
GCP Observability
There are several choices for observing the system’s Google Cloud Functions, Google Cloud Pub/Sub, and MongoDB Atlas. As shown above, the GCP Cloud Functions interface lets you see the individual function executions, execution times, memory usage, and active instances, over varying time intervals.
For a more detailed view of Google Cloud Functions and Google Cloud Pub/Sub, I built two custom dashboards using Stackdriver. According to Google, Stackdriver aggregates metrics, logs, and events from infrastructure, giving developers and operators a rich set of observable signals. I built a custom Stackdriver Cloud Functions dashboard (shown below) and a Cloud Pub/Sub Topics and Subscriptions dashboard.
For functions, I chose to display execution times, memory usage, the number of executions, and network egress, all in a single pane of glass, using four graphs. Below, I am using the 95th percentile average for monitoring. The 95th percentile asserts that 95% of the time, the observed values are below this amount and the remaining 5% of the time, the observed values are above that amount.
Data Analysis using Jupyter Notebooks
According to jupyter.org, the Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. The widespread use of Jupyter Notebooks has grown significantly, as Big Data, AI, and ML have all experienced explosive growth.
PyCharm
JetBrains PyCharm, my favorite Python IDE, has direct integrations with Jupyter Notebooks. In fact, PyCharm’s most recent updates to the Professional Edition greatly enhanced those integrations. PyCharm offers round-trip editing in the IDE and the Jupyter Notebook web browser interface. PyCharm allows you to run and debug individual cells within the notebook. PyCharm automatically starts the Jupyter Server and appropriate kernel for the Notebook you have opened. And, one of my favorite features, PyCharm’s variable viewer tracks the current value of a variable, automatically.
Below, we see the example Analytics Notebook, included in the demonstration’s project, displayed in PyCharm 19.1.2 (Professional Edition). To effectively work with Notebooks in PyCharm really requires a full-size monitor. Working on a laptop with PyCharm’s crowded Notebook UI is workable, but certainly not as effective as on a larger monitor.
Jupyter Notebook Server
Below, we see the same Analytics Notebook, shown above in PyCharm, opened in Jupyter Notebook Server’s web-based client interface, running locally on the development workstation. The web browser-based interface also offers a rich set of features for Notebook development.
From within the Notebook, we are able to query the data from MongoDB Atlas, again using PyMongo, and load the resultsets into Panda DataFrames. As an alternative to hard-coded values and environment variables, with Notebooks, I use the python-dotenv Python package. This package allows me to place my environment variables in a common .env
file and reference them from any Notebook. The package has many options for managing environment variables.
We can then analyze the data using a number of common frameworks, including Pandas, Matplotlib, SciPy, PySpark, and NumPy, to name but a few. Below, we see time series data from four different sensors, on the same IoT device. Viewing the data together, we can study the causal effect of one environment variable on another, such as the impact of light on temperature or humidity.
Below, we can use histograms to visualize temperature frequencies for
intervals, over time, for a given device location.
Machine Learning using Jupyter Notebooks
In addition to data analytics, we can use Jupyter Notebooks with tools such as scikit-learn to build machine learning models based on our sensor telemetry. Scikit-learn is a set of machine learning tools in Python, built on NumPy, SciPy, and matplotlib. Below, I have used JupyterLab on Google’s AI Platform and scikit-learn to build several models, based on the sensor data.
Using scikit-learn, we can build models to predict such things as which IoT device generated a specific temperature and humidity reading, or the temperature and humidity, given the time of day, device location, and external environment variables, or discover anomalies in the sensor telemetry.
Scikit-learn makes it easy to construct randomized training and test datasets, to build models, using data from multiple IoT devices, as shown below.
The project includes a Jupyter Notebook that demonstrates how to build several ML models using sensor data. Examples of supervised learning algorithms used to build the classification models in this demonstration include Support Vector Machine (SVM), k-nearest neighbors (k-NN), and Random Forest Classifier.
Having data from multiple sensors, we are able to enrich the ML models by adding additional categorical (discrete) features to our training data. For example, we could look at the effect of light, motion, and time of day on temperature and humidity.
Conclusion
Hopefully, this post has demonstrated how to efficiently collect telemetry data from IoT devices using Google Protocol Buffers over HTTPS, serverless Google Cloud Functions, Cloud Pub/Sub, and MongoDB Atlas, all on the Google Cloud Platform. Once captured, the telemetry data was easily aggregated and analyzed using common tools, such as MongoDB Compass and Jupyter Notebooks. Further, we used the data and tools to build machine learning models for prediction and anomaly detection.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Image: everythingpossible © 123RF.com