Posts Tagged NoSQL
Apache Solr: Because your Database is not a Search Engine
Posted by Gary A. Stafford in Python, Software Development on February 25, 2019
In this post, we will examine what sets Apache Solr aside as a search engine, from conventional databases like MongoDB. We will explore the similarities and differences between Solr and MongoDB by analyzing a series of comparative queries. We will then delve into some of Solr’s more advanced search capabilities.
Why Search?
The ability to search for information is a basic requirement of many applications. Architects and Developers who limit themselves to traditional databases, often attempt to meet search requirements by creating unnecessarily and overly complex SQL query-based solutions. They force end-users to search in unnatural or highly-structured ways or provide results that lack a sense of relevancy. End-users are not Database Administrators, they do not understand the nuances of SQL, they simply want relevant responses to their inquiries.
In a scenario where data consumers are arbitrarily searching for relevant information within a distinct domain, implementing a search-optimized, Lucene-based platform, such as Elasticsearch or Apache Solr, for reads, is often an effective solution.
Separating database reads from writes is not uncommon. I’ve worked on many projects where the requirements suggested an architecture in which one type of data storage technology should be implemented to optimize for writes, while a different type or types of data storage technologies should be implemented to optimize for reads. Architectures in which this is common include the following.
- CQRS (Command Query Responsibility Segregation) and Event Sourcing;
- Reporting, Data Analytics, and Big Data;
- ML (Machine Learning) and AI (Artificial Intelligence);
- Real-Time and Streaming Data (such as IoT);
- Search: Content, Document, Knowledge;
In this post, we will examine the search capabilities of Apache Solr. We will compare and contrast Solr’s search capabilities to those of MongoDB, the leading NoSQL database. We will consider the differences between querying for data and searching for information.
Apache Lucene
According to Apache, the Apache Lucene project develops open-source search software, including the following sub-projects: Lucene Core, Solr, and PyLucene. The Lucene Core sub-project provides Java-based indexing and search technology, as well as spellchecking, hit highlighting, and advanced analysis/tokenization capabilities.
Apache Lucene 7.7.0 and Apache Solr 7.7.0 were just released in February 2019 and used for all the post’s examples.
Apache Solr
According to Apache, Apache Solr is the popular, blazing fast, open source, enterprise search platform built on Apache Lucene. Solr powers the search and navigation features of many of the world’s largest internet sites.
Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability. Referred to as SolrCloud, and backed by Apache Zookeeper, these capabilities provide distributed, sharded, and replicated indexing and search capabilities.
According to Wikipedia, Solr was created at CNET Networks in 2004, donated to the Apache Software Foundation in 2006, and graduated from their incubator in 2007. Solr version 1.3 was released in 2008. In 2010, the Lucene and Solr projects merged; Solr became a Lucene subproject. With Apache Solr 7.7.0 just released, Solr has well over ten years of development and enterprise adoption behind it.
MongoDB
The leading NoSQL database, MongoDB, describes itself as a document database with the scalability and flexibility that you want with the querying and indexing that users need. Mongo features include ad hoc queries, indexing, and real-time aggregation, which provide powerful ways to access and analyze your data.
Released less than a year ago, MongoDB 4.0 added multi-document ACID transactions, data type conversions, non-blocking secondary replica reads, SHA-2 authentication, MongoDB Compass aggregation pipeline builder, Kubernetes integration, and the MongoDB Stitch serverless platform. MongoDB 4.0.6 was just released in February 2019 and used for all the post’s examples.
Comparing Search Features
Solr and MongoDB appear to have many search-related features in common.
- Both Solr and MongoDB are document-based data stores;
- Both Solr and MongoDB use a non-relational data model;
- Both feature advanced querying and indexing capabilities;
- Solr implements Lucene-based search capabilities; MongoDB has text-based search capabilities;
- Solr scores the relevance of search results using the Lucene scoring algorithm; MongoDB has the capability of ranking text search results using the $meta operator;
- Solr is able to selectively boost the relative importance of search fields and specific values in a field when calculating scores; MongoDB has the capability of boosting the relative importance of fields used in a text search using text indexes;
- Both Solr and MongoDB are capable of implementing stop words, stemming, and tokenization;
Demonstration
Source Code Examples
All examples shown in this post are available as a series of Python 3 scripts, contained in an open-source project on GitHub, searching-solr-vs-mongodb. The project contains the script, query_mongo.py, which uses the Python driver for MongoDB, pymongo, to execute all the MongoDB queries in this post. The project also contains the script, query_solr.py, which uses the lightweight Python wrapper for Apache Solr, pysolr, to execute all the Solr searches in this post. Both packages, along with ancillary packages, may be installed with pip.
pip3 install pysolr pymongo bson.json_util requests
MongoDB and Solr Instances
To follow along, you will need your own MongoDB and Solr instances. Both are easily stood up locally with Docker, using the official MongoDB and Solr Docker Hub images. Example docker run
commands are shown below.
The second command, the Solr command, also creates a new Solr core. The command also bind-mounts the ‘conf’ directory, within the local project, into the container. This will give us the ability to modify our index’s configuration and to store that configuration in source control. All data is ephemeral, neither container persists data outside the container, using these particular commands.
docker run --name mongo -p 27017:27017 -d mongo:latest docker run --name solr -d \ -p 8983:8983 \ -v $PWD/conf:/conf \ solr:latest \ solr-create -c movies -d /conf
Environment Variables
The source code expects two environment variables, which contain the connection information for MongoDB and Solr. You will need to replace the values below with your own connection strings if they are different than the examples below, used for Docker.
export SOLR_URL="http://localhost:8983/solr" export MONOGDB_CONN="mongodb://localhost:27017/movies"
Importing Movies to MongoDB
For this post, we will be using a publicly available movie dataset from MongoDB. A copy of the dataset is available in the project, as well as on MongoDB’s website, Setup and Import the Data.
Assuming you have an instance of MongoDB accessible and have set the two environment variables above, import the specially-formatted JSON file for MongoDB, movieDetails_mongo.json, directly into the movies database’s movieDetails collection, using the following mongoimport
command.
mongoimport \ --uri $MONOGDB_CONN \ --collection "movieDetails" \ --drop --file "data/movieDetails_mongo.json"
Below is a view of the movies database’s movieDetails collection, running in the Docker container, as shown in the MongoDB Compass application.
Indexing Movies to Solr
Assuming you have an instance of Solr accessible and have set the two environment variables above, import the contents of the JSON file, movieDetails.json, by running the Python script, solr_index_movies.py, using the following command.
python3 ./solr_index_movies.py
The command executes a series of HTTP calls to Solr’s exposed RESTful API.
Below is a view of the Solr Administration User Interface, running within the Docker container, and showing the new movies core. After running the script, we should have 2,250 movie documents indexed.
The Solr Admin UI offers a number of useful tools for examing indexes, reviewing schemas and field types, and creating, analyzing, and debugging Solr queries. Below we see the Query UI with the results of a query displayed.
Tuning the Solr Index
The movies index uses a default schema, which was created when the movie documents were indexed. To optimize our query results, we will want to make a few adjustments to the default movies schema. First, we want to ensure that our Solr searches consider the pluralization of words. For example, when we search for the search term ‘Adventure’, we want Solr to also return documents containing terms like adventure, adventures, adventure’s, adventuring, and adventurer, but not misadventure. This is known as Stemming, or reducing words to their word stem. An example is shown below in the Solr Analysis UI.
The fields that we want to search, such as title, plot, and genres, were all indexed by default as the ‘text_general’ Solr field type. The ‘text_general’ field type does not implement stemming when indexing or querying. We need to switch the title, plot, and genres fields to the ‘text_en’ (English text) field type. The ‘text_en’ field type implements multiple indexing and querying filters, including the PorterStemFilterFactory filter, which removes common endings from words. Similar filters include the English Minimal Stem Filter and the English Possessive Filter.
Additionally, the MultiValued field property is set to true
by default for these fields in Solr. Since the title and plot fields, amongst others, were only intended to hold a single text value, as opposed to an array of values, we will switch the MultiValued
field property to false
. This helps with sorting and filtering, and the correct deserialization of documents.
The solr_index_movies.py script will change the title, plot, and genres fields from text_general’ to ‘text_en’ and change the title and plot fields from multi-valued to single-valued. Since we have changed the index’s schema, the script will re-import all the documents after making the schema changes.
To get a better sense of what the schema changes look like, let’s look at the equivalent cURL command to change the schema. This gives you a better sense of the field-level modifications we are making.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
curl -X POST \ | |
"${SOLR_URL}/movies/schema" \ | |
-H 'Content-Type: application/json' \ | |
-d '{ | |
"replace-field":{ | |
"name":"title", | |
"type":"text_en", | |
"multiValued":false | |
}, | |
"replace-field":{ | |
"name":"plot", | |
"type":"text_en", | |
"multiValued":false | |
}, | |
"replace-field":{ | |
"name":"genres", | |
"type":"text_en", | |
"multiValued":true | |
} | |
}' |
You can use the Schema UI to view the results, as shown below. Note the new field types for title, plot, and genres. Also, note the index and query analyzers, including the PorterStemFilterFactory, used by the ‘text_en’ field type.
Comparative Queries
To demonstrate the similarities and the differences between Solr and MongoDB, we will examine a series of comparative queries, followed by a series of Solr-only searches. Again, all queries and output shown are included in the two project’s Python scripts.
Query 1a: All Documents
To start, we will perform a simple query for all the movie documents in the MongoDB collection, followed by the Solr index. With MongoDB, we use the find method. With Solr, we will use the Standard Query Parser, commonly known as the ‘lucene’ query parser, and the q
(query) parameter. The result of the queries should be identical, with all 2,250 documents returned.
MongoDB:
Parameters ---------- query: {} Results ---------- document count: 2250
Solr:
Parameters ---------- q: *:* kwargs: {} Results ---------- document count: 2250
Query 1b: Count Only
We can alter our first query to limit our response to only the count of documents for a given query in MongoDB; no documents will be returned. Since our query is empty, we will get back a count of all documents in the MongoDB database’s collection.
db.movieDetails.count()
Similarly, in Solr, we can set the rows
parameter to zero to return only the document count. For brevity, we can also omit the Solr response header using the omitHeader
parameter.
Parameters ---------- q: *:* kwargs: { 'omitHeader': 'true', 'rows': '0' } Results ---------- document count: 2250
Query 2: Exact Search
Next, we will perform a query for the exact movie title, ‘Star Wars: Episode V – The Empire Strikes Back’ in MongoDB, then Solr. Again, the results of the queries should be identical, with one document returned, matching the title.
MongoDB:
Parameters ---------- query: {'title': 'Star Wars: Episode V - The Empire Strikes Back'} projection: {'_id': 0, 'title': 1} Results ---------- document count: 1 {'title': 'Star Wars: Episode V - The Empire Strikes Back'}
The quotes around the title are key for Solr to view the query as a single phrase as opposed to a series of search terms.
Solr:
Parameters ---------- q: title:"Star Wars: Episode V - The Empire Strikes Back" kwargs: { 'defType': 'lucene', 'fl': 'title, score' } Results ---------- document count: 1 {'title': 'Star Wars: Episode V - The Empire Strikes Back', 'score': 29.41}
Note the use of 'defType': 'lucene'
is optional. The standard Lucene query parser is the default parser used by Solr. I am merely showing this parameter to improve the reader’s understanding. Later, we will use other query parsers.
Query 3: Search Phrase
Next, we will perform a query for the phrase ‘star wars’. With MongoDB, we will use the $regex
and $options
Evaluation Query Operators. The results from both the MongoDB and Solr queries should be identical, the six Star Wars movies are returned.
MongoDB:
Parameters ---------- query: { 'title': {'$regex': '\\bstar wars\\b', '$options': 'i'} } projection: {'_id': 0, 'title': 1} Results ---------- document count: 6 {'title': 'Star Wars: Episode I - The Phantom Menace'} {'title': 'Star Wars: Episode II - Attack of the Clones'} {'title': 'Star Wars: Episode III - Revenge of the Sith'} {'title': 'Star Wars: Episode IV - A New Hope'} {'title': 'Star Wars: Episode V - The Empire Strikes Back'} {'title': 'Star Wars: Episode VI - Return of the Jedi'}
With Solr, wrapping the phrase ‘star wars’ in quotes ensures Solr will treat the query string as an exact phrase, not individual search terms. Solr results are scored, but scores are almost all identical since all six movies contain the exact phrase.
Solr:
Parameters ---------- q: "star wars" kwargs: { 'defType': 'lucene', 'df': 'title', 'fl': 'title, score' } Results ---------- document count: 6 {'title': 'Star Wars: Episode VI - Return of the Jedi', 'score': 8.21} {'title': 'Star Wars: Episode II - Attack of the Clones', 'score': 8.21} {'title': 'Star Wars: Episode IV - A New Hope', 'score': 8.21} {'title': 'Star Wars: Episode I - The Phantom Menace', 'score': 8.21} {'title': 'Star Wars: Episode III - Revenge of the Sith', 'score': 8.21} {'title': 'Star Wars: Episode V - The Empire Strikes Back', 'score': 7.55}
Here is the actual Lucene query (q
) Solr will run.
title:"star war"
Query 4: Search Terms
Next, we will perform a query for all movies whose title contains either the search terms ‘star’ or ‘wars’, as opposed to the phrase, ‘star wars’. The Solr web console has a very powerful Analysis tool. Using the Analysis tool, we can examine how each filter (abbreviated in the far left column, below), associated with a particular field type, will impact the matching capabilities of Solr. To use the Analysis tool, place your search term(s) or phrase on the right side, an indexed field value on the left, and choose a field or field type from the dropdown.
Below, we see how the search terms ‘Star’ and ‘Wars’ (shown below right) would match a series of variations on the two words (shown below left) if the fields being searched are of the field type, ‘text_en’, as discussed earlier. For example, a query for ‘Star’ would match ‘star’, ‘stars’, ‘star‘s’, ‘starring’, ‘starred’, ‘star-shaped’, but not ‘shaped’, ‘superstars’, ‘started’.
Below, we see similar results for the search term, ‘Wars’.
MongoDB Text Search
To accomplish the query with MongoDB, we will use MongoDB’s $text Evaluation Query Operator. The MongoDB $text
operator is able to perform a text search across multiple fields indexed with a text index. MongoDB’s text indexes support text search queries on string content. The text indexes can include any field whose value is a string or an array of string elements, such as the movieDetail collection’s genres field. Although not as powerful as Solr’s search capabilities, MongoDB’s text search may address many basic search requirements without the need to augment the architecture with a search engine.
For our next query, we will rely on a text index on the title field. When the Python script runs, it creates the following three indexes on the collection, including the title text index.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
db.movieDetails.createIndex({ title: 1 }) | |
db.movieDetails.createIndex({ countries: 1 }) | |
db.movieDetails.createIndex({ title: 'text' }) |
With the text index in place, the result of the queries should be identical, with 18 documents returned. Both the MongoDB and Solr resultsets are scored, however, both are scored differently, using different algorithms.
MongoDB:
Parameters ---------- query: { '$text': {'$search': 'star wars', '$language': 'en', '$caseSensitive': False}, 'countries': 'USA' } projection: {'score': {'$meta': 'textScore'}, '_id': 0, 'title': 1} sort: [('score', {'$meta': 'textScore'})] Results ---------- document count: 18 {'title': 'Star Wars: Episode I - The Phantom Menace', 'score': 1.2} {'title': 'Star Wars: Episode IV - A New Hope', 'score': 1.17} {'title': 'Star Wars: Episode VI - Return of the Jedi', 'score': 1.17} {'title': 'Star Wars: Episode II - Attack of the Clones', 'score': 1.17} {'title': 'Star Wars: Episode III - Revenge of the Sith', 'score': 1.17}
Solr:
Parameters ---------- q: star wars kwargs: { 'defType': 'lucene', 'fq': 'countries: USA' 'df': 'title', 'fl': 'title, score', 'rows': '5' } Results ---------- document count: 18 {'title': 'Star Wars: Episode VI - Return of the Jedi', 'score': 8.21} {'title': 'Star Wars: Episode II - Attack of the Clones', 'score': 8.21} {'title': 'Star Wars: Episode IV - A New Hope', 'score': 8.21} {'title': 'Star Wars: Episode I - The Phantom Menace', 'score': 8.21} {'title': 'Star Wars: Episode III - Revenge of the Sith', 'score': 8.21}
Here is the actual Lucene query (q
) Solr will run. The countries filter is applied afterward.
title:star title:war
Find me a Good Western
Query 5a: Multiple Search Terms
Next, we will perform a query for movies, produced in the USA, with the search terms ‘western’, ‘action’, or ‘adventure’ in the movie genres field. The genres field may hold multiple genre values. Although this is a simple query, we can start to see the advantages of Solr’s Lucene scoring capability to provide a way to measure the relevancy of individual results.
Even limited to the USA-based movies, this genres query returns a large number of results, 244 documents. With MongoDB, we have no sense of which documents are more relevant than others. Compared to the Solr results, MongoDB got a few in the top five results, but not the most relevant, based on matching all or most of the genres.
MongoDB:
Parameters ---------- query: { 'genres': {'$in': ['Adventure', 'Action', 'Western']}, 'countries': 'USA' } projection: {'_id': 0, 'genres': 1, 'title': 1} Results ---------- document count: 244 {'title': 'Wild Wild West', 'genres': ['Action', 'Western', 'Comedy']} {'title': 'A Million Ways to Die in the West', 'genres': ['Comedy', 'Western']} {'title': 'An American Tail: Fievel Goes West', 'genres': ['Animation', 'Adventure', 'Family']} {'title': 'Once Upon a Time in the West', 'genres': ['Western']} {'title': 'How the West Was Won', 'genres': ['Western']}
However, with Solr’s scoring, we see the first (top) result, ‘The Wild Bunch’, has a score of 7.18. It genres contain exactly ‘western’, ‘action’, or ‘adventure’. The last (bottom) result, ‘S.S. Doomtrooper’, has a score of 1.47. The most relevant result scored nearly 5x higher (488%) than the least most relevant result. If you were searching for a western action adventure movie, it is pretty apparent the top Solr result, ‘The Wild Bunch’, is a much better choice than the bottom result, ‘S.S. Doomtrooper’. In fact, as shown below, all five top-scoring Solr results look pretty promising based on their score, genres, and title.
Solr:
Parameters ---------- q: adventure action western kwargs: { 'defType': 'lucene', 'fq': 'countries: USA', 'df': 'genres', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 244 {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 7.18} {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 6.26} {'title': 'The Big Trail', 'genres': ['Adventure', 'Western', 'Romance'], 'score': 5.46} {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 5.26} {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 5.26}
Here is the actual Lucene query (q
) Solr will run. The countries filter is applied afterward.
genres:adventur genres:action genres:western
Query 5b: Required Search Term
There are nearly endless options that can be used with Solr to influence Solr’s results. For example, we could perform the same Solr query above, but this time require that the word ‘western’ is the genres field, by using the plus symbol (+
) boolean operator. The top five results and scores are the same, but the total number of relevant results have decreased from 244 to just 24. That means 220 of the previous results contained ‘action’, and/or ‘adventure’, but not ‘western’. The opposite is also true, using the minus symbol (-
) boolean operator will ensure the results do not contain a particular word or phrase.
Solr:
Parameters ---------- q: adventure action +western kwargs: { 'defType': 'lucene', 'fq': 'countries: USA', 'df': 'genres', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 24 {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 7.18} {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 6.26} {'title': 'The Big Trail', 'genres': ['Adventure', 'Western', 'Romance'], 'score': 5.46} {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 5.26} {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 5.26}
Here is the actual Lucene query (q
) Solr will run. The countries filter is applied afterward.
(genres:adventur genres:action) +genres:western
Query 6a: eDisMax Query
For our next query, we will compare Solr’s eDisMax query parser to MongoDB’s text search capabilities.
Solr Extended DisMax
According to Solr, The DisMax query parser is designed to process simple phrases and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input). Solr’s Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser.
In my opinion, in addition to the Lucene-based scoring, the ability to easily search across multiple fields and selectively boost results with the DisMax and eDisMax query parsers is what starts to differentiate querying data in a database, from searching for relevant results with a search engine.
Multi-Field Text Index
For our next query, the Python script will drop the previous MongoDB text index on the title field and create a new compound text index, which will incorporate the title, plot, and genres fields.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
db.movieDetails.dropIndex('title_text') | |
db.movieDetails.createIndex({ | |
title: 'text', | |
plot: 'text', | |
genres: 'text' | |
}) |
Below, we see the new compound text index in the MongoDB Compass application’s Indexes tab.
We will perform a query for movies, produced in the USA, with the search terms ‘western’, ‘action’, or ‘adventure’ in the movie title, plot, or genres fields. The results of the queries should be identical, with 259 documents returned. Both the MongoDB and Solr resultsets are scored, but again, the scores and ordering of results are not identical. Of the top ten results, the two queries matched six movies in their top ten results.
MongoDB:
Parameters ---------- query: { '$text': {'$search': 'western action adventure', '$language': 'en', '$caseSensitive': False}, 'countries': 'USA' } projection: {'score': {'$meta': 'textScore'}, '_id': 0, 'title': 1} Results ---------- document count: 259 {'title': 'Zathura: A Space Adventure', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 3.3} {'title': 'The Extraordinary Adventures of Adèle Blanc-Sec', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 3.24} {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 3.2} {'title': 'The Adventures of Tintin', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 2.85} {'title': 'Adventures in Babysitting', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 2.85}
Solr:
Parameters ---------- q: western action adventure kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 259 {'title': 'The Secret Life of Walter Mitty', 'genres': ['Adventure', 'Comedy', 'Drama'], 'score': 7.67} {'title': 'Western Union', 'genres': ['History', 'Western'], 'score': 7.39} {'title': 'The Adventures of Tintin', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 7.36} {'title': 'Adventures in Babysitting', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 7.36} {'title': 'The Poseidon Adventure', 'genres': ['Action', 'Adventure', 'Drama'], 'score': 7.36}
Here is the actual Lucene query Solr will run.
+(
(title:adventur | plot:adventur | genres:adventur)
(title:action | plot:action | genres:action)
(title:western | plot:western | genres:western)
)
Query 6b: Boosting Fields
If you really wanted a ‘western action adventure’ movie as opposed to a ‘western’, ‘action’, or ‘adventure’ movie, then neither Solr or MongoDB’s probably completely satisfied you with their first five search results. Boosting or weighting fields can often provide more relevant search results if the correct fields are boosted, and the amount of positive or negative boost is appropriate.
MongoDB’s text indexes also allow for weighting individual fields. The weight of an indexed field denotes the significance of the field relative to the other indexed fields and directly impacts the text search score. Weighting fields are the equivalent to boosting fields with Solr. Below, we see a modification applied to our previous text index in which the title field is given twice the weight of the plot field (1.0 is default) and the genres field is given twice the weight of the title field. The Python script also applies this index for you.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
db.movieDetails.dropIndex('genres_text_title_text_plot_text') | |
db.movieDetails.createIndex({ | |
title: 'text', | |
plot: 'text', | |
genres: 'text' | |
}, { | |
weights: { | |
genres: 4, | |
title: 2 | |
} | |
) |
Likewise, Solr is also capable of boosting fields for both the DisMax and eDisMax query parsers. For our next query, we will repeat the previous query, but boost fields in the eDisMax’s qf
(Query Fields) parameter to match the boost in the MongoDB weighted text index, shown above.
The results of the queries should be identical, with 259 documents returned. MongoDB and Solr’s results are scored and ordered differently. However, compared to the previous, un-weighted/boosted MongoDB and Solr query results above, the relative scores are higher, the order of movies returned are different, and most importantly, the Solr results seem more relevant to the original search intent.
For example with Solr, take the movie, ‘The Secret Life of Walter Mitty’, which previously scored highest at 7.58. In the boosted search results, the movie, ‘The Wild Bunch’ is now ranked first with a score of 28.71. The movie, ‘The Secret Life of Walter Mitty’ is no longer even in the top 50 Solr results. Comparatively, other movies, like ‘The Adventures of Tintin’ and ‘Adventures in Babysitting’, barely moved in position even though their scores changed proportionally.
MongoDB:
Parameters ---------- query: { '$text': {'$search': 'western action adventure', '$language': 'en', '$caseSensitive': False}, 'countries': 'USA' } projection: {'score': {'$meta': 'textScore'}, '_id': 0, 'title': 1} Results ---------- document count: 259 {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 12.8} {'title': 'Zathura: A Space Adventure', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 10.27} {'title': 'The Extraordinary Adventures of Adèle Blanc-Sec', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 10.14} {'title': 'The Adventures of Tintin', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 9.9} {'title': 'Adventures in Babysitting', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 9.9}
Solr:
Parameters ---------- q: western action adventure kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title^2.0 genres^4.0', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 259 {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 28.71} {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 25.05} {'title': 'The Big Trail', 'genres': ['Adventure', 'Western', 'Romance'], 'score': 21.84} {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 21.05} {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 21.05}
Here is the actual Lucene query Solr will run.
+(
((title:adventur)^2.0 | plot:adventur | (genres:adventur)^4.0)
((title:action)^2.0 | plot:action | (genres:action)^4.0)
((title:western)^2.0 | plot:western | (genres:western)^4.0)
)
Query 6c: eDisMax Boosted with Required/Prohibited Terms
We can use both the plus (+
) and minus (-
) boolean operators to obtain more relevant search results. Let’s repeat the last Solr boosted query, but this time, also require any results to contain the search term, ‘western’, and prohibit the responses from containing the search term, ‘romance’. I would consider the new search results, based these modifications to the Solr query, to be more relevant to the original intent of the search, than the previous results. For example, the movie, ‘The Big Trail’, a romantic western adventure movie, according to its genres, is no longer included in the results sets.
Solr:
Parameters ---------- q: adventure action +western -romance kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title^2.0 genres^4.0', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 25 {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 28.71} {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 25.05} {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 21.05} {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 21.05} {'title': 'Cowboy', 'genres': ['Western'], 'score': 21.05}
Here is the actual Lucene query Solr will run.
+(
((title:adventur)^2.0 | plot:adventur | (genres:adventur)^4.0)
((title:action)^2.0 | plot:action | (genres:action)^4.0)
+((title:western)^2.0 | plot:western | (genres:western)^4.0)
-((title:romanc)^2.0 | plot:romanc | (genres:romanc)^4.0)
)
Query 7a: The Movie Dilemma
Frequently, end-users interact with a search engine, such as Google, through a search box. We type something into a search box and get back a list of relevant results. By now, most of us have learned how to phrase our Google search to get optimal results. However, the reality is, people can and will type just about anything into a search box.
To try and improve the average search results for end-users, search engineers will often try to tune query parameters, such as boosting the importance certain search fields over other, adjusting fuzzy search parameters, or ignore irrelevant words in the search phases by adding them to the stop words. Default English in Solr stop words include like: ‘a’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’, ‘for’, and so forth.
Take, for example, the word ‘movie’. Someone searching for a movie to watch, using a search box, might enter the phrase ‘A cowboy movie’. The term, ‘A’, is ignored as a stop word. This leaves the search terms ‘cowboy’ and ‘movie’ to be searched for in the title, plot, and genres fields. As we see by the top ten results below, most appear to be about cowboys, having the word ‘cowboy’ in their title or plot. Then there is the result, ‘TV: The Movie’. This does not appear to be a movie about cowboys. The word ‘cowboy’ is not in the title, plot, or genres, yet here it is in third place since it contains the word ‘movie’ in the title, plot, and/or genres.
Similarly, the top movie result, ‘Cowboy Bebop: The Movie’, is probably no more relevant than the second, third, or fourth place movies. However, ‘Cowboy Bebop: The Movie’ has scored significantly higher than even the number two results (11.24 vs. 7.31). This is because the movie’s title contains both search terms, ‘cowboy’ and ‘movie’, even though the word ‘movie’ is irrelevant to the original intent of the search phrase.
Solr:
Parameters ---------- q: A cowboys movie kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title, genres, score', 'rows': '10' } Results ---------- document count: 23 {'title': 'Cowboy Bebop: The Movie', 'genres': ['Animation', 'Action', 'Crime'], 'score': 11.24} {'title': 'Cowboy', 'genres': ['Western'], 'score': 7.31} {'title': 'TV: The Movie', 'genres': ['Comedy'], 'score': 6.42} {'title': 'Space Cowboys', 'genres': ['Action', 'Adventure', 'Thriller'], 'score': 6.33} {'title': 'Midnight Cowboy', 'genres': ['Drama'], 'score': 6.33} {'title': 'Drugstore Cowboy', 'genres': ['Crime', 'Drama'], 'score': 6.33} {'title': 'Urban Cowboy', 'genres': ['Drama', 'Romance', 'Western'], 'score': 6.33} {'title': 'The Cowboy Way', 'genres': ['Action', 'Comedy', 'Drama'], 'score': 6.33} {'title': 'The Cowboy and the Lady', 'genres': ['Comedy', 'Drama', 'Romance'], 'score': 6.33} {'title': 'Toy Story', 'genres': ['Animation', 'Adventure', 'Comedy'], 'score': 5.65}
Here is the actual Lucene query Solr will run.
+(
(title:cowboi | plot:cowboi | genres:cowboi)
(title:movi | plot:movi | genres:movi)
)
Query 7b: Stop Words
To solve the movie dilemma, we might consider adding the word ‘movie’ to the stop words, since the word ‘movie’ seems to be irrelevant to the search phrase ‘A cowboys movie’, or to a movie search engine in general. However, this choice will negatively impact other searches. There are 12 movie titles containing the word ‘movie’. If you were searching for ‘The Lego Movie’, ignoring the word ‘movie’, as a stop word, would negatively impact the accuracy and relevance of your search results. You would end up with only one of two Lego movies in your search results, the one without the title that contained the word ‘movie’. Note only one of the two Lego movies is returned.
Solr:
Parameters ---------- q: The Lego Movie kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 1 {'title': 'Lego DC Comics Super Heroes: Justice League vs. Bizarro League', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 3.63}
Here is the actual Lucene query Solr will run. Note neither the term ‘a’ and ‘movie’ are part of the search.
+((title:lego | plot:lego | genres:lego)
Query 7c: Negative Boost
A second method to solve the movie dilemma might be to negatively boost the word ‘movie’ when it appears in the title field, thus reducing its relevance. Negatively boosting fields, or more precisely, negatively boosting a specific field value, is possible with both the DisMax and eDisMax query parsers. We can assign a negative boost to the word ‘movie’ when it appears in the title field, by using the bq
(Boost Query) parameter. According to Solr, The bq
parameter specifies an additional, optional, query clause that will be added to the user’s main query to influence the score. As Developers, we could programmatically append the negatively boosted term(s) into the query without directly altering the user’s original search phrase. Again, like stop words, boosting may also negatively impact other searches.
After some experimentation, we will try a boost value of -2. We still get 23 results, however, the top ten results now appear to be more relevant, based on the intent of our search phrase. The movies, ‘Cowboy Bebop: The Movie’ and ‘TV: The Movie’ are not present in the top ten results; their scoring was lowered. You can repeat this process for more search terms, like ‘movie’, positively or negatively boosting their scores, to improve the relevancy of the results.
Solr:
Parameters ---------- q: A cowboys movie kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'bq': 'title:movie^-2.0', 'fl': 'title, genres, score', 'rows': '10' } Results ---------- document count: 23 {'title': 'Cowboy', 'genres': ['Western'], 'score': 7.31} {'title': 'Space Cowboys', 'genres': ['Action', 'Adventure', 'Thriller'], 'score': 6.33} {'title': 'Midnight Cowboy', 'genres': ['Drama'], 'score': 6.33} {'title': 'Drugstore Cowboy', 'genres': ['Crime', 'Drama'], 'score': 6.33} {'title': 'Urban Cowboy', 'genres': ['Drama', 'Romance', 'Western'], 'score': 6.33} {'title': 'The Cowboy Way', 'genres': ['Action', 'Comedy', 'Drama'], 'score': 6.33} {'title': 'The Cowboy and the Lady', 'genres': ['Comedy', 'Drama', 'Romance'], 'score': 6.33} {'title': 'Toy Story', 'genres': ['Animation', 'Adventure', 'Comedy'], 'score': 5.65} {'title': "Ride 'Em Cowboy", 'genres': ['Comedy', 'Western', 'Musical'], 'score': 5.58} {'title': "G.M. Whiting's Enemy", 'genres': ['Mystery'], 'score': 5.32}
Here is the actual Lucene query Solr will run, accounting for the negative boost.
+((title:cowboi | plot:cowboi | genres:cowboi)
(title:movi | plot:movi | genres:movi))
(title:movi)^-2.0
Function Query
Solr’s Lucene scoring is effective, but what if we wanted to use an additional, subjective measure of relevance to enrich our search results? If we examine the movies index schema, we will see there are quite a few rating-related fields that provide a sense of the movie’s quality, as judged by viewers and organizations. For example, there is a nested ‘tomato’ object, containing a number of qualitative data fields, such as the tomato rating and a tomato user rating. Tomato refers to Rotten Tomatoes. According to their site, Rotten Tomatoes provides the world’s most trusted recommendation resources for quality entertainment. Additionally, the movies index schema includes similar ‘awards’ and ‘imdb’ objects and a ‘metacritic’ rating. Here is a snippet of data from the movie, ‘Butch Cassidy and the Sundance Kid’, showing many of those qualitative fields.
"imdb": { "id": "tt0064115", "rating": 8.1, "votes": 142642 }, "tomato": { "meter": 89, "image": "certified", "rating": 8.2, "reviews": 46, "fresh": 41, "consensus": "With its iconic pairing of Paul Newman and Robert Redford, jaunty screenplay and Burt Bacharach score, Butch Cassidy and the Sundance Kid has gone down as among the defining moments in late-'60s American cinema.", "userMeter": 93, "userRating": 4, "userReviews": 70088 }, "metacritic": 58, "awards": { "wins": 16, "nominations": 14, "text": "Won 4 Oscars. Another 16 wins & 14 nominations." }
According to Apache, Lucene scoring is a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model. Lucene allows influencing search results by ‘boosting’. Using the Solr’s Function Query, we can apply a document-level multiplicative boost function, which will alter the scores of the query’s search results.
Query 8: Boost Function
If you remember in our previous example, we queried for ‘adventure action +western -romance’. If we run it again, without the boosted fields, we got back 25 documents, of which ‘Western Union’ ranked highest, with a score of 7.39.
Solr:
Parameters ---------- q: adventure action +western -romance kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title, awards.wins, score', 'rows': '5' } Results ---------- document count: 25 {'title': 'Western Union', 'awards.wins': [0.0], 'score': 7.39} {'title': 'The Wild Bunch', 'awards.wins': [5.0], 'score': 7.18} {'title': 'Western Spaghetti', 'awards.wins': [2.0], 'score': 6.64} {'title': 'Crossfire Trail', 'awards.wins': [1.0], 'score': 6.26} {'title': 'Butch Cassidy and the Sundance Kid', 'awards.wins': [16.0], 'score': 6.23}
Here is the actual Lucene query Solr will run.
+(
(title:adventur | plot:adventur | genres:adventur)
(title:action | plot:action | genres:action)
+(title:western | plot:western | genres:western)
-(title:romanc | plot:romanc | genres:romanc)
)
Now, we will apply a boost function. In the function below, I have arbitrarily taken the number of awards won by each movie and divided it in half. The function is applied to the eDisMax’s boost
parameter.
div(field(awards.wins,min),2)
This function has a multiplicative effect on the Lucene scoring of the documents in the result set, by boosting scores in proportion to the number of awards each movie has won. We now get movies that are a blend of both relevant results, based on our search phrase, as well as those movies that are highly acclaimed.
The impact of the multiplicative boost function is immediately apparent with the top result, ‘Butch Cassidy and the Sundance Kid’. This widely acclaimed movie climbed from fifth place in the previous search to first place, using the boost formula. The movie, ‘Butch Cassidy and the Sundance Kid’, won an amazing 16 awards, which is 16 more than that of the previous first place movie, ‘Western Union’, which won no awards, moving it down all the way down to 17th place in the boosted results. A movie that wasn’t even in the top five results, ‘Wild Wild West’, is now in second place, having received ten awards.
The impact of the boost function is most apparent in the scores. Previously, the score delta between the first and fifth positions in the results was 1.16. The delta between the first and second position was a mear 0.21. Now, with the boost function applied, the range of scoring, and consequently, the two deltas increased significantly, 36.78 compared to 1.16 and 23.83 compared to 0.21.
Solr:
Parameters ---------- q: adventure action +western -romance kwargs: { 'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title, awards.wins, score', 'boost': 'div(field(awards.wins,min),2)', 'rows': '5' } Results ---------- document count: 25 {'title': 'Butch Cassidy and the Sundance Kid', 'awards.wins': [16.0], 'score': 49.86} {'title': 'Wild Wild West', 'awards.wins': [10.0], 'score': 26.03} {'title': 'How the West Was Won', 'awards.wins': [7.0], 'score': 18.42} {'title': 'The Wild Bunch', 'awards.wins': [5.0], 'score': 17.95} {'title': 'All Quiet on the Western Front', 'awards.wins': [5.0], 'score': 13.08}
Here is the actual Lucene query Solr will run, using the boost function.
FunctionScoreQuery(+((title:adventur | plot:adventur |
genres:adventur) (title:action | plot:action | genres:action)
+(title:western | plot:western | genres:western)
-(title:romanc | plot:romanc | genres:romanc)),
scored by boost(div(double(awards.wins,MIN),const(2))))
Solr’s Function Query offers a large number of mathematical functions, which can be combined into complex formulas. For example, we could also take the square root of the sum of the IMDB rating and the number of award nominations.
sqrt(sum(field(imdb.rating,min),field(awards.wins,min)))
More Like This, Please
You’ve found a good movie, and now you want more movies just like that one. Maybe you want more movies by a particular director, or starring a certain actor, or based on the same theme. Solr has a solution for this, the More Like This Query Parser. The MLTQParser, for short, enables retrieving documents that are similar to a given document. It uses Lucene’s existing MoreLikeThis logic. To use the parser, you provide the unique Solr ID of the document you want to find more like and the field(s) to use for the comparison. For example:
q: {!mlt qf="genres"}da54520e-a013-4ea3-9698-230ed02c8cf0
Query 9a: MLT Genres
In the first MLTQParser query, we will select the movie, ‘Star Wars: Episode I – The Phantom Menace’. We will look for more movies, produced in the USA, that are similar to ‘Star Wars: Episode I – The Phantom Menace’, by looking for similarities in the genres field. The MLTQParser’s qf
field specifies the fields to use for similarity. We will require amintf
(Minimum Term Frequency) of 1. This is the frequency below which search terms will be ignored in the source document. We will also require a mindf
(Minimum Document Frequency) of 1. This is the frequency at which words will be ignored when they do not occur in at least this many documents.
The results of the Solr search appear logical, they are the other five Star Wars movies. Note that since the first five results are exact matches on The Phantom Menace’s three genres, ‘action’, ‘adventure’, and ‘fantasy’, their scores are identical, 6.33.
Solr:
Parameters ---------- q: {!mlt qf="genres" mintf=1 mindf=1}aaf956a5-afa5-4284-91c1-69455142884f kwargs: { 'defType': 'lucene', 'fq': 'countries: USA', 'fl': 'title, genres, score', 'rows': '5' } Results ---------- document count: 252 {'title': 'Star Wars: Episode IV - A New Hope', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} {'title': 'Star Wars: Episode V - The Empire Strikes Back', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} {'title': 'Star Wars: Episode VI - Return of the Jedi', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} {'title': 'Star Wars: Episode III - Revenge of the Sith', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} {'title': 'Star Wars: Episode II - Attack of the Clones', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33}
Here is the actual Lucene query Solr will run.
+(genres:action genres:adventur genres:fantasi)
-id:652adcfa-c59c-4fa0-ace5-6345fec3cfff
Query 9b: The Problem with George
In the second MLTQParser query example, we will again choose the movie, ‘Star Wars: Episode I – The Phantom Menace’. However, this time will look for similar movies based on a comparison of the actors, director, and writers fields (shown below). Basically, we are looking for similarities between the people associated with ‘Star Wars: Episode I – The Phantom Menace’ and other movies.
Solr:
Parameters ---------- q: id:"aaf956a5-afa5-4284-91c1-69455142884f" kwargs: { 'defType': 'lucene', 'fl': 'actors, director, writers' } Results ---------- document count: 1 { 'director': ['George Lucas'], 'writers': ['George Lucas'], 'actors': ['Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd'] }
As you can tell by the results of the MLTQParser query below, the first nine out of ten search results make sense. However, the tenth movie result, ‘New Meet Me on South Street: The Story of JC Dobbs’, has no obvious similarities with ‘Star Wars: Episode I – The Phantom Menace’. The movie does not share a common director, writer, or actor.
Solr:
Parameters
----------
q: {!mlt qf="actors director writers" mintf=1 mindf=1}aaf956a5-afa5-4284-91c1-69455142884f
kwargs: {
'defType': 'lucene',
'fq': 'countries: USA',
'fl': 'title, actors, director, writers, score',
'rows': '10'
}
Results
----------
document count: 55
{'title': 'Star Wars: Episode III - Revenge of the Sith', 'director': ['George Lucas'], 'writers': ['George Lucas'], 'actors': ['Ewan McGregor', 'Natalie Portman', 'Hayden Christensen', 'Ian McDiarmid'], 'score': 44.84}
{'title': 'Star Wars: Episode II - Attack of the Clones', 'director': ['George Lucas'], 'writers': ['George Lucas', 'Jonathan Hales', 'George Lucas'], 'actors': ['Ewan McGregor', 'Natalie Portman', 'Hayden Christensen', 'Christopher Lee'], 'score': 44.58}
{'title': 'Star Wars: Episode IV - A New Hope', 'director': ['George Lucas'], 'writers': ['George Lucas'], 'actors': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 'Peter Cushing'], 'score': 23.51}
{'title': 'Star Wars: Episode VI - Return of the Jedi', 'director': ['Richard Marquand'], 'writers': ['Lawrence Kasdan', 'George Lucas', 'George Lucas'], 'actors': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 'Billy Dee Williams'], 'score': 11.96}
{'title': 'A Million Ways to Die in the West', 'director': ['Seth MacFarlane'], 'writers': ['Seth MacFarlane', 'Alec Sulkin', 'Wellesley Wild'], 'actors': ['Seth MacFarlane', 'Charlize Theron', 'Amanda Seyfried', 'Liam Neeson'], 'score': 11.72}
{'title': 'Run All Night', 'director': ['Jaume Collet-Serra'], 'writers': ['Brad Ingelsby'], 'actors': ['Liam Neeson', 'Ed Harris', 'Joel Kinnaman', 'Boyd Holbrook'], 'score': 11.72}
{'title': 'I Love You Phillip Morris', 'director': ['Glenn Ficarra, John Requa'], 'writers': ['John Requa', 'Glenn Ficarra', 'Steve McVicker'], 'actors': ['Jim Carrey', 'Ewan McGregor', 'Leslie Mann', 'Rodrigo Santoro'], 'score': 10.97}
{'title': 'The Island', 'director': ['Michael Bay'], 'writers': ['Caspian Tredwell-Owen', 'Alex Kurtzman', 'Roberto Orci', 'Caspian Tredwell-Owen'], 'actors': ['Ewan McGregor', 'Scarlett Johansson', 'Djimon Hounsou', 'Sean Bean'], 'score': 10.97}
{'title': 'Big Fish', 'director': ['Tim Burton'], 'writers': ['Daniel Wallace', 'John August'], 'actors': ['Ewan McGregor', 'Albert Finney', 'Billy Crudup', 'Jessica Lange'], 'score': 10.97}
{'title': 'New Meet Me on South Street: The Story of JC Dobbs', 'director': ['George Manney'], 'writers': ['George Manney'], 'actors': ['Tony Bidgood', 'Peter Stone Brown', 'Stephen Caldwell', 'Tommy Conwell'], 'score': 10.5}
Here is the actual Lucene query Solr run.
+(writers:george director:george actors:natalie writers:lucas
actors:jake actors:portman actors:ewan actors:mcgregor
actors:liam actors:lloyd director:lucas actors:neeson)
-id:652adcfa-c59c-4fa0-ace5-6345fec3cfff
Examining the query, we can plainly see the problem with MLTQParser. The MLTQParser query is splitting the first and last names of actors, directors, and writers, then searching for each name individually, but not their whole name. In my opinion, this is a bug with the MLTQParser, since each value in the actors, director, and writers MultiValued fields are wrapped in double quotes. The query should treat each value an exact phrase, not individual search terms.
Given the MLTQParser’s query logic, it is now clear why a seemingly irrelevant movie, like ‘New Meet Me on South Street: The Story of JC Dobbs’, was part of the search results. Examining the debug output of the scoring explanation, we see the following.
"544637e5-e96d-4f62-9b22-5174c60ee512":" 10.504457 = sum of: 10.504457 = sum of: 5.5608187 = weight(writers:george in 649) [SchemaSimilarity], result of: 5.5608187 = score(doc=649,freq=1.0 = termFreq=1.0 ), product of: 4.226696 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 26.0 = docFreq 1814.0 = docCount 1.315642 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 4.836273 = avgFieldLength 2.0 = fieldLength 4.943639 = weight(director:george in 649) [SchemaSimilarity], result of: 4.943639 = score(doc=649,freq=1.0 = termFreq=1.0 ), product of: 4.6206594 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 20.0 = docFreq 2081.0 = docCount 1.069899 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 2.3801057 = avgFieldLength 2.0 = fieldLength "
George Manney was both the Director and Writer of ‘New Meet Me on South Street: The Story of JC Dobbs’. George Manney shares a first name with George Lucas, the Director and Writer of ‘Star Wars: Episode I – The Phantom Menace’. This is the only similarity between people associated with both movies. Therefore, there were two matches, one match on the director fields and a match on the writers field. Unfortunately, having the same first names negatively impacts the results from the MLTQParser.
Synonyms
For our last example, we will examine the use of synonyms with Solr. In respect to the movie index, we could perform the following eDisMax search on the title, plot, and genres fields: ‘scary’ OR ‘slasher’ OR ‘spooky’ OR ‘evil’ OR ‘horror’, or more simply ‘scary slasher spooky evil horror’. Based on this search, we would get back a truly gruesome collection of 141 films. The search is effective because it uses multiple, similar search terms to return a larger resultset of movies within a similar theme. However, the search relies on each end-user to enter the same relevant search terms, every time.
With Solr’s synonym capability, we can build some intelligence into our index by defining synonymous relationships between terms. There are multiple ways to define synonymous relationships between terms in Solr. Lucidworks has an excellent article, Synonyms Files, on the different synonymous relationships. We will look at three types of relationships, as described by Lucidworks: Replacement Synonyms, Oneway Expansion Synonyms, and Multiway Expansion Synonyms.
I have pre-define some examples for each of the three types of relationships, in the movies index’s synonmys.txt configuration file. This file is created when a new index is created.
## Custom synonym groups for movies index ## # Replacement Synonyms examples scarey => scary ciborg => cyborg # Multiway Expansion Synonyms examples scary,slasher,spooky,evil,horror # Oneway Expansion Synonyms examples droid => droid,android,robot,cyborg
There is a copy of the file in this project, which will be used by the movies index, running in the Docker container.
Query 10a: Replacement Synonyms
We will start with a Replacement Synonyms example. I have added the following synonym mapping for a common misspelling of the word ‘cyborg’.
ciborg => cyborg
If we perform a search for the term ‘ciborg’, Solr will substitute it with the term ‘cyborg’. We can confirm this by viewing the query, as shown in the debug snippet below.
+(title:cyborg | plot:cyborg | genres:cyborg)
Performing the query returns two documents, including the most famous cyborg, Arnold Schwarzenegger, the Terminator.
Parameters ---------- q: ciborg kwargs: { 'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title, score', 'rows': '5' } Results ---------- document count: 2 {'title': 'Terminator 2: Judgment Day', 'score': 8.17} {'title': "I'm a Cyborg, But That's OK", 'score': 7.13}
Query 10b: Oneway Expansion Synonyms
Next, we will demonstrate Oneway Expansion Synonyms. I have added the following synonym mapping for the word, ‘droid’. Note we must include the word ‘droid’ in the expansion synonyms on the right side of the mapping, as well as on the left.
droid => droid,android,robot,cyborg
When we perform a search on the term ‘droid’, Solr will also search for the synonyms ‘android’, ‘robot’, and ‘cyborg’. We can confirm this by viewing the query Solr constructs for the search term ‘droid’, as shown in the debugger snippet below.
+(
Synonym(title:android title:cyborg title:droid title:robot) |
Synonym(plot:android plot:cyborg plot:droid plot:robot) |
Synonym(genres:android genres:cyborg genres:droid genres:robot)
)
Note the converse is not true since this is a one-way relationship. If we search on ‘cyborg’, Solr will not search on ‘droid’, ‘android’, and ‘robot’.
+(title:cyborg | plot:cyborg | genres:cyborg)
Performing the ‘droid’ eDisMax query returns 15 documents.
Parameters ---------- q: droid kwargs: { 'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title, score', 'rows': '5' } Results ---------- document count: 15 {'title': 'Robo Jî', 'score': 7.67} {'title': "I'm a Cyborg, But That's OK", 'score': 7.13} {'title': 'BV-01', 'score': 6.6} {'title': 'Robot Chicken: DC Comics Special', 'score': 6.44} {'title': 'Terminator 2: Judgment Day', 'score': 6.23}
Query 10c: Multiway Expansion Synonyms
Lastly, we will demonstrate Multiway Expansion Synonyms. I have added the following synonym mapping for the word, ‘scary’.
scary,slasher,spooky,evil,horror
If we perform a search on the term ‘scary’, Solr will also search for the synonyms ‘slasher’, ‘spooky’, ‘evil’, and ‘horror’. Unlike the previous example, the converse is true since this is a multi-way relationship. If we search on any of the five synonyms, Solr will also search on the other four terms and return identical results. We can confirm this by viewing the query Solr constructs for the term ‘scary’, as shown in the debug snippet below.
+(Synonym(title:evil title:horror title:scari title:slasher
title:spooki) | Synonym(plot:evil plot:horror plot:scari
plot:slasher plot:spooki) | Synonym(genres:evil genres:horror
genres:scari genres:slasher genres:spooki))
Performing the ‘scary’ eDisMax query returns 141 documents.
Parameters ---------- q: scary kwargs: { 'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title, score', 'rows': '5' } Results ---------- document count: 141 {'title': 'See No Evil, Hear No Evil', 'score': 7.9} {'title': 'The Evil Dead', 'score': 7.23} {'title': 'Evil Dead', 'score': 7.23} {'title': 'Evil Ed', 'score': 7.23} {'title': 'Evil Dead II', 'score': 6.37}
You may have noticed the other replacement synonym mapping I placed in the index’s synonmys.txt configuration file, shown below.
scarey => scary
You might be wondering if you entered the common misspelling ‘scarey’ as a search term, would Solr replace the term with ‘scary’ and search on the term ‘scary’, but also search on it’s four synonyms, ‘slasher’, ‘spooky’, ‘evil’, and ‘horror’. The answer is no, Solr will only search on ‘scary’. However, Solr does not create indirect or secondary relationships between synonym mappings. You would have to correct the spelling and input the term correctly to take advantage of the multi-way relationship.
Managing Unique Vocabulary with Synonyms
Large corporations, industry verticles, government agencies, and other entities often use a unique lexicon. Their vocabulary includes uncommon terms, phrases, acronyms, abbreviations, and other idioms. These commonly represent products and services, organizational structures, and technical jargon. Synonyms are an excellent way to support domain-specific dictions within indexes. We see this in the examples Solr includes in the synonmys.txt configuration file. The two examples, shown below, demonstrate how associate multiple abbreviations and technical terms to equivalent amounts of compute.
# Some synonym groups specific to this example GB,gib,gigabyte,gigabytes MB,mib,megabyte,megabytes
10d: Synonymous Phrases
A largely undocumented feature of synonyms is the ability to create synonymous relationships between common acronyms, abbreviations, terms, and phrases. Below I have provided a few examples of how to create these relationships. The only apparent, logical limitation, you cannot use stop words in the phrases; a good reason not to overuse stop words.
ai,artificial intelligence cia,central intelligence agency fbi,federal bureau investigation lol,laughing out loud,league legends
For example, If we include the full phrase ‘league of legends’ in the synonyms map with ‘lol’, Solr ignores this phrase when I search on ‘LOL’. However, if we remove the stop word, ‘of’, then Solr will create a query that requires the two terms, ‘league’ and ‘legends’, or the two terms ‘laugh’ and ‘loud’, or the single term ‘lol’.
+( (((+title:laugh +title:out +title:loud) (+title:leagu +title:legend) title:lol)) | (((+plot:laugh +plot:out +plot:loud) (+plot:leagu +plot:legend) plot:lol)) | (((+genres:laugh +genres:out +genres:loud) (+genres:leagu +genres:legend) genres:lol)) )
Solr:
Parameters ---------- q: lol kwargs: { 'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title, score', 'rows': '5' } Results ---------- document count: 1 {'title': 'JK LOL', 'score': 9.05}
Conclusion
By understanding the capabilities of Apache Solr, the characteristics of the data contained in your indexes and the search patterns of your end users, you will be able to craft queries that ensure responses contain high-quality, relevant search results.
The query examples in this post demonstrate only a very small portion of Solr’s vast search capabilities. There are several additional query examples available in each of the two Python scripts, which you can uncomment and explore their results, further.
Solr’s documentation is very good; to learn more about Solr’s capabilities, I suggest reviewing the various parsers and their options, in the current Solr version 7.6 documentation.
I also suggest reviewing Solr’s Analyzers, Tokenizers, and Filters, and understand how they affect the way in which Solr indexes documents and how Solr interprets the content of the indexes when performing a search.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Feature Illustration Copyright: Dejan Bozic (123RF)
Building a Microservices-based REST API with RestExpress, Java EE, and MongoDB: Part 3
Posted by Gary A. Stafford in Enterprise Software Development, Java Development, Software Development on June 5, 2015
Develop a well-architected and well-documented REST API, built on a tightly integrated collection of Java EE-based microservices.
Note: All code available on GitHub. For the version of the code that matches the details in this blog post, check out the master branch, v1.0.0 tag (after running git clone …, run a git checkout tags/v1.0.0
command).
Previous Posts
In Part One of this series, we introduced the microservices-based Virtual-Vehicles REST API example. The vehicle-themed Virtual-Vehicles microservices offers a comprehensive set of functionality, through a REST API, to application developers. In Part Two, we installed a copy of the Virtual-Vehicles project from GitHub. In Part Two, we also gained a basic understanding of how RestExpress works. Finally, we discovered how to get the Virtual-Vehicles microservices up and running.
Part Three
In part three of this series, we will take the Virtual-Vehicles for a test drive (get it? maybe it was funnier the first time…). There are several tools we can use to test the Virtual-Vehicles API. One of my favorite tools is Postman. We will explore how to use Postman, along with the Virtual-Vehicles API documentation, to test the Virtual-Vehicles microservice’s endpoints, which compose the Virtual-Vehicles API.
Testing the API
There are three categories of tools available to test RESTful APIs, which are GUI-based applications, command line tools, and testing frameworks. Postman, Advanced REST Client, REST Console, and SmartBear’s SoapUI and SoapUI NG Pro, are examples of GUI-based applications, designed specifically to test RESTful APIs. cURL and GNU Wget are two examples of command line tools, which among other capabilities, can test APIs. Lastly, JUnit is an example of a testing framework that can be used to test a RESTful API. Surprisingly, JUnit is not only designed to manage unit tests. Each category of testing tools has their pros and cons, depending on your testing needs. We will explore all of these categories in this post as we test the Virtual-Vehicles REST API.
JUnit
JUnit is probably the best known of all Java unit testing frameworks. JUnit’s website describes JUnit as ‘a simple, open source framework to write and run repeatable tests. It is an instance of the xUnit architecture for unit testing frameworks.’ Most Java developers turn to JUnit for unit testing. However, JUnit is capable of other forms of testing, including integration testing. In his post, ‘Unit Testing with JUnit – Tutorial’, Lars Vogel states ‘an integration test has the target to test the behavior of a component or the integration between a set of components. The term functional test is sometimes used as a synonym for integration test. This kind of tests allow you to translate your user stories into a test suite, i.e., the test would resemble an expected user interaction with the application.’
Testing the Virtual-Vehicles RESTful API’s operations with JUnit would be considered integration (functional) testing. At a minimum, to complete requests, we call one microservice, which in turn authenticates the JWT by calling another microservice. If authenticated, the first microservice makes a request to its MongoDB database. As Vogel stated, whereas a unit test targets a small unit of code, such as a method, the request/response operation is integration between a set of components. testing an API call requires several dependencies.
The simplest example of testing the Virtual-Vehicles API with JUnit, would be to test an HTTP GET request to return a single instance of a vehicle. The code below demonstrates how this might be done. Notice the request depends on helper methods (not included, for brevity). To request the vehicle, assuming we already have a registered client, we need a valid JWT. We also need a valid vehicle ObjectId
. To obtain these two pieces of data, we call helper methods, which in turn makes the necessary request to retrieve a JWT and vehicle ObjectId
.
/** | |
* Test of HTTP GET to read a single vehicle. | |
*/ | |
@Test | |
public void testVehicleRead() { | |
String responseBody = ""; | |
String output; | |
Boolean result = true; | |
Boolean expResult = true; | |
try { | |
URL url = new URL(getBaseUrlAndPort() + "/vehicles/" + getVehicleObjectId()); | |
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); | |
conn.setRequestMethod("GET"); | |
conn.setRequestProperty("Authorization", "Bearer " + getJwt()); | |
conn.setRequestProperty("Accept", "application/json"); | |
if (conn.getResponseCode() != 200) { | |
// if not 200 response code then fail test | |
result = false; | |
} | |
BufferedReader br = new BufferedReader(new InputStreamReader( | |
(conn.getInputStream()))); | |
while ((output = br.readLine()) != null) { | |
responseBody = output; | |
} | |
if (responseBody.length() < 1) { | |
// if response body is empty then fail test | |
result = false; | |
} | |
conn.disconnect(); | |
} catch (IOException e) { | |
// if MalformedURLException, ConnectException, etc. then fail test | |
result = false; | |
} | |
assertEquals(expResult, result); | |
} |
Below are the results of the above test, run in NetBeans IDE, using the built-in support for JUnit.
JUnit can also be run from the command line using the Maven goal, surefire:test
:
mvn -q -Dtest=com.example.vehicle.objectid.VehicleControllerIT surefire:test |
cURL
One of the best-known command line tools for calling for all types of operations centered around calling a URL is cURL. According to their website, ‘curl is a command line tool and library for transferring data with URL syntax, supporting…HTTP, HTTPS…curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, HTTP/2, cookies, user+password authentication (Basic, Plain, Digest, CRAM-MD5, NTLM, Negotiate, and Kerberos), file transfer resume, proxy tunneling and more.’ I prefer the website’s briefer description, cURL ‘groks those URLs’.
Using cURL, we could make an HTTP PUT request to the Vehicle microservice’s /vehicles/{oid}.{format}
endpoint. With cURL, we have the ability to add the JWT-based Authorization header and the raw request body, containing the modified vehicle object. Below is an example of that cURL command, which can be run from a terminal prompt.
curl --url 'http://virtual-vehicles.com:8581/vehicles/557310cfec7291b25cd3a1c2' \ | |
-X PUT \ | |
-H 'Pragma: no-cache' \ | |
-H 'Cache-Control: no-cache' \ | |
-H 'Accept: application/json; charset=UTF-8' \ | |
-H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJ2aXJ0dWFsLXZlaGljbGVzLmNvbSIsImFwaUtleSI6IlJncjg0YzF6VkdtMFd1N25kWjd5UGNURSIsImV4cCI6MTQzMzY2ODEwNywiYWl0IjoxNDMzNjMyMTA3fQ.xglaKWufcj4TZtMXW3DLa9uy5JB_FJHxxtk_iF1WT6U' \ | |
--data-binary $'{ "year": 2015, "make": "Chevrolet", "model": "Corvette Stingray", "color": "White", "type": "Coupe", "mileage": 902, "createdAt": "2015-05-09T22:36:04.808Z" }' \ | |
--compressed |
The response body contains the expected modified vehicle object in JSON-format, along with a 201 Created
response status.
The cURL commands may be incorporated into many types of automated testing processes. These might be as simple as a bash script. The script could a series of automated tests, including the following: register an API client, use the API key to create a JWT, use the JWT to create a new vehicle, use the new vehicle’s ObjectId
to modify that same vehicle, delete that vehicle, confirm the vehicle is removed using the count operation and returns a test results report to the user.
cURL Commands from Chrome
Quick tip, instead of hand-coding complex cURL commands, containing form data, URL parameters, and Headers, use Chrome. First, open the Chrome Developer Tools (f12). Next, using the Postman – REST Client for Chrome, available in the Chrome App Store, execute your HTTP request. Finally, in the ‘Network’ tab of the Developers tools, find and right-click on the request and select ‘Copy as cURL’. You have a complete cURL command equivalent of your Postman request, which you can paste directly into the command line or insert into a script. Below is an example of using the Postman – REST Client for Chrome to generate a cURL command.
curl --url 'http://virtual-vehicles.com:8581/vehicles/554e8bd4c830093007d8b949' \ | |
-X PUT \ | |
-H 'Pragma: no-cache' \ | |
-H 'Origin: chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm' \ | |
-H 'Accept-Encoding: gzip, deflate, sdch' \ | |
-H 'Accept-Language: en-US,en;q=0.8' \ | |
-H 'CSP: active' \ | |
-H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJ2aXJ0dWFsLXZlaGljbGVzLmNvbSIsImFwaUtleSI6IlBUMklPSWRaRzZoU0VEZGR1c2h6U04xRyIsImV4cCI6MTQzMzU2MDg5NiwiYWl0IjoxNDMzNTI0ODk2fQ.4q6EMuxE0vS43zILjE6e1tYrb5ulCe69-1QTFLYGbFU' \ | |
-H 'Content-Type: text/plain;charset=UTF-8' \ | |
-H 'Accept: */*' \ | |
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36' \ | |
-H 'Cache-Control: no-cache' \ | |
-H 'Connection: keep-alive' \ | |
-H 'X-FirePHP-Version: 0.0.6' \ | |
--data-binary $'{ "year": 2015, "make": "Chevrolet", "model": "Corvette Stingray", "color": "White", "type": "Coupe", "mileage": 902, "createdAt": "2015-05-09T22:36:04.808Z" }' \ | |
--compressed |
The generated command is a bit verbose. Compare this command to the cURL command, earlier.
Wget
Similar to cURL, GNU Wget provides the ability to call the Virtual-Vehicles API’s endpoints. According to their website, ‘GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.’ Again, like cURL, we can run Wget commands from the command line or incorporate them into scripted testing processes. The Wget website contains excellent documentation.
Using Wget, we could make the same HTTP PUT request to the Vehicle microservice /vehicles/{oid}.{format}
endpoint. Like cURL, we have the ability to add the JWT-based Authorization header and the raw request body, containing the modified vehicle object.
wget -O - 'http://virtual-vehicles.com:8581/vehicles/557310cfec7291b25cd3a1c2' \ | |
--method=PUT \ | |
--header='Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJ2aXJ0dWFsLXZlaGljbGVzLmNvbSIsImFwaUtleSI6IlJncjg0YzF6VkdtMFd1N25kWjd5UGNURSIsImV4cCI6MTQzMzY2ODEwNywiYWl0IjoxNDMzNjMyMTA3fQ.xglaKWufcj4TZtMXW3DLa9uy5JB_FJHxxtk_iF1WT6U' \ | |
--header='Content-Type: text/plain;charset=UTF-8' \ | |
--header='Accept: application/json' \ | |
--body-data=$'{ "year": 2015, "make": "Chevrolet", "model": "Corvette Stingray", "color": "White", "type": "Coupe", "mileage": 902, "createdAt": "2015-05-09T22:36:04.808Z" }' |
The response body contains the expected modified vehicle object in JSON-format, along with a 201 Created
response status.
cURL Bash Testing
We can combine cURL and Wget with several of the tools bash provides, to develop fairly complex integration tests. The bash-based script below just scratches the surface as a complete set of integration tests. However, the tests demonstrate an efficient multi-stage test approach to handling the complex nature of RESTful service request requirements. The tests build upon each other.
After setting up some variables and doing a quick health check on one service, the tests register a new API client by calling the Authentication service. Next, they use the new client’s API key to obtain a JWT. The tests then use the JWT to authenticate themselves and create a new vehicle. Finally, they use the new vehicle’s id and the JWT to verify the existence for the new vehicle.
Although some may consider using bash to test somewhat primitive, the following script demonstrates the effectiveness of bash’s curl
, grep
, sed
, awk
, along with regular expressions, to test our RESTful services. Note how we grep certain values from the response, such as the new client’s API key, and then use that value as a parameter for the following test request, such as to obtain a JWT.
#!/bin/sh | |
######################################################################## | |
# | |
# title: Virtual-Vehicles Project Integration Tests | |
# author: Gary A. Stafford (https://programmaticponderings.com) | |
# url: https://github.com/garystafford/virtual-vehicles-docker | |
# description: Performs integration tests on the Virtual-Vehicles | |
# microservices | |
# to run: sh tests_color.sh -v | |
# | |
######################################################################## | |
echo --- Integration Tests --- | |
### VARIABLES ### | |
hostname="localhost" | |
application="Test API Client $(date +%s)" # randomized | |
secret="$(date +%s | sha256sum | base64 | head -c 15)" # randomized | |
echo hostname: ${hostname} | |
echo application: ${application} | |
echo secret: ${secret} | |
### TESTS ### | |
echo "TEST: GET request should return 'true' in the response body" | |
url="http://${hostname}:8581/vehicles/utils/ping.json" | |
echo ${url} | |
curl -X GET -H 'Accept: application/json; charset=UTF-8' \ | |
--url "${url}" \ | |
| grep true > /dev/null | |
[ "$?" -ne 0 ] && echo "RESULT: fail" && exit 1 | |
echo "RESULT: pass" | |
echo "TEST: POST request should return a new client in the response body with an 'id'" | |
url="http://${hostname}:8587/clients" | |
echo ${url} | |
curl -X POST -H "Cache-Control: no-cache" -d "{ | |
\"application\": \"${application}\", | |
\"secret\": \"${secret}\" | |
}" --url "${url}" \ | |
| grep '"id":"[a-zA-Z0-9]\{24\}"' > /dev/null | |
[ "$?" -ne 0 ] && echo "RESULT: fail" && exit 1 | |
echo "RESULT: pass" | |
echo "SETUP: Get the new client's apiKey for next test" | |
url="http://${hostname}:8587/clients" | |
echo ${url} | |
apiKey=$(curl -X POST -H "Cache-Control: no-cache" -d "{ | |
\"application\": \"${application}\", | |
\"secret\": \"${secret}\" | |
}" --url "${url}" \ | |
| grep -o '"apiKey":"[a-zA-Z0-9]\{24\}"' \ | |
| grep -o '[a-zA-Z0-9]\{24\}' \ | |
| sed -e 's/^"//' -e 's/"$//') | |
echo apiKey: ${apiKey} | |
echo | |
echo "TEST: GET request should return a new jwt in the response body" | |
url="http://${hostname}:8587/jwts?apiKey=${apiKey}&secret=${secret}" | |
echo ${url} | |
curl -X GET -H "Cache-Control: no-cache" \ | |
--url "${url}" \ | |
| grep '[a-zA-Z0-9_-]\{1,\}\.[a-zA-Z0-9_-]\{1,\}\.[a-zA-Z0-9_-]\{1,\}' > /dev/null | |
[ "$?" -ne 0 ] && echo "RESULT: fail" && exit 1 | |
echo "RESULT: pass" | |
echo "SETUP: Get a new jwt using the new client for the next test" | |
url="http://${hostname}:8587/jwts?apiKey=${apiKey}&secret=${secret}" | |
echo ${url} | |
jwt=$(curl -X GET -H "Cache-Control: no-cache" \ | |
--url "${url}" \ | |
| grep '[a-zA-Z0-9_-]\{1,\}\.[a-zA-Z0-9_-]\{1,\}\.[a-zA-Z0-9_-]\{1,\}' \ | |
| sed -e 's/^"//' -e 's/"$//') | |
echo jwt: ${jwt} | |
echo "TEST: POST request should return a new vehicle in the response body with an 'id'" | |
url="http://${hostname}:8581/vehicles" | |
echo ${url} | |
curl -X POST -H "Cache-Control: no-cache" \ | |
-H "Authorization: Bearer ${jwt}" \ | |
-d '{ | |
"year": 2015, | |
"make": "Test", | |
"model": "Foo", | |
"color": "White", | |
"type": "Sedan", | |
"mileage": 250 | |
}' --url "${url}" \ | |
| grep '"id":"[a-zA-Z0-9]\{24\}"' > /dev/null | |
[ "$?" -ne 0 ] && echo "RESULT: fail" && exit 1 | |
echo "RESULT: pass" | |
echo "SETUP: Get id from new vehicle for the next test" | |
url="http://${hostname}:8581/vehicles?filter=make::Test|model::Foo&limit=1" | |
echo ${url} | |
id=$(curl -X GET -H "Cache-Control: no-cache" \ | |
-H "Authorization: Bearer ${jwt}" \ | |
--url "${url}" \ | |
| grep '"id":"[a-zA-Z0-9]\{24\}"' \ | |
| grep -o '[a-zA-Z0-9]\{24\}' \ | |
| tail -1 \ | |
| sed -e 's/^"//' -e 's/"$//') | |
echo vehicle id: ${id} | |
echo "TEST: GET request should return a vehicle in the response body with the requested 'id'" | |
url="http://${hostname}:8581/vehicles/${id}" | |
echo ${url} | |
curl -X GET -H "Cache-Control: no-cache" \ | |
-H "Authorization: Bearer ${jwt}" \ | |
--url "${url}" \ | |
| grep '"id":"[a-zA-Z0-9]\{24\}"' > /dev/null | |
[ "$?" -ne 0 ] && echo "RESULT: fail" && exit 1 | |
echo "RESULT: pass" |
Since these tests are just a bash script, they can from the command line, or easily called from a continuous integration tool, Such as Jenkins CI or Hudson.
Postman
Postman, like several similar tools, is an application designed specifically for test API endpoints. The Postman website describes Postman as tool that allows you to ‘build, test, and document your APIs faster.’ There are two versions of Postman in the Chrome Web Store. They are Postman – REST Client, the in-browser extension, which we mentioned above, and Postman, the standalone application. There is also Postman Interceptor, which helps you send requests that use browser cookies through the Postman application.
Postman and similar applications, have add-ons and extensions to extend their features. In particular, Postman, which is free, offers the Jetpacks paid extension. Jetpacks add the ability to ‘write and run tests inside Postman, extract data from responses, chain requests together and test requests with thousands of variations’. Jetpacks allow you to move beyond basic one-off API request-based testing, to automated regression and performance testing.
Using Postman
Let’s use the same HTTP PUT example we used with cURL and Wget, and see how we would perform the same task with Postman. In the first screen grab below, you can see all elements of the HTTP request, including the RESTful API’s URL, URI including the vehicle’s ObjectId
(/vehicles/{ObjectId}.{format}
), HTTP method (PUT), Authorization Header with JWT (Bearer), and the raw request body. The raw request body contains a JSON representation of the vehicle we want to update. Note how Postman saves the request in history so we can easily replay it later.
In the next screen-grab, we see the response to the HTTP PUT request. Note the response body, response status, timing, and response headers.
Looking at the response body in Postman, you easily see the how RestExpress demonstrates the RESTful principle we discussed in Part Two of the series, HATEOAS (Hypermedia as the Engine of Application State). Note the link to this vehicle’s ‘self’ href) and the entire vehicles collection (‘up’ href).
Postman Collections
A great feature of Postman with Jetpacks is Collections. Collections are sets of requests, which can be saved, recalled, and shared. The Collection Runner runs requests in a collection, in the order in which you set them. Ordered collections are ideal for the Virtual-Vehicles API. The screen grab below shows a collection of requests, arranged in the order we would execute them to test the Virtual-Vehicles API, as it applies to specifically to vehicle CRUD operations:
- Execute HTTP POST request to register the new API client, passing the application name and a shared secret in the request
Receive the new client’s API key in response - Execute HTTP GET to request, passing the new client’s API key and the shared secret in the request
Receive the new JWT in response - Execute HTTP POST request to create a new vehicle, passing the JWT in the header for authentication (used for all following requests)
Receive the new vehicle object in response - Execute HTTP PUT request to modify the new vehicle, using the vehicle’s
ObjectId
Receive the modified vehicle object in response - Execute HTTP GET to request the modified vehicle, to confirm it exists in the expected state
Receive the vehicle object in response - Execute HTTP DELETE request to delete the new vehicle, using the vehicle’s
ObjectId
- Execute HTTP GET to request the new vehicle and to confirm it has been removed
Receive a 404 Not Found status response, as expected
Using saved collections for testing the Virtual-Vehicles API is a real-time saving. However, the collections cannot easily be re-run without hand-editing or some advanced scripting. In the simple example above, we hard-coded a JWT and vehicle ObjectId
in the requests. Unfortunately, the JWT has an expiration of only 10 hours by default. More immediately, the ObjectId
is unique. The earlier collection test run created, then deleted, the vehicle with that ObjectId
.
Negative Testing
You may also perform negative testing with Postman. For example, do you receive the expected response when you don’t include the Authorization Header with JWT in a request (401 Unauthorized status)? When you include a JWT, which has expired (401 Unauthorized status)? When you request a vehicle, whose ObjectId
is incorrect or is not found in the database (400 Bad Request status)? Do you receive the expected response when you call an actual service, but an endpoint that doesn’t exist (405 Method Not Allowed)?
Postman Test Automation
In addition to manually viewing the HTTP response, to verify the results of a request, Postman allows you to write and run automated tests for each request. According to their website, a ‘Postman test is essentially JavaScript code which sets values for the special tests object. You can set a descriptive key for an element in the object and then say if it’s true or false’. This allows you to write a set of response validation tests for each request.
Below is a quick example of testing the same HTTP POST request, used to create the new API client, above. In this example, we:
- Test that the
Content-Type
response header is present - Test that the HTTP POST successfully returned a 201 status code
- Test that the new client’s API key was returned in the response body
- Test that the response time was less than 200ms
Reviewing Postman’s ‘Tests’ tab, above, observe the four tests have run successfully. Using the Postman’s testing feature, you can create even more advanced tests, eliminating the need to manually validate responses.
This post demonstrates a small subset of the features Postman and other similar applications provide for testing RESTful API. The tools and processes you use to test your RESTful API will depend on the stage of development and testing you are in, as well as the existing technology stacks you build, and on which you host your services.
Building a Microservices-based REST API with RestExpress, Java EE, and MongoDB: Part 2
Posted by Gary A. Stafford in Enterprise Software Development, Java Development, Software Development on May 31, 2015
Develop a well-architected and well-documented REST API, built on a tightly integrated collection of Java EE-based microservices.
Note: All code available on GitHub. For the version of the code that matches the details in this blog post, check out the master branch, v1.0.0 tag (after running git clone …, run a ‘git checkout tags/v1.0.0’ command).
Previous Post
In Part One of this series, we introduced the microservices-based Virtual-Vehicles REST API example. The vehicle-themed Virtual-Vehicles microservices offers a comprehensive set of functionality, through a REST API, to application developers. The developers, in turn, will use the Virtual-Vehicles REST API’s functionality to build applications and games for their end-users.
In Part One, we also decided on the proper amount and responsibility of each microservice. We also determined the functionality of each microservice to meet the hypothetical functional and nonfunctional requirements of Virtual-Vehicles. To review, the four microservices we are building, are as follows:
Virtual-Vehicles REST API Resources |
||
Microservice | Purpose (Business Capability) | Functions |
Authentication |
Manage API clients and JWT authentication |
|
Vehicle |
Manage virtual vehicles |
|
Maintenance |
Manage maintenance on vehicles |
|
Valet Parking |
Manage a valet service to park for vehicles |
|
To review, the first five functions for each service are all basic CRUD operations: create
(POST), read
(GET), readAll
(GET), update
(PUT), delete
(DELETE). The readAll
function also has find
, count
, and pagination
functionality using query parameters. Unfortunately, RestExpress does not support PATCH
for updates. However, I have updated RestExpress’ PUT HTTP methods to return the modified object in the response body instead of the nothing (status of 201 Created vs. 200 OK). See StackOverflow for an explanation.
All services also have an internal authenticateJwt
function, to authenticate the JWT, passed in the HTTP request header, before performing any operation. Additionally, all services have a basic health-check function, ping
(GET). There are only a few other functions required for our Virtual-Vehicles example, such as for creating JWTs.
Part Two Introduction
In Part Two, we will build our four Virtual-Vehicles microservices. Recall from our first post, we will be using RestExpress. RestExpress composes best-of-breed open-source tools to enable quickly creating RESTful microservices that embrace industry best practices. Those best-of-breed tools include Java EE, Maven, MongoDB, and Netty, among others.
In this post, we will accomplish the following:
- Create a default microservice project in NetBeans using RestExpress MongoDB Maven Archetype
- Understand the basic structure of a default RestExpress microservice project
- Review the changes made to the default RestExpress microservice project to create the Virtual-Vehicles example
- Compile and run the individual microservices directly from NetBeans
I used NetBeans IDE 8.0.2 on Linux Ubuntu 14.10 to build the microservices. You may also follow along in other IDE’s, such as Eclipse or IntelliJ, on Mac or Windows. We won’t cover installing MongoDB, Maven, and Java. I’ll assume if your building enterprise applications, you have the basics covered.
Using the RestExpress MongoDB Maven Archetype
All the code for this project is available on GitHub. However, to understand RestExpress, you should go through the exercise of scaffolding a new microservice using the RestExpress MongoDB Maven Archetype. You will also be able to use this default microservice project to compare and contrast to the modified versions, used in the Virtual-Vehicles example. The screen grabs below demonstrate how to create a new microservice project using the RestExpress MongoDB Maven Archetype. At the time of this post, the archetype version was restexpress-mongodb version 1.15.
Default Project Architecture
Reviewing the two screen grabs below (Project tab), note the key components of the RestExpress MongoDB Maven project, which we just created:
- Base Package (com.example.vehicle)
- Configuration class reads in environment properties (see Files tab) and instantiates controllers
- Constants class contains project constants
- Relationships class defines linking resource which aids service discoverability (HATEOAS)
- Main executable class
- Routes class defines the routes (endpoints) exposed by the service and the corresponding controller class
- Model/Controllers Packages (com.example.vehicle.objectid and .uuid)
- Entity class defines the data entity – a Vehicle in this case
- Controller class contains the methods executed when the route (endpoint) is called
- Repository class defines the connection details to MongoDB
- Service class contains the calls to the persistence layer, validation, and business logic
- Serialization Package (com.example.vehicle.serialization)
- XML and JSON serialization classes
- Object ID and UUID serialization and deserialization classes
Again, I strongly recommend reviewing each of these package’s classes. To understand the core functionality of RestExpress, you must understand the relationships between RestExpress microservice’s Route, Controller, Service, Repository, Relationships, and Entity classes. In addition to reviewing the default Maven project, there are limited materials available on the Internet. I would recommend the RestExpress Website on GitHub, RestExpress Google Group Forum, and the YouTube 3-part video series, Instant REST Services with RESTExpress.
Unit Tests?
Disappointingly, the current RestExpress MongoDB Maven Archetype sample project does not come with sample JUnit unit tests. I am tempted to start writing my own unit tests if I decided to continue to use the RestExpress microservices framework for future projects.
Properties Files
Also included in the default RestExpress MongoDB Maven project is a Java properties file (environment.properties
). This properties file is displayed in the Files tab, as shown below. The default properties file is located in the ‘dev’ environment config folder. Later, we will create an additional properties file for our production environment.
Ports
Within the ‘dev’ environment, each microservice is configured to start on separate ports (i.e. port = 8581
). Feel free to change the service’s port mappings if they conflict with previously configured components running on your system. Be careful when changing the Authentication service’s port, 8587, since this port is also mapped in all other microservices using the authentication.port
property (authentication.port = 8587
). Make sure you change both properties, if you change Authentication service’s port mapping.
Base URL
Also, in the properties files is the base.url
property. This property defines the URL the microservice’s endpoints will be expecting calls from, and making internal calls between services. In our post’s example, this property in the ‘dev’ environment is set to localhost (base.url = http://localhost
). You could map an alternate hostname from your hosts file (/etc/hosts
). We will do this in a later post, in our ‘prod’ environment, mapping the base.url
property to Virtual-Vehicles (base.url = http://virtual-vehicles.com
). In the ‘dev’ environment properties file, MongoDB is also mapped to localhost
(i.e. mongodb.uri = mongodb://virtual-vehicles.com:27017/virtual_vehicle
).
Metrics Plugin and Graphite
RestExpress also uses the properties file to hold configuration properties for Metrics Plugin and Graphite. The Metrics Plugin and Graphite are both first class citizens of RestExpress. Below is the copy of the Vehicles service environment.properties
file for the ‘dev’ environment. Note, the Metrics Plugin and Graphite are both disabled in the ‘dev’ environment.
# Default is 8081 | |
port = 8581 | |
# Port used to call Authentication service endpoints | |
authentication.port = 8587 | |
# The size of the executor thread pool | |
# (that can handle blocking back-end processing). | |
executor.threadPool.size = 20 | |
# A MongoDB URI/Connection string | |
# see: http://docs.mongodb.org/manual/reference/connection-string/ | |
mongodb.uri = mongodb://localhost:27017/virtual_vehicle | |
# The base URL, used as a prefix for links returned in data | |
base.url = http://localhost | |
#Configuration for the MetricsPlugin/Graphite | |
metrics.isEnabled = false | |
#metrics.machineName = | |
metrics.prefix = web1.example.com | |
metrics.graphite.isEnabled = false | |
metrics.graphite.host = graphite.example.com | |
metrics.graphite.port = 2003 | |
metrics.graphite.publishSeconds = 60 |
Choosing a Data Identification Method
RestExpress offers two identification models for managing data, the MongoDB ObjectId and a Universally Unique Identifier (UUID). MongoDB uses an ObjectId to identify uniquely a document within a collection. The ObjectId is a special 12-byte BSON type that guarantees uniqueness of the document within the collection. Alternately, you can use the UUID identification model. The UUID identification model in RestExpress uses a UUID, instead of a MongoDB ObjectId. The UUID also contains createdAt
and updatedAt
properties that are automatically maintained by the persistence layer. You may stick with ObjectId, as we will in the Virtual-Vehicles example, or choose the UUID. If you use multiple database engines for your projects, using UUID will give you a universal identification method.
Project Modifications
Many small code changes differentiate our Virtual-Vehicles microservices from the default RestExpress Maven Archetype project. Most changes are superficial; nothing changed about how RestExpress functions. Changes between the screen grabs above, showing the default project, and the screen grabs below, showing the final Virtual-Vehicles microservices, include:
- Remove all packages, classes, and code references to the UUID identification methods (example uses ObjectId)
- Rename several classes for convenience (dropped use of word ‘Entity’)
- Add the Utilities (com.example.utilities) and Authentication (com.example.authenticate) packages
MongoDB
Following a key principle of microservices mentioned in the first post, Decentralized Data Management, each microservice will have its own instance of a MongoDB database associated with it. The below diagram shows each service and its corresponding database, collection, and fields.
From the MongoDB shell, we can observe the individual instances of the four microservice’s databases.
In the case of the Vehicle microservice, the associated MongoDB database is virtual_vehicle
. This database contains a single collection, vehicles
. While the properties file defines the database name, the Vehicles entity class defines the collection name, using the org.mongodb.morphia.annotations
classes annotation functionality.
@Entity("vehicles") | |
public class Vehicle | |
extends AbstractMongodbEntity | |
implements Linkable { | |
private int year; | |
private String make; | |
private String model; | |
private String color; | |
private String type; | |
private int mileage; | |
// abridged... |
Looking at the virtual_vehicle
database in the MongoDB shell, we see that the sample document’s fields correspond to the Vehicle entity classes properties.
Each of the microservice’s MongoDB databases are configured in the environments.properties
file, using the mongodb.uri
property. In the ‘dev’ environment we use localhost
as our host URL (i.e. mongodb.uri = mongodb://localhost:27017/virtual_vehicle
).
Authentication and JSON Web Tokens
The three microservices, Vehicle, Valet, and Maintenance, are almost identical. However, the Authentication microservice is unique. This service is called by each of the other three services, as well as also being called directly. The Authentication service provides a very basic level of authentication using JSON Web Tokens (JWT), pronounced ‘jot‘.
Why do we want authentication? We want to confirm that the requester using the Virtual-Vehicles REST API is the actual registered API client they are who they claim to be. JWT allows us to achieve this requirement with minimal effort.
According to jwt.io, ‘a JSON Web Token is a compact URL-safe means of representing claims to be transferred between two parties. The claims in a JWT are encoded as a JSON object that is digitally signed using JSON Web Signature (JWS).‘ I recommend reviewing the JWT draft standard to fully understand the structure, creation, and use of JWTs.
Virtual-Vehicles Authentication Process
There are different approaches to implementing JWT. In our Virtual-Vehicles REST API example, we use the following process for JWT authentication:
- Register the new API client by supplying the application name and a shared secret (one time only)
- Receive an API key in response (one time only)
- Obtain a JWT using the API key and the shared secret (each user session or renew when the previous JWT expires)
- Include the JWT in each API call
In our example, we are passing four JSON fields in our set of claims. Those fields are the issuer (‘iss’), API key, expiration (‘exp’), and the time the JWT was issued (‘ait’). Both the ‘iss’ and the ‘exp’ claims are defined in the Authentication service’s environment.properties
file (jwt.expire.length
and jwt.issuer
).
Expiration and Issued date/time use the JWT standard recommended “Seconds Since the Epoch“. The default expiration for a Virtual-Vehicles JWT is set to an arbitrary 10 hours from the time the JWT was issued (jwt.expire.length = 36000
). That amount, 36,000, is equivalent to 10 hours x 60 minutes/hour x 60 seconds/minute.
Decoding a JWT
Using the jwt.io site’s JT Debugger tool, I have decoded a sample JWT issued by the Virtual-Vehicles REST API, generated by the Authentication service. Observe the three parts of the JWT, the JOSE Header, Claims Set, and the JSON Web Signature (JWS).
The JWT’s header indicates that our JWT is a JWS that is MACed using the HMAC SHA-256 algorithm. The shared secret, passed by the API client, represents the HMAC secret cryptographic key. The secret is used in combination with the cryptographic hash function to calculate the message authentication code (MAC). In the example below, note how the API client’s shared secret is used to validate our JWT’s JWS.
Sequence Diagrams of Authentication Process
Below are three sequence diagrams, which detail the following processes: API client registration process, obtaining a new JWT, and a REST call being authenticated using the JWT. The end-user of the API self-registers their application using the Authentication service and receives back an API key. The API key is unique to that client.
The end-user application then uses the API key and the shared secret to receive a JWT from the Authentication service.
After receiving the JWT, the end-user application passes the JWT in the header of each API request. The JWT is validated by calling the Authentication service. If the JWT is valid, the request is fulfilled. If not valid, a ‘401 Unauthorized’ status is returned.
JWT Validation
The JWT draft standard recommends how to validate a JWT. Our Virtual-Vehicles Authentication microservice uses many of those criteria to validate the JWT, which include:
- API Key – Retrieve API client’s shared secret from MongoDB, using API key contained in JWT’s claims set (secret is returned; therefore API key is valid)
- Algorithm – confirm the algorithm (‘alg’), found in the JWT Header, which used to encode JWT, was ‘HS256’ (HMAC SHA-256)
- Valid JWS – Use the client’s shared secret from #1 above, decode HMAC SHA-256 encrypted JWS
- Expiration – confirm JWT is not expired (‘exp’)
Inter-Service Communications
By default, the RestExpress Archetype project does not offer an example of communications between microservices. Service-to-service communications for microservices is most often done using the same HTTP-based REST calls used to by our Virtual-Vehicles REST API. Alternately, a message broker, such as RabbitMQ, Kafka, ActiveMQ, or Kestrel, is often used. Both methods of communicating between microservices, referred to as ‘inter-service communication mechanisms’ by InfoQ, have their pros and cons. The InfoQ website has an excellent microservices post, which discusses the topic of service-to-service communication.
For the Virtual-Vehicles microservices example, we will use HTTP-based REST calls for inter-service communications. The primary service-to-service communications in our example, is between the three microservices, Vehicle, Valet, and Maintenance, and the Authentication microservice. The three services validate the JWT passed in each request to a CRUD operation, by calling the Authentication service and passing the JWT, as shown in the sequence diagram, above. Validation is done using an HTTP GET request to the Authentication service’s .../jwts/{jwt}
endpoint. The Authentication service’s method, called by this endpoint, minus some logging and error handling, looks like the following:
public boolean authenticateJwt(Request request, String baseUrlAndAuthPort) { | |
String jwt, output, valid = ""; | |
try { | |
jwt = (request.getHeader("Authorization").split(" "))[1]; | |
} catch (NullPointerException | ArrayIndexOutOfBoundsException e) { | |
return false; | |
} | |
try { | |
URL url = new URL(baseUrlAndAuthPort + "/jwts/" + jwt); | |
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); | |
conn.setRequestMethod("GET"); | |
conn.setRequestProperty("Accept", "application/json"); | |
if (conn.getResponseCode() != 200) { | |
return false; | |
} | |
BufferedReader br = new BufferedReader(new InputStreamReader( | |
(conn.getInputStream()))); | |
while ((output = br.readLine()) != null) { | |
valid = output; | |
} | |
conn.disconnect(); | |
} catch (IOException e) { | |
return false; | |
} | |
return Boolean.parseBoolean(valid); | |
} |
Primarily, we are using the java.net
and java.io
packages, along with the org.restexpress.Request
class to build and send our HTTP request to the Authentication service. Alternately, you could use just the org.restexpress
package to construct request and handle the response. This same basic method structure shown above can be used to create unit tests for your service’s endpoints.
Health Ping
Each of the Virtual-Vehicles microservices contain a DiagnosticController
in the .utilities
package. In our example, we have created a ping()
method. This simple method, called through the .../utils/ping
endpoint, should return a 200 OK
status and a boolean value of ‘true’, indicating the microservice is running and reachable. This route’s associated method could not be simpler:
public void ping(Request request, Response response) { | |
response.setResponseStatus(HttpResponseStatus.OK); | |
response.setResponseCode(200); | |
response.setBody(true); | |
} |
The ping health check can even be accessed with a simple curl command, curl localhost:8581/vehicles/utils/ping
.
In a real-world application, we would add additional health checks to all services, providing additional insight into the health and performance of each microservice, as well as the service’s dependencies.
API Documentation
A well written RESTful API will not require extensive documentation to understand the API’s operations. Endpoints will be discoverable through linking (see Response Body links section in below example). API documentation should provide HTTP method, required headers and URL parameters, expected response body and response status, and so forth.
An API should be documented before any code is written, similar to TDD. The API documentation is the result of a clear understanding of requirements. The API documentation should make the coding even easier since the documentation serves as a further refinement of the requirements. The requirements are an architectural plan for the microservice’s code structure.
Sample Documentation
Below, is a sample of the Virtual-Vehicles REST API documentation. It details the function responsible for creating a new API client. The documentation provides a brief description of the function, the operation’s endpoint (URI), HTTP method, request parameters, expected response body, expected response status, and even a view of the MongoDB collection’s document for a new API client.
You can download a PDF version of the Virtual-Vehicles RESTful API documentation on GitHub or review the source document on Google Docs. It contains general notes about the API, and details about every one of the API’s operations.
Running the Individual Microservices
For development and testing purposes, the easiest way to start the microservices is in NetBeans using the Run
command. The Run
command executes the Maven exec
goal. Based on the DEFAULT_ENVIRONMENT
constant in the org.restexpress.util
Environment class, RestExpress will use the ‘dev’ environment’s environment.properties
file, in the project’s /config/dev
directory.
Alternately, you can use the RestExpress project’s recommended command from a terminal prompt to start each microservice from its root directory (mvn exec:java -Dexec.mainClass=test.Main -Dexec.args="dev"
). You can also use this command to switch from the ‘dev’ to ‘prod’ environment properties (-Dexec.args="prod"
).
You may use a variety of commands to confirm all the microservices are running. I prefer something basic like sudo netstat -tulpn | grep 858[0-9]
. This will find all the ports within the ‘dev’ port range. For more in-depth info, you can use a command like ps -aux | grep com.example | grep -v grep
Part Three: Testing our Services
We now have a copy of the Virtual-Vehicles project pulled from GitHub, a basic understanding of how RestExpress works, and our four microservices running on different ports. In Part Three of this series, we will take them for a drive (get it?). There are many ways to test our service’s endpoints. One of my favorite tools is Postman. we will explore how to use several tools, including Postman, and our API documentation, to test our microservice’s endpoints.
Building a Microservices-based REST API with RestExpress, Java EE, and MongoDB: Part 1
Posted by Gary A. Stafford in Enterprise Software Development, Java Development, Software Development on May 18, 2015
Develop a well-architected and well-documented REST API, built on a tightly integrated collection of Java EE-based microservices.
Microservices
Microservices are a popular and growing trend in software development. According to Wikipedia, microservices are “a software architecture style, in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task.”
Martin Fowler and James Lewis (ThoughtWorks) have done an exemplary job capturing the essence of microservice architecture in their March 2014 post, microservices. Fowler has also discussed these principles in several presentations, including the January 2015 goto; Conference, Keynote: Microservices by Martin Fowler.
Additionally, noted technical consultant and speaker, Adrian Cockcroft (Battery Ventures), has made significant contributions to the definition of microservices, such as in his December 2014 dockercon14 | eu presentation, State of the Art in Microservices.
Lastly, Zhamak Dehghani (ThoughtWorks), delivered an in-depth discussion of microservices, including customer perspectives, in her October 2014 presentation, Real-World Microservices: Lessons from the Frontline.
Some of the major characteristics of microservices and REST cited by these experts, include:
- Organized Around Business Capabilities
- Single Responsibility
- Loose Coupling / High Cohesion
- Smart Endpoints and Dumb Pipes
- Decentralized Data Management
- Hypermedia as the Engine of Application State (HATEOAS)
As we develop this post’s example, I will demonstrate how all of the above characteristics are implemented.
REST API
A REST API is the mash-up of two common software concepts, Representational State Transfer (REST) and an application programming interface (API). Although even Wikipedia doesn’t have an exact definition of a REST API, they come close to their discussion of REST. According to Wikipedia, “Web service APIs that adhere to the REST architectural constraints are called RESTful APIs. HTTP-based RESTful APIs are defined with these aspects: base URI, an Internet media type for the data, standard HTTP methods, hypertext links to a reference state, and hypertext links to reference related resources.”
An important nuance and differentiator from SOA-based APIs, RESTful APIs do not require XML-based Web service protocols (SOAP and WSDL) to support their interfaces (Wikipedia).
The author of the WebConcepts channel does an excellent job capturing the essence of REST APIs in REST API concepts and examples. Two additional presentations I strongly recommend are REST+JSON API Design – Best Practices for Developers and Designing a Beautiful REST+JSON API, both by Les Hazlewood, CTO of Stormpath. Stormpath is a leader in the commercial REST API space.
Microservices-Based REST API
A microservices-based REST API is a REST API, whose HTTP requests call an orchestrated collection or collections of language-agnostic and platform-agnostic microservices. The combination of these two trends, microservices, and a REST API, offers a simple, reliable, and scalable solution for providing flexible functionality to an end-user, in a technology-agnostic manner.
REST API Example
There is a fast-growing volume of reference materials describing the characteristics, benefits, and general architecture of microservices and REST APIs. However, in researching these topics, I have found a shortage of practical examples or tutorials on building microservices-based REST API solutions.
Undoubtedly, the complexity of even the simplest microservices-based solution limits the number of available cases. A minimally viable solution require planning, coding, testing, and documentation. The addition of cross-cutting features such as security, logging, monitoring, and orchestration, creates an enormous task to build a practical microservices-based example.
In the following series of posts, we will use many of the characteristics of a modern microservice architecture as described by Fowler, Lewis, Cockcroft, and Dehghani. We will combine these microservice characteristics with the best practices of good REST API design, as described by Hazlewood and WebConcepts, to build a minimally viable microservices-based REST API.
In a future post, we will create an application, which leverages the microservices-based solution, through the REST API. Additionally, we will demonstrate how to ensure high-availability of the individual microservices and data sources.
Vehicle for Learning
In a similar vein to the publicly available Twitter, Facebook, and Google REST APIs, we will build the Virtual-Vehicles REST API. The Virtual-Vehicles REST API will constitute a collection of vehicle-themed microservices. Collectively, the microservices will offer a comprehensive set of functionality to the end-user, an application developer. They, in turn, will use the functionality of the Virtual-Vehicles REST API to build applications and games for their end-users.
Technology Choices
There are a seemingly infinite number of technology choices for building microservices and REST APIs. Your choice of development languages, databases, application servers, third-party libraries, API gateway, logging, monitoring, automated testing, ORM or ODM, and even the IDE, all define your technology stack.
For the Virtual-Vehicles solution, we will use the following key technologies:
- RestExpress-Create performant, stand-alone microservices-based REST APIs
- Java EE – Primary development language
- Netty – Async, event-driven Java network application framework
- Apache Maven – Dependency management and more
- MongoDB – Data source
- JSON Web Tokens (JWT) – Claims-based authentication
- Apache Log4j 2 – Logging
- JUnit – Unit testing
- HAProxy – Service gateway and load-balancing
- NetBeans – IDE
- Git and GitHub – SCM
- Postman – REST API testing
What is RestExpress?
According to their website, RestExpress, composes best-of-breed open-source tools to enable quickly creating RESTful microservices that embrace industry best practices. Built from the ground-up for container-less, microservice architectures, RestExpress is the easiest way to create RESTful APIs in Java. RestExpress is an extremely lightweight, fast, REST engine and API for Java. RestExpress is a thin wrapper on Netty IO HTTP handling. RestExpress lets you create performant, stand-alone REST APIs rapidly. RestExpress provides several Maven archetypes, which we will use as a basis for our microservices.
RestExpress will also drive our technology decisions to use Java EE, Maven, MongoDB, and Netty.
Virtual-Vehicles Microservices
Adhering to the first few microservice architectural principles listed above, organized around business capabilities, single responsibility, and high cohesion, we first must determine the proper number of microservices, and their individual responsibilities. In the case of our solution, we will break down Virtual-Vehicles’ business capabilities into the following microservices:
Virtual-Vehicles Services |
|
Microservice | Purpose (Business Capability) |
Authentication Service | Manage API clients and JWT authentication |
Vehicle Service | Manage virtual vehicles |
Maintenance Service | Manage maintenance on vehicles |
Valet Parking Service | Manage a valet service to park for vehicles |
Sales Service | Manage the buying and selling of vehicles |
Registration Service | Manage registration of vehicles to owners |
Auction Service | Manage a virtual car auction |
Car Show Service | Manage a virtual car show |
Interaction Service | Manage interaction of users with vehicles |
For simplicity in this post’s example, we will only be exploring the (4) services shown above in bold.
This segmentation of service functionality is unlike what we might encounter in traditional monolithic, n-tier applications, and SOA-based architecture. Traditional applications were built around application-centric functionality or business’ organizational structure. Microservices, however, are client-centric and built around business capabilities.
REST API Functionality
The next decision we need to make is required functionality. What are the operational requirements of each business segment, represented by the microservices? Additionally, what are the nonfunctional requirements, such as monitoring, logging, and authentication. Requirements are translated into functionality, which is translated into the available resources exposed via the service’s RESTful endpoints.
For the Virtual-Vehicles microservices solution, based on a hypothetical set of business and non-functional requirements, we will expose the following resources. Collectively, they will compose the REST API:
Virtual-Vehicles REST API Resources |
||
Microservice | Purpose (Business Capability) | Functions |
Authentication Service | Manage API clients and JWT authentication |
|
Vehicle Service | Manage virtual vehicles |
|
Maintenance Service | Manage maintenance on vehicles |
|
Valet Parking Service | Manage a valet service to park for vehicles |
|
Reviewing the table above, note the first five functions for each service are all basic CRUD operations: create
(POST), read
(GET), readAll
(GET), update
(PUT), delete
(DELETE). The readAll
function also has find
, count
, and pagination
functionality using query parameters.
All services also have an internal authenticateJwt
function, to authenticate the JWT, passed in the HTTP request header, before performing any operation. Additionally, all services have a basic health-check function, ping
(GET). There are only a few other functions required for our Virtual-Vehicles example, such as for creating JWTs.
I’ve labeled each function as to suggested user scope. Scopes include public, admin, and internal. As a consumer of the REST API, you may only want to expose certain functionality to your general end-user (public). Additional functionality may be reserved for an administrative user (admin) or only yourself as a developer (internal). Creating a new vehicle might be a common end-user feature. However, the ability to permanently delete one or more vehicles may be reserved for an admin-level user, or not exposed at all.
REST API Patterns
We will not spend a lot of time discussing patterns for building REST APIs. There are many useful materials available on the Internet regarding industry-standard patterns for REST API resource URI construction. The two presentations I recommend above by Les Hazlewood, CTO of Stormpath, are excellent. Also, Microservices.io, RestApiTutorial.com, swagger.io, and raml.org websites offer solid overviews of REST patterns and RESTful standards.
A common RESTful anti-pattern, which is hard to avoid as a OOP developer, is the temptation to use verbs versus nouns and method-like names, in resource URIs. Remember, we are not designing an end-user application. We are building an API, used by API consumers (application developers), to build a variety of platform and language-agnostic applications. Functions like paintCar
, changeOil
, or parkVehicle
are not something the API should define. The Vehicle microservice exposes the update operation, which allows an application developer to change the car’s paint color in their paintCar
method. Similarly, the valet service exposes the create operation, which allows the application developer to create a function to park the vehicle (or car, or truck, in a garage, or parking lot, etc.). A good REST API allows for maximum end-user flexibility.
Part Two
In Part Two, we will install a copy of the Virtual-Vehicles project from GitHub. In Part Two, we will gain a basic understanding of how RestExpress works. Finally, we will discover how to get the Virtual-Vehicles microservices up and running.
Installing and Configuring the MEAN Stack, Yeoman, and Associated Tooling on Windows
Posted by Gary A. Stafford in Bash Scripting, Build Automation, Client-Side Development, DevOps, Software Development on March 8, 2014
Configure your Windows environment for developing modern web applications using the popular MEAN Stack and Yeoman suite of utilities.
Introduction
It’s an exciting time to be involved in web development. There are dozens of popular open-source JavaScript frameworks, libraries, code-generators, and associated tools, exploding on to the development scene. It is now possible to use a variety of popular technology mashups to provide a complete, full-stack JavaScript application platform.
MEAN Stack
One of the JavaScript mashups gaining a lot of traction recently is the MEAN Stack. If you’re reading this post, you probably already know MEAN is an acronym for four leading technologies: MongoDB, ExpressJS, AngularJS, and Node.js. The MEAN stack provides an end-to-end JavaScript application solution. MEAN provides a NoSQL document database (Mongo), a server-side solution for JavaScript (Node/Express), and a magic client-side MV* framework (Angular).
Depending in which MEAN stack generator or pre-build project you start with, in addition to the main four technologies, you will pull down several other smaller libraries and frameworks. These most commonly include jQuery, Twitter Bootstrap, Karma (test runner), Jade (template engine), JSHint, Underscore.js (utility-belt library), Mongoose (MongoDB object modeling tool), Passport (authentication), RequireJS (file and module loader), BreezeJS (data entity management), and so forth.
Common Tooling
If you are involved in these modern web development trends, then you are aware there is also a fairly common set of tools used by a majority of these developers, including source control, IDE, OS, and other helper-utilities. For SCM/VCS, Git is the clear winner. For an IDE, WebStorm, Sublime Text, and Kompozer, are heavy favorites. The platform of choice for most developers most often appears to be either Mac or Linux. It’s far less common to see a demonstration of these technologies, or tutorials built on the Microsoft Windows platform.
Yeoman
Another area of commonality is help-utilities, used to make the development, building, dependency management, and deployment of modern JavaScript applications, easier. Two popular ones are Brunch and Yeoman. Yeoman is also an acronym for a set of popular tools: yo, Grunt, and Bower. The first, yo, is best described as a scaffolding tool. Grunt is the build tool. Bower is a tool for client-side package and dependency management. We will install Yeoman, along with the MEAN Stack, in this post.
Windows
There is no reason Windows cannot serve as your development and hosting platform for modern web development, without specifically using Microsoft’s .NET stack. In fact, with minimal set-up, you would barely know you were using Windows as opposed to Linux or Mac. In this post, I will demonstrate how to configure your Windows machine for developing these modern web applications using the MEAN Stack and Yeoman.
Here is a list of the components we will discuss:
Installations
Git
The use of Git for source control is obvious. Git is the overwhelming choice of modern developers. Git has been integrated into most major IDEs and hosting platforms. There are hooks into Git available for most leading development tools. However, there are more benefits to using Git than just SCM. Being a Linux/Mac user, I prefer to use a Unix-like shell on Windows, versus the native Windows Command Prompt. For this reason, I use Git for Windows, available from msysGit. This package includes Git SCM, Git Bash, and Git GUI. I use the Git Bash interactive shell almost exclusively for my daily interactions requiring a command prompt. I will be using the Git Bash interactive shell for this post. OpenHatch has great post and training materials available on using Git Bash with Windows.
Git for Windows provides a downloadable Windows executable file for installation. Follow the installation file’s instructions.
To test your installation of Git for Windows, call the Git binary with the ‘–version’ flag. This flag can be used to test all the components we are installing in this post. If the command returns a value, then it’s a good indication that the component is installed properly and can be called from the command prompt:
gstafford: ~/Documents/git_repos $ git --version git version 1.9.0.msysgit.0
You can also verify Git using the ‘where’ and ‘which’ commands. The ‘where’ command will display the location of files that match the search pattern and are in the paths specified by the PATH environment variable. The ‘which’ command tells you which file gets executed when you run a command. These commands will work for most components we will install:
gstafford: ~/Documents/git_repos $ where git C:\Program Files (x86)\Git\bin\git.exe C:\Program Files (x86)\Git\cmd\git.cmd C:\Program Files (x86)\Git\cmd\git.exe gstafford: ~/Documents/git_repos $ which git /bin/git
Ruby
The reasons for Git are obvious, but why Ruby? Yeoman, specifically yo, requires Ruby. Installing Ruby on Windows is easy. Ruby recommends using RubyInstaller for Windows. RubyInstaller downloads an executable file, making install easy. I am using Ruby 1.9.3. I had previously installed the latest 2.0.0, but had to roll-back after some 64-bit compatibility issues with other applications.
To test the Ruby installation, use the ‘–version’ flag again:
gstafford: ~/Documents/git_repos $ ruby --version ruby 1.9.3p484 (2013-11-22) [i386-mingw32]
RubyGems
Optionally, you might also to install RubyGems. RubyGems allow you to add functionality to the Ruby platform, in the form of ‘Gems’. A common Gem used with the MEAN stack is Compass, the Sass-based stylesheet framework creation and maintenance of CSS. According to their website, Ruby 1.9 and newer ships with RubyGems built-in but you may need to upgrade for bug fixes or new features.
On Windows, installation of RubyGems is as simple as downloading the .zip file from the RubyGems download site. To install, RubyGems, unzip the downloaded file. From the root of the unzipped directory, run the following Ruby command:
ruby setup.rb
To confirm your installation:
gstafford: ~/Documents/git_repos $ gem --version 2.2.2
If you already have RubyGems installed, it’s recommended you update RubyGems before continuing. Use the first command, with the ‘–system’ flag, will update to the latest RubyGems. Use the second command, without the tag, if you want to update each of your individually installed Ruby Gems:
gem update --system gem update
MongoDB
MongoDB provides a great set of installation and configuration instructions for Windows users. To install MongoDB, download the MongoDB package. Create a ‘mongodb’ folder. Mongo recommends at the root of your system drive. Unzip the MongoDB package to ‘c:\mongodb’ folder. That’s really it, there is no installer file.
Next, make a default Data Directory location, use the following two commands:
mkdir c://data && mkdir c://data/db
Unlike most other components, to call Mongo from the command prompt, I had to manually add the path to the Mongo binaries to my PATH environment variable. You can get access to your Windows environment variables using the Windows and Pause keys. Add the path ‘c:\mongodb\bin’ to end of the PATH environment variable value.
To test the MongoDB installation, and that the PATH variable is set correctly, close any current interactive shells or command prompt windows. Open a new shell and use the same ‘–version’ flag for Mongo’s three core components:
gstafford: ~/Documents/git_repos $ mongo --version; mongod --version; mongos --version MongoDB shell version: 2.4.9 db version v2.4.9 Sun Mar 09 16:26:48.730 git version: 52fe0d21959e32a5bdbecdc62057db386e4e029c MongoS version 2.4.9 starting: pid=15436 port=27017 64-bit host=localhost (--help for usage) git version: 52fe0d21959e32a5bdbecdc62057db386e4e029c build sys info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
To start MongoDB, use the ‘mongod’ or ‘start mongod’ commands. Adding ‘start’ opens a new command prompt window, versus tying up your current shell. If you are not using the default MongoDB Data Directory (‘c://data/db’) you created in the previous step, use the ‘–dbpath’ flag, for example ‘start mongod –dbpath ‘c://alternate/path’.
.
Node.js
To install Node.js, download and run the Node’s .msi installer for Windows. Along with Node.js, you will get npm (Node Package Manager). You will use npm to install all your server-side components, such as Express, yo, Grunt, and Bower.
gstafford: ~/Documents/git_repos $ node --version && npm --version v0.10.26 1.4.3
Express
To install Express, the web application framework for node, use npm:
npm install -g express
The ‘-g’ flag (or, ‘–global’ flag) should be used. According to Stack Overflow, ‘if you’re installing something that you want to use in your shell, on the command line or something, install it globally, so that its binaries end up in your PATH environment variable:
gstafford: ~/Documents/git_repos $ express --version 3.5.0
Yeoman – yo, Grunt, and Bower
You will also use npm to install yo, Grunt, and Bower. Actually, we will use npm to install the Grunt Command Line Interface (CLI). The Grunt task runner will be installed in your MEAN stack project, locally, later. Read these instructions on the Grunt website for a better explanation. Use the same basic command as with Express:
npm install -g yo grunt-cli bower
To test the installs, run the same command as before:
gstafford: ~/Documents/git_repos $ yo --version; grunt --version; bower --version 1.1.2 grunt-cli v0.1.13 1.2.8
If you already had Yeoman installed, confirm you have the latest versions with the ‘npm update’ command:
gstafford: ~/Documents/git_repos/gen-angular-sample $ npm update -g yo grunt-cli bower npm http GET https://registry.npmjs.org/grunt-cli npm http GET https://registry.npmjs.org/bower npm http GET https://registry.npmjs.org/yo npm http 304 https://registry.npmjs.org/bower npm http 304 https://registry.npmjs.org/yo npm http 200 https://registry.npmjs.org/grunt-cli
All of the npm installs, including Express, are installed and called from a common location on Windows:
gstafford: ~/Documents/git_repos/gen-angular-sample $ where express yo grunt bower c:\Users\gstaffor\AppData\Roaming\npm\yo c:\Users\gstaffor\AppData\Roaming\npm\yo.cmd c:\Users\gstaffor\AppData\Roaming\npm\grunt c:\Users\gstaffor\AppData\Roaming\npm\grunt.cmd c:\Users\gstaffor\AppData\Roaming\npm\bower c:\Users\gstaffor\AppData\Roaming\npm\bower.cmd gstafford: ~/Documents/git_repos $ which express; which yo; which grunt; which bower ~/AppData/Roaming/npm/express ~/AppData/Roaming/npm/yo ~/AppData/Roaming/npm/grunt ~/AppData/Roaming/npm/bower
Use the command, ‘npm list –global | less’ (or, ‘npm ls -g | less’) to view all npm packages installed globally, in a tree-view. After you have generated your project (see below), check the project-specific server-side packages with the ‘npm ls’ command from within the project’s root directory. For the client-side packages, use the ‘bower ls’ command from within the project’s root directory.
If your in a hurry, or have more Windows boxes to configure you can use one npm command for all four components, above:
npm install -g express yo grunt-cli bower
MEAN Boilerplate Generators and Projects
That’s it, you’ve installed most of the core components you need to get started with the MEAN stack on Windows. Next, you will want to download one of the many MEAN boilerplate projects, or use a MEAN code generator with npm and yo. I recommend trying one or all of the following projects. They are each slightly different architecturally, but fairly stable:
- Linnovate’s MEAN.io project on GitHub (boilerplate MEAN project)
- James Cryer’s npm generator-mean, a Yeoman generator for MEAN stack, inspired by mean.io
- William Lepinski’s MEAN stack generator for Yeoman, generator-meanstack, based on the generator-angular.
.
Links
- Jim Lavin’s Introduction to the MEAN Stack on YouTube
- Google Developer’s Getting MEAN on YouTube
- Introduction to the MEAN Stack, Part One: Setting Up Your Tools
- Full-Stack JavaScript With MEAN And Yeoman
- The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js