+353 87 1272938 firstname.lastname@example.org
This is a detailed description of the benchmark tests that we conducted to compare SIREn’s performance relative to BlockJoin with respect to search on nested documents. The BlockJoin approach is used in both Solr and ElasticSearch to support nested document search. If you are looking for a short summary, you could read this blog post.
There are a lot of use cases that require indexing of nested documents. One only needs to scan the Lucene/Solr user forums to discover examples of this need.
In this post we’ll show how SIREn is by far the only approach that can scale to real world scenarios while delivering the advanced search capabilities on structured data allow.
Both SIREn and Lucene’s Blockjoin feature offer a solution for working with nested documents. However, SIREn is based on a conceptual model that is fundamentally different from the Blockjoin approach. As such, for anyone deciding between these two solutions, it is important to understand their relative scalability and performance metrics.
In this document, we describe the results of some benchmark tests we conducted on both these approaches. We show how SIREn v1.2 offers enormously improved scalability and enhanced performance.
While the focus of this document is Lucene/Solr, the results are identically applicable to ElasticSearch which, under the hood, uses Lucene’s Blockjoin to support nested documents.
One of the core strengths of Lucene/SIREn is their ability to query over structured and unstructured content. As such, we wanted to pick a dataset that provides a combination of both.
The US Patent and Trademark Ofiice (USPTO) dataset is a good example of complex XML data and as such it was used for the test. Each patent document contains structured data about inventors, classifications, etc., as well as textual content such as the abstract, the claims being made, and even down to the single paragraph of text. All of this data is described in a XML document with multiple levels of nesting.
The XML data is transformed into the equivalent JSON for this test. The transformation is based on the json-lib library with some extensions to remove bold, italic and other tags in text values. You can find an example of such a JSON document in the Appendix.
[table caption=”Statistics about the USPTO dataset” style=”font-size:75%” ]
Number of Documents, Uncompressed JSON, Compressed JSON
44369, 5 GB, 715 MB
SIREn natively supports JSON structured data. It can index the JSON documents without any additional data preparation. Each JSON document is mapped to a Lucene’s document with the following fields:
The quickest way to test the Blockjoin approach would be to use Elasticsearch, as it too does not require transformation of the data – the only gotcha being that one has to manually modify the schema mapping to ensure that the engine does not “flatten” the nested documents and uses Blockjoin. In contrast, SIREn is truly “schemaless” and does not require any configuration or modification.
For these tests however we will use Lucene directly for greater control and precision in the results. To do this, we convert JSON to Blockjoin using the transformation described in the Appendix.
The configuration of the machine used for the running the benchmark test is described below:
SIREn v1.2 was used for this test. Lucene/Solr v4.7 was used for testing the Blockjoin’s functionality.
For the creation of the index, we have disabled auto flush and the compound format, and we have increased the ram buffer size to 256 MB. We finally perform a merge to create one single segment.
For the test dataset, the Blockjoin’s approach leads to indexes that are almost twice as large as that for SIREn. This is due to the artificial increase in the number of documents for Blockjoin.
There are an average of 1833 nested objects per patent document in the test dataset. Given that the dataset covers just the first two month of the year, we can expect 6 times more documents for indexing the dataset for a full year. This would result in 486 million documents for Blockjoin.
[table caption=”Comparison of index size in gigabytes between SIREn and Blockjoin” style=”font-size:75%” ]
, Number of Documents in Index, Index Size
SIREn, 44369, 1.317
Blockjoin, 81350921, 2.577
During the creation of the index, we perform a merge to create one single segment, and we record the time to perform the merge. Reported time has been averaged over 5 runs.
[table caption=”Comparison of index merge time in milliseconds between SIREn and Blockjoin” style=”font-size:75%” ]
, Merge Time
We can see that SIREn is 3 to 4 times faster than the Blockjoin’s approach.
The query patterns cover common operators, i.e., conjunction, disjunction and phrase, on the text but also on the structure of the document. The query patterns are restricted to match in a paragraph. A paragraph is a nested object of a description section, the description section being itself a nested object of the parent document. This ‘path’ of nested objects is quite common in the USPTO dataset (there are more than 3.7 millions paragraph objects). Querying a paragraph requires two joins: the first one being the join with the description objects, and the second one being the join between the description objects and the parent documents.
|lookup||lookup of a high frequency term in a paragraph|
|conjunction||conjunction (AND) of two high frequency terms in a paragraph|
|disjunction||disjunction (OR) of two high frequency terms in a paragraph|
|phrase||phrase of two terms in a paragraph|
|conjunction-paragraph||conjunction of two paragraphs within the same description object|
|disjunction-paragraph||disjunction of two paragraphs within the same description object|
The paths for the lookup, conjunction, disjunction and phrase queries are shown below.
"#text": "TERM / TERM1 AND TERM2 / TERM1 OR TERM2 / ‘TERM1 TERM2’",
The paths for conjunction and disjunction paragraph queries are show below.
AND / OR
The queries are generated based on the frequency of the words composing them. The word frequency determines how many entities match a given keyword. The words are grouped into three frequency ranges: high, medium and low. We first order the words by their descending frequency, and then take the first k words whose cumulative frequency is 40% of all word occurrences as high range. The medium range accounts for the next 30%, and the low range is composed of all the remaining words. For the phrase queries, we follow a similar technique. We first extract all the 2-gram from the data collection. We then compute their frequency and sort them by descending frequency. We finally create the three ranges as explained above.
For each type of query, we (1) flush the OS cache; (2) initialise a new JVM; (3) generate a set of 50 random queries, (4) warm the JVM by executing a certain number of times the set of queries, and (5) perform 60 measurements. Each measurement is made by performing n times the query execution of the 50 random queries, with n chosen so that the runtime is long enough (10 seconds) to minimise the time precision error of the OS and machine. The measurement time is the time used by the current thread to process the 50 random queries. We report the query rate (query per second) based on the mean of the measurements.
With respect to Blockjoin’s queries, the parent document filters are created prior to the measurements, and are reused across measurements.
SIREn and Blockjoin are configured to compute the score of a document based on the average score of its children. This means that we iterate over all the child node matches.
The results for the query performance is shown below. The query rates for SIREn and Blockjoin is shown along with the percentage boost offered by the former relative to the later.
[table caption=”Query rates (queries per second) per query type for SIREn and Blockjoin” style=”font-size:75%” ]
Query, Blockjoin Query Rate, SIREn Query Rate, Rate Ratio %
5. Conjunction Paragraph,48.30,163.92,339%
6. Disjunction Paragraph,26.39,74.41,282%
For query 1,3,5,6 SIREn performs an average of 3 times faster than Blockjoin. This is due to Blockjoin suffering when the child query is not highly selective as intersecting the tag filter with the child query results is costly. The necessity of Blockjoin to use a tag filter for each parent and child query incurs a performance penalty.
SIREn on the other hand suffers penalty for query 2 and 4, when the child query is highly sensitive. A specific solution for this is on its way, targeting SIREn release 1.4. We will update this benchmark as soon as available.
While query throughputs are comparable for SIREn and Blockjoin, the issues with the later surface when we consider memory requirements.
To speed up the processing of certain types of queries (Blockjoin queries, facet queries, filter queries), Solr uses various caching strategies, including bit arrays (or bitsets). The memory requirement for caching a query is proportional to the number of documents in the index. For example for an index containing 10m docs, a bitset would need 10m bits of memory (about 1.2MB).
Given that the Blockjoin approach artificially inflates the number of documents in the index, queries that require bitsets significantly impact memory requirements. For the test dataset, a bitset in SIREn would require 0.005 MB (44,369 / 8 / 1024 / 1024) whereas Blockjoin would require 9.7 MB (81,350,921 / 8 / 1024 / 1024)!
SIREn on the other hand has stable memory usage which is not dependent with the number of nested objects. This leaves more memory for caching additional information, e.g., query filters, facets, and for the operating system to cache index files, which is very important for good performance.
We can look at the impact this has on the various types of queries.
The Blockjoin approach requires a bitset to be cached for each parent child relationship that is part of the query. The test queries for instance require two bitsets.
Given the large number of nested object types (as explained in the Appendix), thousands of bitsets would be required for querying over all possible paths, each of 10mb size. SIREn on the other hand needs 0 bitsets for nested querying!
The maximum memory utilization during execution of the benchmark queries was measured. With respect to Blockjoin, this measures the impact of caching the two parent filters that are necessary for the execution of the nested queries.
[table caption=”Comparison of memory usage during the execution of the query benchmark” style=”font-size:75%” ]
, Memory Usage, Memory Usage per Filter
Blockjoin, 20 MB, 9.69 MB
SIREn, 3 MB, 0 MB
, , SIREn does not need memory per filter
Filter queries are used for caching query results and reusing them in subsequent queries. They are extensively used for supporting high volumes. Any filter query over the data would require the creation of a docset (either a BitDocSet or a SortedIntDocSet). This would be in addition to the parent filter bitsets required by Blockjoin for nested queries, as described above.
The memory utilization of Solr was measured (via a profiler) after warming up Solr with 200 random queries (non Blockjoin queries) with a random filter picked from a pool of 50 different filters on object types.
[table caption=”Comparison of memory increase after the execution of the queries with random filters” style=”font-size:75%” ]
, Memory Increase
Blockjoin, 194 MB
SIREn, 40 MB
, SIREn requires 4.85 times less memory
Facet queries are required to support interactive faceting. Lucene supports two distinct modes of faceting : enum and fieldcache (fc).
Enum requires a docset to be cached for each facet value. This is used when the number of unique facet values is low. The docset can be a BitDocSet (i.e., using a bitset described above) or a SortedIntDocSet based on whether the frequency of facet values across documents is high or low respectively.
The Fieldcache mode (which is usually the default one) uses an uninverted field data structure. This mode is recommended when the number of unique facet values is large.
Due to the difference in the index size between Blockjoin and SIREn (81m vs 40k docs), the memory requirement to support faceting is very high for Blockjoin for either of the two methods above.
If for instance, the enum mode is used for the ‘country’ field and if the BitDocSet is used for all the facet values, the Blockjoin would need 853.6MB (88 * 9.7) whereas SIREn would only need 0.44MB (88 * 0.005). However, if the frequency of a facet value is low, the SortedIntDocSet will be used and the total memory requirement will be reduced.
The memory utilization of Solr was measured after requesting a facet query for three fields: main-classification (14k values), country (88 values) and city (12k values). We also measure the average response time to answer the facet query (averaged over 50 runs after warmup).
[table caption=”Comparison of the memory increase and the query time for the execution of the facet query with the enum mode” style=”font-size:75%” ]
, Memory Increase, Query Time
Blockjoin, 289 MB, 668.36 ms
SIREn, 75 MB, 23.34 ms
, SIREn requires 3.85 times less memory, while being 28.6 times faster
[table caption=”Comparison of the memory increase and the query time for the execution of the facet query with the fieldcache mode” style=”font-size:75%” ]
, Memory Increase, Query Time
Blockjoin, 3077 MB, 90.96 ms
SIREn, 126 MB, 8.36 ms
, SIREn requires 24.42 times less memory, while being 10.88 times faster
Solr was configured with the default filter cache size (512), and with a maximum memory heap size of 4GB.
The fieldcache mode uses the “uninverted field” data structure per field. Underneath, it creates a single int[maxDoc()] per field. We can see from the results that it becomes problematic for Blockjoin to use this mode for facet queries due to a significant increase of memory requirement. Instead, the enum mode must be used. However, caching facet values using filters is suboptimal for the filter cache, especially for fields with medium to large number of distinct values, as it will reduce the cache hit rates. As a consequence, the filters will not be cached and have to be created at each request, significantly increasing the response time.
To summarize, the Blockjoin approach would require extraordinary amounts of memory for relatively tiny index sizes. A typical application would require multiple facet queries, nested queries and filter queries.
As can be seen from the results above, a fundamental issue with Blockjoin is memory requirements.
The test dataset indexes just 44,369 patent grants which have been issued over two months. If you want to index the 2 million or so grants over the last 10 years, the Blockjoin approach would require an index of 3.6 billion docs. Projecting based on a linear increase in memory requirements, you would need 135GB of memory to facet over the three fields used in the test compared to SIREn’s requirement of about 6GB!
Beyond the much smaller memory requirements, SIREn’s indexing model provides better performance for nested queries, greater query flexibility (e.g., support for nested arrays), true schemaless operations along with more advanced query operations such as wildcard path query and positional operators on structural elements.
The SIREn engine is now available as a plugin for Solr and Elasticsearch.
In this appendix we describe the procedure we follow to index the Json patent documents using the Blockjoin method. Lucene currently requires you to transform a nested document into a collection or “block” of flat documents.
We will therefore here detail how the Lucene’s documents are created. We are showing below a simple nested JSON document along with the collection of equivalent Blockjoin’s documents.
The transformation adopts the following rules:
Additional fields need to be added for all the Blockjoin’s documents.
Each unique path in the document is mapped to one document type. The document type is a necessary concept for Blockjoin to differentiate between nested documents. Each document type is uniquely identified with a tag. The root document is tagged with ‘root’. The tag of the other document types is generated based on the path to the nested object. This tagging is required to ensure that Blockjoin’s queries are able to distinguish between fields with the same name appearing in different document types (this will result in incorrect results if this is not done). A similar approach is used in Elasticsearch.
One issue to be considered is multi-valued attributes. Lucene will return false positives when querying over multi-valued fields where the values have more than one term. For instance, the following document will match for the query “q=field1:value1 field1:value3”.
"field1": ["value1 value2", "value3 value4"]
SIREn on the other hand does not have this issue. Every value of a multi-valued field is maintained as a separate “node” in the index. To simulate this with Blockjoin, a separate document is created for each of the values (see for example the child documents with id 1.1, 1.2 and 1.3).
"city": "Scotts Valley",
"orgname": "Fox Factory, Inc.",
"#text": "This application claims benefit of U.S. provisional patent ..",
"#text": "The invention generally relates to methods and apparatus ...",