Comparing SIREn 1.2 and Lucene’s Blockjoin Performance: a USPTO Patent Search Scenario

This is a detailed description of the benchmark tests that we conducted to compare SIREn’s performance relative to BlockJoin with respect to search on nested documents. The BlockJoin approach is used in both Solr and ElasticSearch to support nested document search. If you are looking for a short summary, you could read this blog post.

There are a lot of use cases that require indexing of nested documents. One only needs to scan the Lucene/Solr user forums to discover examples of this need.

In this post we’ll show how SIREn is by far the only approach that can scale to real world scenarios while delivering the advanced search capabilities on structured data allow.

Introducing SIREn vs Blockjoin

Both SIREn and Lucene’s Blockjoin feature offer a solution for working with nested documents. However, SIREn is based on a conceptual model that is fundamentally different from the Blockjoin approach. As such, for anyone deciding between these two solutions, it is important to understand their relative scalability and performance metrics.

In this document, we describe the results of some benchmark tests we conducted on both these approaches. We show how SIREn v1.2 offers enormously improved scalability and enhanced performance.

While the focus of this document is Lucene/Solr, the results are identically applicable to ElasticSearch which, under the hood, uses Lucene’s Blockjoin to support nested documents.

The Test Dataset: US Patent Grant XML Corpus

One of the core strengths of Lucene/SIREn is their ability to query over structured and unstructured content. As such, we wanted to pick a dataset that provides a combination of both.

The US Patent and Trademark Ofiice (USPTO) dataset is a good example of complex XML data and as such it was used for the test. Each patent document contains structured data about inventors, classifications, etc., as well as textual content such as the abstract, the claims being made, and even down to the single paragraph of text. All of this data is described in a XML document with multiple levels of nesting.

Preparing the Data for Indexing

The XML data is transformed into the equivalent JSON for this test. The transformation is based on the json-lib library with some extensions to remove bold, italic and other tags in text values. You can find an example of such a JSON document in the Appendix.

[table caption=”Statistics about the USPTO dataset” style=”font-size:75%” ]
Number of Documents, Uncompressed JSON, Compressed JSON
44369, 5 GB, 715 MB
[/table]

Indexing with SIREn

SIREn natively supports JSON structured data. It can index the JSON documents without any additional data preparation. Each JSON document is mapped to a Lucene’s document with the following fields:

  • id : indexed and stored field to identify the document
  • json : indexed, non-stored field configured to use SIREn’s hybrid indexing model

Indexing with Blockjoin

The quickest way to test the Blockjoin approach would be to use Elasticsearch, as it too does not require transformation of the data – the only gotcha being that one has to manually modify the schema mapping to ensure that the engine does not “flatten” the nested documents and uses Blockjoin. In contrast, SIREn is truly “schemaless” and does not require any configuration or modification.

For these tests however we will use Lucene directly for greater control and precision in the results. To do this, we convert JSON to Blockjoin using the transformation described in the Appendix.

Test Environment

The configuration of the machine used for the running the benchmark test is described below:

  • Processor: Intel(R) Core(TM) i5 CPU M 580 @ 2.67GHz
  • Memory: 8GB DDR3 Synchronous 1334 MHz
  • Disk: Intel SSDSA1M160 – Ext4
  • OS: Ubuntu 14.04 64-Bit
  • Java: 1.7.0_45 64-Bit

SIREn v1.2 was used for this test. Lucene/Solr v4.7 was used for testing the Blockjoin’s functionality.

Indexing Performance

Index Size

For the creation of the index, we have disabled auto flush and the compound format, and we have increased the ram buffer size to 256 MB. We finally perform a merge to create one single segment.

For the test dataset, the Blockjoin’s approach leads to indexes that are almost twice as large as that for SIREn. This is due to the artificial increase in the number of documents for Blockjoin.

There are an average of 1833 nested objects per patent document in the test dataset. Given that the dataset covers just the first two month of the year, we can expect 6 times more documents for indexing the dataset for a full year. This would result in 486 million documents for Blockjoin.

[table caption=”Comparison of index size in gigabytes between SIREn and Blockjoin” style=”font-size:75%” ]
, Number of Documents in Index, Index Size
SIREn, 44369, 1.317
Blockjoin, 81350921, 2.577
[/table]

[d3-source canvas=”wpd3-8687-2″]

Index Merging Time

During the creation of the index, we perform a merge to create one single segment, and we record the time to perform the merge. Reported time has been averaged over 5 runs.

[table caption=”Comparison of index merge time in milliseconds between SIREn and Blockjoin” style=”font-size:75%” ]
, Merge Time
SIREn, 91312
Blockjoin, 339002
[/table]

[d3-source canvas=”wpd3-8687-1″]

We can see that SIREn is 3 to 4 times faster than the Blockjoin’s approach.

Query Performance

Query Patterns

The query patterns cover common operators, i.e., conjunction, disjunction and phrase, on the text but also on the structure of the document. The query patterns are restricted to match in a paragraph. A paragraph is a nested object of a description section, the description section being itself a nested object of the parent document. This ‘path’ of nested objects is quite common in the USPTO dataset (there are more than 3.7 millions paragraph objects). Querying a paragraph requires two joins: the first one being the join with the description objects, and the second one being the join between the description objects and the parent documents.

lookuplookup of a high frequency term in a paragraph
conjunctionconjunction (AND) of two high frequency terms in a paragraph
disjunctiondisjunction (OR) of two high frequency terms in a paragraph
phrasephrase of two terms in a paragraph
conjunction-paragraphconjunction of two paragraphs within the same description object
disjunction-paragraphdisjunction of two paragraphs within the same description object

The paths for the lookup, conjunction, disjunction and phrase queries are shown below.

The paths for conjunction and disjunction paragraph queries are show below.

Experimental Design

The queries are generated based on the frequency of the words composing them. The word frequency determines how many entities match a given keyword. The words are grouped into three frequency ranges: high, medium and low. We first order the words by their descending frequency, and then take the first k words whose cumulative frequency is 40% of all word occurrences as high range. The medium range accounts for the next 30%, and the low range is composed of all the remaining words. For the phrase queries, we follow a similar technique. We first extract all the 2-gram from the data collection. We then compute their frequency and sort them by descending frequency. We finally create the three ranges as explained above.

For each type of query, we (1) flush the OS cache; (2) initialise a new JVM; (3) generate a set of 50 random queries, (4) warm the JVM by executing a certain number of times the set of queries, and (5) perform 60 measurements. Each measurement is made by performing n times the query execution of the 50 random queries, with n chosen so that the runtime is long enough (10 seconds) to minimise the time precision error of the OS and machine. The measurement time is the time used by the current thread to process the 50 random queries. We report the query rate (query per second) based on the mean of the measurements.

With respect to Blockjoin’s queries, the parent document filters are created prior to the measurements, and are reused across measurements.

SIREn and Blockjoin are configured to compute the score of a document based on the average score of its children. This means that we iterate over all the child node matches.

Query Throughput

The results for the query performance is shown below. The query rates for SIREn and Blockjoin is shown along with the percentage boost offered by the former relative to the later.

[table caption=”Query rates (queries per second) per query type for SIREn and Blockjoin” style=”font-size:75%” ]
Query, Blockjoin Query Rate, SIREn Query Rate, Rate Ratio %
1. Lookup,52.89,140.12,265%
2. Conjunction,307.66,192.42,63%
3. Disjunction,28.17,74.04,263%
4. Phrase,170.29,47.68,28%
5. Conjunction Paragraph,48.30,163.92,339%
6. Disjunction Paragraph,26.39,74.41,282%
[/table]

[d3-source canvas=”wpd3-8687-0″]

For query 1,3,5,6 SIREn performs an average of 3 times faster than Blockjoin. This is due to Blockjoin suffering when the child query is not highly selective as intersecting the tag filter with the child query results is costly. The necessity of Blockjoin to use a tag filter for each parent and child query incurs a performance penalty.

SIREn on the other hand suffers penalty for query 2 and 4, when the child query is highly sensitive. A specific solution for this is on its way, targeting SIREn release 1.4. We will update this benchmark as soon as available.

Query Memory Requirements

While query throughputs are comparable for SIREn and Blockjoin, the issues with the later surface when we consider memory requirements.

To speed up the processing of certain types of queries (Blockjoin queries, facet queries, filter queries), Solr uses various caching strategies, including bit arrays (or bitsets). The memory requirement for caching a query is proportional to the number of documents in the index. For example for an index containing 10m docs, a bitset would need 10m bits of memory (about 1.2MB).

Given that the Blockjoin approach artificially inflates the number of documents in the index, queries that require bitsets significantly impact memory requirements. For the test dataset, a bitset in SIREn would require 0.005 MB (44,369 / 8 / 1024 / 1024) whereas Blockjoin would require 9.7 MB (81,350,921 / 8 / 1024 / 1024)!

SIREn on the other hand has stable memory usage which is not dependent with the number of nested objects. This leaves more memory for caching additional information, e.g., query filters, facets, and for the operating system to cache index files, which is very important for good performance.

We can look at the impact this has on the various types of queries.

Evaluating Nested queries

The Blockjoin approach requires a bitset to be cached for each parent child relationship that is part of the query. The test queries for instance require two bitsets.

Given the large number of nested object types (as explained in the Appendix), thousands of bitsets would be required for querying over all possible paths, each of 10mb size. SIREn on the other hand needs 0 bitsets for nested querying!

The maximum memory utilization during execution of the benchmark queries was measured. With respect to Blockjoin, this measures the impact of caching the two parent filters that are necessary for the execution of the nested queries.

[table caption=”Comparison of memory usage during the execution of the query benchmark” style=”font-size:75%” ]
, Memory Usage, Memory Usage per Filter
Blockjoin, 20 MB, 9.69 MB
SIREn, 3 MB, 0 MB
, , SIREn does not need memory per filter
[/table]

Evaluating Filter queries

Filter queries are used for caching query results and reusing them in subsequent queries. They are extensively used for supporting high volumes. Any filter query over the data would require the creation of a docset (either a BitDocSet or a SortedIntDocSet). This would be in addition to the parent filter bitsets required by Blockjoin for nested queries, as described above.

The memory utilization of Solr was measured (via a profiler) after warming up Solr with 200 random queries (non Blockjoin queries) with a random filter picked from a pool of 50 different filters on object types.

[table caption=”Comparison of memory increase after the execution of the queries with random filters” style=”font-size:75%” ]
, Memory Increase
Blockjoin, 194 MB
SIREn, 40 MB
, SIREn requires 4.85 times less memory
[/table]

Evaluating Facet queries

Facet queries are required to support interactive faceting. Lucene supports two distinct modes of faceting : enum and fieldcache (fc).

Enum requires a docset to be cached for each facet value. This is used when the number of unique facet values is low. The docset can be a BitDocSet (i.e., using a bitset described above) or a SortedIntDocSet based on whether the frequency of facet values across documents is high or low respectively.

The Fieldcache mode (which is usually the default one) uses an uninverted field data structure. This mode is recommended when the number of unique facet values is large.

Due to the difference in the index size between Blockjoin and SIREn (81m vs 40k docs), the memory requirement to support faceting is very high for Blockjoin for either of the two methods above.

If for instance, the enum mode is used for the ‘country’ field and if the BitDocSet is used for all the facet values, the Blockjoin would need 853.6MB (88 * 9.7) whereas SIREn would only need 0.44MB (88 * 0.005). However, if the frequency of a facet value is low, the SortedIntDocSet will be used and the total memory requirement will be reduced.

The memory utilization of Solr was measured after requesting a facet query for three fields: main-classification (14k values), country (88 values) and city (12k values). We also measure the average response time to answer the facet query (averaged over 50 runs after warmup).

[table caption=”Comparison of the memory increase and the query time for the execution of the facet query with the enum mode” style=”font-size:75%” ]
, Memory Increase, Query Time
Blockjoin, 289 MB, 668.36 ms
SIREn, 75 MB, 23.34 ms
, SIREn requires 3.85 times less memory, while being 28.6 times faster
[/table]

[table caption=”Comparison of the memory increase and the query time for the execution of the facet query with the fieldcache mode” style=”font-size:75%” ]
, Memory Increase, Query Time
Blockjoin, 3077 MB, 90.96 ms
SIREn, 126 MB, 8.36 ms
, SIREn requires 24.42 times less memory, while being 10.88 times faster
[/table]

Solr was configured with the default filter cache size (512), and with a maximum memory heap size of 4GB.

The fieldcache mode uses the “uninverted field” data structure per field. Underneath, it creates a single int[maxDoc()] per field. We can see from the results that it becomes problematic for Blockjoin to use this mode for facet queries due to a significant increase of memory requirement. Instead, the enum mode must be used. However, caching facet values using filters is suboptimal for the filter cache, especially for fields with medium to large number of distinct values, as it will reduce the cache hit rates. As a consequence, the filters will not be cached and have to be created at each request, significantly increasing the response time.

To summarize, the Blockjoin approach would require extraordinary amounts of memory for relatively tiny index sizes. A typical application would require multiple facet queries, nested queries and filter queries.

Conclusions

As can be seen from the results above, a fundamental issue with Blockjoin is memory requirements.

The test dataset indexes just 44,369 patent grants which have been issued over two months. If you want to index the 2 million or so grants over the last 10 years, the Blockjoin approach would require an index of 3.6 billion docs. Projecting based on a linear increase in memory requirements, you would need 135GB of memory to facet over the three fields used in the test compared to SIREn’s requirement of about 6GB!

Beyond the much smaller memory requirements, SIREn’s indexing model provides better performance for nested queries, greater query flexibility (e.g., support for nested arrays), true schemaless operations along with more advanced query operations such as wildcard path query and positional operators on structural elements.

The SIREn engine is now available as a plugin for Solr and Elasticsearch.

Appendix – JSON to Blockjoin Indexing Procedure

In this appendix we describe the procedure we follow to index the Json patent documents using the Blockjoin method. Lucene currently requires you to transform a nested document into a collection or “block” of flat documents.

We will therefore here detail how the Lucene’s documents are created. We are showing below a simple nested JSON document along with the collection of equivalent Blockjoin’s documents.

JSONBlockjoin

The transformation adopts the following rules:

  • one JSON object is mapped to one Blockjoin document;
  • nested objects are mapped to child documents (see field2 as example)
  • textual attributes of a JSON object are mapped to a Lucene’s field
  • nested arrays are flattened (see field1 as example)

Additional fields need to be added for all the Blockjoin’s documents.

  • _root_ : indexed, non-stored field to record the block id (for deletion operations)
  • id : indexed and stored field to identify the document
  • tag : indexed, non-stored field to distinguish between the different types of nested objects in the document.

Each unique path in the document is mapped to one document type. The document type is a necessary concept for Blockjoin to differentiate between nested documents. Each document type is uniquely identified with a tag. The root document is tagged with ‘root’. The tag of the other document types is generated based on the path to the nested object. This tagging is required to ensure that Blockjoin’s queries are able to distinguish between fields with the same name appearing in different document types (this will result in incorrect results if this is not done). A similar approach is used in Elasticsearch.

One issue to be considered is multi-valued attributes. Lucene will return false positives when querying over multi-valued fields where the values have more than one term. For instance, the following document will match for the query “q=field1:value1 field1:value3”.

SIREn on the other hand does not have this issue. Every value of a multi-valued field is maintained as a separate “node” in the index. To simulate this with Blockjoin, a separate document is created for each of the values (see for example the child documents with id 1.1, 1.2 and 1.3).

A truncated version of the transformed patent document is shown below. A sample original patent document and its JSON conversions are given here and here.

No Thanks / Already Signed Up