The standard solution, e.g., as implemented behind the scenes by Elasticsearch, is to rely on Lucene’s Blockjoin capability. Blockjoin, however works by creating one index entry per nested “object”. You would require orders of magnitude more memory to handle nested documents relative to SIREn.
To demonstrate this, we took a collection of 44,000+ US patent grant documents and indexed them using Blockjoin and SIREn. Given their rich structure with lots of nested objects (an average of 1833 nested objects per doc), these documents were ideal for this test.
We then went on to compare the performance and memory requirements for a collection of nested queries involving conjunction/disjunction at the term and at the object level. What we discovered was that SIREn not only performed several times faster for most of the queries, it was also fundamentally more scalable having memory requirements that were orders of magnitude lower!
You can read more about these tests in a detailed whitepaper that we have published.
Why is this big difference in scalability? The Blockjoin approach requires that every nested object in a document be indexed as a separate document. So, if you are looking to index 1 million documents each with 1000 nested objects, you will end up with an index of 1 billion documents.
Now, the amount of memory required to perform various types of queries such as nested queries, filter queries and facet queries is linearly related to the number of documents in the index. A typical app requires hundreds of these queries to be cached for optimal performance.
Consider just facets. For the test dataset of 44k docs, Blockjoin required 3,077MB to create facets over the three chosen fields and had a query time of 90.96ms. SIREn on the other hand required just 126 MB with a query time of 8.36ms. Blockjoin require 2442% more memory while being 10.88 times slower!
And this is just for 44,369 patent grants issues over two months. If we want to index the 2 million or so grants over the last 10 years, you would need 135GB of memory to index over these three fields compared to SIREn requirement of about 6GB!
Clearly, SIREn represents the only realistic solution for indexing large volumes of complex nested documents.