Often in marketing one hears about a “next generation” solution, for little reasons more than a better implementation if not just pure marketing itself.
In the case of SIREn (short for “Semantic Information Retrieval ENgine”), we believe this “tagline” to be appropriate as SIREn has been built from the ground up based on a completely different low level paradigm for encoding semi-structured data. In this post we will give a hint on the theory and technology underlying SIREn and why this makes it fundamentally more scalable and flexible than other search systems available today.
SIREn is a “semi-structured” document (a.k.a. JSON-like data) search system. We characterize semi-structured documents as data that
- can be arbitrarily shaped with no a-priori schema: each document can have its own structure, ranging from loosely to strictly defined. The data structure does not follow strict rules as in a database.
- can have a complex hierarchical structure with multiple levels of nesting (think JSON or XML);
- can evolve over time. The schema is not fixed and may change as the information grows.
- involve weakly typed attributes: the type of an attribute is not fixed and can change across documents. A same attribute can have more than one type, e.g., textual, numerical, or even more complex types, such as arrays or objects.
A semi-structured search system goal is to provide highly efficient ad-hoc querying with hybrid structured and full-text search capabilities on the collection of documents.
Existing enterprise search systems mostly all cater for at least some of the above functionalities, but SIREn provides the best coverage and scalability thanks to a different low level approach. Let’s try to understand why.
The Difference Under the Hood
Based on the Ph.D. work of Renaud Delbru, SIREn is to the best of our knowledge the only enterprise grade search system which employs a “node-labelled tree model” to index semi-structured documents.
This model encodes the hierarchical structure of semi-structured documents directly within the inverted index, offering greater flexibility while availing entirely of the full performance and scalability of a mature search system like Solr. Other search systems do not natively incorporate such a hierarchical indexing model: they are built around the traditional “flat record” model, in which each document is represented as a flat set of attributes. In practice they emulate the hierarchical model by “denormalising” the hierarchical content into a set of flat records and rely on a “block-based” join to reassemble those records into a hierarchical structure at query time. In this article we are referring to this approach as the “BlockJoin” approach.
While all this happens “under the hood” the scalability and performance limitation of the BlockJoin approach come from the fact that this approach basically multiplies by N folds the number of records in the index where N is the average number of nested elements per document in the collection of documents. An index so constructed can easily be 10x to 100x bigger than normal, which in turn affects dramatically memory usage, size on disk, robustness and management of indexes, etc.
In search systems, an efficient memory usage is very important. Systems such as Lucene/Solr rely heavily on bitsets, a compact and very efficient in-memory data structure for set operations, for caching and manipulating sets of documents. Bitsets are used in many different places and for many different purposes: indexing, searching, caching, etc. For example, Range, Join, BlockJoin and Spatial queries are instantiating bitsets during query processing. Query filters are computed and cached as bitsets. Facets are producing a large number of bitsets for caching the facet-value pairs. Grouping is another feature that uses bitsets internally. The issue is that the size of a bitset is bound by the current number of records in the index. Therefore, the BlockJoin approach, by multiplying the number of records in the index, has a direct impact on the size of the bitsets and therefore on the overall memory usage of the system. To give you a better idea of the increase of memory usage, the chart below compares the RAM usage of one single bitset between BlockJoin and SIREn. We can see that as the number of nested elements increases, the RAM usage of a single bitset becomes quickly problematic with the BlockJoin approach. On the other hand, the RAM usage in SIREn is independent of the number of nested elements and stays relatively small.
Unlike these systems, SIREn creates one single record per semi-structured document and encodes natively the tree structure into the inverted index – an approach called “tree encoding”. The tree structure can then be queried using appropriate query operators.
The differences between the two approaches are not just in term of efficiency: functionality is also, if not even more, important.
Both approaches will cater for basic operators for querying nested elements, things however quickly differ when more complex documents and queries are concerned. We will review here the principal functionality that only SIREn’s “tree encoding” approach can provide.
SIREn enables a search system to become truly schemaless. Developers do not have to invest significant modelling effort upfront in order to have the search system up and running. SIREn is able to index and query arbitrary JSON documents without upfront schema configuration or data transformation. In addition, the dynamic data typing capability of SIREn allows the attribute types to change across documents. A document can have a numeric date attribute while another one can have a textual data attribute or even a nested object attribute without interfering with the indexing or querying process.
Extended Data and Query Modelling
SIREn’s tree encoding provide enough flexibility to fully support the JSON data model. This includes multi-valued attributes (or arrays), nested arrays and complex nesting of arrays and objects. In traditional search system, attributes with more than textual values are generally flattened into a single bag of words, which in turn affects search precision. For example, if an attribute “authors” would contain two values [“Giovanni Tummarello”, “Renaud Delbru”], the search system would consider it as a match for the query authors : Giovanni AND Delbru as it has lost the separation between values during the flattening process.
SIREn also enables full-text search on attribute names (attribute names are first class citizens), proximity search on array and nested elements, as well as wildcard search for nested elements, i.e., you do not have to know the exact path of a nested object to be able to query it.
SIREn’s scalability and small memory footprint for large nested trees is the key to “relational faceted browsing”. As explained in our Lucene Revolution talk, relational faceting is achieved by denormalising relational data into a multitude of tree-structured data views. Each of this view can possibly become very large, with hundreds or even thousands of nested records. Such a scale can become quickly problematic with the BlockJoin approach as discussed previously.
Ranking, the Proper Way, for Rich Data
When it comes to ranking, other systems forcefully flatten the data, e.g., aggregating a multi-valued attribute into a single bad of words as discussed previously. What should instead happen is that the structure should be properly taken into consideration, something that you would not do in simple textual document search.
SIREn can support advanced ranking capabilities which uniquely consider semi-structured data and multi-valued attributes. The difference does matter: with an early version of this method we were able to win the Yahoo Semantic Search Challenge in 2011.
Most enterprise search systems offer some level of semi-structured search capabilities, but they were not originally designed for such a task. In all other systems but SIREn, these features have been added incrementally, resulting in a sub-optimal solution that requires an N fold increase in records in the index, a high memory usage and with a more limited search and ranking capabilities.
By using a methodology designed from the ground up for semi-structured data, SIREn gains features (currently and in our roadmap) which, intrinsically, cannot be reproduced by other existing systems.
As a conclusion, we believe SIREn to be the best choice in scenarios where the fastest and most advanced semi-structured document search capabilities are wanted.
Try it yourself! SIREn is available for download.