Lucene Revolution – Solr Performance Innovations by Yonik Seeley

Yonik Seeley reports on many new enhancements for 3.1 and 4.0 with emphasis on performance improvements.

TieredMergePolicy is the new default that ignores segment order when selecting best merge and it does not over merge. Finite State Transducer (FST) based terms index is much smaller in memory and much faster in terms.

DocumentWriterPerThread (DWPT) prevents the index from being blocked while the flushing new segments.  Only the largest DWT is flushed out concurrently while others continue to index.  This benefits performance by 250%+.

SolrCloud is the integration of ZooKeeper to make managing several Solr nodes in a cluster more efficiently. -DzkRun Java param will just run an internal ZK instance.

Distributed requests can be managed by Solr w/o load balancer/VIP.  Just use pipe to define the shard lists or you can query across all shards by adding distrib=true to the queryParams.

Extended Dismax parser (superset of dismax) does term proximity boosting and so much more! See ref. here:

  • Supports full Lucene query syntax in the absence of syntax errors
  • Supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • Advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

New faceting performance improvements included deep faceting.  Pivot faceting allows you to do nested faceting now and Range faceting allows you do what we did with dates now with numbers.

Spatial search is perfect for geo searches.  You can now sort by any arbitrary function query (like closest item).  Pseudo-fields in Solr 4 will allow you to return distance from the point of origin.

Pseudo-fields allows you to return other info along with the document stored fields (function queries, globs, aliasing, multiple fl values).  For future, inline highlighting will provide highlighting details right with the fields rather than having it returned separately…very nice!

Group by field and group by query also look very helpful.

Pseudo-Join will allow you to do a restriction on a particular source very easily and quickly with just a simple “fq” param.  Lots of good examples shown in the slides.

Auto-suggest has two implementation (TST or FST); the latter is slower to build, but faster and more compress.

You can now index with curl and a JSON array and query results can be returned as CSV.

Last but not least, the new browse GUI for SOLR looks very nice too…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s