Lucene Revolution – Solr Performance Innovations by Yonik Seeley

Yonik Seeley reports on many new enhancements for 3.1 and 4.0 with emphasis on performance improvements.

TieredMergePolicy is the new default that ignores segment order when selecting best merge and it does not over merge. Finite State Transducer (FST) based terms index is much smaller in memory and much faster in terms.

DocumentWriterPerThread (DWPT) prevents the index from being blocked while the flushing new segments.  Only the largest DWT is flushed out concurrently while others continue to index.  This benefits performance by 250%+.

SolrCloud is the integration of ZooKeeper to make managing several Solr nodes in a cluster more efficiently. -DzkRun Java param will just run an internal ZK instance.

Distributed requests can be managed by Solr w/o load balancer/VIP.  Just use pipe to define the shard lists or you can query across all shards by adding distrib=true to the queryParams.

Extended Dismax parser (superset of dismax) does term proximity boosting and so much more! See ref. here:

  • Supports full Lucene query syntax in the absence of syntax errors
  • Supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • Advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

New faceting performance improvements included deep faceting.  Pivot faceting allows you to do nested faceting now and Range faceting allows you do what we did with dates now with numbers.

Spatial search is perfect for geo searches.  You can now sort by any arbitrary function query (like closest item).  Pseudo-fields in Solr 4 will allow you to return distance from the point of origin.

Pseudo-fields allows you to return other info along with the document stored fields (function queries, globs, aliasing, multiple fl values).  For future, inline highlighting will provide highlighting details right with the fields rather than having it returned separately…very nice!

Group by field and group by query also look very helpful.

Pseudo-Join will allow you to do a restriction on a particular source very easily and quickly with just a simple “fq” param.  Lots of good examples shown in the slides.

Auto-suggest has two implementation (TST or FST); the latter is slower to build, but faster and more compress.

You can now index with curl and a JSON array and query results can be returned as CSV.

Last but not least, the new browse GUI for SOLR looks very nice too…

Lucene Revolution Keynote – Marc Kellenstein

This week I’m back in SF and this time I’m attending the Lucene Revolution conference.  The conference kicked off with Marc Kellenstein emphatically saying, “It is easier to search than to browse.”  Ain’t that the truth.

Over the next few days I blog my notes from the sessions that I attend at the conference.  I hope they provide some insight for others and reminders for me!

Keynote Notes

Google was first to use spell checking against terms in the docs from the index rather than just a big dictionary.

Recall is the percent of relevant docs returned (50 available only 25 returned is 50%)

Precision is the percent returned that are relevant (100 returned, 25 relevant, 25% precise)

100% recall is easy but really are striving for 100% precision too, which is a lot harder to do.

Getting good recall

  • Use spell checking, synonyms to match users’ vocab
  • NLP
  • Normalize data
  • collect, index and search all data

Getting good precision

  • queries are too short (have users rank terms and use machine learning)
  • implicit relevance feedback is available but doubles search execution and no one really uses it although it should be considered
  • Watson or Google translate doesn’t use NLP but instead huge data set statistical analysis

Some history

  • Lucene created by Doug Cutting and Apache release in 2001, wide acceptance by 2005
  • Solr built in 2005 by Yonik Seeley for CNET; Apache release in 2006 and provide Lucene capabilities over http with faceting

Strengths:

  • Best segmented index (like Google)
  • Open Source
  • Great Community

Basic premise is to use Lucene/Solr since it is the best and it’s free.  It continues to innovate and have strong community support.