Lucene Revolution – Implementing Click Through Relevancy

First page counts the most as most people don’t go that or definitely not beyond page 3.  Three  ways to reflect relevance, indexing-time (text analysis, morphological analysis, synomyns), query-time (boost, dismax, synonyms, fx queries) or editorial ranking.

How to trick the user to give you feedback on the search results?? They won’t really do this otherwise. What was searched, navigation click through and “like” buttons.  Query log and click-through events are key to do this analysis.  In order to do this properly, you must consider the click-through in context of what was searched before.

Some Approaches

You can label the clicked item with the query terms. This can create collaborative filtering or a recommendation system.  However this approach can be sparse or noisy.  Can change intent, hidden intent, or no intent at all.

NOTE: if you don’t collect query logs, you should start so today! This will help collect user profile population, query suggestions, most useful information, and general user interest over time.

You can do vector analysis between the query and label.  The distance between the vectors can you give you some scoring attribute to use for relevancy.

Undesired effects included unbounded positive feedback (dominated by popularity but no longer relvante). Post clicks, off-topic, noisy labels.  Click data should be sub-linear and temporal in nature (so old click counts should be discounted) and finally it should be sanitized and bounded as to how much effect on the score.

How do you implement this?

Doing this in Solr is not OOTB, but fairly simple to implement. Need a component for logging queries, logging click-throughs (you can use a small JS to report this…see Google/Yahoo!), a tool to correlate and aggregate the logs and a tool to manage the click through history.

Next step is to take these results and use them as boost values.  Use ExternalFileField to note the docid with the field and the boost. Another approach is via full-index to join source docs and click data by docid+reindex –not viable for large corpuses.  Incremental field updates will be available in the future and probably the best fit for this use case (check back in a year).  You can use ParallelReader to have a separate index for the click data and zip it with the main index…this approach is complicated/fragile!

Commercial solution from Lucid Imagination has a click scoring framework that can help.

By Loutilities Posted in Search Tagged

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s