This week I’m back in SF and this time I’m attending the Lucene Revolution conference. The conference kicked off with Marc Kellenstein emphatically saying, “It is easier to search than to browse.” Ain’t that the truth.
Over the next few days I blog my notes from the sessions that I attend at the conference. I hope they provide some insight for others and reminders for me!
Google was first to use spell checking against terms in the docs from the index rather than just a big dictionary.
Recall is the percent of relevant docs returned (50 available only 25 returned is 50%)
Precision is the percent returned that are relevant (100 returned, 25 relevant, 25% precise)
100% recall is easy but really are striving for 100% precision too, which is a lot harder to do.
Getting good recall
- Use spell checking, synonyms to match users’ vocab
- Normalize data
- collect, index and search all data
Getting good precision
- queries are too short (have users rank terms and use machine learning)
- implicit relevance feedback is available but doubles search execution and no one really uses it although it should be considered
- Watson or Google translate doesn’t use NLP but instead huge data set statistical analysis
- Lucene created by Doug Cutting and Apache release in 2001, wide acceptance by 2005
- Solr built in 2005 by Yonik Seeley for CNET; Apache release in 2006 and provide Lucene capabilities over http with faceting
- Best segmented index (like Google)
- Open Source
- Great Community
Basic premise is to use Lucene/Solr since it is the best and it’s free. It continues to innovate and have strong community support.