Complementing MongoDB with Real-time Solr Search

Overview

I’ve been a long time user and evangelist of Solr given its amazing ability to fulltext index large amounts of structured and unstructured data. I’ve successfully used it on a number of projects to add both Google-like search and faceted (or filtered) search to our applications. I was quite pleased to find out that MongoDB has a connector for Solr to allow that same type of searching against my application that is back-ended with MongoDB. In this blog post, we’ll explore how to configure MongoDB and Solr and demonstrate its usage with the MongoDB application I wrote several months back that’s outlined in my blog post Mobile GeoLocation App in 30 minutes – Part 1: Node.js and MongoDB.

Mongo-Connector: Realtime access to MongoDB with Solr

I stumbled upon this connector during my research, mongo-connector. This was exactly the sort of thing I was looking for namely because it hooks into MongoDB’s oplog (somewhat similar to a transaction log in Oracle) and updates Solr in real-time based on any create-update-delete operations made to the system. The oplog is critical to MongoDB for master-slave replication, thus it is a requirement that MongoDB needs to be set-up as a replica set (one primary, n-number of slaves; in my case 2). Basically, I followed the instructions here to setup a developer replica set. Once established, I started each mongod instance as follows so they would run in the background (–fork) and use minimal space due to my disk space limitation (–smallfiles).

% mongod –port 27017 –dbpath /srv/mongodb/rs0-0 –replSet rs0 –smallfiles –fork –logpath /srv/mongodb/rs0-0.log

% mongod –port 27018 –dbpath /srv/mongodb/rs0-1 –replSet rs0 –smallfiles –fork –logpath /srv/mongodb/rs0-1.log

% mongod –port 27019 –dbpath /srv/mongodb/rs0-2 –replSet rs0 –smallfiles –fork –logpath /srv/mongodb/rs0-2.log

Once you have MongoDB configured and running you need to install the mongo-connector separately. It relies on Python, so if not installed, you will want to install version 2.7 or 3. To install the mongo-connector I simply ran this command to install it as a package:

% pip install mongo-connector

After it is installed you can run it is as follows so that it will run in the background as well using nohup (hold off on running this till after the next section):

% nohup sudo python mongo_connector.py -m localhost:27017 -t http://solr-pet.xxx.com:9650/solr-pet -d ./doc_managers/solr_doc_manager.py > mongo-connector.out 2>&1

A couple things to note here is that the -m option points to the localhost and port of the primary node in the MongoDB replica set. The -b option is the location of Solr server and context. In my case, it was a remote based instance of Solr. The -n option is the namespace to the the Mongo databases and collection I wish to have indexed by Solr (without this it would index the entire database). Finally, the -d option indicates which doc_manager I wish to use, which of course, in my case is Solr. There are other options for Elastic search as well, if you chose to use that instead.

With this is place your MongoDB instance is configured to start pushing updates to Solr in real-time, however, let’s take a look at the next section to see what we need to do on the Solr side of things.

Configuring Solr to work with Mongo-Connector

Before we run the mongo-connector, there are a few things we need to do in Solr to get it to work propertly. First, to get the mongo-connector to post documents to Solr you must be sure that you have the Solr REST service available for update operations. Second, you must configure the schema.xml with specific fields that are required as well as any fields that are being stored in Mongo. On the first point, we need to be sure that the following line exists in the solr.xml config:

<requestHandler name=”/update” class=”solr.UpdateRequestHandler”/>

As of version 4.0 of Solr, this request handler supports XML, JSON, CSV and javabin. It allows the mongo-connector to send the data to the REST interface for incremental indexing. Regarding the schema, you must include a field per each entry you have (or are going to add) to your Mongo schema. Here’s an example of what my schema.xml looks like:

<schema name="solr-suggest-box" version="1.5">
        <types>
                <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
                <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
                <fieldType name="text_wslc" class="solr.TextField" positionIncrementGap="100">
                        <analyzer type="index">
                                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                        </analyzer>
                        <analyzer type="query">
                                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                        </analyzer>
                </fieldType>
                <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>
                <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
                <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>
        </types>

        <fields>
                <field name="_id" type="string" indexed="true" stored="true" required="true" />
                <field name="name" type="text_wslc" indexed="true" stored="true" />
                <field name="description" type="text_wslc" indexed="true" stored="true" />
                <field name="date" type="tdate" indexed="true" stored="true" />
                <field name="nmdsc" type="text_wslc" indexed="true" stored="true" multiValued="true" />
                <field name="coordinate" type="location" indexed="true" stored="true"/>
                <field name="_version_" type="long" indexed="true" stored="true"/>
                <field name="_ts" type="long" indexed="true" stored="true"/>
                <field name="_ns" type="string" indexed="true" stored="true"/>
                <field name="ns" type="string" indexed="true" stored="true"/>
                <field name="coords" type="string" indexed="true" stored="true" multiValued="true" />
                <dynamicField name="*" type="string" indexed="true" stored="true"/>
        </fields>

        <uniqueKey>_id</uniqueKey>

        <defaultSearchField>nmdsc</defaultSearchField>

        <!-- we don't want too many results in this usecase -->
        <solrQueryParser defaultOperator="AND"/>

        <copyField source="name" dest="nmdsc"/>
        <copyField source="description" dest="nmdsc"/>
</schema>

I found that all the underscore fields (lines 21-32) I have were required to get this working correctly. To future proof this, on line 32 I added a dynamicField so that the schema could change without affecting the Solr configuration — a tenant of MongoDB is to have flexible schema. Finally, I use copyfield on lines 42-43 to only include those fields I wish to search against, which name and description were only of interest for my use case. The “nmdsc” field will be used as the default search field for the UI as per line 37, which I will go into next.

After your config is in place and you start the Solr server, you can now launch the mongo-connector successfully and it will continuously update Solr with any updates that are saved to Mongo in real-time. I used nohup to kick it off in the background as shown above.

Using Solr in the DogTags Application

To tie this all together, we need to alter the UI of the original application to allow for Solr searching. See my original blog post for a refresher: Mobile GeoLocation App in 30 minutes – Part 2: Sencha Touch. Recall that this is a Sencha Touch MVC application and so all I needed to do was add a new store for the Solr REST/JSONP service that I will call for searching and update the UI to provide a control for the user to conduct a search. Let’s take a look at each of these:

Ext.define('MyApp.store.PetSearcher', {
    extend: 'Ext.data.Store',
    requires: [
        'MyApp.model.Pet'
    ],
    config: {
        autoLoad: true,
        model: 'MyApp.model.Pet',
        storeId: 'PetSearcher',
        proxy: {
            type: 'jsonp',
            url: 'http://solr-pet.xxx.com:9650/solr-pet/select/',
            callbackKey: 'json.wrf',
            limitParam: 'rows',
            extraParams: {
                wt: 'json',
                'json.nl': 'arrarr'
            },
            reader: {
                root: 'response.docs',
                type: 'json'
            }
        }
    }
});

Above is the new store I’m using to call Solr and map its results back to the original model that I used before. Note the differences from the original store that our specific to Solr, namely the URL and some of the proxy parameters on lines 10-18. The collection of docs are a bit buried in the response, so I have to set the root accordingly as I did on line 20.

The next thing I need to do is add a control to my view so the user can interact with the search service. In my case I chose to use a search field docked at the top and have it update the list based on the search term. In my view, the code looks as follows:

Ext.define('MyApp.view.PetPanel', {
    extend: 'Ext.Panel',
    alias: 'widget.petListPanel',
    config: {
        layout: {
            type: 'fit'
        },
        items: [
            {
                xtype: 'toolbar',
                docked: 'top',
                title: 'Dog Tags'
            },
            {
                xtype: 'searchfield',
                docked: 'top',
                name: 'query',
                id: 'SearchQuery'
            },
            {
                xtype: 'list',
                store: 'PetTracker',
                id: 'PetList',
                itemId: 'petList',
                emptyText: "<div>No Dogs Found</div>",
                loadingText: "Loading Pets",
                itemTpl: [
                    '<div>{name} is a {description} and is located at {latitude} (latitude) and {longitude} (longitude)</div>'
                ]
            }
        ],
        listeners: [
            {
                fn: 'onPetsListItemTap',
                event: 'itemtap',
                delegate: '#PetList'
            },
            {
                fn: 'onSearch',
                event: 'change',
                delegate: '#SearchQuery'
            },
            {
                fn: 'onReset',
                event: 'clearicontap',
                delegate: '#SearchQuery'
            }
        ]
    },
    onPetsListItemTap: function (dataview, index, target, record, e, options) {
        this.fireEvent('petSelectCommand', this, record);
    },
    onSearch: function (dataview, newValue, oldValue, eOpts) {
        this.fireEvent('petSearch', this, newValue, oldValue, eOpts);
    },
    onReset: function() {
        this.fireEvent('reset', this);
    }
});

Lines 15-18 add the control and lines 38-47 define the listeners I’m using to fire events in my controller. The controller supports those events as follows:

    onPetSearch: function(view, value, oldvalue, opts) {
        if (value) {
            var store = Ext.getStore('PetSearcher');
            var list = this.getPetList();
            store.load({
                params: {q:value},
                callback: function() {
                    console.log("we searched");
                    list.setData(this._proxy._reader.rawData.response.docs);
                }
            });
            list.setStore(store);
        }
    },

    onReset: function (view) {
        var store = Ext.getStore('PetTracker');
        var list = view.down("#petList");
        store.getProxy().setUrl('http://nodetest-loutilities.rhcloud.com/dogtag/');
        store.load();
        list.setStore(store);
    },

Since the model is essentially the same between Mongo and Solr, all I have to do is swap the stores and reload them to get the results updated accordingly. On line 6, you can see where I pass in the dynamic search term so that is loads the PetSearcher store with that value. When I reset the search value, I want to go back to the original PetTracker store to reload the full results as per lines 17-21. In both, I set the list component’s view to the corresponding store as I did on lines 12 and 21 so that the list will show the results according to the store it has been set to.

Conclusion

In this short example, we established that we could provide real-time search with Solr against MongoDB and augment an existing application to add a search control to use it. This has the potential of being a great compliment to Mongo because it keeps us from having to add additional indexes to MongoDB for searching which has a performance cost to it, especially as the record set grows. Solr removes this burden from Mongo and leverages an incremental index that can be updated in real-time for extremely fast queries. I see this approach being very powerful for modern applications.

Advertisements

Using the Ext JS 4 MVC architecture and a few gotchas

I recently worked on a POC to integrate Solr search with the Ext JS 4 infinite scrolling grid. This allows you to scroll through 234k+ records without having the user page through the data; the scrolling does the data buffering automatically.  Other features include hit highlighting, wild card searching, resizing windows and word-wrapped columns.  However, the most interesting part to me was using the new MVC approach that Sencha introduced in this release to organize your project much like you would a Grails or Java Web project.  I’ll detail the approach I took to make that happen and point out some gotchas along the way.

First, let’s start from the model.  In the code below you can see I’ve defined a model and proxy which will be the piece that will pull the data.  There’s nothing too special here with the exception of the namespace I used to define the model, ESearch.model.EPart.  These are crucial that they are spelled correctly because they will be used later in other parts of the MVC.

Ext.define('ESearch.model.EPart', {
    extend: 'Ext.data.Model',
    idProperty: 'id',
    fields: [
        {name:'id', type:'int'}, 'description', 'item_number', 'part_number'
    ],
    proxy: {
        // load using script tags for cross domain, if the data in on the same domain as
        // this page, an HttpProxy would be better
        type: 'jsonp',
        url: 'http://solrdev1/solr-eti/select/',
        callbackKey: 'json.wrf',
        limitParam: 'rows',
        extraParams: {
            q: '*',
            wt:'json',
            hl:'on',
            'hl.fl': 'description',
            'json.nl':'arrarr'
        },
        reader: {
            root: 'response.docs',
            totalProperty: 'response.numFound'
        },
        // sends single sort as multi parameter
        simpleSortMode: true
    }
});

The next thing to consider is the store.  Arguably, this is not part of MVC per se, but it is used in conjunction with the model to define what type of store we want to use.  In this case, we want to use a buffered store of with a page size of 500.  You’ll also notice that in this code I create some listeners for beforeload and load so that I can allow sorting with Solr and to be able to do hit highlighting and query times.  You’ll also notice that I link the model to the store by using the namespace for the model parameter.  Let’s take a look:

// default to wildcard search
var query = '*';

Ext.define('ESearch.store.EParts', {
    extend: 'Ext.data.Store',
    model: 'ESearch.model.EPart',
    pageSize: 200,
    remoteSort: true,
    autoLoad:false,
    // allow the grid to interact with the paging scroller by buffering
    buffered: true,
    listeners: {
        beforeload:{ fn: function(store, options) {
            if (options && options.sorters) {
            var sorters = options.sorters;
            for (var i=0; i 1) {
                var queryParsed = query.replace(/\*/g,'').replace(/"/g, '').trim();
                var queries = queryParsed.split(' ');

                for (var i=0; i < queries.length; i++) {
                    if (queries[i]) {
                        var q = escapeRegExChars(queries[i]);

                        // Check to highlight text only in grid-body
                        var node = Ext.get("grid-inf").dom.childNodes;
                        for (var j=0; j < node.length;j++ ) {
                            if (node[j].className.contains("x-grid-body",true)) {
                                node = node[j];
                                break;
                            }
                        }

                        highlightText(node,
                                q + "+", 'HL', true);
                    }
                }
            }
            // temporary fix to address issue with scrollbars not resizing           
            var grid = Ext.getCmp('grid-inf');
            grid.resetScrollers();
          }
        }
    }
});

Now that I have the data being consumed the way I want it the next step is to put it into an infinity scrolling grid.  Here’s the code to do that:

Ext.define('ESearch.view.parts.List', {
    extend: 'Ext.grid.Panel',
    alias: 'widget.partslist',
    store: 'EParts',
    initComponent: function() {

        var groupingFeature = Ext.create('Ext.grid.feature.Grouping', {
            groupHeaderTpl: 'Group: {name} ({rows.length})',
            startCollapsed: false
        });

        var selectFeature = Ext.create('qcom.grid.SelectFeature');

        var config = {
            name: 'qparts-grid',
            id: 'grid-inf',
            verticalScrollerType: 'paginggridscroller',
            loadMask: true,
            invalidateScrollerOnRefresh: false,
            disableSelection: false,
            features: [groupingFeature,selectFeature],
            viewConfig: {
                trackOver: false
            },
            // grid columns
            columns:[{xtype: 'rownumberer',width: 45, sortable: false},{
                id: 'id-col',
                header: "ID",
                dataIndex: 'id',
                width:60
            },{
                id:"descr",
                header: "Description",
                dataIndex: 'description',
                width: 300,
                renderer: columnWrap
            },{
                id:"itemnum",
                header: "Item Numbers",
                dataIndex: 'item_number',
                width: 100
            },{
                id: "partnum",
                header: "Part Numbers",
                dataIndex: 'part_number',
                flex: 1,
                renderer: columnWrap
            }]
            ,selModel:{
           selType:'rowmodel'
          ,allowDeselect:true
          ,mode:'MULTI'
         },
            tbar:
                ['Search:',{
                     xtype: 'textfield',
                     name: 'searchField',
                     hideLabel: true,
                     width: 250,
                     emptyText: "Enter search terms separated by space",
                     listeners: {
                         change: {
                            fn: function adjustQuery(field) {
                                // temporary fix to address issue with scrollbars not resizing
                                this.store.resetData();

                                // Regex query and add wildcards where appropriate
                                if (field.value.length >= 1) {
                                    var values = field.getValue().match(/[A-Za-z0-9_%\/\.\-\|]+|"[^"]+"/g),
                                        value =[];
                                    if (values && values.length > 1) {
                                        for ( var i=0; i < values.length; i++ ) {
                                            if (values[i].indexOf("\"") >= 0 ) {
                                                value.push(values[i].toLowerCase());
                                            }
                                            else {
                                                value.push("*" + values[i].toLowerCase() + "*");
                                            }
                                        }
                                        query = value.join(" ");
                                        if (Ext.isChrome) {
                                            console.log(query);
                                        }
                                    }
                                    else {
                                        if (field.getValue().indexOf("\"") >= 0 ) {
                                            value.push(field.getValue().toLowerCase());
                                            query = value.join(" ");
                                        }
                                        else {
                                            // temporary fix because regex not picking up 1 char
                                            var temp = values ? values[0] : field.getValue();
                                            query = "*" + temp.toLowerCase() + "*";
                                        }
                                        if (Ext.isChrome) {
                                            console.log(query);
                                        }
                                    }
                                    this.store.load({
                                        params: {q:query}
                                    });
                                }
                            },
                            scope: this,
                            buffer: 500
                         }
                     }
                },
                {
                     xtype: 'tbfill'
                },{
                     xtype: 'displayfield',
                     name: 'totalText',
                     id: 'totalText',
                     hideLabel: true,
                     baseCls: 'x-toolbar-text',
                     style: 'text-align:right;',
                     width:180
                }
            ]
        };
        // apply config object
     Ext.apply(this, config);

     // call parent initComponent
     this.callParent(arguments);
    }
});

So from the above code you see that I’m defining a Grid Panel and assigning an alias to it called “partslist” (more on that later), but one gotcha I found is that I could not use the full namespace for the store definition — I had to just simply call it “EParts”.  Finally, you’ll see me set-up the columns and create a top bar that will hold the search field.  I do a regex to process the search field to create a wildcard search and to preserve quotes.  I also set the buffer to 500 so they it will wait 500ms for keystrokes before firing the search again.

Now that we have the grid, we need to put it some where.  This is where I bring the window into the picture.  In the code below, I simply define my window size, where I want it in the browser, window capabilities like maximize, collapse and closable, and finally the items.  Notice for the items, I’m using the alias partslist from the previously defined grid as the xtype.  This allows me to insert a grid as I defined it before without having to instantiate it as a variable.  Let’s take a look:

Ext.define( 'ESearch.view.Portal', {
    extend: 'Ext.window.Window',
    alias: 'widget.portal',
        width: 800,
        height:600,
        x: 150,
        y: 80,
        layout:'fit',
        border: false,
        closable: true,
        maximizable: true,
        collapsible: true,
        title: 'EParts Search',
        items: [{
            xtype: 'partslist',
            itemId:'myPartList'
        }]
});

So to finish up the MVC portion, we need a controller.  In the code below you will see how we create the controller and then define the models, stores, views, and any references needed.  You’ll also see in the init function where I invoke the store for the initial load of data as well as an example of how we could listen for certain events and do something with that event.  Notice again, that the alias from the grid comes into play (partslist) so that we can capture button events from the grid.  This wasn’t completely implemented, but it gives an example how it might be implemented.

Ext.define('ESearch.controller.Search', {
    extend: 'Ext.app.Controller',
    models:[
        'EPart'
    ],
    stores:[
        'EParts'
    ],
    views:[
        'parts.List'
    ],
    refs:[{
         ref:'PartsList',
         selector:'partslist'
    }],
    init:function(app) {
            var store = this.getEPartsStore();
            store.guaranteeRange(0, 199);
            this.control({
                   'partslist button':{
                    click:this.onButtonClick
               }
          });
    },
    onButtonClick: function(btn, e) {
        if (btn.operation === 'newSearch') {
            //TODO need to find a nice way to instantiate a new window
        }
    }
});

The last little bit of code simply defines the application and sets some criteria as to what paths we should use and which pieces of Ext JS are required for this application to function.  One gotcha I noticed is that you must define the first part of your namespace as the the folder your app will fall under.  You’ll notice that I have mapped the path “app” to “ESearch” so thusly my directory structure for my application must follow something like this:

 

So for instance, ESearch.view.Portal, must live as a file called Portal.js under app/view and same for the other files you see there.  The App.js file that contains the following code will be under the “webapp” directory adjacent to “app” to maintain relative pathing.  All I do is create my viewport based on its namespace and fire .show() to kick the whole thing off.

Ext.Loader.setConfig({enabled: true,
        paths: {
            'Ext.ux':'lib/extjs4/ux/',
            'ESearch': 'app'
        }
});
Ext.require([
    'Ext.grid.*',
    'Ext.data.*',
    'Ext.util.*',
    'ESearch.view.Portal',
    'Ext.grid.PagingScroller',
    'Ext.ux.grid.FiltersFeature',
    'Ext.grid.feature.Grouping',
    'Ext.grid.plugin.CellEditing',
    'Ext.state.CookieProvider'
]);

Ext.application({
     name:'ESearch',
     appFolder:'app',
     autoCreateViewport:false,
     controllers:['Search'],
     launch:function() {
     Ext.state.Manager.setProvider(new Ext.state.CookieProvider());
     this.viewport = Ext.create('ESearch.view.Portal', {
              stateId:'esearchWindow'
     });
     window[this.name].app = this;

      this.viewport.show();

    }
});

And finally, we have an HTML file that points to all the necessary JS files for this application to work, which is pretty standard stuff to bootstrap the application. However, with this approach, I only had to define the App.js file and not all the underlying JS files in the MVC portion. This is because the pathing we used in the previous section.

I hope this is useful for folks that would like to explore MVC in Ext JS 4 a little more.  I really find it useful because it helps break up a larger component much along the lines I’m used to.  In this way you could have multiple DEVs work on the same project pretty easily without walking over each other.

Lucene Revolution – Solr Performance Innovations by Yonik Seeley

Yonik Seeley reports on many new enhancements for 3.1 and 4.0 with emphasis on performance improvements.

TieredMergePolicy is the new default that ignores segment order when selecting best merge and it does not over merge. Finite State Transducer (FST) based terms index is much smaller in memory and much faster in terms.

DocumentWriterPerThread (DWPT) prevents the index from being blocked while the flushing new segments.  Only the largest DWT is flushed out concurrently while others continue to index.  This benefits performance by 250%+.

SolrCloud is the integration of ZooKeeper to make managing several Solr nodes in a cluster more efficiently. -DzkRun Java param will just run an internal ZK instance.

Distributed requests can be managed by Solr w/o load balancer/VIP.  Just use pipe to define the shard lists or you can query across all shards by adding distrib=true to the queryParams.

Extended Dismax parser (superset of dismax) does term proximity boosting and so much more! See ref. here:

  • Supports full Lucene query syntax in the absence of syntax errors
  • Supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • Advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

New faceting performance improvements included deep faceting.  Pivot faceting allows you to do nested faceting now and Range faceting allows you do what we did with dates now with numbers.

Spatial search is perfect for geo searches.  You can now sort by any arbitrary function query (like closest item).  Pseudo-fields in Solr 4 will allow you to return distance from the point of origin.

Pseudo-fields allows you to return other info along with the document stored fields (function queries, globs, aliasing, multiple fl values).  For future, inline highlighting will provide highlighting details right with the fields rather than having it returned separately…very nice!

Group by field and group by query also look very helpful.

Pseudo-Join will allow you to do a restriction on a particular source very easily and quickly with just a simple “fq” param.  Lots of good examples shown in the slides.

Auto-suggest has two implementation (TST or FST); the latter is slower to build, but faster and more compress.

You can now index with curl and a JSON array and query results can be returned as CSV.

Last but not least, the new browse GUI for SOLR looks very nice too…

Lucene Revolution – Implementing Click Through Relevancy

First page counts the most as most people don’t go that or definitely not beyond page 3.  Three  ways to reflect relevance, indexing-time (text analysis, morphological analysis, synomyns), query-time (boost, dismax, synonyms, fx queries) or editorial ranking.

How to trick the user to give you feedback on the search results?? They won’t really do this otherwise. What was searched, navigation click through and “like” buttons.  Query log and click-through events are key to do this analysis.  In order to do this properly, you must consider the click-through in context of what was searched before.

Some Approaches

You can label the clicked item with the query terms. This can create collaborative filtering or a recommendation system.  However this approach can be sparse or noisy.  Can change intent, hidden intent, or no intent at all.

NOTE: if you don’t collect query logs, you should start so today! This will help collect user profile population, query suggestions, most useful information, and general user interest over time.

You can do vector analysis between the query and label.  The distance between the vectors can you give you some scoring attribute to use for relevancy.

Undesired effects included unbounded positive feedback (dominated by popularity but no longer relvante). Post clicks, off-topic, noisy labels.  Click data should be sub-linear and temporal in nature (so old click counts should be discounted) and finally it should be sanitized and bounded as to how much effect on the score.

How do you implement this?

Doing this in Solr is not OOTB, but fairly simple to implement. Need a component for logging queries, logging click-throughs (you can use a small JS to report this…see Google/Yahoo!), a tool to correlate and aggregate the logs and a tool to manage the click through history.

Next step is to take these results and use them as boost values.  Use ExternalFileField to note the docid with the field and the boost. Another approach is via full-index to join source docs and click data by docid+reindex –not viable for large corpuses.  Incremental field updates will be available in the future and probably the best fit for this use case (check back in a year).  You can use ParallelReader to have a separate index for the click data and zip it with the main index…this approach is complicated/fragile!

Commercial solution from Lucid Imagination has a click scoring framework that can help.

By Loutilities Posted in Search Tagged

Lucene Revolution Keynote – Marc Kellenstein

This week I’m back in SF and this time I’m attending the Lucene Revolution conference.  The conference kicked off with Marc Kellenstein emphatically saying, “It is easier to search than to browse.”  Ain’t that the truth.

Over the next few days I blog my notes from the sessions that I attend at the conference.  I hope they provide some insight for others and reminders for me!

Keynote Notes

Google was first to use spell checking against terms in the docs from the index rather than just a big dictionary.

Recall is the percent of relevant docs returned (50 available only 25 returned is 50%)

Precision is the percent returned that are relevant (100 returned, 25 relevant, 25% precise)

100% recall is easy but really are striving for 100% precision too, which is a lot harder to do.

Getting good recall

  • Use spell checking, synonyms to match users’ vocab
  • NLP
  • Normalize data
  • collect, index and search all data

Getting good precision

  • queries are too short (have users rank terms and use machine learning)
  • implicit relevance feedback is available but doubles search execution and no one really uses it although it should be considered
  • Watson or Google translate doesn’t use NLP but instead huge data set statistical analysis

Some history

  • Lucene created by Doug Cutting and Apache release in 2001, wide acceptance by 2005
  • Solr built in 2005 by Yonik Seeley for CNET; Apache release in 2006 and provide Lucene capabilities over http with faceting

Strengths:

  • Best segmented index (like Google)
  • Open Source
  • Great Community

Basic premise is to use Lucene/Solr since it is the best and it’s free.  It continues to innovate and have strong community support.