Complementing MongoDB with Real-time Solr Search

Overview

I’ve been a long time user and evangelist of Solr given its amazing ability to fulltext index large amounts of structured and unstructured data. I’ve successfully used it on a number of projects to add both Google-like search and faceted (or filtered) search to our applications. I was quite pleased to find out that MongoDB has a connector for Solr to allow that same type of searching against my application that is back-ended with MongoDB. In this blog post, we’ll explore how to configure MongoDB and Solr and demonstrate its usage with the MongoDB application I wrote several months back that’s outlined in my blog post Mobile GeoLocation App in 30 minutes – Part 1: Node.js and MongoDB.

Mongo-Connector: Realtime access to MongoDB with Solr

I stumbled upon this connector during my research, mongo-connector. This was exactly the sort of thing I was looking for namely because it hooks into MongoDB’s oplog (somewhat similar to a transaction log in Oracle) and updates Solr in real-time based on any create-update-delete operations made to the system. The oplog is critical to MongoDB for master-slave replication, thus it is a requirement that MongoDB needs to be set-up as a replica set (one primary, n-number of slaves; in my case 2). Basically, I followed the instructions here to setup a developer replica set. Once established, I started each mongod instance as follows so they would run in the background (–fork) and use minimal space due to my disk space limitation (–smallfiles).

% mongod –port 27017 –dbpath /srv/mongodb/rs0-0 –replSet rs0 –smallfiles –fork –logpath /srv/mongodb/rs0-0.log

% mongod –port 27018 –dbpath /srv/mongodb/rs0-1 –replSet rs0 –smallfiles –fork –logpath /srv/mongodb/rs0-1.log

% mongod –port 27019 –dbpath /srv/mongodb/rs0-2 –replSet rs0 –smallfiles –fork –logpath /srv/mongodb/rs0-2.log

Once you have MongoDB configured and running you need to install the mongo-connector separately. It relies on Python, so if not installed, you will want to install version 2.7 or 3. To install the mongo-connector I simply ran this command to install it as a package:

% pip install mongo-connector

After it is installed you can run it is as follows so that it will run in the background as well using nohup (hold off on running this till after the next section):

% nohup sudo python mongo_connector.py -m localhost:27017 -t http://solr-pet.xxx.com:9650/solr-pet -d ./doc_managers/solr_doc_manager.py > mongo-connector.out 2>&1

A couple things to note here is that the -m option points to the localhost and port of the primary node in the MongoDB replica set. The -b option is the location of Solr server and context. In my case, it was a remote based instance of Solr. The -n option is the namespace to the the Mongo databases and collection I wish to have indexed by Solr (without this it would index the entire database). Finally, the -d option indicates which doc_manager I wish to use, which of course, in my case is Solr. There are other options for Elastic search as well, if you chose to use that instead.

With this is place your MongoDB instance is configured to start pushing updates to Solr in real-time, however, let’s take a look at the next section to see what we need to do on the Solr side of things.

Configuring Solr to work with Mongo-Connector

Before we run the mongo-connector, there are a few things we need to do in Solr to get it to work propertly. First, to get the mongo-connector to post documents to Solr you must be sure that you have the Solr REST service available for update operations. Second, you must configure the schema.xml with specific fields that are required as well as any fields that are being stored in Mongo. On the first point, we need to be sure that the following line exists in the solr.xml config:

<requestHandler name=”/update” class=”solr.UpdateRequestHandler”/>

As of version 4.0 of Solr, this request handler supports XML, JSON, CSV and javabin. It allows the mongo-connector to send the data to the REST interface for incremental indexing. Regarding the schema, you must include a field per each entry you have (or are going to add) to your Mongo schema. Here’s an example of what my schema.xml looks like:

<schema name="solr-suggest-box" version="1.5">
        <types>
                <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
                <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
                <fieldType name="text_wslc" class="solr.TextField" positionIncrementGap="100">
                        <analyzer type="index">
                                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                        </analyzer>
                        <analyzer type="query">
                                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                        </analyzer>
                </fieldType>
                <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>
                <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
                <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>
        </types>

        <fields>
                <field name="_id" type="string" indexed="true" stored="true" required="true" />
                <field name="name" type="text_wslc" indexed="true" stored="true" />
                <field name="description" type="text_wslc" indexed="true" stored="true" />
                <field name="date" type="tdate" indexed="true" stored="true" />
                <field name="nmdsc" type="text_wslc" indexed="true" stored="true" multiValued="true" />
                <field name="coordinate" type="location" indexed="true" stored="true"/>
                <field name="_version_" type="long" indexed="true" stored="true"/>
                <field name="_ts" type="long" indexed="true" stored="true"/>
                <field name="_ns" type="string" indexed="true" stored="true"/>
                <field name="ns" type="string" indexed="true" stored="true"/>
                <field name="coords" type="string" indexed="true" stored="true" multiValued="true" />
                <dynamicField name="*" type="string" indexed="true" stored="true"/>
        </fields>

        <uniqueKey>_id</uniqueKey>

        <defaultSearchField>nmdsc</defaultSearchField>

        <!-- we don't want too many results in this usecase -->
        <solrQueryParser defaultOperator="AND"/>

        <copyField source="name" dest="nmdsc"/>
        <copyField source="description" dest="nmdsc"/>
</schema>

I found that all the underscore fields (lines 21-32) I have were required to get this working correctly. To future proof this, on line 32 I added a dynamicField so that the schema could change without affecting the Solr configuration — a tenant of MongoDB is to have flexible schema. Finally, I use copyfield on lines 42-43 to only include those fields I wish to search against, which name and description were only of interest for my use case. The “nmdsc” field will be used as the default search field for the UI as per line 37, which I will go into next.

After your config is in place and you start the Solr server, you can now launch the mongo-connector successfully and it will continuously update Solr with any updates that are saved to Mongo in real-time. I used nohup to kick it off in the background as shown above.

Using Solr in the DogTags Application

To tie this all together, we need to alter the UI of the original application to allow for Solr searching. See my original blog post for a refresher: Mobile GeoLocation App in 30 minutes – Part 2: Sencha Touch. Recall that this is a Sencha Touch MVC application and so all I needed to do was add a new store for the Solr REST/JSONP service that I will call for searching and update the UI to provide a control for the user to conduct a search. Let’s take a look at each of these:

Ext.define('MyApp.store.PetSearcher', {
    extend: 'Ext.data.Store',
    requires: [
        'MyApp.model.Pet'
    ],
    config: {
        autoLoad: true,
        model: 'MyApp.model.Pet',
        storeId: 'PetSearcher',
        proxy: {
            type: 'jsonp',
            url: 'http://solr-pet.xxx.com:9650/solr-pet/select/',
            callbackKey: 'json.wrf',
            limitParam: 'rows',
            extraParams: {
                wt: 'json',
                'json.nl': 'arrarr'
            },
            reader: {
                root: 'response.docs',
                type: 'json'
            }
        }
    }
});

Above is the new store I’m using to call Solr and map its results back to the original model that I used before. Note the differences from the original store that our specific to Solr, namely the URL and some of the proxy parameters on lines 10-18. The collection of docs are a bit buried in the response, so I have to set the root accordingly as I did on line 20.

The next thing I need to do is add a control to my view so the user can interact with the search service. In my case I chose to use a search field docked at the top and have it update the list based on the search term. In my view, the code looks as follows:

Ext.define('MyApp.view.PetPanel', {
    extend: 'Ext.Panel',
    alias: 'widget.petListPanel',
    config: {
        layout: {
            type: 'fit'
        },
        items: [
            {
                xtype: 'toolbar',
                docked: 'top',
                title: 'Dog Tags'
            },
            {
                xtype: 'searchfield',
                docked: 'top',
                name: 'query',
                id: 'SearchQuery'
            },
            {
                xtype: 'list',
                store: 'PetTracker',
                id: 'PetList',
                itemId: 'petList',
                emptyText: "<div>No Dogs Found</div>",
                loadingText: "Loading Pets",
                itemTpl: [
                    '<div>{name} is a {description} and is located at {latitude} (latitude) and {longitude} (longitude)</div>'
                ]
            }
        ],
        listeners: [
            {
                fn: 'onPetsListItemTap',
                event: 'itemtap',
                delegate: '#PetList'
            },
            {
                fn: 'onSearch',
                event: 'change',
                delegate: '#SearchQuery'
            },
            {
                fn: 'onReset',
                event: 'clearicontap',
                delegate: '#SearchQuery'
            }
        ]
    },
    onPetsListItemTap: function (dataview, index, target, record, e, options) {
        this.fireEvent('petSelectCommand', this, record);
    },
    onSearch: function (dataview, newValue, oldValue, eOpts) {
        this.fireEvent('petSearch', this, newValue, oldValue, eOpts);
    },
    onReset: function() {
        this.fireEvent('reset', this);
    }
});

Lines 15-18 add the control and lines 38-47 define the listeners I’m using to fire events in my controller. The controller supports those events as follows:

    onPetSearch: function(view, value, oldvalue, opts) {
        if (value) {
            var store = Ext.getStore('PetSearcher');
            var list = this.getPetList();
            store.load({
                params: {q:value},
                callback: function() {
                    console.log("we searched");
                    list.setData(this._proxy._reader.rawData.response.docs);
                }
            });
            list.setStore(store);
        }
    },

    onReset: function (view) {
        var store = Ext.getStore('PetTracker');
        var list = view.down("#petList");
        store.getProxy().setUrl('http://nodetest-loutilities.rhcloud.com/dogtag/');
        store.load();
        list.setStore(store);
    },

Since the model is essentially the same between Mongo and Solr, all I have to do is swap the stores and reload them to get the results updated accordingly. On line 6, you can see where I pass in the dynamic search term so that is loads the PetSearcher store with that value. When I reset the search value, I want to go back to the original PetTracker store to reload the full results as per lines 17-21. In both, I set the list component’s view to the corresponding store as I did on lines 12 and 21 so that the list will show the results according to the store it has been set to.

Conclusion

In this short example, we established that we could provide real-time search with Solr against MongoDB and augment an existing application to add a search control to use it. This has the potential of being a great compliment to Mongo because it keeps us from having to add additional indexes to MongoDB for searching which has a performance cost to it, especially as the record set grows. Solr removes this burden from Mongo and leverages an incremental index that can be updated in real-time for extremely fast queries. I see this approach being very powerful for modern applications.

21 comments on “Complementing MongoDB with Real-time Solr Search

  1. Thanks, mate. That is exactly what I am looking for. Leverage the benefits for each technologies and work together. Great stuff!

  2. Thanks. It looks very promising. However, I could not find the connector in the mongo-connector Github link you provided.

  3. Pingback: "pysolr.SolrError: [Reason: /solr4/update/]" when running mongo_connector.py | Technology & Programming

  4. Thank you for your article! It helps a lot!
    I have a lot of nested documents in mongo. Is it correct to make a flat document in solr_doc_manager? For example, transform { user: { name: example } } to { user.name: example }? Or is there any other way to hundle nested documents?
    I found scala-mongo-conector https://github.com/SelfishInc/solr-mongo-connector. They make flat document from nested documents, but there is no information about this tool anywhere except github readme.

  5. really interested in implementing something like this so that i have to write no solr/db syncing code in my web app. But I have a question.. What if one of the fields in the mongo collection is a blob like a pdf document which i also need Solr to index ? will i be stuck in this case ?

    Also when I stop and restart mongo, is it potentially going to have to spend ages re-indexing stuff all over again every time every time i stop and restart ?

    • You’ll have to experiment with blobs to see if its an acceptable level of performance. Starting/stopping mongo doesn’t force a re-index since Solr is maintaining that and can incrementally manage it.

  6. Is there configuration in mongo-connector that’ll delay the updates to solr. I want to pull the data from mongo to solr at frequent intervals (configurable, say every 1 hr) instead of real time. Can you please tell if thats possible?

    • Not that I’m aware. Why would you want your Solr and Mongo data out of sync for that long? If that’s what you want, mongo-connector isn’t really the right tool for that since it hooks into the opslog.

      • Thanks Lou. My usecase has frequent CRUD operations (high volume) against mongo and we don’t need the data synced with solr until data is final in mongo (which we would know by specific operation that gets triggered against mongo). This is to avoid heavy load onto Solr and was wondering if there can be a way to delay/control the push from mongo to solr. Looks like I might have to go with DataImportHandler approach that Solr supports for my needs.

  7. Thanks so much for writing this up. There’s not a lot of information out there on how to use the mongo-connector so this is really helpful.

  8. Hi, we have found strange problem. Mongo and Solr are installed on centOS 6.2. When Mongodb size become ~30Gb the synchronization does not work stable. Sometimes we experience delay several minutes. We connect using localhost
    is there any tricky?

Leave a reply to Loutilities Cancel reply