Distributed Search with SOLR Multicore – Part 2

May 2, 2010

In the previous post we got the bare minimum SOLR install up and running with apache, tomcat, and SOLR with multicore/multiindex. So you should be able to navigate to http://localhost/ we should be seeing the manage your cores screen.

solr

All is well with this but we need to modify our default install a bit to more represent a real world multi-core setup.

Download SOLR Example

The schemas provided by the SOLR distribution are rather basic and do not include field types commonly found in production deployments. To simplify this I created a new template that you can use to get a jump-start. It includes;

  • Settings: A more robust schema.xml that includes field types such as text_ws, string, int, decimal, and ignore.
  • Performance: A solr core exists [see: distrib] that simply provides distributed search via a requestHandler that auto injects the shards parameter on any search request.
  • Performance: The solrconfg.xml of each has a more aggressive caching mechanism in place to provide better scalability as well as some further options to enhance performance out of the box (etag, cache control headers, etc).
  • Security: Disabled remote streaming.

You can download the example here.

How to issue multi-core/multi-index queries

To issue a multi-core query by hand we can use a url template like below:

http://localhost/core1/select/?q=solr&shards=localhost/core01,localhost/core02

You will notice the use of the shards parameter. This allows us to query multiple cores (indexes) simultaneously. This proves invaluable as you horizontally scale your search as shards can be on different pcs all together. In our case, however, we are simply using it to query across indexes in a search on the same PC, but you don’t have to do it like this.

Distributed Search Scenarios

From the above we can explore two scenarios for deployment depending on the needs of our search architecture and current scale. Scenarios differ only in their physical deployment so moving from scenario 1 to scenario 2 is as painless as it can get. Note: The reasons for moving to a distributed platform usually revolve around bottlenecks at the IO level either IO is too slow, or size of index is too large (>1 million)

As the size of our indexes grow we may move a shard to a different server to get the benefit of higher IO. However, keep in mind, this will increase CPU on the core servers respectively. We might also consider still keeping 1-2 search servers in place and as IO becomes an issue mount new drives to take care of the load. I would strongly recommend SSD drives for any search server as 99% of the time you are concerned with how fast your “gets” are and not your “sets”.

figure1

Benefits

  • Easier to maintain
  • Allows migration to multi-server solutions
  • Allows higher up-time through multi-cores

Potential Drawbacks

  • Scalability at start is diminished

Alternate Considerations

  • Consider scenario 2 when you need massive scalability.
  • Consider a replicated index with multiple hosts.

figure2

Benefits

  • Massive horizontal scale

Potential Drawbacks

  • The increased scale can be hard to manage
  • Effective performance tuning is required
  • More testing to establish best configuration

Alternate Considerations

  • Reconsider scenario 1 if IO is the main bottleneck, instead of scaling horizontal with physical servers, mount different RAID configurations

Additional Resources

Scaling up Large Scale Search from 500,000 volumes to 5 Million volumes and beyond

SOLR Performance Benchmark Single index vs Multiple (Multicore)

Multicore admin commands for SOLR

Distributed Search with SOLR Multicore – Part 1

May 1, 2010

Taking a break today from windows and decided to put together a quick guide on how to get a fully working search server up and running that supports multi-core instances in SOLR.

What is multi-core on SOLR?

Technically, an index in SOLR is basically like a
single-table database schema. Imagine a massive spreadsheet, if you will. In-spite of this limitation, there is nothing to stop you from putting different types of data into a single index, thereby, in effect mitigating this limitation.

All you have to do is use different fields for the different document types, and use a field to discriminate between the types. An identifier field would need to be unique across all documents in this index, no matter the type, so you could easily do this by concatenating the field type and the entity’s identifier. This may appear really ugly from a relational database design standpoint, but this isn’t a database.

More importantly, unlike a database, there is no overhead whatsoever for a document to have no data in a field. On the other hand, databases usually set aside space for data in a row whether it is filled or not. This is where the spreadsheet metaphor can break down, because a blank cell in a spreadsheet takes up space, but not in an index.

Why use multiple indexes?

For a large number of documents, a strategy using multiple indices will prove to be more scalable. Only testing will indicate what “large” is for your data and your queries, but less than a million documents will not likely benefit from multiple indices. Ten million documents have been suggested as a reasonable maximum number for a single index.

Committing changes to a Solr index invalidates the caches used to speed up querying. If this happens often, and the changes are usually to one type of entity in the index, then you will get better query performance by using separate indices.

Drawbacks?

Using a multi-core (read: multi-index) configuration of SOLR does prevent you from searching all indexes at once out-of-the-box. However, as we will discover, this can be easy to implement as we can use another SOLR feature (distributed search) to provide a seemless distributed search platform that can scale horizontally and vertically.

Tomcat 6.0.26

At the time of this article the latest tomcat is version 6.0.26. Please update links accordingly to whatever the new version is at the time of reading.

wget http://download.nextag.com/apache/tomcat/tomcat-6/v6.0.26/bin/apache-tomcat-6.0.26.zip
unzip apache-tomcat-6.0.26
sudo mv apache-tomcat-6.0.26 /etc/tomcat

Tomcat should now be runnable by going to /etc/tomcat/bin/ and executing startup.sh. By default tomcat will be on port 8080 so open up a browser and navigate to http://yourhost:8080/.

Tomcat 6.0.26 – Automatic Startup

It would be nice if we got our apache install to startup automatically when we restart so lets go ahead and create the appropriate /etc/init.d entry.

sudo pico /etc/init.d/tomcat

Paste the following code in;

# Tomcat auto-start
#
# description: Auto-starts tomcat
# processname: tomcat
# pidfile: /var/run/tomcat.pid

case $1 in
start)
        sh /etc/tomcat/bin/startup.sh
        ;;
stop)
        sh /etc/tomcat/bin/shutdown.sh
        ;;
restart)
        sh /etc/tomcat/bin/shutdown.sh
        sh /etc/tomcat/bin/startup.sh
        ;;
esac
exit 0

Now we just need to make it executable and create the symbolic links to have it startup.

sudo chmod 755 /etc/init.d/tomcat
sudo ln -s /etc/init.d/tomcat /etc/rc1.d/K99tomcat
sudo ln -s /etc/init.d/tomcat /etc/rc2.d/S99tomcat

Install SOLR 1.4.0

Download apache via http://lucene.apache.org/solr/.

wget http://apache.deathculture.net/lucene/solr/1.4.0/apache-solr-1.4.0.zip
unzip apache-solr-1.4.0.zip
cd apache-solr-1.4.0
ant dist
sudo mv example/multicore/ /etc/solr
mv dist/apache-solr-1.4.0.war /etc/tomcat/webapps/

Now let’s set solr.home to the location of our solr install by changing our .bash_profile to set the JAVA_OPTS parameter.

# ensure we get nice colors when using putty
if [ -f ~/.bashrc ];
then
source ~/.bashrc
fi

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.17
export TOMCAT_HOME=/etc/tomcat
export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/etc/solr"

PATH=$PATH:$JAVA_HOME/bin

With these changes in place lets go ahead and reload our bash using source .bash_profile. And restart tomcat /etc/tomcat/bin/shutdown.sh and /etc/tomcat/bin/startup.sh.

Now we should be able to browse to http://localhost:8080/apache-solr-1.4.0/.

solr

Apache + Tomcat

Now for effective hosting we can direct apache to mod_proxy, or we could simply have tomcat be the host.

So lets go and get apache

apt-get install apache2-mpm-prefork
cd /etc/apache2
# enable apache proxy
ln -s ../mods-available/proxy.conf proxy.config
ln -s ../mods-available/proxy.load proxy.load
ln -s ../mods-available/proxy_http.load proxy_http.load
<IfModule mod_proxy.c>
ProxyRequests Off
ProxyPreserveHost On
ProxyStatus On

ProxyPass / http://localhost:8080/apache-solr-1.4.0/
ProxyPassReverse / http://localhost:8080/apache-solr-1.4.0/
</IfModule>

Edit /etc/apache2/sites-enabled/000-default to include the below snippet.

<VirtualHost>
// ....
// ....
// ....
// right before end
    Include /etc/apache2/mods-enabled/proxy.config
</VirtualHost>

In the next post we will cover how to create a distributed search across our cores!

Split XML into Seperate Files using XSLT

February 13, 2010

I have been working with multiple data-feeds lately that come in the form of one giant XML file and needed a way to transform them into smaller more succinct packages. Was considering a quick little C# program with the help of snippetcompiler but thought that a pure XSLT option could be used.

Giant XML File

The example XML shown below is about 1.5GB in size on disk and is a bit unwieldy to operate on. Even with optimized XSLT you end up taking a lot of system resources doing standard string manipulation. But if you’ve got 12 GB on a server free to use then by all means don’t bother writing to disk.

In my case I didn’t have that amount of space (4GB of which 100MB was about max I’d like to consume). So my plan was to split the document into separate files on a specific node and name the file a value found within the child nodes. In my case this was the node.

<Root>
	<Investment><Id>234b131</Id><!-- removed for brevity --></Investment>
	<Investment><Id>234b130</Id><!-- removed for brevity --></Investment>
	<Investment><Id>234b132</Id><!-- removed for brevity --></Investment>
	<!-- ...and on... -->
	<!-- ...and on it goes... -->
</Root>

This seemed straightforward as a bit of researching (read: google) I stumbled upon the “xsl:result-document” XSLT tag. Not many XSLT engines implemented it on windows so finding the tool was the hardest part.

Saxon seemed to be the best fit as it could be used freely (the non-commercial version) and had an excellent command line option as well as multi-platform support (linux guy at heart).

Lately the author merged and separated some functions and the best version right now for open source is the Saxon-B which I’ve uploaded to the skydrive account.

XSLT

The below XSLT explains it all, after a short while (in my case for 1.5GB file) about 10 minutes neat little compact files await further processing.

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<!-- HOTO USE: -->
<!-- ************************************** -->
<!-- $ "Transform -xsl:split.xslt -s:split.xslt" inside main directory where target XML is found -->
<!-- ************************************** -->

<!-- grab every xml file in directory and process it into smaller files -->
<xsl:template match="/">
    <xsl:for-each select="collection(iri-to-uri('./?select=*.xml;recurse=yes'))">
        <xsl:for-each select="//Investment">
            <xsl:variable name="Id" select="Id" />
            <!-- create the file name used when outputting this node to its own file -->
            <xsl:variable name="filename" select="concat('./_Out/', $Id, '.xml')" />
            <xsl:value-of select="$filename" />
            <xsl:result-document href="{$filename}">
                <Root>
                    <xsl:copy-of select="node()"/>
                </Root>
            </xsl:result-document>
        </xsl:for-each>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>