The Sphinx search engine and MySQL

Using sphinx to search text stored in mysql

While MySQL has support for full text searches, it is just too slow to use on big data. The sphinx search engine is very well suited to search text stored in mysql. This article shows how to configure and use sphinx on text stored in mysql.

installing sphinx (on debian)

Currently (February 2016), sphinx is not yet in debian stable, but there are debian packages avaiable for the stable release of debian at http://sphinxsearch.com/downloads/release/.

When installing stand-alone packages like this with dpkg you might encounter unsatified dependencies (whether or not that will happen depends on the set of packages that you have installed when you install sphinx).

sudo dpkg -i sphinxsearch_2.2.10-release-1~jessie_amd64.deb

I got an error because libpq5 was not installed on my system, but that was easily resolved:

sudo apt-get install libpq5

configuring sphinx

What information about your system does sphinx need? Well, to start with it needs the name of the database, and a username and password to access it. To optimize the search results, sphinx also need to know what language the text is written in, and if there are special characters that it should know about.

Edit /etc/sphinxsearch/sphinx.conf. This file has different sections:

source

You need to modify the following parameters:

For example,

source src1
{
        type                    = mysql
        sql_host                = localhost
        sql_user                = foo
        sql_pass                = bar
        sql_db                  = baz
        sql_query               = \
SELECT id, content \
FROM bal
}

index

This section is language and charset dependent. I have swedish text in UTF-8, and I had to define how to map some upper case letters to their corresponding lower case letters: Ää Åå Ææ Èè Éé Öö Øø Üü, which is done in the charset_table parameter. I also created my own stopwords.txt and put it in /var/lib/sphinxsearch/data/stopwords.txt. If you have swedish text, you might find it useful and it is avaible here.

index bal
{
        source                  = src1
        path                    = /var/lib/sphinxsearch/data/bal
        docinfo                 = none
        morphology              = libstemmer_sv
        stopwords               = /var/lib/sphinxsearch/data/stopwords.txt
        min_word_len            = 3
        ignore_chars            = U+0085, U+0094
        charset_table           = english, _, U+00C4->U+00E4, U+00C5->U+00E5, U+00C6->U+00E6, U+00C8->U+00E8, U+00C9->U+00E9, U+00D6->U+00F6, U+00D8->U+00F8, U+00DC->U+00FC, U+00E4, U+00E5, U+00E6, U+00E8, U+00E9, U+00F6, U+00F8, U+00FC
        # Ää Åå Ææ Èè Éé Öö Øø Üü
}

indexer

You don't need to change anything here.

searchd

The default values work well. If you have a single-core CPU, change workers to none.

Starting sphinx

When you are done editing /etc/sphinxsearch/sphinx.conf, issue:

sudo indexer --all
sudo /etc/init.d/sphinxsearch restart

using sphinx

To test the search enginge, use your mysql client but hook up to with the port where searchd is listening, which defaults to port 9306.

mysql -P 9306 -h0

To view the tables that are indexed, use

show tables;

To search for a word, use regular sql like this (here I use a table tweets, which has text from tweets from twitter, and I search for tweets matching "enhörning".

mysql> select * from tweets where match('enhörning');
+--------------------+
| id                 |
+--------------------+
| 507203754782105600 |
| 510777573513887744 |
| 534460203480408064 |
| 545244354508779521 |
| 546060439508234240 |
| 554614933741379585 |
| 561643506452942848 |
| 603994619366498305 |
| 661571498767982592 |
| 671092677128101892 |
| 696072621692186624 |
| 304319166431760384 |
| 507199036722778113 |
| 507199036722786305 |
| 507201130594852864 |
| 507201516965728256 |
| 507202874506752001 |
| 507388678059343873 |
| 507420920756322304 |
| 507585580557799424 |
+--------------------+
20 rows in set (0.00 sec)

If you want all occurences, and not only the 20 top best matches, use

select * from tweets where match('enhörning') limit 10000 option max_matches=10000;

Both of "option max_matches=10000" and "limit 10000" are necessary.

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 12, 2017