While MySQL has support for full text searches, it is just too slow to use on big data. The sphinx search engine is very well suited to search text stored in mysql. This article shows how to configure and use sphinx on text stored in mysql.
Currently (February 2016), sphinx is not yet in debian stable, but there are debian packages avaiable for the stable release of debian at http://sphinxsearch.com/downloads/release/.
When installing stand-alone packages like this with dpkg you might encounter unsatified dependencies (whether or not that will happen depends on the set of packages that you have installed when you install sphinx).
sudo dpkg -i sphinxsearch_2.2.10-release-1~jessie_amd64.deb
I got an error because libpq5 was not installed on my system, but that was easily resolved:
sudo apt-get install libpq5
What information about your system does sphinx need? Well, to start with it needs the name of the database, and a username and password to access it. To optimize the search results, sphinx also need to know what language the text is written in, and if there are special characters that it should know about.
Edit /etc/sphinxsearch/sphinx.conf
. This file has different sections:
You need to modify the following parameters:
For example,
source src1 { type = mysql sql_host = localhost sql_user = foo sql_pass = bar sql_db = baz sql_query = \ SELECT id, content \ FROM bal }
This section is language and charset dependent. I have swedish text in UTF-8, and I had to define how to map some upper case letters to their corresponding lower case letters: Ää Åå Ææ Èè Éé Öö Øø Üü, which is done in the charset_table
parameter. I also created my own stopwords.txt
and put it in /var/lib/sphinxsearch/data/stopwords.txt
. If you have swedish text, you might find it useful and it is avaible here.
index bal { source = src1 path = /var/lib/sphinxsearch/data/bal docinfo = none morphology = libstemmer_sv stopwords = /var/lib/sphinxsearch/data/stopwords.txt min_word_len = 3 ignore_chars = U+0085, U+0094 charset_table = english, _, U+00C4->U+00E4, U+00C5->U+00E5, U+00C6->U+00E6, U+00C8->U+00E8, U+00C9->U+00E9, U+00D6->U+00F6, U+00D8->U+00F8, U+00DC->U+00FC, U+00E4, U+00E5, U+00E6, U+00E8, U+00E9, U+00F6, U+00F8, U+00FC # Ää Åå Ææ Èè Éé Öö Øø Üü }
You don't need to change anything here.
The default values work well. If you have a single-core CPU, change workers
to none
.
When you are done editing /etc/sphinxsearch/sphinx.conf
, issue:
sudo indexer --all sudo /etc/init.d/sphinxsearch restart
To test the search enginge, use your mysql client but hook up to with the port where searchd is listening, which defaults to port 9306.
mysql -P 9306 -h0
To view the tables that are indexed, use
show tables;
To search for a word, use regular sql
like this (here I use a table tweets
, which has text from tweets from twitter, and I search for tweets matching "enhörning".
mysql> select * from tweets where match('enhörning'); +--------------------+ | id | +--------------------+ | 507203754782105600 | | 510777573513887744 | | 534460203480408064 | | 545244354508779521 | | 546060439508234240 | | 554614933741379585 | | 561643506452942848 | | 603994619366498305 | | 661571498767982592 | | 671092677128101892 | | 696072621692186624 | | 304319166431760384 | | 507199036722778113 | | 507199036722786305 | | 507201130594852864 | | 507201516965728256 | | 507202874506752001 | | 507388678059343873 | | 507420920756322304 | | 507585580557799424 | +--------------------+ 20 rows in set (0.00 sec)
If you want all occurences, and not only the 20 top best matches, use
select * from tweets where match('enhörning') limit 10000 option max_matches=10000;
Both of "option max_matches=10000" and "limit 10000" are necessary.