Using Sphinx search engine in Ruby on Rails

Posted by Dmytro Shteflyuk on under Ruby & Rails

Almost all Web-applications needs data search logic and really often this logic should have full-text search capabilities. If you are using MySQL database, you can use its FULLTEXT search, but it’s not efficient when you have a large amout of data. In this case third party search engines used, and one of them (and I think, the most efficient) is Sphinx. In this article I’ll present my port of Sphinx client library for Ruby and show how to use it.

First of all, what is the Sphinx itself? Sphinx is a full-text search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL, or from an XML pipe.

Current Sphinx distribution includes the following software:

  • indexer: an utility to create fulltext indices;
  • search: a simple (test) utility to query fulltext indices from command line;
  • searchd: a daemon to search through fulltext indices from external software (such as Web scripts);
  • sphinxapi: a set of API libraries for popular Web scripting languages (currently, PHP);

I will not describe how to install engine, if you are new with Sphinx, look the official documentation (but if you want to see my vision, you can always ask me in comments, and I will explain installation procedure in one of future posts). Instead I will present port of Sphinx client library to Ruby and show how to use it (to use this library you need Sphinx 0.9.7-RC2).

First you need to download plugin from RubyForge, or from this site.

This is Ruby on Rails plugin, therefor just unpack it in your <app>/vendor/plugins directory (library can be used outside the Rails application). Now you can write something like following in your code:

1
2
3
4
5
6
7
8
9
10
11
sphinx = Sphinx.new
sphinx.set_match_mode(Sphinx::SPH_MATCH_ANY)
result = sphinx.query('term1 term2')

# Fetch corresponding models
ids = result[:matches].map { |id, value| id }.join(',')
posts = Post.find :all, :conditions => "id IN (#{ids})"

# Get excerpts
docs = posts.map { |post| post.body }
excerpts = sphinx.build_excerpts(docs, 'index', 'term1 term2')

It’s pretty simple, isn’t it? There are several options you can use to get more relevant search results:

  • set_limits(offset, limit) – first document to fetch and number of documents.
  • set_match_mode(mode) – matching mode (can be SPH_MATCH_ALL – match all words, SPH_MATCH_ANY – match any of words, SPH_MATCH_PHRASE – match exact phrase, SPH_MATCH_BOOLEAN – match boolean query).
  • set_sort_mode(mode) – sorting mode (can be SPH_SORT_RELEVANCE – sort by document relevance desc, then by date, SPH_SORT_ATTR_DESC – sort by document date desc, then by relevance desc, SPH_SORT_ATTR_ASC – sort by document date asc, then by relevance desc, SPH_SORT_TIME_SEGMENTS – sort by time segments (hour/day/week/etc) desc, then by relevance desc).

Other options you can be found in API documentation.

If you are interested with this library, found bugs or have ideas how to improve it – please leave comments.

Updated: Unfortunately, there are no Windows binaries for latest Sphinx 0.9.7-rc2 version. I’ve built Sphinx for Windows, and added my config file into archive. You can download my build here.

37 Responses to this entry

Subscribe to comments with RSS

said on November 26th, 2006 at 15:02 · Permalink

Шикарная штука, спасибо! Давно уже искал как сделать поиск и по русским, и по английским текстам. Буду рекомендовать всем знакомым рельсовикам. Надеюсь, с UTF-8 никаких проблем у поисковика нет?

said on November 26th, 2006 at 18:45 · Permalink

Вообще никаких проблем :-) Специально проверил еще раз. Более того, в движке реализована такая штука, как морфологический поиск для русского языка. К тому же, автор открыт для предложений, потому если есть какие-то замечания или предложения по усовершенствованию – велкам на форум. Огромная вероятность того, что запрошенные фичи будут включены в последующие релизы.

said on November 27th, 2006 at 09:22 · Permalink

:) вообще кайф! проаннонсирую в русскоязычной ror-группе.

а как с производительностью? относительно ferret

said on November 27th, 2006 at 14:19 · Permalink

Эх, еще бы оно с постгресом работало…

Roman Semenenko
said on November 27th, 2006 at 16:09 · Permalink

Не могли бы вы описать преимущества этого движка над Ferret ?

said on November 27th, 2006 at 17:16 · Permalink

В документации по HyperEstraier я не нашел ни слова о русскоязычной морфологии. Он, может быть, конечно, и лучше, но как его локализовать под нужды русскоязычного проекта?

said on December 14th, 2006 at 23:18 · Permalink

Bregor,
Sphinx успешно работает с PostgreSQL.

guest,
а чем конкретно HyperEstraier лучше?

Sam
said on December 21st, 2006 at 17:27 · Permalink

в чем преимущества перед ferret?

said on December 25th, 2006 at 14:59 · Permalink

Честно говоря, нигде не видел сравнения это движков. Если будет время – проведу…

said on December 25th, 2006 at 15:25 · Permalink

Как минимум, в Ferret я с ходу не нашел поддержки распределенного поиска – те. есть вопрос с масштабируемостью.

Любопытно сравнивать скорость, но это надо делать аккуратно – надо помнить, что Sphinx по умолчанию (MATCH_ALL) считает степень совпадения фразы -это заметно более трудоемкая операция, чем просто сосчитать частоты слов в документе.

John
said on January 12th, 2007 at 07:33 · Permalink

Why Sphinx? You know there is already a port of Apache Lucene to Ruby called Ferret, and its supposed to be even faster. The “Acts_as_ferret” plugin for Rails builds the functioanlity right into your models :)

Dmitry
said on January 15th, 2007 at 12:30 · Permalink

xmlpipe источник позволяет индексировать локальные файлы на высокой скорости с любой предварительной обработкой.
В моем случае это было примерно так – 50 Гб данных, запакованных в zip в формате doc обрабатывались последовательно (unzip, rtf2txt), затем приводились к формату xml.
При этом поиск работает в среднем от 0.001 до 0.005 секунд на стандартном сервере ( 3Ghz, 1Gb, RAID SATA)

В следующей версии (она уже есть на cvs) Андрей обещал “практически” wildcard search.

shawn
said on February 3rd, 2007 at 06:07 · Permalink

Hi, I found a bug in the plugin code. When you read the attrs, they are put them in a hash, which isn’t guaranteed to be in a specific order. Then they are used to unpack the data in order. This was resulting in some attrs being mixed up when doing a grouping query (@count was switched with @groupby, etc).

Here is a fix that worked for me. I just tracked the attr names in an array so that we are guaranteed they stay in the same order, then use those to unpack the attrs in order. The only lines that are changed are the ones where the new attrs_names_in_order variable is used:

fields = []
attrs = {}
attrs_names_in_order = []

nfields = response[p, 4].unpack(‘N*’).first
p += 4
while nfields > 0 and p 0 && p 0 and p

shawn
said on February 3rd, 2007 at 06:08 · Permalink

while nfields > 0 and p 0 && p 0 and p

shawn
said on February 3rd, 2007 at 06:10 · Permalink

Hmmm, looks like it doesn’t like the brackets in the code. Let’s try this again:

Hi, I found a bug in the plugin code. When you read the attrs, they are put them in a hash, which isn’t guaranteed to be in a specific order. Then they are used to unpack the data in order. This was resulting in some attrs being mixed up when doing a grouping query (@count was switched with @groupby, etc).

Here is a fix that worked for me. I just tracked the attr names in an array so that we are guaranteed they stay in the same order, then use those to unpack the attrs in order. The only lines that are changed are the ones where the new attrs_names_in_order variable is used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
    fields = []
    attrs = {}
    attrs_names_in_order = []
   
    nfields = response[p, 4].unpack('N*').first
    p += 4
    while nfields > 0 and p < max
      nfields -= 1
      len = response[p, 4].unpack('N*').first
      p += 4
      fields << response[p, len]
      p += len
    end
    result[:fields] = fields

    nattrs = response[p, 4].unpack('N*').first
    p += 4
    while nattrs > 0 &amp;&amp; p < max
      nattrs -= 1
      len = response[p, 4].unpack('N*').first
      p += 4
      attr = response[p, len]
      p += len
      type = response[p, 4].unpack('N*').first
      p += 4
      attrs[attr.to_sym] = type;
      attrs_names_in_order << attr.to_sym
    end
    result[:attrs] = attrs
   
    # read match count
    count = response[p, 4].unpack('N*').first
    p += 4
   
    # read matches
    result[:matches] = {}
    while count > 0 and p < max
      count -= 1
      doc, weight = response[p, 8].unpack('N*N*')
      p += 8

      result[:matches][doc] ||= {}
      result[:matches][doc][:weight] = weight
      for attr in attrs_names_in_order
        val = response[p, 4].unpack('N*').first
        p += 4
        result[:matches][doc][:attrs] ||= {}
        result[:matches][doc][:attrs][attr] = val
      end
    end

Hopefully you can add the fix in and maybe get the updated ruby api distributed with sphinx 9.7 when it gets released.

said on February 7th, 2007 at 00:16 · Permalink

I think you may may have a minor error on line 339 in sphinx.rb one of the values in the devision should be a float so that it returns a float value

1
2
-    result[:time] = '%.3f' % (result[:time] / 1000)
+    result[:time] = '%.3f' % (result[:time] / 1000.0)
said on February 7th, 2007 at 00:25 · Permalink

update .. actually I think sphinx returns things with 1 = 1/10,000 of second not 1=1/1000th … let me know if you find otherwise:

1
2
- result[:time] = '%.3f' % (result[:time] / 1000)
+ result[:time] = '%.3f' % (result[:time] / 10000.0)
Danila
said on February 9th, 2007 at 18:20 · Permalink

Как установить Сфинкс под Windows, Если я использую пакет разработчика DENWER?

said on February 22nd, 2007 at 01:03 · Permalink

Thanks for the comment! I will review it shortly and post update. Thanks again

said on February 22nd, 2007 at 01:14 · Permalink

Danila, sphinx ставится как отдельное приложение. Просто возьми билд под Windows (мой или с официального сайта), настрой конфиг и запусти searchd.

Nikolay Karev
said on March 3rd, 2007 at 10:12 · Permalink

вот этот кусок кода потенциально проблеммный:

1
posts = Post.find :all, :conditions => "id IN (#{ids})"

Если ids содержит несколько тысяч результатов, то есть вероятность что сдохнет парсер запросов в СУБД. И всё очень мрачно упадёт.
Так что имхо лучше сделать или ограничение на количество результатов от sphinx или разбивать их на блоки и уже поблочно вытаскивать из СУБД.

said on March 20th, 2007 at 14:37 · Permalink

Hi,
I have used fullsearch feature in many of my projects. I have used ferret and hyperestraier. You can use acts_as_ferret for ferret searching and acts_as_searchable for hyperestraier. Ferret provides multiple model search and other does’nt. I prefer hyperestraier for fulltext search. :)

joost
said on March 27th, 2007 at 16:32 · Permalink

thx! :)

joost
said on March 27th, 2007 at 16:36 · Permalink

BTW, is there already an update?? Which includes the above fixes? This would be great!

said on March 27th, 2007 at 17:51 · Permalink

Update would be published tomorrow or the day after tomorrow. Currently I’m finishing RSpec tests which would cover whole functionality.

joost
said on April 3rd, 2007 at 18:35 · Permalink

Currently I get the following error using the plugin (with v0.9.7 of Sphinx). All database fields are MySQL INT(11).

1
2
3
4
5
6
../config/../vendor/plugins/sphinx/lib/sphinx.rb:256:in 'pack': bignum too big to convert into 'unsigned long' (RangeError)
        from ../config/../vendor/plugins/sphinx/lib/sphinx.rb:256:in 'query'
        from ../config/../vendor/plugins/sphinx/lib/sphinx.rb:253:in 'each'
        from ../config/../vendor/plugins/sphinx/lib/sphinx.rb:253:in 'query'
        from ./test.rb:136:in 'search_entry'
        from ./test.rb:149

Any idea? Please let me know.. Also about an update!! :)

said on April 3rd, 2007 at 19:04 · Permalink

joost, do you use set_filter_range in your code? Could you show me values you have sent to this method? Also it would be great, if you contact me directly to fix it quickly.

I’m updating API now and will upload it in next few days.

said on September 10th, 2007 at 05:12 · Permalink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Index: vendor/plugins/sphinx/lib/client.rb
===================================================================
--- vendor/plugins/sphinx/lib/client.rb (revision 5885)
+++ vendor/plugins/sphinx/lib/client.rb (working copy)
@@ -391,18 +391,20 @@
       count = response[p, 4].unpack('N*').first; p += 4
       
       # read matches
-      result['matches'] = {}
+      result['matches'] = []
       while count > 0 and p < max
         count -= 1
         doc, weight = response[p, 8].unpack('N*N*'); p += 8
   
-        result['matches'][doc] ||= {}
-        result['matches'][doc]['weight'] = weight
+        doc_data = {}
+        doc_data['weight'] = weight
         attrs_names_in_order.each do |attr|
           val = response[p, 4].unpack('N*').first; p += 4
-          result['matches'][doc]['attrs'] ||= {}
-          result['matches'][doc]['attrs'][attr] = val
+          doc_data['attrs'] ||= {}
+          doc_data['attrs'][attr] = val
         end
+        
+        result['matches'] << [doc, doc_data]
       end
       result['total'], result['total_found'], msecs, words = response[p, 16].unpack('N*N*N*N*'); p += 16
       result['time'] = '%.3f' % (msecs / 1000.0)
tolya
said on September 11th, 2008 at 12:54 · Permalink

Привет, Всем!

У меня появился вопрос по Sphinx, помогите пожалуйста найти решение.

У меня есть следующая структура в конфигурационном файле:

sphinx.conf:

1
2
3
4
5
6
7
8
source sphinx_users_main
source sphinx_users_delta : sphinx_users_main
source sphinx_spaces_main
source sphinx_spaces_delta : sphinx_spaces_main
index users_main
index users_delta : users_main
index spaces_main
index spaces_delta : spaces_main

Такая структура была придумана мной для того, чтоб можно было при поиске получать ID по отдельной таблицы(указав по какому индексу с конфигурационного файла производить поиск).

Все, вроде как, корректно работает:

search -a test

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Sphinx 0.9.8-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/etc/sphinx.conf'...
index 'users_main': query 'test ': returned 14 matches of 14 total in 0.000 sec

displaying matches:
1. document=3592, weight=2
2. document=4178, weight=2
3. document=4179, weight=2
4. document=4181, weight=2
5. document=6192, weight=2
6. document=2807, weight=1
7. document=3593, weight=1
8. document=4717, weight=1
9. document=4740, weight=1
10. document=6090, weight=1
11. document=6196, weight=1
12. document=6218, weight=1
13. document=6219, weight=1
14. document=6220, weight=1

words:
1. 'test': 14 documents, 19 hits

index 'users_delta': query 'test ': returned 0 matches of 0 total in 0.000 sec

words:
1. 'test': 0 documents, 0 hits

index 'spaces_main': query 'test ': returned 17 matches of 17 total in 0.000 sec

displaying matches:
1. document=937, weight=1
2. document=940, weight=1
3. document=942, weight=1
4. document=943, weight=1
5. document=944, weight=1
6. document=945, weight=1
7. document=964, weight=1
8. document=983, weight=1
9. document=984, weight=1
10. document=985, weight=1
11. document=986, weight=1
12. document=987, weight=1
13. document=988, weight=1
14. document=989, weight=1
15. document=990, weight=1
16. document=991, weight=1
17. document=992, weight=1

words:
1. 'test': 17 documents, 17 hits

index 'spaces_delta': query 'test ': returned 0 matches of 0 total in 0.000 sec

words:
1. 'test': 0 documents, 0 hits

Но вот не могу понять, как с помощью Sphinx организовать поиск по указанному мной индексу, как например я это делаю с консоли:

search -i spaces_main -a test

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Sphinx 0.9.8-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/etc/sphinx.conf'...
index 'spaces_main': query 'test ': returned 17 matches of 17 total in 0.000 sec

displaying matches:
1. document=937, weight=1
2. document=940, weight=1
3. document=942, weight=1
4. document=943, weight=1
5. document=944, weight=1
6. document=945, weight=1
7. document=964, weight=1
8. document=983, weight=1
9. document=984, weight=1
10. document=985, weight=1
11. document=986, weight=1
12. document=987, weight=1
13. document=988, weight=1
14. document=989, weight=1
15. document=990, weight=1
16. document=991, weight=1
17. document=992, weight=1

words:
1. 'test': 17 documents, 17 hits

Подскажите мне пожалуйста, как это можно организовать?

Спасибо

said on September 11th, 2008 at 14:51 · Permalink

Второй параметр метода Query – название индекса, по которому искать:

1
sphinx.Query('test', 'spaces_main');
tolya
said on September 12th, 2008 at 14:27 · Permalink

Спасибо большое за ответ.

Подскажите пожалуйста, как я могу в Sphinx изменить шаблон, по которому мне возвращается результат запроса?
Например в результате запроса: sphinx.Query(‘test’)
я хотел бы, чтоб в результате я мог бы получить кроме всего прочего: test16, test_12, [email protected].

Спасибо

Anatoliy
said on September 24th, 2008 at 17:23 · Permalink

Привет, всем!!!

Подскажите пожалуйста, как в sphinx реализовать такой же поиск, какой бы например был бы при ‘…LIKE %name%…’

Спасибо

Comments are closed

Comments for this entry are closed for a while. If you have anything to say – use a contact form. Thank you for your patience.