sphinx | Dmytro Shteflyuk's Home

Scribd open source projects

Dmytro Shteflyuk — Tue, 08 Sep 2009 02:06:05 +0000

It’s time to summarize what we have done for the Open Source community. Scribd is pretty open company, we release a lot of code into the public after a time (sometimes it is short, sometimes it is not). Here I want to mention all the code we have opensourced. Please take into account that time is moving on, so we are publishing more and more code. I will update this post periodically, so stay tuned. Follow me on Twitter to get instant updates.

Here is the list of our projects in alphabetical order:

bounces-handler — Email Bounces Processing System with Rails plugin to prevent Rails mailers from sending any messages to a blocked addresses.
db-charmer — ActiveRecord Connections Magic (slaves, multiple connections, etc).
easy-prof — Simple and easy to use Ruby code profiler, which could be used as a Rails plugin.
Fast Sessions — Sessions class for ActiveRecord sessions store created to work fast (really fast).
loops — Simple background loops framework for Ruby on Rails and Merb.
magic-enum — Method used to define ENUM-like attributes in your model (int fields actually).
rlibsphinxclient — A Ruby wrapper for pure C searchd client API library.
rscribd — Ruby client library for the Scribd API.
Rspec Cells — A library for testing applications that are using Cells in RSpec.
Scribd Desktop Uploader — A fully native Cocoa Macintosh uploader app for the Scribd.com website.

bounces-handler

Bounces-handler package is a simple set of scripts to automatically process email bounces and ISP’s feedback loops emails, maintain your mailing blacklists and a Ruby on Rails plugin to use those blacklists in your RoR applications.

This piece of software has been developed as a part of more global work on mailing quality improvement in Scribd.com, but it was one of the most critical steps after setting up reverse DNS records, DKIM and SPF.

Links: Project Home Page on GitHub | Introduction Blog Post | RDoc Documentation.

db-charmer

DbCharmer is a simple yet powerful plugin for ActiveRecord that does a few things:

Allows you to easily manage AR models’ connections (switch_connection_to method)
Allows you to switch AR models’ default connections to a separate servers/databases
Allows you to easily choose where your query should go (Model.on_db methods)
Allows you to automatically send read queries to your slaves while masters would handle all the updates.
Adds multiple databases migrations to ActiveRecord

It requires Ruby on Rails version 2.3 or later. The main purpose of this plugin is to put all the databases-related code we have been using in Scribd for a while into a single easy-to use package.

Links: Project Home Page on GitHub | Test Rails Application on GitHub | RDoc Documentation.

easy-prof

Simple and easy to use Ruby code profiler, which could be used as a Rails plugin. The main idea behind the easy-prof is creating check points and your code and measuring time needed to execute code blocks. Here is the example of easy-prof output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

[home#index] Benchmark results:
[home#index] debug: Logged in user home page
[home#index] progress: 0.7002 s [find top videos]
[home#index] progress: 0.0452 s [build categories list]
[home#index] progress: 0.0019 s [build tag cloud]
[home#index] progress: 0.0032 s [find featured videos]
[home#index] progress: 0.0324 s [find latest videos]
[home#index] debug: VIEW STARTED
[home#index] progress: 0.0649 s [top videos render]
[home#index] progress: 0.0014 s [categories render]
[home#index] progress: 2.5887 s [tag cloud render]
[home#index] progress: 0.0488 s [latest videos render]
[home#index] progress: 0.1053 s [featured video render]
[home#index] results: 3.592 s

From this output you can see what checkpoints takes longer to reach, and what code fragments are pretty fast.

Links: Project Home Page on GitHub | Introduction Blog Post | RDoc Documentation.

Fast Sessions

FastSessions is a sessions class for ActiveRecord sessions store created to work fast (really fast). It uses some techniques which are not so widely known in developers’ community and only when they cause huge problems, performance consultants are trying to help with them.

FastSessions plugin was born as a hack created for Scribd.com (large RoR-based web project), which was suffering from InnoDB auto-increment table-level locks on sessions table.

So, first of all, we removed id field from the table. Next step was to make lookups faster and we’ve used a following technique: instead of using (session_id) as a lookup key, we started using (CRC32(session_id), session_id) — two-columns key which really helps MySQL to find sessions faster because key cardinality is higher (so, mysql is able to find a record earlier w/o checking a lots of index rows). We’ve benchmarked this approach and it shows 10–15% performance gain on large sessions tables.

And last, but most powerful change we’ve tried to make was to not create database records for empty sessions and to not save sessions data back to database if this data has not been changed during current request processing. With this change we basically reduce inserts number by 50-90% (depends 0n application).

All of these changes were implemented and you can use them automatically after a simple plugin installation.

There is a fork patched by mudge for full compatibility with Ruby on Rails version 2.3 or later.

Links: Project Home Page on Google Code | Introduction Blog Post | Fork on GitHub Compatible with Rails 2.3 and Later.

loops

loops is a small and lightweight library for Ruby on Rails, Merb and other frameworks created to support simple background loops in your application which are usually used to do some background data processing on your servers (queue workers, batch tasks processors, etc).

Originally loops plugin was created to make our own loops code more organized. We used to have tens of different modules with methods that were called with script/runner and then used with nohup and other not so convenient backgrounding techniques. When you have such a number of loops/workers to run in background it becomes a nightmare to manage them on a regular basis (restarts, code upgrades, status/health checking, etc).

After a short time of writing our loops in more organized ways we were able to generalize most of the loops code so now our loops look like a classes with a single mandatory public method called run. Everything else (spawning many workers, managing them, logging, backgrounding, pid-files management, etc) is handled by the plugin itself.

Links: Project Home Page on GitHub | Introduction Blog Post | RDoc Documentation.

magic-enum

Method used to define ENUM-like attributes in your model (int fields actually). It’s easier to show what it does in code rather than to explain in plain English:

1
2
3
4
5
6
7

Statuses = {
:unknown => 0,
:draft => 1,
:published => 2,
:approved => 3
}
define_enum :status, :default => 1, :raise_on_invalid => true, :simple_accessors => true

is identical to

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

Statuses = {
:unknown => 0,
:draft => 1,
:published => 2,
:approved => 3
}
StatusesInverted = Statuses.invert

def status
StatusesInverted[self[:status].to_i] || StatusesInverted[1]
end

def status=(value)
raise ArgumentError, "Invalid value "#{value}" for :status attribute of the #{self.class} model" if
Statuses[value].nil?
self[:status] = Statuses[value]
end

def unknown?
status == :unknown
end

def draft?
status == :draft
end

def published?
status == :published
end

def approved?
status == :approved
end

This plugin was originally developed for Best Tech Videosand later was cleaned up in Scribd repository and released to the public.

Links: Project Home Page on GitHub | RDoc Documentation.

rlibsphinxclient

A Ruby wrapper for pure C searchd client API library. It works much faster than any Ruby client for Sphinx, so you can check it to ensure you application works as fast as possible.

Please note: this is *highly experimental* library so use it at your own risk.

Links: Project Home Page on GitHub | RDoc Documentation.

rscribd

Ruby client library for the Scribd API. This gem provides a simple and powerful library for the Scribd API, allowing you to write Ruby applications or Ruby on Rails websites that upload, convert, display, search, and control documents in many formats. For more information on the Scribd platform, visit the Scribd Platform Documentation page.

The main features are:

Upload your documents to Scribd’s servers and access them using the gem
Upload local files or from remote web sites
Search, tag, and organize documents
Associate documents with your users’ accounts

Links: Project Home Page on GitHub | Scribd Platform Documentation | RDoc Documentation.

Rspec Cells

This plugin allows you to test your cells easily using RSpec. Basically, it adds an example group especially for cells, with several helpers to perform cells rendering.

If you are not sure what is cells, please visit its home page.

Spec for a regular cell could look like:

1
2
3
4
5
6
7
8
9
10
11
12
13

describe VideoCell do
integrate_views

context '.videos' do
it 'should initialize :videos variable' do
params[:id] = 10
session[:user_id] = 20
opts[:opt] = 'value'
result = render_cell :videos, { :videos => [] }, :slug => 'hello'
result.should have_tag('div', :class => :videos)
end
end
end

Links: Project Home Page on GitHub | Cells Home Page | Cells Home Page on GitHub.

Scribd Desktop Uploader

A fully native Cocoa Macintosh uploader app for the Scribd.com website. Supports following features:

Upload many files at once from your desktop.
Edit titles, tags, and other metadata before uploading.
Quickly and easily manage bulk uploads, straight from your desktop.
Right-click to start uploading files directly to Scribd (Windows only).

Links: Project Home Page on GitHub | Home Page on Scribd.com.

Changelog

September 9, 2009
- Fixed a link to GitHub homepage of the rspec-cells plugin.
September 8, 2009
- Added rspec-cells plugin.

The post Scribd open source projects first appeared on Dmytro Shteflyuk's Home.

Sphinx Client API 0.3.1 and 0.4.0 r909 for Sphinx 0.9.8 r909 released

Dmytro Shteflyuk — Sun, 09 Dec 2007 19:33:10 +0000

I have a good news: Sphinx Client API has been updated and now it supports all brand new features of the unstable Sphinx 0.9.8 development snapshot. What does it mean for you as a developer? What features you will get if you would decide to switch to the new version? I will describe most valuable improvements of the Sphinx in this article, and will show how to use them with new Sphinx Client API 0.4.0 r909.

Multi-query support
Extended engine V2
64-bit document and word IDs support
Multiple-valued attributes
Geodistance feature
Download

Multi-query support

What does it mean? Multi-query support means sending multiple search queries to Sphinx at once. It’s saving network connection overheads and other round-trip costs. But what’s much more important, it unlocks possibilities to optimize “related” queries internally. Here is quote from the Sphinx home page:

One typical Sphinx usage pattern is to return several different “views” on the search results. For instance, one might need to display per-category match counts along with product search results, or maybe a graph of matches over time. Yes, that could be easily done earlier using the grouping features. However, one had to run the same query multiple times, but with different settings.

From now on, if you submit such queries through newly added multi-query interface (as a side note, ye good olde Query() interface is not going anywhere, and compatibility with older clients should also be in place), Sphinx notices that the full-text search query is the same and it is just sorting/grouping settings which are different. In this case it only performs expensive full-text search once, but builds several different (differently sorted and/or grouped) result sets from retrieved matches. I’ve seen speedups of 1.5-2 times on my simple synthetic queries; depending on different factors, the speedup could be even greater in practice.

To perform multi-query you should add several queries using AddQuery method (parameters are exactly the same as in Query call), and then call RunQueries. Please note, that all parameters, filters, query settings are stored between AddQuery calls. It means that if you have specified sort mode using SetSortMode before first AddQuery call, then sort mode will be the same for the second AddQuery call. Currently you can reset only filters (using ResetFilters) and group by (ResetGroupBy) settings. BTW, you can use Query as usually to perform single query, but don’t try to make this call after you have added query into the batch using AddQuery.

Stop speaking, let’s look the example:

1
2
3
4
5
6
7
8
9
10

sphinx = Sphinx::Client.new
sphinx.SetFilter('group_id', [1])
sphinx.AddQuery('wifi')

sphinx.ResetFilters
sphinx.SetFilter('group_id', [2])
sphinx.AddQuery('wifi')

results = sphinx.RunQueries
pp results

As the result we will get array of 2 hashes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

[{"total_found"=>2,
"status"=>0,
"matches"=>
[{"attrs"=>{"group_id"=>1, "created_at"=>1175658647}, "weight"=>2, "id"=>3},
{"attrs"=>{"group_id"=>1, "created_at"=>1175658490}, "weight"=>1, "id"=>1}],
"error"=>"",
"words"=>{"wifi"=>{"hits"=>6, "docs"=>3}},
"time"=>"0.000",
"attrs"=>{"group_id"=>1, "created_at"=>2},
"fields"=>["name", "description"],
"total"=>2,
"warning"=>""},
{"total_found"=>1,
"status"=>0,
"matches"=>
[{"attrs"=>{"group_id"=>2, "created_at"=>1175658555}, "weight"=>2, "id"=>2}],
"error"=>"",
"words"=>{"wifi"=>{"hits"=>6, "docs"=>3}},
"time"=>"0.000",
"attrs"=>{"group_id"=>1, "created_at"=>2},
"fields"=>["name", "description"],
"total"=>1,
"warning"=>""}]

Each hash contains the same data as result of Query method call. Also they have additional fields error and warning which contains error and warning message respectively when not empty.

Note: I have added ResetFilters call before creating second query. Without this call our query will have two filters with conflicting conditions, so there will be no results at all.

Extended engine V2

New querying engine (codenamed “extended engine V2”) is going to gradually replace all the currently existing matching modes. At the moment, it is fully identical to extended mode in functionality, but is much less CPU intensive for some queries. Here are notes from Sphinx author:

I have already seen improvements of up to 3-5 times in extreme cases. The only currently known case when it’s slower is processing complex extended queries with tens to thousands keywords; but forthcoming optimizations will fix that.

V2 engine is currently in alpha state and does not affect any other matching mode yet. Temporary SPH_MATCH_EXTENDED2 mode was added to provide a way to test it easily. We are in the middle of extensive internal testing process (under simulated production load, and then actual production load) right now. Your independent testing results would be appreciated, too!

So, to use new matching mode we should use SPH_MATCH_EXTENDED2 mode. Let’s do it!

1
2
3

sphinx = Sphinx::Client.new
sphinx.SetMatchMode(Sphinx::Client::SPH_MATCH_EXTENDED2)
sphinx.Query('wifi')

Easy enough, right? You should try it by yourself to feel power of new engine. Please note, that this mode is temporary and it will be removed after release.

64-bit document and word IDs support

Before version 0.9.8 the Sphinx was limited to index up to 4 billion documents because of using 32-bit keys. From here on it has ability to use 64-bit IDs, and new feature does not impact on 32-bit keys performance. Let’s look at the example. First we will make query to DB with 32-bit keys:

1
2
3

sphinx = Sphinx::Client.new
result = sphinx.Query('wifi')
pp result['matches'][0]['id'].class

As you can see, class of the id field is Fixnum. Let’s try to make call to index with 64-bit keys. You will get Bignum as the result, and it means that you can have more than 4 billion documents!

Multiple-valued attributes

Plain attributes only allow to attach 1 value per each document. However, there are cases (such as tags or categories) when it is necessary to attach multiple values of the same attribute and be able to apply filtering to value lists. In these cases we can use multiple-valued attributes now.

1
2
3

sphinx = Sphinx::Client.new
sphinx.SetFilter('tag', [1,5])
pp sphinx.Query('wifi')

In case of using miltiple-valued attribute tag you will get result like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

{"total_found"=>2,
"status"=>0,
"matches"=>
[{"attrs"=>
{"tag"=>[4, 5],
"group_id"=>2,
"created_at"=>1175658555},
"weight"=>2,
"id"=>2},
{"attrs"=>
{"tag"=>[1, 2, 3],
"group_id"=>1,
"created_at"=>1175658490},
"weight"=>1,
"id"=>1}],
"error"=>"",
"words"=>{"wifi"=>{"hits"=>6, "docs"=>3}},
"time"=>"0.000",
"attrs"=>
{"price"=>5,
"tag"=>1073741825,
"is_active"=>4,
"group_id"=>1,
"created_at"=>2},
"fields"=>["name", "description"],
"total"=>2,
"warning"=>""}

As you can see, multiple-valued attributes returned as array of integers.

Geodistance feature

Sphinx now is able to compute geographical distance between two points specified by latitude and longitude pairs (in radians). So you now can specify per-query “anchor point” (and attribute names to fetch per-entry latitude and longitude from), and then use “@geodist” virtual attribute both in the filters and in the sorting clause. In this case distance (in meters) from anchor point to each match will be computed, used for filtering and/or sorting, and returned as a virtual attribute too.

1
2
3

sphinx = Sphinx::Client.new
sphinx.SetGeoAnchor('lat', 'long', 0.87248, 0.63195)
result = sphinx.Query('wifi')

Download

As always, you can download Sphinx Client API from project home page. Take into account that version 0.3.1 of the client API intended to use with Sphinx 0.9.7, and Sphinx Client API 0.4.0 r909 requires Sphinx 0.9.8 r909 development snapshot. You could download Sphinx from the Download section of the Sphinx home page.

The post Sphinx Client API 0.3.1 and 0.4.0 r909 for Sphinx 0.9.8 r909 released first appeared on Dmytro Shteflyuk's Home.

Sphinx Search Engine 0.9.7, Ruby Client API 0.3.0

Dmytro Shteflyuk — Thu, 05 Apr 2007 14:44:36 +0000

[lang_en]

It’s happened! We all waited for Sphinx update and finally Andrew Aksyonoff has released version 0.9.7 of his wonderful search engine (who does not know about it, look my previous posts here and here).

[/lang_en]

[lang_ru]

Свершилось! Мы все ждали обновления Sphinx, и вот наконец Andrew Aksyonoff выпустил версию 0.9.7 своего замечательного поискового движка (для тех, кто не понимает, о чем я говорю: посмотрите мои предыдущие заметки здесь и здесь).

[/lang_ru]

[lang_en]

Major Sphinx updates include:

separate groups sorting clause in group-by mode
support for 1-grams, prefix and infix indexing
improved documentation

Now about Sphinx Client API for Ruby. In this version I decided that it is not so good to have different interfaces in different languages (BuildExcerpts in PHP and build_excerpts in Ruby). Therefor applications which using version 0.1.0 or 0.2.0 of API should be reviewed after update. Check documentation for details.

New things in the Sphinx Ruby API:

Completely synchronized API with PHP version.
Fixed bug with processing attributes in query response (thanks to shawn).
Fixed bug query processing time round-up (thanks to michael).
100% covered by RSpec specifications.

You could always download latest version from the Sphinx Client API for Ruby page.

If you are using Sphinx in your Ruby on Rails application, you should try acts_as_sphinx plugin.

[/lang_en]

[lang_ru]

Основные новшества Sphinx включают:

separate groups sorting clause in group-by mode
support for 1-grams, prefix and infix indexing
improved documentation

Теперь о Sphinx Client API для Ruby. В этой версии я решил, что нехорошо иметь разные интерфейсы в разных языка (BuildExcerpts в PHP и build_excerpts в Ruby). Потому код приложений, в которых использовали версии 0.1.0 или 0.2.0 API, необходимо пересмотреть. Детали смотрите в документации.

Изменения в Sphinx Client API для Ruby:

Полностью синхронизирован API с версией PHP.
Исправлена ошибка с обработкой атрибутов в результатах запроса (спасибо shawn).
Исправлена ошибка с округлением временем обработки запроса (спасибо michael).
Библиотека покрыта на 100% спецификациями RSpec.

Вы всегда можете загрузить последнюю версию со страницы Sphinx Client API для Ruby.

Если Вы используете Sphinx в приложении на Ruby on Rails, посмотрите плагин acts_as_sphinx.

[/lang_ru]

The post Sphinx Search Engine 0.9.7, Ruby Client API 0.3.0 first appeared on Dmytro Shteflyuk's Home.

Sphinx 0.9.7-RC2 released, Ruby API updated

Dmytro Shteflyuk — Wed, 20 Dec 2006 06:33:29 +0000

Today I found that Sphinx search engine has been updated. Major new features include:

extended query mode with boolean, field limits, phrases, and proximity support (eg.: @title "hello world"~10 | @body example program);
extended sorting mode (eg.: @weight DESC @id ASC);
combined phrase+statistical ranking which takes words frequencies into account (currently in extended mode only);
official Python API;
contributed Perl and Ruby APIs.

I have updated Sphinx Client Library along with Sphinx 0.9.7-RC2 Windows build.

The post Sphinx 0.9.7-RC2 released, Ruby API updated first appeared on Dmytro Shteflyuk's Home.

Using Sphinx search engine in Ruby on Rails

Dmytro Shteflyuk — Sun, 26 Nov 2006 08:55:20 +0000

Almost all Web-applications needs data search logic and really often this logic should have full-text search capabilities. If you are using MySQL database, you can use its FULLTEXT search, but it’s not efficient when you have a large amout of data. In this case third party search engines used, and one of them (and I think, the most efficient) is Sphinx. In this article I’ll present my port of Sphinx client library for Ruby and show how to use it.

First of all, what is the Sphinx itself? Sphinx is a full-text search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL, or from an XML pipe.

Current Sphinx distribution includes the following software:

indexer: an utility to create fulltext indices;
search: a simple (test) utility to query fulltext indices from command line;
searchd: a daemon to search through fulltext indices from external software (such as Web scripts);
sphinxapi: a set of API libraries for popular Web scripting languages (currently, PHP);

I will not describe how to install engine, if you are new with Sphinx, look the official documentation (but if you want to see my vision, you can always ask me in comments, and I will explain installation procedure in one of future posts). Instead I will present port of Sphinx client library to Ruby and show how to use it (to use this library you need Sphinx 0.9.7-RC2).

First you need to download plugin from RubyForge, or from this site.

This is Ruby on Rails plugin, therefor just unpack it in your /vendor/plugins directory (library can be used outside the Rails application). Now you can write something like following in your code:

1
2
3
4
5
6
7
8
9
10
11

sphinx = Sphinx.new
sphinx.set_match_mode(Sphinx::SPH_MATCH_ANY)
result = sphinx.query('term1 term2')

# Fetch corresponding models
ids = result[:matches].map { |id, value| id }.join(',')
posts = Post.find :all, :conditions => "id IN (#{ids})"

# Get excerpts
docs = posts.map { |post| post.body }
excerpts = sphinx.build_excerpts(docs, 'index', 'term1 term2')

It’s pretty simple, isn’t it? There are several options you can use to get more relevant search results:

set_limits(offset, limit) – first document to fetch and number of documents.
set_match_mode(mode) – matching mode (can be SPH_MATCH_ALL – match all words, SPH_MATCH_ANY – match any of words, SPH_MATCH_PHRASE – match exact phrase, SPH_MATCH_BOOLEAN – match boolean query).
set_sort_mode(mode) – sorting mode (can be SPH_SORT_RELEVANCE – sort by document relevance desc, then by date, SPH_SORT_ATTR_DESC – sort by document date desc, then by relevance desc, SPH_SORT_ATTR_ASC – sort by document date asc, then by relevance desc, SPH_SORT_TIME_SEGMENTS – sort by time segments (hour/day/week/etc) desc, then by relevance desc).

Other options you can be found in API documentation.

If you are interested with this library, found bugs or have ideas how to improve it – please leave comments.

Updated: Unfortunately, there are no Windows binaries for latest Sphinx 0.9.7-rc2 version. I’ve built Sphinx for Windows, and added my config file into archive. You can download my build here.

The post Using Sphinx search engine in Ruby on Rails first appeared on Dmytro Shteflyuk's Home.

sphinx | Dmytro Shteflyuk's Home

Scribd open source projects

Table of Contents

bounces-handler

db-charmer

easy-prof

Fast Sessions

loops

magic-enum

rlibsphinxclient

rscribd

Rspec Cells

Scribd Desktop Uploader

Changelog

Sphinx Client API 0.3.1 and 0.4.0 r909 for Sphinx 0.9.8 r909 released

Table of contents

Multi-query support

Extended engine V2

64-bit document and word IDs support

Multiple-valued attributes

Geodistance feature

Download

Sphinx Search Engine 0.9.7, Ruby Client API 0.3.0

Sphinx 0.9.7-RC2 released, Ruby API updated

Using Sphinx search engine in Ruby on Rails