Logo Tamarack Associates
Products Download Support Order Contact

Rubicon Search Speed

The speed of a Rubicon search is a function of several factors, including the type of search, ranking, match dataset creation, specificity of the search, and database performance.  The Rubicon TrbSearch component contains several properties that allow you to manage search performance. 

Type of Search

Most searches will be performed almost instantaneously.  Rubicon simply reads the indexes for each word in the search and performs the necessary boolean operations to determine the matching locations.

For a proximity search -- those using phrases, near searches, or the SubFieldNames option -- there is an additional step that requires extra processing.  Since Rubicon indexes only contain information on whether a word does or does not appear at a location, it cannot tell from the index where or how often the word appears at a location.  Thus, in order to process a proximity search, Rubicon must read the text at the location in order to determine whether the location meets the proximity criteria.

Finally, broadly defined wildcard searches (e.g. "s*") can be time consuming because the number of words (and hence indexes read) matching the criteria is large.

Ranking

As described above, Rubicon indexes do not include information on how many times a word appears in a given location.  As a result, when ranking is enabled, Rubicon must read the text at all the matching locations in order to perform ranking.

Match Dataset Creation

The creation of a Match dataset requires the creation of a dataset and populating it with the matching records.  The fastest way to accomplish this is to use an in-memory dataset that avoids overhead of creating the dataset on disk.  The number of fields included in the Match dataset may also be reduced to improve performance.

A Match dataset is a very convenient way to work with the search results, but it is not required.  A potentially faster approach is to use the TrbSearch FindFirst and FindNext methods to move to each matching record and create the search results in code.

Matching Locations

If you just need to get the index values of the matching records, then call the TrbSearch MatchingLocations method.  This method returns a TList populated with the index values of the matching records.  Since the values of a TList are pointer types, you'll have to cast these as Integers (e.g. Integer(MyList[0]) is the index value of the first matching record).  Since this method requires no additional dataset calls, it is very fast.

Specificity of the Search

A search that is specific will result in fewer matching locations, and thus the amount of work required to check proximity, perform ranking, and create a Match dataset will be much less, resulting in better overall search performance.

Database Performance

All of the above factors are dependent on the performance of the database.  As a general rule, local databases such a Paradox will perform operations faster than a server based database like InterBase.

Searching External Files

When searching external files (e.g. html files) proximity searches and ranking incur the extra overhead of opening and closing the files.  Performance can be significantly improved if these files are placed in a database.

Converting to Plain Text

Converting formatted text to plain text can be time consuming and this will slow down proximity searches and ranking.  Performance can be improved by creating and maintaining a plain text copy of the formatted text and then have Rubicon index, update, and search the plain text version.

Managing Search Performance

There are several ways that the performance of a search can be managed so that the search completes within a certain amount of time.  In the following code the time consuming lines are #6 and #10:

 1: procedure TForm1.Button1Click(Sender: TObject); 
 2: begin
 3:  with rbSearch1 do
 4:  begin
 5:   SearchFor := Edit1.Text;
 6:   Execute;
 7:   Label1.Caption := Format('%d matches (%1.3f sec)',
 8:                            [MatchCount,SearchTime/1000]);
 9:   if MatchCount > 0 then
10:    rbMatchMaker1.Execute
11:  end
12: end;

Line 6 will execute very quickly unless a proximity or broad wildcard search is performed.  The rbSearch1.TimeLimit may be used to limit lengthy operations  to the number of milliseconds you wish to allow the search to run.  When the search times out, the results will reflect all the locations that were checked (for a proximity search) or all the words processed (for a wildcard search) during the time allowed. 

The rbSearch1.OnPreviewWord event may be used to catch broadly defined wildcard searches.  This is especially important for SQL databases because Rubicon will query the Words table with a SQL LIKE statement that will not be limited by the TimeLimit property.  An example of OnPreviewWord appears  below.

Two potentially time consuming activities are occurring in line 10.  The obvious one is that a Match dataset is being created.  If there are a large number of matching records, then these records have to be copied to the Match dataset.  Besides using an in-memory dataset (e.g. TClientDataSet) to speed things up, it may also be useful to use the rbMatchMaker1.MatchLimit property to limit the number of records copied to the Match dataset.  Another option is to use the rbMatchMaker1.TimeLimit property to limit the creation of the Match dataset to a certain number of milliseconds.

If ranking is enabled, then the second activity occurring in line 10 is the ranking process.  By using the rbSearch1.RankLimit property, this activity can be restricted to ranking the first RankLimit matching locations.  For more information on ranking, please see Ranking and Navigation.

Keep in mind that you do not need to create a Match dataset.  TrbMatchMaker is provided because creating a Match dataset is a convenient way of working with the search results, but it is not efficient since it is duplicating data.  Instead, you may wish to simply navigate to each matching record by calling rbSearch1.FindFirst and FindNext (if ranking is enabled, the ranking will be performed when FindFirst is first called).

At the Limit

When a search reaches one of the Limit property restrictions , which locations are left unprocessed?  Locations are processed from one end of the Text to the other end.  By default, this process starts at the beginning of the Text, but processing can begin at the end and move backwards through the Text by using the soNavReverse SearchOption.  Thus a proximity search that is cut short by a TimeLimit or ranking that is restricted to RankLimit will not have processed the locations at the end of the Text (or at the beginning of the Text if soNavReverse is set).  For more detailed information, please see Ranking and Navigation.

For wildcard searches that exceed the TimeLimit, the search engine will not have had time to read all the possible matching words, so the search results will be incomplete.

Managing the Search in Action

The Tamarack newsgroup search application uses all of the above techniques to manage search performance.  Note how the application limits search results to the most recent 100 matching messages.  Messages are added to the database in chronological order, so to process the table from the most recent to the least recent message, the soNavReverse search option is set.

If you try to perform a proximity search that uses common words (e.g. "borland delphi"), the search will end after 15 seconds because the TimeLimit property is set to 15000.  

Finally, if you try to perform a broad wildcard search (e.g. "b*") , you will get an error because the OnPreviewWord event is set to:

procedure TForm1.rbSearch1PreviewWord(Sender: TObject;
           var Word: String);
var i : Integer;
begin
 for i := 1 to IMin(2,Length(Word)) do
  if Word[i] in ['?','*'] then
   raise Exception.Create(
    'Wildcards must be preceded by at least two ' +
    'characters (e.g. "st*", not "s*" or "*")');
end;

See Also

 

Copyright 2003 © Tamarack Associates 

www.TamarackA.com Last updated 05/18/01 www.FullTextSearch.com