SWISH Bug Fixes and Enhancements

Bug Fixes

The following bugs have been fixed in SWISH-E:

Wild card *
problem before fix: in a multiple words search, the results varied with the position of the term containing the asterisk in the query.
Merge option -M
problem before fix: the created merged file was not in the right format, consequently any search on that index would cause swish to hang.
Unary operator "not"
problem before fix: unreliable results
Explicit nested boolean
problem before fix: urnreliable results

New Features

- Ignore specified char's when in final position.
  It is sometimes convenient that certain char's are treated as normal
char when in the middle of a word while they are disregarded when in final
position. To exercise this option there should be in the config.h file
the following lines:

#define IGNORELAST 1
#define IGNORELASTCHAR "<list of char>"

For example if "." is listed in the IGNORELASTCHAR variable, words
will be indexed as follows:
Word            Indexed as
z39.50          z39.50
z39.50.         z39.50

  There is to note that the char's that are listed in the IGNORELASTCHAR
variable need also to be listed in the ENDCHARS variable, otherwise
the word is discarded as invalid. The char's in the list are written
in sequence within the quotes with no separation between them.

- Common removed words printing
  This new swish version automatically prints out all the words that
are not indexed as too common according to the limits set in the PLIMIT 
and FLIMIT variables in the config.h file. 


- META data tag support
  It is now possible to search in META tags for names associated to a
particular metaName.

  There are two ways to associate a word to a metaName:

1) <META NAME="metaName" CONTENT="words"> the usual HTML tag used 
within <HEAD></HEAD>

2) <!--META START NAME="metaName" -->
    some text of any length
   <!--META END -->

In this way it is possible to mark pretty much any part of the text; please 
note, however, that the words associated to metaNames are not searchable 
in a plain search.

NOTE: Nested or overlapping META tags are not allowed and will lead to
unpredictable search results.

Step by Step indexing and search:
In the user configuration file a new variable containing the metaNames
that will be used in the files (see user config file example at the
end of this doc); after adding the list of metaNames values to the
file, indexing proceeds as usual:
%swish-e -c <config.file>

If during indexing a metaName specified in a file is not listed in the
config.file, the user has the choice of having SWISH-E either aborting the
indexing with an error, or issuing a warning stating the metaName not in
the config.file and the file that contains it and continuing the index
construction, in which case the words are not associated to any metaName.
To exercise this choice, set the variable OKNOMETA in the conifig.h file
(see config.h file example at the end). 

Meta names are case insensitive, so they can be written with any
combination of upper and lower cases.

The search query has a slightly different syntax and is of the kind:
%swish-e -w "metaName = word" -f <index.file>

The equal sign indicates the presence of a metaName and the search
results are all the file where the META tag with NAME="metaName" has
CONTENT="word" (or where "word" is contained in the area marked by the
<!--META START...> and <!--META END..> tags).  

It is not necessary to have spaces at either side of the '=',
consequently the following are equivalent:
%swish-e -w "metaName = word" -f <index.file>
%swish-e -w "metaName=word" -f <index.file>
%swish-e -w "metaName= word" -f <index.file>

To search on a word that contain a '=', have a '/' precede the '=':
%swish-e -w "test/=3 = x/=4 or y/=5" -f <index.file>
this query returns the files where the word "x=4" is associated with
the metaName "test=3" or that contain the word "y=5" not associated
with any metaName.

Queries can be also constructed using any of the usual search features,
moreover metaName and plain search can be mixed in a single query.
%swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)"  -f yyy

This query will retrieve all the files in which the "metaName1" is
associated either with "a1" or "a4" and that do not contain the words
"a3" and "a7", where "a3" and "a7" are not associated to any meta


config.h  example
** SWISH Default Configuration File
** Kevin Hughes, kevinh@eit.com 
** 3/11/94
** Two variables added IGNORELAST and IGNORELASTCHAE
**        G. Hill 3/12/97 ghill@library.berkeley.edu
** Added OKNOMETA to allow no failing in case the META name is
** not listed in the config.h
**        G. Hill 4/15/97 ghill@library.berkeley.edu
** The following are user-definable options that you can change
** to fine-tune SWISH's default options.

/* #define NEXTSTEP */

/* You may need to define this if compiling on a NeXTstep machine.

#define INDEXPERMS 0644

/* After SWISH generates an index file, it changes the permissions
** of the file to this mode. Change to the mode you like
** (note that it must be an octal number). If you don't want
** permissions to be changed for you, comment out this line.

#define PLIMIT 80
#define FLIMIT 256

/* SWISH uses these parameters to automatically mark words as
** being too common while indexing. For instance, if I defined PLIMIT
** as 80 and FLIMIT as 256, SWISH would define a common word as
** a word that occurs in over 80% of all indexed files and over
** 256 files. Making these numbers lower will most likely make your
** index files smaller. Making PLIMIT and FLIMIT small will also
** ensure that searching consumes only so much CPU resources.

#define VERBOSE 2

/* You can define VERBOSE to be a number from 0 to 3. 0 is totally
** silent operation; 3 is very verbose.

#define MAXHITS 500

/* MAXHITS is the maximum number of results to return from a search.


/* If a list of search words is specified without booleans,
** SWISH will assume they are connected by a default rule.
** This can be AND_RULE or OR_RULE.


/* This is how many lines deep SWISH will look into an HTML file to
** attempt to find a <TITLE> tag.


/* Normally, words within HTML comments are not assigned a higher
** relevance rank. If you're including keywords in comments
** define this as 1 so matching results will rise to the top
** of search results.


/* This is the minimum length of a word. Anything shorter will not
** be indexed.


/* This is the maximum length of a word. Anything longer will not
** be indexed.


/* If defined as 1, all entities in search words and indexed
** words will be converted to an ASCII equivalent. For instance,
** with this feature you can index the word "resumé" or
** "resumé" and it will be indexed as the word "resume".
** If defined as 0, only numerical entities will be converted
** to named entities, if they exist.

#define IGNOREALLV 0
#define IGNOREALLC 0
#define IGNOREALLN 0

/* If IGNOREALLV is 1, words containing all vowels won't be indexed.
** If IGNOREALLC is 1, words containing all consonants won't be indexed.
** If IGNOREALLN is 1, words containing all digits won't be indexed.
** Define as 0 to allow words with consistent characters.
** Vowels are defined as "aeiou", digits are "0123456789".

#define IGNOREROWV 6
#define IGNOREROWC 8
#define IGNOREROWN 7

/* IGNOREROWV is the maximum number of consecutive vowels a word can have.
** IGNOREROWC is the maximum number of consecutive consonants a word can have.
** IGNOREROWN is the maximum number of consecutive digits a word can have.
** Vowels are defined as "aeiou", digits are "0123456789".

#define IGNORESAME 15

/* IGNORESAME is the maximum times a character can repeat in a word.

#define WORDCHARS "abcdefghijklmnopqrstuvwxyz=&#;0123456789.@\|/-"

/* WORDCHARS is a string of characters which SWISH permits to
** be in words. Any strings which do not include these characters
** will not be indexed. You can choose from any character in
** the following string:
** abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'\"`~,.[]{}()
** Note that if you omit "0123456789&#;" you will not be able to
** index HTML entities. DO NOT use the asterisk (*), lesser than
** and greater than signs (<), (>), or colon (:).
** Including any of these four characters may cause funny things to happen.
** If you have a pressing need to index 8-bit characters, please contact
** me for possible user testing in the future.
** Also note that if you specify the backslash character (\) or
** double quote (") you need to type a backslash before them to
** make the compiler understand them.

#define BEGINCHARS "abcdefghijklmnopqrstuvwxyz&0123456789"

/* Of the characters that you decide can go into words, this is
** a list of characters that words can begin with. It should be
** a subset of (or equal to) WORDCHARS.

#define ENDCHARS "abcdefghijklmnopqrstuvwxyz;0123456789,."

/* This is the same as BEGINCHARS, except you're testing for
** valid characters at the ends of words.

/* Note that if you really want to edit the default stopwords, (words
** that are deemed too common to be indexed) then you can do so in the
** file "swish.h". They don't have to be in alphabetical order.

#define IGNORELAST 1

/* Variable that, if set to 1, will cause IGNORELASTCHAR to be direguared
** when in the final position in a word. This variable was introduced to solve
** the z39.50 problem - to have certain char valid in the middle of a sentence,
** but disreguarded when at the end  i.e. period. Defaults is false.


/* Array that contains the char that, if considered valid in the middle of 
** a word need to be disreguarded when at the end. It is important to also
** set the given char's in the ENDCHARS array, otherwise the word will not
** be indexed because considered invalid.

#define OKNOMETA 1
/* Variable that define if it is ok to fail in case the META name is not listed
** in the METANAMES variable. Value of 1 will cause the word to be listed as a
** regular words with no metaName attached, and only a warning listing the
** the meta name and the file in which it was found is issued.

#define INDEXTAGS 0

/* Normally, all data in tags in HTML files (except for words in
** comments) is ignored. If you want to index HTML files with the
** text within tags and all, define this to be 1 and not 0.


User configuration file example

# Sample SWISH configuration file
# Kevin Hughes, kevinh@eit.com, 3/11/95
# Added MetaNames variable to support META tags
# G.Hill ghill@library.berkeley.edu 4/97

IndexDir /home/ghill/swish/dir5/records
# This is a space-separated list of files and
# directories you want indexed. You can specify
# more than one of these directives.

IndexFile /home/ghill/swish/dir5/myindex5
# This is what the generated index file will be.

MetaNames NaMe1 nAme2
# List of metaNames used in the files to index; names
# are case insensitive.

IndexName "Improvement index"
IndexDescription "This is an index to test bug fixes in swish." 
IndexPointer "http://xxxx"
IndexAdmin "Name, (e-mail address)"
# Extra information you can include in the index file.

IndexOnly .html
# Only files with these suffixes will be indexed.

IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.

FollowSymLinks no
# Put "yes" to follow symbolic links in indexing, else "no".

NoContents .gif .xbm .au .mov .mpg .pdf .ps
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.

#ReplaceRules replace "/home/cleita/public_html/index/links" "http://sunsite.berkeley.edu/InternetIndex/Data"
# ReplaceRules allow you to make changes to file pathnames
# before they're indexed.

FileRules pathname contains admin testing demo trash construction confidential
FileRules filename contains # % ~ .bak .orig .old old.
FileRules title contains construction example pointers
FileRules directory contains .htaccess
# Files matching the above criteria will *not* be indexed.

IgnoreLimit 50 1000
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of auto-stopwording.

#IgnoreWords SwishDefault
# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated by spaces
# and may span multiple directives.

SWISH is Copyright © 1989, 1991 Free Software Foundation, Inc.
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA
SWISH-E is distributed with no warranty under the terms of the GNU Public License.
Public questions may be posted to the SWISH-E Discussion.
Document maintained at http://sunsite.berkeley.edu/SWISH-E/changes.html by the SunSITE Manager.
Last update 8/12/97. SunSITE Manager: manager@sunsite.berkeley.edu