Contents:
SWISH was created by Kevin Hughes to fill the need of the growing number of Web administrators on the Internet - many current indexing systems are not well documented, are hard to use and install, and are too complex for their own good. The system was widely used for several years, long enough to collect some bug fixes and requests for enhancements. In Fall 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary. The result is SWISH-Enhanced or SWISH-E, brought to you by the SWISH-E Development Team. For more information about it's strengths, see SWISH-E Features. The SWISH-E release also includes AutoSwish, a Perl program that makes setting up, indexing, and maintaining your SWISH-E indexes a breeze. To see it in action, go to the AutoSwish Demonstration.
First, download the package from ftp://sunsite.berkeley.edu/pub/swish-e/.
swish-e.1.tar.Z, swish-efiles.1.tar.Z
) on your computer
uncompress swish-e*
)
tar -xvf swish-e*
)
swish-e
directory
README
file for compilation instructions
Everything was written in C, so it should work just about anywhere. It has been tested on the following systems: Solaris 2.5.1 (on a Sun SPARCcenter 2000E), Digital UNIX (on a DEC Alpha), and BSDI 2.0.
The swish-e program can go under /usr/local/bin
- you may
want to put other SWISH-E things somewhere such as
/usr/local/httpd/swish-e
, if you're using NCSA's httpd.
You'll also want to create a directory to hold SWISH-E databases,
somewhere like /usr/local/httpd/swish-e/sources
. You can
store the files anywhere you like, as long as you remember where they are!
After you've compiled (and installed) SWISH-E, make sure the
swish-e program is somewhere in your executable path (somewhere
such as /usr/local/bin
).
Also available is the AutoSwish CGI script. For complete installation instructions see the AutoSwish README.
If all this seems a litle daunting, there is an easier way. AutoSwish is a CGI script that works in conjunction with SWISH-E, and allows you or your users to index a Web site by simply filling out a form. AutoSwish will then automatically write a SWISH-E configuration file, index the directories and files that have been specified, create a cgi script for searching the index, and generate a fully functional search form, which can be used immediately. In just minutes your users can make their WebPages fully searchable and they didn't have to come running to you!
AutoSwish Demonstration: http://sunsite.berkeley.edu/SWISH-E/AutoSwish/
AutoSwish Readme: http://sunsite.berkeley.edu/SWISH-E/AutoSwish/README
Download AutoSwish: ftp://sunsite.berkeley.edu/pub/swish-e/
Searching with SWISH-E
In the SWISH-E distribution, there's a sample SWISH-E index (called sample.swish-e), and you can do a simple search on it. Try typing this:
swish-e -f sample.swish-e -w internet and resources and archie
This will search the file sample.swish-e for files consisting of the words internet and resources and archie. You should get something back like this:
# SWISH-E format 1.1 search words: internet and resources and archie # Name: Index of EIT's Web # Saved as: sample.swish-e # Counts: 7316 words, 94 files # Indexed on: 12/03/95 17:50:43 PST # Description: This is a full index of EIT's web site. # Pointer: http://www.eit.com/cgi-bin/wwwwais/ # Maintained by: Kevin Hughes (kevinh@eit.com) 1000 http://www.eit.com/web/www.guide/guide.15.html "Guide to Cyberspace 6.1: Index/Glossary" 11566 360 http://www.eit.com/web/netservices.html "Internet Resources List" 48391 .
The results tell you:
If there are errors, instead of the results list, you may get one of the following error lines. These lines will always be prefixed with err:
.
err: no results
err: could not open index file
err: no search words specified
err: a word is too common
err: the index file is empty
err: the index file format is unknown
err: the Metaname <name> does not exist in the user
configuration file.
SWISH-E
has the capability to use configuration files in which you can specify all sorts of options for indexing. To use a configuration file, call it something such as swish-e.conf, and place it somewhere such as /usr/local/httpd/swish-e/
. T
he configuration file below is an example of a typical SWISH-E configuration file:
# Sample SWISH configuration file # Kevin Hughes, kevinh@eit.com, 3/11/95 # # Added MetaNames variable to support META tags # G.Hill ghill@library.berkeley.edu 4/97 IndexDir /home/ghill/swish/dir5/records # This is a space-separated list of files and # directories you want indexed. You can specify # more than one of these directives. IndexFile /home/ghill/swish/dir5/myindex5 # This is what the generated index file will be. MetaNames NaMe1 nAme2 # List of metaNames used in the files to index; names # are case insensitive. IndexName "Improvement index" IndexDescription "This is an index to test bug fixes in swish." IndexPointer "http://xxxx" IndexAdmin "Name, (e-mail address)" # Extra information you can include in the index file. IndexOnly .html # Only files with these suffixes will be indexed. IndexReport 3 # This is how detailed you want reporting. You can specify numbers # 0 to 3 - 0 is totally silent, 3 is the most verbose. FollowSymLinks no # Put "yes" to follow symbolic links in indexing, else "no". NoContents .gif .xbm .au .mov .mpg .pdf .ps # Files with these suffixes will not have their contents indexed - # only their file names will be indexed. ReplaceRules replace "/home/cleita/public_html/index/links" "http://sunsite.berkeley.edu/InternetIndex/Data" # ReplaceRules allow you to make changes to file pathnames # before they're indexed. FileRules pathname contains admin testing demo trash construction confidential FileRules filename contains # % ~ .bak .orig .old old. FileRules title contains construction example pointers FileRules directory contains .htaccess # Files matching the above criteria will *not* be indexed. IgnoreLimit 50 1000 # This automatically omits words that appear too often in the files # (these words are called stopwords). Specify a whole percentage # and a number, such as "80 256". This omits words that occur in # over 80% of the files and appear in over 256 files. Comment out # to turn of auto-stopwording. #IgnoreWords SwishDefault # The IgnoreWords option allows you to specify words to ignore. # Comment out for no stopwords; the word "SwishDefault" will # include a list of default stopwords.. Words # should be separated by spaces and may span multiple directives.
To index a site using the options in a configuration file, type:
swish-e -c /usr/local/httpd/swish-e/swish-e.conf
To run swish-e and index your site.
Taking as an example the above configuration in the script, you'd have the directory /usr/local/httpd/swish-e/sources
and one file called index.swish-e
in the directory. The name of the database you've just created is in
dex.swish-e
.
You can specify variables and values in the configuration file by typing the variable name (it's not case sensitive), a space (tabs are OK), and the value you want for the variable. If the value has spaces, you can enclose it in quotes to keep the space. If you want to specify multiple values, separate the values with a single space. In the configuration file, lines beginning with a hash mark (#) and blank lines are ignored. MetaNames must be one word with no quotes.
directory
The IndexDir variable tells swish-e what directories and files to index. Each specified directory will be indexed recursively. You can use more than one of these directives - here are some examples:
IndexDir /usr/local/www /src/code.html IndexDir /users/tony/public_html/home.html /web
indexfile
The IndexFile variable tell swish-e what to save the indexed results as. Indexes generated by swish-e should have a suffix of .swish-e
.
.suffix1 .suffix2 .suffix3 ...
Only files with these suffixes will be indexed. If you omit this variable, swish-e will index every file it comes across. Suffix checking is not case sensitive.
3
This variable can have the values 0
to 3
. If you specify 3
, swish-e will tell you what's going on while it's indexing, printing out directory and file names, number of words indexed, and so on, as well as give inf
ormation about other operations. The value 0
will make swish-e completely silent.
value
Normally swish-e ignores symbolic links to files whe indexing. If you want it to follow such links, define this value as yes
, else define it as no
.
.suffix1 .suffix2 .suffix3 ...
This variable lets you control what files will have their contents indexed. If a file with a suffix in this list is indexed, only its file name (and not any words in the file) will be indexed. This is useful because normally SWISH-E will try to index t he contents of every file, even files without words (such as images or movies). Suffix checking is case-insensitive.
word1 word2 ...
Here you can specify words to ignore when searching. Usually these words (called stopwords) are words that occur too many times in your data to make indexing them worthwhile. If you specify a word as SwishDefault
, it will be replace
d with swish-e's default list - a few hundred very common English words.
number1 number2
After indexing, swish-e can automatically tell which words are the most common and omit them from the index according to these parameters. Here are some examples:
1. IgnoreLimit 80 256 2. IgnoreLimit 50 50
Using IgnoreLimit and IgnoreWords can help trim the size of your index files considerably - experiment with parameters to see what works best at your site. You can also use IgnoreLimit to limit the CPU resources that searches take.
"value"
"value"
"value"
"value"
These variables specify information that goes into index files to help
users and administrators. IndexName should be the name of your
index, like a book title. IndexDescription is a short description
of the index or a URL pointing to a more full description.
IndexPointer should be a pointer to the original information, most
likely a URL. IndexAdmin should be the name of the index maintainer
and can include name and email information. These values should not be
more than 70 or so characters and should be contained in quotes. Note that
the automatically generated date in index files is in D/M/Y
and 24-hour format.
These variables specify the meta names used in the .html files. Do not comment out or erase this line. MetaNames need to be one word with no quotes.
When results are returned from swish-e searches, you may get a bunch of funny pathnames to files that you can't access. Using ReplaceRules, you can specify a series of operations to perform on the pathname result to change it into a URL and othe r things if you desire.
There are three operations you can specify: replace, append, and prepend. They will parse the pathname in the order you've typed these commands. More than one command and its arguments can appear on the same line, but it's easier t o read when commands are broken up over a few lines. You can't put a command and its argument(s) on different lines, however.
Here's the syntax:
replace "the string you want replaced" "what to change it to" This replaces all occurrences of the old string with the new one. prepend "a string to add before the result" append "a string to add after the result"
Study the above sample configuration file and try things out. You'll find that by having swish-e return URLs instead of pathnames, you can create interfaces to swish-e that can allow users to get to the search results over the World-Wide Web.
You can specify certain file directives in the configuration file - any files or directories matching these criteria will be ignored and will not be indexed. Prepend all of these operations with the FileRules directive:
string1 string2 string3 ...
Any path names containing exactly these strings, whether they be paths to directories or paths to files, will be ignored. Using this you can avoid indexing temporary directories or private material.
filename
Any file name exactly matching the specified file name will be ignored (this is case-sensitive). This cannot be a path.
string1 string2 string3 ...
Any file name containing these strings will be ignored (this is not case-sensitive). This cannot be a path.
string1 string2 string3 ...
Any HTML file with a title that contains these strings will be ignored (this is case-insensitive).
string1 string2 string3 ...
Any directory that contains any of these specified file names will be ignored (this is case-insensitive).
usage: swish-e [-i dir file ... ] [-c file] [-f file] [-l] [-v (num)] swish-e -w word1 word2 ... [-f file1 file2 ...] [-m num] [-t str] swish-e -M index1 index2 ... outputfile swish-e -D file swish-e -V options: defaults are in brackets -i : create an index from the specified files -w : search for words "word1 word2 ..." -t : tags to search in - specify as a string "HBthec" - in head, body, title, header, emphasized, or comments -f : index file to create or search from [index.swish-e] -c : configuration file to use for indexing -v : verbosity level (0 to 3) [0] -l : follow symbolic links when indexing -m : the maximum number of results to return [40] -M : merges index files -D : decodes an index file -V : prints the current version version: 1.0 docs: http://sunsite.berkeley.edu/SWISH-E/
To see the usage, run swish-e with a -z or -? option.
index.swish-e
in the current directory. You don't need to put quotes around search words.
You can use the booleans and, or, or not in
searching. Without these booleans, SWISH-E will assume you're
anding the words together. [Note: you can change the default to
oring by changing the variable DEFAULT.RULE in the
config.h
file and recompiling SWISH-E.] Evaluation takes place
from left to right only, although you can use parentheses to force the
order of evaluation. The boolean operators are case sensitive --
use lowercase ONLY.
You can also use an asterisks (*) to truncate a search word. For example, by searching on "librar*" you will find all occurrences of "library", "libraries" and "librarians".
example 1: swish-e -w john and doe or jane example 2: swish-e -w john and (doe or not jane) example 3: swish-e -w not (john or jane) and doe example 4: swish-e -w j* and doe
john or jane
will be evaluated first, a not
operation will be performed on that, then everything will be and
ed with doe
.
j
and that also contain doe
.
The equal sign indicates the presence of a metaName and the search results in all the files where the META tag with NAME="metaName" has CONTENT="word" (or where "word" is contained in the area marked by the <!--META START...> and <!--META END..> tags).
It is not necessary to have spaces at either side of the '=', consequently the following are equivalent:
example 1: swish-e -w "metaName = word" -fexample 2: swish-e -w "metaName=word" -f example 3: swish-e -w "metaName= word" -f
To search on a word that contain a '=', have a '/' precede the '=':
example: swish-e -w "test/=3 = x/=4 or y/=5" -fthis query returns the files where the word "x=4" is associated with the metaName "test=3" or that contain the word "y=5" not associated with any metaName.
Queries can be also constructed using any of the usual search features, moreover metaName and plain search can be mixed in a single query.
example: swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f yyy
This query will retrieve all the files in which the "metaName1" is associated either with "a1" or "a4" and that do not contain the words "a3" and "a7", where "a3" and "a7" are not associated to any meta name.
The -t option allows you to search for words that exist only in specific HTML tags. Each character in the string you specify in the argument to this option represents a different tag to search for the word in. H means all <HEAD>
tags, B stands for <BODY>
tags, t is all <TITLE>
tags, h is <H1>
to <H6>
(header) tags, e is emphasized tags (this may be <B>
, <I>
, <EM>
, or <STRONG>
), and c is HTML comment tags (<!-- ... -->
).
example 1: swish-e -w apples oranges -t t example 2: swish-e -w keywords draft release -t c example 3: swish-e -w world wide web -t the
While searching, this specifies the maximum number of results to return. The default is 40. If no numerical value is given, the default is assumed. If the value is 0 or the string all
, there will be no limit to the number of results. The c
onfiguration file value overrides this value.
This specifies the directories and/or files to index. Directories will be indexed recursively.
This specifies the configuration file to use for searching. You can use this as an only option to swish-e to do automatic indexing, if all the necessary variables are set in the configuration file.
If you specify a directory to index, an index file, or the verbose option on the command-line, these values will override any specified in the configuration file.
You can specify multiple configuration files in order to split up common preferences. For instance, you might store a file with the stopwords in it and have multiple other files that have different index file information.
example 1: swish-e -c swish-e.conf example 2: swish-e -i /usr/local/www -f index.swish-e -v -c swish-e.conf example 3: swish-e -c swish-e.conf stopwords.conf
swish-e.conf
will be read, then the variable in stopwords.conf
will be read. Note that if the same variables occur in both files, older values may be written over.
If you are indexing, this specifies the file to save the generated index in, and you can only specify one file. If you are searching, this specifies the index files (one or more) to search from. The default index file is index.swish-e
in t
he current directory.
Specifying this option tells swish-e to follow symbolic links when indexing. The configuration file value will currently override the command-line value.
This allows you to merge two or more index files - the last file you specify on the list will be the output file. Merging removes all redundant file and word data. To estimate how much memory the operation will need, sum up the sizes of the files to be merged and divide by two. That's about the maximum amount of memory that will be used. You can use the -v option to produce feedback while merging and the -c option with a configuration file to include new administrative information in the new index file.
This option is provided so you can check the word, file, and maintenance information in index files. You can specify multiple files to decode.
The -v option can take a numerical value from 0
to 3
. Specify 0
for completely silent operation and 3
for detailed reports. If no value is given then 3
is assumed.
Once your SWISH-E index has been created, you will likely want to make it accessible to Web users. The manual method for doing this is to write a program that interfaces with the SWISH-E index and presents the results to the user. We are providing a simple Perl program for doing this. Edit the program variables at the top of the file for your specific index, and change any of the HTML coding in the body of the program until you get the desired result. Access to Perl is required.
Swish crashes and burns on a certain file. What can I do?
You can use a FileRules operation to exclude the particular file name, or pathname, or its title. If there are serious problems in indexing certain types of files, they may not have valid text in them (they may be binary files, for instance). You can use NoContents to exclude that type of file.
Swish isn't indexing a certain word or phrase.
By default, swish-e tries to make it best guesses as to what it thinks are reasonable words and filters out "garbage" words according to a set of rules, for instance, if swish-e encounters a word that has no vowels, it doesn't index it. You can change these rules by editing theconf.h
file in thesrc
directory of the swish-e distribution package. By editing the rules, you may be able to index quite a few more words, or less, depending on your preference.
How can I index all my compressed files?
Swish doesn't currently have the capability to do on-the-fly filtering of files. In the meantime, first index the uncompressed data, compress it, and using a ReplaceRules operation, change the suffix of indexed files to .Z or whatever is appropriate. That way users can retrieve the compressed information.
Can I index 8-bit text?
Yes, if the text uses the HTML equivalents for the ISO-Latin-1 (ISO8859-1) character set. Upon indexing swish-e will convert all numbered entities it finds (such as©
) to named entities (such as©
). To search for words including these codes, type the named entity (if it exists) in place of the 8-bit character. Swish will also convert entities to ASCII equivalents, so words that might look like this in HTML:resumé
can be searched as this:resume
.
How can I index phrases?
Currently the only way to do this is to use the HTML entity 
(non-breaking space) to represent a space in your HTML. It will then be indexed with a space. To search for the phrase, you'd have to enter 
to represent a space also.
How can I implement keywords in my documents?
In your HTML files you can put keywords in HTML META tags, such as:
<META NAME="DC.subject" CONTENT="digital libraries">The above example uses the Dublin Core metadata draft standard and a proposed convention for embedding Dublin Core tags in HTML documents.
Then, to inform SWISH-E about the existence of DC.subjects in your documents, edit the line in your configuration file (using either AutoSwish or a text editor like vi or Pico) so it reads like so:
MetaNames DC.subject
You will also need to include any other MetaNames you are using in your files. Then, you can search your documents for words that occur within these META tags in your documents. See above for search syntax, or if you use AutoSwish, it will automatically create a search form that includes fields for searching those fields.
I want to generate a list of files to be indexed and pass it to swish-e.
One thing you can do is make a simple script to generate a configuration file full of IndexDir directives. For instance, make a separate file called
files.conf
and put something like this in it:IndexDir /this_is_file_1/file.html IndexDir /usr/local/www IndexDir file2.html /some/directory/ ...Then call swish-e like this (assuming you're using a mainswish-e.conf
file):swish-e -c swish-e.conf files.conf
I run out of memory trying to index my files.
It's true that indexing can take up a lot of memory! One thing you can do is make many indices of smaller content instead of trying to do everything at once. You can then merge all the smaller pieces together.
SWISH is Copyright © 1989, 1991 Free Software Foundation, Inc.
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA
SWISH-E is distributed with no warranty under the terms of the GNU Public License.
Public questions may be posted to
the SWISH-E Discussion.
Document maintained at
http://sunsite.berkeley.edu/SWISH-E/manual.html by the SunSITE Manager.
Last update October 8, 1997. SunSITE Manager:
manager@sunsite.berkeley.edu