Sign In Sign-Up

Chapter 28 Search Engines and Indexed Databases

by William Robert Stanek

CONTENTS

What Are Indexers and How Are They Used?
What Are Search Engines and How Are They Used?
Accessing a WAIS Database
Building an Indexed Database
- Installing and Configuring freeWAIS
- Installing and Configuring SWISH
Other Search Engines
Summary

The hypertext facilities of the World Wide Web put the world's most powerful search engines at your fingertips. Search engines are the gateways to the vast storehouses and databases available on the Web. Thousands of Web search engines are used every day, and if you've browsed the Web, you know online searches are easy to perform. You simply enter keywords, press Enter, and the search engine takes over.

A search engine is an application specifically designed and optimized for searching databases. Search engines can race through megabytes of information in nanoseconds. They achieve this terrific speed and efficiency thanks to an application called an indexer. An indexer is an application specifically designed and optimized for indexing files. Using the index built by the indexer, the search engine can jump almost immediately to sections of the database containing the information you are looking for. Thus, creating an indexed database of documents at your Web site requires two applications: a search engine and an indexer. The search engine and indexer are normally a part of a larger application, such as a Wide Area Information Server (WAIS).

The trick to creating Web documents that provide access to a search engine is knowing how to integrate the capabilities of the search engine using the existing structure of hypertext and CGI. This chapter unlocks the mysteries of search engines and indexed databases.

What Are Indexers and How Are They Used?

The index for Web Publishing Unleashed is an invaluable resource for quickly finding information in this book. Using the index, you can quickly find the topic or subtopic you want to learn more about. You do this by following an alphabetical listing of keywords selected by the human indexer who combed the manuscript in search of the gems you would be interested in.

The indexer used the text of the entire book to create an alphabetical list of keywords and related concepts. The alphabetical listing is broken down into categories and subcategories. The first categories are broad and divided based on the letters A to Z. Next, the broad categories are subdivided based on keywords. The keyword categories are sometimes further divided based on related concepts. For quick reference, a page number is associated with keywords and their related concepts. You've probably noticed that articles, such as a, an, and the, and prepositions, such as in, with, and to, are never listed in the index. The indexer has a list of hundreds of common words such as these that should be excluded from the index because they occur too often to be helpful.

A computer-coded indexer builds an index in much the same way. The indexer application uses a list of common words to figure out which words to exclude from the index, searches through the list of documents you specified, and finally, builds an index containing the relevant associations between the remaining words within all the specified documents. As most indexers build a full-text index based on your documents, the index is often larger than the original files. For example, if your Web site has 15 M of data in 125 documents, the indexer would create an index slightly larger than 15 M.

Most indexers enable you to add to or subtract from the list of common words. Often you can do this by editing an appropriate file or by creating a stop word file. A stop word file contains an alphabetized list of words that the indexer should ignore. Most indexers have a predefined stop word file, but you can override it to use your list of stop words instead.

Another type of word list indexers use is the synonym list. Synonym lists make it easier for readers to find what they need on your Web site without knowing the exact word to use in the search. Each line of a synonym file contains words that can be used interchangeably. A search for any word in the list will be matched to other words in the same line. Instead of getting no results from the search, the reader will get a list of results that match related words.

Suppose a reader wants to learn how forms are processed on the server but doesn't know the right keyword to use to get a related response. This line from a synonym file could be used to help the reader find what she is looking for:


cgi cgi-bin script gateway interface programming

Before you create a synonym list, think carefully about the words you want to use in the list. The indexer program uses synonym lists to create an index. Whenever you change the synonym list, you will have to reindex your Web site.

What Are Search Engines and How Are They Used?

Hundreds of search engines are used in commercial and proprietary applications. Usually search engines are part of a larger application, such as a database management system. When Web publishers looked for indexing and searching solutions, they looked at the search engines available and found that most of them were not well-suited for use on the World Wide Web. The main reason they were ill-suited for use on the World Wide Web was because they weren't designed to be used on distributed networks.

One solution Web publishers did find was the Wide Area Information Server (WAIS). WAIS is a database retrieval system that searches indexed databases. The databases can contain any type of file, including text, sound, graphics, and even video. The WAIS interface is easy to use; you can perform a search on any topic simply by entering a keyword and pressing Enter. When you press the Enter key, the WAIS search engine takes over. You can find both commercial and freeware versions of WAIS.

WAIS was developed as a joint project whose founders include Apple Computers, Dow Jones, Thinking Machines, and the Peat Marwick group. In the early days of WAIS, Thinking Machines maintained a free version of WAIS suitably titled freeWAIS. This version of WAIS enjoys the widest usage. freeWAIS is now maintained by the Clearinghouse for Networked Information Discovery and Retrieval. CNIDR started handling freeWAIS when the founders of WAIS turned to commercial ventures such as WAIS, Inc. Because freeWAIS is so popular and easy-to-use, it is heavily featured in this chapter. Beyond freeWAIS, there are many commercial, shareware, and freeware options. This chapter examines many of those options, including wwwwais.c, SFGATE, SWISH, Excite for Web servers, and Livelink search.

The largest database on the Web, the Lycos Catalog of the Internet, enables you to perform searches for information on over 90 percent of Internet sites using a powerful indexer called a Web crawler. The Web crawler uses URLs at Web sites to find new information, and thus can index every page at a Web site and even find new Web sites. Lycos combines the Web crawler with a powerful search engine. Using the search engine, Web users can find the information they are looking for in a matter of seconds.

Note

A Web crawler is a generalized term for an indexer that searches and catalogs sites using the links in hypertext documents. By using the links, the indexer crawls through a document and all its related documents one link at a time.

The Lycos search engine is characteristic of the dozens of search engines you will find on the Web that use WAIS or are modeled after WAIS. You can enter a query at Lycos using the simple one-box form shown in Figure 28.1. By default, the Lycos search engine finds all documents matching any keyword you type in the query box. Because the exclusion list for Lycos contains articles, prepositions, and other nonuseful words for searching, you can enter a query as a complete sentence, as demonstrated in the following examples:


What is WAIS?

Where is WAIS?

How do I use WAIS?

Figure 28.1 : Lycos search engine.

You can enter as many keywords as you want on the query line. Because the search is not case-sensitive, the keywords do not have to be capitalized. If you type in two keywords that are not on the exclusion list, the search engine assumes you want to search on both words. For example, if you entered the following:


WAIS Web

The search engine would search its index for all documents containing either WAIS or Web. Although you could specify the "or" explicitly in a search, such as WAIS OR Web, you generally do not have to; the OR is assumed whenever you do not specify otherwise. To search the index only for documents containing both WAIS and Web, you could use the following:


WAIS AND Web

The AND tells the search engine you are only interested in documents containing the words WAIS and Web. You can combine the basic functions of logical OR and logical AND in many ways. Often, you will be searching for material on a specific subject; you can use multiple keywords related to the subject to help you get better results on your searches. Suppose you are looking for publishers on the Web; you might try the following keywords:


book

fiction

magazine

nonfiction

publisher

publishing

Often your main topic will reveal dozens or hundreds of relevant sites. As Figure 28.2 shows, a single search on the keyword WAIS at Lycos returned over 26,000 matches. The Lycos search engine returns a summary of each document matching the search. The summary is in the form of an abstract for small documents and a combined outline and abstract for long documents. However, most search engines return a two-line summary of the related document that includes the size, type, and title of the document.

Figure 28.2 : Results of a Lycos search.

Documents matching your search are weighted with what should be the most relevant documents displayed first and what should be the least relevant documents displayed last. These scores are usually on a 0.0-1.0 scale or a zero to 1,000 scale. The most relevant documents (the documents with the greatest number of search words that are closest together) have a high score, with the highest score being 1.0 or 1,000. The criteria used to figure out relevancy are the number of words that match the search string and the proximity of the words to each other in the document.

The overall scores based on these two criteria are used to present the matching documents in order from most relevant to least relevant. Although this method of ranking by relevance is used widely, it can be misleading. For this reason, other descriptive features, like a title and summary, are provided with the search results.

Accessing a WAIS Database

Originally, you could only access a WAIS database using a WAIS client. WAIS clients have built-in functions and are used much like other clients. Yet few people want to download a new type of client onto their system and learn how to use it, especially when the client can be used only for the specific purpose of searching a database. If you've ever used a WAIS client, you know they aren't the friendliest clients on the block. This is why Web users prefer to use their browsers, which provide a simple interface to just about every resource on the Internet.

Currently, most Web users access WAIS databases using a simple fill-out form, such as the one shown in Figure 28.1. When a user enters data into the form, the data is passed to the server and directed to a gateway script designed for search and retrieval. The script does five things:

Processes the input
Passes the input to the search engine
Receives the results from the search engine
Processes the results
Passes the output to the client

The gateway script creates the interface between the client browser and WAIS. Creating a gateway script to interface with WAIS is not an extremely complex process. Although you could create such a script using fewer than 100 lines of Perl code, dozens of ready-made WAIS gateways are already available. Some of these WAIS gateways are simple and involve efficient Perl scripts packed into a few kilobytes of file space. Other WAIS gateways are part of all-in-one software packages that contain an indexer and search engines as part of a WAIS server and a gateway script to create the Web-to-WAIS interface. The sections that follow discuss five freeware options for WAIS gateways.

Basic WAIS Gateways

Three ready-made solutions for processing the information from WAIS searches are

wais.pl son-of-wais.pl kidsofwais.pl

`wais.pl`

The very first WAIS gateway, a Perl script called wais.pl, is a quick-and-dirty solution for accessing WAIS. The wais.pl script was created by Tony Sanders and is included with the ncSA Web server software. You can obtain wais.pl from ncSA's FTP server at


ftp://ftp.ncsa.uiuc.edu/Web/httpd/Unix/ncsa_httpd/cgi/wais.tar.Z

`son-of-wais.pl`

Although wais.pl is used widely on the Web, it is slowly being replaced by its offspring: son-of-wais.pl and kidofwais.pl. The son-of-wais.pl script is the second evolution of wais.pl. This Perl script created by Eric Morgan beats its generic parent hands down because it is more advanced than the original and more robust. You can obtain son-of-wais.pl from ncSU at


http://dewey.lib.ncsu.edu/staff/morgan/son-of-wais.html.

`kidofwais.pl`

The third evolution of WAIS.PL is a script called kidofwais.pl. This script, created by Michael Grady, is a somewhat advanced WAIS gateway programmed entirely in Perl. The features of kidsofwais.pl include debugging, multiple formatting options, table titles, and more. You can obtain kidsofwais.pl from UIUC at


http://www.cso.uiuc.edu/grady.html.

The results of a search on the word computer using the kidofwais.pl script is shown in Figure 28.3. Many Web publishers prefer the clean output of kidofwais.pl. As you can see, matches are generally displayed on a single line of a bulleted list. Each item on the list is displayed in ranked relevance order with the scores omitted. The title, size, and type of matching documents form the basis of each list item.

Figure 28.3 : Search results using the kidofwais.pl script.

Advanced WAIS Gateways

Two advanced and powerful solutions for your WAIS needs are

Sfgate wwwwais.c

`SFgate`

Created by Miao-Jane Lin and Ulrich Pfeifer, SFgate is one of the most advanced freeware WAIS gateways. Unlike other gateways discussed so far, SFgate uses a group of shell scripts to create a smooth and feature-rich interface to your WAIS server. This WAIS gateway is several orders of magnitude larger than other WAIS gateways and uses more than 500 K of disk space. In the days of 2 G hard drives, 500 K of disk space is negligible, but when you compare this amount to the kidofwais.pl script that is only 24 K in size, you can easily see that SFgate is certainly a more involved gateway. Fortunately, the SFgate distribution includes an installation routine and good documentation.

SFgate provides you with advanced control over the search options. Not only can you search on a keyword, but you can also tell SFgate the specific areas of the indexed database to search. You can search by document type, title, author, date, and contents. You can also tell SFgate precisely how to format the search results. Increased flexibility in the search parameters and output style produces more meaningful results. You can learn more about SFgate and download the latest version at


http://ls6-www.informatik.uni-dortmund.de/SFgate/SFgate.html

`wwwwais.c`

The wwwwais.c gateway is proof positive that you can pack a lot of power into a small C script. Created by Kevin Hughes of Enterprise Integration Technologies and packed into 54 K of C code, wwwwais.c is arguably the most powerful freeware WAIS gateway. EIT uses the wwwwais.c gateway to search the databases at its Web site. Figure 28.4 shows the results from a wwwwais.c search using EIT's wwwwais.c gateway.

Figure 28.4 : Using wwwwais.c.

The output from wwwwais.c is similar to other WAIS gateways discussed previously. Matches are generally displayed in a numbered list. Each item on the list is displayed in ranked relevance order. The title, size, type, and ranked score of the matching documents form the basis of each list item. You can download the latest version of this WAIS gateway at


http://www.eit.com/software/wwwwais/

How WAIS Gateways Work

The best way to see how a WAIS gateway works is to examine the code for a script. Ideally, the script should be slightly advanced, yet not too advanced so that its inner workings cannot be easily studied. The son-of-wais script fits this description well. Listing 28.1 shows the code for son-of-wais.pl for you to study.

If you go through the code line by line, you will see that the first part of the script begins with an overview of changes Eric Morgan made to wais.pl. This section also contains contact information. Because such documentation makes it easier to use and maintain a script, good programmers always add it to a script.

After the overview, the code assigns configuration variables. Because these variables will be unique to your Web server, you will need to update them accordingly. Follow the inline documentation to update the paths to where your data is stored on the server, and be sure to update the contact and title information. One variable that you should pay particular attention to is the one that sets the location of the search engine to be used to perform the search. In the script, the variable is $waisq. Using waisq and waissearch to perform access WAIS databases is discussed later in this chapter.

The next section creates the output to waisq. The brevity of this section of the code surprises most beginning Web programmers. Yet keep in mind that waisq is the script performing the search against your WAIS database.

The final section of the script creates the output. Although the code for the search fills only a handful of lines, massaging the output and creating the textual portion of the output fills dozens of lines. You can modify the output message to suit your needs. However, the output page should contain the general information provided in the script, which ensures the reader knows how to use the index if they've had problems. If you follow the script, you can see that brief summaries for documents matching the search are displayed according to their relevance. Ranked relevance is described by scores associated with the documents.

Listing 28.1. The son-of-wais.pl script.


#!/usr/bin/perl

#

# wais.pl -- WAIS search interface

#

# $Id$

#

# Tony Sanders <sanders@bsdi.com>, Nov 1993

#

# Example configuration (in local.conf):

#     map topdir wais.pl &do_wais($top, $path, $query, "database", "title")

#

# Modified to present the user "human-readable" titles, better instructions as

# well as the ability to do repeated searches after receiving results.

#

# by Eric Lease Morgan, ncSU Libraries, April 1994

# eric_morgan@ncsu.edu

# http://www.lib.ncsu.edu/staff/morgan/morgan.html

# To read more about this script try:

# http://www.lib.ncsu.edu/staff/morgan/son-of-wais.html

#

# where is your waiq binary?

$waisq = "/usr/users/temp/wais/freeWAIS-0.202/bin/waisq";



# where are your source files?

$waisd = "/usr/users/temp/gopher/data/.wais";



# what database do you want to search?

$src = "ncsu-libraries-www";



# what is the opening title you want to present to users?

$openingTitle = "Search the ncSU Libraries Webbed Information System";



# after searching, what to you want the title to be?

$closingTitle = "Search results of the ncSU Libraries Information System";



# specify the path to add

# this is the same path your subtracted when you waisindexed

$toAdd = "/usr/users/temp/www/httpd/data/";



# specify the leader to subtract

# again, this is the same string you added when you waisindexed

$toSubtract = "http://www.lib.ncsu.edu/";



# who maintains this service?

$maintainer = "<A HREF=http://www.lib.ncsu.edu/staff/morgan/morgan.html>

Eric Lease Morgan</A> (eric_morgan@ncsu.edu)";



# and when was it last modified?

$modified = "April 15, 1994";



# you shouldn't have to edit anything below this line,

except if you want to change the help text



sub extractTitle {

  # get the string

  $theFile = $headline;



  # parse out the file name

  $theFile =~ s/^.*$toSubtract//i;



  # Concatenate the "toAdd" variable with the file name

  $theFile = $toAdd.$theFile;



  # open the file

  open( DATA, $theFile) || die "Can't open $theFile\n";



  # read the file and extract the title

  $linenum = 1;

  $foundtitle = 0;

  $humanTitle = "(No title found in document!) Call $maintainer.";

  while ( $line = <DATA>) {

    last if ($linenum > 5);

    $linenum++;

    if ($line =~ s/^.*<title>//i ) {

      chop( $line);

      $line =~ s!</title>.*$!!i;

      $humanTitle = $line;

      $humanTitle =~ s/^\s*//;

      $humanTitle =~ s/\s*$//;

      $foundtitle = 1;

      last;

    }

  }



  # close the file

  close (DATA);



  # return the final results

  return $humanTitle;

  }



sub send_index {

    print "Content-type: text/html\n\n";



    print "<HEAD>\n<TITLE>$openingTitle</TITLE>\n<ISINDEX></HEAD>\n";

    print "<BODY>\n<H2>", $openingTitle, "</H2>\n";



    print "<p>";

    print "This is an index of the information on this server. ";

    print "To use this function, simply enter a query.<P>";

    print "Since this is a WAIS index, you can enter complex queries.

    For example:<P>";

    print "<DT><b>Right-hand truncation</b> (stemming) queries";

    print "<DD>The query 'astro*' will find documents containing the words";

    print " 'astronomy' as well as 'astrophysics'.<P>";

    print "<DT>Boolean '<b>And</b>' queries";

    print "<DD>The query 'red and blue' will find the <B>intersection</b> of

    all";

    print " the documents containing the words 'red', and 'blue'.";

    print "The use of 'and' limits your retrieval.<p>";

    print "<DT>Boolean '<b>Or</b>' queries";

    print "<DD>The query 'red or blue' will find the <B>union</b> of all the";

    print " documents containing the words 'red' and 'blue'.";

    print "The use of 'or' increases your retrieval.<p>";

    print "<DT>Boolean '<b>Not</b>' queries";

    print "<DD>The query 'red not green' will find the all the documents

    containing";

    print " the word 'red', and <b>excluding</b> the documents containing the

    word 'green'.";

    print "The use of 'not' limits your retrieval.<p>";

    print "<DT><b>Nested</b> Boolean queries";

    print "<DD>The query '(red and green) or blue not pink' will find the

    union of all";

    print " the documents containing the words 'red', and 'green'. It will

    then add (union)";

    print " all documents containing the word 'blue'. Finally, it will exclude all

    documents";

    print " containing the word 'pink'";

    print "<HR>";

    print "This page is maintained by $maintainer, and it was last modified on

    $modified.<p>";

}



sub do_wais {

#    local($top, $path, $query, $src, $title) = @_;



    do { &'send_index; return; } unless defined @ARGV;

    local(@query) = @ARGV;

    local($pquery) = join(" ", @query);



    print "Content-type: text/html\n\n";



    open(WAISQ, "-|") || exec ($waisq, "-c", $waisd,

                                "-f", "-", "-S", "$src.src", "-g", @query);



    print "<HEAD>\n<TITLE>$closingTitle</TITLE>\n<ISINDEX></HEAD>\n";

    print "<BODY>\n<H2>", $closingTitle, "</H2>\n";



    print "Index \`$src\' contains the following\n";

    print "items relevant to \`$pquery\':<P>\n";

    print "<DL>\n";



    local($hits, $score, $headline, $lines, $bytes, $type, $date);

    while (<WAISQ>) {

        /:score\s+(\d+)/ && ($score = $1);

        /:number-of-lines\s+(\d+)/ && ($lines = $1);

        /:number-of-bytes\s+(\d+)/ && ($bytes = $1);

        /:type "(.*)"/ && ($type = $1);

        /:headline "(.*)"/ && ($headline = $1);         # XXX

        /:date "(\d+)"/ && ($date = $1, $hits++, &docdone);

    }

    close(WAISQ);

    print "</DL>\n";

    print "<HR>";

    print "This page is maintained by $maintainer.<P>";



    if ($hits == 0) {

        print "Nothing found.\n";

    }

    print "</BODY>\n";

}



sub docdone {

    if ($headline =~ /Search produced no result/) {

        print "<HR>";

        print $headline, "<P>\n<PRE>";

# the following was &'safeopen

        open(WAISCAT, "$waisd/$src.cat") || die "$src.cat: $!";

        while (<WAISCAT>) {

            s#(Catalog for database:)\s+.*#$1 <A HREF="/$top/$src.src">

            $src.src</A>#;

            s#Headline:\s+(.*)#Headline: <A HREF="$1">$1</A>#;

            print;

        }

        close(WAISCAT);

        print "\n</PRE>\n";

    } else {

        $title = &extractTitle ($headline);

        print "<DT><A HREF=\"$headline\">$humanTitle</A>\n";

        print "<DD>Score: $score, Lines: $lines, Bytes: $bytes\n";

    }

    $score = $headline = $lines = $bytes = $type = $date = '';

}



eval '&do_wais';

How to Create an HTML Document for a WAIS Gateway

Creating an HTML document for your WAIS gateway is easy. All you have to do is create a document with a fill-out form that sends the proper values to your WAIS gateway of choice. Depending on the WAIS gateway you choose, this form can be a simple one-line form for entering keywords or a complex multiple-line form that has space for entering keywords as well as search and retrieval options. Listing 28.2 is the HTML code for a document using a simple form for use with wwwwais.c.

Listing 28.2. Simple form for use with wwwwais.c.


<HTML>

<HEAD>

<TITLE>Using WWWWAIS.C</TITLE>

</HEAD>

<BODY>

<CENTER>

<FORM METHOD=GET ACTION="/cgi-bin/wwwwais">

<P><B>Search for:</B>

<INPUT TYPE=TEXT NAME="keywords" SIZE=40>

</FORM>

</CENTER>

</BODY>

</HTML>

Because the previous form has only one input field, the submit and reset buttons are not necessary. When the user presses return, the form is automatically submitted to wwwwais.c. The wwwwais.c script passes the value of the keywords variable to the WAIS search engine you have installed on your system.

Forms designed for use with the SFgate script can be as simple or complex as you make them because SFgate gives you advanced control over how searches are performed and the way results are formatted. Figure 28.5 shows the search section of an advanced form designed to be used with SFgate. Figure 28.6 shows how users could be allowed to alter your default search and debug parameters. Listing 28.3 is the HTML code for the document shown in Figure 28.5 and Figure 28.6.

Listing 28.3. Advanced form with use with SFgate.


<HTML>

<HEAD>

<TITLE>Using SFgate</TITLE>

</HEAD>

<BODY>

<H1>Accessing a WAIS database with SFgate</H1>

<FORM METHOD=GET ACTION="/usr/cgi-bin/SFgate">

<INPUT NAME="database" TYPE="hidden" VALUE="www.tvpress.com/site.db">

<H2>Search by:</H2>

<DL>

<DT>Title

<DD><INPUT TYPE=TEXT NAME="ti">

<DT>Author name

<DD><INPUT TYPE=TEXT NAME="au">

<DT>Text

<DD><INPUT TYPE=TEXT NAME="text" SIZE=60>

<DT>Publication year

<DD><SELECT NAME="py_p">

<OPTION> &gt;

<OPTION> =

<OPTION> &lt;

</SELECT>

<INPUT TYPE=TEXT NAME="py" SIZE=4 VALUE="1995">

</DL>

<INPUT TYPE="submit">

<INPUT TYPE="reset">

<H1>Change default search and debug parameters</H1>

<H2>Enter search and retrieval options:</H2>

<P>Fetch documents using direct WAIS URL?</P>

<SELECT NAME="directwais">

<OPTION> off

<OPTION> on

</SELECT>

<P>Use redirection capabilities?</P>

<SELECT NAME="redirect">

<OPTION> off

<OPTION> on

</SELECT>

<P>Language for return results?</P>

<SELECT NAME="language">

<OPTION>english

<OPTION>french

<OPTION>german

</SELECT>

<P>How do you want the results to be listed?</P>

<INPUT TYPE="radio" NAME="listenv" chECKED VALUE="DL">descriptive list

<INPUT TYPE="radio" NAME="listenv" VALUE="PRE">preformatted list

<P>What type of title headings do you want to see in the list?</P>

<INPUT TYPE="radio" NAME="verbose" chECKED VALUE="1">verbose headings

<INPUT TYPE="radio" NAME="verbose" VALUE="0">short headings

<P>What is the maximum number of hits you want the search to return?</P>

<INPUT NAME="maxhits" TYPE=TEXT VALUE="40" SIZE=3>

<H2>Enter debug options:</H2>

<P>Dump environment to an HTML document instead of processing the query?</P>

<SELECT NAME="dmpenv">

<OPTION> no

<OPTION> yes

</SELECT>

<P>Show Debug information?</P>

<SELECT NAME="debug">

<OPTION> off

<OPTION> on

</SELECT>

</FORM>

</BODY>

</HTML>

Figure 28.5 : Advanced form for use with SFgate.

Figure 28.6 : Setting additional search and debug parameters.

The form used with the SFgate script has many fields. You can assign the NAME field to key values that have special meaning to SFgate. The primary search and retrieval parameters are ti, au, text, and py. The title parameter ti enables keyword searches of titles. The author parameter au enables keyword searches of the authors of documents indexed in the database. The text parameter text enables keyword searches of the full text of documents indexed in the database. The publication year parameter py is used to search based on the date the indexed documents were published.

The database variable defines the name of the WAIS database you want to search. In this example, this variable is assigned to a hidden input field; in this way, you could use SFgate to search different databases at your Web site. You could even let the user search different databases at your Web site using the same form by changing the input field for the database from a hidden field to one the user can manipulate.

Most of the additional search variables are set to default values automatically and do not have to be specified. Specifying parameters for these variables enables you to provide additional controls to users. The debug parameters are used for testing and troubleshooting problems and are not normally included in your final search form.

Installing a WAIS Gateway

Installing a WAIS gateway may not be as easy as you think. This section looks at installing basic and advanced WAIS gateways.

Configuring Basic WAIS Gateways

Installing one of the basic WAIS gateways (wais.pl, son-of-wais.pl, and kidsofwais.pl) is easy. You simply obtain the script, move it to an appropriate directory, such as cgi-bin, and modify the configuration parameters in the beginning of the script. The easiest gateway to configure is wais.pl. Configuring wais.pl involves modifying four lines of code at the beginning of the script:

Set the path to the search engine, which is normally waisq if you've installed the freeWAIS server:
$waisq = "/usr/local/bin/wais/waisq";
Specify the location of the directory containing your WAIS databases:
$waisd = "/usr/local/wais.db/";
Specify the indexed database for the search:
$src = "sitedb.src"
Specify the title for the HTML document used to display the results:
$title = "Search Results"

If all WAIS gateways were as easy to configure as wais.pl, Web publishers would have no problems creating an interface to WAIS. Although configuring son-of-wais.pl and kidofwais.pl is slightly more difficult, the scripts have good step-by-step documentation that explains the process.

Configuring Advanced WAIS Gateways

Advanced WAIS gateways present more problems to Web publishers because more options and variables are involved. This section looks at configuring an advanced WAIS gateway called wwwwais.c. To install EIT's wwwwais.c, you have to make a minor modification to the source code, compile the source code, move the compiled script to an appropriate directory, and update the configuration file.

Preparing the wwwwais.c Script

Because the wwwwais.c gateway uses a separate configuration file, you can install the configuration file wherever you would like. For this reason, you must specify the path to the configuration file in the source code. This minor modification is easy to make; simply edit the source code using your favorite editor. To ensure that the configuration file will be easy to find if you need to update it later, you may want to place the file in the same directory as the configuration file for your Web server, such as


/usr/local/httpd/conf/wwwwais.conf

Note

Because you specify the full path to the configuration file in the source code, you can name the file anything you want. In the preceding example, the configuration file is called wwwwais.conf.

After you modify the source code, compile it using your favorite C compiler, such as gcc. The wwwwais.c script should compile without errors. After the program is compiled, move it to an appropriate directory on your Web server. Usually this directory is your server's cgi-bin directory. After moving the script, make sure the script is executable. You may want to use chmod 711, which allows you to read, write, and execute the program, but only allows others to execute it.

Updating the wwwwais.c Configuration File

The wwwwais.c configuration file enables you to set many useful parameters for searching indexed databases and displaying the results. The configuration file contains parameters that can be passed to wwwwais.c. Variables are specified by variable name and associated value. The space between the variable name and value is necessary. Listing 28.4 is an example of a wwwwais.c configuration file.

Listing 28.4. Sample wwwwais.c configuration file.


# WWWWAIS configuration file



# If PageTitle is a string, it will be a title only.

# If PageTitle specifies an HTML file, this file will be prepended to

# wwwwais results.

PageTitle "waistitle.html"



# The self-referencing URL for wwwwais.

SelfURL "http://www.tvpress.com/cgi-bin/wwwwais"



# The maximum number of results to return.

MaxHits 40



# How results are sorted. This can be "score", "lines", "bytes",

# "title", or "type".

SortType score



# AddrMask is used to specify the IP addresses of sites authorized access

# to your database

# Only addresses specified here will be allowed to use the gateway.

# These rules apply:

# 1) You can use asterisks in specifying the string, at either

#    ends of the string:

#    "192.100.*", "*100*", "*2.100.2"

# 2) You can make lists of masks:

#    "*192.58.2,*.2", "*.100,*171.128*", ".58.2,*100"

# 3) A mask without asterisks will match EXACTLY:

#    "192.100.58.2"

# 4) Define as "all" to allow all sites.

AddrMask all



# The full path to your waisq program.

WaisqBin /usr/local/bin/waisq

# The full path to your waissearch program.

WaissearchBin /usr/local/bin/waissearch

# The full path to your SWISH program.

SwishBin /usr/local/bin/swish



# WAIS source file descriptions

# These represent the path to the indexed databases

# For SWISH sources:

#    SwishSource full_path_to_source/source.swish "description"

SwishSource /usr/local/httpd/wais/index/index.swish "Search our Web"

SourceRules replace "/usr/local/www/" "http://www.tvpress.com/"

# For waisq sources:

#    WaisSource full_path_to_source/source.src "description"

WaisSource /usr/local/httpd/wais/index/index.src "Search our Web"

SourceRules replace "/usr/local/www/" "http://www.tvpress.com/"

WaisSource /usr/local/httpd/wais/index/index.src "Search our Web"

SourceRules replace "/usr/local/www/" "/"

SourceRules prepend "http://www.tvpress.com/cgi-bin/print_hit_bold.pl"

SourceRules append "?$KEYWORDS#first_hit"

# For waissearch sources:

#    WaisSource host.name port source "description"

WaisSource quake.think.com 210 directory-of-servers "WAIS directory of servers"



# Do you want to use icons?

UseIcons yes



# Where are your icons are kept?

IconUrl http://www.tvpress.com/software/wwwwais/icons



# Determining file type based on suffix.

# Suffix matching is not case sensitive is entered in the form:

#    TypeDef .suffix "description" file://url.to.icon.for.this.type/ MIME-type

# You can use $ICONURL in the icon URL to substitute the root icon directory.

# You can define new document types and their associated icons here.

TypeDef .html "HTML file" $ICONURL/text.xbm text/html

TypeDef .txt "text file" $ICONURL/text.xbm text/plain

TypeDef .ps "PostScript file" $ICONURL/image.xbm application/postscript

TypeDef.gif "gif image" $ICONURL/image.xbm image/gif

TypeDef .src "WAIS index" $ICONURL/index.xbm text/plain

TypeDef .?? "unknown" $ICONURL/unknown.xbm text/plain

When you update the configuration file for use on your system, look closely at every line of the file containing a parameter assignment. Because you will need to change almost every parameter assignment, be wary of any assignments that you do not change. The most important updates to the configuration file involve specifying the proper paths to essential files on the system. Here's how you should assign these essential values:

`SelfURL`	The URL path to `wwwwais.c`.
`WaisqBin`	The full path to your `waisq` program.
`WaissearchBin`	The full path to your `waissearch` program.
`SwishBin`	The full path to your `SWISH` program. `SWISH` is similar to `wwwwais.c` and is discussed later in the chapter.
`SwishSource`	The location of local databases indexed with `SWISH` and a brief description. If there are multiple `SwishSource` lines, the user is prompted to specify the database to search.
`WaisSource`	The local WAIS database name and location. Local WAIS databases are accessed with `waisq`, and remote WAIS databases will be accessed with `waissearch`. For local WAIS databases, you must specify the location of the database and a brief description. All local database names should include the `.src` extension. For remote WAIS databases, you must specify a host name, port, database name, and description. All remote database names should not include the `.src` extension.
`SourceRules`	The action to take on the results. Valid actions are:
	`append`	Add information after the results
	`prepend`	Add information before the results
`replace`	Replace the local path with a URL path so Web users can access the documents
`TypeDef`	The MIME type definition. This parameter allows the script to match file name extensions to MIME types. Any MIME types not configured are assigned to the type `unknown`.

Passing Additional Parameters to wwwwais.c

You can pass additional parameters to wwwwais.c as input from a fill-out form or with environment variables set in a script that calls wwwwais.c. Any additional parameters you reference override parameters set in your configuration file. The simple form used earlier to pass keywords to wwwwais.c can be easily updated to accommodate these additional parameters. The variables you can set include the following:

`host`	The name of the remote host machine to search with `waissearch`. The host information should include the domain, as this example does: host=tvp.com
`iconurl`	The URL path to icons. The `iconurl` should include the transfer protocol, as in the following example: iconurl=http://tvp.com/icons/
`isindex`	The index to search on.
`Keywords`	The keywords to search on.
`Maxhits`	The maximum number of matches to return after a search.
`Port`	The port number to contact the remote host machine on.
`Searchprog`	The search engine to use. This variable can be set to one of the following:
	`searchprog=swish`	A local search using `SWISH`
	`searchprog=waisq`	A local search using `waisq`
	`searchprog=waissearch`	A remote search using `waissearch`
`selection`	The indexed database to use as specified by the description set in the configuration file.
`Sorttype`	The sorting method for the results. This variable can bet set to the following:
	`sorttype=bytes`	Sort by the byte size of the documents
	`sorttype=lines`	Sort by the number of lines
	`sorttype=score`	Sort by score
	`sorttype=title`	Sort by document title
	`sorttype=type`	Sort by document type
`source`	The indexed database to search.
`Sourcedir`	The directory of the indexed database.
`Useicons`	Whether icons based on file type are used. This variable can be set to one of the following:
	`useicons=no`	Do not use icons
	`useicons=yes`	Use icons
`version`	Verification of the version number of your WAIS applications. The default value `false` can be set to `true` as follows: version=true

You can use either the GET or POST method to submit data from an HTML form to wwwwais.c. You can set variables yourself using hidden fields or allow the users to set these variables using input fields. The wwwwais.c script supports the PATH_INFO variable as well, so you can add additional parameters to the end of the URL path to wwwwais.c in URL-encoded format. Listing 28.5 shows how you could create a form with additional parameters already added to the URL path:

Listing 28.5. wwwwais.c form with additional parameters.


<HTML>

<HEAD>

<TITLE>Search our Web site</TITLE>

</HEAD>

<BODY>

<FORM METHOD=GET

ACTION="/cgi-bin/wwwwais/useicons=yes&maxhits=50&sorttype=score">

<P><B>Search for:</B>

<INPUT TYPE=TEXT NAME="keywords" SIZE=40>

</FORM>

</BODY>

</HTML>

To set parameters in a script that calls wwwwais.c, you use environment variables. You can change any of the variables discussed earlier into an environment variable that wwwwais.c will recognize by putting wwww before the variable name. All variable names should be in uppercase. Listing 28.6 is a simple csh script to show how you could set variables and call wwwwais.c.

Listing 28.6. csh script to set variables for wwwwais.c.


!/bin/csh

#Shell script for setting environment variables for wwwwais

setenv WWWW_USEICONS = yes

setenv WWWW_MAXHITS = 50

setenv WWWW_SORTTYPE = type

#Call wwwwais

/usr/local/cgi-bin/wwwwais

exit

Building an Indexed Database

So far this chapter has discussed the basics of indexers, search engines, WAIS, and WAIS gateways. Now that you have read the section on accessing a WAIS database, you should understand how WAIS gateways work and how to create an HTML document for a WAIS gateway. The next step is to install a search and retrieval application that includes an indexer and a search engine.

As I mentioned earlier in this chapter, one of the most widely used Wide Area Information Servers is freeWAIS. The freeWAIS server is actually a series of scripts for building and searching an indexed database. An alternative to freeWAIS is SWISH. Developed by the team at EIT Corporation, the Simple Web Indexing System for Humans offers ease of installation and ease of use.

Installing and Configuring `freeWAIS`

Many versions of freeWAIS are in use on the Internet. The two main variants you may be interested in are the standard freeWAIS package and the freeWAIS-sf package. Standard freeWAIS is the most widely used WAIS system. The freeWAIS-sf package is optimized for use with SFgate.

You can find information on the current version of freeWAIS and obtain the source code at these locations:


http://www.eit.com/software/

http://cnidr.org/

ftp://ftp.cnidr.org/pub/NIDR.tools/freewais/

You can find information on the current version of freeWAIS-sf and obtain the source code at these locations:


http://ls6-www.informatik.uni-dortmund.de/SFgate/SFgate.html

ftp://mirror-site/mirror-dir/SFgate/

ftp://mirror-site/mirror-dir/freeWAIS-sf-1.2/freeWAIS-sf/

After you download the source code to your computer from one of the listed locations and uncompress the source code as necessary, you can begin installing and configuring freeWAIS. Both variants of freeWAIS include essentially the same applications:

waisserver waisq waissearch waisindex

Using `waisserver`

The waisserver program is the primary server program. You need to run waisserver only if you want to be able to search locally available databases. Before you start waisserver, you will need to know three things:

You need to know the port on which you want the waisserver to allow connections. This port is normally 210. You invoke waisserver with the -p option to set the port number.
You need to know the directory where your source databases are located. The waisserver program will allow any database in the specified directory to be searched. You invoke waisserver with the -d option to set the directory path to your indexed WAIS databases.
You need to know how you want errors to be treated. Although tracking errors is not mandatory, it is a sound administrative practice. Invoke waisserver with the -e option to specify a log file for tracking errors.

To start waisserver, change directories to where waisserver is installed. In the following example, waisserver answers requests on port 210, source databases are in the /usr/local/httpd/wais/sources, and errors are logged in /usr/local/httpd/logs/wais.log. To start waisserver using these options, you would type the following all on one line:


./waisserver -p 210 -d /usr/local/httpd/wais/sources -e

/usr/local/httpd/logs/wais.log &

Note

The ampersand symbol puts waisserver in the background. If you do not put the server process in the background, the server will stop running when you exit your login. Additionally, to ensure waisserver is started if the host computer is rebooted, update the appropriate configuration files. For example, you could add the following to the rc.local file on most UNIX systems to make sure that waisserver is started automatically:

#Added to start the waisserver process #waisserver is used to enable searching of the local WAIS databases /usr/local/httpd/wais/waisserver -p 210 -d /usr/local/httpd/wais/sources -e /usr/local/httpd/logs/wais.log &

Using `waisq` and `waissearch`

The waisq and waissearch programs search WAIS databases for the information you're looking for. The waisq search engine looks in databases on the local host, and waissearch looks in databases on remote machines. The waissearch program does things remotely by contacting WAIS servers on different machines, each of which has its own database. In order for waissearch to work properly, you must tell it a host name and a port to which to connect. Additionally, the remote host must have a WAIS server of its own running on the port you specify. Your WAIS gateway calls waisq or waissearch for you, so you generally do not access these search engines directly.

Using `waisindex`

The waisindex program creates indexed WAIS databases. When you create an index, you can index all or any portion of the files on the host computer. Generally, files are indexed into a database according to their directory. When you index files, you can specify the following information:

Invoke waisindex with the -d option to set the directory path to the files you want to index.
Invoke waisindex with the -r option to specify that you want subdirectories to be indexed.
Specify the name of the database that is created as a result of running waisindex, including the file path. Indexed WAIS databases end with the .src extension, but are named without the .src extension.
Specify whether you want waisindex to index the full contents of the document or just the file name. The default is to index the full contents of the documents. For files indexed with the nocontents flag, only the file names are indexed.

You can specify these additional parameters if you wish:

Invoke waisindex with the -e option to specify a log file for tracking errors.
To specify the level of detail for logging what waisindex is doing on your system, invoke waisindex with the -l option to specify logging verbosity. The higher the number, the more verbose and detailed the logging will be.
Invoke waisindex with the -m option to specify the amount of memory and resources to use for indexing. The higher the number, the more system resources will be used.

To run waisindex, either change to the directory where waisindex is installed or specify the full path to the program. In the following example, waisindex is located in the /usr/local/httpd/wais directory, the verbosity of the output is set to 1 for minimal logging, errors are logged in /usr/local/httpd/logs/waisindex.log, the directory to index is /users/webdocs, the path to the database is /usr/local/httpd/wais/sources, and the name of the database is webdocuments. To run waisindex using these options, you would type the following all on one line:


/usr/local/httpd/wais/waisindex -l 1 -d /users/webdocs -e /usr/local/httpd/logs/

	wais.log -r /usr/local/httpd/wais/sources/webdocuments

Although you could run waisindex by hand whenever you needed to reindex your site, the best way to handle indexing is to set up a cron job to handle the task. In UNIX environments, cron jobs are run automatically at times you specify. Most systems have multiple cron tables. Jobs in a cron table are run by the owner of the cron tab, which is normally located in the /usr/spool/cron/crontabsdirectory.

You will usually want to update your indexes daily, especially on a host that changes frequently. The best time to run waisindex is when system usage is low. Often, this is in the early morning hours. To add a statement to the root cron table to update your index daily at 1 a.m., you could insert the following lines:


# Root Cron

# Entry added to build waisindex

00 01 * * * /usr/local/httpd/wais/waisindex -l 1 -d /users/webdocs -e

/usr/local/httpd/logs/wais.log -r /usr/local/httpd/wais/sources/webdocuments

The previous example assumes you want to index only a single directory and its subdirectories, but you can add more statements to the cron tab to build additional indexed databases. This solution for indexing your site works best on simple document structures. If your host has a complex document structure, you can build the index using links or build the index using a script.

Building a WAIS Index Using Links

You can add links from your document directories to a base directory that you will index using the -r option to recursively index subdirectories. To do this, you could create a base directory, such as /users/webdocs, add subdirectories to this directory, and link your document directories to the subdirectories. Here's how you would do this on most UNIX systems that enable symbolic links:


$ mkdir /users/webdocs

$ cd /users/webdocs

$ mkdir HTML

$ mkdir TEXT

$ mkdir PDF

$ mkdir gif

$ ln -s /users/webdocs/HTML /usr/local/httpd/docs/html

$ ln -s /users/webdocs/TEXT /home/users/local/text

$ ln -s /users/webdocs/PDF /usr/bin/adobe/acrobat/docs

$ ln -s /users/webdocs/gif /usr/local/images/samples/gif

Now if you ran the command defined earlier, you would index all the appropriate directories you have linked to the /users/webdocs directory. Keep in mind that the actual database would be located in the /usr/local/httpd/wais/sources directory and would have the name webdocuments.src. The waisindex program adds the .src extension to the database name to indicate that the file is a source file.

Building a WAIS Index Using a Script

For the most complex document structures, you should use a shell script. A script also enables you to easily specify the types of files you want to be indexed and the types of files you want to ignore.

The csh script in Listing 28.7 was written by Kevin Hughes of EIT and can be used to index documents at your site with waisindex. Documents you don't want to index the contents of, such as gif images, are specified with the nocontents flag. This flag tells waisindex to index only the file name and not the contents.

Listing 28.7. csh script for indexing documents using waisindex.


#! /bin/csh



set rootdir = /usr/local/www

#       This is the root directory of the Web tree you want to index.



set index = /usr/local/httpd/wais/sources/index

#       This is the name your WAIS indexes will be built under.

#       Index files will be called index.* in the /usr/local/httpd/wais/sources

#       directory, in this example.



set indexprog = /usr/local/httpd/wais/waisindex

#       The full pathname to your waisindex program.



set nonomatch

cd $rootdir

set num = 0

foreach pathname ('du $rootdir | cut -f2 | tail -r')



        echo "The current pathname is: $pathname"

        if ($num == 0) then

                set exportflag = "-export"

        else

                set exportflag = "-a"

        endif

        $indexprog -l 0 -nopairs -nocat -d $index $exportflag $pathname/*.html

        $indexprog -l 0 -nopairs -nocat -d $index -a $pathname/*.txt

        $indexprog -l 0 -nopairs -nocat -d $index -a $pathname/*.c

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.ps

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.gif

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.au

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.hqx

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.xbm

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.mpg

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.pict

        $indexprog -nocontents -l 0 -nopairs -nocat -d $index -a $pathname/*.tiff

        @ num++

end

echo "$num directories were indexed."

The script shown in Listing 28.8 for indexing directories based on file type was created by Michael Grady from the University of Illinois Computing & Communications Services Office. This Perl script is based on the csh script in Listing 28.7. Although both scripts are terrific and get the job done right, the Perl script offers more control over the indexing.

Listing 28.8. Perl script for indexing documents using waisindex.


#!/usr/local/bin/perl

# Michael Grady,  Univ. of Illinois Computing & Communications Services Office

# Perl script to index the contents of a www tree. This is derived from a csh

# script that Kevin Hughes of EIT constructed for indexing files.



$rootdir = "/var/info/www/docs";

#       This is the root directory of the Web tree you want to index

$index = "/var/info/www/wais-sources/ccso-main-www";

#       This is the name and location of the index to be created

$indexprog = "/var/info/gopher/src/fw02sf/bin/waisindex";

#       The full pathname of the waisindex program

$url = "http://www.uiuc.edu";

#       The main URL for your Web. No slash at the end!



$numdir = $num = 0;



# Generate a list of directory names, then for each directory, generate an

# array of all the filenames in that directory except for . and .. . Sort this

# list so that if there is an .htaccess file in that directory, it comes near

# the front of the list. We assume that if you've bothered to put special

# access controls into a directory, then maybe you don't want these files

# indexed in a general index. You of course can remove this restriction if you

# want. Then we separate all the files in the directory into two lists: one

# list is those file types for which it is appropriate to index the contents of

# the files, and the second list are those whose file types are such we don't

# want to index the contents, just the filename (gif, for instance). Then

# if there are any files in either of these lists, we call waisindex to index

# them. The first time we index, we do not include the -a flag, so that the

# index replaces the current one. Every subsequent call to waisindex includes

# the -a flag so that we then add to the new index we are building. We include

# the -nopairs option on all waisindex calls, because this saves a lot of

# unused info from being put into the index.



# If this is run by cron, redirect print statements to file (or /dev/null).

# Probably want to add a "-l 0" option to the waisindex call also.

#open (LOGIT, ">>/tmp/waisindex.run");

#select LOGIT;



# Put in the appropriate path on your system to each of the commands

# "du", "cut" and "tail", in case you want to run this from a cronjob and

# these commands are not in the default path. Note that "du" will not follow

# symbolic links out of this "tree".

open (PATHNAMES,"/usr/bin/du $rootdir | /usr/bin/cut -f2 |/usr/bin/tail -r |");

DO_PATH: while ( $pathname = <PATHNAMES>) {

        chop $pathname;



        # The following are "path patterns" that we don't want to

        # follow (subdirectories whose files we do not want to index).

        # Add or subtract from this list as appropriate. These may

        # be directories you don't want to index at all, or directories

        # for which you want to build their own separate index.

        next DO_PATH if $pathname =~ /uiucnet/i;

        #next DO_PATH if $pathname =~ /demopict/i;

        next DO_PATH if $pathname =~ /images/i;

        next DO_PATH if $pathname =~ /testdir/i;



        print "Current pathname is: $pathname\n";

        $numdir++;

        @contents = @nocontents = ();

        opendir(CURRENT_DIR, "$pathname")

                        || die "Can't open directory $pathname: $!\n";

        @allfiles = sort (grep(!/^\.\.?$/, readdir(CURRENT_DIR)));

        closedir(CURRENT_DIR);



        DO_FILE: foreach $file (@allfiles) {

                        # skip directories that contain a .htaccess file

                        # note this is NOT smart enough to be recursive (if a

                        # directory below this does not itself contain an

                        # .htaccess file, it WILL be indexed).

                next DO_PATH if $file eq '.htaccess';

                        # filetypes for which we want to index contents

                $file =~ /\.html$/i &&

                   do { push(@contents, "$pathname/$file"); next DO_FILE;};

                $file =~ /\.te?xt$/i &&

                   do { push(@contents, "$pathname/$file"); next DO_FILE;};

                $file =~ /\.pdf$/i &&

                   do { push(@contents, "$pathname/$file"); next DO_FILE;};

                #$file =~ /\.ps$/i &&

                   #do { push(@contents, "$pathname/$file"); next DO_FILE;};



                        # filetypes for which we DON'T want to index contents

                $file =~ /\.gif$/i &&

                   do { push(@nocontents, "$pathname/$file"); next DO_FILE;};

                #$file =~ /\.au$/i &&

                   #do { push(@nocontents, "$pathname/$file"); next DO_FILE;};

                #$file =~ /\.mpg$/i &&

                   #do { push(@nocontents, "$pathname/$file"); next DO_FILE;};

                #$file =~ /\.hqx$/i &&

                   #do { push(@nocontents, "$pathname/$file"); next DO_FILE;};

        # Comment out the above lines to your liking, depending on what

        # filetypes you are actually interested in indexing.

#       For instance, if the ".mpg" line is commented out, then

#       MPEG files will *not* be indexed into the database (and thus

#       won't be searchable by others).

        } # end DO_FILE loop



        if ($#contents >= 0) {          # Index if any files in list.

                @waisflags = ("-a", "-nopairs");

                @waisflags = ("-nopairs") if $num == 0;

                $num ++;

                system($indexprog, "-d", $index, @waisflags, "-t", "URL",

                                $rootdir, $url, @contents);

        }

        if ($#nocontents >= 0) {        # Index if any files in list.

                @waisflags = ("-a", "-nopairs");

                @waisflags = ("-nopairs") if $num == 0;

                $num ++;

                system($indexprog, "-d", $index, @waisflags, "-t", "URL",

                                $rootdir, $url, "-nocontents", @nocontents);

                # note that "-nocontents" flag must follow any -T or -t option

        }

} # end DO_PATH loop



close(PATHNAMES);

print "Waisindex called $num times.\n";

print "Tried indexing $numdir directories.\n";

# end of script

Testing the WAIS Database

After you have installed freeWAIS, started waisserver, and built an index, you will want to test your new WAIS system. You can do this using waisq. If the database was indexed with the following command,


00 01 * * * /usr/local/httpd/wais/waisindex -l 1 -d /users/webdocs -e

/usr/local/httpd/logs/wais.log -r /usr/local/httpd/wais/sources/webdocuments

you could invoke waisq as follows to test the database:


/usr/local/httpd/wais/waisq -m 40 -c /usr/local/httpd/wais/sources -f -

-S webdocuments -g Stanek

This command tells waisq to return a maximum of 40 matches and to search the webdocuments source file located in the /usr/local/httpd/wais/sources directory for the keyword Stanek. If all goes well and some documents contain the keyword, the server should respond with output similar to the following:


Searching webdocuments.src . . . Initializing connection . . . Found 28 items.

After this message, the server should produce output containing the search word used and the results of the query. Keep in mind that this output is normally interpreted by your WAIS gateway. The WAIS gateway processes this output, creates a document containing the results, and sends the document to the client originating the search.

Installing and Configuring `SWISH`

SWISH, the Simple Web Indexing System for Humans, is an easy-to-use alternative to freeWAIS. SWISH is good choice if you want to experiment with indexing and search engines. Besides being easy to install, SWISH creates very small indexes compared to a WAIS index. Using the environment variables PLIMIT and FLIMIT, you can squeeze what otherwise would be a large index into about one-tenth of the file space. As a smaller file is quicker to search, SWISH can display results faster than many other search engines. However, there is a trade-off between file size and search results. A smaller file contains less data, and the smaller the file size, the less accurate the results of the search.

SWISH has a couple of limitations. Because it can search only local SWISH databases, you must use another indexing system if you need to access remote hosts. Additionally, SWISH works best with small to medium-size databases, so if you have a large site with hundreds of megabytes of files to index, you may want to use freeWAIS instead of SWISH.

You can find information on the current version of SWISH and obtain the source code from EIT corporation at


http://www.eit.com/software/

After you download the source code to your computer from the EIT Web site and uncompress the source code as necessary, you can begin installing and configuring SWISH. The first step is to change directories to the SWISH source directory and update the config.h file. If you've just uncompressed SWISH, you should be able to change directories to swish/src or simply src.

In the config.h file, you need to set parameters for your specific system. This file is also where you update the PLIMIT and FLIMIT variables that control the size of your index files. After you set those parameters by following the inline documentation, you can compile SWISH. SWISH compiles fine with any C compiler, even plain old gcc.

Setting Up the SWISH Configuration File

The next step is to edit the SWISH configuration file. This file is usually located in the src directory and is used to configure environment variables for search and retrieval results. After you've updated the configuration file, you can name it anything you want, such as swish.conf. Listing 28.9 is a sample configuration file for SWISH.

Listing 28.9. Sample SWISH configuration file.


# SWISH configuration file



IndexDir /usr/webdocs

# This is a space-separated list of files and directories you want to index.



IndexFile /usr/local/httpd/swish/sources/index.swish

# This is the name your SWISH-indexed database.



IndexAdmin "William Stanek publisher@tvp.com"

IndexDescription "Index of key documents at the Web site"

IndexName "Index of TVP Web site"

IndexPointer "http://tvp.com/cgi-bin/wwwwais/"

# Additional information that can be used to describe the index,

# the WAIS gateway used, and the administrator



FollowSymLinks yes

# If you want to follow symbolic links, put yes. Otherwise, put no.



IndexOnly .html .txt .c .ps.gif .au .hqx .xbm .mpg .pict .tiff

# Only files with these suffixes will be indexed.



IndexVerbose yes

# Put this to show indexing information as SWISH is working.



NoContents .ps.gif .au .hqx .xbm .mpg .pict .tiff

# Files with these suffixes won't have their contents indexed,

# only their file names.



IgnoreLimit 75 200

# To ignore words that occur too frequently, you will want to

# set this parameter. The numbers say ignore words that occur

# in this percentage of the documents and occur in at least this

# many files. Here, ignore words that occur in 75% of the files

# and occur in over 200 files. If this variable is not set, SWISH

# uses a default setting.



IgnoreWords SwishDefault

# This variable allows you to set your own stop words.

# To do this, you replace the word SwishDefault with a space-

# separated list of stop words. You can use multiple assignments

# if necessary.

The most important variables in the configuration file are IndexDir and IndexFile. The IndexDir variable enables you to specify the files and directories to index. If you enter multiple directories and file names, separate them with spaces. You can make more than one IndexDir assignment, if necessary. The IndexFile variable tells SWISH where to store the index. Because SWISH does not add the .src extension to the file name, you can name the file anything you want. However, you may want to use an extension of .swish so you know the file is a SWISH-indexed database.

Compiling and Running `SWISH`

After you update the configuration file, you can move the compiled SWISH program, swish, and the configuration file to an appropriate directory, such as:


/usr/local/httpd/swish/

To run SWISH and index the files and directories specified in the configuration file, change directories to where SWISH is located and type the following:


./swish -c /usr/local/httpd/swish/swish.conf

Based on the settings in the previously defined configuration file, when SWISH finishes indexing your site, the indexed database will be located here:


/usr/local/httpd/swish/sources/index.swish

Because SWISH lets you specify the full path to the configuration file, you can have different configuration files for different databases. To use SWISH with a gateway, you must ensure that the script has been modified to work with SWISH or is SWISH-friendly. To modify a gateway so that it is SWISH-friendly, you may only have to change the path for its search engine from its current setting to the full path to the SWISH executable file. An example of a SWISH-friendly gateway is wwwwais.c. The wwwwais.c program enables you to set the path to SWISH executable files and sources.

These are the settings that make the program SWISH-friendly:


# The full path to your SWISH program.

SwishBin /usr/local/bin/swish



# WAIS source file descriptions

# These represent the path to the indexed databases

# for SWISH sources:

#    SwishSource full_path_to_source/source.swish "description"

SwishSource /usr/local/httpd/wais/index/index.swish "Search our Web"

SourceRules replace "/usr/local/www/" "http://www.tvpress.com/"

Other Search Engines

Dozens of commercial, freeware, and shareware search engines are available. If you are looking for a commercial-grade search engine that is free, you may have to look no further than Excite for Web Servers (see Figure 28.7).

Figure 28.7 : Excite for Web Servers: a next-generation search engine.

Excite for Web Servers is a next-generation search engine that accepts natural language input, which means users can enter search information in whole sentences and don't have to use keywords. Another great feature of EWS is the capability to browse the index, which allows users to search through the database using a directory tree structure. EWS is available for most UNIX operating systems and Windows NT. You can test drive EWS at Excite before you download and install it:


http://www.excite.com/navigate/home.html

The developers at Excite claim you can download, install, and have EWS running on your system in 30 minutes. Although this claim is definitely true, there is a downside to EWS. EWS requires a minimum of 32 M RAM, and EWS searches are system resource-intensive. Therefore, before you install EWS, carefully consider how your server will handle the additional load.

If you are looking for the best search engine available and cost is not a major consideration, you may be looking for Livelink Search from Open Text, Inc (see Figure 28.8). Most search engines only allow you to search HTML and ASCII formatted documents, but Livelink Search allows you to search just about any type of text-based document, including HTML, SGML, Acrobat PDF, Microsoft Word, WordPerfect, and most other word-processing and spreadsheet formats. Not only will Livelink Search allow you to index non-HTML documents, it also converts non-HTML documents on the fly to HTML for viewing in Web browsers.

Figure 28.8 : Livelink Search allows searching and indexing of non-HTML documents in dozens of popular formats.

Livelink Search is a complete Web publishing solution and includes the Netscape Enterprise Web server. You can test drive Livelink Search at Open Text:


http://www.opentext.com/livelink/ll_search.html

Summary

Building an indexed database and creating Web documents that access the database via a gateway requires a lot of effort on the part of the Web publisher. Yet if you take the process one step at a time, you can join the thousands of Web publishers who have indexed their Web sites and thus provide to Web users the ability to search the site quickly and efficiently. Enabling the interface from a fill-out form in your Web document to an indexed database involves these steps:

Obtain the appropriate software. If you use freeWAIS, the package includes waisserver, waisq, waissearch, and waisindex. These programs will handle searching and indexing. You will also need to select a gateway, such as wwwwais.c.
Install and configure the software.
Build your indexed databases.
Create a fill-out form to submit data to the gateway.
Test the search capabilities of the index.

Chapter 28

Search Engines and Indexed Databases

wais.pl

son-of-wais.pl

kidofwais.pl

SFgate

wwwwais.c

Configuring Basic WAIS Gateways

Configuring Advanced WAIS Gateways

Using waisserver

Using waisq and waissearch

Using waisindex

Testing the WAIS Database

Setting Up the SWISH Configuration File

Compiling and Running SWISH

`wais.pl`

`son-of-wais.pl`

`kidofwais.pl`

`SFgate`

`wwwwais.c`

Using `waisserver`

Using `waisq` and `waissearch`

Using `waisindex`

Compiling and Running `SWISH`