W3 Catalog History
This page briefly describes the CUI W3 Catalog, how it
got started, and why it has come to an end.
What is W3 Catalog?
W3 Catalog was one of the first search engines that attempted to
provide a general searchable catalog for WWW resources.
It ran from September 2, 1993 to November 8, 1996, at the
Centre Universitaire d'Informatique (CUI) of the University of Geneva.
Back to top
How did W3 Catalog get started?
The World Wide Web was developed by
Tim Berners Lee at CERN in the early 1990s. Initially, the only widely available browsers were purely textual, and the only graphical browser was Tim's NeXT implementation. This changed in the Spring of 1993, when
NCSA introduced its
Mosaic browser for X platforms. At the same time, multiple servers for different platforms became available.
Since CERN was just up the road from the University of Geneva,
we invited Tim to give a seminar on the WWW with a live demo.
For the demo we set up our own server, and on June 25, 1993,
the CUI server was publicly announced.
Although the navigational possibilities of the Web were self-evident, it was not clear how one could (or should) provide query facilities for Web resources. We had a number of applications waiting for an easy way to be implemented on the Web (such as a searchable interface to the CUI library database).
At the time, the CERN http server seemed to be very difficult to
configure to run ``active pages'' (i.e., pages whose output would be dynamically generated). The search for a better platform led us to switch on August 10, 1993 to the
Plexus server, implemented in Perl.
This made it very easy for us to install ad hoc search engines
for The Language List and the
OO Bibliography Database as Perl packages directly integrated into our server.
Since the ability to set up a search engine seemed generally useful, I decided to implement a simple, configurable Perl package for the Plexus server that could easily be adapted to different applications. (Basically, all the package - called parscan.pl - would do is return all the HTML paragraphs matching an ISINDEX HTML query string.)
Parscan was made available September 2, 1993.
At the same time, I noticed that many industrious souls had gone about creating high-quality lists of WWW resources, and made these lists available as part of other services, such as CERN's
WWW
Virtual Library.
The only problem with these lists is that they were not searchable. With parscan, a simple solution suggested itself:
- Periodically download lists of resources using a Perl script that connects to the HTTP servers where the lists are stored (such mirroring programs are now common).
- Use an ad hoc Perl script to reformat the lists into individual HTML paragraphs (or ``chunks''), and glue all the chunks together into one searchable file.
- Provide a parscan interface to query the file.
The search engine (briefly called ``jughead'', but soon renamed ``w3catalog'')
was announced on Sepember 2, 1993.
Back to top
How did W3 Catalog evolve?
Essentially very little has changed since W3 Catalog was initially installed.
The main changes are:
- New lists were gradually added to the original sources that made up
the searchable catalog. At no time was there any intention to
allow people to directly register resources with W3 Catalog.
Resources appearing in the consulted lists would automatically
appear in W3 Catalog.
- The ``chunking'' script was frequently modified to cope with changes
in the formats of the individual lists. This was always a problem
with the approach, since it was not self-evident how to write a
completely general-purpose chunker.
Parscan itself evolved independently, since it served as a generic
implementation for multiple search engines. In October 1993,
several http servers adopted a convention that would allow active
pages to be implemented as separate programs and scripts.
Such a program would be put in a standard directory (called ``htbin''),
and files found there would be executed by the server instead
of displayed as text or as HTML.
This seemed like an ideal way to open up parscan and make it available
as a utility for multiple servers. On October 27, 1993, I released
an htbin package for Plexus, allowing the Plexus server to run
htbin scripts, and about the same time I converted parscan so it
could be run as an htbin script.
My timing was a little off, though, since shortly afterwards the
CGI standard was proposed and adopted by most of the servers.
Nowadays most simple active pages are implemented as ``cgi scripts''.
Finally, on May 12, 1994, parscan was reborn as a generic cgi script, and renamed htgrep.
Htgrep is now used as a back-end for numerous search engines around the WWW.
Back to top
Why has W3 Catalog been stopped?
Although W3 Catalog was very popular, it has been made obsolete
for a number of technical and practical reasons.
- Maintenance overhead:
W3 Catalog was designed to re-build itself automatically without
administrator intervention. Unfortunately there is still a certain amount of work involved in keeping track of the lists that are consulted, making sure that URLs are valid, and answering email requests. Nobody has time or interest for this.
- Fragility:
The weak link in the chain is the chunking. If the format of the lists changes slightly, the chunking script can break, and the catalog will start to contain ``ugly'' records (i.e., records that are several pages long, or records that don't form complete or valid HTML paragraphs). Writing a more robust chunker is a non-trivial task.
- Implementation overhead:
Htgrep was designed to be a simple, general-purpose back-end to ad hoc search engines. It is cheap (free!), portable (Perl), and relatively simple to install (well, there is a
FAQ).
Unfortunately Htgrep does not lend itself very well to applications like W3 Catalog because of its extreme simplicity:
- For each request, a new process is created to run an instance of the htgrep CGI script. This, of course starts a copy of the Perl compiler and reads and compiles htgrep for each request.
- For each request, the entire W3 Catalog database is read and scanned.
This overhead is fine for small catalogs or for relatively infrequently accessed search engines. But the W3 Catalog database is several megabytes, and, at peak times, several requests are received every second. This quickly brings the CUI server to its knees.
- Competition:
Frankly, much better, faster, and more comprehensive search engines are
now available. Alta
Vista and HotBot are two
popular search engines whose contents are generated by periodically
scanning the entire WWW. Their implementations are certainly
not based on CGI, and are highly optimized.
Back to top
Is there a life after death?
Well, in this case, I don't think so. W3 Catalog has had its run, and
has been widely referenced in many of the early books on the WWW.
There is still a need for a search engine that is based on
high-quality, human-managed lists, but a coordinated effort is needed
to provide a common interface to such lists. Experience shows that
individual list maintainers are generally reluctant to switch to a
standard format if they feel this will make more work for them. (If
you have ever maintained a list, you will sympathize with this
position!) Still, if anyone is interested in reviving such an effort, I
will be happy to offer my 2 cents worth.
Back to top
Oscar Nierstrasz
November 8, 1996