11.0328 altered vistas revisited

Humanist Discussion Group (humanist@kcl.ac.uk)
Fri, 10 Oct 1997 08:19:28 +0100 (BST)

Humanist Discussion Group, Vol. 11, No. 328.
Centre for Computing in the Humanities, King's College London

[1] From: "Charles L. Creegan" <ccreegan@ncwc.edu> (12)
Subject: Re: 11.0327 altered vistas

[2] From: "C. M. Sperberg-McQueen" <cmsmcq@hd.uib.no> (135)
Subject: Re: 11.0327 altered vistas & eroded feet of clay

Date: Tue, 07 Oct 1997 16:55:02
From: "Charles L. Creegan" <ccreegan@ncwc.edu>
Subject: Re: 11.0327 altered vistas

....all of which is to say, use MetaCrawler. The syntax is necessarily
limited, but you do get answers from several indices and search engines,
each of which has a slightly different protocol. There's no systematic
research behind my recommendation, I just notice consistent useful results.
I don't *want* do know *everything* the web has to say on a topic!

I did find arbitrary Geocities pages I know of using metacrawler. They
were returned by the WebCrawler engine.

BTW: a 6,000 page site should probably have tables of contents, at least,
for every major topic in the top 300 pages. At a minimum a page full of
keywords and a link to your own internal search engine...

Charles L. Creegan    N.C. Wesleyan College    ccreegan@ncwc.edu

--[2]------------------------------------------------------------------ Date: Wed, 8 Oct 1997 10:19:10 +0100 (MET) From: "C. M. Sperberg-McQueen" <cmsmcq@hd.uib.no> Subject: Re: 11.0327 altered vistas & eroded feet of clay

At 08:41 PM 10/7/97 +0100, Willard McCarty forwarded a note from Matthew Kirschenbaum, who enclosed a posting from John Pike on the Red Rock Eater list; one or more of these individuals (it's not always clear to me who wrote what) wrote:

> ... that the large search engines are >not nearly as current or comprehensive as we tend to think

The phrase 'as current or comprehensive as we tend to think' is as near to irrefutable as one can come while still making what sounds like a substantive statement. So I won't try to refute it. But I will point out that it doesn't say as much as one might be inclined to suppose on first reading.

In fact, I have no problem with any of the substantive observations in the posting, whether from Dr. Pike, Dr. Kirschenbaum, or Dr. McCarty. It's more than plausible to me that a web crawler might take several weeks or months to visit all the pages on a plausible starting list of Web pages, and that an indexer which must not only visit, but index, each page is likely to take weeks, if not months, before revisiting and reindexing pages which have changed.

What does bother me is the shock and dismay, and the tone of exposure; these, like some of the characterizations of search engine behavior, do not seem to me to be warranted by the claims made. If we tend to think that AV can do the implausible, because we have not bothered to do even a back-of-the-envelope calculation of the bandwidth and indexing speed which would be required, then it seems to me that the shock and dismay ought to be directed at user's unrealistic and ill-founded expectations of the Web search engines, rather than at the search engines themselves. I didn't see anything in the posting to suggest that the search engines have actively misrepresented their services or their algorithms; at the most, there is some evidence that they have chosen to pass in silence over the misconceptions some users will bring to their services, rather than attempt to refute them. This may be sad, but it doesn't surprise me that much.

>>While AltaVista is indeed an estimable implementation, most >>web.surfers will be astonished to learn that, contrary to this >>conventional wisdom, AltaVista indexes only a small, flawed, >>arbitrary and not even random sample of what is on the web today.

None of the adjectives used here seem to me to be justified by anything else in the posting. 'Small' is justified only in comparison to the expectation (based on what? On journalistic oversimplifications in the Washington Post?) that search engines ought to index every page on the Web. In any absolute sense, I think 31 million pages is hard to call 'small'. The words 'flawed', 'arbitrary' and 'not ... random' may be true, or may not be. But if any evidence is given to support them, I missed it.

>>Estimates of the total content of the web are of necessity >>speculative, but run as high as 150 million pages. AltaVista >>claims < http://altavista.digital.com/ > to be "the largest Web >>index: 31 million pages found on 476,000 servers." So where are >>the missing pages ?? [or as Ronald Reagan asked "where is >>the rest of me??].

Not only are they speculative, but I rather suspect the high end of the range of estimates includes (does it not?) generated text, PDF and other formats often not indexed, pages behind fire walls, and pages flagged do-not-index. What is the high estimate after such pages are deducted? What is the low estimate?

>>There are many reasons a web page might not show up in the >>AltaVista index. ... >>But surely this does not explain why the estimable AltaVista >>indexes only 20% of the web.

'Surely'? Why not? At the least, it makes 20% an implausibly low number.

>>This certainly creates the impression that once AltaVista has even >>one URL from a site, it will automatically [in the fullness of time, >>but that is another story as well....] include the entire site in >>its widely used index. Certainly, this claim is the reason that

When did 'the impression' become a 'claim'?

>>AltaVista is so widely relied upon, and the reason that most >>web.users assume that "if it ain't in AltaVista, it ain't online"

I don't know about anyone else, but I rely on Altavista because Digital has devoted (at last report) eight Alphas to perform the searches, sharing a RAM array larger than my hard disk, and the engine is (I assume for that reason) pretty fast. An implausible and uncritical assumption that it somehow captures even everything on the Web (let alone the rather larger set of everything online) as of any given date does not, as far as I can tell, play any part in my decision.

I have not read about AV's algorithm for finding new Web pages, but from the statement quoted I think it would be folly to assume that the submission of a single URL from a site would lead automatically and unconditionally to the indexing of the entire site. At the very most I might hope that (a) if my site constitutes a connected graph (every page is reachable by following links from some single page on the site -- not true, in fact, for the sites I maintain) and (b) the page whose URL I submit is one of the (non-existent, in my case) ones from which I can reach every other page on the site, then (c) a sufficiently determined Web crawler could find every page on my site.

>> ... Recently I noticed that the Alta Vista search engine seemed to only >>index about 600 of our pages. I thought that this was rather odd, since I had >>long had the impression that AltaVista indexed pretty much everything, or at >>least made a good-faith best effort to do so.

To be fair, I should admit that I had pretty much this impresion, too.

But I'd prefer to distinguish between impressions one has got and the stated policies of a search service. The curious thing, for me, about Dr. Pike's exchange with AV's technical support is that they do not seem to describe, and he does not ask about, the algorithm they use to decide what pages to index.

>>>>>>That is probably a good estimate...We have 600 pages from you indexed in >>>the system. You will probably not see much more than that for any one >>>domain. Goecities has 300...and they have 300,000 members.

>>For a medium to large site, such as ours, it means that they are only >>indexing some arbitrarily selected subset of our total content.

Since the algorithm is not described, I'm not sure it's strictly fair to characterize it as 'arbitrary'.

>> But it is to say >>that anyone whose online presence has been predicated on their entire site >>[large or small] showing up in AltaVista had better think again. And that >>anyone trying to search the 'entire' web [as opposed to some arbitrary >>sample thereof] had best look somewhere other than AltaVista.

Any suggestions? One reason one might choose to limit the depth of indexing in a given domain is the risk of falling into a black hole, where every page you index gives you two more URLs to index, for a long enough time that your indexer effectively stalls in that domain for an appreciable period of time. So even if one did want to index every page one could find, one might reasonably choose to put all the newly found URLs at the bottom of the list of pending URLs -- which would (if it took a long time to get to the bottom of the list) look to users a lot like the Altavista behavior Dr. Pike describes.

In other words, is there any reason to believe it is feasible to index the 'entire' web instead of a sample (arbitrary or otherwise)? If it's feasible, is there any reason to believe any search service available for free or for hire actually does so?

>>AltaVista claims to be used nearly 30 million times a day, >>so this "undocumented feature" of AltaVista affects nearly >>everyone who uses the web [doesn't everyone???].

Did I miss something? Dr. Pike does not describe any search of any Altavista documentation seeking clarification on the indexing and page-selection policy. And now it's suddenly "undocumented"?

The mismatch between common expectations and the reality of the search engines is an important topic. But I could do without the adjectives like 'arbitrary', 'flawed', and the implications that innocent users have somehow been misled. The only really misleading statement cited in the entire piece is from the Washington Post, not from AltaVista.

-C. M. Sperberg-McQueen

------------------------------------------------------------------------- Humanist Discussion Group Information at <http://www.kcl.ac.uk/humanities/cch/humanist/> <http://www.princeton.edu/~mccarty/humanist/> =========================================================================