Internet page

EAHIL
European Association for Health Information and Libraries
Association Européenne pour I'Information et les Bibliothèques de Santé

Newsletter

About EAHIL

Organization

Sub-groups

Conferences

Discussion list

Links

Sponsor

Internet page

Spiders and Worms on the Web - Vincent Maes

(Goodbye-English) - (Article-Intro)

Comme certains d'entre-vous le savent déjà, au moment où vous lirez cette page, j'aurai quitté Pfizer Belgique et le milieu de l'information de santé pour d'autres responsabilités. En conséquence, je ne serai plus capable de remplir quelque fonction que ce soit au sein de l'AEIBS et, entre autres, celle de Rédacteur de la Page Internet. Ceci est donc la dernière page internet... pour moi, mais pas pour vous ! Je pense que beaucoup de choses restent à faire et à dire, donc le travail doit continuer.

Mes fonctions consistaient à:

- participer au Comité Editorial,

- rédiger trimestriellement un article (la page internet),

- assurer la maintenance de la page web de ce bulletin.

Si vous êtes intéressé par tout ou partie de ces fonctions, merci de contacter la Rédactrice en chef, Luisa Vercellesi Luisa.Vercellesi@basiglio.zeneca.com

En espérant que cette page aura été utile, Salutations, Vincent Maes.

As some of you already know, by the time you read this page, I will have left Pfizer Belgium and the health information area to be in charge of other responsibilities. As a consequence, I will not be able any longer to fulfil any functions within EAHIL, and among others, that of Internet Page Editor.

Thus, this is the last Internet page for me... but not for you!

I think a lot of things remain to be done and said, so the work has to continue.

My functions were:

- member of the Editorial Committee

- a quarterly Internet article (the Internet Page)

- the maintenance of the Newsletter Homepage

If you are interested in whole or part in these functions, please contact the Newsletter Editor, Luisa Vercellesi at Luisa.Vercellesi@basiglio.zeneca.com

I hope this page was useful. Yours, Vincent Maes

About Spiders and Worms

We've seen in several past issues, that there is some very interesting quality information available on the Internet. We will now consider the means to access that information. We have already talked about specialized medical information resources (OMNI, Medical Matrix, ...), we will now take a broader look at general search engines.

There are a lot of search engines: some talk about more than 1000 engines. We can broadly classify the search engines into some general types:

General

Perhaps the most well-known search engines, conist of a giant database, filled by a robot (also called spider, worm or Web wanderer), that automatically screens the WWW. Once it finds sites not already included, it indexes them according to a certain policy. Sites can also be included by inscription; but if you add some keywords, the major indexing will be performed by the robot.

The major problem with these search engines is the amount of results you receive. To help you a bit, the results are ranked by "relevance". This relevance is the result of a complex calculation based on the position and the frequency of the keywords you have typed in, and sometimes other criteria:

* link popularity (the number of pages that links to that page) for

Excite: http://www.excite.com/

Infoseek: http://www.infoseek.com/

WebCrawler: http://webcrawler.com

* the number of visits and number of votes from visitors for

WebOrama: http://www.weborama.fr

Ideally, in order to be ranked before "competitors", web designers take into account these ranking calculations to build their homepages.

To correctly use these search engines, you have to know how they work, on the indexing, as well as, on the results side:

depth of gathering: two situations: it tries to gather the whole site, or only sample pages (Infoseek, Lycos, WebCrawler)

* frames, dynamically created or password protected sites, ... Few search engines manage correctly the information in such! e.g. it seems only AltaVista and Northern Light are able to follow links in frames

take metatags into account. Metatags are kinds of fields put in the header, and hence invisible. Generally, there are three metatags: description, author and keywords: e.g.

There are some projects to use these metatags as a kind of electronic equivalent of ISBD; such as the Dublin Core Metadata, or PICS, the Platform for Internet Content Selection, for e.g. quality filtering.

For more information about the metatags:

Dublin Core Metadata: http://purl.oclc.org/metadata/dublin_core/

The Web Developer's Virtual Library. META Tagging for Search Engines: http://www.stars.com/Search/Meta/Tag.html

Platform for Internet Content Selection (PICS): http://www.w3.org/PICS/

Metadata, PICS and Quality: http://www.ariadne.ac.uk/issue9/pics/

An article by Chris Armstrong, from the Centre for Information Quality Management (CIQM) published in the excellent electronic journal Adriadne on potentiality of PIC as a quality filter.

During the search, you ought to know the characteristics of the system, so as to make full use of the features that permit you to refine your search:

an advanced search mode is usually available

some are case-sensitive (AltaVista, Hotbot, Infoseek, Northern Light, ...) ! Lower cases include all variations

Stop words: to avoid too general questions, they maintain a list of stop words. Some are quite general: web, information, internet, ... If your keywords seem too general, check searched terms (at the top or end of the result page)

truncation (the "*") e.g. search engine*

boolean operators OR, AND. These are mostly implicit: sites with all the terms will be ranked higher, followed by sites containing less terms; but you can generally indicate that terms are required with "+" before the term. e.g. +search +engines

exclusion of terms (the boolean NOT) generally with a "-" before the term e.g. +search +engines -altavista

expressions, usually within quotes e.g. "angina pectoris" (also called exact phrases). Used to search expressions with stop words.

Each search engine has specific features. Don't miss them!

AltaVista: http://www.altavista.com/

Probably the most well-known search engine. A lot of features are available: natural language question, limit to a specific language, to a "field" (title, metatag, hyperlink, image text), to a site or to a date range, links to books (through Amazon), refine you search using kinds of classification, proximity operator (10 words), "translation" of the sites found (for some languages), category search, ...

Northern Light: http://www.northernlight.com/

2 main features:

- results are organized into folders, which are groups of results based on subject, type of information, source, language that help you narrow your search step by step

- on subscription, you also have access to 4,500 special collection titles (journals, reviews, books, magazines and news wire, most 1995-)

+ field searching, internal truncation, date and site limit

HotBot: http://www.hotbot.com

Features: limit to a language, to a field, a domain, presence of multimedia features (images, video, audio, acrobat, ...), creation or modification date. Possibility of word stemming (grammatical variations)

Never forget that even if these engines give enormous amounts of information, they are never complete:

even the biggest (AltaVista, HotBot) only index 100 to 150 million pages, about 1/3 of the Web; see Lawrence S, Giles CL. September 1998 Search Engine Coverage Update. http://www.neci.nj.nec.com/homepages/lawrence/websize98.html

currency. Internet is a moving environment, so a lot of sites within search engines databases are no longer active (dead links), or content is different. Most engines need several weeks to refresh information or include new. Submitted pages are included a lot faster than the "automatic" inclusion.

exclusion. Some webmasters can choose not to be indexed by search engines

The only way to reach the complete Web is to combine several engines, e.g. by using metaengines. This special type of programme translates your request into the special language of each search engine selected to be interrogated. Most classical features of those are also available here (boolean, truncation, exclusion, ...) + some more specific: (rough) deduplication, maximum number of sites per search engines,

Metacrawler: http://www.metacrawler.com/

Apart from the classical search in search engines with classical features we listed before, you can also customize your use: engines searched, origin of the data, maximum time for results.

SavvySearch: http://www.savvysearch.com/

Here you select the type of information searched or engine you want to use.

AskJeeves: http://www.askjeeves.com

Permits you to enter a question in natural language. Like in real life with some people ;-) it answers you with other questions when your question is a little complicated. It gives you max. 10 links to several search engines (Excite, AltaVista, WebCrawler, Lycos, Yahoo, ...)

Metaengines are different from a Configurable Unified Search Engine (CUSI), which give you a common interface to a lot of search engines, but you can only search one at a time. e.g.

Virtual Newest Search Engines:

http://www.dreamscape.com/frankvad/search.newest.html

over 1000 search engines classified in 50 categories

You find a big list of both at Yahoo!:

http://dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/ Searching_the_Web/All_in_One_Search_Pages/

Directories

In these search tools, sites are organised by humans into a classification. To locate relevant sites, you follow the classification tree, or use a complementary tool that permits you to perform a search in the whole directory or only in a given category. Less complete than the general search engines, these add value by the selection, and the comments the indexers add about the sites. Sites are found using a robot, or by inscription. In the latter case, you have to choose the place(s) in the classification tree where the site should be listed.

* Yahoo: http://www.yahoo.com

* Lycos: http://www.lycos.com

* Einet Galaxy: http://galaxy.einet.net/galaxy.html

To refine, some have also rated the sites, and you can limit your search to the "best" ones, or the most visited ones:

* Lycos Top 5%: http://point.lycos.com/categories/

5% best in each categories

* Magellan: http://www.mckinley.com/

Possibility to access reviewed and rated sites

* 100hot.com: http://www.100hot.com/

100 most frequently visited by category

The philosophy of the following resources are different. They provide a classification that gives a central access to value-added topical guides written by independant individuals that match their quality criteria. For these two directories, a category is available for health and medicine.

The World-Wide Web Virtual Library: http://www.vlib.org/

Run by volunteers, it's the oldest catalog of the web, started by Tim Berners-Lee, the creator of the web itself.

The Argus Clearinghouse: http://www.clearinghouse.net/

Maintained by librarians. To be included, a resource guide must match strict criterias for level of description and evaluation of the sites, for organization and design. Guides are also rated.

Top of the Web: http://www.december.com/web/top.html

Selected commented resources in several categories (by J. December)

Some already use a library-type classification:

CyberStacks: http://www.public.iastate.edu/~CYBERSTACKS/homepage.html

Significant Internet resources with summaries categorized using the Library of Congress classification scheme.

More information:

Vizine-Goetz D. Using Library Classification Schemes for Internet Resources. OCLC Internet Cataloging Project Colloquium. Position Paper http://www.oclc.org/oclc/man/colloq/v-g.htm

Specialized

You already know subject-specific search tools such as OMNI, Medical Matrix, ... that present only biomedical stuff, but others are focussed on other specializations: type, format, origin, ... of information.

Here are some examples:

* Yahoo! People Search: http://people.yahoo.com/

Previously known as Four11, this service intends to help you find the electronic mail address of known people. Data is gathered from mailings lists or subscription.

* DejaNews: http://www.dejanews.com

Even if general search engines include Usenet newsgroups postings (AltaVista, Hotbot, ...), this is the first and the most well-known usenet postings finder. Don't forget some newsgroups are only replications of discussion lists, such as medlib-l that becomes bit.listserv.medlib-l. The PowerSearch feature permits some limitation of your search on the newsgroup, the author, dates, and language. Moreover results can also be sorted by those fields.

* FileZ: http://www.filez.com

For search of specific files

* EuroFerret: http://www.euroferret.com/

This is one of the typical examples of engines that tend to cover only some languages or some geographical areas. You can find a lot of examples for France and French-speaking resources, such as Nomade (http://www.nomade.fr), Ecila (http://www.ecila.fr), WebOrama (we already talked about), or Annuaire E/R (http://www.urec.cnrs.fr/annuaire/) where you can even limit by region or by town. For a list of search engines by country, look at:

Search Engines Worldwide: http://www.twics.com/~takakuwa/search/search.html

A lot more on search engines and search techniques:

* Search Engine Watch: http://searchenginewatch.com

THE site of reference for search engines. Free, monthly Newsletter with update of search engine facts. A subscription gives access to more in-depth information on specific search engines.

* Web Site Search Tools: http://www.searchtools.com/

Information, Guides and News

* Web Search (from the The Mining Company): http://websearch.miningco.com/

Guides, search engines features, search by subject

* Sherlock: http://www.intermediacy.com/sherlock/index.phtml

"Centre for effective Internet search" + Bi-weekly short article, and a collection of tips for searching. You may also post questions about techniques, ...

* l'URFIST de Strasbourg propose...:

http://www-scd-ulp.u-strasbg.fr/urfist/cours-internet.htm

Excellent resource with proposes guides, articles, ... to enhance Internet search performance

* The Web Robots Pages: http://info.webcrawler.com/mak/projects/robots/robots.html

Papers and articles, with a FAQ, a detailed list of robots, a mailing list, ... specific to automatic web page indexers.

* Motrech, French-speaking mailing list on search engines (problems, technics, comparison)

To subscribe, send an empty message to: motrech-subscribe@makelist.com

Hope it helps ;-)

Bye, bye! Vincent Maes

Return to Table of Contents

Last Updated March 22, 1999 by Suzanne Bakker

Updated by Webmaster2001-08-07