Spiders and Worms on the Web - Vincent Maes
(Goodbye-English) - (Article-Intro)
Comme certains d'entre-vous le savent déjà, au moment où vous lirez cette page,
j'aurai quitté Pfizer Belgique et le milieu de l'information de santé pour d'autres
responsabilités. En conséquence, je ne serai plus capable de remplir quelque fonction
que ce soit au sein de l'AEIBS et, entre autres, celle de Rédacteur de la Page Internet.
Ceci est donc la dernière page internet... pour moi, mais pas pour vous ! Je pense que
beaucoup de choses restent à faire et à dire, donc le travail doit continuer.
Mes fonctions consistaient à:
- participer au Comité Editorial,
- rédiger trimestriellement un article (la page internet),
- assurer la maintenance de la page web de ce bulletin.
Si vous êtes intéressé par tout ou partie de ces fonctions, merci de contacter
la Rédactrice en chef, Luisa Vercellesi Luisa.Vercellesi@basiglio.zeneca.com
En espérant que cette page aura été utile, Salutations, Vincent Maes.
As some of you already know, by the time you read this page, I will have left Pfizer
Belgium and the health information area to be in charge of other responsibilities. As a
consequence, I will not be able any longer to fulfil any functions within EAHIL, and among
others, that of Internet Page Editor.
Thus, this is the last Internet page for me... but not for you!
I think a lot of things remain to be done and said, so the work has to continue.
My functions were:
- member of the Editorial Committee
- a quarterly Internet article (the Internet Page)
- the maintenance of the Newsletter Homepage
If you are interested in whole or part in these functions, please contact the
Newsletter Editor, Luisa Vercellesi at Luisa.Vercellesi@basiglio.zeneca.com
I hope this page was useful. Yours, Vincent Maes
About Spiders and Worms
We've seen in several past issues, that there is some very interesting quality
information available on the Internet. We will now consider the means to access that
information. We have already talked about specialized medical information resources (OMNI,
Medical Matrix, ...), we will now take a broader look at general search engines.
There are a lot of search engines: some talk about more than 1000 engines. We can
broadly classify the search engines into some general types:
General
Perhaps the most well-known search engines, conist of a giant database, filled by a
robot (also called spider, worm or Web wanderer), that automatically screens the WWW. Once
it finds sites not already included, it indexes them according to a certain policy. Sites
can also be included by inscription; but if you add some keywords, the major indexing will
be performed by the robot.
The major problem with these search engines is the amount of results you receive. To
help you a bit, the results are ranked by "relevance". This relevance is the
result of a complex calculation based on the position and the frequency of the keywords
you have typed in, and sometimes other criteria:
* link popularity (the number of pages that links to that page) for
Excite: http://www.excite.com/
Infoseek: http://www.infoseek.com/
WebCrawler: http://webcrawler.com
* the number of visits and number of votes from visitors for
WebOrama: http://www.weborama.fr
Ideally, in order to be ranked before "competitors", web designers take into
account these ranking calculations to build their homepages.
To correctly use these search engines, you have to know how they work, on the indexing,
as well as, on the results side:
depth of gathering: two situations: it tries to gather the whole site, or only sample
pages (Infoseek, Lycos, WebCrawler)
* frames, dynamically created or password protected sites, ... Few search engines
manage correctly the information in such! e.g. it seems only AltaVista and Northern Light
are able to follow links in frames
take metatags into account. Metatags are kinds of fields put in the header, and hence
invisible. Generally, there are three metatags: description, author and keywords: e.g.
<meta name="description" content="Newsletter of the EAHIL - European
Association for Health Information and Libraries / Bulletin de l'AEIBS - Association
Europeenne des Bibliotheques de Sante">
<meta name="keywords" content="association, eahil, aeibs, europe,
medicine, health, libraries">
<meta name="author" content="Vincent Maes">
There are some projects to use these metatags as a kind of electronic equivalent of
ISBD; such as the Dublin Core Metadata, or PICS, the Platform for Internet Content
Selection, for e.g. quality filtering.
For more information about the metatags:
Dublin Core Metadata: http://purl.oclc.org/metadata/dublin_core/
The Web Developer's Virtual Library. META Tagging for Search Engines: http://www.stars.com/Search/Meta/Tag.html
Platform for Internet Content Selection (PICS): http://www.w3.org/PICS/
Metadata, PICS and Quality: http://www.ariadne.ac.uk/issue9/pics/
An article by Chris Armstrong, from the Centre for Information Quality Management
(CIQM) published in the excellent electronic journal Adriadne on potentiality of PIC as a
quality filter.
During the search, you ought to know the characteristics of the system, so as to make
full use of the features that permit you to refine your search:
an advanced search mode is usually available
some are case-sensitive (AltaVista, Hotbot, Infoseek, Northern Light, ...) ! Lower
cases include all variations
Stop words: to avoid too general questions, they maintain a list of stop words. Some
are quite general: web, information, internet, ... If your keywords seem too general,
check searched terms (at the top or end of the result page)
truncation (the "*") e.g. search engine*
boolean operators OR, AND. These are mostly implicit: sites with all the terms will be
ranked higher, followed by sites containing less terms; but you can generally indicate
that terms are required with "+" before the term. e.g. +search +engines
exclusion of terms (the boolean NOT) generally with a "-" before the term
e.g. +search +engines -altavista
expressions, usually within quotes e.g. "angina pectoris" (also called exact
phrases). Used to search expressions with stop words.
Each search engine has specific features. Don't miss them!
AltaVista: http://www.altavista.com/
Probably the most well-known search engine. A lot of features are available: natural
language question, limit to a specific language, to a "field" (title, metatag,
hyperlink, image text), to a site or to a date range, links to books (through Amazon),
refine you search using kinds of classification, proximity operator (10 words),
"translation" of the sites found (for some languages), category search, ...
Northern Light: http://www.northernlight.com/
2 main features:
- results are organized into folders, which are groups of results based on subject,
type of information, source, language that help you narrow your search step by step
- on subscription, you also have access to 4,500 special collection titles (journals,
reviews, books, magazines and news wire, most 1995-)
+ field searching, internal truncation, date and site limit
HotBot: http://www.hotbot.com
Features: limit to a language, to a field, a domain, presence of multimedia features
(images, video, audio, acrobat, ...), creation or modification date. Possibility of word
stemming (grammatical variations)
Never forget that even if these engines give enormous amounts of information, they are
never complete:
even the biggest (AltaVista, HotBot) only index 100 to 150 million pages, about 1/3 of
the Web; see Lawrence S, Giles CL. September 1998 Search Engine Coverage Update. http://www.neci.nj.nec.com/homepages/lawrence/websize98.html
currency. Internet is a moving environment, so a lot of sites within search engines
databases are no longer active (dead links), or content is different. Most engines need
several weeks to refresh information or include new. Submitted pages are included a lot
faster than the "automatic" inclusion.
exclusion. Some webmasters can choose not to be indexed by search engines
The only way to reach the complete Web is to combine several engines, e.g. by using
metaengines. This special type of programme translates your request into the special
language of each search engine selected to be interrogated. Most classical features of
those are also available here (boolean, truncation, exclusion, ...) + some more specific:
(rough) deduplication, maximum number of sites per search engines,
Metacrawler: http://www.metacrawler.com/
Apart from the classical search in search engines with classical features we listed
before, you can also customize your use: engines searched, origin of the data, maximum
time for results.
SavvySearch: http://www.savvysearch.com/
Here you select the type of information searched or engine you want to use.
AskJeeves: http://www.askjeeves.com
Permits you to enter a question in natural language. Like in real life with some people
;-) it answers you with other questions when your question is a little complicated. It
gives you max. 10 links to several search engines (Excite, AltaVista, WebCrawler, Lycos,
Yahoo, ...)
Metaengines are different from a Configurable Unified Search
Engine (CUSI), which give you a common interface to a lot of search engines, but
you can only search one at a time. e.g.
Virtual Newest Search Engines:
http://www.dreamscape.com/frankvad/search.newest.html
over 1000 search engines classified in 50 categories
You find a big list of both at Yahoo!:
http://dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/
Searching_the_Web/All_in_One_Search_Pages/
Directories
In these search tools, sites are organised by humans into a classification. To locate
relevant sites, you follow the classification tree, or use a complementary tool that
permits you to perform a search in the whole directory or only in a given category. Less
complete than the general search engines, these add value by the selection, and the
comments the indexers add about the sites. Sites are found using a robot, or by
inscription. In the latter case, you have to choose the place(s) in the classification
tree where the site should be listed.
* Yahoo: http://www.yahoo.com
* Lycos: http://www.lycos.com
* Einet Galaxy: http://galaxy.einet.net/galaxy.html
To refine, some have also rated the sites, and you can limit your search to the
"best" ones, or the most visited ones:
* Lycos Top 5%: http://point.lycos.com/categories/
5% best in each categories
* Magellan: http://www.mckinley.com/
Possibility to access reviewed and rated sites
* 100hot.com: http://www.100hot.com/
100 most frequently visited by category
The philosophy of the following resources are different. They provide a classification
that gives a central access to value-added topical guides written by independant
individuals that match their quality criteria. For these two directories, a category is
available for health and medicine.
The World-Wide Web Virtual Library: http://www.vlib.org/
Run by volunteers, it's the oldest catalog of the web, started by Tim Berners-Lee, the
creator of the web itself.
The Argus Clearinghouse: http://www.clearinghouse.net/
Maintained by librarians. To be included, a resource guide must match strict criterias
for level of description and evaluation of the sites, for organization and design. Guides
are also rated.
Top of the Web: http://www.december.com/web/top.html
Selected commented resources in several categories (by J. December)
Some already use a library-type classification:
CyberStacks: http://www.public.iastate.edu/~CYBERSTACKS/homepage.html
Significant Internet resources with summaries categorized using the Library of Congress
classification scheme.
More information:
Vizine-Goetz D. Using Library Classification Schemes for Internet Resources. OCLC
Internet Cataloging Project Colloquium. Position Paper http://www.oclc.org/oclc/man/colloq/v-g.htm
Specialized
You already know subject-specific search tools such as OMNI, Medical Matrix, ... that
present only biomedical stuff, but others are focussed on other specializations: type,
format, origin, ... of information.
Here are some examples:
* Yahoo! People Search: http://people.yahoo.com/
Previously known as Four11, this service intends to help you find the electronic mail
address of known people. Data is gathered from mailings lists or subscription.
* DejaNews: http://www.dejanews.com
Even if general search engines include Usenet newsgroups postings (AltaVista, Hotbot,
...), this is the first and the most well-known usenet postings finder. Don't forget some
newsgroups are only replications of discussion lists, such as medlib-l that becomes
bit.listserv.medlib-l. The PowerSearch feature permits some limitation of your search on
the newsgroup, the author, dates, and language. Moreover results can also be sorted by
those fields.
* FileZ: http://www.filez.com
For search of specific files
* EuroFerret: http://www.euroferret.com/
This is one of the typical examples of engines that tend to cover only some languages
or some geographical areas. You can find a lot of examples for France and French-speaking
resources, such as Nomade (http://www.nomade.fr),
Ecila (http://www.ecila.fr),
WebOrama (we already talked about), or Annuaire E/R (http://www.urec.cnrs.fr/annuaire/)
where you can even limit by region or by town. For a list of search engines by country,
look at:
Search Engines Worldwide: http://www.twics.com/~takakuwa/search/search.html
A lot more on search engines and search techniques:
* Search Engine Watch: http://searchenginewatch.com
THE site of reference for search engines. Free, monthly Newsletter with update of
search engine facts. A subscription gives access to more in-depth information on specific
search engines.
* Web Site Search Tools: http://www.searchtools.com/
Information, Guides and News
* Web Search (from the The Mining Company): http://websearch.miningco.com/
Guides, search engines features, search by subject
* Sherlock: http://www.intermediacy.com/sherlock/index.phtml
"Centre for effective Internet search" + Bi-weekly short article, and a
collection of tips for searching. You may also post questions about techniques, ...
* l'URFIST de Strasbourg propose...:
http://www-scd-ulp.u-strasbg.fr/urfist/cours-internet.htm
Excellent resource with proposes guides, articles, ... to enhance Internet search
performance
* The Web Robots Pages: http://info.webcrawler.com/mak/projects/robots/robots.html
Papers and articles, with a FAQ, a detailed list of robots, a mailing list, ...
specific to automatic web page indexers.
* Motrech, French-speaking mailing list on search engines (problems,
technics, comparison)
To subscribe, send an empty message to: motrech-subscribe@makelist.com
Hope it helps ;-)
Bye, bye! Vincent Maes
Return to Table of Contents
Last Updated March 22, 1999 by Suzanne Bakker |