Spiders and Worms on the Web - Vincent Maes
          
        (Goodbye-English) - (Article-Intro)  
        Comme certains d'entre-vous le savent déjà, au moment où vous lirez cette page,
        j'aurai quitté Pfizer Belgique et le milieu de l'information de santé pour d'autres
        responsabilités. En conséquence, je ne serai plus capable de remplir quelque fonction
        que ce soit au sein de l'AEIBS et, entre autres, celle de Rédacteur de la Page Internet.
        Ceci est donc la dernière page internet... pour moi, mais pas pour vous ! Je pense que
        beaucoup de choses restent à faire et à dire, donc le travail doit continuer.  
        Mes fonctions consistaient à:  
        - participer au Comité Editorial,  
        - rédiger trimestriellement un article (la page internet),  
        - assurer la maintenance de la page web de ce bulletin.  
        Si vous êtes intéressé par tout ou partie de ces fonctions, merci de contacter
        la Rédactrice en chef, Luisa Vercellesi Luisa.Vercellesi@basiglio.zeneca.com
         
        En espérant que cette page aura été utile, Salutations, Vincent Maes.  
         
         
        As some of you already know, by the time you read this page, I will have left Pfizer
        Belgium and the health information area to be in charge of other responsibilities. As a
        consequence, I will not be able any longer to fulfil any functions within EAHIL, and among
        others, that of Internet Page Editor.  
        Thus, this is the last Internet page for me... but not for you!  
        I think a lot of things remain to be done and said, so the work has to continue.  
        My functions were:  
        - member of the Editorial Committee  
        - a quarterly Internet article (the Internet Page)  
        - the maintenance of the Newsletter Homepage  
        If you are interested in whole or part in these functions, please contact the
        Newsletter Editor, Luisa Vercellesi at Luisa.Vercellesi@basiglio.zeneca.com
         
        I hope this page was useful. Yours, Vincent Maes  
         
         
        About Spiders and Worms 
        We've seen in several past issues, that there is some very interesting quality
        information available on the Internet. We will now consider the means to access that
        information. We have already talked about specialized medical information resources (OMNI,
        Medical Matrix, ...), we will now take a broader look at general search engines.  
        There are a lot of search engines: some talk about more than 1000 engines. We can
        broadly classify the search engines into some general types:  
         
         
         
         
        General  
        Perhaps the most well-known search engines, conist of a giant database, filled by a
        robot (also called spider, worm or Web wanderer), that automatically screens the WWW. Once
        it finds sites not already included, it indexes them according to a certain policy. Sites
        can also be included by inscription; but if you add some keywords, the major indexing will
        be performed by the robot.  
        The major problem with these search engines is the amount of results you receive. To
        help you a bit, the results are ranked by "relevance". This relevance is the
        result of a complex calculation based on the position and the frequency of the keywords
        you have typed in, and sometimes other criteria:  
        * link popularity (the number of pages that links to that page) for  
        Excite: http://www.excite.com/  
        Infoseek: http://www.infoseek.com/
         
        WebCrawler: http://webcrawler.com
         
        * the number of visits and number of votes from visitors for  
        WebOrama: http://www.weborama.fr
         
         
         
        Ideally, in order to be ranked before "competitors", web designers take into
        account these ranking calculations to build their homepages.  
        To correctly use these search engines, you have to know how they work, on the indexing,
        as well as, on the results side:  
        depth of gathering: two situations: it tries to gather the whole site, or only sample
        pages (Infoseek, Lycos, WebCrawler)  
        * frames, dynamically created or password protected sites, ... Few search engines
        manage correctly the information in such! e.g. it seems only AltaVista and Northern Light
        are able to follow links in frames  
        take metatags into account. Metatags are kinds of fields put in the header, and hence
        invisible. Generally, there are three metatags: description, author and keywords: e.g.  
        <meta name="description" content="Newsletter of the EAHIL - European
        Association for Health Information and Libraries / Bulletin de l'AEIBS - Association
        Europeenne des Bibliotheques de Sante">  
        <meta name="keywords" content="association, eahil, aeibs, europe,
        medicine, health, libraries">  
        <meta name="author" content="Vincent Maes">  
        There are some projects to use these metatags as a kind of electronic equivalent of
        ISBD; such as the Dublin Core Metadata, or PICS, the Platform for Internet Content
        Selection, for e.g. quality filtering.  
        For more information about the metatags:  
        Dublin Core Metadata: http://purl.oclc.org/metadata/dublin_core/
         
        The Web Developer's Virtual Library. META Tagging for Search Engines: http://www.stars.com/Search/Meta/Tag.html
         
        Platform for Internet Content Selection (PICS): http://www.w3.org/PICS/  
        Metadata, PICS and Quality: http://www.ariadne.ac.uk/issue9/pics/  
        An article by Chris Armstrong, from the Centre for Information Quality Management
        (CIQM) published in the excellent electronic journal Adriadne on potentiality of PIC as a
        quality filter.  
         
         
        During the search, you ought to know the characteristics of the system, so as to make
        full use of the features that permit you to refine your search:  
        an advanced search mode is usually available  
        some are case-sensitive (AltaVista, Hotbot, Infoseek, Northern Light, ...) ! Lower
        cases include all variations  
        Stop words: to avoid too general questions, they maintain a list of stop words. Some
        are quite general: web, information, internet, ... If your keywords seem too general,
        check searched terms (at the top or end of the result page)  
        truncation (the "*") e.g. search engine*  
        boolean operators OR, AND. These are mostly implicit: sites with all the terms will be
        ranked higher, followed by sites containing less terms; but you can generally indicate
        that terms are required with "+" before the term. e.g. +search +engines  
        exclusion of terms (the boolean NOT) generally with a "-" before the term
        e.g. +search +engines -altavista  
        expressions, usually within quotes e.g. "angina pectoris" (also called exact
        phrases). Used to search expressions with stop words.  
         
         
        Each search engine has specific features. Don't miss them!  
        AltaVista: http://www.altavista.com/
         
        Probably the most well-known search engine. A lot of features are available: natural
        language question, limit to a specific language, to a "field" (title, metatag,
        hyperlink, image text), to a site or to a date range, links to books (through Amazon),
        refine you search using kinds of classification, proximity operator (10 words),
        "translation" of the sites found (for some languages), category search, ...  
        Northern Light: http://www.northernlight.com/  
        2 main features:  
        - results are organized into folders, which are groups of results based on subject,
        type of information, source, language that help you narrow your search step by step  
        - on subscription, you also have access to 4,500 special collection titles (journals,
        reviews, books, magazines and news wire, most 1995-)  
        + field searching, internal truncation, date and site limit  
        HotBot: http://www.hotbot.com
         
        Features: limit to a language, to a field, a domain, presence of multimedia features
        (images, video, audio, acrobat, ...), creation or modification date. Possibility of word
        stemming (grammatical variations)  
         
         
        Never forget that even if these engines give enormous amounts of information, they are
        never complete:  
        even the biggest (AltaVista, HotBot) only index 100 to 150 million pages, about 1/3 of
        the Web; see Lawrence S, Giles CL. September 1998 Search Engine Coverage Update. http://www.neci.nj.nec.com/homepages/lawrence/websize98.html
         
        currency. Internet is a moving environment, so a lot of sites within search engines
        databases are no longer active (dead links), or content is different. Most engines need
        several weeks to refresh information or include new. Submitted pages are included a lot
        faster than the "automatic" inclusion.  
        exclusion. Some webmasters can choose not to be indexed by search engines  
         
         
        The only way to reach the complete Web is to combine several engines, e.g. by using
        metaengines. This special type of programme translates your request into the special
        language of each search engine selected to be interrogated. Most classical features of
        those are also available here (boolean, truncation, exclusion, ...) + some more specific:
        (rough) deduplication, maximum number of sites per search engines,  
        Metacrawler: http://www.metacrawler.com/  
        Apart from the classical search in search engines with classical features we listed
        before, you can also customize your use: engines searched, origin of the data, maximum
        time for results.  
        SavvySearch: http://www.savvysearch.com/  
        Here you select the type of information searched or engine you want to use.  
        AskJeeves: http://www.askjeeves.com
         
        Permits you to enter a question in natural language. Like in real life with some people
        ;-) it answers you with other questions when your question is a little complicated. It
        gives you max. 10 links to several search engines (Excite, AltaVista, WebCrawler, Lycos,
        Yahoo, ...)  
         
         
        Metaengines are different from a Configurable Unified Search
        Engine (CUSI), which give you a common interface to a lot of search engines, but
        you can only search one at a time. e.g.  
        Virtual Newest Search Engines:  
        http://www.dreamscape.com/frankvad/search.newest.html
         
        over 1000 search engines classified in 50 categories  
         
         
        You find a big list of both at Yahoo!:  
        http://dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/
        Searching_the_Web/All_in_One_Search_Pages/  
         
         
        Directories  
        In these search tools, sites are organised by humans into a classification. To locate
        relevant sites, you follow the classification tree, or use a complementary tool that
        permits you to perform a search in the whole directory or only in a given category. Less
        complete than the general search engines, these add value by the selection, and the
        comments the indexers add about the sites. Sites are found using a robot, or by
        inscription. In the latter case, you have to choose the place(s) in the classification
        tree where the site should be listed.  
        * Yahoo: http://www.yahoo.com
         
        * Lycos: http://www.lycos.com
         
        * Einet Galaxy: http://galaxy.einet.net/galaxy.html  
         
         
        To refine, some have also rated the sites, and you can limit your search to the
        "best" ones, or the most visited ones:  
        * Lycos Top 5%: http://point.lycos.com/categories/
         
        5% best in each categories  
        * Magellan: http://www.mckinley.com/
         
        Possibility to access reviewed and rated sites  
        * 100hot.com: http://www.100hot.com/
         
        100 most frequently visited by category  
        The philosophy of the following resources are different. They provide a classification
        that gives a central access to value-added topical guides written by independant
        individuals that match their quality criteria. For these two directories, a category is
        available for health and medicine.  
        The World-Wide Web Virtual Library: http://www.vlib.org/  
        Run by volunteers, it's the oldest catalog of the web, started by Tim Berners-Lee, the
        creator of the web itself.  
        The Argus Clearinghouse: http://www.clearinghouse.net/  
        Maintained by librarians. To be included, a resource guide must match strict criterias
        for level of description and evaluation of the sites, for organization and design. Guides
        are also rated.  
        Top of the Web: http://www.december.com/web/top.html 
        Selected commented resources in several categories (by J. December)  
        Some already use a library-type classification:  
        CyberStacks: http://www.public.iastate.edu/~CYBERSTACKS/homepage.html
         
        Significant Internet resources with summaries categorized using the Library of Congress
        classification scheme.  
         
         
        More information:  
        Vizine-Goetz D. Using Library Classification Schemes for Internet Resources. OCLC
        Internet Cataloging Project Colloquium. Position Paper http://www.oclc.org/oclc/man/colloq/v-g.htm
         
         
         
        Specialized  
        You already know subject-specific search tools such as OMNI, Medical Matrix, ... that
        present only biomedical stuff, but others are focussed on other specializations: type,
        format, origin, ... of information.  
        Here are some examples:  
        * Yahoo! People Search: http://people.yahoo.com/  
        Previously known as Four11, this service intends to help you find the electronic mail
        address of known people. Data is gathered from mailings lists or subscription.  
        * DejaNews: http://www.dejanews.com
         
        Even if general search engines include Usenet newsgroups postings (AltaVista, Hotbot,
        ...), this is the first and the most well-known usenet postings finder. Don't forget some
        newsgroups are only replications of discussion lists, such as medlib-l that becomes
        bit.listserv.medlib-l. The PowerSearch feature permits some limitation of your search on
        the newsgroup, the author, dates, and language. Moreover results can also be sorted by
        those fields.  
        * FileZ: http://www.filez.com
         
        For search of specific files  
        * EuroFerret: http://www.euroferret.com/  
        This is one of the typical examples of engines that tend to cover only some languages
        or some geographical areas. You can find a lot of examples for France and French-speaking
        resources, such as Nomade (http://www.nomade.fr),
        Ecila (http://www.ecila.fr),
        WebOrama (we already talked about), or Annuaire E/R (http://www.urec.cnrs.fr/annuaire/)
        where you can even limit by region or by town. For a list of search engines by country,
        look at:  
        Search Engines Worldwide: http://www.twics.com/~takakuwa/search/search.html
         
         
         
        A lot more on search engines and search techniques:  
        * Search Engine Watch: http://searchenginewatch.com  
        THE site of reference for search engines. Free, monthly Newsletter with update of
        search engine facts. A subscription gives access to more in-depth information on specific
        search engines.  
        * Web Site Search Tools: http://www.searchtools.com/  
        Information, Guides and News  
        * Web Search (from the The Mining Company): http://websearch.miningco.com/  
        Guides, search engines features, search by subject  
        * Sherlock: http://www.intermediacy.com/sherlock/index.phtml
         
        "Centre for effective Internet search" + Bi-weekly short article, and a
        collection of tips for searching. You may also post questions about techniques, ...  
        * l'URFIST de Strasbourg propose...:  
        http://www-scd-ulp.u-strasbg.fr/urfist/cours-internet.htm
         
        Excellent resource with proposes guides, articles, ... to enhance Internet search
        performance  
        * The Web Robots Pages: http://info.webcrawler.com/mak/projects/robots/robots.html
         
        Papers and articles, with a FAQ, a detailed list of robots, a mailing list, ...
        specific to automatic web page indexers.  
        * Motrech, French-speaking mailing list on search engines (problems,
        technics, comparison)  
        To subscribe, send an empty message to: motrech-subscribe@makelist.com
         
         
        Hope it helps ;-)  
        Bye, bye! Vincent Maes  
         
         
          
        Return to Table of Contents  
         
        Last Updated March 22, 1999 by Suzanne Bakker  |