My name is Lincoln Sayger. My interests include writing essays and novels, and I use the Internet for a fair portion of my research. Because of this, I believe that the proposal contained herein will be greatly helpful to me and to other writers.
Search engines and indices have helped to make easier the task of finding information online, but they have their limits. I propose to develop a system for Web pages, sites, and domains to be categorized using the Dewey Decimal System, the Library of Congress Classification, or a similar system of classification.
The process of developing a new system, based on the lessons of previous systems and including categories which have emerged as having viability from the intensity with which our society has adopted the World Wide Web, has begun.
No one can rationally deny that the World Wide Web, in addition to other things it is and does, serves a vast storehouse of information. Since educational institutions get a separate top-level domain, and since many hundreds of these institutions have some sort of Web site, with many of those sites publishing academic research and critical articles, credible information on a vast myriad of topics is available to all who have access to World Wide Web sites. In addition to this information stored on academic servers, credible information also exists on some other sites, including credible authorities in the media and commercial sites related to a specific topic. These other sites gain credibility by having reliable information, so they have a vested interest in making sure what you get from them is accurate.
However, the very multitude of sites and documents leads to one of the main problems with researching a given subject on the Internet. As LaBrie and St. Louis put it, large amounts of data make it possible "to have quality information captured in a knowledge management system, but to have no effective mechanism to retrieve that knowledge" 1. Since "there is no comprehensive 'card catalogue' system to organize the gargantuan library," as TIME Magazine points out 2, Internet users must rely on search engines, directories and indices, or on browsing along the hyperlinks sites provide. Browsing hyperlinks is, of course, very slow and unhelpful.
In 1990, the first search engine, Archie, was created 3, and others soon followed: Gopher, Veronica, Jughead, and more familiar names such as WebCrawler, Yahoo!, and Lycos in 1994 4.
These search engines and directories, along with their successors, have made finding information on the Web much easier than using links to find information, but all search engines and directories have strengths and weaknesses, and those weaknesses can be problematic.
Directories are good at breaking things down into topics, but they are very slow at indexing new sites, and they often cover only the most common topics. Search engines find the greatest number of sites and include the most, but they have problems of their own.
Search engines often use shortcuts to work around the limits of machine intelligence, but those workarounds, while they often lead to improved searching, allow for various means of cheating.
Search engines use these shortcuts because of the limits of various factors in their systems and because of limits in communication between these various factors. Shortcuts are the only way to perform searches in acceptable periods of time given the limits of disk space, working memory, search term matching, analyzing time, bandwidth, and responsiveness to the addition of new pages to the World Wide Web. Indices also have to cope with limits, because their own artificial intelligence has the same shortcomings, and because those maintained by staff or volunteers must cope with the limits of human data processing speed. Human-categorized pages will be more accurately classified than machine-categorized pages, but the trade-off is speed.
Search engines often analyze the full text of a page, which allows for the inclusion in the body of a page of highly rated search terms that have nothing to do with the actual content of the site. Search engines also use meta keyword tags for search criteria, which allows a site author to include multiple related search terms people are likely to use that might not be in the body text, but it also allows unscrupulous authors to include multiple disparate terms that have no relation to the actual page content.
Search engines also face difficulty in dealing with human languages. In machine code, items that are used for a variety of functions inherit some context from the functions that call them, but human language uses words for multiple functions and may or may not have anything to lend context to them. A word might appear alone on a page.
Consider "SHALIMAR" as an example of a word with multiple uses. It can refer to a city 5, a perfume 6, A novel by Manohar Malgonkar 7, or a section of the Kolkata suburban railway system 8, as well as other possibilities, such as singer/musical group names.
But a person using it as a search term usually will want only one of those meanings. In addition to multiple meanings, a word may be from a different human language than the main language indexed by the search engine. These semantic differences, except for some common words, are unlikely to be resolved by context, since the cultural context that would resolve these differences for a human are not available to a machine.
Paper Libraries are a good place to start when considering possible methods of classifying materials
Paper libraries are a good place to start when considering possible methods of classifying materials, since libraries have been using classification systems to sort books and other materials for many decades. In fact, collections of materials have used coding systems for thousands of years 9. Today's libraries are organized by standardized systems which are reasonably well-understood by the general reading public. The two most commonly used classification systems by public and university libraries are the Dewey Decimal Classification system and the Library of Congress Classification system.
The Dewey Decimal Classification system was developed by Melvil Dewey during the 1870s 10. It uses a system of ten major categories (called classes) broken up into one hundred subcategories (called divisions), ten for each class, and into one thousand subcategories (called sections), ten for each division, though not all of the available numbers are in use 11.
The Library of Congress Classification system was developed in the late 19th century and continues to be developed 12. The original design was influenced by the leadership of Herbert Putnam and by the Dewey Decimal Classification and the Cutter Expansive Classification 13.
The Dewey system is strictly hierarchical, while the LCC is loosely hierarchical. Both have served libraries well for decades. But libraries have in recent years made efforts to bring the data processing power of computers to the organization and access of libraries.
Antelman, et al, summarize the history of online card catalogs, pointing out that "first generation online catalogs (1960s and 1970s) provided the same access points as the card catalog," that second-generation catalogs provided Boolean searching but didn't solve the difficulties of searching by subject, and that newer catalogs provide for some partial matching of terms 14. In spite of the adoption by some libraries of somewhat hierarchical databases, many libraries still use card catalogs that are similar in function to Internet search engines, which have all those drawbacks related to free text searching.
Because of these efforts to bring computers into library classification, perhaps we can learn from their experience and more elegantly bring the structure of library classification to the Internet. Combining hierarchy with keyword searching could lead to faster, more efficient paths in knowledge seeking.
I propose a means of adding hierachy to free text searching.