However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words, such as the English "the", which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming on the words being indexed, so for example any of the words "drives", "drove", or "driven" will be recorded in the index under a single concept word "drive".
Due to the ambiguities of natural language, a full text search typically produces a retrieval list that has low precision: most of the items retrieved are irrelevant. Controlled-vocabulary searching solves this problem by tagging the documents in such a way that the ambiguities are eliminated. However, a controlled vocabulary search may have low recall: it may fail to retrieve some documents that are actually relevant to the search question. Despite the presence of many irrelevant documents in a free text search's retrieval list, a free text search may be able to locate a document that a controlled vocabulary search failed to retrieve.
Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives. The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language.
Certain clustering techniques based on Bayesian algorithms (similar to spam filter in gmail) can help reduce the false positive errors. So if the search term is "football", these techniques can categorize the document/data universe into say "American football", "corporate football" etc. Depending on the occurrences of words in a document, it can fall into one of the categories or more. These techniques are being extensively deployed in the e-discovery domain.
The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.
Technological advances have greatly improved the performance of free text searching. For example, Google's PageRank algorithm gives more prominence to documents to which other Web pages have linked. This algorithm dramatically improves users' perception of search precision, a fact that explains its popularity among Internet users. See search engine for additional examples.
Wipo Publishes Patent of Hitachi, Emi Omori, Naoki Wakizaka and Kazuyuki Ichikawa for "Path Display Method for Search Term, Search Support Device and Program" (Japanese Inventors)
Jan 30, 2013; GENEVA, Jan. 30 -- Publication No. WO/2013/011563 was published on Jan. 24.Title of the invention: "PATH DISPLAY METHOD FOR...
WIPO PUBLISHES PATENT OF ALIBABA GROUP HOLDING FOR "DETERMINING AND USING SEARCH TERM WEIGHTINGS" (CHINESE INVENTOR)
Dec 26, 2011; GENEVA, Dec. 26 -- Publication No. WO/2011/159361 was published on Dec. 22. Title of the invention: "DETERMINING AND USING SEARCH...