YOU ARE HERE: LAT HomeCollections

The Cutting Edge / COMPUTING / TECHNOLOGY / INNOVATION : New Ways to Find Needle in Data Haystack : Information: Novel software is making the database search faster, more efficient.


In the summer of 1991, while using an on-line legal database to research cases on judicial review, Yale Law School student Daniel Egger joined the ranks of frustrated database searchers. He spent hours each day in front of a computer terminal, yet his searches were retrieving dozens of irrelevant documents while skipping some important cases altogether.

"I thought there's got to be a better way," Egger recalls. "What lawyers are really trying to do is find a line of cases that develop a particular legal idea over time. I realized I could do that with mathematical modeling techniques."

That brainstorm lead to V-Search, a program designed to comb databases more efficiently by using the relationships between documents to help find the ones that are useful.

V-Search is one of handful of novel new products that use concepts rather than "keywords" to help database users find information. While traditional database search tools use simplistic "true or false" principles to determine whether a keyword or specific phrase is contained in a document, the new breed of programs uses statistical analysis to identify key concepts and find the linkages between related documents.

These new programs are emerging in the nick of time. The quantity of information stored in electronic databases is growing by leaps and bounds, and doctors, lawyers, journalists, financial professionals and many others are increasingly dependent on them. And yet it is often impossible to find what amounts to a needle of information in an immense haystack of data.

"The basic problem is information overload," said Steven Fingerhood, general partner of SLF Partners, a San Francisco venture capital firm that has invested Egger's Durham, N.C.-based company, Libertech. "Everyone agrees it's one of the major problems that has to be solved for effective use of electronic information."

The new "search engines," as they are known, are a quantum leap ahead of those that concentrate on locating keywords, said Vinod Khosla, a partner in the venture capital firm Kleiner Perkins Caufield & Byers, which has invested in a Palo Alto-based search-engine company called Architext. Comparing the two is "like calling a car a bicycle," Khosla said. "Both of them get you from one place to another," but the car is far more advanced.

The market for software packages that help people access and sort through databases was $748 million last year, and is projected to grow to $960 million next year, according to International Data Corp., a market research firm based in Framingham, Mass. Fortune 1000 companies ranked "improved access to data" as their second-most important concern in a recent survey conducted by IDC.

The vendors of keyword-based search engines certainly aim to keep their share of the booming business. They are designing sophisticated interfaces that allow users to query databases in plain English, as well as providing pre-programmed thesauruses, so that if someone is looking for documents about the New York Stock Exchange, the search engine will also retrieve documents that refer to the exchange as the NYSE.

But the new engines promise a breakthrough in speed and accuracy. First, they read through each of the files in a database. By counting the number of times each of the words appears in the documents, and by noticing the other words that appear nearby, the search engines can discern the key concepts in a document. Then, using statistical analysis, the search engines compare the concepts in each of the documents to find ones that are closely related to each other.

"The main thing is to look for relationships between words and groups of words," explains Graham Spencer, senior scientist at Architext, which has developed a search engine similar to the one that powers V-Search. "It helps us pin down what a human thinks of when he thinks of a concept."

For example, when searching a database for information about intellectual property, a concept-based search engine will retrieve documents about piracy because the two concepts are closely linked, Spencer said. A typical search engine would only find piracy documents if they made an explicit reference to intellectual property.

That strategy works particularly well for databases of documents--like legal cases and technical journal articles--that refer to earlier documents.

"Our system (V-Search) analyzes the network of explicit links among documents and finds groups of closely related documents," Egger said. V-Search, which will be unveiled officially today at the Folio Infobase 95 conference in San Diego, has already been licensed to several major legal publishers, and Egger hopes to expand beyond the legal database market by the end of the year.

Los Angeles Times Articles