Computer Program Beats Scientists at Indexing Experimental Data
In August, a supercomputer managed to predict scientific discoveries. Now, computers continue to threaten scientists with obsolescence, as a cutting-edge program has performed on par or better than human scientists in extracting and cataloguing experimental data from scientific journals.
Advancements in any scientific field are difficult to synthesize into one database, as new data appears in hundreds of different journals. Entering this often qualitative data into a database requires a relatively subjective process of consuming articles and entering the extracted information manually. The Paleobiology Database, for example, has been under construction for sixteen years, but as a result of the arduousness of the cataloging process, it still does not contain all of the knowledge from paleontological studies. In response to this problem, researchers from University of Wisconsin Madison developed PaleoDeepDrive, which, like human researchers, reads articles and extracts structured data, such as species names, time periods, and geographic locations. They then compared the data that the program gleaned from the articles to the data that human scientists had entered into the Paleobiology Database in response to the same articles.
"We demonstrated that the system was no worse than people on all the things we measured, and it was better in some categories," said Christopher Ré, who developed the technology.
According to author Shanan Peters, a UW professor of geoscience, the project "marks a milestone in the quest to rapidly and precisely summarize, collate and index the vast output of scientists around the globe."
According to Ré, the groundbreaking performance of the program is the result of their innovative approach to extraction. While other programs developed by companies like IBM or Google attempt to determine one "correct" reading of the articles, PaleoDeepDrive operates according to probabilities. So for every pertinent word that it extracts, such as a species name or place name, it assesses the probabilities that they are related in different ways. As a result, the program is much better equipped to correct and update its information than human researchers. According to Peters, "Information that was manually entered into the Paleobiology Database by humans cannot be assessed or enhanced without going back to the library and re-examining original documents. Our machine system, on the other hand, can extend and improve results essentially on the fly as new information is added." As a result of this increased efficiency, the researchers predict that this program, or a program very much like it, will be widely used for data mining and cataloging in the near future.
Peters, for his part, is fairly explicit about the possibility of computers serving as replacements for scientists: "Ultimately, we hope to have the ability to create a computer system that can do almost immediately what many geologists and paleontologists try to do on a smaller scale over a lifetime: read a bunch of papers, arrange a bunch of facts, and relate them to one another in order to address big questions."