A Catalogue of billions of phrases from 107 million papers could ease computerized searching of the literature.
In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.
CLICK ON IMAGE TO READ MORE
The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.
Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place.
Some researchers who have had early access to the index say it’s a major development in helping them to search the literature with software — a procedure known as text mining. Gitanjali Yadav, a computational biologist at the University of Cambridge, UK, who studies volatile organic compounds emitted by plants, says she aims to comb through Malamud’s index to produce analyses of the plant chemicals described in the world’s research papers. “There is no way for me — or anyone else — to experimentally analyse or measure the chemical fingerprint of each and every plant species on Earth. Much of the information we seek already exists, in published literature,” she says. But researchers are restricted by lack of access to many papers, Yadav adds.
Malamud’s ‘General Index’, as he calls it, aims to address the problems faced by researchers such as Yadav. Computer scientists already text mine papers to build databases of genes, drugs and chemicals found in the literature, and to explore papers’ content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are restricted to mining only open-access papers, or those articles they (or their institutions) have subscriptions to. Some publishers have said that researchers looking to mine the text of paywalled papers need their authorization.
And although free search engines such as Google Scholar have — with publishers’ agreement — indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn’t allow large-scale computerized analysis using more specialized searches, Malamud says.
Terabytes of data
Malamud’s project is his latest venture in a career spent releasing locked-up information for free access online — often in the face of legal challenges. He originally focused on publishing government-produced legal and financial information. But more recently he has turned his attention to opening up the scientific literature.
He began with a project to allow scientists to text mine — but not read — a giant store of research papers he’s holding on a server in India; an idea he says he’s still working on. The General Index now allows anyone to mine scientific works, but it doesn’t have its own web-search portal, so if scientists want to search it they will have to download its files and develop their own programs. Malamud is hoping that users will make any search engines they create available to others.
In its compressed format the catalogue totals almost 5 terabytes, and then expands to 38 terabytes. As well as sentence fragments, the files also include tables of nearly 20 billion keywords in the literature, and tables of a paper’s title, authors and DOI (article identifier), so that users can track down a full paper if they have access to read it.
Michael Carroll, a legal researcher at the American University Washington College of Law, says that distributing the index should be legal worldwide because the files do not copy enough of an underlying article to infringe the publisher’s copyright — although laws vary by country. “Copyright does not protect facts and ideas, and these results would be treated as communication of facts derived from the analysis of the copyrighted articles,” he says.
The only legal question, Carroll adds, is whether Malamud’s obtaining and copying of the underlying papers was done without breaching publishers’ terms. Malamud says that he did have to get copies of the 107 million articles referenced in the index in order to create it; he declined to say how, but emphasizes that researchers will not have access to the full texts of the papers, which are stored in a secured, undisclosed location in the United States.
“I am very confident that what I’m doing is legal. We are not doing this to provoke a lawsuit, we are doing it to advance science,” he says.
Nature contacted six publishers about the General Index for this article: all but one declined to comment. In a statement, Springer Nature said that the company supports open-research initiatives that use technology and algorithms to meet the needs of researchers. “We have seen some initiatives run into trouble, however, when the necessary rights have not been secured to enable their sustainability,” the statement added. (Springer Nature publishes this journal; Nature’s news team is editorially independent of its publisher.)
Another legal researcher, Arul George Scaria at Delhi’s National Law University, says that any publishers that tried to use copyright laws to prevent researchers from using the General Index “would eventually be disappointed”. The release of the index, Scaria says, is a “major development for the wealth of information it has unlocked from those 107 million journal articles”.