Using Chemistry Central's open access full-text corpus for text mining research
As of 20 Oct 2014 BioMed Central (with Chemistry Central and SpringerOpen) has published 219356 articles of peer-reviewed research, all of which are covered by our open access license agreement which allows free distribution and re-use of the full-text article, including the highly structured XML version.
As a result, Chemistry Central's open access corpus is ideally suited for use by text mining researchers.
An XSLT preview stylesheet, which will render any Chemistry Central article XML file into HTML, is available:
Sample code for developers, demonstrating the use of the stylesheet, is also available:
How to download Chemistry Central's corpus
1. By FTP
|/articles/||A subdirectory containing the full-text XML file for each article, each named based on its unique identifier - i.e. [ui].xml|
|/articles.zip||A single ZIP-compressed file containing all the
full-text XML files
Remember to set FTP transfer mode to BINARY
2. Via the Open Archive Initiative Metadata Harvesting Protocol (OAI protocol)
The OAI protocol is an HTTP/XML web service standard for the exchange of data between archives and repositories. Full-text XML is one of the metadata formats that the Chemistry Central OAI protocol interface supports. See Chemistry Central's OAI page for more details.
Use the following OAI 'set' to download
all open access articles via Chemistry Central's OAI interface.
3. From PubMed Central (standard NLM DTD format)
Chemistry Central's open access articles are part of the open access subset of PubMed Central. This means that they are available for bulk xml download from PubMed Central in the standard NLM Archiving and Interchange DTD format, via FTP or OAI.
Publish your text mining research with Chemistry Central
Chemistry Central and BioMed Central are keen to publish high quality research in the area of text mining and biomedical and chemical literature analysis.
See this list of recent publications on this topic that have appeared in BioMed Central's journals.
All research articles published by Chemistry Central are covered by our open access policy, and so are freely available without subscription.
For more information on using Chemistry Central's articles
for text mining purposes firstname.lastname@example.org.
- BioNLP - a collection of resources relating to textual analysis of the biological literature
- BioLINK - a special interest group on text mining, which in 2003 ran a text mining competition that made use of Chemistry Central's corpus
- BLIMP - a collection of links to publications on the subject of biomedical text mining
- Data mining Open Access research - an article in the 8 September 2003 edition of Open Access Now