Skip to main content

Table 1 Document sets (corpora) used in this work

From: Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

 

ID

Name

# of docs

size in MB

type

description

1

RN

RegulonDB Network References

724

24.9

full-text

Full-text papers from the RegulonDB database references that curators have identified as referring specifically to the regulatory network, as opposed to those referring to other objects from the database.

2

RP

RegulonDB papers

2,475

99

full-text

Full text papers from the complete RegulonDB references that we were able to access and download.

3

RA

RegulonDB Abstracts

3,075

3.3

abstracts

Abstracts from the complete RegulonDB references, as of June of 2006.

4

RS

RegulonDB search strategies

12,059

12.3

abstracts

Corpus generated by using the RegulonDB curator's search strategies, without any subsequent filtering.

5

EA

EcoCyc Abstracts

13,334

14.4

abstracts

Abstracts from references in the 2006 EcoCyc database that describes the genome and the biochemical machinery of E. coli.

6

ST

STRING-IE

58,312

10.7

sentences

Corpus of distinct sentences generated by the STRING-IE team by searching in PubMed for "E. coli" (and synonyms), and two gene/protein names in the same abstract, from 195,000 abstracts.

  1. Description of the different full text and abstract corpora used for extraction of regulatory interactions. The document sets are based on PubMed searches and on reference lists from database curation efforts.