Skip to content

Commit

Permalink
Merge pull request #68 from HHS/meshrdf-dev
Browse files Browse the repository at this point in the history
Release URI preserving algorithm and README changes
  • Loading branch information
danizen committed Dec 3, 2015
2 parents f36bfe3 + d5389d4 commit 0cf7fcb
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 56 deletions.
112 changes: 61 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# MeSH® RDF
# MeSH® RDF

***Status: Beta. Feedback is [welcome](https://github.com/HHS/meshrdf/issues).***

Expand Down Expand Up @@ -35,34 +35,34 @@ MeSH XML files from the NLM server. The latter option is described first.

Since the complete MeSH data files are quite large, we assume that they'll be kept
location on the filesystem that is separate from this GitHub repository. Set the environment variable
$MESHRDF_HOME to point to that location. For example,
`$MESHRDF_HOME` to point to that location. For example,

export MESHRDF_HOME=/var/data/mesh-rdf

You can run the script *bin/fetch-mesh-xml.sh*, which downloads all the XML and corresponding
DTD files from the NLM FTP server. For now, run it with the following command:
You can run the script `bin/fetch-mesh-xml.sh`, which downloads all the XML and corresponding
DTD files from the NLM FTP server.

MESHRDF_URI=ftp://ftp.nlm.nih.gov/online/mesh/.xmlmesh bin/fetch-mesh-xml.sh
bin/fetch-mesh-xml.sh

This saves the XML files to the *data* subdirectory of $MESHRDF_HOME.
This saves the XML files to the `data` subdirectory of `$MESHRDF_HOME`.

By default, it downloads the following:

* desc2015.dtd
* desc2015.xml
* pa2015.dtd
* pa2015.xml
* qual2015.dtd
* qual2015.xml
* supp2015.dtd
* supp2015.xml
* `desc2016.xml`
* `qual2016.xml`
* `supp2016.xml`

If you want to download a different year's data, set the MESHRDF_YEAR environment variable
before executing the script. For example,
If you want to download a different year's data, use the `-y` argument when executing the script.
For example:

MESHRDF_YEAR=2014 \
MESHRDF_URI=ftp://ftp.nlm.nih.gov/online/mesh/.xmlmesh \
bin/fetch-mesh-xml.sh
bin/fetch-mesh-xml.sh -y 2015

When downloading a year less than or equal to 2015, `bin/fetch-mesh-xml.sh` will also download the DTDs.
For example:

* `desc2015.xml` and `desc2015.dtd`
* `qual2015.xml` and `qual2015.dtd`
* `supp2015.xml` and `supp2015.dtd`


### Getting Saxon
Expand All @@ -79,11 +79,11 @@ download version 9.5 (which is known to work with these XSLTs) from the command

wget http://sourceforge.net/projects/saxon/files/Saxon-HE/9.5/SaxonHE9-5-1-5J.zip

Unzip that into the *saxon9he* subdirectory. For example:
Unzip that into the `saxon9he` subdirectory. For example:

unzip -d saxon9he SaxonHE9-5-1-5J.zip

Set an environment variable SAXON_JAR to point to the executable Jar file:
Set an environment variable `SAXON_JAR` to point to the executable Jar file:

export SAXON_JAR=*repository-dir*/saxon9he/saxon9he.jar

Expand All @@ -94,40 +94,50 @@ appropriately.

### Converting the complete MeSH data set

The conversion script is *mesh-xml2rdf.sh*. This shell script will run the XSLTs to convert each of
The conversion script is `bin/mesh-xml2rdf.sh`. This shell script will run the XSLTs to convert each of
the three main MeSH XML files into RDF N-Triples format, and put the results into the
*$MESHRDF_HOME/out* directory.
`$MESHRDF_HOME/out` directory.

By default, it looks for 2015 data files, and will produce *mesh.nt*, which is the
RDF in N-triples format, and *mesh.nt.gz*, a gzipped version. Also by default, these
By default, it looks for 2016 data files, and will produce `mesh.nt`, which is the
RDF in N-triples format, and `mesh.nt.gz`, a gzipped version. Also by default, these
data files will have RDF URIs that do not include the year. For example, the descriptor for
Ofloxacin would have the URI http://id.nlm.nih.gov/mesh/D015242.
Ofloxacin would have the URI `http://id.nlm.nih.gov/mesh/D015242`.

As with the fetch script, described above, you can use the MESHRDF_YEAR environment variable
to specify that it convert a different set of data files. For example:
As with the fetch script, described above, you can use the `-y` argument to
specify that it convert a different set of data files. For example:

MESHRDF_YEAR=2014 bin/mesh-xml2rdf.sh
bin/mesh-xml2rdf.sh -y 2015

This uses the 2014 data files to produce the "current" RDF output files *out/mesh.nt*
and *out/mesh.nt.gz*.
This uses the 2015 data files to produce the "current" RDF output files `out/mesh.nt`
and `out/mesh.nt.gz`.

To produce RDF data that has URIs with the year, then you should also set the
MESHRDF_URI_YEAR variable to "yes". Thus, the following uses the 2014 MeSH XML files to
generate the data that has RDF URIs that include the year:
To produce RDF data that has URIs with the year, you should also use the `-u` argument.
For example, the following generates RDF URIs that include the year:

MESHRDF_YEAR=2014 MESHRDF_URI_YEAR=yes bin/mesh-xml2rdf.sh
bin/mesh-xml2rdf.sh -y 2015 -u

In this case, the output data files will be written to *out/2014/mesh2014.nt* and
*out/2014/mesh2014.nt.gz*.
In this case, the output data files will be written to `out/2015/mesh2015.nt` and
`out/2015/mesh2015.nt.gz`.

### URI preservation and versioning

The vocabulary, `meta/vocabulary.ttl`, includes data proprerties used to indicate
which entities are still present in MeSH XmL, and which are no longer present.
However, these scripts produce N-triples files that are inputs to the data
processing that preserves URIs and adds these properties.

You can get N-triples files preserving URIs from
[mesh.nt.gz](ftp://ftp.nlm.nih.gov/online/mesh/mesh.nt.gz) and
[mesh.nt](ftp://ftp.nlm.nih.gov/online/mesh/mesh.nt) online.


### Generating and converting the sample files

In the *samples* subdirectory are a number of sample files that can be used for testing.
In the `samples` subdirectory are a number of sample files that can be used for testing.
The XML files here are generated from the full MeSH XML files, but are included in the
repository so that anyone can get up and running, and try things out, very easily.

The *sample-list.txt* file has the list of items from each of the three main XML
The `sample-list.txt` file has the list of items from each of the three main XML
files that provide a fairly good coverage of the variation of data found within MeSH.

These three sample files, corresponding to that list and the three main XML files,
Expand All @@ -137,14 +147,14 @@ are included in the repository:
* qual-samples.xml
* supp-samples.xml

The Perl script *make-samples.pl* can be used to regenerate these sample files from the
master XML files, extracting just those items that are listed in the *sample-list.txt*
The Perl script `make-samples.pl` can be used to regenerate these sample files from the
master XML files, extracting just those items that are listed in the `sample-list.txt`
file, if any of those changes. So, keep in mind that these samples in the repository are
used for testing/demo purposes, and are not necessarily up-to-date with the latest MeSH
release.

Finally, the script *convert-samples.sh* can be used to convert the sample XML files into
RDF, the final output being *samples.nt*.
Finally, the script `convert-samples.sh` can be used to convert the sample XML files into
RDF, the final output being `samples.nt`.

***Note that the generated RDF will be missing a lot of meshv:parentTreeNumber
relationships, because those are generated from the tree node identifiers to link between
Expand All @@ -159,17 +169,19 @@ These are the subdirectories of this project -- either part of the repository, o
* *bin* - Scripts for fetching the XML and running the conversions
* *meta* - Schema for the RDF, and other documentation
* *rnc* - Relax NG Compact version of the MeSH XML file schema (experimental, not normative)
* *samples* - XML data files for testing and demo purposes, which each contain a small subset
* *samples* - XML data files and scripts for testing and demo purposes, which each contain a small subset
of the items from the real XML data files, as described above.
* *xslt* - The main XSLT processor files that convert the XML into RDF.
* *data* - an NTriples files containing rdfs:label for Central Nervous System diseases in 14 languages. This is included as an example.


These are the subdirectories of the $MESHRDF_HOME directory, which typically (but not necessarily)
These are the subdirectories of the `$MESHRDF_HOME` directory, which typically (but not necessarily)
is set to some separate location:

* *data* - Source MeSH XML and DTD files. These files are quite large, and change often,
* *data* - Source MeSH XML files. These files are moderately large, and change often,
so they are not part of the repository, but should be downloaded separately, as described above.
* *out* - Product RDF files, in n-triples format. The conversion scripts write these product
files here.
files here. Copies of the data files may also appear here.


## Virtuoso setup
Expand All @@ -193,7 +205,7 @@ Build:
./autogen.sh
CFLAGS="-O2 -m64"
export CFLAGS
./configure --prefix=$VIRTUOSO_HOME
./configure --prefix=$VIRTUOSO_HOME --with-readline
make
make install

Expand All @@ -205,9 +217,7 @@ Start up of server:
Shutdown of server (see [the Virtuoso
documentation](http://data-gov.tw.rpi.edu/wiki/How_to_install_virtuoso_sparql_endpoint#Manual_Shutdown)):

$VIRTUOSO_HOME/bin/isql 1111 dba <password>
SQL> shutdown();

kill -s SIGTERM `cut -d= -f2 $VIRTUOSO_HOME/virtuoso/var/lib/virtuoso/db/virtuoso.lck`

## Technical documentation on GitHub pages

Expand Down
7 changes: 6 additions & 1 deletion bin/mesh-xml2rdf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,13 @@ else
fi
cd $($READLINK -e `dirname $0`/..)

while getopts "h:j:y:u" opt; do
while getopts "h:j:y:o:u" opt; do
case $opt in
h) export MESHRDF_HOME=$OPTARG ;;
j) export SAXON_JAR=$OPTARG ;;
y) export MESHRDF_YEAR=$OPTARG ;;
u) export MESHRDF_URI_YEAR="yes" ;;
o) export OUTFILE_FORCE=$OPTARG ;;
*) echo "Usage: $0 [-h mesh-rdf-home] [-j saxon-jar-path] [-y year ]" 1>&2 ; exit 1 ;;
esac
done
Expand Down Expand Up @@ -69,6 +70,10 @@ else
URI_PREFIX="http://id.nlm.nih.gov/mesh"
fi

if [ -n "$OUTFILE_FORCE" ]; then
OUTFILE=$OUTFILE_FORCE
fi


# Do the conversions

Expand Down
8 changes: 4 additions & 4 deletions meta/vocabulary.ttl
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ dct:description rdf:type :AnnotationProperty .

rdfs:label "active" ;

dct:description "A property of MeSH objects indicating whether they still appear in the current version of the most recently released MeSH year" ;
dct:description "A property of all classes. A Boolean-typed property that indicates whether MeSH content is active in the current MeSH year." ;

rdfs:range xsd:boolean .

Expand Down Expand Up @@ -430,13 +430,13 @@ dct:description rdf:type :AnnotationProperty .



### http://id.nlm.nih.gov/mesh/vocab#lastActive
### http://id.nlm.nih.gov/mesh/vocab#lastActiveYear

<http://id.nlm.nih.gov/mesh/vocab#lastActiveYear> rdf:type :DatatypeProperty ;

rdfs:label "lastActiveYear" ;

dct:description "The lastActiveYear property value is the year in which the subject last appeared. If that year is still hosted, a new URI can be constructed." ;
dct:description "A property of all classes. Indicates the most recent year in which inactive MeSH content was active." ;

rdfs:range xsd:string .

Expand All @@ -458,7 +458,7 @@ dct:description rdf:type :AnnotationProperty .

rdfs:label "nlmClassificationNumber" ;

dct:description "Each MeSH Descriptor has a corresponding class number in the NLM Classification. This classification is similar to the Library of Congress Classification (LCC).";
dct:description "A property of Descriptors. Most MeSH Descriptors have a corresponding NLM Classification (the system for the organization of literature). Descriptors that lack an NLM Classification Number include those that point to more than one number, those in the Z Tree (Geographicals), and many of those in the V Tree (Publication Characteristics). This NLM classification is similar to the Library of Congress Classification (LCC)." ;

rdfs:range xsd:string .

Expand Down

0 comments on commit 0cf7fcb

Please sign in to comment.