Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdfs:subClassOf relationships missing from MeSH RDF #153

Closed
dhimmel opened this issue Dec 4, 2020 · 9 comments
Closed

rdfs:subClassOf relationships missing from MeSH RDF #153

dhimmel opened this issue Dec 4, 2020 · 9 comments

Comments

@dhimmel
Copy link

dhimmel commented Dec 4, 2020

This query returns results (online explorer):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
  ?s rdfs:subClassOf ?p .
}

This query returns no results (online explorer):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  ?s rdfs:subClassOf ?p .
}

The difference being that the later query specifies FROM <http://id.nlm.nih.gov/mesh>. Using FROM <http://id.nlm.nih.gov/mesh/2020> also returns no results.

The original query run via rdflib after loading ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt also returns no results.

I think this is the same issue as #65, but it wasn't clear to me why this is or how to get rdfs:subClassOf relationships.

Thanks for the help... am new to accessing MeSH via SPARQL / RDF.

@danizen
Copy link
Contributor

danizen commented Dec 5, 2020

Behind the UI, we use Virtuoso, the open-source version. As you've seen, it is really a quadstore, so that it stores tuples of the form
<graph, subject, property, object>. The graph with IRI http://id,nlm.nih.gov/mesh/vocab stores the vocabulary itself, which can then be used as the RDFS ruleset for the other graphs. You can tease out the graphs by adding that to your query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
  GRAPH ?g {  
    ?s rdfs:subClassOf ?p .
  }
}

Does this answer your questions? I'm really glad you are benefiting from it - it does not get as much manual traffic as you might think, even though there is a lot of API usage of the system.

@dhimmel
Copy link
Author

dhimmel commented Dec 7, 2020

it is really a quadstore, so that it stores tuples of the form <graph, subject, property, object>

I see! I was initially looking at https://hhs.github.io/meshrdf/descriptors and I assumed all visualized nodes where from the same graph.

So if I want to query a MeSH release with SPARQL, but where we store serve the database locally, I would need to load both of these files from the ftp site?

  • ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/vocabulary_1.0.0.ttl
  • ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt.gz

Would it be okay to load both of these files into a single rdflib merged graph? My goal is to write queries that can access the rdfs:subClassOf relationships as well as the MeSH data?

You can tease out the graphs by adding that to your query

Ah good to know. Pasting the results from that query below as a reference:

g s p
http://www.w3.org/ns/ldp# http://www.w2.org/ns/ldp#DirectContainer http://www.w2.org/ns/ldp#Container
http://www.w3.org/ns/ldp# http://www.w2.org/ns/ldp#BasicContainer http://www.w2.org/ns/ldp#Container
http://www.w3.org/ns/ldp# http://www.w2.org/ns/ldp#IndirectContainer http://www.w2.org/ns/ldp#Container
mesh:vocab meshv:Concept owl:Thing
mesh:vocab meshv:SCR_Chemical meshv:SupplementaryConceptRecord
mesh:vocab meshv:SCR_Disease meshv:SupplementaryConceptRecord
mesh:vocab meshv:TreeNumber owl:Thing
mesh:vocab meshv:SCR_Organism meshv:SupplementaryConceptRecord
mesh:vocab meshv:AllowedDescriptorQualifierPair meshv:DescriptorQualifierPair
mesh:vocab meshv:DisallowedDescriptorQualifierPair meshv:DescriptorQualifierPair
mesh:vocab meshv:GeographicalDescriptor meshv:Descriptor
mesh:vocab meshv:PublicationType meshv:Descriptor
mesh:vocab meshv:TopicalDescriptor meshv:Descriptor
mesh:vocab meshv:CheckTag meshv:Descriptor
mesh:vocab meshv:SCR_Protocol meshv:SupplementaryConceptRecord
mesh:vocab owl:Thing owl:Thing
mesh:vocab meshv:Descriptor owl:Thing
mesh:vocab meshv:DescriptorQualifierPair owl:Thing
mesh:vocab meshv:SupplementaryConceptRecord owl:Thing
mesh:vocab meshv:Qualifier owl:Thing
mesh:vocab meshv:Term owl:Thing

Does this answer your questions? I'm really glad you are benefiting from it

Thanks! My current goal is to load MeSH into a Python networkx directed graph (using nxontology). Basically, I want a single directed acyclic graph of concepts. I'm thinking that means I want to add meshv:Descriptor and meshv:SupplementaryConceptRecord records as nodes. Feel free to point me to any complimentary resources or efforts.

@danizen
Copy link
Contributor

danizen commented Dec 7, 2020

You can certainly do that. How you make use of the vocabulary depends on a lot on how your triple store does inference, and on your research need for inference, e.g. whether you need it. I've used rdflib for little things, but never for the full model, and so I don't feel like I am the expert to tell you what to do.

I can however expand a bit on inference. Inference makes a property statement such as "?d a meshv:Descriptor" work. Without it, you must very explicit, maybe using SPARQL UNION queries. So, in general, you can always rewrite queries to get around a lack of inference in a bespoke system, but it limits things if you are for instance implementing a question answering system.

Different triple stores do inference differently. Virtuoso uses separate graphs as a set of rules (and only does RDFS inference). Oracle SPATIAL and GRAPH calculates an "entailment", which is the full set of inferred triples, then those are loaded into another graph, and you defined a union graph with some sort of aliasing. A quick web search finds https://github.com/RDFLib/OWL-RL, which does limited OWL inferencing as well as RDFS inferencing. So, that would be enough, but I'm not sure whether this is the leading way to do inferencing with rdflib, or whether you need inferencing.

@danizen
Copy link
Contributor

danizen commented Dec 7, 2020

Since you are explicitly wanting to calculate the extra nodes you need to take it into a DAG system such as networkx, you can ignore the vocabulary file and create your own "entailment", adding the triples you need to make the entailment work by doing something like this:

SELECT ?d FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  { ?d a meshv:TopicalDescriptor } 
  UNION { ?d a meshv:GeographicalDescriptor }
  UNION { ?d a meshv:PublicationType }
  UNION { ?d a meshv:CheckTag }
}

Using the results to generate the new nodes you need and inserting them into your graph. You can do a similar thing with other relationships you need.

I caution that networkx will certainly scale to MeSH RDF, but if you are thinking of adding something bigger such as PubChem RDF or SNOMED CT, you may want to think about a DAG system such as neo4j. Using a system like that will give you hosting options if you are going beyond research to a production system.

@danizen
Copy link
Contributor

danizen commented Dec 7, 2020

One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness.

@danizen
Copy link
Contributor

danizen commented Dec 7, 2020

I misspeak below. MeSH RDF cannot be represented as a tree, but should be able to be represented as a DAG.

One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness.

@dhimmel
Copy link
Author

dhimmel commented Dec 7, 2020

you can ignore the vocabulary file and create your own "entailment"

This is probably the easiest solution, since we can list all classes we're interested. Then there are a few ways to structure the SPARQL query.

We really only need two queries: one for nodes and one for relationships. But rdflib is struggling here, in terms of running indefinitely for queries where https://id.nlm.nih.gov/mesh/query results within seconds.

So it might be nice to query a more performant database. You mentioned Virtuoso and neo4j. My main goals are SPARQL support and ease-of-setup. I like neo4j, but it probably isn't the right tool as it's not a native triplestore. I'd also be fine running our queries on the NLM Virtuoso instance, but I couldn't figure out how to access the full results when there were over 1000 results: see #150.

You should find the motivating paper by Olivier Bohdenreider

Okay, the following papers look relevant. Will review:

  1. Desiderata for an authoritative Representation of MeSH in RDF
    Rainer Winnenburg, Olivier Bodenreider
    AMIA (2014-11-14) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419968/
    PMID: 25954433 · PMCID: PMC4419968

  2. Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF
    Barbara Bushman, David Anderson, Gang Fu
    Journal of library metadata (2015) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4749162/
    DOI: 10.1080/19386389.2015.1099967 · PMID: 26877832 · PMCID: PMC4749162

@dhimmel
Copy link
Author

dhimmel commented Dec 8, 2020

rdfs:subClassOf graph

Would it be okay to load both of these files into a single rdflib merged graph?

I loaded vocabulary_1.0.0.ttl into rdflib and was able to access the rdfs:subClassOf relationships. Here's a graph of all the rdfs:subClassOf relationships in the mesh vocab:

mesh-subclassof

Also available as SVG at https://bit.ly/36W5up9.

python source & output graphviz dot

python source

import pandas as pd
import fsspec
import rdflib
import networkx as nx
from networkx.drawing.nx_pydot import write_dot

rdf = rdflib.Graph()
# load MeSH vocabulary
url = "ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/vocabulary_1.0.0.ttl"
with fsspec.open(url, "rt") as src:
    # https://github.com/HHS/meshrdf/issues/153
    rdf.parse(source=src, format="n3")

query='''
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?subject_suffix ?object_suffix
WHERE {
  ?subject rdfs:subClassOf ?object .
  BIND( STRAFTER(STR(?subject), "#") AS ?subject_suffix)
  BIND( STRAFTER(STR(?object), "#") AS ?object_suffix)
}
ORDER BY ?subject_suffix ?predicate_suffix
'''
results = rdf.query(query)
subclass_df = sparql_results_to_df(results)
subclass_df.head(2)
graph = nx.DiGraph()
for row in subclass_df.itertuples():
    graph.add_edge(row.object_suffix, row.subject_suffix)

write_dot(graph, "mesh-subclassof.dot")

graphviz source

# Medical Subject Headings (MeSH) Vocabulary rdfs:subClassOf graph
digraph  {
DescriptorQualifierPair;
AllowedDescriptorQualifierPair;
Descriptor;
CheckTag;
Thing;
Concept;
DisallowedDescriptorQualifierPair;
GeographicalDescriptor;
PublicationType;
Qualifier;
SupplementaryConceptRecord;
SCR_Chemical;
SCR_Disease;
SCR_Organism;
SCR_Protocol;
Term;
TopicalDescriptor;
TreeNumber;
DescriptorQualifierPair -> AllowedDescriptorQualifierPair;
DescriptorQualifierPair -> DisallowedDescriptorQualifierPair;
Descriptor -> CheckTag;
Descriptor -> GeographicalDescriptor;
Descriptor -> PublicationType;
Descriptor -> TopicalDescriptor;
Thing -> Concept;
Thing -> Descriptor;
Thing -> DescriptorQualifierPair;
Thing -> Qualifier;
Thing -> SupplementaryConceptRecord;
Thing -> Term;
Thing -> Thing;
Thing -> TreeNumber;
SupplementaryConceptRecord -> SCR_Chemical;
SupplementaryConceptRecord -> SCR_Disease;
SupplementaryConceptRecord -> SCR_Organism;
SupplementaryConceptRecord -> SCR_Protocol;
}

I am going to close this issue since my original question has been answered. But happy to continue discussion on my subsequent questions.

@dhimmel dhimmel closed this as completed Dec 8, 2020
@danizen
Copy link
Contributor

danizen commented Dec 8, 2020

Very cool - when they ask why we need "the software architect" maintaining this software, I may point to this discussion and ask whether they'd rather have a "principal investigator" from the group that does the science. Feel free to open an issue just to report back how it worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants