My recent interest in mass spectrometry (MS) open data followed logically after I played with analyzing open metagenomics datasets. Both MS and metagenomics can be applied in an untargeted mode, and with a serendipitous mindset, yielding data that frequently contains unexpected results, solely depending on the questions asked.
In the case of metagenomics I used the Kraken software ([1]) to find unexpected species in sequenced samples: old versions of invertebrate genomes deposited at NCBI RefSeq turned out to be contaminated with bacterial DNA (unpublished data at [2]); thankfully RefSeq submissions are now screened for this and it can no longer happen. In another investigation I probed human samples for specific fungi ([3]), following up on a study that suggested the fungi's involvement in Alzheimer's disease.
If the promise of open mass spectrometry datasets were only taxonomic identification of samples, through finding taxon-specific peptides and metabolites, it would be interesting enough, for example looking at clinical data and scanning for possible signals of unexpected infections. But untargeted metabolomics datasets have much more, namely >90% spectra of unidentified molecules which might turn out to be completely unseen before. The crux is that it is so difficult to go from MS spectrum to structural identification. At the moment (2022) this is impossible to do in bulk, in an automated way ([4]).
Having open datasets is a prerequisite to even begin analysis of MS spectra, but even more important, especially with MS, is the information about the experiments, the metadata. It is highly structured, specialized, and voluminous due to the number of experiments that can be performed in short time using a mass spectrometer. Accurate and open metadata is a necessity for analyzing the growing number of deposited datasets. This leads to our question: what is the state of Linked Open Metadata of mass spectrometry datasets?
Wikipedia has the definition of different qualities of Open Data([5]):
Tim Berners-Lee has suggested a 5-star scheme for grading the quality of open data on the web, for which the highest ranking is Linked Open Data:[10]
- 1 star: data is openly available in some format.
- 2 stars: data is available in a structured format, such as Microsoft Excel file format (.xls).
- 3 stars: data is available in a non-proprietary structured format, such as Comma-separated values (.csv).
- 4 stars: data follows W3C standards, like using RDF and employing URIs.
- 5 stars: all of the other, plus links to other Linked Open Data sources.
In this spirit I will give a list of open MS dataset repositories, together with the star rank of their metadata that I encountered in October 2022. Study-wide metadata usually contains sample-wide and experiment-wide metadata. The following entries are in no particular order.
Study data
★★★☆☆ MetaboLights: XML available but it is unclear if this is complete, as many original datasets aren't converted to open formats.
★★★☆☆ massive.ucsd.edu: Export of limited structured study-metadata from searches. See below.
★★★☆☆ redu.ucsd.edu: More extensive and curated study-/sample-/experiment-metadata of Massive datasets. The caveat of unconverted datasets applies as well.
☆☆☆☆☆ metabolomicsworkbench.org: no metadata download. They require only minimum metadata on upload, anyway. The caveat of unconverted datasets applies as well.
Spectrum Libraries
★★★☆☆ mona.fiehnlab.ucdavis.edu: Structured data of compound-spectrum-experiment aggregates, meaning limited metadata is copied to each spectrum/compound entry. Limited to identified spectra but includes in silico models.
☆☆☆☆☆: gmd.mpimp-golm.mpg.de: spectra download possible, but no metadata associated with spectra
☆☆☆☆☆: HMDB: the experimental spectra file download link gave me a 502 Bad Gateway
☆☆☆☆☆: mzcloud: no metadata download
☆☆☆☆☆: metlin.scripps.edu: no metadata download
☆☆☆☆☆: metabolome-express.org: site unreachable
☆☆☆☆☆: mmcd.nmrfam.wisc.edu: site unreachable
Conclusion
There is no 5-Star Linked Open Metadata of mass spectrometry datasets. MetaboLights and ReDU clearly are dedicated to provide metadata in structured format with the datasets, the same applies to MoNA doing a great job associating metadata with single spectra. The importance and the promise of having these associations is convincingly shown in the ReDU ([6]) and MoNA ([7]) papers. It remains to be seen if it will be possible to extract full metadata from the datasets deposited at Metabolomics Workbench.
2022-Oct-12
Ralf Stephan, developer and biocurator