Unsexy Science: 2022

My recent interest in mass spectrometry (MS) open data followed logically after I played with analyzing open metagenomics datasets. Both MS and metagenomics can be applied in an untargeted mode, and with a serendipitous mindset, yielding data that frequently contains unexpected results, solely depending on the questions asked.

In the case of metagenomics I used the Kraken software ([1]) to find unexpected species in sequenced samples: old versions of invertebrate genomes deposited at NCBI RefSeq turned out to be contaminated with bacterial DNA (unpublished data at [2]); thankfully RefSeq submissions are now screened for this and it can no longer happen. In another investigation I probed human samples for specific fungi ([3]), following up on a study that suggested the fungi's involvement in Alzheimer's disease.

If the promise of open mass spectrometry datasets were only taxonomic identification of samples, through finding taxon-specific peptides and metabolites, it would be interesting enough, for example looking at clinical data and scanning for possible signals of unexpected infections. But untargeted metabolomics datasets have much more, namely >90% spectra of unidentified molecules which might turn out to be completely unseen before. The crux is that it is so difficult to go from MS spectrum to structural identification. At the moment (2022) this is impossible to do in bulk, in an automated way ([4]).

Having open datasets is a prerequisite to even begin analysis of MS spectra, but even more important, especially with MS, is the information about the experiments, the metadata. It is highly structured, specialized, and voluminous due to the number of experiments that can be performed in short time using a mass spectrometer. Accurate and open metadata is a necessity for analyzing the growing number of deposited datasets. This leads to our question: what is the state of Linked Open Metadata of mass spectrometry datasets?

Wikipedia has the definition of different qualities of Open Data([5]):

Tim Berners-Lee has suggested a 5-star scheme for grading the quality of open data on the web, for which the highest ranking is Linked Open Data:^[10]

1 star: data is openly available in some format.
2 stars: data is available in a structured format, such as Microsoft Excel file format (.xls).
3 stars: data is available in a non-proprietary structured format, such as Comma-separated values (.csv).
4 stars: data follows W3C standards, like using RDF and employing URIs.
5 stars: all of the other, plus links to other Linked Open Data sources.

In this spirit I will give a list of open MS dataset repositories, together with the star rank of their metadata that I encountered in October 2022. Study-wide metadata usually contains sample-wide and experiment-wide metadata. The following entries are in no particular order.

Study data

★★★☆☆ MetaboLights: XML available but it is unclear if this is complete, as many original datasets aren't converted to open formats.

★★★☆☆ massive.ucsd.edu: Export of limited structured study-metadata from searches. See below.

★★★☆☆ redu.ucsd.edu: More extensive and curated study-/sample-/experiment-metadata of Massive datasets. The caveat of unconverted datasets applies as well.

☆☆☆☆☆ metabolomicsworkbench.org: no metadata download. They require only minimum metadata on upload, anyway. The caveat of unconverted datasets applies as well.

Spectrum Libraries

★★★☆☆ mona.fiehnlab.ucdavis.edu: Structured data of compound-spectrum-experiment aggregates, meaning limited metadata is copied to each spectrum/compound entry. Limited to identified spectra but includes in silico models.

☆☆☆☆☆: gmd.mpimp-golm.mpg.de: spectra download possible, but no metadata associated with spectra

☆☆☆☆☆: HMDB: the experimental spectra file download link gave me a 502 Bad Gateway

☆☆☆☆☆: mzcloud: no metadata download

☆☆☆☆☆: metlin.scripps.edu: no metadata download

☆☆☆☆☆: metabolome-express.org: site unreachable

☆☆☆☆☆: mmcd.nmrfam.wisc.edu: site unreachable

Conclusion

There is no 5-Star Linked Open Metadata of mass spectrometry datasets. MetaboLights and ReDU clearly are dedicated to provide metadata in structured format with the datasets, the same applies to MoNA doing a great job associating metadata with single spectra. The importance and the promise of having these associations is convincingly shown in the ReDU ([6]) and MoNA ([7]) papers. It remains to be seen if it will be possible to extract full metadata from the datasets deposited at Metabolomics Workbench.

2022-Oct-12

Ralf Stephan, developer and biocurator

The promise of automatic speech recognition (ASR) providing a hands-free experience is fulfilled in countless specialist PC software products, for example dictation software. After starting up, dictation software does not need a wake-word for doing its job, even after minutes of silence. Neither is it the case with military speech recognition---imagine a fighter pilot needing to speak a wake-word in a high-risk situation. So why does Alexa (or, for that matter Google Assistant and Siri) have wake-words?

But it's not the wake-word only. If you wait too long giving a reply to a question from Alexa, or giving a second command associated with a previous command, Alexa will stop listening, making a blip sound, and will require you to say the wake-word again, before listening again. The time Alexa listens to your silence is fixed to eight seconds (same with Google). Isn't this annoying? The Follow-Up mode implemented in Alexa in 2021 does not remove this restriction, it just removes the requirement for a wake-word within those eight seconds.

The difference between PC or military ASR and virtual assistants is manyfold:

the former run on a PC or embedded system, and are used in a workplace setting. Most assistants are installed at home. However, dictation software is also used by disabled people to control the PC in a home setting, but this is minor usage if you count the numbers.
speech-to-text transformation of specialist ASR happens completely in the PC/device. In contrast virtual assistants transfer speech to the Cloud for processing. This is a legal nightmare for privacy reasons. But apparently the central processing and the reduced need for software updates in devices make this system design attractive despite the legal minefields.
Alexa in particular allows skill distribution by external developers in the Amazon Skill Store, much like apps for mobile devices. However, any mobile app using the microphone has to be given explicit permission by the user, while Alexa uses the microphone per default unless explicitly turned off. So Amazon just assumes the worst case and avoids legal problems by making Alexa strictly listen for eight seconds in all skills.

You can probably see now why you never will play a game of chess using Alexa, sitting on your sofa in front of a big TV showing a chess board, and leisurely moving pieces by saying "move pawn to e4". After you see the move of your opponent you ponder for minutes and say "take the pawn". And you can say "pause" or "quit game" after minutes and Alexa will know what you mean.

No. It will always be like this: The opponent moves. You ponder. After eight seconds you here a faint BLIP. Now you need to say "Alexa, open chess game and take pawn". For every of your moves that takes longer than eight seconds to ponder. Because the device needs to hear the wake-word to start listening again, and because the skill context has been lost.

And that is very unsexy for any developer trying to provide services to the user. I wanted to write a chess skill for Alexa to help me or other physically disabled persons. But now I'm starting to think about PC-centered solutions again.

Unsexy Science

Welcome!

2022-10-12

Is there 5-Star Linked Open Metadata of mass spectrometry datasets?

2022-02-12

Why you won't play a relaxed game of chess with Alexa in the foreseeable future