Welcome!

Ramblings about work on subjects uninteresting to most people.

2022-10-12

Is there 5-Star Linked Open Metadata of mass spectrometry datasets?

 My recent interest in mass spectrometry (MS) open data followed logically after I played with analyzing open metagenomics datasets. Both MS and metagenomics can be applied in an untargeted mode, and with a serendipitous mindset, yielding data that frequently contains unexpected results, solely depending on the questions asked.

In the case of metagenomics I used the Kraken software ([1]) to find unexpected species in sequenced samples: old versions of invertebrate genomes deposited at NCBI RefSeq turned out to be contaminated with bacterial DNA (unpublished data at [2]); thankfully RefSeq submissions are now screened for this and it can no longer happen. In another investigation I probed human samples for specific fungi ([3]), following up on a study that suggested the fungi's involvement in Alzheimer's disease.

If the promise of open mass spectrometry datasets were only taxonomic identification of samples, through finding taxon-specific peptides and metabolites, it would be interesting enough, for example looking at clinical data and scanning for possible signals of unexpected infections. But untargeted metabolomics datasets have much more, namely >90% spectra of unidentified molecules which might turn out to be completely unseen before. The crux is that it is so difficult to go from MS spectrum to structural identification. At the moment (2022) this is impossible to do in bulk, in an automated way ([4]).

Having open datasets is a prerequisite to even begin analysis of MS spectra, but even more important, especially with MS, is the information about the experiments, the metadata. It is highly structured, specialized, and voluminous due to the number of experiments that can be performed in short time using a mass spectrometer. Accurate and open metadata is a necessity for analyzing the growing number of deposited datasets. This leads to our question: what is the state of Linked Open Metadata of mass spectrometry datasets?

Wikipedia has the definition of different qualities of Open Data([5]):

Tim Berners-Lee has suggested a 5-star scheme for grading the quality of open data on the web, for which the highest ranking is Linked Open Data:[10]

  • 1 star: data is openly available in some format.
  • 2 stars: data is available in a structured format, such as Microsoft Excel file format (.xls).
  • 3 stars: data is available in a non-proprietary structured format, such as Comma-separated values (.csv).
  • 4 stars: data follows W3C standards, like using RDF and employing URIs.
  • 5 stars: all of the other, plus links to other Linked Open Data sources.

In this spirit I will give a list of open MS dataset repositories, together with the star rank of their metadata that I encountered in October 2022.  Study-wide metadata usually contains sample-wide and experiment-wide metadata. The following entries are in no particular order.

Study data

★★★☆☆ MetaboLights: XML available but it is unclear if this is complete, as many original datasets aren't converted to open formats.

★★★☆☆ massive.ucsd.edu: Export of limited structured study-metadata from searches. See below.

★★★☆☆ redu.ucsd.edu: More extensive and curated study-/sample-/experiment-metadata of Massive datasets. The caveat of unconverted datasets applies as well.

☆☆☆☆☆ metabolomicsworkbench.org: no metadata download. They require only minimum metadata on upload, anyway. The caveat of unconverted datasets applies as well.

Spectrum Libraries

★★★☆☆ mona.fiehnlab.ucdavis.edu: Structured data of compound-spectrum-experiment aggregates, meaning limited metadata is copied to each spectrum/compound entry. Limited to identified spectra but includes in silico models.

☆☆☆☆☆: gmd.mpimp-golm.mpg.de: spectra download possible, but no metadata associated with spectra

☆☆☆☆☆: HMDB: the experimental spectra file download link gave me a 502 Bad Gateway

☆☆☆☆☆: mzcloud: no metadata download

☆☆☆☆☆: metlin.scripps.edu: no metadata download

☆☆☆☆☆: metabolome-express.org: site unreachable

☆☆☆☆☆: mmcd.nmrfam.wisc.edu: site unreachable

Conclusion

There is no 5-Star Linked Open Metadata of mass spectrometry datasets. MetaboLights and ReDU clearly are dedicated to provide metadata in structured format with the datasets, the same applies to MoNA doing a great job associating metadata with single spectra. The importance and the promise of having these associations is convincingly shown in the ReDU ([6]) and MoNA ([7]) papers. It remains to be seen if it will be possible to extract full metadata from the datasets deposited at Metabolomics Workbench.

2022-Oct-12

Ralf Stephan, developer and biocurator

2022-02-12

Why you won't play a relaxed game of chess with Alexa in the foreseeable future

 The promise of automatic speech recognition (ASR) providing a hands-free experience is fulfilled in countless specialist PC software products, for example dictation software. After starting up, dictation software does not need a wake-word for doing its job, even after minutes of silence. Neither is it the case with military speech recognition---imagine a fighter pilot needing to speak a wake-word in a high-risk situation. So why does Alexa (or, for that matter Google Assistant and Siri) have wake-words?

But it's not the wake-word only. If you wait too long giving a reply to a question from Alexa, or giving a second command associated with a previous command, Alexa will stop listening, making a blip sound, and will require you to say the wake-word again, before listening again. The time Alexa listens to your silence is fixed to eight seconds (same with Google). Isn't this annoying? The Follow-Up mode implemented in Alexa in 2021 does not remove this restriction, it just removes the requirement for a wake-word within those eight seconds.

The difference between PC or military ASR and virtual assistants is manyfold:

  • the former run on a PC or embedded system, and are used in a workplace setting. Most assistants are installed at home. However, dictation software is also used by disabled people to control the PC in a home setting, but this is minor usage if you count the numbers.
  • speech-to-text transformation of specialist ASR happens completely in the PC/device. In contrast virtual assistants transfer speech to the Cloud for processing. This is a legal nightmare for privacy reasons. But apparently the central processing and the reduced need for software updates in devices make this system design attractive despite the legal minefields.
  • Alexa in particular allows skill distribution by external developers in the Amazon Skill Store, much like apps for mobile devices. However, any mobile app using the microphone has to be given explicit permission by the user, while Alexa uses the microphone per default unless explicitly turned off. So Amazon just assumes the worst case and avoids legal problems by making Alexa strictly listen for eight seconds in all skills.

You can probably see now why you never will play a game of chess using Alexa, sitting on your sofa in front of a big TV showing a chess board, and leisurely moving pieces by saying "move pawn to e4". After you see the move of your opponent you ponder for minutes and say "take the pawn". And you can say "pause" or "quit game" after minutes and Alexa will know what you mean.

No. It will always be like this: The opponent moves. You ponder. After eight seconds you here a faint BLIP. Now you need to say "Alexa, open chess game and take pawn". For every of your moves that takes longer than eight seconds to ponder. Because the device needs to hear the wake-word to start listening again, and because the skill context has been lost.

And that is very unsexy for any developer trying to provide services to the user. I wanted to write a chess skill for Alexa to help me or other physically disabled persons. But now I'm starting to think about PC-centered solutions again.

2018-02-15




This is Chess960 position #310, the prototype of the hanging-rook motive. After 1.c3 Black has only 1..Ng6 or 1..g6
The latter leaves the knight in the corner. A possible continuation: 2.d4 Bg7 3.e4 c5 4.Ng3 cxd4 5.cxd4 O-O 6.d5 and Black is cramped. It looks like only 1..Ng6 will lead to balanced positions, after some forced moves. After 2.h4 h5 3.d4 c6 4.Ng3 Black will want to break symmetry with 4..e5. After 5.Bg5 Be7 6.Bxe7 Nxe7 the initial problems are resolved.

After 6...Nxe7
This means that there is at least one Chess960 starting position that, with best play by White, forces Black tactically to make specific moves. So if you play the Fischer chess variant you need to know these positions.

2015-10-22

Survey: Sage and Enumerative Combinatorics

The project I'm helping out with for nearly two years now is Sage Math, which has 700k lines of Python code that glue about a hundred open source math software packages into one tool conglomerate. My mathematical interest was always discrete math, and the recent developments in symbolic computation fueling the new experimental mathematics fascinate me especially. This made me naturally gravitating towards advancing symbolics in Sage which, I regret to say, is in a poor (unsexy) state because most main developers of Sage are interested in abstract algebra and number theory, but less in enumerative combinatorics, nor in experimental symbolics, or even calculus.

Experimental symbolics is very successful in discrete math, especially enumerative combinatorics. The natural question arises, how far can Sage help with this branch of symbolics? To this end I present a table of respective mathematical objects and algorithms, and the support Sage has for them. I'm leaning heavily on the recent summary of computer algebra relevant for enumerative combinatorics by Manuel Kauers (published in Bona's new Handbook of Combinatorics).

Sage capability survey (Fall 2015)
Computation in/with Status Comments
Finite fields Documentation
Lattice reduction Documentation
Multivariate polynomials Documentation
Gröbner bases Documentation
Algebraic number arithmetic Documentation
Cylindrical Algebraic
Decomposition
Documentation (from Sage version 6.10.p2 up)
Formal power series Two implementations, a fast one missing most symbolic function expansions, and a slower one with function expansions, but neglected having many bugs. Both not interoperating.---Documentation1, Documentation2
Lazy power series rudimentary---Documentation
Laurent series Only univariate available
Puiseux series
Ore algebras optional package ore-algebra
C-finite sequences Documentation
D-finite sequences
Combinatorial species Documentation
Omega analysis (partitions)
Ehrhart theory incomplete, in progress
Computational group theory available via GAP
Symbolic summation: Gosper's algorithm part of sum(), available via Maxima
Zeilberger's algorithm part of sum(), available via Maxima
Petkovšek's algorithm
Karr's algorithm


Creative telescoping
ΠΣ-theory
Holonomic functions


2013-09-21

Random 100 sequences from the OEIS---a survey.

Summary: there were 16 holonomic, 16 prime, 11 digital, 7 constants, 4 arbitrary, 28 number theoretic, 13 combinatorial, four group theoretical, and one physics sequences in a random sample of 100 sequences from the OEIS.

The field I'm feeling most natural is mathematics, and I think my most successful work is associated with the OEIS database of integer sequences which sparked all my papers so far. To get an impression on what type of OEIS entries there are, I decided to work on a random sample of one hundred of them, and try to classify them.

So, let's get a random sample. Welcome to a hundred random numbers between 1 and 229000:
? for(i=1,100,print1(random(229000),","))
  • First, there are polynomials in n, linear recurrences with constant coefficients (or lin-recs as the editors call them frequently), and other holonomic sequences. This is basic stuff, although not completely uninteresting. Many at first really interesting sequences later turn out to be of this type: A004315, A005056, A009671, A012845, A013081, A029920, A070358, A107396, A109794, A132200, A133886, A135493, A140405, A175485, A193931, A213036
  • Then, the sequences involving primes. In my personal opinion most such sequences are random (no formulae possible), and you can't say much about them in terms of conjectures, although they may not be unimportant to have in the database: A003631, A007996, A013637, A022465, A045467, A066520, A086762, A088592, A090725, A100669, A105998, A118812, A120853, A122413, A142247, A188754
  • Sequences involving decimal, and other digits: A034967, A037914, A053974, A061958, A075009, A092995, A095827, A102120, A117860, A141063, A209859
  • A certain amount of OEIS entries are decimal expansions of constants. The justification to include them is the benefit for inverse calculations, and as a point where to collect statements and references about the respective constant: A088543, A153205, A154167, A196505, A196758, A198565, A201848
  • Some sequences are so arbitrary that, although they could be interesting, it would be better to look at a definition or formula with small constants first and generalize from that. If the submitter gives no reason for the importance of such an arbitrary sequence, it is most likely unimportant. I found the following that fit this description: A030835, A040566, A152339, A182771
Now, the rest is what many OEIS editors agree to be interesting.
The really interesting sequences can be divided according to the field of mathematics they arise in, so let me list them so grouped. From here I will give one-liner definitions and make them clickable.

Number theory
A002547 Numerator of {n-th harmonic number H(n) divided by (n+1)}.
A004618 Divisible only by primes congruent to 4 mod 5.
A033831 Number of d dividing n such that d>=3 and 1<=n/d<=d-2.
A049384 a(0)=1, a(n+1) = (n+1)^a(n).
A060553 Symmetric patterns in the cellular automaton that generates Pascal's triangle modulo 2.
A064031 Product of non-unitary divisors of n!. 
A081474 Distinct lines through the origin in n-dimensional cube of side length n.  
A088138 Generalized Gaussian Fibonacci integers.
A088303 Smallest integer value of n!/ ( 1!a!b!c!...) ...
A089552 Sum of legs of primitive Pythagorean triangles having legs that add up to a square, sorted on hypotenuse.
A094234 Period of terms in continued fraction expansion of 2^n*tanh(1).
A117658 Number of solutions to x^(k+1)=x^k mod n for some k>=1.
A120615 sum(k=0,n,floor(phi*floor(n/phi))) where phi=(1+sqrt(5))/2.
A139799 n>=2 such that there is an integer k>1 with k divides n and k divides (n/k)+1.
A140418 Position of cubes in the EKG sequence.
A141321 Special sum of divisors of n.
A152066 Coefficients of certain polynomials.
A160394 Numbers n = p*q*r (p, q, r prime) congruent to 0 mod p+q+r.
A172819 Number of n X 9 0..4 arrays with row sums 9 and column sums n.
A173931 Primitive numbers k such that m/k is in the Cantor set for some m. 
A178272 Number of collinear point 7-tuples in an n X n .. X n 4-dimensional cubical grid.
A178535 Matrix inverse of A178534.
A185383 Denominator of the fraction |n^2/A049417(n)-A064380(n)|.
A189675 Composition of Catalan and Fibonacci numbers.
A200521 Numbers n such that omega(n)=4 but bigomega(n)>4.
A218335 Even numbers n such that the largest value in trajectory of n under the juggler map is greater than n.
A227128 The twisted Euler phi-function for the non-principal Dirichlet character mod 3.
A227434 Value of row n in Pascal's triangle mod 3 seen as ternary number.

Enumerative combinatorics
A028461 Number of perfect matchings in graph P_{3} X C_{4} X P_{n}.
A057545 Maximum cycle size in range...
A124419 Number of partitions of the set {1,2,...n} having no blocks that contain both odd and even entries.
A135493 Number of ways to toss a coin n times and not get a run of six.
A149516 Number of walks within N^3 (the first octant of Z^3) starting...
A183882 Number of arrangements of n+2 numbers in 0..7 with ...
A185334 Number of not necessarily connected 3-regular simple graphs on 2n vertices with girth at least 4.
A186764 Permutations of {1,2,...,n} having k increasing even cycles.
A207224 Number of nX4 0..2 arrays avoiding the patterns ...
A208545 Number of 7-bead necklaces of n colors allowing reversal, with no adjacent beads having the same color.
A211359 Noncrossing partitions up to rotation and reflection of an n-set that contain k singleton blocks.
A214130 Partitions of n into parts congruent to +-2, +-3 (mod 13).
A227189 (k+1)-th part of the unordered partition which has been encoded in the binary expansion of n.

Group theory
A019537 Number of special orbits for dihedral group of degree n.
A057743 Maximal order of element of alternating group A_{2n+1}.
A170263 Number of reduced words of length n in Coxeter group on 14 generators
A214464 Degrees of irreducible representations of Suzuki group Sz(32).

Mathematical physics
A008199 Coordination sequence T4 for Zeolite Code MTW.

So, now you have a pretty good overview of what kind of OEIS entries exist, and what OEIS editors think are interesting submissions. If such pearls as the above can be found in a random sample of 100, what treasures might lurk there in the whole thing? Look for yourself!

2011-09-03

Methionine

This story has three parts: Met salvage, catabolism, and urology. And it spans three decades of missing research.

L-Methionine (Met) is an essential amino acid. Its use is to take part in Met-RNA and protein biosynthesis, and the synthesis of S-Adenosylmethionine (SAM). In all cases it is recycled. Even when SAM is used to produce polyamines, the sulfur is recycled to Met via the Met salvage pathway. However, if you take a Met overdose -- say 1 or 2 grams orally -- the excess doesn't show in the blood for long, and is degraded or changed quickly. It appears to be well known[1] that this excess leads to an excess of sulfate which is excreted with urine. Around 1985, at least two reactions were hypothesized for excess Met -- transamination to 4-methylthio-2-oxobutanoate (MOB) and transmethylation-transsulfuration via SAM, homocysteine and cystathionine -- with inconclusive results on which is the main path[2]. The transamination reaction to MOB certainly plays a role[3] but where the sulfate comes from quantitatively (MOB or cystathionine) is still unclear, as well as the whole regulation issue in such a tightly regulated system. Possibly the location, cytosol or mitochondria, makes a difference. Meanwhile, a review elucidated the cysteine catabolic branch[4]. So, a complete characterization of the Met-catabolic pathway via transamination -- or the proof of it being irrelevant awaits the trophy-hungry lab rat.

Additionally, in the Met salvage pathway, we don't know exactly the human gene producing the necessary methylthioribulose 1-phosphate dehydratase activity (EC 4.2.1.109). From homology to yeast, it might be APIP but the human activity was never shown. And finally, while transamination to and from Met is proven, which of the many transaminases has that broad specificity to also take on Met? Our guess it's the GGT but noone bothered to test it for decades.

Finally, the sulfate excretion accounting for the acidification potential of Met[5], according to my urologist, this is the only compound with that effect on humans. There may be also ammonium chloride (ref?). Okay, there is the n=60 study[6] showing diluted vinegar being effective in urinary tract infection (UTI), but would you drink it daily to prevent infections? Surprisingly, although the beneficial effect of low pH urine for UTI prevention is beyond doubt, there is no clinical study using Met for this. It would be so easy, the pH test strips and Met itself are inexpensive, so please someone take up this piece of Unsexy Science!

Refs:
1.  Mudd, S. H., and H. L. Levy. 1983. Disorders of Transsulfuration. In: The Metabolic Basis of Inherited Disease. 5th edition. J. B. Stanbury, J. B. Wyngaarden, D. S. Fredrickson, J. L. Goldstein, and M. S. Brown, editors. McGraw-Hill Book Co., Inc., New York. 522-559. (unchecked)
2. J. D. Finkelstein, J. J. Martin: Methionine metabolism in mammals. Adaptation to methionine excess. In: J biol chem 261, 4, 1986, 1582–1587. PMID 3080429.
3. W. A. Gahl, I. Bernardini et al.: Transsulfuration in an adult with hepatic methionine adenosyltransferase deficiency. In: J clin. invest. 81, 2, 1988, 390–397. doi:10.1172/JCI113331. PMID 3339126. PMC 329581.
4. M. H. Stipanuk, I. Ueki: Dealing with methionine/homocysteine sulfur: cysteine metabolism to taurine and inorganic sulfur. In: Journal of inherited metabolic disease 34, 1, 2011, 17–32. doi:10.1007/s10545-009-9006-9. PMID 20162368. PMC 290177. (Review)
5. D. L. Bella, M. H. Stipanuk: Effects of protein, methionine, or chloride on acid-base balance and on cysteine catabolism. In: Am J phys 269, 5 Pt 1, 1995, E910–E917. PMID 7491943.
6. Y. C. Chung, H. H. Chen, M. L. Yeh: Vinegar for Decreasing Catheter-Associated Bacteriuria in Long-Term Catheterized Patients : A Randomized Controlled Trial. In: ''Biological research for nursing'' epub 2011. doi:10.1177/1099800411412767. PMID 21708892.

The case of the one hand clapping

Fatty acid synthesis happens alike in all organisms. Like an assembly line parts are hung onto a template until it grows to a long chain. The template is fixed to a bench, the ACP protein domain, and half a dozen enzymes are at work around it, and with recurring activity, to perform the task until the required length results. In one of the steps an acyl moiety is fused to a malonyl moiety and the chain so elongated. Imagine my surprise when I found everywhere the reaction depicted as

acyl-ACP + malonyl-ACP = 3-oxoacyl-ACP + CO2 + ACP         [3]

Twice ACP? That would be fine in mitochondria or bacteria, as there the ACP domain is on a separate protein and, well, let's just take two of them. But in animals' cytosol all enzymatic and ACP domains are on a single enzyme, the fatty acid synthase (FAS). Now, this FAS is a dimer in nature, which could account for the second ACP. Theoretically. We learn from the literature[1] that both monomers are sandwiched in a way that both ACP domains are far apart. Moreover, it is known[2] that the dimer can only contain one phosphopantethein (PPT) per dimer, and this also means, only one usable ACP domain.

Well, I would say one of the ACPs in the reaction actually is CoA in cytosol of animals but who is inclined to show it experimentally? Certainly not the pharma industry. The subject of mostly known physiology is boring, nothing wholly surprising or monetary is to expect. It's all Unsexy Science!

Ref.:
1. A. Witkowski, V. S. Rangan et al.: Structural organization of the multifunctional animal fatty-acid synthase. In: European journal of biochemistry / FEBS 198, Nr 3, June 1991, 571–579. PMID 2050137
2.  A. Jayakumar, M. H. Tai et al.: Human fatty acid synthase: properties and molecular cloning.'' In: ''Proceedings of the National Academy of Sciences of the United States of America'' V 92, Nr 19, September 1995, 8695–8699. PMID 7567999. PMC 41033
3. IUBMB Enzyme Nomenclature, EC 2.3.1.41 Website