Strings of sound can be given an interpretation in at least two ways. The usual method, whereby we typically define "possible interpretation", is "strictly using the rules of the grammar in this language". Another method – the one typically used by the overwhelming majority of humans – is "using any and all evidence and methods available". The latter method is extremely useful when called on by individuals attempting to communicate, who do not have a common language: it allows one to be understood, even when you massively violate the rules of grammar.
There are many pre-compiled strings that are used in language differently from what you would expect given literal semantics, for example "pot calling the kettle black" is literally an absurdity. Moreover, the actual interpretation of the utterance cannot be related by rule to word meaning and principles of compositional semantics.
The answer to your question depends on having an explicit theory of lexical meaning, syntax, and semantic composition. It presupposes that syntactic distribution is blind to semantic properties – a possible but not logically-necessary position. All attempts that I am aware of to compose the meaning of strings from the meaning of the components (including abstract "thematic role" marking) are very general – they say that you can can combine "P", "x" and "y" to derive the proposition "P(x,y)" as well as "P(y,x)". In such compositional theories, all well-formed sentences have a semantic interpretation. But: not all such propositions describe actually-possible states of affairs. Avicenna's defense of the law of non-contradiction ("Those who deny the first principle should be flogged or burned until they admit that it is not the same thing to be burned and not burned, or whipped and not whipped") reminds us that we can conceptualize things that do not exist, and that contradict the nature of the universe.
In other words, it depends on what you mean by "mean". If you think meaning is about actual states of affairs in the real world, then it's easy to construct grammatically well-formed strings that have no meaning because they don't describe facts. I just think that is the wrong theory of meaning.
The entire poem Jaberwocky (sp) is completely nonsensical, but completely grammatically correct.
syntax - Are there different "kinds" of meaningless sentences? - Linguistics Stack Exchange
terminology - Is there a linguistics term meaning "it's grammatically correct, but nobody says that"? - Linguistics Stack Exchange
semantics - Semantically meaningless words - English Language & Usage Stack Exchange
syntactic analysis - Syntactically correct, semantically incorrect sentence - English Language & Usage Stack Exchange
I think the common term would be non-idiomatic, idiomatic here not referring to idioms like "kick the bucket", but to the natural ways a language is spoken.
In pragmatics, if an utterance is syntactically well-formed and makes sense but cannot occur then it is called infelicitous. Unacceptability judgments are broader as it may include semantic incoherence:
- The true circle with four sides in my backyard creeps me out.
- Colorless green dreams sleep furiously.
Unacceptability judgments may also include infelicitous or ungrammatical statements (this can be problematic in poorly designed elicited response tasks). So I do not think it is exactly the phenomena the OP is targeting. Non-idiomatic may also capture some cases of "nobody says that" but if the utterance never occurs for pragmatic reasons I would suggest infelicitously is the linguistic property you are trying to identify.
Specifically answering your request for a lexicon of such words, this webpage includes a link to download multiple lists of function words.
"Although" appears in its list of English conjunctions. The class of function words are defined by this source as including Auxiliary Verbs, Conjunctions, Determiners, Prepositions, Pronouns, and Quantifiers. The OP may choose to disregard some of these sub-categories as not applicable to your purpose.
People who study that branch of linguistics known as relevance theory describe words like although as 'procedural items'. Words like although are said to not have any conceptual content. Rather they work by constraining the inferential processes of the listener. The word although can be thought of as cutting off further implicatures that would otherwise follow from the following clause.
So we can describe such words as a) not having any conceptual content and b) being procedural items.
Many function words in English can be thought of as having no conceptual meaning.
The panda eats, shoots and leaves.
The syntax is correct: it relates an observation of a panda eating before shooting and leaving. However, the misplacement of the comma makes the sentence semantically incorrect, as the intention of the sentence should be that pandas eat shoots and leaves, not that this panda was shooting. (No offense to the Kung-Fu Panda, who may actually shoot.)
Noam Chomsky famously used the sentence "Colorless green ideas sleep furiously". The syntax is flawless, but it has no meaning.
Assuming that, 100% of the time, there will be a sentence that is semantically correct, and another that isn't, you can just split the type0 and type1 sentences into 2 different examples and classify them individually, e.g.:
id,type0,type1
0,He married to a dinosaur.,He married to a women.
1,She drinks a beer.,She drinks a banana.
2,He lifted a 500 tons.,He lifted a 50kg.
Becomes:
id,sentence
0,He married to a dinosaur
1,He married to a women.
2,She drinks a beer.
3,She drinks a banana.
4,He lifted a 500 tons.
5,He lifted a 50kg.
However, this won't work if your data contains records where a sentence is slightly less stupid than the other, i.e. there's the actual need to compare both sentences.
Maybe you can consider not only unigrams (treat each word individually as a variable) but also use bi-grams. this can help identifying combinations of words that are no-sense. (clean the text from stop words first..)
so you would have variables such as "married dinosaur" or "drink bear" instead of each word alone.
I d use tidytext (for R) but if you re looking for something similar in Python you could check out this
https://github.com/michelleful/TidyTextMining-Python