A Syllogism in Turing's 1950 Paper

NIPS Interpretable Machine Learning Symposium Spotlight Talk

Exhibiting Computing Machines. Alien Phenomenology as a Speculative Principle for Exhibition Design

Posted by Fabian Offert on February 22, 2017.

Tweet this post or cite this post (BibTex).

Vector space models are mathematical models that make it possible to represent *multiple* complex objects as *commensurable* entities. They became widely used in Information Retrieval in the mid 1970s^{1}, and subsequently found their way into the digital humanities field, a development that is not surprising, given that the above definition, applied to literary texts, is very much a description of distant reading^{2} in its most pragmatic interpretation. There is no doubt that vector space models work well, not only as a tool for distant reading, but also as a tool for more general natural language processing and machine learning tasks. Consequently, however, the justification of their use is often suspiciously circular.

In this post I will try to trace the way in which vector space models generate knowledge, for the particular case of the digital humanities. I will try to answer the question: what is the price of the commensurability that vector space models provide? In other words: if we use a vector space model to compare two or more complex aesthetic objects, what are the implicit epistemological assumptions enabling the commensurability of these objects?

In the digital humanities, quantitative methods are used to answer qualitative questions. The first, general implicit epistemological assumption is thus that intuitive concepts are rational concepts, i.e. that intuition is Cartesian intuition. The second, more concrete implicit epistemological assumption is that intuitive concepts can be *sufficiently* modeled formally, or, in Frieder Nake's words, that "the computer is a means of production enabling the mechanization of mental work"^{4}.

This *strong algorithmic modeling assumption* has a long tradition in computer science, starting with the arguably most important appeal to intuition of all time: the Church-Turing thesis. In the digital humanities, the strong algorithmic modeling assumption is exactly the implicit application of Turing's method^{5} to texts. Some intuitive property of a text, it posits, can be modeled, like the Turing machine, after a human "computor"^{6} (or, in Post's terms, a "worker"^{7}) counting and comparing words and "bags of words", all in one intuitive glance. I would even argue that Franco Moretti's notion of "operationalizing" "is not only derived from Bridgeman's "operational point of view", but also implicitly acknowledges Turing's legacy, who closes the most famous passage in his 1936 paper ("We may compare a man in the process of computing a real number to a machine [...]") with the sentence: "It is my contention that these operations include all those which are used in the computation of a number."^{8}

The *generality* of the strong algorithmic modeling assumption, however, hides the fact that, for the specific field of the digital humanities, it has some specific implications that make it more difficult to just waive it and move on to more productive work. This becomes apparent in a close reading of two popular methods of distant reading: cosine similarity and automated analogical reasoning with word embedding models.

Cosine similarity is a way of estimating the similarity of documents by representing them in high-dimensional vector space. Concretely, it is a measurement of the angle between two vectors: the smaller their angle, the larger their cosine similarity. Vectors with the same *orientation* have a cosine similarity of \(1\), orthogonal vectors have a cosine similarity of \(0\), and vectors pointing in opposite directions have a cosine similarity of \(-1\). To measure the cosine similarity of two documents, we represent them as vectors within a high-dimensional vector space where each dimension corresponds to one word in the two documents' common vocabulary.

Cosine similarity corresponds very well to our intuitive concept of similarity. More precisely: if we would be asked to quantify our intuition in regard to the similarity of two short sentences on a scale \([-1 < i < 1]\) the result would very likely be in the general neighborhood of the cosine similarity of their vectors.

This is surprising, as there are at least two properties of our intuitive concept of similarity that seem to prohibit such a correspondence. First, our intuitive concept of similarity seems to be independent of word order. We recognize the "subject matter" to be similar even if a sentence is reversed. Second, it also seems to be immune to syntactically insignificant but semantically significant changes. We recognize the "subject matter" to be similar even if a sentence is negated.

In other words, our intuitive concept of similarity is an inherently *fuzzy* concept. How, then, is it possible that cosine similarity -- clearly *not* a fuzzy concept -- seems to sufficiently model it? The answer is simple: it doesn't. What models our intuitive concept of similarity is the bag-of-words model implicit in the particular form of vectorization *preceding* the cosine similarity computation. As so often in computer science^{9}, cosine similarity just happens to be the easiest way of comparing the direction of vectors in general, completely independent of the properties of the objects they represent.

We see that the strong modeling assumption gives rise to a more fine-grained, and, more importantly, *false* assumption, which I will call the assumption of bijection. The assumption of bijection states that, to arrive at a formal concept from an intuitive concept we simply "translate" the intuitive concept, i.e. that there exist a *direct one-to-one correspondence* between the set of intuitive concepts and the set of formal concepts. As we have seen, however, this is not necessarily true: by assuming that cosine similarity sufficiently models intuitive similarity, we ignore the fact that it is the bags-of-words model that sufficiently models intuitive similarity, while cosine similarity -- not surprisingly -- only sufficiently models general vector similarity.

This intermediate step, however, transcends the analogy of translation, as there exists no language in between two natural languages. More precisely: contrary to the assumption of bijection, to arrive at a formal concept from an intuitive concept, we have to pass through one or multiple *intermediate*, *counter-intuitive* spaces, hidden in plain sight.

This becomes even more apparent if we consider another intuitive concept and its "translation", the concept of analogy itself.

With the recent comeback of machine learning in general and neural networks in particular, new technologies have emerged that enable much more complex investigations into texts. Among them are word embedding models^{10}, vector space models which, other than cosine similarity, preserve word order, or, more precisely, word contexts up to a certain window size.

The most frequently used of these models, word2vec, uses a shallow neural network to construct a high-dimensional vector space that not only reflects syntactic, but also semantic properties of the source corpus^{11}. This is possible because, instead of using \(n\)-dimensional vectors to represent specific n-grams in relation to a vocabulary of \(n\) n-grams, neural networks use smaller, real-weighted feature vectors "tuned" over several iterations according to a loss function. Most prominently, word2vec is able to answer analogy queries, like "what is to woman what king is to man" ("queen", of course; this used to be a hard problem to solve computationally, though)^{12}.

However, the solution to the analogy query is given by the algorithm not as a definite answer, but as a hierarchy of existing data points. Why? Simply because there are no "intermediate" words. If the best possible analogy is a (new) data point right in between two (existing) data points representing words in the source corpus vocabulary, the best possible analogy is neither of them, but still can only be described in terms of them. Even if the input vocabulary consisted of all words in the English language, the vector operations that solve the analogy query could still produce a data point that is "in between everything".

Every computational solution to an analogy task is thus, ironically, itself an analogy - an analogy which, in a peculiar reversal of the usual narrative of violent quantification^{13}, is a violent "qualification" of the machine's solution^{14}. This also means: our interpretation of any results produced by word2vec is actually an interpretation of an interpretation.

In other words: we, again, encounter the assumption of bijection, this time, however, on the "other end" of the process of quantification. We assume that the solution to the analogy query is the hierarchy of references to existing high-dimensional data points presented to us by the algorithm, while in fact this hierarchy is a *visualization* of the underlying vector space solution.

Why a "visualization"? High-dimensional vector spaces are *geometrically counter-intuitive* spaces. Not only is it impossible to imagine a vector in eleven dimensions, high dimensional vector spaces have counter-intuitive properties as well: this is what is known as the "curse of dimensionality". One of the core problems of higher dimensions is that

under certain broad conditions [...], as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor.

^{15}

In other words: in high-dimensional vector space, distances between data points have a tendency to become *illegible*: they lose, or at least significantly change their meaning. Thus, they become inaccessible to intuition.

More generally, for us as humans, the only *geometrically intuitive* space is Euclidian space. We simply do not have the mental capabilities to think *spatially* about anything above \(n=3\) (or \(n=4\) if we include time). Consequently, to be able to think spatially about a \(n > 3\)-space, we have to find a way to map the \(n > 3\)-space to a \(n \leq 3\)-space, or, in other words, *visualize* the \(n > 3\)-space.

This means: when word2vec produces a *legible* (i.e. *visible*) answer to an analogy query, it actually produces two visualizations: a *numerical* visualization, and a *geometric* visualization. The numerical visualization is the hierarchy of references to existing high-dimensional data points presented to us by the algorithm. The geometrical visualization is the "collapsed"^{16} high dimensional space, presented to us as the computed real (i.e. floating point) values of these data points.

We have seen that there exists a more fine-grained implicit epistemological assumption derived from the general strong algorithmic modeling assumption that influences the way in which we perceive the generation of knowledge through quantitative methods in the digital humanities. While, obviously, the presence of this implicit epistemological assumption does not necessarily undermine the validity of results derived from applications of cosine similarity and word2vec, it *does* put in doubt how far we are able to intuitively read these results, and hence, if we really do what we think we do when we interpret them.

For the very interesting history of VSMs, see David Dubin, “The Most Influential Paper Gerard Salton Never Wrote,”

*Library Trends*52, no. 4 (2004): 748.↩Franco Moretti,

*Distant Reading*(New York, NY: Verso Books, 2013).↩Gerard Salton, Anita Wong, and Chung-Shu Yang, “A Vector Space Model for Automatic Indexing,”

*Communications of the ACM*18, no. 11 (1975): 613–20.↩"Der Computer ist jenes Produktionsmittel, das zur Maschinisierung von Kopfarbeit entwickelt und eingesetzt wird." Frieder Nake,

*Ästhetik Als Informationsverarbeitung. Grundlagen Und Anwendung Der Informatik Im Bereich ästhetischer Produktion Und Kritik*(Springer, 1974)↩Alan Mathison Turing, “Systems of Logic Based on Ordinals,”

*Proceedings of the London Mathematical Society*2, no. 1 (1939): 161–228.↩A term introduced in Robin Gandy, “The Confluence of Ideas in 1936,” in

*The Universal Turing Machine. a Half-Century Survey*, ed. Rolf Herken (Hamburg: Kammerer und Unverzagt, 1988) to distinguish the computing human from the computing machine.↩Emil L. Post, “Finite Combinatory Processes. Formulations I,” in

*The Undecidable. Basic Papers on Undecidable Propositions, Unsolvable Problems and Computable Functions*, ed. Martin Davis (Mineola, NY: Dover, 1965).↩Turing, “Systems of Logic Based on Ordinals.”↩

As Claude Shannon famously states: "Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem." Claude Elwood Shannon, “A Mathematical Theory of Communication,”

*The Bell System Technical Journal*27 (1948): 379–423, 623–56↩Tomas Mikolov et al., “Distributed Representations of Words and Phrases and Their Compositionality,” in

*Advances in Neural Information Processing Systems*, 2013, 3111–9, http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality, Tomas Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,”*arXiv Preprint arXiv:1301.3781*, 2013, https://arxiv.org/abs/1301.3781.↩I have written elsewhere about the interesting possibilities of such models, and many others have as well.↩

For a more detailed examination of how word2vec works, see this great writeup by Piotr Migdał, the introduction to Google's TensorFlow implementation, and this article by Matthew Honnibal.↩

Theodor W. Adorno and Max Horkheimer,

*Dialektik Der Aufklärung*, ed. Rolf Tiedemann, vol. 3, Gesammelte Schriften (Frankfurt am Main: Suhrkamp, 1974).↩Alan Liu has pointed out to me that this is strikingly similar to the rhetorical notion of "catachresis" in J. Hillis Miller's analysis of Kant and Derrida (Joseph Hillis Miller,

*The Ethics of Reading: Kant, de Man, Eliot, Trollope, James, and Benjamin*(New York, NY: Columbia University Press, 1987), 20-21), where he writes: "What, then, is the law as such? The reader would like to know. He would like to have access to it, to confront it face to face, to see it written down somewhere, so he can know whether or not he is obeying it. Well, Kant cannot tell you exactly what the law as such is, in so many words, nor can he tell you exactly where it is, or where it comes from. The law, as Jacques Derrida puts it, gives itself without giving itself. It may only be confronted in its delegates or representatives or by its effects on us or on others. It is those effects that generate respect for the law. But if Kant cannot tell you exactly what the law is, where it is, or where it comes from, he can nevertheless tell you to what it is analogous. Into the vacant place where there is no direct access to the law as such, but where we stand respectfully, like the countryman in Kafka's parable, "before the law," is displaced by metaphor or some other form of analogy two forms of feeling that can be grasped and named directly. Respect for the law is said to be analogous to just those two feelings which it has been said not to be: inclination and fear. The name for this procedure of naming by figures of speech what cannot be named literally because it cannot be faced directly is catachresis or, as Kant calls it in paragraph fifty-nine of the Critique of Judgment, "hypotyposis" (Hypotypose). Kant's linguistic procedure in this footnote is an example of the forced or abusive transfer of terms from an alien realm to name something which has no proper name in itself since it is not an object which can be directly confronted by the senses. That is what the word catachresis means; etymologically: "against usage." What is "forced or abusive" in this case is clear enough. Kant has said that respect for the law is not based on fear or inclination, but since there is no proper word for what it is based on, he is forced to say it is like just those two feelings, fear and inclination, he has said it is not like."↩Kevin Beyer et al., “When Is ‘Nearest Neighbor’ Meaningful?” in

*International Conference on Database Theory*(Springer, 1999), 217–35.↩This is a well-known problem in other disciplines. Quantum mechanics, for instance, in its von Neumann articulation, relies on the counter-intuitive representation of the counter-intuitive physical property of particle superposition as a data point in complex Hilbert space. Measurements, however, "collapse" the wave function, and produce a single, real-valued number. In the von Neumann formalism they are exclusively represented by self-adjoint operators (i.e. square matrices that are equal to their conjugate transpose) to make sure its results are exclusively real, as imaginary numbers have no meaning as experimental results.↩