Intuition and Epistemology of High-Dimensional Vector Space

Vector space models are mathematical models that make it possible to represent multiple complex objects as commensurable entities. They became widely used in Information Retrieval in the mid 1970s (Dubin 2004), and subsequently found their way into the digital humanities field, a development that is not surprising, given that the above definition, applied to literary texts, is very much a description of distant reading (Moretti 2013) in its most pragmatic interpretation. There is no doubt that vector space models work well, not only as a tool for distant reading, but also as a tool for more general natural language processing and machine learning tasks. Consequently, however, the justification of their use is often suspiciously circular.

Vector space model from “A vector space model for automatic indexing”(Salton, Wong, and Yang 1975).

Vector space model from “A vector space model for automatic indexing”(Salton, Wong, and Yang 1975).

In this post I will try to trace the way in which vector space models generate knowledge, for the particular case of the digital humanities. I will try to answer the question: what is the price of the commensurability that vector space models provide? In other words: if we use a vector space model to compare two or more complex aesthetic objects, what are the implicit epistemological assumptions enabling the commensurability of these objects?

The Strong Algorithmic Modeling Assumption

In the digital humanities, quantitative methods are used to answer qualitative questions. The first, general implicit epistemological assumption is thus that intuitive concepts are rational concepts, i.e. that intuition is Cartesian intuition. The second, more concrete implicit epistemological assumption is that intuitive concepts can be sufficiently modeled formally, or, in Frieder Nake’s words, that “the computer is a means of production enabling the mechanization of mental work” (Nake 1974).

This strong algorithmic modeling assumption has a long tradition in computer science, starting with the arguably most important appeal to intuition of all time: the Church-Turing thesis. In the digital humanities, the strong algorithmic modeling assumption is exactly the implicit application of Turing’s (1936) method to texts. Some intuitive property of a text, it posits, can be modeled, like the Turing machine, after a human “computor” (a term introduced by Gandy (1988) to distinguish the computing human from the computing machine; in Post’s (1965) terms: a “worker”) counting and comparing words and “bags of words”, all in one intuitive glance. I would even argue that Franco Moretti’s notion of “operationalizing” “is not only derived from Bridgeman’s”operational point of view“, but also implicitly acknowledges Turing’s legacy.

The generality of the strong algorithmic modeling assumption, however, hides the fact that, for the specific field of the digital humanities, it has some specific implications that make it more difficult to just waive it and move on to more productive work. This becomes apparent in a close reading of two popular methods of distant reading: cosine similarity and automated analogical reasoning with word embedding models.

Intuitive Similarity and Cosine Similarity

Cosine similarity is a way of estimating the similarity of documents by representing them in high-dimensional vector space. Concretely, it is a measurement of the angle between two vectors: the smaller their angle, the larger their cosine similarity. Vectors with the same orientation have a cosine similarity of 1, orthogonal vectors have a cosine similarity of 0, and vectors pointing in opposite directions have a cosine similarity of -1. To measure the cosine similarity of two documents, we represent them as vectors within a high-dimensional vector space where each dimension corresponds to one word in the two documents’ common vocabulary.

Cosine similarity corresponds very well to our intuitive concept of similarity. More precisely: if we would be asked to quantify our intuition in regard to the similarity of two short sentences on a scale between -1 and 1 the result would very likely be in the general neighborhood of the cosine similarity of their vectors.

This is surprising, as there are at least two properties of our intuitive concept of similarity that seem to prohibit such a correspondence. First, our intuitive concept of similarity seems to be independent of word order. We recognize the “subject matter” to be similar even if a sentence is reversed. Second, it also seems to be immune to syntactically insignificant but semantically significant changes. We recognize the “subject matter” to be similar even if a sentence is negated.

In other words, our intuitive concept of similarity is an inherently fuzzy concept. How, then, is it possible that cosine similarity – clearly not a fuzzy concept – seems to sufficiently model it? The answer is simple: it doesn’t. What models our intuitive concept of similarity is the bag-of-words model implicit in the particular form of vectorization preceding the cosine similarity computation. As so often in computer science (Shannon 1948), cosine similarity just happens to be the easiest way of comparing the direction of vectors in general, completely independent of the properties of the objects they represent.

Intuitive Analogies and Word Embedding Analogies

With the recent comeback of machine learning in general and neural networks in particular, new technologies have emerged that enable much more complex investigations into texts. Among them are word embedding models (Mikolov, Chen, et al. 2013; Mikolov, Sutskever, et al. 2013), vector space models which, other than cosine similarity, preserve word order, or, more precisely, word contexts up to a certain window size.

The most frequently used of these models, word2vec, uses a shallow neural network to construct a high-dimensional vector space that not only reflects syntactic, but also semantic properties of the source corpus. This is possible because, instead of using -dimensional vectors to represent specific n-grams in relation to a vocabulary of  n-grams, neural networks use smaller, real-weighted feature vectors “tuned” over several iterations according to a loss function. Most prominently, word2vec is able to answer analogy queries, like “what is to woman what king is to man” (“queen”, of course; this used to be a hard problem to solve computationally, though).

However, the solution to the analogy query is given by the algorithm not as a definite answer, but as a hierarchy of existing data points. Why? Simply because there are no “intermediate” words. If the best possible analogy is a (new) data point right in between two (existing) data points representing words in the source corpus vocabulary, the best possible analogy is neither of them, but still can only be described in terms of them. Even if the input vocabulary consisted of all words in the English language, the vector operations that solve the analogy query could still produce a data point that is “in between everything”.

Every computational solution to an analogy task is thus, ironically, itself an analogy - an analogy which, in a peculiar reversal of the usual narrative of violent quantification (Adorno 2003), is a violent “qualification” of the machine’s solution. This also means: our interpretation of any results produced by word2vec is actually an interpretation of an interpretation.

In other words: we, again, encounter the assumption of bijection, this time, however, on the “other end” of the process of quantification. We assume that the solution to the analogy query is the hierarchy of references to existing high-dimensional data points presented to us by the algorithm, while in fact this hierarchy is a visualization of the underlying vector space solution.

Solving Is Visualizing

Why a “visualization”? High-dimensional vector spaces are geometrically counter-intuitive spaces. Not only is it impossible to imagine a vector in eleven dimensions, high dimensional vector spaces have counter-intuitive properties as well: this is what is known as the “curse of dimensionality”. One of the core problems of higher dimensions is that “under certain broad conditions […], as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor” (Beyer et al. 1999).

In other words: in high-dimensional vector space, distances between data points have a tendency to become illegible: they lose, or at least significantly change their meaning. Thus, they become inaccessible to intuition.

More generally, for us as humans, the only geometrically intuitive space is Euclidian space. We simply do not have the mental capabilities to think spatially about anything above n=3 (or n=4 if we include time). Consequently, to be able to think spatially about a larger space, we have to find a way to map the larger space to Euclidian space, or, in other words, visualize the ​larger space.

This means: when word2vec produces a legible (i.e. visible) answer to an analogy query, it actually produces two visualizations: a numerical visualization, and a geometric visualization. The numerical visualization is the hierarchy of references to existing high-dimensional data points presented to us by the algorithm. The geometrical visualization is the “collapsed” high dimensional space, presented to us as the computed real (i.e. floating point) values of these data points.

Conclusion

We have seen that there exists a more fine-grained implicit epistemological assumption derived from the general strong algorithmic modeling assumption that influences the way in which we perceive the generation of knowledge through quantitative methods in the digital humanities. While, obviously, the presence of this implicit epistemological assumption does not necessarily undermine the validity of results derived from applications of cosine similarity and word2vec, it does put in doubt how far we are able to intuitively read these results, and hence, if we really do what we think we do when we interpret them.

References

Adorno, Theodor W. 2003. Vorlesungen über Negative Dialektik. Edited by Rolf Tiedemann. Vols. IV-16. Nachgelassene Schriften. Frankfurt am Main: Suhrkamp.

Beyer, Kevin, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. “When Is ‘Nearest Neighbor’ Meaningful?” In International Conference on Database Theory, 217–35. Springer.

Dubin, David. 2004. “The Most Influential Paper Gerard Salton Never Wrote.” Library Trends 52 (4): 748.

Gandy, Robin. 1988. “The Confluence of Ideas in 1936.” In The Universal Turing Machine. A Half-Century Survey, edited by Rolf Herken. Hamburg: Kammerer und Unverzagt.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems, 3111–9.

Moretti, Franco. 2013. Distant Reading. New York, NY: Verso Books.

Nake, Frieder. 1974. Ästhetik Als Informationsverarbeitung. Grundlagen Und Anwendung Der Informatik Im Bereich ästhetischer Produktion Und Kritik. Springer.

Post, Emil L. 1965. “Finite Combinatory Processes. Formulations I.” In The Undecidable. Basic Papers on Undecidable Propositions, Unsolvable Problems and Computable Functions, edited by Martin Davis. Mineola, NY: Dover.

Salton, Gerard, Anita Wong, and Chung-Shu Yang. 1975. “A Vector Space Model for Automatic Indexing.” Communications of the ACM 18 (11): 613–20.

Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27.

Turing, Alan Mathison. 1936. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society 2 (1): 230–65.

Previous
Previous

A Syllogism in Turing’s 1950 Paper