Although probabilistic classifiers using a bag of words approach prove largely effective, bag of words has several disadvantages.
Word correlation. Bag of words assumes words are independent of one another in a document or corpus. Election is more likely to appear in shared context with president then poet. In measuring individual term frequency, bag of words does not account for correlations in usage between words. Because bag of words extracts each word in a document as a feature of the bag of words model, with term frequency being that feature’s weight, two or more correlated words can theoretically induce multicollinearity in statistical classifiers using that model. Nevertheless, Naïve Bayes’ simplifying assumption has shown to produce robust models despite such potential shortcomings.5
Compound words. Word correlation extends to bag of words representations of compound phrases, in which two or more words operate as one semantic unit. For instance, a simple Bag of words model may represent Mr. Darcy as two unique and unrelated words even though they function in tandem. Such a bag of words representation fails to reflect the semantic and syntactic nature of multi-word concepts.
Polysemous words. Many words have multiple, markedly different meanings. For instance, bat can signify a sports instrument or animal, and these meanings usually occur in significantly different contexts. Similarly, words can change meaning depending on placement of its stress in spoken language—for example CON-tent versus con-TENT. Because bag of words does not regard context and meaning when modeling words, it collapses all of these distinct meanings under one word, thereby eliding potentially significant information on a text’s subject (and so potential classification).
Sparsity. In a bag of words model, each word is a feature, or dimension, of the model, and each so-called document is a vector. Because a document does not use every word in the generated model’s vocab, many of the feature values for a given vector may be zero. When the majority of values for vectors are zero, the model is sparse (if representing vectors as a matrix, this is called a sparse matrix). Model sparsity results in high dimensionality, which, in turn leads to overfitting on training data.6