The task of text segmentation may be undertaken at many levels in text
analysis---paragraphs, sentences, words, or even letters. Here, we focus on a
relatively fine scale of segmentation, hypothesizing it to be in accord with a
stochastic model of language generation, as the smallest scale where
independent units of meaning are produced. Our goals in this letter include the
development of methods for the segmentation of these minimal independent units,
which produce feature-representations of texts that align with the independence
assumption of the bag-of-terms model, commonly used for prediction and
classification in computational text analysis. We also propose the measurement
of texts' association (with respect to realized segmentations) to the model of
language generation. We find (1) that our segmentations of phrases exhibit much
better associations to the generation model than words and (2), that texts
which are well fit are generally topically homogeneous. Because our generative
model produces Zipf's law, our study further suggests that Zipf's law may be a
consequence of homogeneity in language production.
Metrics
2 Record Views
Details
Title
Zipf's law is a consequence of coherent language production