Computer Science - Computation and Language Physics - Physics and Society
Phys. Rev. E 91, 052811 (2015) Natural languages are full of rules and exceptions. One of the most famous
quantitative rules is Zipf's law which states that the frequency of occurrence
of a word is approximately inversely proportional to its rank. Though this
`law' of ranks has been found to hold across disparate texts and forms of data,
analyses of increasingly large corpora over the last 15 years have revealed the
existence of two scaling regimes. These regimes have thus far been explained by
a hypothesis suggesting a separability of languages into core and non-core
lexica. Here, we present and defend an alternative hypothesis, that the two
scaling regimes result from the act of aggregating texts. We observe that text
mixing leads to an effective decay of word introduction, which we show provides
accurate predictions of the location and severity of breaks in scaling. Upon
examining large corpora from 10 languages in the Project Gutenberg eBooks
collection (eBooks), we find emphatic empirical support for the universality of
our claim.
Metrics
5 Record Views
Details
Title
Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language