Why are there so many homophones in the Chinese language?

The reason is that sound changes reduces the complex consonant clusters of Old Chinese and deleted final consonants over time. Old Chinese typically had short monosyllabic words with coda consonants, much like most of the native Germanic vocabulary of English (e.g., words like pink, strike, screw, sleep, dream, wink, wing, back, bend, first, sixth).

In most Chinese varieties today, none of these words have coda consonants or clusters:

As all of these consonants dropped off, tones evolved because the contours left over from the consonants remained, but even with tones, dozens of homophones developed out of any monosyllable.

Crosslinguistically, this is bizarre from the perspective of Indo-European, Afro-Asiatic, Niger-Congo, Pama-Nyungan, or Austronesian languages. Typically languages avoid homophones, and sound changes that would create homophones are often skipped. Additionally, languages with smaller consonant inventories and fewer clusters evolve to have longer words.

Linguists generally think homophone avoidance (or the tendency to avoid sound changes that result in homophones) is universal, and recently, Trott and Bergen (2020) went as far as testing the theory of homophone avoidance using AI models to test homophone avoidance. Their AIs produce similar levels of homophones in their artificial languages as human languages—therefore homophone avoidance.

One Austronesian language, Tagalog, perfectly fits as a language homophone avoidance would predict. It has a consonant inventory even smaller than Mandarin’s, three vowels in native words, five vowels in Spanish and English loans, and no phonemic lexical tonal contrasts:

Yet, Tagalog doesn’t have more homophones than English. The simple reason is that it has almost zero monosyllabic words at all. No Tagalog sound changes have clipped off the ends of words. If you casually leaf through the Tagalog Swadesh list, you will see only four words or morphological forms that are monosyllabic out of 200:

Appendix:Tagalog Swadesh list – Wiktionary, the free dictionary

From Wiktionary, the free dictionary This is a Swadesh list of words in Tagalog , compared with definitions in English . For further information, including the full final version of the list, read the Wikipedia article: Swadesh list . American linguist Morris Swadesh believed that languages changed at measurable rates and that these could be determined even for languages without written precursors. Using vocabulary lists, he sought to understand not only change over time but also the relationships of extant languages. To be able to compare languages from different cultures, he based his lists on meanings he presumed would be available in as many cultures as possible. He then used the fraction of agreeing cognates between any two related languages to compute their divergence time by some (still debated) algorithms. Starting in 1950 with 165 meanings, his list grew to 215 in 1952, which was so expansive that many languages lacked native vocabulary for some terms. Subsequently, it was reduced to 207, and reduced much further to 100 meanings in 1955. A reformulated list was published posthumously in 1971. № English Tagalog Tagalog edit (207) 1 I ( 1sg ) ako 2 you ( 2sg ) ikaw , kayo 3 he , she , it ( 3sg ) siya 4 we ( 1pl ) tayo ( inclusive ) , kami ( exclusive ) 5 you ( 2pl ) kayo 6 they ( 3pl ) sila 7 this ito 8 that iyan ( far from speaker but closer to person being spoken to ) , iyon ( far from speaker and person being speaker to ) 9 here dito 10 there diyan ( far from speaker but closer to person being spoken to ) , doon ( far from speaker and person being speaker to ) 11 who sino 12 what ano 13 where saan 14 when kailan 15 how paano 16 not hindi 17 all lahat 18 many marami 19 some ilan 20 few kaunti 21 other iba 22 one isa 23 two dalawa 24 three tatlo 25 four apat 26 five lima 27 big malaki 28 long mahaba 29 wide malapad 30 thick makapal 31 heavy mabigat 32 small maliit , munti , pandak ( of stature ) 33 short maiksi , maikli 34 narrow makitid 35 thin manipis , payat 36 woman babae 37 man (adult male) lalaki 38 man (human being) tao 39 child anak , bata 40 wife asawa 41 husband asawa 42 mother ina 43 father ama 44 animal hayop 45 fish isda 46 bird ibon 47 dog aso 48 louse kuto 49 snake ahas 50 worm uod 51 tree puno 52 forest gubat 53 stick patpat 54 fruit bunga 55 seed buto , binhi 56 leaf dahon 57 root ugat 58 bark (of a tree) balat 59 flower bulaklak 60 grass damo 61 rope lubid 62 skin balat 63 meat laman 64 blood dugo 65 bone buto 66 fat (noun) taba 67 egg itlog 68 horn sungay 69 tail buntot 70 feather balahibo 71 hair buhok 72 head ulo 73 ear tainga 74 eye mata 75 nose ilong 76 mouth bibig 77 tooth ngipin 78 tongue (organ) dila 79 fingernail kuko 80 foot paa 81 leg binti 82 knee tuhod 83 hand kamay 84 wing pakpak 85 belly tiyan 86 guts bituka 87 neck leeg 88 back likod 89 breast suso 90 heart puso 91 liver atay 92 to drink uminom 93 to eat kumain 94 to bite kumagat 95 to suck sumipsip 96 to spit dumura 97 to vomit sumuka 98 to blow umihip 9

https://en.wiktionary.org/wiki/Appendix:Tagalog_Swadesh_list

They are the second person singular ‘you’ in the absolutive case ka, the preposition sa, and the conjunction at which means ‘and,’ and another conjunction kung which means ‘if.’ That’s it! The core vocabulary of Tagalog is only 2% monosyllabic.

Outside of this core vocabulary, the only words of Tagalog I know that are monosyllabic are some of the pronouns in the ergative case, second person singular mo and first person singular ko, and a few clitics like the question marker ba, and din/rin which means ‘also,’ and the absolutive case marker ang, and ergative case marker ng. If you look at a block of Tagalog text, you’ll frequently see paragraphs without a single monosyllabic word besides the very important case markers ang/ng and the preposition sa, the preposition o, and the inverter ay (used whenever a sentence is not the default Verb-ergative-absolutive word order).

All of the nouns and verbs are at least two syllables. I can’t think of single noun in Tagalog that’s monosyllabic, and the only verb is an edge case, copular may which means ‘there is.’

It’s a bit of a simplification to say Chinese mostly has monosyllabic words. The stems are monosyllabic, but most Chinese words are used in compounds, where often two nouns with similar meaning are stuck together. Sampson (2013) argues that because Chinese has to reduce homophones by compounding nouns, homophone avoidance is not universal or even a sound theoretical concept.

I wouldn’t go that far. If 99% of the world’s languages follow homophone avoidance, I don’t think we should throw out the whole theory just because Chinese doesn’t. It is an interesting question why Chinese had sound changes that created so many homophones though, and I really wonder why. We understand the individual sound changes that created the homophones, but why this happened is a mystery.

References:

Sampson, G. (2013). A counterexample to homophony avoidance. Diachronica, 30(4), 579-591.

https://www.grsampson.net/ACth.pdf

Li, P., & Yip, M. C. (1998). Context effects and the processing of spoken homophones. Reading and Writing, 10, 223-243.

https://blclab.org/wp-content/uploads/2013/02/rw98.pdf

Trott, S., & Bergen, B. (2020). Why do human languages have homophones?. Cognition, 205, 104449.

https://pages.ucsd.edu/~bkbergen/papers/trott_bergen_2020.pdf