from Hacker News

English Letter Frequency Counts: Mayzner Revisited (2013)

by sindoc on 1/3/14, 1:26 PM with 15 comments

  • by yoyo1999 on 1/3/14, 2:31 PM

    Can anybody help me understand how can this data be useful to anybody?

    I was playing with n-gram for a while and even produced similar results. But I don't see how can those data be useful to anybody.

  • by ableal on 1/3/14, 3:21 PM

  • by Rezo on 1/3/14, 3:16 PM

    I used Norvig's frequency counts as input for the board generation algorithms (in Scala) for my Android word game "5 Star Words" [1]. With this as the start plus a few other tricks, I'm typically able to reach an average of ~300 common English words (or easily 400+ when including less common and swear words) on a 4x4 letter board.

    [1] https://play.google.com/store/apps/details?id=com.starwords

  • by bane on 1/3/14, 3:47 PM

    I think natural language designers might also look at the letter frequencies and question why 'E' shows up so much. Is the canonical sound it makes just common in English or is there some problem with its "design"? It turns out E is way overloaded in English:

    - it's silent in the case of modifying preceding vowels separated by a medial consonant e.g. hat vs. hate, bat vs. bate

    - and in older English (or English that wants to feel old) was a superfluous final letter e.g. olde, pubbe

    - as a silent letter entirely e.g. eagle

    - as itself e.g. egg, education

    - as a silent or nearly silent suffix separator for -ed e.g. dropped, judged

    - as a non-silent suffix for -ed e.g. educated

    - silent as an immediate vowel modifier in vowel digraphs (in some spellings) e.g. archaeology, encyclopaedia, caesar used to be ligatured it was so incidental.

    - silent as a modifier on itself e.g. teen, feel

    - one of several representation for schwa, ə e.g. taken (takən), enemy (enəmy)

    etc.

    'e' is a mess. It's mostly silent, either ignored completely or modifying something else (an issue even Benjamin Franklin tried to solve through a proposed spelling reform). It's conflated with schwa (the most common vowel sound in English yet has no singular representation).

    A language reformer would probably tackle this letter first and fix a great deal of the spelling problems in English.

  • by JeffJenkins on 1/3/14, 3:29 PM

    Fun fact: If you're taking one vowel and five consonants the Wheel of Fortune letters RSTLNE—not in that order—are the letters that are most likely to occur
  • by triplesec on 1/3/14, 2:53 PM

    I love this. One minor representation issue: for the "Letter Counts by Position Within Word" that charting approach is less than helpful. Improvements within the structure he uses might be coding each letter with its own colour, and having each letter in its own column, reordered by length. However, charting experts may easily come up with a more useful re-charting approach better than I can off the top of my head.