Burmese Character Frequency

I tried an attempt to collect a general data analysis on Burmese character frequency using data from Wikipedia and blogs. Download Data Sheet here. The image and data are redistributable under CC BY-SA 3.0 license.



This is good! It would also

sorlok_reaves's picture

This is good! It would also be a good idea to get the conditional frequencies (or n-grams), though.

You know, like now you have P(U+1000), P(U+1001), etc.

You could also do P(U+1000|U+1000), P(U+1000|U+1001), etc. (Probability of U+1000 directly after U+1001)

...and P(U+1000|U+1001U+1002) (Probability of U+1000 directly after U+1001 then U+1002). 

I think the interesting data are in the relationships (for example, P(U+103C|X) is medium-high if X is a consonant, medium if X is U+103B, and zero otherwise.)


Lionslayer's picture

Someone advise me so just now in fb :P I should really do n-gram when I get time.