Create language model with KenLM

Hi guys, when creating language model with KenLM and I have know that KenLM use the N-grams model. So I have 2 questions for this:

  1. When I build an *.arpa file from a text.txt file. Did all the sentences in the text.txt need to have the length from 3 to 5 words to get the best LM? Because my text is about 12000 sentences and more than 80% of them have length about 8-15.

  2. I’m using this command to build the *.arpa file: ./lmplz --text text.txt --arpa text2.arpa --o 5. Did I need to change the value of the last param (currently 5) to some other value like 3 or 4 based on my data as above ?

Can anyone help me, please.

  1. I don’t think the sentence length has that 3-5 words limitation.
  2. It depends on your own requirement: do you need 5-gram or 3/4-gram. Without enough background knowledge, others cannot answer the question. You can try different parameters and see which one gives your the best result/performance/resource tradeoff.
1 Like

So the 5 value in the command ./lmplz --text text.txt --arpa text2.arpa --o 5 is the N in N-grams LM right ? @eggonlea