Create language model with KenLM

bem0302 · April 30, 2019, 3:25pm

Hi guys, when creating language model with KenLM and I have know that KenLM use the N-grams model. So I have 2 questions for this:

When I build an *.arpa file from a text.txt file. Did all the sentences in the text.txt need to have the length from 3 to 5 words to get the best LM? Because my text is about 12000 sentences and more than 80% of them have length about 8-15.
I’m using this command to build the *.arpa file: ./lmplz --text text.txt --arpa text2.arpa --o 5. Did I need to change the value of the last param (currently 5) to some other value like 3 or 4 based on my data as above ?

bem0302 · May 2, 2019, 3:49am

Can anyone help me, please.

eggonlea · May 2, 2019, 11:09pm

I don’t think the sentence length has that 3-5 words limitation.
It depends on your own requirement: do you need 5-gram or 3/4-gram. Without enough background knowledge, others cannot answer the question. You can try different parameters and see which one gives your the best result/performance/resource tradeoff.

bem0302 · May 3, 2019, 1:30am

So the 5 value in the command ./lmplz --text text.txt --arpa text2.arpa --o 5 is the N in N-grams LM right ? @eggonlea