I tested the link you send me, but I’m stuck at one point, may be it’s absurd cause it seems to me that it is.
I successfully created my unigram using :
but when I want to create the lm.binary, it fails:
/bin/build_binary -T -s words.arpa lm.binary
Reading words.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
lm/read_arpa.cc:85 in void lm::ReadNGramHeader(util::FilePiece&, unsigned int) threw FormatLoadException'.
Was expecting n-gram header \1-grams: but got \end\ instead Byte: 209
ERROR
Did I do something wrong somewhere ? Don’t know if you have tested it yet, but if you have any tips or insights, it’ll be helpul
Thanks !
EDIT: It seems to be an error when reading the word.arpa file, it didn’t read the \1-grams:. I tried to add an other one, and it reads but throw an error cause there is two \1-grams: now…
kenlm-master/bin/build_binary -T -s words.arpa lm.binary
Reading words.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Could not parse "\1-grams:" into a number in the 1-gram at byte 38 Byte: 38
ERROR
Earlier I misread what you’d proposed, but now I think I get it. This seems like a really useful idea to look into. However what about a third option:
#3: Mixtures of names (first or last) optionally with a little context.
A couple of reasons:
There are increasing numbers of names where they are not definitively a first or a last name (Hunter, Jackson, James, Carter, Cooper etc)
It could easily be useful to handle lists of names (“Has anyone seen Susan or Michael?”) and names are sometimes listed in reverse order (“Smith, Luke, reporting for duty”)
Isn’t the main point of the LM to determine the probability of a word given the sentence. Wouldn’t option 1 just bias it towards first names followed by last names? Also as names are going to be fairly mixed up in reality, it’s not going to help the system make a massively better guess for which name comes next - the first won’t have that much predictive power for the second (I admit there would be some but it’s limited) I guess a case where name pairs are distinctly from two cultures might seem marginally less likely (“Marmaduke Zhao”) and where they’re culturally closer (“Rebecca Jones”) they’re a little more likely but that’s got to be fairly limited impact in lots of modern contexts (certainly in diverse international cities)
One last general problem with names is that they can often evolve to be pronounced very distinctly from how they’re written. Eg St. John (“Sinjun”), Cholmondley (“Chumley”) and Beauchamp (“Beacham”). I doubt the LM alone could do anything for those! Not that this should dishearten you
This is an interesting proposal. It would be really awesome if we can make it work.
I tried different ASR from Google and some other vendors. They are all having the same issue you mentioned above. When I say Rose in a sentence, they don’t know if it’s a name or a kind of flower.
The difficulty is that this is out of the scope of LM. Analyzing the context around names is such a complex task that I guess another powerful NLP model is required. Or you need a huge dataset to train the model to cover so many probabilities.
It’s really beyond what I can do right now. That’s why I want to have a standalone LM for names first.
BTW I’ve been looking for sources on names and also cases that are a challenge, so hope to gather some of that together soon when I next get some time. Earlier I stumbled upon this Wikipedia page, but it does sound like just dealing with regular names standalone as you suggest would be plenty to be getting on with.
There’s more I’d like to do to make a more comprehensive list but I took a slightly “quick & dirty” approach to getting a list of names with IPA details, which I’ve put into a gist here:
It was produced by extracting the phonemic spelling of all words in Wiktionary, then focusing on the capitalised ones (ie proper nouns) and then filtering out cases which obviously weren’t used as names for people. There’s some ambiguity around eg placenames that are also people’s names, so I’ve generally kept those in (but there are probably lots missing) and to keep speed up I went with instinct and fairly minimal checking (so there could well be some foreign terms that sound to me like they’re people’s names but are in fact some other kind of proper noun)
If I get more time I might try to explore other sources (eg cast lists from films, authors of academic papers) to create individual first and last names that can then be cross-checked against this list. This would also be useful to get a sense of the more / less frequent names (eg some which I recognise as names but they’re obscure / not widely used)
I suspect the list above will have a bias toward “western” names (esp. English and to a degree European ones) with under representation of African and Asian names (given they’re less common in my source and I have slightly less experience recognising them as names). It seems (to me at least) to under represent last names somewhat too.