While validating content for Italian, I have come across a lot of low quality sentences that seems to have made it past the sentence collector.
Should I hit no on those? Skip? I imagine it would be better if they could be taken out of circulation entirely, so that users don’t waste their time speaking them.
Some examples:
- Lots of foreign technical terms, some of them highly specific to software development (did someone upload technical documentation?). Even for more common terms, like open source, I’ve noticed that about half the speakers say
/sɔɹs/
as if it were English , and half say/surs/
, mimicking French. Surely this can’t be good for the dataset? - Lots of foreign first and last names. For some, e.g. Bob or Obama, pronunciation will be obvious to everyone (?). Others, like Schoenberner or Veheran get mangled in various inconsistent ways.
- Several sentences appear to be taken from a fantasy novel. They contain what must be made-up names. These look nothing like Italian words, and their pronunciation isn’t at all obvious, like Zipak (
/'dzipak/
?/tsi'pak/
?/'zaipek/
? no clue) or Bonard (pronounce it like English? French?) - Very repetitive sentences. I must have come across at least 10 sentences from a novel which involve someone called De Vincenzi.
Can I give someone a list of these sentences I come across?