Multiple speakers within one record for training

Hello everyone!
Just wanted to find out how bad would it be to use records which have more than one speaker talking.
Let me explain a bit more. I extract speech data from youtube based on manual subtitles provided. To collect as much data as possible within a short time period I perform almost no post processing. The music, noise and other kind of acoustic effects are kept - I guess and hope it will lead to more robust model. Am I wrong?
And as long as subtitles has no info regarding the speakers (who spoke when) and I leave it as it is, quite often there are multiple people’s speech being presented within a single record. How bad it is (if is)? After all I’m gonna add this data to a clean dataset of 300 Hours (with single speaker per single record).
Thank you all for the suggestions!