Here’s a random idea, hoping that we can start brainstorming on it. With any luck, someone with better knowledge of computer vision than me can tell me if it’s feasible.
Users holding a camera are going to move this camera, whether they want it or not. As a consequence, for objects that are close enough to the camera, we’ll get slightly more information than a 2d picture. How hard would it be to extract some 3d information from this 2d picture to filter out big chunks of the background?