The idea relies on performing relationship mapping between detected/recognised objects/users and their activity in the audio-visual scene, so the users’ 2D or 3D pose can be automatically inferred and the belief can be maintained across longer time spans even if the audio-visual data is only partially observable (e.g. user is not visible within the camera frustum, however microphone array is able to recognise and localise the sound she makes by walking, speaking or some other action).

For example: the device recognises an object/user, however this object is then no longer visible in the camera (e.g. limited field-of-view, occlusions, etc). The device is able to keep tracking the user based on the natural sound she makes (noise, speech, ...) during this period. Hence it is able to infer when the object/user appears again in the camera view and successfully performs data association.