In modern dialogue systems, the broad aim is to extract from speech acts (whether spoken or textual) the semantics necessary to determine a reply that is appropriate or optimal, given some desired end.

Semantic understanding is typically limited to classifying a users intent, and some values to fill slots, but ignores both (i) linguistic and paralinguistic cues that might add emotional/personality/psychological relevance to the speech act, and (ii) cues from vision and/or any other non-linguistic measurements of the user or world around them.

In general these will be needed for an agent to be fully informed and situationally aware, before selecting the best action to take next. For instance, a user may become confused or frustrated at some point during an interaction, and this information is both hidden to a purely textual NLU and essential information for an agent attempting to provide a good user experience.