Two requirements should be met in order to develop a practical multimodal interface system , i . e ., ( 1 ) integration of delayed arrival of data and ( 2 ) elimination of ambiguity in recognition results of each modality . This paper presents an efficient and generic methodology for interpretation of multimodal input to satisfy these requirements . The proposed methodology can integrate delayed - arrival data satisfactorily and efficiently interpret multimodal input that contains ambiguity . In the input interpretation the multimodal interpretation process is regarded as hypothetical reasoning , and the control mechanismof interpretation is formalized by applying the assumption - based truth maintenance system ( ATMS ). The proposed method is applied to an interface agent system that accepts multimodal input consisting of voice and direct indication gesture on a touch display . The systemcommunicates to the user through a human - like interface agent's three - dimensional motion image with facial expressions , gestures , and a synthesized voice .