VLM guided Toddler Behavior Recognition from Semi- Structured Triadic Play Videos. Multimodal and Responsible Affective Computing.
Accurate recognition of behaviors exhibited by children with Autism Spectrum Disorder (ASD) is critical for early detection and timely intervention. Characterization of such behaviors supports the development of downstream ASD prediction models toward early diagnosis. Key behaviors relevant to the prediction of ASD can be categorized into three primary groups: facial attribute-based behaviors, social interaction-related behaviors, and play-based behaviors. In this study, we focus on the recognition of three key behaviors relevant for ASD prediction: gaze (facial attribute-based), imitation (social interaction-based), and functional play (play-based), using videos collected in a semi-structured triadic interaction setting. We first establish a strong vision-only baseline using Video Swin Transformer for spatial feature extraction and a Long Short-Term Memory (LSTM) network for temporal modeling. Classification using vision-only features achieves an accuracy of 73%. Building on this foundation, we introduce a multimodal framework that integrates language understanding through Vision-Language Models (VLMs). Leveraging recent advancements in zero-shot inference, we propose a two-step prompting strategy for VLMs such as Video-LLaVA, LLaVA-Next-Video, and Gemma-3 to extract behavior-relevant textual cues from the videos. A late fusion mechanism is employed to combine visual features from vision-only path with textual cues from VLMs, thereby enhancing behavior recognition. Experimental results demonstrate that integrating VLM-derived cues—particularly from LLaVA-Next-Video —yields a 6% improvement in recognition accuracy over the vision-only baseline, highlighting the effectiveness of VLM-guided behavior analysis in ASD research.