Project Title: Super Rapid Annotator 2.0: Advanced Multimodal Video Annotation Agent
Mentors: Raúl Sánchez Sánchez ([email protected]), Manish Kumar Thota ([email protected]), Cristobal Pagán Cánovas ([email protected]), Rosa Illán Castillo ([email protected])
Project Size: Medium (175-hour project)
Difficulty Level: Medium
Objective:
Building upon the foundation of the previous Super Rapid Annotator project (https://github.com/manishkumart/Super-Rapid-Annotator-Multimodal-Annotation-Tool), this initiative aims to develop an advanced annotation agent that leverages state-of-the-art multimodal large language models (MLLMs) and reasoning models. The agent will process videos and generate structured CSV outputs for annotation purposes, operable via a command-line interface (CLI) or Python.
We have a software called Rapid Annotator(https://sites.google.com/case.edu/techne-public-site/red-hen-rapid-annotator). Students upload a bunch of videos and watch them one by one annotating if the person is inside or outside, if it wear glasses, … We want to automate when possible it and avoid the repetitive tasks to the students and get a resultant csv with all the annotations using a multimodal model.
System Components:
Example Workflow:
Input:
Folder with video files
Annotation schema:
[
{
"description": "Is the person in the image standing?",
"value": "standing"
},
{
"description": "Are the person's hands visible?",
"value": "hands_visible"
},
{
"description": "Is the setting indoors or outdoors?",
"value": "indoor"
},
"description": "The meaning of touch word in the video transcription has physical sense or Emotional sense?",
"value": "physicaltouch"
}
]
Output:
CSV file:
video_file,standing,hands_visible,indoor,physicaltouch
video.mp4,true,true,false,true
video2.mp4,true,false,false,true
video3.mp4,false,true,false,true
video4.mp4,true,true,true,true
Output example:
Example of a resultant .csv with 10 examples:
Red Hen Lab Potential Contributors