Proposal

Project Title: Super Rapid Annotator 2.0: Advanced Multimodal Video Annotation Agent

Mentors: Raúl Sánchez Sánchez ([email protected]), Manish Kumar Thota ([email protected]), Cristobal Pagán Cánovas ([email protected]), Rosa Illán Castillo ([email protected])

Project Size: Medium (175-hour project)

Difficulty Level: Medium

Objective:

Building upon the foundation of the previous Super Rapid Annotator project (https://github.com/manishkumart/Super-Rapid-Annotator-Multimodal-Annotation-Tool), this initiative aims to develop an advanced annotation agent that leverages state-of-the-art multimodal large language models (MLLMs) and reasoning models. The agent will process videos and generate structured CSV outputs for annotation purposes, operable via a command-line interface (CLI) or Python.

We have a software called Rapid Annotator(https://sites.google.com/case.edu/techne-public-site/red-hen-rapid-annotator). Students upload a bunch of videos and watch them one by one annotating if the person is inside or outside, if it wear glasses, … We want to automate when possible it and avoid the repetitive tasks to the students and get a resultant csv with all the annotations using a multimodal model.

System Components:

Agent-Based Annotation System:
- Functionality: An intelligent agent capable of orchestrating various models to analyze video content and produce structured annotations.
- Model Integration: Utilizes a diverse set of models, including multimodal models for comprehensive content understanding and reasoning models for in-depth analysis.
Command-Line Interface (CLI):
- Functionality: A user-friendly CLI that allows users to input video files and specify annotation schemas.
- Output: Generates a CSV file containing the annotated information, formatted for compatibility with the Red Hen Rapid Annotator.

Example Workflow:

Input:

Folder with video files

Annotation schema:

[
  {
    "description": "Is the person in the image standing?",
    "value": "standing"
  },
  {
    "description": "Are the person's hands visible?",
    "value": "hands_visible"
  },
  {
    "description": "Is the setting indoors or outdoors?",
    "value": "indoor"
  },
    "description": "The meaning of touch word in the video transcription has physical sense or Emotional sense?",
    "value": "physicaltouch"
  }
]

Output:

CSV file:

video_file,standing,hands_visible,indoor,physicaltouch
video.mp4,true,true,false,true
video2.mp4,true,false,false,true
video3.mp4,false,true,false,true
video4.mp4,true,true,true,true

Output example:

Example of a resultant .csv with 10 examples:

GSoC.zip

Red Hen Lab Potential Contributors