Inputs and Data Preparation for Multimodal LLMs

Inputs and Data Preparation for Multimodal LLMs #

Multimodal LLMs are language models that can process and reason over multiple data types, especially:

Text
Images
(Optionally: audio, video, or other modalities)

They are designed to understand both visual and linguistic context, enabling tasks like visual question answering, image captioning, grounding, and perception-based reasoning.

🖼️ + 💬 Input Format #

Inputs typically include:

Image(s): RGB images, optionally annotated (e.g., bounding boxes, circles)
Text Prompt: Task instruction or question (e.g., “Which object is closer?”)
Answer Choices (optional): For classification-style tasks like BLINK

inputs = {
  "images": [...],   # preprocessed (resized, normalized) tensors or raw image paths
  "text": "Which point is closer to the camera? (A) A (B) B"
}

Some APIs accept JSON-style mixed prompts with interleaved text and image tokens.

Data Preparation Pipeline #

Image Collection
Use open datasets (COCO, LVIS, IIW, WikiArt) or your own; resize consistently (e.g., 224x224 or 1024px).
Visual Prompt Annotation
Add circles (keypoints), boxes (objects), or masks (regions) using tools like OpenCV, CVAT, or FiftyOne.
Text Prompt Design
Write clear, natural or templated questions.
- e.g., “Which image completes the jigsaw?”
- e.g., “Is the laptop to the left of the bear?”
Label Encoding
- Classification: (A), (B), (C), (D)
- Generation: Free-text string
- Evaluation: Ground-truth match or similarity

Example Entry (BLINK-style) #

{
  "image_1": "img001.jpg",
  "image_2": "img002.jpg",
  "prompt": "Which point corresponds to the reference point (REF)? (A) A (B) B (C) C (D) D",
  "visual_prompts": {
    "ref_point": [x1, y1],
    "candidates": [[x2, y2], [x3, y3], [x4, y4], [x5, y5]]
  },
  "answer": "C"
}

Use Cases #

Visual Question Answering (VQA)
Visual Grounding & Alignment
Perception-based Evaluation (e.g., BLINK)
Medical Image Reasoning
Image Captioning / Retrieval