Inputs and Data Preparation for Multimodal LLMs #
Multimodal LLMs are language models that can process and reason over multiple data types, especially:
- Text
- Images
- (Optionally: audio, video, or other modalities)
They are designed to understand both visual and linguistic context, enabling tasks like visual question answering, image captioning, grounding, and perception-based reasoning.
🖼️ + 💬 Input Format #
Inputs typically include:
- Image(s): RGB images, optionally annotated (e.g., bounding boxes, circles)
- Text Prompt: Task instruction or question (e.g., “Which object is closer?”)
- Answer Choices (optional): For classification-style tasks like BLINK
inputs = {
"images": [...], # preprocessed (resized, normalized) tensors or raw image paths
"text": "Which point is closer to the camera? (A) A (B) B"
}
Some APIs accept JSON-style mixed prompts with interleaved text and image tokens.
Data Preparation Pipeline #
-
Image Collection
Use open datasets (COCO, LVIS, IIW, WikiArt) or your own; resize consistently (e.g., 224x224 or 1024px). -
Visual Prompt Annotation
Add circles (keypoints), boxes (objects), or masks (regions) using tools like OpenCV, CVAT, or FiftyOne. -
Text Prompt Design
Write clear, natural or templated questions.- e.g., “Which image completes the jigsaw?”
- e.g., “Is the laptop to the left of the bear?”
-
Label Encoding
- Classification: (A), (B), (C), (D)
- Generation: Free-text string
- Evaluation: Ground-truth match or similarity
Example Entry (BLINK-style) #
{
"image_1": "img001.jpg",
"image_2": "img002.jpg",
"prompt": "Which point corresponds to the reference point (REF)? (A) A (B) B (C) C (D) D",
"visual_prompts": {
"ref_point": [x1, y1],
"candidates": [[x2, y2], [x3, y3], [x4, y4], [x5, y5]]
},
"answer": "C"
}
Use Cases #
- Visual Question Answering (VQA)
- Visual Grounding & Alignment
- Perception-based Evaluation (e.g., BLINK)
- Medical Image Reasoning
- Image Captioning / Retrieval