Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Purdue University1, Amazon2

Generating Synthetic Visual Instruction Dataset. Our framework uses open-sourced 3D assets to generate photo-realistic images with precisely controlled camera-object relation. Corresponding text instructions are also generated by Large Language Model (LLM). The generated Ultimate3D dataset (240K) and benchmark (8K) advance baseline MLLM models (LLaVA-1.6, Llama-3.2-Vision, etc.) to outperform commercial MLLMs (GPT-4o, Claude-3-Sonnet, etc.) on camera-object relation recognition tasks.

Abstract

Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.

The Pipeline

Figure 2. Pipeline. Given open-sourced 3D assets, our approach leverages a 3D Renderer to generate 3D visual priors (\( I_{\beta} \)) preserving ground truth camera-object relation (\( \beta \)). Meanwhile, LLMs take 3D asset category to generate diverse image descriptions (\( \mathcal{T}_{img} \)) as conditional guidance. Both the 3D visual priors and diverse text prompts are used to generate synthetic images (\( I_{syn} \)) by multiple ControlNet-based networks. Corresponding text QA instructions (\( \mathcal{T}_{qa} \)) are generated by LLMs given the ground truth camera-object relation. Our Ultimate3D dataset and benchmark (i.e. pairs of \( \mathcal{T}_{qa} \) and \( I_{syn} \) ) contribute to fine-tuning and evaluation of MLLMs.

Quantitative Comparisons

Figure 4. Quantitative Comparisons. We fine-tune LLaVA models by Ultimate3D dataset, then evaluate the MLLM response accuracy (%) on Ultimate3D and MMVP benchmarks (MMVP is a public benchmark showing our cross-dataset capability). Fine-tuned LLaVA models outperform SOTAs by an average of $33.4\%$ among all three tasks.

Diversity of Ultimate3D

Figure 5. Diversity of Ultimate3D. Our Ultimate3D dataset and benchmark cover 100 categories of objects, range diverse camera-object relation settings, and provide plausible image quality. (Each row shows images with the same orientation but in diverse subject and context.)

BibTeX


Coming soon!