ImageGem Dataset

Yuanhe Guo^{1, *} Linxi Xie^{1, *} Zhuoran Chen¹ Kangrui Yu¹ Ryan Po² Guandao Yang² Gordon Wetzstein² Hongyi Wen^{1, †}

¹NYU ²Stanford
ICCV2025
^*Indicates Equal Contribution ^†Correspondence

Our proposed ImageGem dataset and its applications. The left side illustrates image and generative model retrieval. On the right, we demonstrate a novel task of generative model personalization through LoRA weights-to-weights (W2W) space construction.

Dataset Overview

Our dataset contains 4,916,134 images, with 2,895,364 unique prompts, generated with 242,118 LoRA models after applying a safety filer. We visualize our data distribution with WizMap. We use grid tiles to display keywords extracted from image prompts or model tags.

The left panel shows a UMAP embedding of 1M images sampled from the dataset, while the right panel illustrates a contour plot of LoRA model checkpoints.

Aggregate-level Preference Alignment

We sampled three topics: Cars, Dogs, Scenery, from both our ImageGem dataset and Pick-a-pic, and trained DiffusionDPO respectively.

Qualitative DiffusionDPO results comparison of images generated with OOD prompts in three topics sampled from DiffusionDB. For each prompt, random seed and all other hyperparameters are kept the same.

Retrieval and Generative Recommendation

With rich individual preference data, we enable personalized image retrieval and generative model recommendations using a two-stage approach: collaborative filtering (CF) retrieves top-k candidate items, followed by a visual-language model (VLM) for refined ranking.

Image ranking results from different recommendation models, where VLM demonstrates superior performance.

Generative Model Personalization

We construct a LoRA weight space with PCA. To handle large LoRA weights with various ranks, we experimented several methods, including standardizing LoRAs to rank1 with PCA, select feed-forward (FF) layers or attention value (attn v) layers only. Our results show that the SVD-based strategy yields the most robust transformations.

Editing results using the SVD-based W2W space for anime to realistic and reverse. The base model outputs are shown in the first column, followed by results with increasing tuning strength. Each row uses a fixed generation seed

Building upon the ani-real transformation, we extend our approach to learn personalized editing directions within the W2W space in the human figure domain.

Each user’s visual preference is shown at the top, with generated samples below. Left images are from the unedited SDXL base model; right images are from the edited models.