MongoDB Image Database Documentation

Overview

This database stores metadata and analysis results for a large corpus of images, primarily sourced from anime-related platforms.

Database Name: images
Database Size: ~100.83 GiB
Primary Collection: pdxl
Document Count: 23,637,222

Document Schema

Each document in the pdxl collection includes the following structured fields:

1. Basic Metadata

Field	Description
`_id`	MongoDB ObjectId
`filepath`	Full path to the image file (indexed, unique)
`filename`	Filename of the image
`source`	Source platform (e.g., `rule34`, `danbooru`)
`image_id`	Unique image identifier (indexed, unique)
`album_id`	Album/group identifier (indexed)
`version_major`	Versioning information
`version_minor`	Versioning information
`created_at`	Timestamp when document was added
`original_created_at`	Timestamp when image was originally created
`valid`	Boolean flag indicating a usable/clean image

2. Image Properties

Field	Description
`width`	Image width (pixels)
`height`	Image height (pixels)
`itype`	Image type (e.g., `"anime"`)
`phash6`	6-byte perceptual hash (indexed)

3. Tags and Metadata

Field	Description
`ori_img_tags`	Original tags, divided into subfields:
— `gen`	General tags (indexed)
— `char`	Character tags
— `copy`	Copyright tags
— `art`	Artist tags
— `meta`	Meta/informational tags
`auto_tags`	Tags generated by automated systems
`auto_caption`	AI-generated caption or image description

4. Scoring

Field	Description
`aes_score`	Aesthetic evaluation scores (indexed)
— `siglip_2_5`	Aesthetic score from SigLIP model
`cv_scores`	Computer vision-derived metrics:
— `edge_density`	Edge density measure
— `focus_measure`	Focus/blur estimation
— `texture_score`	Texture complexity
— `noise_level`	Estimated noise
— `saturation`	Color saturation
— `contrast`	Contrast level
— `brightness`	Brightness level
— `avg_dynamic_range`	Dynamic range of image

Indexed Fields

Indexes are used to optimize query performance. Key indexes include:

[
  { "key": { "_id": 1 }, "name": "_id_" },
  { "key": { "filepath": 1 }, "name": "filepath_1", "unique": true },
  { "key": { "source": 1 }, "name": "source_1" },
  { "key": { "valid": 1 }, "name": "valid_1" },
  { "key": { "image_id": 1 }, "name": "image_id_1", "unique": true },
  { "key": { "phash6": 1 }, "name": "phash6_1" },
  { "key": { "album_id": 1 }, "name": "album_id_1" },
  { "key": { "ori_album_tags": 1 }, "name": "ori_album_tags_1" },
  { "key": { "ori_img_tags.gen": 1 }, "name": "ori_img_tags.gen_1" },
  { "key": { "aes_score.siglip_2_5": 1 }, "name": "aes_score.siglip_2_5_1" }
]

Source Distribution

Approximate distribution of images by source:

Source	Count
`rule34`	8,505,871
`danbooru`	7,865,216
`pixiv`	3,856,761
`twitter`	2,208,117
`yandere`	1,060,776
`bangumi`	99,074
`anime-video`	40,547
`horse_cleaned`	800
`tiger_bench`	60

Example Queries

1. Basic Filters

db.pdxl.find({source: "danbooru"}).limit(10)
db.pdxl.find({valid: true}).limit(10)
db.pdxl.find({image_id: "danbooru-7"})
db.pdxl.find({album_id: "danbooru-7"})

2. Tag-Based Search

db.pdxl.find({"ori_img_tags.gen": "1girl"}).limit(10)
 
db.pdxl.find({"ori_img_tags.gen": {$all: ["1girl", "long_hair"]}}).limit(10)
 
db.pdxl.find({"ori_img_tags.gen": {$in: ["1girl", "long_hair"]}}).limit(10)
 
db.pdxl.find({
  $and: [
    {"ori_img_tags.gen": {$all: ["1girl", "blonde_hair"]}},
    {"ori_img_tags.char": "chii"}
  ]
}).limit(10)

3. Aesthetic Score Filters

db.pdxl.find({"aes_score.siglip_2_5": {$gt: 7}}).limit(10)
db.pdxl.find({"aes_score.siglip_2_5": {$gte: 6, $lte: 8}}).limit(10)

4. Image Properties

db.pdxl.find({width: {$gte: 1920}, height: {$gte: 1080}}).limit(10)
db.pdxl.find({itype: "anime"}).limit(10)

5. Perceptual Hash

db.pdxl.find({phash6: "ee9c079a1"}).limit(5)
db.pdxl.find({phash6: /^ee9c/}).limit(5)  // Prefix match

6. Complex Query Example

db.pdxl.find({
  source: "danbooru",
  valid: true,
  "ori_img_tags.gen": {$all: ["1girl", "blonde_hair"]},
  "aes_score.siglip_2_5": {$gte: 5},
  width: {$gte: 1000}
}).limit(10)
 
db.pdxl.find({"ori_img_tags.gen": "1girl"})
  .sort({"aes_score.siglip_2_5": -1})
  .limit(10)

Limitations

No Perceptual Hash Similarity:
Hamming distance is not natively supported in MongoDB queries.
Index Usage:
MongoDB typically uses only one index per query. Multicondition queries may not fully benefit from indexing.
No Vector Search Support:
This database does not include vector embeddings or any vector similarity search mechanism.

Performance Notes

Indexed tag queries are fast, but compound filtering can slow down as tag count increases.
Place rare/selective tags first in $and queries for better performance.
For common tag patterns, consider defining compound indexes to accelerate queries.

LAPIS

Explorer

quail-mongodb