MongoDB Image Database Documentation

Overview

This database stores metadata and analysis results for a large corpus of images, primarily sourced from anime-related platforms.

  • Database Name: images

  • Database Size: ~100.83 GiB

  • Primary Collection: pdxl

  • Document Count: 23,637,222


Document Schema

Each document in the pdxl collection includes the following structured fields:

1. Basic Metadata

FieldDescription
_idMongoDB ObjectId
filepathFull path to the image file (indexed, unique)
filenameFilename of the image
sourceSource platform (e.g., rule34, danbooru)
image_idUnique image identifier (indexed, unique)
album_idAlbum/group identifier (indexed)
version_majorVersioning information
version_minorVersioning information
created_atTimestamp when document was added
original_created_atTimestamp when image was originally created
validBoolean flag indicating a usable/clean image

2. Image Properties

FieldDescription
widthImage width (pixels)
heightImage height (pixels)
itypeImage type (e.g., "anime")
phash66-byte perceptual hash (indexed)

3. Tags and Metadata

FieldDescription
ori_img_tagsOriginal tags, divided into subfields:
โ€” genGeneral tags (indexed)
โ€” charCharacter tags
โ€” copyCopyright tags
โ€” artArtist tags
โ€” metaMeta/informational tags
auto_tagsTags generated by automated systems
auto_captionAI-generated caption or image description

4. Scoring

FieldDescription
aes_scoreAesthetic evaluation scores (indexed)
โ€” siglip_2_5Aesthetic score from SigLIP model
cv_scoresComputer vision-derived metrics:
โ€” edge_densityEdge density measure
โ€” focus_measureFocus/blur estimation
โ€” texture_scoreTexture complexity
โ€” noise_levelEstimated noise
โ€” saturationColor saturation
โ€” contrastContrast level
โ€” brightnessBrightness level
โ€” avg_dynamic_rangeDynamic range of image

Indexed Fields

Indexes are used to optimize query performance. Key indexes include:

[
  { "key": { "_id": 1 }, "name": "_id_" },
  { "key": { "filepath": 1 }, "name": "filepath_1", "unique": true },
  { "key": { "source": 1 }, "name": "source_1" },
  { "key": { "valid": 1 }, "name": "valid_1" },
  { "key": { "image_id": 1 }, "name": "image_id_1", "unique": true },
  { "key": { "phash6": 1 }, "name": "phash6_1" },
  { "key": { "album_id": 1 }, "name": "album_id_1" },
  { "key": { "ori_album_tags": 1 }, "name": "ori_album_tags_1" },
  { "key": { "ori_img_tags.gen": 1 }, "name": "ori_img_tags.gen_1" },
  { "key": { "aes_score.siglip_2_5": 1 }, "name": "aes_score.siglip_2_5_1" }
]

Source Distribution

Approximate distribution of images by source:

SourceCount
rule348,505,871
danbooru7,865,216
pixiv3,856,761
twitter2,208,117
yandere1,060,776
bangumi99,074
anime-video40,547
horse_cleaned800
tiger_bench60

Example Queries

1. Basic Filters

db.pdxl.find({source: "danbooru"}).limit(10)
db.pdxl.find({valid: true}).limit(10)
db.pdxl.find({image_id: "danbooru-7"})
db.pdxl.find({album_id: "danbooru-7"})
db.pdxl.find({"ori_img_tags.gen": "1girl"}).limit(10)
 
db.pdxl.find({"ori_img_tags.gen": {$all: ["1girl", "long_hair"]}}).limit(10)
 
db.pdxl.find({"ori_img_tags.gen": {$in: ["1girl", "long_hair"]}}).limit(10)
 
db.pdxl.find({
  $and: [
    {"ori_img_tags.gen": {$all: ["1girl", "blonde_hair"]}},
    {"ori_img_tags.char": "chii"}
  ]
}).limit(10)

3. Aesthetic Score Filters

db.pdxl.find({"aes_score.siglip_2_5": {$gt: 7}}).limit(10)
db.pdxl.find({"aes_score.siglip_2_5": {$gte: 6, $lte: 8}}).limit(10)

4. Image Properties

db.pdxl.find({width: {$gte: 1920}, height: {$gte: 1080}}).limit(10)
db.pdxl.find({itype: "anime"}).limit(10)

5. Perceptual Hash

db.pdxl.find({phash6: "ee9c079a1"}).limit(5)
db.pdxl.find({phash6: /^ee9c/}).limit(5)  // Prefix match

6. Complex Query Example

db.pdxl.find({
  source: "danbooru",
  valid: true,
  "ori_img_tags.gen": {$all: ["1girl", "blonde_hair"]},
  "aes_score.siglip_2_5": {$gte: 5},
  width: {$gte: 1000}
}).limit(10)
 
db.pdxl.find({"ori_img_tags.gen": "1girl"})
  .sort({"aes_score.siglip_2_5": -1})
  .limit(10)

Limitations

  1. No Perceptual Hash Similarity:
    Hamming distance is not natively supported in MongoDB queries.

  2. Index Usage:
    MongoDB typically uses only one index per query. Multicondition queries may not fully benefit from indexing.

  3. No Vector Search Support:
    This database does not include vector embeddings or any vector similarity search mechanism.


Performance Notes

  • Indexed tag queries are fast, but compound filtering can slow down as tag count increases.

  • Place rare/selective tags first in $and queries for better performance.

  • For common tag patterns, consider defining compound indexes to accelerate queries.