MongoDB Image Database Documentation
Overview
This database stores metadata and analysis results for a large corpus of images, primarily sourced from anime-related platforms.
-
Database Name:
images -
Database Size: ~100.83 GiB
-
Primary Collection:
pdxl -
Document Count: 23,637,222
Document Schema
Each document in the pdxl collection includes the following structured fields:
1. Basic Metadata
| Field | Description |
|---|---|
_id | MongoDB ObjectId |
filepath | Full path to the image file (indexed, unique) |
filename | Filename of the image |
source | Source platform (e.g., rule34, danbooru) |
image_id | Unique image identifier (indexed, unique) |
album_id | Album/group identifier (indexed) |
version_major | Versioning information |
version_minor | Versioning information |
created_at | Timestamp when document was added |
original_created_at | Timestamp when image was originally created |
valid | Boolean flag indicating a usable/clean image |
2. Image Properties
| Field | Description |
|---|---|
width | Image width (pixels) |
height | Image height (pixels) |
itype | Image type (e.g., "anime") |
phash6 | 6-byte perceptual hash (indexed) |
3. Tags and Metadata
| Field | Description |
|---|---|
ori_img_tags | Original tags, divided into subfields: |
โ gen | General tags (indexed) |
โ char | Character tags |
โ copy | Copyright tags |
โ art | Artist tags |
โ meta | Meta/informational tags |
auto_tags | Tags generated by automated systems |
auto_caption | AI-generated caption or image description |
4. Scoring
| Field | Description |
|---|---|
aes_score | Aesthetic evaluation scores (indexed) |
โ siglip_2_5 | Aesthetic score from SigLIP model |
cv_scores | Computer vision-derived metrics: |
โ edge_density | Edge density measure |
โ focus_measure | Focus/blur estimation |
โ texture_score | Texture complexity |
โ noise_level | Estimated noise |
โ saturation | Color saturation |
โ contrast | Contrast level |
โ brightness | Brightness level |
โ avg_dynamic_range | Dynamic range of image |
Indexed Fields
Indexes are used to optimize query performance. Key indexes include:
[
{ "key": { "_id": 1 }, "name": "_id_" },
{ "key": { "filepath": 1 }, "name": "filepath_1", "unique": true },
{ "key": { "source": 1 }, "name": "source_1" },
{ "key": { "valid": 1 }, "name": "valid_1" },
{ "key": { "image_id": 1 }, "name": "image_id_1", "unique": true },
{ "key": { "phash6": 1 }, "name": "phash6_1" },
{ "key": { "album_id": 1 }, "name": "album_id_1" },
{ "key": { "ori_album_tags": 1 }, "name": "ori_album_tags_1" },
{ "key": { "ori_img_tags.gen": 1 }, "name": "ori_img_tags.gen_1" },
{ "key": { "aes_score.siglip_2_5": 1 }, "name": "aes_score.siglip_2_5_1" }
]Source Distribution
Approximate distribution of images by source:
| Source | Count |
|---|---|
rule34 | 8,505,871 |
danbooru | 7,865,216 |
pixiv | 3,856,761 |
twitter | 2,208,117 |
yandere | 1,060,776 |
bangumi | 99,074 |
anime-video | 40,547 |
horse_cleaned | 800 |
tiger_bench | 60 |
Example Queries
1. Basic Filters
db.pdxl.find({source: "danbooru"}).limit(10)
db.pdxl.find({valid: true}).limit(10)
db.pdxl.find({image_id: "danbooru-7"})
db.pdxl.find({album_id: "danbooru-7"})2. Tag-Based Search
db.pdxl.find({"ori_img_tags.gen": "1girl"}).limit(10)
db.pdxl.find({"ori_img_tags.gen": {$all: ["1girl", "long_hair"]}}).limit(10)
db.pdxl.find({"ori_img_tags.gen": {$in: ["1girl", "long_hair"]}}).limit(10)
db.pdxl.find({
$and: [
{"ori_img_tags.gen": {$all: ["1girl", "blonde_hair"]}},
{"ori_img_tags.char": "chii"}
]
}).limit(10)3. Aesthetic Score Filters
db.pdxl.find({"aes_score.siglip_2_5": {$gt: 7}}).limit(10)
db.pdxl.find({"aes_score.siglip_2_5": {$gte: 6, $lte: 8}}).limit(10)4. Image Properties
db.pdxl.find({width: {$gte: 1920}, height: {$gte: 1080}}).limit(10)
db.pdxl.find({itype: "anime"}).limit(10)5. Perceptual Hash
db.pdxl.find({phash6: "ee9c079a1"}).limit(5)
db.pdxl.find({phash6: /^ee9c/}).limit(5) // Prefix match6. Complex Query Example
db.pdxl.find({
source: "danbooru",
valid: true,
"ori_img_tags.gen": {$all: ["1girl", "blonde_hair"]},
"aes_score.siglip_2_5": {$gte: 5},
width: {$gte: 1000}
}).limit(10)
db.pdxl.find({"ori_img_tags.gen": "1girl"})
.sort({"aes_score.siglip_2_5": -1})
.limit(10)Limitations
-
No Perceptual Hash Similarity:
Hamming distance is not natively supported in MongoDB queries. -
Index Usage:
MongoDB typically uses only one index per query. Multicondition queries may not fully benefit from indexing. -
No Vector Search Support:
This database does not include vector embeddings or any vector similarity search mechanism.
Performance Notes
-
Indexed tag queries are fast, but compound filtering can slow down as tag count increases.
-
Place rare/selective tags first in
$andqueries for better performance. -
For common tag patterns, consider defining compound indexes to accelerate queries.