Aesthetic Scoring — Training Plan (v1.3)
0) Summary
Train a scalar aesthetic scorer for images (optionally conditioned on text ), using 4-image, same-prompt batches from production logs plus Likert 1–7 internal labels. The primary objective is listwise learning from user-interaction–derived soft labels; the secondary objective is absolute calibration via internal labels. We will also test model-family hypotheses: DINOv3 (context-free) vs CLIP-like / SigLIP2 (text-aware).
1) Objectives & hypotheses
Primary objective. Learn a continuous score usable (a) within batch (ranking) and (b) globally (comparable across prompts).
Hypotheses.
- A (context-free): Aesthetic quality is largely image-intrinsic DINOv3 features + light head outperform language-aware encoders.
- B (context-aware): Text context helps judge aesthetic fit with intent CLIP-like (SigLIP2) encoders outperform.
Initial prior: B (based on dino v2 results); we will test DINOv3 vs SigLIP2 directly.
2) Data & cohorts
2.1 Sources
- Internal labellers: Likert 1–7 (aligned with historical “aesthetic 1.2”), enabling absolute calibration and backward compatibility.
- User interactions (PixAI logs): millions of 4-image batches per prompt with actions .
2.2 Cohort balance & drift
- User expertise: logs skew “newbie.” Create strata (newbie/intermediate/senior) via heuristics (tenure, usage frequency, creation count). Sample proportionally or reweight to avoid overfitting to casual tastes; retain a “production-like” validation split.
- Recency weighting: exponential decay for logs to track fast-moving taste shifts.
- Category balance: prompt domain, style, NSFW filters (if applicable), aspect ratio.
2.3 Cleaning
- Deduplicate near-identical images; drop obvious corrupts.
- Bot/low-trust filters (heuristics on burstiness, improbable action patterns).
- For Likert, enforce double-rating overlap to estimate rater reliability; use rater-bias correction (z-scoring per rater).
3) Signal construction (from logs)
3.1 Action weights
Two interchangeable approaches:
- Fixed mapping (simple): .
- Estimated mapping (preferred): small GLM (logistic/Poisson) predicting conversion from observed actions; use coefficients as action weights. Optionally include recency and user-cohort terms.
For each (user, batch, image), keep the max action value per-image reward .
3.2 Soft labels per batch
Convert rewards to listwise soft labels:
Tune to match empirical sharpness.
4) Model families & I/O
Families.
- Context-free: DINOv3 (frozen or lightly tuned) + 2–3 layer MLP head scalar .
- Context-aware: SigLIP2/CLIP-like encoder; fuse text with image via pooled features MLP head scalar .
Input. Image (optional text , metadata).
Output. Scalar score .
Batching. Tensors shaped with shared prompt per 4-pack.
5) Training objectives
Batch centering (shift-invariant):
Primary (listwise cross-entropy):
Optional (pairwise Bradley–Terry) for DPO-compatibility:
Optional (triplet, representation-oriented):
Absolute calibration head (internal Likert).
Multi-task with an auxiliary ordinal/regression loss on internally-labeled images:
- Ordinal (recommended): CORN / ordinal-BCE.
- Regression: with monotone link .
Total loss (if multi-task):
Start with , ; others . Add pair/triplet only if helpful.
6) Absolute scale & inference normalization
Relative training implies arbitrary shift/scale for . Use:
- Train-time centering (as above).
- Inference-time normalization: global z-score (per category & overall). Keep rolling means/stds by time slice; monitor drift.
- Optional anchors: fit isotonic (or linear) mapping from to Likert using an internal-label dev set; freeze mapping for releases.
7) Evaluation
Within-batch (logs-style):
- Top-1 accuracy vs (argmax agreement).
- NDCG@4 using as relevance.
- Pairwise accuracy (if edges built).
Global (absolute):
- Spearman with Likert.
- ECE/Brier after binning scores into ordinal bins.
- KS/QQ diagnostics for score-distribution stability (per cohort/domain).
Ablations (all report CIs via paired bootstrap across batches):
- Encoder family (DINOv3 vs SigLIP2).
- With/without text .
- softness; action-weight estimation vs fixed.
- Multi-task .
- Data strata (newbie vs senior).
Sanity checks.
Published images’ score distribution should stochastically dominate non-published; expect monotone lift in offline replay.
8) Experiment matrix (minimum set)
| Axis | Levels |
|---|---|
| Encoder | DINOv3, SigLIP2 |
| Context | Image-only, Image+Text |
| 0.5, 1, 2 | |
| 0, 0.25 | |
| Soft labels | Fixed vs GLM-estimated weights |
| Calibration | z-score only vs z-score isotonic |
Minimum 12–16 runs to cover interactions; cap with early-stop on dev NDCG@4 plateau.
9) Training protocol & infra
- Sampler: draw 4-packs by prompt; enforce diversity across users/time.
- Optim: AdamW, cosine decay, warmup; LR grid (tuned per family).
- Freezing: begin with frozen encoders + head; unfreeze last block if gains plateau.
- Augmentations: light (resize/crop); avoid style-altering transforms.
- Batch size / grad-accum: target effective 256 images (64 four-packs).
- Logging: MLflow/W&B; log params, metrics, artifacts, calibration plots.
- Versioning: Git SHA, Docker image tag, dataset snapshot IDs.
- Seeding: fixed seeds; determinism where feasible.
10) Risks & mitigations
- Casual-user bias / taste drift. Reweighting recency decay; separate “pro” validation.
- Non-comparability across prompts. Calibration head anchors.
- Mode collapse in listwise. Add small pairwise term or entropy reg on .
- Spec creep. Freeze this plan at w01; only scoped changes via change log.
11) Timeline (September 2025)
| Week | Event | Notes / Gates |
|---|---|---|
| w01 | Finalize internal-label protocol | Definitions, rater guide, double-rating for reliability. |
| w01–w02 | Mine, clean, analyze user interactions | Build GLM for action weights; set grid; cohort stats. |
| w02 | Start internal labeling | Daily inter-rater checks; bias correction. |
| w02 | Stand up new trainer in trainlib | 4-pack sampler, listwise loss, logging, evaluation suite. |
| w02–w03 | First training sweep (new loss/architectures) | DINOv3 vs SigLIP2, image-only vs image+text. Gate G1: choose family & context. |
| w03 | Iterate on data & objectives | Tune , try GLM-weights; optional small . Gate G2: fix loss mix. |
| w04 | Internal labels complete; on-site JP | Mix “aes 1.2” new Likert into quick 1.3 calibration head; rerun. Gate G3: release candidate report. |
12) Deliverables
- Model artifacts: encoder+head weights; calibration mapping.
- Data cards: provenance, cohort stats, cleaning rules, GLM spec.
- Experiment report: metrics table, ablations, calibration curves, drift analysis.
- Integration: trainer merged into trainlib, reproducible run scripts.
- Backward-compat note: mapping from 1.2 1.3 score scale.