Aesthetic Scoring — Training Plan (v1.3)

0) Summary

Train a scalar aesthetic scorer for images (optionally conditioned on text ), using 4-image, same-prompt batches from production logs plus Likert 1–7 internal labels. The primary objective is listwise learning from user-interaction–derived soft labels; the secondary objective is absolute calibration via internal labels. We will also test model-family hypotheses: DINOv3 (context-free) vs CLIP-like / SigLIP2 (text-aware).


1) Objectives & hypotheses

Primary objective. Learn a continuous score usable (a) within batch (ranking) and (b) globally (comparable across prompts).

Hypotheses.

  • A (context-free): Aesthetic quality is largely image-intrinsic DINOv3 features + light head outperform language-aware encoders.
  • B (context-aware): Text context helps judge aesthetic fit with intent CLIP-like (SigLIP2) encoders outperform.
    Initial prior: B (based on dino v2 results); we will test DINOv3 vs SigLIP2 directly.

2) Data & cohorts

2.1 Sources

  • Internal labellers: Likert 1–7 (aligned with historical “aesthetic 1.2”), enabling absolute calibration and backward compatibility.
  • User interactions (PixAI logs): millions of 4-image batches per prompt with actions .

2.2 Cohort balance & drift

  • User expertise: logs skew “newbie.” Create strata (newbie/intermediate/senior) via heuristics (tenure, usage frequency, creation count). Sample proportionally or reweight to avoid overfitting to casual tastes; retain a “production-like” validation split.
  • Recency weighting: exponential decay for logs to track fast-moving taste shifts.
  • Category balance: prompt domain, style, NSFW filters (if applicable), aspect ratio.

2.3 Cleaning

  • Deduplicate near-identical images; drop obvious corrupts.
  • Bot/low-trust filters (heuristics on burstiness, improbable action patterns).
  • For Likert, enforce double-rating overlap to estimate rater reliability; use rater-bias correction (z-scoring per rater).

3) Signal construction (from logs)

3.1 Action weights

Two interchangeable approaches:

  • Fixed mapping (simple): .
  • Estimated mapping (preferred): small GLM (logistic/Poisson) predicting conversion from observed actions; use coefficients as action weights. Optionally include recency and user-cohort terms.

For each (user, batch, image), keep the max action value per-image reward .

3.2 Soft labels per batch

Convert rewards to listwise soft labels:

Tune to match empirical sharpness.


4) Model families & I/O

Families.

  • Context-free: DINOv3 (frozen or lightly tuned) + 2–3 layer MLP head scalar .
  • Context-aware: SigLIP2/CLIP-like encoder; fuse text with image via pooled features MLP head scalar .

Input. Image (optional text , metadata).
Output. Scalar score .
Batching. Tensors shaped with shared prompt per 4-pack.


5) Training objectives

Batch centering (shift-invariant):

Primary (listwise cross-entropy):

Optional (pairwise Bradley–Terry) for DPO-compatibility:

Optional (triplet, representation-oriented):

Absolute calibration head (internal Likert).
Multi-task with an auxiliary ordinal/regression loss on internally-labeled images:

  • Ordinal (recommended): CORN / ordinal-BCE.
  • Regression: with monotone link .

Total loss (if multi-task):

Start with , ; others . Add pair/triplet only if helpful.


6) Absolute scale & inference normalization

Relative training implies arbitrary shift/scale for . Use:

  • Train-time centering (as above).
  • Inference-time normalization: global z-score (per category & overall). Keep rolling means/stds by time slice; monitor drift.
  • Optional anchors: fit isotonic (or linear) mapping from to Likert using an internal-label dev set; freeze mapping for releases.

7) Evaluation

Within-batch (logs-style):

  • Top-1 accuracy vs (argmax agreement).
  • NDCG@4 using as relevance.
  • Pairwise accuracy (if edges built).

Global (absolute):

  • Spearman with Likert.
  • ECE/Brier after binning scores into ordinal bins.
  • KS/QQ diagnostics for score-distribution stability (per cohort/domain).

Ablations (all report CIs via paired bootstrap across batches):

  • Encoder family (DINOv3 vs SigLIP2).
  • With/without text .
  • softness; action-weight estimation vs fixed.
  • Multi-task .
  • Data strata (newbie vs senior).

Sanity checks.
Published images’ score distribution should stochastically dominate non-published; expect monotone lift in offline replay.


8) Experiment matrix (minimum set)

AxisLevels
EncoderDINOv3, SigLIP2
ContextImage-only, Image+Text
0.5, 1, 2
0, 0.25
Soft labelsFixed vs GLM-estimated weights
Calibrationz-score only vs z-score isotonic

Minimum 12–16 runs to cover interactions; cap with early-stop on dev NDCG@4 plateau.


9) Training protocol & infra

  • Sampler: draw 4-packs by prompt; enforce diversity across users/time.
  • Optim: AdamW, cosine decay, warmup; LR grid (tuned per family).
  • Freezing: begin with frozen encoders + head; unfreeze last block if gains plateau.
  • Augmentations: light (resize/crop); avoid style-altering transforms.
  • Batch size / grad-accum: target effective 256 images (64 four-packs).
  • Logging: MLflow/W&B; log params, metrics, artifacts, calibration plots.
  • Versioning: Git SHA, Docker image tag, dataset snapshot IDs.
  • Seeding: fixed seeds; determinism where feasible.

10) Risks & mitigations

  • Casual-user bias / taste drift. Reweighting recency decay; separate “pro” validation.
  • Non-comparability across prompts. Calibration head anchors.
  • Mode collapse in listwise. Add small pairwise term or entropy reg on .
  • Spec creep. Freeze this plan at w01; only scoped changes via change log.

11) Timeline (September 2025)

WeekEventNotes / Gates
w01Finalize internal-label protocolDefinitions, rater guide, double-rating for reliability.
w01–w02Mine, clean, analyze user interactionsBuild GLM for action weights; set grid; cohort stats.
w02Start internal labelingDaily inter-rater checks; bias correction.
w02Stand up new trainer in trainlib4-pack sampler, listwise loss, logging, evaluation suite.
w02–w03First training sweep (new loss/architectures)DINOv3 vs SigLIP2, image-only vs image+text. Gate G1: choose family & context.
w03Iterate on data & objectivesTune , try GLM-weights; optional small . Gate G2: fix loss mix.
w04Internal labels complete; on-site JPMix “aes 1.2” new Likert into quick 1.3 calibration head; rerun. Gate G3: release candidate report.

12) Deliverables

  1. Model artifacts: encoder+head weights; calibration mapping.
  2. Data cards: provenance, cohort stats, cleaning rules, GLM spec.
  3. Experiment report: metrics table, ablations, calibration curves, drift analysis.
  4. Integration: trainer merged into trainlib, reproducible run scripts.
  5. Backward-compat note: mapping from 1.2 1.3 score scale.

13) Appendix — Losses (for copy-paste)