Aesthetic Scoring — Training Plan (v1.3)

0) Summary

Train a scalar aesthetic scorer $u_{θ} (x, c)$ for images (optionally conditioned on text $c$ ), using 4-image, same-prompt batches from production logs plus Likert 1–7 internal labels. The primary objective is listwise learning from user-interaction–derived soft labels; the secondary objective is absolute calibration via internal labels. We will also test model-family hypotheses: DINOv3 (context-free) vs CLIP-like / SigLIP2 (text-aware).

1) Objectives & hypotheses

Primary objective. Learn a continuous score usable (a) within batch (ranking) and (b) globally (comparable across prompts).

Hypotheses.

A (context-free): Aesthetic quality is largely image-intrinsic $\Rightarrow$ DINOv3 features + light head outperform language-aware encoders.
B (context-aware): Text context helps judge aesthetic fit with intent $\Rightarrow$ CLIP-like (SigLIP2) encoders outperform.
Initial prior: B (based on dino v2 results); we will test DINOv3 vs SigLIP2 directly.

2) Data & cohorts

2.1 Sources

Internal labellers: Likert 1–7 (aligned with historical “aesthetic 1.2”), enabling absolute calibration and backward compatibility.
User interactions (PixAI logs): millions of 4-image batches per prompt with actions $A \in {publish, favorite, download, none, delete, ...}$ .

2.2 Cohort balance & drift

User expertise: logs skew “newbie.” Create strata (newbie/intermediate/senior) via heuristics (tenure, usage frequency, creation count). Sample proportionally or reweight to avoid overfitting to casual tastes; retain a “production-like” validation split.
Recency weighting: exponential decay for logs to track fast-moving taste shifts.
Category balance: prompt domain, style, NSFW filters (if applicable), aspect ratio.

2.3 Cleaning

Deduplicate near-identical images; drop obvious corrupts.
Bot/low-trust filters (heuristics on burstiness, improbable action patterns).
For Likert, enforce double-rating overlap to estimate rater reliability; use rater-bias correction (z-scoring per rater).

3) Signal construction (from logs)

3.1 Action weights

Two interchangeable approaches:

Fixed mapping (simple): $publish = 3, favorite = 1.5, download = 1, none = 0, delete = - 1$ .
Estimated mapping (preferred): small GLM (logistic/Poisson) predicting conversion from observed actions; use coefficients as action weights. Optionally include recency and user-cohort terms.

For each (user, batch, image), keep the max action value $\Rightarrow$ per-image reward $r_{i}$ .

3.2 Soft labels per batch

Convert rewards to listwise soft labels:

q_{i} = softmax (α r_{i}) (over the 4 images).

Tune $α \in {0.5, 1, 2, 4}$ to match empirical sharpness.

4) Model families & I/O

Families.

Context-free: DINOv3 (frozen or lightly tuned) + 2–3 layer MLP head $\to$ scalar $u_{θ} (x)$ .
Context-aware: SigLIP2/CLIP-like encoder; fuse text $c$ with image via pooled features $\to$ MLP head $\to$ scalar $u_{θ} (x, c)$ .

Input. Image $x$ (optional text $c$ , metadata).
Output. Scalar score $u_{θ}$ .
Batching. Tensors shaped $[B, 4, \dots]$ with shared prompt per 4-pack.

5) Training objectives

Batch centering (shift-invariant):

s_{i} = u_{i} - \frac{1}{4} j \sum u_{j} .

Primary (listwise cross-entropy):

L_{list} = - i \sum q_{i} lo g softmax (s)_{i} .

Optional (pairwise Bradley–Terry) for DPO-compatibility:

L_{pair} = - (w, l) \sum lo g σ (u_{w} - u_{l}) .

Optional (triplet, representation-oriented):

L_{triplet} = max (0, m + d (h_{a}, h_{p}) - d (h_{a}, h_{n})) .

Absolute calibration head (internal Likert).
Multi-task with an auxiliary ordinal/regression loss on internally-labeled images:

Ordinal (recommended): CORN / ordinal-BCE.
Regression: $L_{Likert} = Huber (g (u_{θ}), y_{Likert})$ with monotone link $g$ .

Total loss (if multi-task):

L = λ_{list} L_{list} + λ_{pair} L_{pair} + λ_{triplet} L_{triplet} + λ_{Likert} L_{Likert} .

Start with $λ_{list} = 1$ , $λ_{Likert} \in {0.1, 0.25, 0.5}$ ; others $= 0$ . Add pair/triplet only if helpful.

6) Absolute scale & inference normalization

Relative training implies arbitrary shift/scale for $u_{θ}$ . Use:

Train-time centering (as above).
Inference-time normalization: global z-score (per category & overall). Keep rolling means/stds by time slice; monitor drift.
Optional anchors: fit isotonic (or linear) mapping from $u_{θ}$ to Likert using an internal-label dev set; freeze mapping for releases.

7) Evaluation

Within-batch (logs-style):

Top-1 accuracy vs $q$ (argmax agreement).
NDCG@4 using $q$ as relevance.
Pairwise accuracy (if edges built).

Global (absolute):

Spearman $ρ$ with Likert.
ECE/Brier after binning scores into ordinal bins.
KS/QQ diagnostics for score-distribution stability (per cohort/domain).

Ablations (all report CIs via paired bootstrap across batches):

Encoder family (DINOv3 vs SigLIP2).
With/without text $c$ .
$α$ softness; action-weight estimation vs fixed.
Multi-task $λ_{Likert}$ .
Data strata (newbie vs senior).

Sanity checks.
Published images’ score distribution should stochastically dominate non-published; expect monotone lift in offline replay.

8) Experiment matrix (minimum set)

Axis	Levels
Encoder	DINOv3, SigLIP2
Context	Image-only, Image+Text
$α$	0.5, 1, 2
$λ_{Likert}$	0, 0.25
Soft labels	Fixed vs GLM-estimated weights
Calibration	z-score only vs z-score $+$ isotonic

Minimum 12–16 runs to cover interactions; cap with early-stop on dev NDCG@4 plateau.

9) Training protocol & infra

Sampler: draw 4-packs by prompt; enforce diversity across users/time.
Optim: AdamW, cosine decay, warmup; LR grid ${1 0^{- 5}, 3 \cdot 1 0^{- 5}}$ (tuned per family).
Freezing: begin with frozen encoders + head; unfreeze last block if gains plateau.
Augmentations: light (resize/crop); avoid style-altering transforms.
Batch size / grad-accum: target effective 256 images (64 four-packs).
Logging: MLflow/W&B; log params, metrics, artifacts, calibration plots.
Versioning: Git SHA, Docker image tag, dataset snapshot IDs.
Seeding: fixed seeds; determinism where feasible.

10) Risks & mitigations

Casual-user bias / taste drift. Reweighting $+$ recency decay; separate “pro” validation.
Non-comparability across prompts. Calibration head $+$ anchors.
Mode collapse in listwise. Add small pairwise term or entropy reg on $softmax (s)$ .
Spec creep. Freeze this plan at w01; only scoped changes via change log.

11) Timeline (September 2025)

Week	Event	Notes / Gates
w01	Finalize internal-label protocol	Definitions, rater guide, double-rating for reliability.
w01–w02	Mine, clean, analyze user interactions	Build GLM for action weights; set $α$ grid; cohort stats.
w02	Start internal labeling	Daily inter-rater checks; bias correction.
w02	Stand up new trainer in trainlib	4-pack sampler, listwise loss, logging, evaluation suite.
w02–w03	First training sweep (new loss/architectures)	DINOv3 vs SigLIP2, image-only vs image+text. Gate G1: choose family & context.
w03	Iterate on data & objectives	Tune $α$ , try GLM-weights; optional small $λ_{Likert}$ . Gate G2: fix loss mix.
w04	Internal labels complete; on-site JP	Mix “aes 1.2” $+$ new Likert into quick 1.3 calibration head; rerun. Gate G3: release candidate $+$ report.

12) Deliverables

Model artifacts: encoder+head weights; calibration mapping.
Data cards: provenance, cohort stats, cleaning rules, GLM spec.
Experiment report: metrics table, ablations, calibration curves, drift analysis.
Integration: trainer merged into trainlib, reproducible run scripts.
Backward-compat note: mapping from 1.2 $\to$ 1.3 score scale.

13) Appendix — Losses (for copy-paste)

s_{i} = u_{i} - \frac{1}{4} j \sum u_{j}

L_{list} = - i \sum q_{i} lo g softmax (s)_{i}

L_{pair} = - (w, l) \sum lo g σ (u_{w} - u_{l})

L_{triplet} = max (0, m + d (h_{a}, h_{p}) - d (h_{a}, h_{n}))

q_{i} = softmax (α r_{i}) (over the 4 images)

LAPIS

Explorer

2025-09-aes1.3-plan