Aesthetic Model Pretraining Log

Project Goal: Identify and stabilize a strong base architecture and training setup for future aesthetic scoring models.


TL;DR Summary

  • Architectures Tested: DINOv2 outperformed ConvNeXt-v2 by a small margin, but both lacked native text alignment. SigLIP v2 was chosen as the final architecture for future work—primarily due to ~2% performance improvement and strong alignment features (CLIP-based, SOTA).
  • Standardization: Going forward, all aesthetic models will use the same architecture (SigLIP v2) for consistency in inference and model merging.
  • Configuration: Training configs are included in the appendix.
  • Untracked Notes: Some findings were noted during training but not summarized—consider reconstructing from logs if relevant.

Objective

Provide strong, consistent pretraining weights for aesthetic models using different image encoder architectures.

Hypothesis: Different model families (ConvNeXt v2, DINO v2, ViT, etc.) may exhibit different levels of trainability and generalization for aesthetic regression tasks.


Dataset

df = pd.read_csv("data.csv")
df = df.dropna().reset_index(drop=True)

Architectures Evaluated

ModelNotes
DINOv2 (large)Best performance among non-CLIP architectures
ConvNeXt-v2Slightly worse than DINO; required lower LR
SigLIP v2 (base)Chosen as final baseline; 2% better, text-aligned
TODOEvaluate Swin; add transformer-based variants

Experiments & Findings

1. Warmup vs. No Warmup (DINOv2)

  • Removing warmup produced slightly better MSE early on, but unclear long-term stability.
  • Pending: evaluate whether warmup benefits continue training phase.

2. Input Transformations for Regression Stability

Three transformations tested on fav counts:

def transform1(x): return np.log1p(x)
def transform2(x): return np.sqrt(x) * np.log(x + 10)
def transform3(x): return np.power(x + 1, 1/3) * np.log(x + 10)
  • log1p: Simple, stable gradient; effective baseline
  • sqrt * log / cbrt * log: Compresses extreme values; may distort gradient shape and complicate optimization
  • Next step: compare validation error across value ranges (after inverse-transform)

3. Numerical Range Effects

  • Twitter scores are skewed; transformations aim to:
    • Improve regression error stability
    • Reduce over-weighting of high-fav outliers
  • Plan: Analyze error vs. original value per model and transform

4. Hyperparameter Sensitivity (DINOv2)

  • Learning rate needed to be lowered significantly to avoid divergence (down to 1e-6)
  • Warmup and cosine scheduler helped maintain smoother convergence

5. Architecture Alignment

  • DINO and ConvNeXt lack natural text-image alignment
  • SigLIP offers built-in alignment, which may benefit multitask settings or downstream fusion

Training Configuration

  • Optimizer: AdamW
  • Scheduler: Cosine w/ 1000-step warmup
  • Batch Size: 64 × 8 = 512
  • Hardware: 8xH100 (1 node)
  • Time: ~27 hours (10 epochs)
  • Loss: MSE
  • Precision: FP16
  • Additional Notes: Default gradnorm = 1.0
python train_with_parquet.py \
    --model_name_or_path "facebook/dinov2-with-registers-large" \
    --output_dir ./dinov2-large-reg_twitter-logfav \
    --remove_unused_columns False \
    --image_column_name "local_path" \
    --label_column_name "score" \
    --task regression \
    --do_train True \
    --do_eval True \
    --learning_rate 1e-6  \
    --num_train_epochs 10 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 32 \
    --logging_steps 10 \
    --eval_steps 2400 \
    --save_steps 1200 \
    --seed 1337 \
    --dataloader_num_workers 16 \
    --fp16 True \
    --warmup_steps 1000 \
    --parquet_path /lv0/test_aesthetics/trainlib/projects/aesthetics/data/legacy/data_twitter_fav-normed_full_filtered.parquet \
    --max_grad_norm 1.0 \
    --lr_scheduler_type cosine

Evaluation Plans

  • Transformation vs. Loss Design:
    • Consider whether good transformations can replace complex loss weighting
  • Range-Aware Error Evaluation:
    • Error over low-fav vs. high-fav samples
  • Architecture Error Profiles:
    • How different encoders handle value distribution
  • Smooth Gradients:
    • Choose transformations that preserve predictable derivatives for stable optimization

Preliminary Metrics (placeholder)

ArchitectureTransformTrain MSEVal MSENotes
DINOv2log1p0.xx0.xxBest so far (non-CLIP)
ConvNeXt-v2log1p0.xx0.xxRequired lower LR
SigLIP v2log1p0.xx0.xxSlightly better overall

Appendix

Training Config Summary

  • Base LR: 1e-6 for DINO, potentially higher for SigLIP
  • Scheduler: Cosine
  • GradNorm: 1.0
  • Max steps: ~120k (10 epochs)
  • Precision: FP16

  • [2025-03: Twitter Logfav Transforms Study] 2025-03-aes-twitter-logfav-transforms Deep dive into gradient stability and value compression techniques
  • [Notes on Alignment & Architecture Selection] Why CLIP-based models may generalize better in score regression tasks