Aesthetic Model Pretraining Log
Project Goal: Identify and stabilize a strong base architecture and training setup for future aesthetic scoring models.
TL;DR Summary
- Architectures Tested: DINOv2 outperformed ConvNeXt-v2 by a small margin, but both lacked native text alignment. SigLIP v2 was chosen as the final architecture for future work—primarily due to ~2% performance improvement and strong alignment features (CLIP-based, SOTA).
- Standardization: Going forward, all aesthetic models will use the same architecture (SigLIP v2) for consistency in inference and model merging.
- Configuration: Training configs are included in the appendix.
- Untracked Notes: Some findings were noted during training but not summarized—consider reconstructing from logs if relevant.
Objective
Provide strong, consistent pretraining weights for aesthetic models using different image encoder architectures.
Hypothesis: Different model families (ConvNeXt v2, DINO v2, ViT, etc.) may exhibit different levels of trainability and generalization for aesthetic regression tasks.
Dataset
- Source: Twitter Aesthetic Scores (HuggingFace)
- Size: ~3.6M training / 72K validation (2% split)
- Format: Local path to image + continuous score
- Preprocessing:
df = pd.read_csv("data.csv")
df = df.dropna().reset_index(drop=True)Architectures Evaluated
| Model | Notes |
|---|---|
| DINOv2 (large) | Best performance among non-CLIP architectures |
| ConvNeXt-v2 | Slightly worse than DINO; required lower LR |
| SigLIP v2 (base) | Chosen as final baseline; 2% better, text-aligned |
| TODO | Evaluate Swin; add transformer-based variants |
Experiments & Findings
1. Warmup vs. No Warmup (DINOv2)
- Removing warmup produced slightly better MSE early on, but unclear long-term stability.
- Pending: evaluate whether warmup benefits continue training phase.
2. Input Transformations for Regression Stability
Three transformations tested on fav counts:
def transform1(x): return np.log1p(x)
def transform2(x): return np.sqrt(x) * np.log(x + 10)
def transform3(x): return np.power(x + 1, 1/3) * np.log(x + 10)- log1p: Simple, stable gradient; effective baseline
- sqrt * log / cbrt * log: Compresses extreme values; may distort gradient shape and complicate optimization
- Next step: compare validation error across value ranges (after inverse-transform)
3. Numerical Range Effects
- Twitter scores are skewed; transformations aim to:
- Improve regression error stability
- Reduce over-weighting of high-fav outliers
- Plan: Analyze error vs. original value per model and transform
4. Hyperparameter Sensitivity (DINOv2)
- Learning rate needed to be lowered significantly to avoid divergence (down to
1e-6) - Warmup and cosine scheduler helped maintain smoother convergence
5. Architecture Alignment
- DINO and ConvNeXt lack natural text-image alignment
- SigLIP offers built-in alignment, which may benefit multitask settings or downstream fusion
Training Configuration
- Optimizer: AdamW
- Scheduler: Cosine w/ 1000-step warmup
- Batch Size: 64 × 8 = 512
- Hardware: 8xH100 (1 node)
- Time: ~27 hours (10 epochs)
- Loss: MSE
- Precision: FP16
- Additional Notes: Default gradnorm = 1.0
python train_with_parquet.py \
--model_name_or_path "facebook/dinov2-with-registers-large" \
--output_dir ./dinov2-large-reg_twitter-logfav \
--remove_unused_columns False \
--image_column_name "local_path" \
--label_column_name "score" \
--task regression \
--do_train True \
--do_eval True \
--learning_rate 1e-6 \
--num_train_epochs 10 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 32 \
--logging_steps 10 \
--eval_steps 2400 \
--save_steps 1200 \
--seed 1337 \
--dataloader_num_workers 16 \
--fp16 True \
--warmup_steps 1000 \
--parquet_path /lv0/test_aesthetics/trainlib/projects/aesthetics/data/legacy/data_twitter_fav-normed_full_filtered.parquet \
--max_grad_norm 1.0 \
--lr_scheduler_type cosineEvaluation Plans
- Transformation vs. Loss Design:
- Consider whether good transformations can replace complex loss weighting
- Range-Aware Error Evaluation:
- Error over low-fav vs. high-fav samples
- Architecture Error Profiles:
- How different encoders handle value distribution
- Smooth Gradients:
- Choose transformations that preserve predictable derivatives for stable optimization
Preliminary Metrics (placeholder)
| Architecture | Transform | Train MSE | Val MSE | Notes |
|---|---|---|---|---|
| DINOv2 | log1p | 0.xx | 0.xx | Best so far (non-CLIP) |
| ConvNeXt-v2 | log1p | 0.xx | 0.xx | Required lower LR |
| SigLIP v2 | log1p | 0.xx | 0.xx | Slightly better overall |
Appendix
Training Config Summary
- Base LR:
1e-6for DINO, potentially higher for SigLIP - Scheduler: Cosine
- GradNorm: 1.0
- Max steps: ~120k (10 epochs)
- Precision: FP16
Related Work
- [2025-03: Twitter Logfav Transforms Study] → 2025-03-aes-twitter-logfav-transforms Deep dive into gradient stability and value compression techniques
- [Notes on Alignment & Architecture Selection] Why CLIP-based models may generalize better in score regression tasks