Aesthetic Model Pretraining Log

Project Goal: Identify and stabilize a strong base architecture and training setup for future aesthetic scoring models.

TL;DR Summary

Architectures Tested: DINOv2 outperformed ConvNeXt-v2 by a small margin, but both lacked native text alignment. SigLIP v2 was chosen as the final architecture for future work—primarily due to ~2% performance improvement and strong alignment features (CLIP-based, SOTA).
Standardization: Going forward, all aesthetic models will use the same architecture (SigLIP v2) for consistency in inference and model merging.
Configuration: Training configs are included in the appendix.
Untracked Notes: Some findings were noted during training but not summarized—consider reconstructing from logs if relevant.

Objective

Provide strong, consistent pretraining weights for aesthetic models using different image encoder architectures.

Hypothesis: Different model families (ConvNeXt v2, DINO v2, ViT, etc.) may exhibit different levels of trainability and generalization for aesthetic regression tasks.

Dataset

Source: Twitter Aesthetic Scores (HuggingFace)
Size: ~3.6M training / 72K validation (2% split)
Format: Local path to image + continuous score
Preprocessing:

df = pd.read_csv("data.csv")
df = df.dropna().reset_index(drop=True)

Architectures Evaluated

Model	Notes
DINOv2 (large)	Best performance among non-CLIP architectures
ConvNeXt-v2	Slightly worse than DINO; required lower LR
SigLIP v2 (base)	Chosen as final baseline; 2% better, text-aligned
TODO	Evaluate Swin; add transformer-based variants

Experiments & Findings

1. Warmup vs. No Warmup (DINOv2)

Removing warmup produced slightly better MSE early on, but unclear long-term stability.
Pending: evaluate whether warmup benefits continue training phase.

2. Input Transformations for Regression Stability

Three transformations tested on fav counts:

def transform1(x): return np.log1p(x)
def transform2(x): return np.sqrt(x) * np.log(x + 10)
def transform3(x): return np.power(x + 1, 1/3) * np.log(x + 10)

log1p: Simple, stable gradient; effective baseline
sqrt * log / cbrt * log: Compresses extreme values; may distort gradient shape and complicate optimization
Next step: compare validation error across value ranges (after inverse-transform)

3. Numerical Range Effects

Twitter scores are skewed; transformations aim to:
- Improve regression error stability
- Reduce over-weighting of high-fav outliers
Plan: Analyze error vs. original value per model and transform

4. Hyperparameter Sensitivity (DINOv2)

Learning rate needed to be lowered significantly to avoid divergence (down to 1e-6)
Warmup and cosine scheduler helped maintain smoother convergence

5. Architecture Alignment

DINO and ConvNeXt lack natural text-image alignment
SigLIP offers built-in alignment, which may benefit multitask settings or downstream fusion

Training Configuration

Optimizer: AdamW
Scheduler: Cosine w/ 1000-step warmup
Batch Size: 64 × 8 = 512
Hardware: 8xH100 (1 node)
Time: ~27 hours (10 epochs)
Loss: MSE
Precision: FP16
Additional Notes: Default gradnorm = 1.0

python train_with_parquet.py \
    --model_name_or_path "facebook/dinov2-with-registers-large" \
    --output_dir ./dinov2-large-reg_twitter-logfav \
    --remove_unused_columns False \
    --image_column_name "local_path" \
    --label_column_name "score" \
    --task regression \
    --do_train True \
    --do_eval True \
    --learning_rate 1e-6  \
    --num_train_epochs 10 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 32 \
    --logging_steps 10 \
    --eval_steps 2400 \
    --save_steps 1200 \
    --seed 1337 \
    --dataloader_num_workers 16 \
    --fp16 True \
    --warmup_steps 1000 \
    --parquet_path /lv0/test_aesthetics/trainlib/projects/aesthetics/data/legacy/data_twitter_fav-normed_full_filtered.parquet \
    --max_grad_norm 1.0 \
    --lr_scheduler_type cosine

Evaluation Plans

Transformation vs. Loss Design:
- Consider whether good transformations can replace complex loss weighting
Range-Aware Error Evaluation:
- Error over low-fav vs. high-fav samples
Architecture Error Profiles:
- How different encoders handle value distribution
Smooth Gradients:
- Choose transformations that preserve predictable derivatives for stable optimization

Preliminary Metrics (placeholder)

Architecture	Transform	Train MSE	Val MSE	Notes
DINOv2	log1p	0.xx	0.xx	Best so far (non-CLIP)
ConvNeXt-v2	log1p	0.xx	0.xx	Required lower LR
SigLIP v2	log1p	0.xx	0.xx	Slightly better overall

Appendix

Training Config Summary

Base LR: 1e-6 for DINO, potentially higher for SigLIP
Scheduler: Cosine
GradNorm: 1.0
Max steps: ~120k (10 epochs)
Precision: FP16

[2025-03: Twitter Logfav Transforms Study] → 2025-03-aes-twitter-logfav-transforms Deep dive into gradient stability and value compression techniques
[Notes on Alignment & Architecture Selection] Why CLIP-based models may generalize better in score regression tasks

LAPIS

Explorer

[25.02] Revisiting Model Architecture Differences in Twitter LogFav Training

Aesthetic Model Pretraining Log

TL;DR Summary

Objective

Dataset

Architectures Evaluated

Experiments & Findings

1. Warmup vs. No Warmup (DINOv2)

2. Input Transformations for Regression Stability

3. Numerical Range Effects

4. Hyperparameter Sensitivity (DINOv2)

5. Architecture Alignment

Training Configuration

Evaluation Plans

Preliminary Metrics (placeholder)

Appendix

Training Config Summary

On this page

LAPIS

Explorer

[25.02] Revisiting Model Architecture Differences in Twitter LogFav Training

Aesthetic Model Pretraining Log

TL;DR Summary

Objective

Dataset

Architectures Evaluated

Experiments & Findings

1. Warmup vs. No Warmup (DINOv2)

2. Input Transformations for Regression Stability

3. Numerical Range Effects

4. Hyperparameter Sensitivity (DINOv2)

5. Architecture Alignment

Training Configuration

Evaluation Plans

Preliminary Metrics (placeholder)

Appendix

Training Config Summary

Related Work

On this page