🧪 [Experiment Name]

Ideation: [tmp] aesthetic model research ideas
Experiment code: https://github.com/troph-team/trainlib/tree/main/projects/aesthetics/experiments/ex03_twidanhalf_regression

Experiment plan:

Prioritize:

Check data quality of the collected responses
Minimize error on reference dataset (the ones that were rated by 5 people)
See how model trained on the collected data generalizes on unseen data (eg. danbooru)

This over X:

… Haven’t thought of any good ones yet

Data process:

(only speculation, because I haven’t looked at the dump yet)
Currently, that’s all I plan to use, but maybe I will use something more in the future

item	description
Remove voters with < 100 responses	Assuming no prior experience in voting, too little response could also mean unstable predictions
Remove data packs with <20% responses	Same reason as above
Apply Averaging	(for those with more than 1 rating: apply averaging)
Select samples with =5 responses	Some multi-rate samples don’t have enough response; only samples with 5 response will be used as reference, rest as training set maybe

(future):
Maybe remove outliers using cleanlab	Will probably make the model behave more consistent, but it has tradeoffs

Code references:

Data processing code:

not yet, but I plan to put data explorations processing here, to be separate from trainings:
troph-team/aeslib: aesthetics related notebooks; cleanup wip

Training code:

trainlib/projects/aesthetics at main · troph-team/trainlib

Feedback loop:

Eagle plugin

(side project: distill novelai v4’s the aesthetic model):

troph-team/datagen: synthetic data generation pipeline

training over long period creates over-confidence:

import pandas as pd
 
import matplotlib.pyplot as plt
 
  
 
df = res_all.copy()
 
  
  
 
# Identify disagreements
disagreements = df[df['label_20e'] != df['label_100e']]
  
 
# Plot confidence differences
 
plt.figure(figsize=(8, 5))
 
plt.hist(disagreements['confidence_20e'], bins=20, alpha=0.5, label='20e Confidence', color='blue')
 
plt.hist(disagreements['confidence_100e'], bins=20, alpha=0.5, label='100e Confidence', color='red')
 
plt.xlabel("Confidence")
 
plt.ylabel("Frequency")
 
plt.title("Confidence Distribution in Disagreements")
 
plt.legend()
 
plt.show()

both models have similar accuracy, but 100e model has much higher confidence.
distill-lab/distill-n4_00-01_combined_cls_v1b2 · Hugging Face
distill-lab/distill-n4_00-01_combined_cls_v1b2-100e · Hugging Face

Clip Models:

seems to perform better on bianry (AI throw / keep) than just the dino v2 large models.

on a 8xh100 machine, with training args:

bs = 24x8; total 11k samples * 10e* = ~500 steps

#!/bin/bash
 
# Define variables
BASE_MODEL="google/siglip2-base-patch16-512"
DATASET="distill-lab/COMBINE_nai-distill_00-01_eagle.library"
TASK="classification"
NUM_EPOCHS=10
 
 
# Run training command
python -m trainlib.hf_trainer.cli \
  --model_name_or_path $BASE_MODEL \
  --dataset_name $DATASET \
  --output_dir distill-n4_00-01_combined_cls_v1b2_classification_$BASE_MODEL \
  --remove_unused_columns False \
  --label_column_name star \
  --task $TASK \
  --do_train \
  --do_eval \
  --eval_strategy steps \
  --eval_steps 100 \
  --learning_rate 5e-6 \
  --num_train_epochs $NUM_EPOCHS \
  --per_device_train_batch_size 24 \
  --per_device_eval_batch_size 24 \
  --logging_strategy steps \
  --logging_steps 2 \
  --save_total_limit 1 \
  --seed 1337 \
  --lr_scheduler_type cosine \
  --dataloader_num_workers 16 \
  --ignore_mismatched_sizes True

model = google/siglip2-base-patch16-512:

(376M params)

wandb: Run summary:
wandb:            eval/accuracy 0.76684
wandb:                eval/loss 0.49165
wandb:             eval/runtime 13.1276
wandb:  eval/samples_per_second 134.602
wandb:    eval/steps_per_second 0.762
wandb:               total_flos 4.381485869237797e+19
wandb:              train/epoch 10.0
wandb:        train/global_step 530
wandb:          train/grad_norm 16.72753
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.3167
wandb:               train_loss 0.43538
wandb:            train_runtime 508.2728
wandb: train_samples_per_second 197.001
wandb:   train_steps_per_second 1.043

model =google/siglip2-large-patch16-512:

        train_transforms = Compose([
            RandomResizedCrop(size),
            RandomHorizontalFlip(),
            ToTensor(),
            normalize,
        ])

(882M params)
twice the size, 1% increase in accuracy

wandb: Run summary:
wandb:            eval/accuracy 0.77533
wandb:                eval/loss 0.4809
wandb:             eval/runtime 15.9025
wandb:  eval/samples_per_second 111.114
wandb:    eval/steps_per_second 0.692
wandb:               total_flos 1.4915777670524436e+20
wandb:              train/epoch 10.0
wandb:        train/global_step 570
wandb:          train/grad_norm 375217.9375
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.286
wandb:               train_loss 0.40591
wandb:            train_runtime 1032.5423
wandb: train_samples_per_second 96.974
wandb:   train_steps_per_second 0.552

modified augmentations:

        train_transforms = Compose([
            RandomResizedCrop(size=size, scale=(0.8, 1.0), ratio=(0.9, 1.1)),
            RandomRotation(5),
            RandomHorizontalFlip(p=0.2),
            ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.05),
            RandomApply([GaussianBlur(kernel_size=3, sigma=(0.5, 1.5))], p=0.1),
            ToTensor(),
            normalize,
        ])

(reduce random crop, more light things like rotate / jitter)
eval loss goes really high (and train loss goes down much faster)
training with more subtle augmentations overfits?

ndb:            eval/accuracy 0.77363
wandb:                  eval/f1 0.49109
wandb:                eval/loss 0.66314
wandb:           eval/precision 0.5452
wandb:              eval/recall 0.44676
wandb:             eval/roc_auc 0.77921
wandb:             eval/runtime 17.0701
wandb:  eval/samples_per_second 103.514
wandb:    eval/steps_per_second 0.644
wandb:               total_flos 1.4915777670524436e+20
wandb:              train/epoch 10.0
wandb:        train/global_step 570
wandb:          train/grad_norm 524651.125
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.0637
wandb:               train_loss 0.26663
wandb:            train_runtime 1073.7748
wandb: train_samples_per_second 93.25
wandb:   train_steps_per_second 0.531

modified augmentation 2 (heavier):

        train_transforms = Compose([
            # RandomResizedCrop(size=size, scale=(0.8, 1.0), ratio=(0.9, 1.1)),
            RandomResizedCrop(size),
            RandomRotation(5),
            # RandomHorizontalFlip(p=0.2),
            RandomHorizontalFlip(),
            ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.05),
            RandomApply([GaussianBlur(kernel_size=3, sigma=(0.5, 1.5))], p=0.1),
            ToTensor(),
            normalize,
        ])

on binary classification of AI image filtering, accuracy (and other metrics) seems to benefit from heavy augmentations:

purple: original augments (crop/flip)
blue: augments1 (less crop/fip, slight jitter/rotate/blur)
green: aguments2 (original crop/flip + slight jitter/rotate/blur)
augment too little: overfits (eval loss goes up)
augment moderately: works (eval loss not going too up)

wandb: Run summary:
wandb:            eval/accuracy 0.77646
wandb:                  eval/f1 0.46837
wandb:                eval/loss 0.4708
wandb:           eval/precision 0.55949
wandb:              eval/recall 0.40278
wandb:             eval/roc_auc 0.7864
wandb:             eval/runtime 17.4719
wandb:  eval/samples_per_second 101.134
wandb:    eval/steps_per_second 0.63
wandb:               total_flos 1.4915777670524436e+20
wandb:              train/epoch 10.0
wandb:        train/global_step 570
wandb:          train/grad_norm 515768.625
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.3154
wandb:               train_loss 0.41385
wandb:            train_runtime 1075.1719
wandb: train_samples_per_second 93.129
wandb:   train_steps_per_second 0.53

modified augmentation 3 (even heavier):

train_transforms_aug3 = Compose([
    T.RandomResizedCrop(size=size, scale=(0.5, 1.0), ratio=(0.75, 1.33)),
    T.RandomRotation(5),
    T.RandomHorizontalFlip(p=0.5),
    T.RandomVerticalFlip(p=0.15),  # Optional, depending on your task
    T.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15, hue=0.05),
    T.RandomApply([T.GaussianBlur(kernel_size=3, sigma=(0.5, 2.0))], p=0.3),
    T.RandomPerspective(distortion_scale=0.3, p=0.5),
    T.RandomAffine(degrees=2, translate=(0.1, 0.1), scale=(0.95, 1.05), shear=5),
    T.ToTensor(),
    normalize,
    T.RandomErasing(p=0.5, scale=(0.02, 0.03), ratio=(0.3, 3.3)),
])

worse result than aug2 (more subtle)
accuracy went down; loss went up

wandb: Run summary:
wandb:            eval/accuracy 0.76853
wandb:                  eval/f1 0.45101
wandb:                eval/loss 0.50415
wandb:           eval/precision 0.53674
wandb:              eval/recall 0.38889
wandb:             eval/roc_auc 0.77035
wandb:             eval/runtime 17.9685
wandb:  eval/samples_per_second 98.339
wandb:    eval/steps_per_second 0.612
wandb:               total_flos 1.4915777670524436e+20
wandb:              train/epoch 10.0
wandb:        train/global_step 570
wandb:          train/grad_norm 494826.5
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.2347
wandb:               train_loss 0.37664
wandb:            train_runtime 1082.901
wandb: train_samples_per_second 92.465
wandb:   train_steps_per_second 0.526

modified augmentation 4 (tweak back a bit):

slightly reduced the ranges for aug3 and see if it improves accuracy (so we know if it’s too much of magnitude vs. one of transforms is fundamentally NOT good):

train_transforms_aug4 = Compose([
    # Geometric transforms
    T.RandomResizedCrop(size=size, scale=(0.8, 1.0), ratio=(0.75, 1.33)),
    T.RandomRotation(15),
    T.RandomHorizontalFlip(p=0.5),
    T.RandomVerticalFlip(p=0.15),
    T.RandomPerspective(distortion_scale=0.1, p=0.5),  # Reduced from 0.3 for subtlety
    T.RandomAffine(degrees=2, translate=(0.1, 0.1), scale=(0.98, 1.02), shear=3),
 
    # Color and quality transforms
    T.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15, hue=0.05),  # Kept conservative
    T.RandomApply([T.GaussianBlur(kernel_size=3, sigma=(0.5, 2.0))], p=0.3),
    T.RandomAdjustSharpness(sharpness_factor=1.5, p=0.3),  # New: subtle sharpness adjustment
 
    # Conversion to tensor
    T.ToTensor(),
 
    # Post-tensor transforms
    T.Lambda(lambda x: x + torch.randn_like(x) * 0.01),  # New: slight Gaussian noise
    normalize,
    T.RandomErasing(p=0.5, scale=(0.01, 0.02), ratio=(0.3, 3.3)),  # Adjusted scale slightly up
    # T.RandomErasing(p=0.2, scale=(0.1, 0.2), ratio=(0.3, 3.3)),  # New: occasional larger erasure
])

Using Focal Loss:

 
# Run training command
python -m trainlib.hf_trainer.cli \
  --model_name_or_path $BASE_MODEL \
  --dataset_name $DATASET \
  --output_dir distill-n4_00-01_combined_cls_v1b2_classification_aug2_focal-loss_$BASE_MODEL \
  --remove_unused_columns False \
  --label_column_name star \
  --task $TASK \
  --do_train \
  --do_eval \
  --eval_strategy steps \
  --eval_steps 100 \
  --learning_rate 5e-6 \
  --num_train_epochs $NUM_EPOCHS \
  --per_device_train_batch_size 22 \
  --per_device_eval_batch_size 22 \
  --logging_strategy steps \
  --logging_steps 2 \
  --save_total_limit 1 \
  --seed 1337 \
  --lr_scheduler_type cosine \
  --dataloader_num_workers 16 \
  --ignore_mismatched_sizes True \
  --fp16 True  # EXTRA ARGUMENT

(siglip2 large, augment2, focal loss): similar performance with regular

wandb: Run summary:
wandb:            eval/accuracy 0.77193
wandb:                  eval/f1 0.46338
wandb:                eval/loss 0.24877
wandb:           eval/precision 0.54545
wandb:              eval/recall 0.40278
wandb:             eval/roc_auc 0.78564
wandb:             eval/runtime 17.8045
wandb:  eval/samples_per_second 99.245
wandb:    eval/steps_per_second 0.618
wandb:               total_flos 1.4915777670524436e+20
wandb:              train/epoch 10.0
wandb:        train/global_step 570
wandb:          train/grad_norm 212554.17188
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.1774
wandb:               train_loss 0.22022
wandb:            train_runtime 1070.7678
wandb: train_samples_per_second 93.512
wandb:   train_steps_per_second 0.532
wandb:

Using lower LR:

NEW: lower learning rate by 10 times and see what happens since this one suggest 1e-7 / 1e-8: https://github.com/openai/CLIP/issues/150

python -m trainlib.hf_trainer.cli \
  --model_name_or_path $BASE_MODEL \
  --dataset_name $DATASET \
  --output_dir distill-n4_00-01_combined_cls_v1b2_classification_aug2_focal-loss_lowerLR_$BASE_MODEL \
  --remove_unused_columns False \
  --label_column_name star \
  --task $TASK \
  --do_train \
  --do_eval \
  --eval_strategy steps \
  --eval_steps 100 \
  --learning_rate 5e-7 \
  --num_train_epochs $NUM_EPOCHS \
  --per_device_train_batch_size 22 \
  --per_device_eval_batch_size 22 \
  --logging_strategy steps \
  --logging_steps 2 \
  --save_total_limit 1 \
  --seed 1337 \
  --lr_scheduler_type cosine \
  --dataloader_num_workers 16 \
  --ignore_mismatched_sizes True \
  --fp16 True  # EXTRA ARGUMENT

metrics is more stable now (green curve):

Using Lower LR, Longer:

(5e-7 30e; mediocure results, no better than just using 5e-6 10e)

wandb: Run summary:
wandb:            eval/accuracy 0.7691
wandb:                  eval/f1 0.4104
wandb:                eval/loss 0.2466
wandb:           eval/precision 0.54615
wandb:              eval/recall 0.3287
wandb:             eval/roc_auc 0.76975
wandb:             eval/runtime 17.5273
wandb:  eval/samples_per_second 100.814
wandb:    eval/steps_per_second 0.628
wandb:               total_flos 4.474733301157331e+20
wandb:              train/epoch 30.0
wandb:        train/global_step 1710
wandb:          train/grad_norm 352201.21875
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.2293
wandb:               train_loss 0.24453
wandb:            train_runtime 3093.8595
wandb: train_samples_per_second 97.092
wandb:   train_steps_per_second 0.553

Using siglip2 giant:

actually had lower accuracy than say large, and overfits more easily. could be fluctuate but maybe not

wandb: Run summary:
wandb:            eval/accuracy 0.75891
wandb:                  eval/f1 0.45524
wandb:                eval/loss 0.56475
wandb:           eval/precision 0.50857
wandb:              eval/recall 0.41204
wandb:             eval/roc_auc 0.77055
wandb:             eval/runtime 19.4088
wandb:  eval/samples_per_second 91.041
wandb:    eval/steps_per_second 0.721
wandb:               total_flos 3.090267103867837e+20
wandb:              train/epoch 10.0
wandb:        train/global_step 790
wandb:          train/grad_norm 1113629.75
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.2394
wandb:               train_loss 0.35256
wandb:            train_runtime 1717.9909
wandb: train_samples_per_second 58.283
wandb:   train_steps_per_second 0.46

Random things to paste

Here’s a very heavy augmentation that kind of looks cool:

train_transforms = Compose([
    # T.RandomResizedCrop(size=size, scale=(0.85, 1.0), ratio=(0.95, 1.05)),
    T.RandomResizedCrop(size),
    T.RandomRotation(5, fill=255),
    T.RandomHorizontalFlip(p=0.5),
    T.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.05),
    T.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.05),
    T.RandomApply([T.GaussianBlur(kernel_size=3, sigma=(0.5, 0.8))], p=0.05),
    T.RandomAffine(degrees=0, translate=(0.03, 0.03), scale=(0.97, 1.03), fill=255),
    T.ToTensor(),
    normalize,
])

AES-iter3-min3:

gettin glabels from the argilla rating, choose images with lt least 3 rates, and train model on their average:

run2: (1st run had very bad params)

 
 
# =================== BEGIN NOTES =======================
 
# 1. forgot to turn warmup down from the previous train; but rmse seems good somehow (so very low lr helped)?
# also forgot to turn lr down.
 
# next run: 
# lower lr; proper warmup
 
# =================== END NOTES ==========================
python train_localpath_e05.py \
    --model_name_or_path "google/siglip2-large-patch16-512" \
    --output_dir test_train_rating_min3_run2 \
    --remove_unused_columns False \
    --image_column_name "local_path" \
    --label_column_name "aes_mean_score" \
    --task regression \
    --do_train True \
    --do_eval True \
    --learning_rate 1e-5  \
    --num_train_epochs 20 \
    --per_device_train_batch_size 22 \
    --per_device_eval_batch_size 22 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_strategy steps \
    --eval_steps 100 \
    --save_strategy steps \
    --save_steps 100 \
    --seed 1337 \
    --allow_no_dataset_arg True \
    --dataloader_num_workers 16 \
    --fp16 True \
    --parquet_path "hf://datatmp/aesthetic-iter3-labelling-samples-min_3_rated_local_path" \
    --ignore_mismatched_sizes True \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --lr_scheduler_type cosine

wandb: Run summary:
wandb:                eval/loss 0.55636
wandb:                 eval/mse 0.55711
wandb:                eval/rmse 0.7464
wandb:             eval/runtime 12.7365
wandb:  eval/samples_per_second 32.427
wandb:    eval/steps_per_second 0.236
wandb:               total_flos 6.9595812462801715e+19
wandb:              train/epoch 20.0
wandb:        train/global_step 280
wandb:          train/grad_norm 1618059.375
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.3619
wandb:               train_loss 2.0436
wandb:            train_runtime 709.7309
wandb: train_samples_per_second 65.828
wandb:   train_steps_per_second 0.395

run3:

LAPIS

Explorer