Quick Notes

warmup vs. no wawrmup: (on dino v2 large)

  • no warmup seemed to have slightly better MSE (but did not go very far; )
  • 还没试加了warmup会不会continue时会好一点

twitter logfav vs tailored transform on numerical ranges:

  • TODO
def transform1(x):
    return np.log1p(x)
 
def transform2(x):
    return np.sqrt(x) * np.log(x + 10)
 
def transform3(x):
    return np.power(x + 1, 1/3) * np.log(x + 10)

new architectures (TODO):

  • swin
  • siglip v2

Objective

为新的美学模型提供比较良好的pretrain weights

Hypothesis: 不同模型架构 (Convnext V2, Dino v2, ViT, etc) 可能在美学分类上有不同的训练难度

Setup

Dataset

# Key preprocessing code snippet if relevant
df = pd.read_csv("data.csv")
df = df.dropna().reset_index(drop=True)

Model Architecture

# trainlib/src/trainlib/hf_trainer/image_classification.py
    model = AutoModelForImageClassification.from_pretrained(
        model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        token=model_args.token,
        trust_remote_code=model_args.trust_remote_code,
        ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
    )

Training Configuration

  • Optimizer: [AdamW]
  • Learning rate: [cosine]
  • Batch size: [64*8=512]
  • Hardware: [8xH100 1node]
  • Training time: [around 27 hours for 10 epochs]
#!/bin/bash
 
# 2% of total columns are used as evaluation (around 3.6m train / 72k val)
# data also available at: https://huggingface.co/datasets/datatmp/data_twitter_fav-normed_full_filtered
# also seems that gradnorm is set to 1.0 by default without needing for a flag
 
# DINOv2-large:
# Learn Rate: https://github.com/facebookresearch/dinov2/issues/252
# Warmup is longer (1000 steps)
# https://x.com/i/grok/share/IqCZCK1BH3VveKNYkaJjvgOAE
 
# (Convnext-v2 seems to have too high LR?)
# (used a much lower one on dino v2 this time)
 
python train_with_parquet.py \
    --model_name_or_path "facebook/dinov2-with-registers-large" \
    --output_dir ./dinov2-large-reg_twitter-logfav \
    --remove_unused_columns False \
    --image_column_name "local_path" \
    --label_column_name "score" \
    --task regression \
    --do_train True \
    --do_eval True \
    --learning_rate 1e-6  \
    --num_train_epochs 10 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 8 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_strategy steps \
    --eval_steps 2400 \
    --per_device_eval_batch_size 32 \
    --save_strategy steps \
    --save_steps 1200 \
    --seed 1337 \
    --allow_no_dataset_arg True \
    --dataloader_num_workers 16 \
    --fp16 True \
    --parquet_path "/lv0/test_aesthetics/trainlib/projects/aesthetics/data/legacy/data_twitter_fav-normed_full_filtered.parquet" \
    --ignore_mismatched_sizes True \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --lr_scheduler_type cosine
 
    # --load_best_model_at_end True \
    # --save_total_limit 10 \

Eval ideas (on twitter logfav):

better gradient?

  • log1p(x + 10) : straightforward gradient, vs. current (a power transform, difficult grad)

beter numerical range?

  • train on both log1p and cbrtxlog, and see error difference across real value ranges (after projecting back)

better architecture?

  • train on log1p on both dinov2-large and siglip2-base

better hyperparameters?

  • On dino v2 experiment, changed LR a bit; analyze results

transformation vs. loss penalty?

  • if transformation fixes it, it could be more straightforward (compared to hacking loss)
  • some transformations may make optimization harder by messing with gradient /derivatives too much: (eg. cbrt x log compared to just stable log1p)

Results

Metrics

maybe add different arch’s error on numerical ranges here

MetricTrainingValidationTest
Loss0.X0.X0.X
Acc0.X0.X0.X
[Other]0.X0.X0.X

maybe add more components from the template here

an important revisit on data transforms of skewed twitter:

2025-03-twitter-logfav-transforms