[25.02] Revisiting Model Architecture Differences in Twitter LogFav Training

Quick Notes

warmup vs. no wawrmup: (on dino v2 large)

no warmup seemed to have slightly better MSE (but did not go very far; )
还没试加了warmup会不会continue时会好一点

twitter logfav vs tailored transform on numerical ranges:

TODO

def transform1(x):
    return np.log1p(x)
 
def transform2(x):
    return np.sqrt(x) * np.log(x + 10)
 
def transform3(x):
    return np.power(x + 1, 1/3) * np.log(x + 10)

new architectures (TODO):

swin
siglip v2

Objective

为新的美学模型提供比较良好的pretrain weights

Hypothesis: 不同模型架构 (Convnext V2, Dino v2, ViT, etc) 可能在美学分类上有不同的训练难度

Setup

Dataset

Source: datatmp/data_twitter_comp_score_full_filtered · Datasets at Hugging Face
Size: [3M train / 72k val (2%)]
Preprocessing: [key preprocessing steps]

# Key preprocessing code snippet if relevant
df = pd.read_csv("data.csv")
df = df.dropna().reset_index(drop=True)

Model Architecture

Base: facebook/dinov2-with-registers-large
Modifications: [None]
Parameters: [See training config for details]

# trainlib/src/trainlib/hf_trainer/image_classification.py
    model = AutoModelForImageClassification.from_pretrained(
        model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        token=model_args.token,
        trust_remote_code=model_args.trust_remote_code,
        ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
    )

Training Configuration

Optimizer: [AdamW]
Learning rate: [cosine]
Batch size: [64*8=512]
Hardware: [8xH100 1node]
Training time: [around 27 hours for 10 epochs]

#!/bin/bash
 
# 2% of total columns are used as evaluation (around 3.6m train / 72k val)
# data also available at: https://huggingface.co/datasets/datatmp/data_twitter_fav-normed_full_filtered
# also seems that gradnorm is set to 1.0 by default without needing for a flag
 
# DINOv2-large:
# Learn Rate: https://github.com/facebookresearch/dinov2/issues/252
# Warmup is longer (1000 steps)
# https://x.com/i/grok/share/IqCZCK1BH3VveKNYkaJjvgOAE
 
# (Convnext-v2 seems to have too high LR?)
# (used a much lower one on dino v2 this time)
 
python train_with_parquet.py \
    --model_name_or_path "facebook/dinov2-with-registers-large" \
    --output_dir ./dinov2-large-reg_twitter-logfav \
    --remove_unused_columns False \
    --image_column_name "local_path" \
    --label_column_name "score" \
    --task regression \
    --do_train True \
    --do_eval True \
    --learning_rate 1e-6  \
    --num_train_epochs 10 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 8 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_strategy steps \
    --eval_steps 2400 \
    --per_device_eval_batch_size 32 \
    --save_strategy steps \
    --save_steps 1200 \
    --seed 1337 \
    --allow_no_dataset_arg True \
    --dataloader_num_workers 16 \
    --fp16 True \
    --parquet_path "/lv0/test_aesthetics/trainlib/projects/aesthetics/data/legacy/data_twitter_fav-normed_full_filtered.parquet" \
    --ignore_mismatched_sizes True \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --lr_scheduler_type cosine
 
    # --load_best_model_at_end True \
    # --save_total_limit 10 \

Eval ideas (on twitter logfav):

better gradient?

log1p(x + 10) : straightforward gradient, vs. current (a power transform, difficult grad)

beter numerical range?

train on both log1p and cbrtxlog, and see error difference across real value ranges (after projecting back)

better architecture?

train on log1p on both dinov2-large and siglip2-base

better hyperparameters?

On dino v2 experiment, changed LR a bit; analyze results

transformation vs. loss penalty?

if transformation fixes it, it could be more straightforward (compared to hacking loss)
some transformations may make optimization harder by messing with gradient /derivatives too much: (eg. cbrt x log compared to just stable log1p)

Results

Metrics

maybe add different arch’s error on numerical ranges here

Metric	Training	Validation	Test
Loss	0.X	0.X	0.X
Acc	0.X	0.X	0.X
[Other]	0.X	0.X	0.X

maybe add more components from the template here

an important revisit on data transforms of skewed twitter:

2025-03-twitter-logfav-transforms

LAPIS

Explorer

[25.02] Revisiting Model Architecture Differences in Twitter LogFav Training

Quick Notes

Objective

Setup

Dataset

Model Architecture

Training Configuration

Eval ideas (on twitter logfav):

Results

Metrics

On this page

LAPIS

Explorer

[25.02] Revisiting Model Architecture Differences in Twitter LogFav Training

Quick Notes

Objective

Setup

Dataset

Model Architecture

Training Configuration

Eval ideas (on twitter logfav):

Results

Metrics

Related

On this page