Quick Notes
warmup vs. no wawrmup: (on dino v2 large)
- no warmup seemed to have slightly better MSE (but did not go very far; )
- 还没试加了warmup会不会continue时会好一点
twitter logfav vs tailored transform on numerical ranges:
- TODO
def transform1(x):
return np.log1p(x)
def transform2(x):
return np.sqrt(x) * np.log(x + 10)
def transform3(x):
return np.power(x + 1, 1/3) * np.log(x + 10)new architectures (TODO):
- swin
- siglip v2
Objective
为新的美学模型提供比较良好的pretrain weights
Hypothesis: 不同模型架构 (Convnext V2, Dino v2, ViT, etc) 可能在美学分类上有不同的训练难度
Setup
Dataset
- Source: datatmp/data_twitter_comp_score_full_filtered · Datasets at Hugging Face
- Size: [3M train / 72k val (2%)]
- Preprocessing: [key preprocessing steps]
# Key preprocessing code snippet if relevant
df = pd.read_csv("data.csv")
df = df.dropna().reset_index(drop=True)Model Architecture
- Base: facebook/dinov2-with-registers-large
- Modifications: [None]
- Parameters: [See training config for details]
# trainlib/src/trainlib/hf_trainer/image_classification.py
model = AutoModelForImageClassification.from_pretrained(
model_args.model_name_or_path,
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
token=model_args.token,
trust_remote_code=model_args.trust_remote_code,
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
)Training Configuration
- Optimizer: [AdamW]
- Learning rate: [cosine]
- Batch size: [64*8=512]
- Hardware: [8xH100 1node]
- Training time: [around 27 hours for 10 epochs]
#!/bin/bash
# 2% of total columns are used as evaluation (around 3.6m train / 72k val)
# data also available at: https://huggingface.co/datasets/datatmp/data_twitter_fav-normed_full_filtered
# also seems that gradnorm is set to 1.0 by default without needing for a flag
# DINOv2-large:
# Learn Rate: https://github.com/facebookresearch/dinov2/issues/252
# Warmup is longer (1000 steps)
# https://x.com/i/grok/share/IqCZCK1BH3VveKNYkaJjvgOAE
# (Convnext-v2 seems to have too high LR?)
# (used a much lower one on dino v2 this time)
python train_with_parquet.py \
--model_name_or_path "facebook/dinov2-with-registers-large" \
--output_dir ./dinov2-large-reg_twitter-logfav \
--remove_unused_columns False \
--image_column_name "local_path" \
--label_column_name "score" \
--task regression \
--do_train True \
--do_eval True \
--learning_rate 1e-6 \
--num_train_epochs 10 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 8 \
--logging_strategy steps \
--logging_steps 10 \
--eval_strategy steps \
--eval_steps 2400 \
--per_device_eval_batch_size 32 \
--save_strategy steps \
--save_steps 1200 \
--seed 1337 \
--allow_no_dataset_arg True \
--dataloader_num_workers 16 \
--fp16 True \
--parquet_path "/lv0/test_aesthetics/trainlib/projects/aesthetics/data/legacy/data_twitter_fav-normed_full_filtered.parquet" \
--ignore_mismatched_sizes True \
--warmup_steps 1000 \
--max_grad_norm 1.0 \
--lr_scheduler_type cosine
# --load_best_model_at_end True \
# --save_total_limit 10 \Eval ideas (on twitter logfav):
better gradient?
log1p(x + 10): straightforward gradient, vs. current (a power transform, difficult grad)
beter numerical range?
- train on both log1p and cbrtxlog, and see error difference across real value ranges (after projecting back)
better architecture?
- train on log1p on both dinov2-large and siglip2-base
better hyperparameters?
- On dino v2 experiment, changed LR a bit; analyze results
transformation vs. loss penalty?
- if transformation fixes it, it could be more straightforward (compared to hacking loss)
- some transformations may make optimization harder by messing with gradient /derivatives too much: (eg.
cbrt x logcompared to just stablelog1p)
Results
Metrics
maybe add different arch’s error on numerical ranges here
| Metric | Training | Validation | Test |
|---|---|---|---|
| Loss | 0.X | 0.X | 0.X |
| Acc | 0.X | 0.X | 0.X |
| [Other] | 0.X | 0.X | 0.X |
maybe add more components from the template here
Related
an important revisit on data transforms of skewed twitter: