A quickly put together (and not very complete) summary of current research ideas on anime aesthetics classification:
Model Architecture & Training Strategies
Human labels are not scalable, so develop ways to leverage existing human signals efficiently and transfer them better.
-
Architecture Comparison:
-
We’re examining how different neural network architectures (e.g., ConvNeXt V2 vs. Vision Transformers) impact the fine-tuning process for aesthetic classification.
- (model performance difference on small scale data isn’t very significant, but seems to be more obvious on larger-scale trainings)
-
-
Architectural Tendencies:
-
Different architectures like Vision Transformers and ConvNeXts have varying focus on textures versus structure, which affects their performance in aesthetic classification.
-
Paper: Architectural Analysis ← also introduced the idea of using synthetic dataset to do albalations
-
(and a few other papers about the topic)
-
-
Pre-training Strategies:
-
We’re considering the impact of pre-training on large, general datasets and how it affects fine-tuning on our smaller, aesthetic-specific dataset.
-
Alternative to this is to initialize the model from Danbooru taggers directly and see if it generalizes better vs. General pretrains such as ImageNet / SO400M
-
-
We’re also analyzing how different pre-training targets (classification, regression, unsupervised) influence results.
- Eg. Regression targets with tailored transforms may or may not transfer to ordinal classification (human labels) well
-
Data Quality & Labeling Strategies
Same idea as first part, but doing improvements on the data quality axis.
-
Data Cleaning Effects:
-
We’re exploring whether removing outliers from the dataset improves model performance or makes it more brittle, especially with a limited dataset.
-
-
Soft Labeling:
-
To address label scarcity, we’re considering using soft labels based on the most probable predictions to augment the dataset.
-
-
Semi-supervised Learning:
-
This approach could enhance model accuracy but requires more investment compared to simple soft labeling.
-
Rater-Specific Modeling
Extending on the idea of data quality, consider training ensembles of models on individual raters, or optimize further by doing stratified ensembles on rater groups.
-
Individual Classifiers for Raters:
- Training classifiers tailored to individual raters could improve robustness and potentially help in cleaning up outlier data.
-
Clustering Raters by Preference:
- Grouping raters based on preferences and using stratified averaging may provide a better representation of the population.
-
Explanability / importance of features:
-
Findings from human labelling may give some insights on “what’s generally preferred” for future data collections.
-
Ex. saturation, contrast, subject matter, structural information
-
Ex2. anatomy may or may not be tied to perceived better aesthetics.
-
-
Input & Computational Considerations
Finding computationally efficient parameters for more efficient training / higher experiment churn rates
-
Resolution Effects:
-
We’re analyzing how different input resolutions impact model performance, balancing detail capture with computational cost.
- Planning to create and train on a synthetic dataset for better controls
-
MISC
-
Leveraging vlm / human critique dataset:
-
https://huggingface.co/blog/PandorAI1995/image-analysis-janus
-
(There’s also a dataset on this, update later)