Nowadays, the diffusion-based image generation model has become a popular trend. These models focus on noise perception entirely, ignoring the understanding of the images, which makes it hard to improve further. To bridge this gap, we introduce a novel model architecture that combines a feature aggregation encoder and a noise aware decoder. Our encoder extracts multi-level features using the ViT architecture backbone and improves the understanding of our model. Our decoder is responsible for the awareness of noise by the interaction between noise intensity and image features. Combined with our encoder and decoder, our model generates realistic images. Experiments show that our model can achieve a competitive result in both the class conditional image generation and image classification tasks compared with the baseline.
09月20日
2024
09月22日
2024
初稿截稿日期
注册截止日期