Paper illustrations can also be automatically generated, using the diffusion model, and accepted by ICLR

If the graphs in the paper do not need to be drawn, is it a convenience for the researchers? Some people have explored this aspect, using text descriptions to generate paper charts, and the results are quite impressive!

Editors: Du Wei, Zi Wen

Image source: Generated by Unbounded AI

Generative AI has become popular in the artificial intelligence community. Whether it is an individual or an enterprise, they are all keen to create related modal transformation applications, such as Wensheng diagram, Wensheng video, Wensheng music and so on.

Recently, several researchers from research institutions such as ServiceNow Research and LIVIA tried to generate graphs in papers based on text descriptions. To this end, they proposed a new method of FigGen, and related papers were also included as Tiny Paper by ICLR 2023.

Paper address:

Some people may ask, what is so difficult about generating the graphs in the paper? How does this help scientific research?

Scientific graph generation helps disseminate research results in a concise and understandable manner, and automatic graph generation can bring many advantages to researchers, such as saving time and effort in designing graphs from scratch. Furthermore, designing visually appealing and comprehensible figures can make the paper more accessible to more people.

However, generating diagrams also faces some challenges, which need to represent complex relationships between discrete components such as boxes, arrows, and text. Unlike generating natural images, concepts in paper graphs may have different representations, requiring fine-grained understanding, e.g. generating a neural network graph involves ill-posed problems with high variance.

Therefore, the researchers in this paper train a generative model on a dataset of paper diagram pairs, capturing the relationship between diagram components and the corresponding text in the paper. This requires dealing with varying lengths and highly technical text descriptions, varying chart styles, image aspect ratios, and text rendering fonts, sizes, and orientations.

In the specific implementation process, the researchers were inspired by recent text-to-image achievements, using the diffusion model to generate graphs, and proposed a potential diffusion model for generating scientific research graphs from text descriptions——FigGen.

What is unique about this diffusion model? Let's move on to the details.

Model and method

The researchers trained a latent diffusion model from scratch.

An image autoencoder is first learned to map images into compressed latent representations. Image encoders use KL loss and OCR perceptual loss. The text encoder used for tuning is learned end-to-end during the training of this diffusion model. Table 3 below shows the detailed parameters of the image autoencoder architecture.

The diffusion model then interacts directly in the latent space, performing data-corrupted forward scheduling, while learning to recover the process with a temporal and textual conditional denoising U-Net.

As for the dataset, the researchers used Paper2Fig100k, which consists of graph-text pairs from papers and contains 81,194 training samples and 21,259 validation samples. Figure 1 below is an example of a graph generated using text descriptions in the Paper2Fig100k test set.

Model details

The first is the image encoder. In the first stage, the image autoencoder learns a mapping from the pixel space to the compressed latent representation, making the diffusion model training faster. Image encoders also need to learn to map the underlying image back to pixel space without losing important details of the graph (such as text rendering quality).

To this end, we define a bottleneck convolutional codec that downsamples images by a factor f=8. The encoder is trained to minimize KL loss, VGG perceptual loss and OCR perceptual loss with Gaussian distribution.

Second is the text encoder. The researchers found that general-purpose text encoders are not well suited for the task of generating graphs. They therefore define a Bert transformer trained from scratch during diffusion with an embedding channel size of 512, which is also the embedding size that regulates U-Net's cross-attention layers. The researchers also explored the variation of the number of transformer layers under different settings (8, 32 and 128).

Finally there is the latent diffusion model. Table 2 below shows the network architecture of U-Net. We perform the diffusion process on a perceptually equivalent latent representation of an image whose input size is compressed to 64x64x4, making the diffusion model faster. They defined 1,000 diffusion steps and a linear noise schedule.

Training Details

To train the image autoencoder, the researchers used an Adam optimizer with an effective batch size of 4 samples and a learning rate of 4.5e−6, using four 12GB NVIDIA V100 graphics cards. To achieve training stability, they warmup the model in 50k iterations without using the discriminator.

For training the latent diffusion model, we also use the Adam optimizer with an effective batch size of 32 and a learning rate of 1e−4. When training the model on the Paper2Fig100k dataset, they used eight 80GB Nvidia A100 graphics cards.

Experimental results

In the generation process, the researchers adopted a DDIM sampler with 200 steps and generated 12,000 samples for each model to calculate FID, IS, KID and OCR-SIM1. Steady uses classifier-free guidance (CFG) to test for overregulation.

Table 1 below shows the results of different text encoders. It can be seen that the large text encoder produces the best qualitative results, and the conditional generation can be improved by increasing the size of the CFG. Although qualitative samples are not of sufficient quality to solve the problem, FigGen has grasped the relationship between text and images.

Figure 2 below shows additional FigGen samples generated when tuning the Classifier-Free Guidance (CFG) parameters. The researchers observed that increasing the size of the CFG (which was also quantified) resulted in an improvement in image quality.

Figure 3 below shows some more examples of FigGen generation. Be aware of the variation in length between samples, as well as the technical level of the text description, which closely affects how difficult it is for the model to correctly generate intelligible images.

However, the researchers also admit that although these generated charts cannot provide practical help to the authors of the paper, they are still a promising direction of exploration.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)