🎉 Gate Square Growth Points Summer Lucky Draw Round 1️⃣ 2️⃣ Is Live!
🎁 Prize pool over $10,000! Win Huawei Mate Tri-fold Phone, F1 Red Bull Racing Car Model, exclusive Gate merch, popular tokens & more!
Try your luck now 👉 https://www.gate.com/activities/pointprize?now_period=12
How to earn Growth Points fast?
1️⃣ Go to [Square], tap the icon next to your avatar to enter [Community Center]
2️⃣ Complete daily tasks like posting, commenting, liking, and chatting to earn points
100% chance to win — prizes guaranteed! Come and draw now!
Event ends: August 9, 16:00 UTC
More details: https://www
Paper illustrations can also be automatically generated, using the diffusion model, and accepted by ICLR
Editors: Du Wei, Zi Wen
Generative AI has become popular in the artificial intelligence community. Whether it is an individual or an enterprise, they are all keen to create related modal transformation applications, such as Wensheng diagram, Wensheng video, Wensheng music and so on.
Recently, several researchers from research institutions such as ServiceNow Research and LIVIA tried to generate graphs in papers based on text descriptions. To this end, they proposed a new method of FigGen, and related papers were also included as Tiny Paper by ICLR 2023.
Some people may ask, what is so difficult about generating the graphs in the paper? How does this help scientific research?
Scientific graph generation helps disseminate research results in a concise and understandable manner, and automatic graph generation can bring many advantages to researchers, such as saving time and effort in designing graphs from scratch. Furthermore, designing visually appealing and comprehensible figures can make the paper more accessible to more people.
However, generating diagrams also faces some challenges, which need to represent complex relationships between discrete components such as boxes, arrows, and text. Unlike generating natural images, concepts in paper graphs may have different representations, requiring fine-grained understanding, e.g. generating a neural network graph involves ill-posed problems with high variance.
Therefore, the researchers in this paper train a generative model on a dataset of paper diagram pairs, capturing the relationship between diagram components and the corresponding text in the paper. This requires dealing with varying lengths and highly technical text descriptions, varying chart styles, image aspect ratios, and text rendering fonts, sizes, and orientations.
In the specific implementation process, the researchers were inspired by recent text-to-image achievements, using the diffusion model to generate graphs, and proposed a potential diffusion model for generating scientific research graphs from text descriptions——FigGen.
What is unique about this diffusion model? Let's move on to the details.
Model and method
The researchers trained a latent diffusion model from scratch.
An image autoencoder is first learned to map images into compressed latent representations. Image encoders use KL loss and OCR perceptual loss. The text encoder used for tuning is learned end-to-end during the training of this diffusion model. Table 3 below shows the detailed parameters of the image autoencoder architecture.
The diffusion model then interacts directly in the latent space, performing data-corrupted forward scheduling, while learning to recover the process with a temporal and textual conditional denoising U-Net.
The first is the image encoder. In the first stage, the image autoencoder learns a mapping from the pixel space to the compressed latent representation, making the diffusion model training faster. Image encoders also need to learn to map the underlying image back to pixel space without losing important details of the graph (such as text rendering quality).
To this end, we define a bottleneck convolutional codec that downsamples images by a factor f=8. The encoder is trained to minimize KL loss, VGG perceptual loss and OCR perceptual loss with Gaussian distribution.
Second is the text encoder. The researchers found that general-purpose text encoders are not well suited for the task of generating graphs. They therefore define a Bert transformer trained from scratch during diffusion with an embedding channel size of 512, which is also the embedding size that regulates U-Net's cross-attention layers. The researchers also explored the variation of the number of transformer layers under different settings (8, 32 and 128).
Finally there is the latent diffusion model. Table 2 below shows the network architecture of U-Net. We perform the diffusion process on a perceptually equivalent latent representation of an image whose input size is compressed to 64x64x4, making the diffusion model faster. They defined 1,000 diffusion steps and a linear noise schedule.
To train the image autoencoder, the researchers used an Adam optimizer with an effective batch size of 4 samples and a learning rate of 4.5e−6, using four 12GB NVIDIA V100 graphics cards. To achieve training stability, they warmup the model in 50k iterations without using the discriminator.
For training the latent diffusion model, we also use the Adam optimizer with an effective batch size of 32 and a learning rate of 1e−4. When training the model on the Paper2Fig100k dataset, they used eight 80GB Nvidia A100 graphics cards.
Experimental results
In the generation process, the researchers adopted a DDIM sampler with 200 steps and generated 12,000 samples for each model to calculate FID, IS, KID and OCR-SIM1. Steady uses classifier-free guidance (CFG) to test for overregulation.
Table 1 below shows the results of different text encoders. It can be seen that the large text encoder produces the best qualitative results, and the conditional generation can be improved by increasing the size of the CFG. Although qualitative samples are not of sufficient quality to solve the problem, FigGen has grasped the relationship between text and images.