Creating Art with Vieutopia: a Revolutionary AI Art Generation App
Unlock the potential of your words and create one-of-a-kind digital artwork with our intuitive art creation resource
Introduction
Welcome to the exciting world of text-generated art! Have you ever wondered what your words would look like if they were transformed into mesmerizing digital artwork? Well, wonder no more! With our revolutionary text-generated art app, you can easily turn simple words into unique and visually stunning digital masterpieces.
Our app harnesses the power of artificial intelligence to understand the input text provided by the user and then generates corresponding digital artwork based on that understanding. The process is easy to use and requires no prior knowledge of technical details. With a user-friendly interface, you can simply input your desired text and generate art with a click of a button with zero learning cost.
But what sets our app apart from others is that we strive to create truly artistic, one-of-a-kind pieces that are not just a product of technology, but also have a touch of human creativity. Our app is designed to be a tool for artists, designers, and anyone who wants to explore and express their creativity in new ways.
Meanwhile, we understand the value and importance of traditional individual artists and their works, and we respect their creativity and hard work. Our app is intended to be a supplement to traditional art forms, not a replacement. It’s a new way to express yourself and generate new ideas, and we believe that it can coexist with traditional art forms to create new possibilities.
In this blog, we’ll take a closer look at how our text-generated art app works and the technology behind it, as well as showing you some of the amazing artwork that can be generated with just a few simple words. So, let’s dive in and discover the exciting possibilities of text-generated art!
Background
Text-to-image technology, which generates digital pieces by interpreting and processing text inputs, is becoming increasingly popular and widely adopted due to its ability to create unique, visually stunning masterpieces that combine the power of AI and human creativity, and brings benefits to various fields such as advertising, gaming, and more.
Since the late 1990s, Golan Levin has been utilizing computer vision techniques to create art pieces. However, the technology at the time was not as advanced, resulting in the artworks falling short of expectations. The advent of deep learning models has revolutionized the field of text-to-art, making it a viable and usable tool for creating artwork. CogView and DALL·E were truly impressive examples of text-generated art, at the time of their release, based on Transformer. One particular demo, the image of an avocado armchair, went viral on the internet and I believe that it left an indelible impression on anyone who was following the field at that time.
A mere year later, the introduction of the diffusion model propelled text-to-art to new heights. Applications represented by Stable Diffusion are capable of generating text-compatible images with high resolution and intricate details. This has not only made text-to-art popular in the professional field but has also garnered widespread attention and appeal among the general public.
While text-to-image models are able to generate photorealistic images, they may struggle to capture the diverse art styles and the texture of brushstrokes that can be found in traditional art forms, especially for the style with a strong artistic style. Our original intention is to make a model that can generate artworks with similar textures as fine arts while consisting of as many art forms as possible, including traditional art forms and blending new technology.
Technical Details
Text Encoder
First of all, the key to successful image generation through text prompts is ensuring the algorithm accurately and efficiently understands the text prompts. Our natural language is designed to convey abstract concepts and ideas, which can sometimes lead to ambiguous and subjective interpretations.
For example, “a red apple on a table” can have multiple visual representations, like a ripe red apple on a wooden table or a rotten red apple on a steel table, with varying lighting, background, angles and styles.
Similarly, a traditional artist would need to carefully consider all the details they want to include before starting to paint. That’s where the Text Encoder comes in. It plays the role of converting the textual description of an image into a numerical representation, known as a latent code, that the diffusion model can use as input.
The architecture of the text encoder we used is the ViT-B version of OpenCLIP. To ensure that our model can handle stylized images, we continued training it on cc0 art images based on the open-source weight of LAION-400M.
In simple terms, OpenCLIP first converts both the natural language and image into continuous latent representations for memory efficiency and increased performance. After training, the model is able to establish relationships between these latent representations.
Diffusion Model
Our application backbone utilizes the cutting-edge diffusion model as its foundation. While this model has gained widespread popularity and you may have encountered information about it elsewhere, to maintain consistency for our readers, I would like to briefly explain the diffusion model in this blog post. However, for those interested in a more in-depth mathematical explanation, I recommend checking out the original paper or this paper.
The diffusion model is a type of generative model that is trained on a dataset of images and captions. The training process of the diffusion model can be separated into two stages: adding noise to the image and denoising from the noisy image. The first stage, adding noise, is exactly what it sounds like — introducing noise to the image. This process is reparameterised for training the model. In the second stage, the model attempts to restore the original image by reversing the noise-adding process, using the encoded prompt provided. This is also the stage where images are generated once the model is sufficiently trained. During the training process, the goal is to have the final generated image closely resemble the input image. Therefore, when the model is well-trained, it should be able to generate high-quality images based on the prompts given.
Think of it like an artist tracing a piece, the artist will try to mimic every line and stroke in order to make the finished work as close as possible to the original. In the case of the diffusion model, it imitates not strokes but pixels at a time.
Stroke Detection Algorithm
Did you realise one possibility of why existing models struggle to capture the intricate details and nuances of brushstrokes in art pieces? It’s because they are trained to restore images by restoring pixels, rather than emulating the strokes of human artists. This results in generated artworks that lack the coherence in strokes that we see in human-created pieces.
However, it is not practical to expect the model to replicate brushstrokes exactly, as it does not understand the image in the same way that humans do, but what if we could teach the model to focus more on the stroke information? That’s where my stroke detection algorithm comes in. I designed and trained a stroke detection algorithm that uses computer vision techniques to analyze the brushstrokes in training images, extracting information such as stroke width, direction, and colour.
By incorporating the output of this algorithm as a loss function for the diffusion model, it allows the model to learn to generate images with similar brushstrokes to the training images, thus resulting in more artistic and coherent generated artworks.
Style Encoding
Now that we already have the ‘real’ stroke, how do we guide the model to generate the different styles of artwork? Simply using natural language would not be precise enough for the model.
One solution is using an existing technique called Textual Inversion, which allows the diffusion model to learn new concepts through a set of images. However, the textual inversion technique is more focused on the content than the style. Additionally, retraining the style concepts can be a waste of computing resources, as the model has already learned these concepts, but does not represent them well.
To address this issue, we propose a technique called style encoding extraction. This involves using the same model that encodes the input prompts to encode the prompts of the images belonging to a specific style in the training set and identifying the specific codes for that style. Essentially, it is a feature extraction technique that works because it is how the model understands and learns the style.
Thus, the input of the algorithm is an amalgamation of the style encoding and the text encoding, creating a unique and visually stunning image that truly embodies the selected style. With the combination of these two components, the diffusion model is able to produce images that are not only contextually relevant to the input text but also aesthetically appealing in the chosen style.
Prompts Extension
So far, a professional AI art creator can produce stunning images with our application. However, we understand that not all users may have the technical knowledge or the patience to craft detailed and precise prompts. This is where our prompts extension model comes into play.
The prompts extension model is a GPT-2 based AI model, trained on a large corpus of AI-generated images with rates. It uses advanced feature extraction techniques to analyze the simple prompts provided by users and generate additional relevant information. The model has learned to recognize patterns and relationships between different concepts, allowing it to generate more complete and descriptive prompts.
For instance, if a user inputs a prompt like “a red apple on a table,” the model can extend it to a more vivid description like “a shiny red apple on a wooden table in a bright room with a white background.” With this additional information, the diffusion model can generate a more accurate representation of the scene described by the prompt.
With the feature extraction-based prompt extension function, users can easily generate more complex and intricate images without having to spend time crafting detailed prompts from scratch. This function seamlessly blends the power of AI with the creativity of the user, enabling users to create unique and beautiful artwork with ease.
Function Introduction
The innovation of AI has made the generation of stunning artwork, videos, and animations more accessible than ever before. Our app harnesses the latest AI algorithms introduced above, to provide you with a user-friendly platform that can easily produce unique and captivating pieces. Our aim is to offer a user-friendly experience by doing as much as we can behind the scenes and minimizing the effort required from the user. With our text-to-art function, you can turn plain text into artistic masterpieces. Our image-to-image function allows you to transform an image into a completely different and personalized work of art. Additionally, our bird’s eye view video generation function receives your text prompts and generates eye-catching videos that give a bird’s eye view perspective. Whether you’re an artist, filmmaker, or just eager to explore your creative side, our app’s functions are the ideal tools to unleash your imagination and bring your vision to life.
Text-to-Art
This function transforms written descriptions into stunning artwork. Simply input a text prompt describing the scene you want to create and the style you want to create it in, and the model will generate an image based on your description. With its advanced AI algorithms, the model can understand and interpret a wide range of textual cues, allowing you to create rich and detailed artwork with ease.
Bird’s Eye View Video Generation — Will release in version 1.0.3
This function generates stunning bird’s eye view videos from either images or written descriptions. Simply describe the scene you want to create with a text prompt, and the model will generate a video from a bird’s eye perspective. With its advanced AI algorithms, the model can understand and interpret a wide range of textual cues, allowing you to create smooth and seamless videos that give you a unique perspective on the world around you. Whether you’re a filmmaker, travel enthusiast, or just want to explore new places, this function is a powerful tool for creating captivating and immersive videos.
Image-to-Image — Will release in version 1.0.4
This function enables you to manipulate and modify existing images. You can change the colours, adjust the brightness and contrast, or even transform an image into a different style. Whether you want to touch up a photo or create a new work of art, the Image to Image function makes it easy to get the results you want.
More Styles …
In our ongoing efforts to make art accessible to everyone, we are committed to exploring and introducing new styles. Our next series of styles will focus on highlighting the rich cultural heritage of different communities, including African futurism. By incorporating diverse art forms, we aim to ensure that everyone has the opportunity to express their emotions and creativity through AI-generated art.
Our Commitment to Respect for Traditional Art and Copyright Protection
In our mission to make art creation accessible to everyone, we hope to empower and inspire the masses, not replace traditional artists. We respect and value the unique contributions of individual artists, and strive to maintain their benefits in the art world. To ensure the protection of original works, we avoid using copyrighted data in our dataset whenever possible. If an artist would like to opt out of having their artistic style potentially generated by our model, they can simply reach out to us via email at support@vieutopia.com to request the removal of their pieces from the dataset (if included) and restrict user input for their style. Our ultimate goal is to provide a platform for creativity and artistic expression for all while still respecting the rights and contributions of individual artists.
In conclusion, with the advancements in AI technology, creating art has become more accessible to the masses. Our goal is not to replace traditional artists, but to provide a new platform for people to express themselves creatively. By leveraging our innovative diffusion model, text encoding and other advanced techniques, anyone can turn their thoughts and feelings into unique and beautiful pieces of art. Whether you’re an experienced artist or just looking for a new way to express yourself, our AI art creation resource App has something for everyone.
I would like to extend our gratitude to Leonor Guedes and Karen Yan for their invaluable contribution to this blog. Their insights and hard work have helped to make this blog a reality and we couldn’t have done it without them.