A Data Scientist’s Guide to using Image Generation Models
Are you wondering how to get started creating your own images with AI? There are a host of tools to choose from, and new advancements are made almost weekly. I’ve tinkered with a few tools and I’m here to tell you my favorite, how I use it, and what you need to understand about the model architecture behind it as a Data Scientists.
This tutorial is for Data Scientists by Data Scientists. If you follow my tips for becoming a great Data Scientist, you already know that understanding the math behind these models helps us get the most value from them and apply them properly. We don’t look at models like this as a black box. So in this article, the I’ll talk through not only how to use the tools, but cover the basic core math concepts behind these modeling methods.
The Art
Here is a secret. Although I’m a deeply technical person, I also have a creative side I’ve been keeping under wraps. When I’m not yelling about doing MLOps the right way and scaling out Engineering organizations, I’m actually spending my time on the creative side of my brain by working on my books of poetry, short stories, and a novel about the Zombie apocalypse. So like many other artists, I’m wondering how AI will play a role in how we create art.
So how do I use AI to assist in the creative process?
Well, I definitely started out by experimenting with asking chatGPT to help me write text. But I quickly gave that up. Here is why. The content it produced was mediocre crap with mass appeal 😉. And there is a mathematical reason for that I’ll dive into. Sure, we can get the AI to mimic great writing if we feed it enough examples, but due to the heavy guardrails around these tools today, and the fact that these are probabilistic language models that are essentially just averaging the internet or whatever data we dumped into it, I know that these models are only capable of generating fairly generic material with mass appeal. While that might be something that can be monetized, it’s not the purpose of my art. So after this experiment, I am of the strong opinion that a human in the loop is still required to create meaningful poetry and writing — as this form of art is drawn from deeply personal and individual experiences and perspectives.
Here is how I’m actually using AI to help my writing today.
I found that the image capabilities of Generative AI are far more impressive than the text capabilities in terms of artistic output. So I create AI art to help myself visualize and illustrate the storyline of my writing. I’ve found that through creating these images, I’ve refined the storylines and developed the characters at a deeper level. So generating AI art for my stories has become a method to overcome writers block!
I can also illustrate my more abstract poems. This validates that the poem is evoking the mental image I hoped to inspire the reader with. Pretty cool, huh?
So I’ve been tinkering with a few tools, and my favorites are Midjourney, and more recently, Dall-E 3. Before we dive into the tech, let’s look at an example of how I created one of these images. This is one of my favorite images I created using Dall-E 3 in chattyGPT 4.
To create your own AI art, you simply need to take these 3 easy steps
- log in to chatGPT 4
- Write a great prompt
- Iterate on the image through the language chat interface
The thing to notice here is that the images are generated from written text. So the model is actually interpreting human language and attempting construct the images as described. This means to create stunning and unique images you need to get good at prompting.
The Tech
So a Generative text to image model is actually a system of multiple neural networks. These networks decode the text into embeddings, and then construct the image from the embeddings. So the models are trained to map words to images.
In digital terms, an image is just data. It’s RGB pixels stored in various formats like JPEG or PNG.
Computer vision is a field of AI that uses models to process this visual data. Let’s take a minute to look at the five core Computer Vision tasks: Image Classification, Object Detection, Semantic Segmentation, Instance Segmentation, and Generative.
- Image Classification: Image classification is a computer vision task that involves classifying an input image into predefined categories. The goal is to assign a single label or class to the entire image. For example, classifying an image of a mermaid as “mermaid.”
- Object Detection: Object detection is a computer vision task that goes beyond image classification by identifying AND locating multiple objects within the image. It provides both the class labels of detected objects and the positions of the objects in the image in terms of bounding boxes.
- Semantic Segmentation: Semantic segmentation is a pixel-level technique for object detection. Instead of using bounding boxes, each pixel in an image is assigned a class label to indicate the object label.
- Instance Segmentation: Instance segmentation is an advanced computer vision task that combines object detection and semantic segmentation. It not only identifies and segments individual objects in an image but also distinguishes between multiple instances of the same object class. Each object instance is assigned a unique label, allowing for precise object separation within the same class.
- Image Generation. Generative Computer Vision models create images from scratch or modify existing images to produce new visual content. These models produce images that mimic or create new visual data based on patterns and information learned from existing
What is a Probabilistic Language Model?
“Hallucinations” is the term used to refer to the phenomenon when LLMs will generate information that is not accurate or was not present in the training data or prompt. I actually like the hallucination behavior in the context of creativity. When these system hallucinate, they are using patterns found in past data to come up with something new — the model is make a probabilistic prediction. These model predict the most likely word in a sentence given the current context and previous words. So instead of regurgitating facts its been fed, or telling you it can’t answer those questions (one of the most annoying responses imho), the hallucination is the models way of inferring a likely response. So if you have a context gap in your data, you still get a response, it’s just probably got low probability scores. Hallucinations are getting a bad rep these days, but that is just how statistical inference works, and it’s important to understand that when we apply these models and get a “hallucination”.
What are Embeddings?
All these Computer Vision and Natural Language processes tasks are not possible without embeddings. Embeddings are an important data format used to represent unstructured image and text data. Models are just algorithms that detect patterns in numbers, and encode those patterns mathematically to use to make future predictions later. So for this image and text data to be useful to a mathematical model, we have to turn it into numbers. That is what an embedding does. Embeddings are vector representations of this data. They capture semantic similarities and visual patterns in the data, and encoded this in n dimension latent space as a vector.
What is an Autoencoder?
A Variational Autoencoders (VAEs) is a neural network architecture that learns to encode data into a lower-dimensional latent space and decode it back. These models are used to create embeddings. It uses probabilistic techniques for encoding and decoding, making it useful for data generation. Once trained, VAEs can generate new data samples by sampling from the latent space and decoding these samples.
So then what is Diffusion?
When we map the words and sentences to images, the images that come out are pretty fuzzy and blah. That is what diffusion helps with. Diffusion models are trained to denoise images. They take a noisy image and return a crisp one by removing the gaussian noise. So the Diffusion model is just a step to denoise the images that have been generated form the text.
Prompting Guide
The art is in the prompt! If you know how to write great prompts, you can enable these model to easily make that translation from text to image. So when you’re promoting a diffusion model, you’ll want to be direct descriptive, and using the right vocabulary matters. Understanding technical vocabulary of lighting, photography and painting helps. This example from Dall-E’s website shows an example of a great prompt leading to a pretty stunning image.
Here is an example of one of my favorite prompts and the response image from Dall-E 3.
Prompt: draw an icy landscape under a starlit sky, where a magnificent
beautiful mermaid is In the center of the scene.
The mermaid is sitting near a waterfall that flows over a cliff.
She is glowing and glittering and the flowers in her hair contrast starkly
with the cold icy surroundings.
The Politics
These Generative AI models create content (images and text) based on two things:
- The common patterns detected from the data it has been provided
- The prompt the user created
So who owns the output? And who is the real artist here? The answer is it depends, but it’s also pretty simple really. Here is my hot take from someone working in the AI industry.
Art that has been randomly generated by AI based on a lager corpus of data or a non-specific prompt belongs to the masses. In the context of copywriting this art or attributing royalties, the “art” that these foundational models generate is no different than a person seeing art all their lives and then creating art” Their art might be loosely inspired by some paintings that stood out to them in a gallery once, or by all the millions of pictures they’ve seen over their lifetime. But those artists aren’t claiming ownership of the result this person just produced.
Also, this general art inspired by the average of a million works of art is going to be pretty unoriginal until a true artist starts to add their own personal touch. So art that has been specifically generated based on a certain existing artist’s style or body or work obviously needs to be attributed to that artists.