You upload a photo, wait a bit, get a result back. That is the whole user-facing surface of an AI clothes removal tool, which is part of why people assume there is one big neural network behind the button doing all the work. There isn't. There is a chain of separate models, each doing a narrow job, and how well those models cooperate is most of what determines whether the output looks plausible or surreal.
I want to walk through what is actually happening in there, because I think the architecture details explain a lot of the visible failure modes, and they tell you more about a tool's quality than any marketing page does.
The pipeline, end to end
A typical modern pipeline has four logical stages:
- Body and clothing segmentation. The model identifies which pixels belong to the body, which belong to clothing, and which belong to background.
- Pose and shape estimation. A second model estimates the underlying body geometry, joint positions, silhouette, often a coarse 3D body mesh.
- Inpainting. The clothed regions are masked out and a generative model fills them in, conditioned on the surrounding skin tones, lighting, and the estimated body shape.
- Refinement. A final pass cleans up boundaries, harmonizes lighting, and resolves artifacts where the inpainted region meets the original photo.
Single-pass models exist. The pitch is that doing everything in one forward pass should be cleaner, and the inference is faster. In practice they almost always look worse than staged pipelines, and the reason is mundane: training signal. Each stage of a pipeline can be supervised on data that matches its specific objective. A monolith has to learn segmentation, pose estimation, and inpainting jointly from a much weaker signal, and you can see the result in the output.
Segmentation: the foundation everything else stands on
Segmentation is the boring part. It is also the part that caps everything else. If the segmenter thinks part of an arm is clothing, that arm gets erased. If it misses a thin strap or a hair tie, the strap gets baked into the final output as some weird piece of skin. You can usually spot a bad segmenter in the result before you spot anything else.
The current standard for human segmentation is a transformer-based architecture trained on datasets like ATR, LIP, and CIHP. These datasets contain hundreds of thousands of human images with pixel-level labels for around twenty body parts and clothing categories. Transformer segmenters such as Mask2Former and SegFormer have largely displaced older convolutional approaches like DeepLab, mainly because attention captures the long-range dependencies needed to reason about clothing that wraps around the body.
Why segmentation fails
The hard cases are predictable. Loose flowing fabric blurs the boundary between clothing and background. Sheer or transparent fabric confuses the model into labeling skin as clothing. Heavy patterns can cause the segmenter to over-segment a single garment into several pieces. Partial occlusion, an arm crossed over a torso, a hand holding a phone, is a known weak point because the training data underrepresents these poses.
GANs versus diffusion: the architecture question
The inpainting stage is where most of the visible quality comes from, and it is also where the architecture choice matters most. Two families dominate.
GAN-based inpainting
Generative adversarial networks (GANs), introduced by Goodfellow and colleagues in 2014, train a generator against a discriminator. The generator tries to produce images that fool the discriminator into thinking they are real; the discriminator gets better at spotting fakes; both improve over time.
For inpainting specifically, architectures like StyleGAN2 and its descendants produce remarkably sharp, high-frequency detail, skin texture, fine shadows, and convincing local lighting. The original DeepNude tool from 2019 was a pix2pixHD GAN, and the visual signature of GAN-based inpainting (sharp but sometimes locally inconsistent) is still recognizable.
The downside of GANs is mode collapse and global inconsistency. A GAN can produce a beautiful local patch that does not quite agree with the rest of the body in terms of lighting direction or skin tone. They are also notoriously hard to train and sensitive to hyperparameters.
Diffusion-based inpainting
Denoising diffusion probabilistic models (DDPM), formalized by Ho and colleagues in 2020 and scaled into Stable Diffusion in 2022, learn to reverse a gradual noising process. To inpaint, the model starts from pure noise in the masked region and runs a sequence of denoising steps conditioned on the unmasked surroundings.
Diffusion models trade a little sharpness for far better global consistency. They handle lighting harmonization more gracefully and are less prone to the locally-perfect-but-globally-wrong artifacts that GANs sometimes produce. The trade-off is compute: a diffusion forward pass is one step, but generation typically requires twenty to fifty steps, making diffusion models slower and more expensive to run.
In practice, modern systems often combine both, a diffusion model for the bulk of the inpainting and a GAN-based refiner for the final sharpening pass.
The role of training data
The quality of any model in this pipeline is bounded by the diversity of its training data. Bodies vary along many axes, age, body composition, skin tone, hair, scarring, tattoos, and a model only handles what it has seen. Underrepresentation in training data shows up as systematic failure modes: certain skin tones rendered with subtly wrong undertones, certain body shapes producing unnatural proportions, and so on.
This is also why benchmarks matter. A model that scores well on a narrow benchmark of one demographic can perform badly on a broader population, and the only way to know is to test across the actual distribution of users.
Where the technology fails
It is worth being honest about the limitations. Even a well-built pipeline struggles with:
- Loose or layered clothing. Coats, dresses, and oversized garments hide the body shape so completely that the pose estimator has very little to work with. The output becomes a guess.
- Complex poses. Crossed limbs, sitting positions, and unusual angles fall outside the bulk of the training distribution. Expect artifacts at occlusion boundaries.
- Multiple subjects. Most pipelines assume one person per image. With two or more subjects, segmentation can confuse limbs across people.
- Low resolution and heavy compression. The pipeline only has whatever signal exists in the input. JPEG artifacts, motion blur, and resolutions below 512 pixels on the long edge produce visibly worse output.
- Strong stylization. Anime, illustrations, and heavily filtered photos sit outside the realistic-photo training distribution and require purpose-trained models.
Practical takeaway
The shortest version of all of this: AI clothes removal is a chain of imperfect models, and the visible quality is set by the weakest link. Tools that look sharper usually have a better refiner. Tools that handle more poses usually have richer pose-estimation data. Tools that fail in obvious ways (extra limbs, melted boundaries) usually have a segmentation problem, not an inpainting problem.
If you want to dig further, the next post in this series covers what happens to your image after upload, how AI image tools handle privacy, and the third post compares how the leading tools in the category have evolved since the original DeepNude in 2019, in our DeepNude vs modern alternatives overview.