May 5, 2026
A Comprehensive Survey of AI/ML Models for Background Removal (Image Matting & Segmentation), 2004–2025
A definitive technical survey of 21 years of AI background removal research — from GrabCut to BiRefNet and diffusion matting — with model-by-model benchmarks, licensing guidance, and practical recommendations for 2025.
TL;DR
- For general-purpose, high-resolution background removal in 2025, BiRefNet (Zheng et al., CAAI AIR 2024) is the open-source SOTA, and BRIA's RMBG-2.0 — a BiRefNet fine-tune on a 15,000-image licensed corpus — is the best legally clean commercial-grade variant (90% vs 85% success-rate on Bria's internal benchmark vs vanilla BiRefNet); for portraits and video, MODNet and RobustVideoMatting remain the practical default, while diffusion-based DiffDIS (ICLR 2025) and SDMatte (ICCV 2025) define the new accuracy frontier.
- The field has progressed in five distinct waves: classical graph-cut/closed-form matting (2004–2010) → CNN encoder–decoder matting with trimaps (Deep Image Matting, 2017; FBA, 2020) → trimap-free saliency/portrait nets (U²-Net, MODNet, BackgroundMattingV2, 2020–2022) → high-resolution dichotomous segmentation (IS-Net/DIS5K 2022, InSPyReNet, BiRefNet, MVANet) → foundation/diffusion-prompted matting (SAM-MAM 2023, ViTMatte 2023, DiffDIS 2024, SDMatte 2025).
- Practitioner takeaway: install
rembg+ BiRefNet/RMBG-2.0 for production cutouts; MODNet or RVM for real-time/edge portrait video; ViTMatte or SDMatte when you have a trimap or click and need pixel-perfect hair; SAM-2 + Matting-Anything for promptable/text-controlled selection. Avoid U²-Net for new projects — it is now meaningfully behind.
Key Findings
1. The dominant 2024–2025 architecture is "Swin-Large encoder + dual-reference / multi-view decoder." BiRefNet (Swin-V1-Large, 215 M params) reaches S_m=0.898 on DIS-VD and a 6.8% average S-measure gain over the previous SOTA on DIS5K, 2.0% on HRSOD, 5.6% on COD. MVANet (CVPR 2024 Highlight, Swin-V1-Base) reaches S_m=0.905 on DIS-VD at 4.6 FPS on an RTX 3090 — twice the speed of InSPyReNet — by reformulating DIS as a multi-view perception problem.
2. Diffusion priors now lead pure accuracy. DiffDIS (ICLR 2025) repurposes Stable Diffusion 2.1's U-Net via one-step denoising and posts the highest aggregate scores on DIS5K (DIS-VD F_β^max = 0.918, MAE = 0.029). SDMatte (ICCV 2025, vivo Camera Research) grafts SD v2 onto interactive matting with point/box/mask prompts — the first diffusion model to clearly beat ViTMatte on edge fidelity. Both pay a latency cost (~0.3 s on H800 for DiffDIS; multi-second on consumer GPUs for SDMatte).
3. Commercial 2025 leaders are essentially BiRefNet derivatives or proprietary diffusion stacks. BRIA RMBG-2.0 explicitly states it is "developed on the BiRefNet architecture enhanced with our proprietary dataset and training scheme." Cloudflare Images' "remove background" feature uses BiRefNet directly through Workers AI. Photoroom built its own hybrid stack (segmentation transformer + DiT-based generative backgrounds) and open-sourced its 1.3-billion-parameter PRX text-to-image model under Apache 2.0. Adobe Firefly's Remove Background V2 API is a proprietary Sensei/Firefly model; remove.bg (acquired by Canva in February 2021) does not disclose architecture.
4. Trimap-free portrait matting plateaued; the action moved to general DIS. MODNet (AAAI 2022) at 67 FPS on a 1080 Ti and RobustVideoMatting (WACV 2022) at 76 FPS at 4K still anchor real-time portrait pipelines, with no clearly superior real-time replacement in 2024–2025; meanwhile high-quality matting research migrated to general object DIS where BiRefNet/MVANet/BEN2 compete.
5. Open-source is now plentiful but licensing is the gating factor. BiRefNet is MIT-licensed; U²-Net is Apache 2.0; MODNet/RVM are Creative Commons or research-only; RMBG-1.4/2.0 are CC-BY-NC-4.0 (commercial use requires a Bria license). For commercial work the cleanest open path today is BiRefNet weights or paid BRIA licenses.
Details
1. Early Methods & Evolution (2004–2016)
GrabCut (Rother, Kolmogorov, Blake — Microsoft Research, SIGGRAPH 2004) is the canonical pre-deep-learning interactive segmentation method. The user draws a rectangle; GrabCut models foreground and background as Gaussian mixture models, builds a Markov random field over pixel labels with a graph-cut energy, and iterates. It also introduced "border matting" for the alpha layer at object boundaries. OpenCV's grabCut() made it ubiquitous. Limitations: low-frequency color priors fail on complex backgrounds and hair.
Closed-form matting (Levin, Lischinski, Weiss, CVPR 2006/PAMI 2008) introduced the matting Laplacian under a color-line assumption — the energy minimum admits a closed-form solution given a trimap. With KNN matting (Chen et al., CVPR 2012) and information-flow matting (Aksoy et al., CVPR 2017), this family dominated alphamatting.com benchmarks until deep learning arrived.
FCN (Long, Shelhamer, Darrell, CVPR 2015) and DeepLab v1–v3+ (Chen et al., 2014–2018) introduced the encoder-decoder + atrous-convolution / ASPP recipe that remains the structural ancestor of every model in this report. They were trained on PASCAL/COCO classes — not background removal per se — but the ASPP block is reused in MODNet's e-ASPP and elsewhere.
2. CNN-Era Trimap-Based Matting (2017–2020)
Deep Image Matting (Xu, Price, Cohen, Huang — Adobe Research / UIUC, CVPR 2017) was the first end-to-end deep matting network. A VGG encoder–decoder takes the RGB image concatenated with a trimap and predicts the alpha matte; a small refinement net sharpens edges. They introduced the Adobe Composition-1k dataset (493 foregrounds, 1,000 test composites) that every subsequent matting paper uses. Achieved 2nd place on alphamatting.com and 1st on videomatting.com.
IndexNet Matting (Lu et al., ICCV 2019), Context-Aware Matting (Hou & Liu, ICCV 2019), AdaMatting (Cai et al., ICCV 2019) progressively added attention, context aggregation, and trimap refinement.
FBA Matting (Forte & Pitié, 2020) introduced jointly predicting the foreground colour F, the background colour B, and the alpha α from a single network with a ResNet-50 encoder using Group Normalisation + Weight Standardisation, and a richer loss (L1 alpha + compositional + Laplacian + foreground/background). It became the long-standing alphamatting.com leader and the de facto base model for many products and rembg post-processing.
3. Trimap-Free Portrait Segmentation & Matting (2018–2022)
PortraitNet (Zhang, Dong, Li, Yang — CAD&Graphics 2019) — a lightweight U-shape network with two auxiliary losses (boundary loss + consistency-constraint loss). Hits 30 FPS for 224×224 on iPhone 7 — the first real-time mobile portrait segmenter to gain wide adoption for video chat backgrounds.
U²-Net (Qin et al., Pattern Recognition 2020, Best Paper Award 2022) — a "U-Net of U-Nets" using nested ReSidual U-blocks (RSU). Two variants (176.3 MB and 4.7 MB). U²-Net was a watershed: it powers Pixelmator Pro, the original rembg library, and dozens of mobile cutout apps. Trained for salient object detection on DUTS, then re-purposed by the community for portraits and arbitrary objects.
Background Matting V1 (Sengupta et al., CVPR 2020) required the user to capture an additional clean background image; the network solves for the alpha by comparing.
BackgroundMattingV2 (Lin, Ryabtsev, Sengupta, Curless, Seitz, Kemelmacher-Shlizerman — CVPR 2021) keeps the captured-background trick but uses a base/refinement architecture: a low-resolution base net produces α_raw and an error map, then a refinement net only re-processes selective high-error patches. Real-time at 4K @ 30 FPS and HD @ 60 FPS on RTX 2080 Ti. Introduced VideoMatte240K and PhotoMatte13K/85 datasets — the first large-scale matting datasets that aren't synthetic Adobe-1k composites.
MODNet (Ke, Sun, Li, Yan, Lau — AAAI 2022) is the dominant trimap-free portrait matting model. Single-stage, single-input RGB. Decomposes matting into three sub-objectives — Semantic Estimation, Detail Prediction, Semantic-Detail Fusion — trained jointly. Introduces e-ASPP (efficient atrous spatial pyramid pooling) and a Self-supervised SOC consistency strategy for domain adaptation. 67 FPS on a 1080 Ti. They also released the PPM-100 photographic-portrait benchmark.
RobustVideoMatting / RVM (Lin, Yang, Saleemi, Sengupta — WACV 2022) drops the captured-background requirement entirely. A recurrent ConvGRU architecture exploits temporal context; trained jointly on matting + segmentation losses. 4K at 76 FPS, HD at 104 FPS on a 1080 Ti. Still the default real-time human video matting baseline in 2025; integrated in Replicate, OBS plugins, and Zoom-style virtual-background pipelines.
4. Salient Object Detection → Dichotomous Image Segmentation (2022–)
IS-Net (Qin, Dai, Hu, Fan, Shao, Van Gool — ECCV 2022) introduced the Dichotomous Image Segmentation (DIS) task and the DIS5K dataset: 5,470 high-resolution (2K–4K+) images of camouflaged, salient, or "meticulous" objects with extremely fine-grained labels. Also introduced the HCE (Human Correction Effort) metric — approximating the number of mouse clicks needed to fix a mask. IS-Net itself is a U²-Net backbone with intermediate supervision at both feature and mask level. Reference DIS-VD scores: F_β^max=.791, S_m=.813, MAE=.074, HCE=1116.
InSPyReNet (Kim et al., ACCV 2022) — Inverse Saliency Pyramid Reconstruction Network. Builds a strict image pyramid of saliency maps and blends LR + HR pyramids without requiring HR training data. Released as the popular transparent-background PyPI package; ~2.2 FPS on RTX 3090. DIS-VD numbers (3rd-party reproduction): F_β^max=0.889, S_m=0.900.
BiRefNet (Zheng, Gao, Fan, Liu, Laaksonen, Ouyang, Sebe — CAAI Artificial Intelligence Research 2024, arXiv:2401.03407) — currently the most influential open-source background-removal model. The Bilateral Reference idea: an inward reference (hierarchical patches of the original-resolution image fed into the decoder) plus an outward reference (gradient maps of the predicted mask under auxiliary gradient supervision). Architecture: Swin-V1-Large encoder (215 M params) + Localization Module + Reconstruction Module. Hybrid loss: BCE + IoU + SSIM + CE (weights 30/0.5/10/5) plus auxiliary gradient BCE and a 219-class auxiliary head. Reported gains over previous SOTA: 6.8% average S_m on DIS5K, 2.0% on HRSOD, 5.6% on COD. DIS-VD: F_β^max=.891, S_m=.898, MAE=.038, HCE=989. ~12 FPS on A100. BiRefNet's model zoo includes general-purpose, portrait, matting, matting-HR (2048×2048), matting-lite, DIS, HRSOD, and COD weights — the matting variants are trimap-free.
MVANet (Yu, Zhao, Pang, Zhang, Lu — CVPR 2024 Highlight, arXiv:2404.07445) — Multi-view Aggregation Network. Reformulates DIS as a multi-view problem: a downsampled "distant view" plus non-overlapping local "close-up" patches share a single Swin-B encoder and a unified decoder with multi-view complementary localization (MCLM) and refinement (MCRM) modules. DIS-VD: F_β^max=0.913, S_m=0.905, MAE=0.036; 4.6 FPS on RTX 3090 — twice as fast as InSPyReNet at higher accuracy.
BEN / BEN2 (PramaLLC, arXiv:2501.06230) — Background Erase Network introduces Confidence-Guided Matting (CGM): a lightweight refiner targets pixels where the base model's confidence is low. BEN_Base+Refiner on DIS-VD: F_β^ω=0.8956, S_m=0.9166, MAE=0.0270, Dice=0.8989 — the highest reported DIS-VD weighted-F to date among open-source non-diffusion models. BEN2 (released 2025) was trained on DIS5k + a 22K proprietary set.
DiffDIS (Yu et al., ICLR 2025, arXiv:2410.10105) — Probes Stable Diffusion 2.1's U-Net (initialized from SD-Turbo) with one-step deterministic denoising plus an auxiliary edge-generation task. DIS-VD: F_β^max=0.918, F_β^ω=0.888, S_m=0.904, MAE=0.029 — leads pure accuracy on DIS5K. Inference 0.33 s/image on H800.
MVANet, BiRefNet, and BEN2 are essentially co-equal SOTA in late 2025; DiffDIS slightly leads on absolute metrics but at higher latency.
5. Transformer-Based & Foundation Matting (2023–)
Segment Anything (Kirillov et al., Meta, ICCV 2023) — SAM, trained on 1.1 B masks. Promptable (point/box/mask) but produces hard binary masks, not alpha mattes; raw SAM is unsuitable for hair-level matting.
ViTMatte (Yao, Wang, Yang, Wang — Information Fusion, March 2024, arXiv:2305.15272) — first to plug a pretrained plain ViT (DINO-pretrained) into matting via a hybrid local-window/global-attention scheme plus a lightweight convolution detail-capture module. SOTA on Composition-1k and Distinctions-646; integrated into HuggingFace Transformers as VitMatteForImageMatting.
Matting Anything Model / MAM (Li, Jain, Shi — SHI-Labs, arXiv:2306.05399, June 2023) — bolts a 2.7 M-parameter Mask-to-Matte (M2M) module onto SAM's frozen ViT-H feature maps with iterative refinement. Single model handles semantic, instance, and referring matting from box/point/text prompts.
Segment-and-Matte-Anything / SAMA (AAAI 2026, arXiv:2601.12147) — extends SAM with a Multi-View Localization Encoder and a Localization Adapter, plus dual prediction heads for segmentation and matting, achieving SOTA across both tasks simultaneously.
SDMatte (Huang et al., vivo Camera Research, ICCV 2025, arXiv:2508.00443) — first diffusion-grafted interactive matting model. Adapts Stable Diffusion v2's U-Net with visual-prompt-driven cross-attention, coordinate embeddings of point/box prompts, opacity embeddings, and masked self-attention. The diffusion prior gives noticeably better hair/translucency than ViTMatte.
SAM 2 (Ravi et al., Meta, August 2024, arXiv:2408.00714) — extends SAM to video with a streaming-memory transformer and the SA-V dataset (35.5 M masks across 50.9 K videos). 6× faster than SAM-1 on images, 3× fewer interactions for video. Not a matter directly, but the dominant pre-segmentation step before applying ViTMatte/SDMatte/MAM.
6. Proprietary Commercial Models
BRIA RMBG-1.4 — IS-Net retrained on 12,000 manually labeled, fully licensed images. Still the most-downloaded HuggingFace background-remover (15.9 million lifetime downloads as of May 2026). CC-BY-NC-4.0.
BRIA RMBG-2.0 (released Nov 2024) — explicitly "developed on the BiRefNet architecture enhanced with our proprietary dataset and training scheme" with 15,000 manually labeled licensed images. 7.2 million lifetime HuggingFace downloads as of May 2026. Bria's own benchmark reports 90% "good/very good" rate vs 85% for vanilla BiRefNet, vs ~74% for RMBG-1.4. Outputs non-binary alpha (256 levels) for natural compositing. CC-BY-NC-4.0; commercial license required from Bria.
Photoroom V4 (announced 2024) — 4th-generation segmentation transformer combined with the Photoroom Experimental (PRX) generative diffusion-transformer for backgrounds. Photoroom open-sourced PRX (1.3 B parameters, Apache 2.0, trained on 32 NVIDIA H200 GPUs) as Photoroom/prx-1024-t2i-beta on HuggingFace but kept the segmentation backbone proprietary.
Adobe Firefly / Photoshop "Remove Background" — proprietary Sensei/Firefly model; not disclosed publicly. Widely believed to be a Mask2Former/SAM-style transformer with custom matting head.
Cloudflare Images uses BiRefNet directly through Workers AI — officially disclosed in Cloudflare's blog and Images docs.
remove.bg (Kaleido AI, acquired by Canva on 23 February 2021) — does not publicly disclose architecture; community reverse-engineering points to a heavily customized U²-Net/IS-Net descendant fine-tuned per category (people, cars, products, animals).
Canva's BG Remover is built on remove.bg (same parent company).
7. Datasets & Benchmarks
- Adobe Composition-1k (Xu et al., 2017) — 493 foregrounds × 1,000 composites. The trimap-matting standard.
- alphamatting.com (Rhemann et al., 2009) — 27 small-scale natural images; saturated.
- DUTS (Wang et al., 2017) — 15,572 saliency images; SOD standard.
- VideoMatte240K / PhotoMatte13K (BGMv2, 2020).
- PPM-100 (MODNet, 2022) — photographic portrait matting.
- AIM-500 (Li et al., 2021) — 500 natural automatic-image-matting test images.
- AM-2k (Li et al., GFM, IJCV 2022) — Animal Matting 2000.
- P3M-10k (Li, Ma, Zhang et al., ACM MM 2021 / IJCV 2023) — 10,421 face-blurred portraits, the only privacy-preserving portrait-matting benchmark.
- Distinctions-646 (Qiao et al., CVPR 2020).
- DIS5K (Qin et al., ECCV 2022) — 5,470 HR images, 4 graded test sets DIS-TE1–4 + DIS-VD; the dominant general-DIS benchmark.
- HRSOD / UHRSD / HRS10K — high-resolution salient-object detection benchmarks.
- HIM2K (Sun et al., ECCV 2022) — Human Image Matting at 2K.
- SA-V (SAM 2) — 50.9K videos, 35.5M masks.
8. Quantitative Leaderboard on DIS5K
DIS-VD (1024×1024 input):
| Model | F_β^max ↑ | F_β^ω ↑ | E_φ^m ↑ | S_m ↑ | MAE ↓ | HCE ↓ |
|---|---|---|---|---|---|---|
| IS-Net (ECCV 2022) | .791 | .717 | .856 | .813 | .074 | 1116 |
| InSPyReNet (ACCV 2022) | .889 | .834 | .914 | .900 | .042 | – |
| BiRefNet-SwinL (2024) | .891 | .854 | .931 | .898 | .038 | 989 |
| MVANet (CVPR 2024) | .913 | .856 | .938 | .905 | .036 | – |
| DiffDIS (ICLR 2025) | .918 | .888 | .948 | .904 | .029 | – |
| BEN_Base+Refiner (2025) | .919 | .896 | .958 | .917 | .027 | – |
Backbones: IS-Net = U²-Net; InSPyReNet/MVANet/BEN = Swin-V1-Base; BiRefNet = Swin-V1-Large; DiffDIS = Stable Diffusion 2.1 U-Net.
Speed (best-effort, different GPUs): RVM 76 FPS @ 4K (1080 Ti), MODNet 67 FPS (1080 Ti), MVANet 4.6 FPS (RTX 3090), InSPyReNet 2.2 FPS (RTX 3090), BiRefNet-SwinL ~12 FPS (A100), DiffDIS ~3 FPS-equivalent (H800).
9. Best Model by Use Case, Late 2025
| Use case | Recommended model | Why |
|---|---|---|
| General product/e-commerce cutouts (open source) | BiRefNet-Matting / BiRefNet-HR | MIT license, top-tier hair/edge quality, extensive HF/ComfyUI tooling, runs in ~80 ms on A100 |
| General cutouts (commercial-clean license) | BRIA RMBG-2.0 API | BiRefNet architecture + 15K licensed-image fine-tune; non-binary alpha; 90% "good" on Bria benchmark; legally indemnified |
| Pixel-perfect hair / fine details (with prompt) | SDMatte (ICCV 2025) or ViTMatte | Diffusion or ViT prior preserves strand-level translucency given trimap/click |
| Promptable / text-driven selection | SAM 2 → MAM (Matting Anything) | Box/point/text prompts; cleanly separates instance selection from matting |
| Real-time portrait video | RobustVideoMatting (RVM) | 4K @ 76 FPS, temporally coherent, no auxiliary input |
| Real-time portrait image (mobile) | MODNet | 67 FPS on 1080 Ti, e-ASPP module, fits ARM NEON / Core ML |
| Edge / browser deployment | U²-Netp (4.7 MB) or BiRefNet-lite/PVTv2-b0 (11 MB) or MobileSAM | Fits in WebAssembly / mobile RAM; runs via ONNX Runtime in the browser |
| Highest-accuracy single-image (research) | DiffDIS | Top of DIS5K leaderboard at the cost of latency |
| Anime / illustration cutouts | isnet-anime (in rembg) or BiRefNet anime fine-tunes | Trained on stylized data |
10. Open-Source Availability
- HuggingFace: ZhengPeng7/BiRefNet (and BiRefNet-DIS5K, BiRefNet-portrait, BiRefNet-matting variants); briaai/RMBG-1.4 and briaai/RMBG-2.0; PramaLLC/BEN2; nielsr/vitmatte-base-composition-1k; SHI-Labs/Matting-Anything; Photoroom/prx-1024-t2i-beta (background generation, not removal).
- GitHub: xuebinqin/U-2-Net; xuebinqin/DIS (IS-Net); ZHKKKe/MODNet and ZHKKKe/PPM; PeterL1n/BackgroundMattingV2 and PeterL1n/RobustVideoMatting; MarcoForte/FBA_Matting; hustvl/ViTMatte; ZhengPeng7/BiRefNet; qianyu-dlut/MVANet and qianyu-dlut/DiffDIS; vivoCameraResearch/SDMatte; plemeri/InSPyReNet; PramaLLC/BEN2; SHI-Labs/Matting-Anything.
- Convenience wrappers:
rembg(CLI/Python/Docker, 22.5K GitHub stars as of April 2025, supports U²-Net, IS-Net, BiRefNet, BRIA-RMBG, SAM); 1038lab/ComfyUI-RMBG (RMBG-2.0, INSPYRENET, BEN, BEN2, BiRefNet, SDMatte, SAM/SAM2/SAM3, GroundingDINO); transparent-background (PyPI wrapper for InSPyReNet).
11. Commercial Tools and What Powers Them
| Tool | Architecture (disclosed or strongly indicated) |
|---|---|
| Cloudflare Images "remove background" | BiRefNet via Workers AI (officially disclosed in Cloudflare docs) |
| BRIA RMBG-2.0 API / Replicate | BiRefNet architecture + proprietary 15K licensed dataset |
| Photoroom V4 | Proprietary segmentation transformer + open-sourced PRX 1.3B DiT for generative BG |
| remove.bg (Kaleido AI / Canva, since Feb 2021) | Proprietary; widely understood to be U²-Net/IS-Net descendant fine-tuned per category |
| Canva BG Remover | Same as remove.bg (parent: Canva) |
| Adobe Firefly / Photoshop Remove Background V2 | Proprietary Sensei model; not disclosed |
| Pixelmator Pro Magic Eraser | U²-Net (officially acknowledged in U²-Net README) |
| Apple Visual Look Up / iOS "Lift Subject" | Proprietary on-device transformer (DETR-style); not disclosed |
| Pixelcut, Bazaart, Removal.AI, Pixa | Proprietary; mix of U²-Net / IS-Net derivatives + matting refiners |
Recommendations
1. If you are starting a new background-removal project today, default to BiRefNet (MIT license). Use the BiRefNet-Matting or BiRefNet-HR weights via HuggingFace Transformers. It will outperform U²-Net by roughly 10 S-measure points on DIS5K and is now the de facto open standard. Switch to MVANet if you need 2× the throughput at slightly lower hair quality.
2. If you ship a commercial product and cannot risk training-data provenance, use BRIA RMBG-2.0 with a paid commercial license (or the Bria API). Same architecture as BiRefNet but trained on 15,000 fully-licensed images; Bria offers IP indemnification.
3. For real-time video / WebRTC / Zoom-like virtual backgrounds, RobustVideoMatting remains the right answer in 2025 — no public successor has matched its 4K-76 FPS quality/speed point. Pair with TensorRT or CoreML for production.
4. For mobile or browser, prefer the smallest BiRefNet variant (PVT-v2-b0, 11 MB) or MobileSAM; for true offline browser-only operation use U²-Netp or BiRefNet-lite via ONNX Runtime + WASM — the architecture used by BG Remove Free.
5. For interactive product/photo workflows where users can click, adopt SAM-2 + Matting-Anything (MAM) for selection, then refine with ViTMatte or SDMatte. The two-stage "segment then matte" pattern dominates 2025 production pipelines (ComfyUI-RMBG is the canonical reference implementation).
6. Avoid building new systems on U²-Net, IS-Net, or pure DeepLab. They are cited only as baselines in 2024–2025 papers and are 5–10 S-measure points behind the BiRefNet/MVANet/BEN tier.
Decision thresholds (when to revisit): if your DIS5K-style mIoU drops below ~0.85 on internal data, retrain or fine-tune BiRefNet on your domain. If a single image requires more than ~5 manual mouse clicks of cleanup, you are below SOTA — switch model. If video flicker exceeds ~5% inter-frame mask delta, switch from per-frame matting to RVM.
Caveats
- All "average % improvement" numbers are model-author claims computed against the best baselines they chose; cross-paper rankings can shift by 1–2 points depending on input resolution (most modern papers report at 1024×1024).
- BiRefNet's frequently-cited "8.0% S-measure improvement on DIS5K" is from an early preprint; the camera-ready CAAI AIR 2024 version revised this to 6.8%. Some secondary blogs and tool docs still propagate the 8.0% number.
- BEN2 has no peer-reviewed quantitative table — only the BEN paper (which evaluates BEN_Base / BEN_Base+Refiner on DIS-VD) is published. BEN2's claimed gains are vendor marketing.
- Remove.bg, Canva, Adobe, and Apple do not disclose architectures. All "what powers them" claims for those tools are inference based on community analysis, employee blog posts, and reverse engineering — not official statements.
- License compliance is non-trivial. RMBG-1.4 / 2.0, MODNet, and many matting datasets (Adobe Composition-1k, Distinctions-646) are research-only; deploying them commercially requires either training-data swap-out or explicit licensing.
- Diffusion-based matting (DiffDIS, SDMatte) is impressive but slow — 0.3–3 seconds per image on a high-end GPU. Use only when accuracy strictly dominates throughput.
- The DIS-VD numbers between papers are not always reproducible: BEN's local re-run of MVANet differs slightly from MVANet's own paper, and InSPyReNet's DIS5K numbers come from third-party reproductions because the original paper did not benchmark DIS5K.