BRepCLIP | Contrastive Multimodal Pretraining on BRep Primitives

Overview

Native CAD structure as the substrate for multimodal understanding.

Point clouds discard analytic surface types, curve primitives, and topology. BRepCLIP keeps faces and edges as first-class CAD entities, tokenizes them with separate face and edge vocabularies, and trains a transformer encoder for open-vocabulary CAD retrieval and evaluation.

400KABC training CAD models

3retrieval benchmarks

39FabWave zero-shot classes

15Kgeneration-evaluation samples

Core Contributions

What BRepCLIP brings

Native BRep Encoder

Represents CAD models through typed face and edge primitives rather than unordered points.

Hybrid Dual-dVAE

Uses separate codebooks for surface geometry and curve geometry.

Multimodal Alignment

Aligns BRep geometry with text descriptions and multiview image renderings.

BRepCLIP-Score

Provides a structure-aware similarity metric for text- and image-conditioned CAD generation.

Pipeline

How BRepCLIP works

The pipeline is intentionally split into two stages: primitive tokenization first, followed by contrastive alignment with text and image encoders.

Stage 1

Hybrid face-edge tokenization

Stage 2

Contrastive BRep-text-image alignment

Applications

One embedding space, three CAD workflows.

BRepCLIP is trained as a representation model, but its embedding can be used directly in practical CAD workflows: retrieval, classification, and generation evaluation.

01

Text-to-CAD Retrieval

Engineers describe a target part in natural language. BRepCLIP embeds the text query and ranks native CAD models by cosine similarity in the shared embedding space.

Preserves hole count, fillets, chamfers, and surface type cues.
Works on unseen galleries such as CADParser and Automate.
Improves Top-1 retrieval over OpenShape on all evaluated datasets.

Experiment

Train on 400K ABC samples from CADCap-1M, then evaluate text queries against full CAD galleries: 91K ABC, 40K CADParser, and 65K Automate models.

8.59 ABC Top-1 5.00 CADParser Top-1 9.42 Automate Top-1

02

Zero-Shot CAD Classification

Without fine-tuning, BRepCLIP matches each CAD embedding against class-level text descriptors, making category prediction possible for new engineering repositories.

Transfers from ABC training to FabWave classes.
Uses text labels as the classifier, not a task-specific head.
Reaches 38.62 Top-1 and 86.71 Top-10 on FabWave.

Experiment

Evaluate on FabWave after filtering invalid assets, using 4,378 CAD models across 39 engineering categories. No FabWave samples are used during training.

38.62 Top-1 70.28 Top-5 86.71 Top-10

03

CAD Generation Evaluation

BRepCLIP-Score compares a generated CAD model with its input prompt using the learned geometry-language space, so small but important engineering mistakes are penalized.

Detects wrong hole counts, wrong primitive type, and mismatched topology.
Correlates better with human and GPT judgments than CLIP score.
Provides a CAD-aware alternative to image-only and point-distance metrics.

Experiment

Test score sensitivity on 10K CAD models with original captions, mildly corrupted captions, and fully mismatched captions; benchmark text-to-CAD methods on 15K ABC examples.

104.17% mismatch drop 0.61 ground truth score 0.35 CADFusion score

Experiment

Quantitative results

BRepCLIP is evaluated across text-to-CAD retrieval, zero-shot CAD classification, and CAD generation evaluation. Retrieval CD is scaled by 10³.

Text-to-CAD Retrieval

Given a text query, models retrieve the matching CAD object from a full gallery using cosine similarity in the shared embedding space.

Method	ABC Top-1	ABC Top-5	ABC Top-20	CADParser Top-1	Automate Top-1	CD ↓
Point-BERT	2.60	9.36	22.72	1.10	0.91	61.56
PointNet	3.31	12.07	29.60	0.40	3.33	62.27
PointMLP	0.90	3.50	9.50	1.10	1.02	68.43
BRepEncoder	4.30	16.30	33.90	2.10	4.82	61.11
ULIP	2.30	4.00	12.20	0.70	0.92	63.48
OpenShape	6.12	18.17	34.36	4.10	7.60	71.63
BRepCLIP	8.59	24.52	47.89	5.00	9.42	58.16

Zero-Shot Classification on FabWave

CAD embeddings are matched directly to class-level text descriptors without fine-tuning.

Method	Top-1	Top-5	Top-10
Point-BERT	17.34	40.21	56.04
PointNet	15.74	38.78	54.37
PointMLP	18.80	41.00	59.02
BRepEncoder	21.81	43.40	60.74
ULIP	21.65	47.28	60.62
MixCon3D	34.10	63.93	78.18
OpenShape	33.58	68.73	81.73
BRepCLIP	38.62	70.28	86.71

Evaluation

BRepCLIP-Score

BRepCLIP-Score evaluates whether generated CAD geometry matches a text prompt using the learned BRep embedding, making the metric sensitive to topology, hole counts, surface types, and edge structure.

CD

Geometry distance

Chamfer Distance is computed between point samples from the generated CAD model and the target CAD geometry. Lower values indicate closer global shape reconstruction.

CLIP Score

Image-text similarity

We render multiview images of each generated CAD model and compare them with the prompt using CLIP image and text embeddings. This captures visual alignment, but not native BRep structure.

BRepCLIP Score

Prompt-to-BRep similarity

The prompt embedding is compared directly with the generated CAD model's BRep embedding using cosine similarity, so errors in holes, surfaces, edges, and topology affect the score.

Human / GPT Eval

Semantic faithfulness

Five CAD designers and GPT-based evaluators score multiview renderings from 0 to 10 based on how faithfully each generated CAD model matches the input caption.

BRepCLIP-Score examples comparing correct and incorrect CAD generations — **Qualitative score examples.** Each generated CAD model is paired with a text prompt. BRepCLIP embeds the prompt and the generated BRep model, then reports cosine similarity. Prompt-faithful generations receive higher scores, while models with wrong hole counts, wrong primitive type, or mismatched topology receive lower scores.

Score analysis comparing CLIP, LongCLIP, and BRepCLIP sensitivity — **Prompt-corruption sensitivity test.** We sample 10K CAD models and score each model with three captions: the original caption, a mildly corrupted caption, and a fully mismatched caption. BRepCLIP-Score drops much more sharply under semantic corruption, showing stronger sensitivity to CAD-specific geometric errors than image-only CLIP metrics.

Text-to-CAD Generation Benchmark

BRepCLIP-Score is compared against CD, CLIP score, human ratings, and GPT ratings on generated CAD outputs from recent text-to-CAD methods.

Method	CD ↓	CLIP Score ↑	Human Score ↑	GPT Score ↑	BRepCLIP Score ↑
Ground Truth	-	0.37	9.7	9.8	0.61
DeepCAD	86.54	0.24	2.2	2.4	0.15
Text2CAD	86.54	0.26	3.6	3.5	0.16
CADRille	155.80	0.26	3.5	3.7	0.16
Text2CQ (Q3B)	68.15	0.33	5.0	4.9	0.31
Text2CQ (GL)	71.27	0.32	4.6	4.5	0.25
Text2CQ (CG)	77.91	0.31	4.1	3.9	0.22
CADFusion	56.36	0.29	5.5	5.8	0.35

Collaborators

Contact and profiles

Author profile links follow the public project/profile pages used in related CAD generation work. Publicly available emails are listed where available.

Muhammad Usama

DFKI · RPTU Kaiserslautern-Landau · MindGarage

Website LinkedIn GitHub

Didier Stricker

Scientific Director, DFKI Augmented Vision

didier.stricker@dfki.de LinkedIn Profile

Mohammad Sadil Khan*

Equal contributing supervisor · DFKI / RPTU

Website LinkedIn GitHub

Muhammad Zeshan Afzal*

Equal contributing supervisor · DFKI / MindGarage

afzal@iupr.com LinkedIn MindGarage