Codia
กลับสู่บทความทั้งหมด

Visual Struct API: Convert UI Screenshots into Structured JSON

Engineering2026-04-23

What the Visual Struct API Does

The Visual Struct API converts UI screenshots into structured data. Send an image and receive a hierarchical JSON tree with element types, bounding boxes, text content, layout relationships, and confidence scores.

A UI screenshot is, to a computer, a grid of pixels. To a designer or developer, it is a card, heading, price, rating, button, chart, table, or navigation bar. The API bridges those views by turning pixels into named, typed elements that software can query and manipulate.

Use it when you need screenshot-to-JSON, screenshot-to-Figma, visual QA, UI search, dataset labeling, or design-to-code automation.

The shape of the response

Every element has the same core fields. Here's a header cropped from a landing-page screenshot:

json
{ "elementId": "header_section_001", "elementName": "HeaderSection", "elementType": "header", "displayName": "Header Section", "layoutConfig": { "positionMode": "flex", "flexibleMode": "row", "flexAttributes": { "justifyContent": "space-between", "alignItems": "center" } }, "boundingBox": [0, 0, 1440, 88], "childElements": [ { "elementType": "logo", /* ... */ }, { "elementType": "nav", /* ... */ }, { "elementType": "button", "displayName": "Sign Up", /* ... */ } ] }

A few things to notice:

  • elementType is the primary key to program against. The API recognizes dozens of types — header, button, input, card, badge, icon, tab, table, chart, and so on — and the set is trained against millions of real UIs.
  • layoutConfig captures the way children arrange inside their parent. When the detector sees a row of elements with equal vertical centers and similar spacing, it emits a flex row with the appropriate justifyContent. Free-positioned elements get absolute. You get a layout graph, not just a bag of boxes.
  • childElements is recursive. The whole image is a single tree rooted at a page container, and you can walk it depth-first to render, diff, or index.
  • Text is handled by a multilingual OCR stage — 50+ languages out of the box — and the recognized strings land in their correct container, not as floating text runs.

How accuracy and speed are measured

On our internal benchmarks, output quality varies by source quality, resolution, visual density, and text clarity across desktop web, mobile web, iOS, Android, and data-dense dashboards. End-to-end latency usually sits in the 2–5 second range per screenshot, with the top of the range reserved for high-resolution, text-heavy screens.

The bounding boxes are pixel-accurate in the sense that box IoU (intersection over union) against human-annotated ground truth averages above 0.9 for visually distinct elements. Nested text inside tight layouts — think a six-column data table — is where you'll see the most deviation, and where the detectionScore on each element earns its keep: filter the long tail of low-confidence nodes before downstream use.

Output formats

The default response is JSON. Two alternative formats are available on the same endpoint family:

  • SVG — a vector re-rendering of the detected scene. Useful when you want a lossless, editable version of the screenshot that isn't tied to a specific design tool.
  • Figma-compatible nodes — a tree shaped to drop directly into a Figma file via the REST or plugin API. This is what the Codia Figma plugin consumes.

The JSON version is the most flexible; SVG and Figma output are convenience wrappers that post-process the same underlying schema.

Five things to build

1. Screenshot → Figma pipeline. Drop an image into a plugin, get a fully editable Figma file. The plugin is a thin client over the API. If you're doing design ingestion inside your own tool, skip the plugin and call the API directly. For a product-focused workflow, read Screenshot to Figma.

2. Visual QA at scale. Instead of pixel-diffing screenshots between CI runs, diff their element trees. Element-level diffs ignore sub-pixel rendering differences, font fallbacks, and anti-aliasing noise, and they give you diffs that a human can actually read: "button 'Continue' moved 12px right, gained a 4px border-radius."

3. Semantic indexing for AI search. If you're building a Pinterest-for-UI product — or just want your design team to grep for "dark-mode dashboards with sparklines" — the schema gives you queryable structure. Index element types, colors, and text content rather than raw images.

4. Accessibility lint for images. A screenshot is not accessible; a tree of typed elements can be. Feed the JSON through an a11y checker to surface missing alt text, low-contrast combinations, or unlabeled buttons in designs that live as PNGs in tickets.

5. Code generation from screenshots. Because the schema shares semantics with our PDF and design pipelines, you can pipe it straight into Codia's code generators to produce React, Vue, Flutter, or HTML/Tailwind. The same tree that describes an image also generates the UI.

When to Use Visual Struct Instead of OCR

Traditional OCR extracts text. Visual Struct extracts UI structure. If all you need is the text inside an image, OCR may be enough. If you need to know that a string belongs inside a button, a price card, a table cell, or a navigation item, use Visual Struct.

That distinction matters for design automation. A screenshot with "Continue" text is not useful by itself. A screenshot with a button node named "Continue", positioned inside a checkout form with sibling inputs and validation text, is useful.

Handling edge cases

A few practical notes from integrations:

  • Very long screenshots (like full-page captures of marketing sites) are best split at natural section boundaries. The model still works on 10k-pixel-tall images, but you'll get more stable groupings by chunking.
  • Dark mode UIs are well-supported — the detector does not rely on background assumptions — but if a screenshot is a rendered dark theme of a light-mode design, expect the extracted colors to reflect what's on screen, not the token values behind them.
  • Hand-drawn or sketch-fidelity mockups work, though the elementType distribution skews toward generic container and text nodes until the sketch is tightened up.
  • Multiple nested scrollable regions (a side panel that scrolls inside a page that scrolls) are flattened into a single tree. The model has no concept of scroll; model that on your side.

Getting started

bash
curl 'https://api.codia.ai/v1/open/image_to_design' \ -H 'Authorization: Bearer {codia_api_key}' \ -H 'Content-Type: application/json' \ --data '{ "image_url": "https://example.com/ui.png" }'
  1. Get a key at codia.ai/dashboard/developer.
  2. Send an image (URL or base64). A 1440×900 desktop screenshot is the sweet spot.
  3. Walk the tree. If you only care about elements above a confidence threshold, pre-filter on processingMeta.detectionScore > 0.6 and skip the noise.
  4. Full reference — schema definitions, rate limits, error codes — is at /api.

If you're considering this for an enterprise workload — on-prem, private region, or custom element types specific to your product — [email protected] is the right door. The model is the same; the packaging changes.

UIs are going to keep being captured as images — in bug reports, in Slack threads, in Dribbble shots, in competitor research. Treat those images as queryable structure and you unlock a pile of work you previously had to do by hand.

FAQ

Can the Visual Struct API convert screenshots to Figma?

Yes. The API can return Figma-compatible node data, and the Codia Figma plugin uses the same underlying structure to create editable design files.

Is this the same as OCR?

No. OCR extracts text. Visual Struct extracts text plus layout, element types, hierarchy, and bounding boxes.

What image size works best?

A clear 1440 x 900 desktop screenshot or a high-resolution mobile screen is a good starting point. Very long pages should be split at natural section boundaries.

#visual-struct-api#screenshot-to-json#screenshot-to-figma#computer-vision#ocr