Building a shippable Record & Replay app for macOS

Codex Record & Replay points to a product idea I find very interesting: sometimes users do not want to explain a workflow from scratch. They already know how to do the work. They just want to show it once and turn that demonstration into something reusable. That is a strong idea. But a standalone product cannot be a simple browser recorder or a mouse-coordinate macro. If people are going to install it, trust it, and use it repeatedly, it has to record the real browser and the real desktop, prote
Codex Record & Replay points to a product idea I find very interesting: sometimes users do not want to explain a workflow from scratch. They already know how to do the work. They just want to show it once and turn that demonstration into something reusable.
That is a strong idea. But a standalone product cannot be a simple browser recorder or a mouse-coordinate macro. If people are going to install it, trust it, and use it repeatedly, it has to record the real browser and the real desktop, protect sensitive data, and replay the intent of the workflow instead of blindly replaying pixels.
The product I would build is a local-first Mac app:
record a browser + desktop workflow
-> review what was captured
-> compile it into an editable routine
-> replay it with verification
Enter fullscreen mode Exit fullscreen mode
The advantage over Codex is not “a better general agent.” The advantage is product ownership: local storage, bring-your-own-model settings, a routine library, run history, scheduling, logs, and exportable workflow files.
The core product bet
A useful Record & Replay app should turn one demonstration into a routine that can run again.
Demonstrate
-> capture evidence
-> compile a routine
-> run the routine
-> verify the result
Enter fullscreen mode Exit fullscreen mode
The important word is “compile.” Raw events are evidence. They are not the final automation.
If the user clicks a button, the product should not remember only x=1040, y=72. It should remember the app, the window, the accessible target, the surrounding text, the browser DOM, a screenshot if needed, and the reason that step mattered.
That is the difference between a useful workflow system and a fragile macro recorder.
The first version must cover browser and desktop
Browser-only automation is useful, but it is not enough for this product.
Real workflows often cross boundaries:
- open a page in Chrome;
- download a file;
- find it in Finder;
- upload it somewhere else;
- copy a value from Notes or Slack;
- confirm a macOS dialog;
- return to the browser and submit.
If the product can only see the browser, it misses the part that makes Record & Replay different.
So the first credible version needs two surfaces:
flowchart LR
User["User demonstrates workflow"] --> App["Mac app"]
App --> Browser["Browser surface\nextension + native messaging"]
App --> Desktop["Desktop surface\nAccessibility + input events + screenshots"]
Browser --> Trace["Local trace\nsession.json\nevents.jsonl\nkeyframes"]
Desktop --> Trace
Trace --> Review["Review and redaction"]
Review --> Compile["Routine compiler"]
Compile --> Routine["workflow.json\nroutine.md\nassets"]
Routine --> Runtime["Replay runtime"]
Runtime --> Verify["Verification"]
Enter fullscreen mode Exit fullscreen mode
The browser surface should capture URL, title, DOM target, ARIA role/name, selector candidates, input changes, navigation, and key screenshots.
The desktop surface should capture foreground app, window title, Accessibility role/name/value, focus, selection, input events, and screenshots.
Both should write into one timeline. The routine compiler should see the workflow as one story, not as two disconnected logs.
Trace is evidence, not the product
The trace format should be boring and explicit.
Each event needs to say:
- what surface it came from;
- what app and window were active;
- what the user did;
- what target was involved;
- what context was captured;
- which fields are redacted;
- which screenshot, DOM, or Accessibility snapshot supports the event.
A simple event might look like this:
{
"eventId": "evt_001",
"surface": "browser",
"app": "Chrome",
"windowTitle": "Dashboard",
"type": "input",
"target": {
"role": "textbox",
"label": "Search",
"selector": "[aria-label='Search']"
},
"value": {
"kind": "text",
"redacted": false,
"preview": "invoice 2026"
},
"contextRefs": ["dom_001", "frame_001"]
}
Enter fullscreen mode Exit fullscreen mode
The product should not store everything as plain text forever. Raw content, summaries, screenshots, and model-bound context should be separated. Before anything leaves the device, the user should be able to see it.
The compiler is the real product
The compiler turns noisy evidence into an editable routine.
flowchart TD
Raw["Raw events"] --> Clean["Clean noise"]
Clean --> Segment["Segment steps"]
Segment --> Anchor["Build stable anchors"]
Anchor --> Params["Detect variables"]
Params --> Verify["Write verification"]
Verify --> Routine["routine.md + workflow.json"]
Enter fullscreen mode Exit fullscreen mode
Good compilation means:
- merge typing into one input step;
- remove accidental clicks and idle time;
- detect which values change between runs;
- prefer semantic targets over coordinates;
- write verification for important steps;
- mark risks and sensitive fields;
- ask the user only when intent is unclear.
A good routine should read like this:
Open the report page.
Choose the date range.
Upload the selected file.
Submit the draft.
Verify that the success message appears.
Enter fullscreen mode Exit fullscreen mode
Not like this:
Move mouse to 1040,72.
Click.
Wait 500ms.
Press Tab.
Enter fullscreen mode Exit fullscreen mode
Replay needs a verification loop
Replay should be a state machine, not a script that blindly runs line by line.
stateDiagram-v2
[*] --> LoadRoutine
LoadRoutine --> CollectInputs
CollectInputs --> Preflight
Preflight --> ExecuteStep
ExecuteStep --> VerifyStep
VerifyStep --> ExecuteStep: Pass and more steps remain
VerifyStep --> Recover: Fail
Recover --> ExecuteStep: Recovered
Recover --> HumanTakeover: Needs help
HumanTakeover --> ExecuteStep: User resumes
VerifyStep --> Done: All steps pass
Done --> [*]
Enter fullscreen mode Exit fullscreen mode
The runtime should try tools in this order:
- Use the most stable semantic path first: API, MCP, Apple Events, browser DOM action, or Accessibility action.
- Use structure next: selector, ARIA label, visible text, Accessibility role, or window hierarchy.
- Use visual matching when structure is incomplete.
- Use coordinates only as a last resort, tied to a known window and screenshot.
Dangerous actions need explicit confirmation. Deleting data, submitting payments, changing passwords, uploading personal files, installing software, or sending sensitive information should not run unattended.
BYOK is a real reason to exist
Bring-your-own-model support is not just a settings feature. It is a product advantage.
It lets users choose cost, privacy, and model quality. The app should support at least:
- OpenAI-compatible base URL + API key + model;
- Anthropic;
- Gemini;
- OpenRouter;
- local models through Ollama or a compatible server.
Keys should live in Keychain. The settings screen should include connection tests, model capability hints, context preview, and rough cost estimates.
Cheap models can clean text and summarize traces. Stronger multimodal models can handle ambiguous screens or visual recovery.
Distribution is part of the plan
This kind of app should start outside the Mac App Store.
It needs Accessibility, Screen Recording, Input Monitoring, Apple Events in some cases, helper processes, and browser extension setup. Developer ID signing and notarization are the more realistic first route.
The user experience also has to explain permissions clearly. A workflow recorder sees a lot. If the app cannot explain what it records, where it stores data, and what it sends to models, it will not deserve trust.
A realistic v1
The first sellable version should be narrow, but complete.
It should support:
- Chrome or Brave workflow recording;
- macOS app/window tracking and Accessibility targets;
- one local trace timeline;
- a trace review screen with deletion and redaction;
- routine compilation into
routine.mdandworkflow.json; - browser + desktop replay with verification;
- human takeover when the app is unsure;
- local-first storage and context preview;
- signed and notarized distribution.
Good first workflows are boring on purpose: back-office browser forms, browser + Finder upload/download flows, and simple native app steps in Finder, Notes, Mail, Calendar, or Slack.
I would avoid banking, payments, government, medical, games, complex design tools, and fully unattended sensitive actions in v1.
My final take
This is a viable product direction if the positioning is honest.
It is not “automate every app.”
It is not “record pixels and replay them.”
It is not “browser automation with a nicer UI.”
The stronger promise is:
> Record a browser + Mac workflow once, turn it into an editable AI routine, and replay it with your own model.
That gives the product a clear reason to exist next to Codex: local-first control, model choice, portable routines, and a UI built for repeated real-world runs.
Sources and related resources
- OpenAI Codex Record & Replay:
- OpenAI Codex Computer Use:
- Captr for macOS:
- OpenBrowser:
- Interceptor:
- workflow-use:
- Apple notarization:
- Apple ScreenCaptureKit:
- Apple AXUIElement:
- Codex Record & Replay principles
- MD+HTML Reader product page


