pseudo-jarvis
Real-time voice dictation that types directly into Cursor Agent — the first step toward a personal AI you can simply talk to.
The goal is an Iron Man–style JARVIS: speak naturally, and your AI assistant understands intent, confirms when needed, and acts. Today this project wires voice → text → Cursor; tomorrow it layers agents on top of that stream.
02 — Why pseudo-jarvis?
Not the same as Cursor's microphone
Cursor has a built-in voice button in Agent chat (hold Ctrl+M / Cmd+M, or click the mic).
It records a clip, runs batch speech-to-text, and drops the transcript into the input field when you confirm.
pseudo-jarvis is a different pipeline — built for hands-free, continuous dictation into Cursor
with voice commands and an agent rule that confirms intent before acting.
Cursor mic built-in
- Push-to-talk or hold shortcut — one recording, then transcribe
- Transcript appears only inside Cursor's chat input
- You still click Send (or type) to submit the message
- No custom voice commands or pause punctuation
- Uses Cursor's own STT stack (quality varies by release / window)
pseudo-jarvis this project
- Continuous dictation — text types at the cursor as you speak
- Simulates keystrokes (needs macOS Accessibility)
- Say
sendafter a pause → Enter +@voice-input-confirmation.mds freeze/resumeand pause > 2 s →.+ Shift+Enter- Agent rule asks for confirm. before implementing
| Cursor microphone | pseudo-jarvis | |
|---|---|---|
| When to use | Quick one-shot prompts; you review text before sending | Long dictation sessions; JARVIS-style talk → confirm → act |
| Recording model | Record clip → batch STT → paste into input | Always listening; chunks transcribed in a priority queue |
| Submit message | Manual Send button or Enter | Voice command send (Enter + rule mention) |
| Agent safety | Whatever you typed is sent as-is | @voice-input-confirmation.mds restates intent; waits for Confirm. |
| Setup | Microphone permission for Cursor only | Separate app; Microphone + Accessibility; ADD each project once |
Use Cursor's mic when you want a short prompt inside chat with no extra tooling. Use pseudo-jarvis when you want to dictate continuously, break sentences on long pauses, submit with your voice, and have the agent confirm before it edits code or runs commands.
03 — System design
Architecture & tech stack
A macOS tkinter app captures speech, transcribes via Google Speech Recognition, and simulates
keystrokes into Cursor Agent. Run from source (python gui_app.py) or install
pseudo-jarvis.app built with PyInstaller.
App shell active
gui_app.py— tkinter GUIapp/session.py— session runnerbuild_app.sh→.appbundle
Voice engine active
VoiceToTextinapp/voice_to_text.py- Priority queue for send / resume
- Stop button →
stop()(no stray typing)
Audio & speech active
- PortAudio (Homebrew)
- PyAudio + SpeechRecognition
- Google Speech API (online)
Cursor integration active
- pyautogui + pynput (click, paste, type)
- Paste
@voice-input-confirmation.mds - Agent rule confirms intent
Project registry active
app/project_registry.py- ADD copies voice rule into projects
setup-variables/subscribed-projects.txt
AI agents planned
Task runners, automations, and multi-step reasoning built on the voice transcript pipeline.
High-level data flow
flowchart LR
subgraph Input
MIC[Microphone]
BTN[GUI Start / Stop]
end
subgraph pseudo_jarvis
GUI[gui_app.py]
SESS[app/session.py]
VTT[VoiceToText]
QUEUE[Priority queue]
SR[Google Speech API]
end
subgraph Output
CUR[Cursor Agent typing box]
end
MIC --> VTT
BTN --> GUI
GUI --> SESS --> VTT
VTT --> QUEUE --> SR
SR --> VTT
VTT -->|pyautogui / paste| CUR
BTN -->|Stop| VTT
Component map
flowchart TB
subgraph Startup
A[Mic dropdown in GUI] --> B[Start button]
B --> C[wait for click in Cursor]
C --> D[type rule mention]
D --> E[listen_and_transcribe]
end
subgraph Dictation
E --> F[_on_audio callback]
F --> G[_recognition_worker]
G --> H{command?}
H -->|speech| I[_type_recognized_text]
H -->|send| J[Enter + rule mention]
H -->|freeze| K[halt typing]
H -->|resume| L[click + rule mention]
end
I --> CUR[(Cursor focus)]
J --> CUR
L --> CUR
04 — Application window
App UI (gui_app.py)
The pseudo-jarvis window is a single tkinter app. Everything below runs on macOS only;
grant Microphone and Accessibility to
pseudo-jarvis (or Terminal when running from source).
setup-variables/subscribed-projects.txt.
Expand to see which Cursor repos already have the voice rule installed.
Project root text field · Browse… folder picker · ADD button.
Copies voice-input-confirmation.mds into .cursor/rules/, registers the path,
and adds the rule file to that project's .gitignore.
Microphone dropdown (from list_input_devices) ·
Start begins a session on a background thread ·
Stop calls VoiceToText.stop() and re-enables the mic picker.
print() output is redirected here via
QueueLogWriter — calibration, listening hints, freeze/resume messages, and errors.
First launch
A dialog explains that you must ADD each Cursor project root before dictating into it.
Start
Disables Start and mic picker, enables Stop, spawns run_session(device_index) on a worker thread, and binds the live VoiceToText instance for Stop.
Stop
Sets the stop flag, drains the recognition queue, and joins worker threads so no keystrokes fire after the session ends.
Distribution
./build_app.sh produces dist/pseudo-jarvis.app. The bundled app stores config under ~/Library/Application Support/pseudo-jarvis/ when not run from the repo.
GUI → session → Cursor
sequenceDiagram
actor User
participant GUI as gui_app.py
participant Sess as app/session.py
participant VTT as VoiceToText
participant Cursor as Cursor Agent
User->>GUI: choose mic, click Start
GUI->>Sess: run_session on background thread
Sess->>User: click typing box
User->>Cursor: click
Sess->>Cursor: paste @voice-input-confirmation.mds
Sess->>VTT: listen_and_transcribe
GUI->>GUI: session log shows hints
User->>Cursor: speak
VTT->>Cursor: type at cursor
User->>GUI: click Stop
GUI->>VTT: stop
ADD project flow
flowchart TD
A[Browse or paste project root] --> B[Click ADD]
B --> C[Copy voice-input-confirmation.mds]
C --> D[Append path to subscribed-projects.txt]
D --> E[Add rule path to project .gitignore]
E --> F[Show in Subscribed projects list]
05 — How it works
Command & feature workflows
Voice commands are matched as whole phrases after a pause. Click the tabs below to see each flow. Voice signal reliability is still being tuned — see disclaimer above.
Session startup
sequenceDiagram
actor User
participant GUI as pseudo-jarvis GUI
participant App as VoiceToText
participant Cursor as Cursor Agent
User->>GUI: open app, pick mic, Start
GUI->>User: click typing box
User->>Cursor: click
App->>Cursor: paste @voice-input-confirmation.mds
App->>App: start listening
User->>Cursor: speak — text appears at cursor
Continuous dictation
flowchart LR
A[Speech chunk] --> B[Google transcribe]
B --> C{voice command?}
C -->|no| D[Type at cursor]
C -->|yes| E[Run command handler]
D --> F[Short pause joins with space]
F --> G[Keep listening]
Pause > 2 seconds
flowchart TD
A[Silence exceeds 2 s] --> B[Type period and space]
B --> C[Shift+Enter new line in Cursor]
C --> D[Next phrase types without leading space]
| Trigger | Action | Typed |
|---|---|---|
| Pause > 2 s (once per pause) | Sentence break | . + Shift+Enter |
| Pause ≤ 2 s | Same thought | space + next words |
send — submit message
sequenceDiagram
actor User
participant App as VoiceToText
participant Cursor as Cursor Agent
User->>App: say "send" after pause
App->>Cursor: Enter
Note over App: wait 0.5 s
App->>Cursor: paste @voice-input-confirmation.mds
Note over Cursor: Agent asks for confirm.
freeze — pause dictation
flowchart TD
A[Say freeze after pause] --> B[Dictation halted]
B --> C[Mic still listens]
C --> D[Only resume accepted]
B --> E[No text typed at cursor]
resume — continue after freeze
sequenceDiagram
actor User
participant App as VoiceToText
participant Cursor as Cursor Agent
User->>App: say "resume" after pause
App->>User: click typing box
User->>Cursor: click
App->>Cursor: paste @voice-input-confirmation.mds
App->>App: enable dictation
User->>Cursor: speak again
Stop session — GUI Stop button
flowchart TD
A[Click Stop in pseudo-jarvis] --> B[VoiceToText.stop sets flag]
B --> C[Mic listener stops]
C --> D[Queue drained]
D --> E[No further typing or paste]
E --> F[Session log: stopped message]
F --> G[Start re-enabled, mic picker unlocked]
| Command / control | When | Result |
|---|---|---|
send | After pause | Enter → wait 0.5 s → rule mention → agent confirmation |
freeze | After pause | Stop typing; mic listens for resume |
resume | After freeze | Click → rule mention → dictate again |
Stop button | Any time during session | End session; zero keystrokes after stop |