WebRTC-based Realtime Voice AI
For a JARVIS-like experience, we’re going to set up a voice AI interface that uses the OpenAI Realtime API with WebRTC to send and receive audio data in real time. This time, instead of controller methods executed by the AI SDK on the backend, we’re going to derive tools from the generated RPC modules to make authorized requests directly in the browser.
Since we’re going to use WebRTC, audio data is sent and received directly between the client and the OpenAI Realtime API, without passing through our backend server. This makes tool execution extremely low-latency.
Backend Setup
On the backend, we’re going to create a session endpoint implemented using the official OpenAI article Realtime API with WebRTC . The endpoint accepts the SDP offer from the client and a voice-selection query parameter, then returns the SDP answer from the OpenAI Realtime API.
import { procedure, prefix, post, HttpException, HttpStatus } from "vovk";
import { z } from "zod";
import { sessionGuard } from "@/decorators/sessionGuard";
@prefix("realtime")
export default class RealtimeController {
@post("session")
@sessionGuard()
static session = procedure({
query: z.object({
voice: z.enum(["ash", "ballad", "coral", "sage", "verse"]),
}),
body: z.object({ sdp: z.string() }),
output: z.object({ sdp: z.string() }),
async handle({ vovk }) {
const voice = vovk.query().voice;
const { sdp: sdpOffer } = await vovk.body();
const sessionConfig = JSON.stringify({
type: "realtime",
model: "gpt-realtime",
audio: { output: { voice } },
});
const fd = new FormData();
fd.set("sdp", sdpOffer);
fd.set("session", sessionConfig);
try {
const r = await fetch("https://api.openai.com/v1/realtime/calls", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: fd,
});
// Send back the SDP we received from the OpenAI REST API
const sdp = await r.text();
return { sdp };
} catch (error) {
throw new HttpException(
HttpStatus.INTERNAL_SERVER_ERROR,
"Failed to generate token. " + String(error),
);
}
},
});
}The code above is fetched from GitHub repository.
Frontend Setup
WebRTC Audio Session Hook
Next, we’re going to create a custom hook useWebRTCAudioSession that manages the WebRTC session: starting and stopping it, handling audio streams, and managing the data channel for function calling.
The hook accepts the selected voice and the tools list as parameters. It returns the session state (isActive, isTalking) and a function to toggle the session (toggleSession).
The crucial parts of the hook are:
- the
onopenevent handler of the data channel, where we send asession.updatemessage with the tools list to inform the OpenAI Realtime API about the available tools, and - the
onmessageevent handler, where we listen for function call requests from the model via theresponse.function_call_arguments.doneevent, execute the corresponding tool, and send back the results.
The onmessage handler also takes care of sending the response.create message to make the Realtime API respond, unless the tool execution result contains the __preventResponseCreate flag set to true, returned from the tool’s execute function.
"use client";
import { useState, useRef, useCallback, useEffect } from "react";
import { VovkTool } from "vovk";
import { RealtimeRPC } from "vovk-client";
/**
* Hook to manage a real-time session with OpenAI's Realtime endpoints.
* @example const { isActive, isTalking, handleStartStopClick } = useWebRTCAudioSession(voice, tools);
*/
export default function useWebRTCAudioSession(
voice: "ash" | "ballad" | "coral" | "sage" | "verse",
tools: VovkTool[],
) {
const audioElement = useRef<HTMLAudioElement | null>(null);
const [isActive, setIsActive] = useState(false);
// Data channel ref
const dcRef = useRef<RTCDataChannel | null>(null);
// Media stream ref for microphone
const mcRef = useRef<MediaStream | null>(null);
// talking state + refs
const [isTalking, setIsTalking] = useState(false);
const remoteAnalyserRef = useRef<AnalyserNode | null>(null);
const remoteMonitorIntervalRef = useRef<number | null>(null);
const remoteAudioContextRef = useRef<AudioContext | null>(null);
const startSession = useCallback(async () => {
// Create a peer connection
const pc = new RTCPeerConnection();
// Set up to play remote audio from the model
audioElement.current = document.createElement("audio");
audioElement.current.autoplay = true;
pc.ontrack = (e) => {
audioElement.current!.srcObject = e.streams[0];
// Simple audio activity monitor
try {
const audioCtx = new AudioContext();
remoteAudioContextRef.current = audioCtx;
const source = audioCtx.createMediaStreamSource(e.streams[0]);
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 256;
source.connect(analyser);
remoteAnalyserRef.current = analyser;
remoteMonitorIntervalRef.current = window.setInterval(() => {
if (!remoteAnalyserRef.current) return;
const a = remoteAnalyserRef.current;
const data = new Uint8Array(a.fftSize);
a.getByteTimeDomainData(data);
let sum = 0;
for (let i = 0; i < data.length; i++) {
const v = (data[i] - 128) / 128;
sum += v * v;
}
const rms = Math.sqrt(sum / data.length);
setIsTalking(rms > 0.02); // simple threshold
}, 200);
} catch {
// ignore audio activity errors
}
};
// Add local audio track for microphone input in the browser
const ms = await navigator.mediaDevices.getUserMedia({
audio: true,
});
mcRef.current = ms;
pc.addTrack(ms.getTracks()[0]);
// Set up data channel for sending and receiving events
const dc = pc.createDataChannel("oai-events");
dcRef.current = dc;
// Start the session using the Session Description Protocol (SDP)
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const { sdp } = await RealtimeRPC.session({
body: { sdp: offer.sdp! },
query: { voice },
});
await pc.setRemoteDescription({
type: "answer",
sdp,
});
dc.onopen = () => {
const sessionUpdate = {
type: "session.update",
session: {
type: "realtime",
tools: tools.map(({ name, description, parameters, type }) => ({
name,
description,
parameters,
type,
})),
},
};
dc.send(JSON.stringify(sessionUpdate));
};
dc.onmessage = async (event) => {
const msg = JSON.parse(event.data);
// Handle function call completions
if (msg.type === "response.function_call_arguments.done") {
const execute = tools.find((tool) => tool.name === msg.name)?.execute;
if (execute) {
const args = JSON.parse(msg.arguments);
const result = (await execute(args)) as {
__preventResponseCreate?: boolean;
};
// Respond with function output
const response = {
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: msg.call_id,
output: JSON.stringify(result),
},
};
dcRef.current?.send(JSON.stringify(response));
if (!result?.__preventResponseCreate) {
const responseCreate = {
type: "response.create",
};
dcRef.current?.send(JSON.stringify(responseCreate));
}
}
}
};
setIsActive(true);
}, []);
const stopSession = useCallback(() => {
// Close data channel and peer connection
dcRef.current?.close();
dcRef.current = null;
// Stop microphone tracks
mcRef.current?.getTracks().forEach((track) => track.stop());
mcRef.current = null;
// Close remote audio context
remoteAudioContextRef.current?.close();
remoteAudioContextRef.current = null;
remoteAnalyserRef.current = null;
// Stop the audio immediately
if (audioElement.current) {
audioElement.current.srcObject = null;
audioElement.current = null;
}
// Clear monitoring interval
if (remoteMonitorIntervalRef.current) {
clearInterval(remoteMonitorIntervalRef.current);
remoteMonitorIntervalRef.current = null;
}
setIsTalking(false);
setIsActive(false);
}, []);
const toggleSession = useCallback(() => {
if (isActive) {
stopSession();
} else {
startSession();
}
}, [isActive, startSession, stopSession]);
// Cleanup on unmount
useEffect(() => {
return () => stopSession();
}, []);
return {
startSession,
stopSession,
toggleSession,
isActive,
isTalking,
};
}The code above is fetched from GitHub repository.
Client-side Tools
The useWebRTCAudioSession hook accepts a tools list derived via deriveTools({ modules: { UserRPC, TaskRPC } }), and also custom client-side tools created with createTool for navigation, scrolling, and other UI interactions. The getCurrentTime and partyMode tools were borrowed from this repository , which was used as an inspiration for this demo.
Since the component is mounted in layout.tsx , the component state and WebRTC connection persist across page navigations within this route. This allows you to navigate the app via voice commands using the navigateTo tool, which in turn uses Next.js useRouter hook.
The client-side tool execution functions are located in the lib/tools folder for better organization.
"use client";
import { useRouter } from "next/navigation";
import { TaskRPC, UserRPC } from "vovk-client";
import { createTool, deriveTools } from "vovk";
import z from "zod";
import useWebRTCAudioSession from "@/hooks/useWebRTCAudioSession";
import { getCurrentTime } from "@/lib/tools/getCurrentTime";
import { partyMode } from "@/lib/tools/partyMode";
import { scroll } from "@/lib/tools/scroll";
import { getVisiblePageSection } from "@/lib/tools/getVisiblePageSection";
import Floaty from "./Floaty";
const RealTimeDemo = () => {
const router = useRouter();
const { isActive, isTalking, toggleSession } = useWebRTCAudioSession("ash", [
...deriveTools({
modules: { TaskRPC, UserRPC },
}).tools,
createTool({
name: "getCurrentTime",
description: "Gets the current time in the user's timezone",
outputSchema: z
.object({ time: z.string(), timezone: z.string(), message: z.string() })
.meta({ description: "Current time info." }),
execute: getCurrentTime,
}),
createTool({
name: "partyMode",
description: "Triggers a confetti animation on the page",
execute: partyMode,
}),
createTool({
name: "navigateTo",
description:
"Navigates the user to a specified URL within the application.",
inputSchema: z.object({
url: z
.enum(["/", "/openapi"])
.meta({ description: "The URL to navigate to." }),
}),
outputSchema: z
.string()
.meta({ description: "Navigation confirmation message." }),
execute: async ({ url }: { url: string }) => {
router.push(url);
return `Navigating to ${url}`;
},
}),
createTool({
name: "scroll",
description: "Scrolls the page up or down.",
inputSchema: z.object({
direction: z
.enum(["up", "down"])
.meta({ description: "The direction to scroll" }),
px: z.number().optional().meta({
description:
"The number of pixels to scroll. If not provided, scrolls by one viewport height.",
}),
}),
outputSchema: z.object({
message: z
.string()
.meta({ description: "Scroll action confirmation message." }),
__preventResponseCreate: z
.boolean()
.meta({ description: "Flag to prevent response creation." }),
}),
execute: scroll,
}),
createTool({
name: "getVisiblePageSection",
description: "Gets the currently visible section of the page",
outputSchema: z
.string()
.meta({ description: "Visible text content from the page." }),
execute: getVisiblePageSection,
}),
]);
return (
<Floaty
isActive={isActive}
isTalking={isTalking}
handleClick={toggleSession}
/>
);
};
export default RealTimeDemo;The code above is fetched from GitHub repository.
The code for the Floaty component is not shown here for brevity, but you can find it in the repository .
With that, you now have a fully functional Realtime Voice AI interface that can interact with your application using natural language via voice, powered by OpenAI’s Realtime API.