Skip to Content

WebRTC-based Realtime Voice AI

For the JARVIS-like experience, mentioned at the Overview page, weโ€™re going to set up a voice interface that uses OpenAI Realtime API with WebRTC to send and receive audio data in real-time. This time, instead of controller methods that are executed by AI SDK on the back-end, weโ€™re going to use function calling capabilities to call the methods of the RPC modules directly from the browser, making the experience truly real-time and almost instantaneous.

Back-end Setup

For the back-end weโ€™re going to create a session endpoint thatโ€™s implemented with the help of the Realtime API with WebRTC articleย  from the official OpenAI documentation. The endpoint is going to accept the SDP offer from the client, voice selection query parameter, and return the SDP answer from the OpenAI Realtime API.

src/modules/realtime/RealtimeController.ts
import { procedure, prefix, post, HttpException, HttpStatus } from "vovk"; import { z } from "zod"; import { sessionGuard } from "@/decorators/sessionGuard"; @prefix("realtime") export default class RealtimeController { @post("session") @sessionGuard() static session = procedure({ query: z.object({ voice: z.enum(["ash", "ballad", "coral", "sage", "verse"]), }), body: z.object({ sdp: z.string() }), output: z.object({ sdp: z.string() }), async handle({ vovk }) { const voice = vovk.query().voice; const { sdp: sdpOffer } = await vovk.body(); const sessionConfig = JSON.stringify({ type: "realtime", model: "gpt-realtime", audio: { output: { voice } }, }); const fd = new FormData(); fd.set("sdp", sdpOffer); fd.set("session", sessionConfig); try { const r = await fetch("https://api.openai.com/v1/realtime/calls", { method: "POST", headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}`, }, body: fd, }); // Send back the SDP we received from the OpenAI REST API const sdp = await r.text(); return { sdp }; } catch (error) { throw new HttpException( HttpStatus.INTERNAL_SERVER_ERROR, "Failed to generate token. " + String(error), ); } }, }); }

The code above is fetched from GitHub repository.ย 

Front-end Setup

Client-side Tools

For the front-end, letโ€™s create the client-side tools first. The tools array is going to include all the HTTP tools from the TaskRPC and UserRPC modules, as well as two custom tools: getCurrentTime and partyMode, borrowed from this repositoryย  to demonstrate custom tool creation. The list can be extended with other UI-related tools.

src/lib/tools/index.ts
404: Not Found

The code above is fetched from GitHub repository.ย 

WebRTC Audio Session Hook

Next, weโ€™re going to create a custom hook useWebRTCAudioSession that manages the WebRTC session, including starting and stopping the session, handling audio streams, and managing the data channel for function calling.

The hook accepts the selected voice and the tools list as parameters. It returns the session state (isActive, isTalking) and a function to toggle the session (toggleSession).

The important parts of the hook are the onopen event handler of the data channel, where we send the session.update message with the tools list to inform the OpenAI Realtime API about the available tools, and the onmessage event handler, where we listen for function call requests from the model via response.function_call_arguments.done event, execute the corresponding tool, and send back the results.

src/hooks/useWebRTCAudioSession.ts
"use client"; import { useState, useRef, useCallback, useEffect } from "react"; import { VovkTool } from "vovk"; import { RealtimeRPC } from "vovk-client"; /** * Hook to manage a real-time session with OpenAI's Realtime endpoints. * @example const { isActive, isTalking, handleStartStopClick } = useWebRTCAudioSession(voice, tools); */ export default function useWebRTCAudioSession( voice: "ash" | "ballad" | "coral" | "sage" | "verse", tools: VovkTool[], ) { const audioElement = useRef<HTMLAudioElement | null>(null); const [isActive, setIsActive] = useState(false); // Data channel ref const dcRef = useRef<RTCDataChannel | null>(null); // Media stream ref for microphone const mcRef = useRef<MediaStream | null>(null); // talking state + refs const [isTalking, setIsTalking] = useState(false); const remoteAnalyserRef = useRef<AnalyserNode | null>(null); const remoteMonitorIntervalRef = useRef<number | null>(null); const remoteAudioContextRef = useRef<AudioContext | null>(null); const startSession = useCallback(async () => { // Create a peer connection const pc = new RTCPeerConnection(); // Set up to play remote audio from the model audioElement.current = document.createElement("audio"); audioElement.current.autoplay = true; pc.ontrack = (e) => { audioElement.current!.srcObject = e.streams[0]; // Simple audio activity monitor try { const audioCtx = new AudioContext(); remoteAudioContextRef.current = audioCtx; const source = audioCtx.createMediaStreamSource(e.streams[0]); const analyser = audioCtx.createAnalyser(); analyser.fftSize = 256; source.connect(analyser); remoteAnalyserRef.current = analyser; remoteMonitorIntervalRef.current = window.setInterval(() => { if (!remoteAnalyserRef.current) return; const a = remoteAnalyserRef.current; const data = new Uint8Array(a.fftSize); a.getByteTimeDomainData(data); let sum = 0; for (let i = 0; i < data.length; i++) { const v = (data[i] - 128) / 128; sum += v * v; } const rms = Math.sqrt(sum / data.length); setIsTalking(rms > 0.02); // simple threshold }, 200); } catch { // ignore audio activity errors } }; // Add local audio track for microphone input in the browser const ms = await navigator.mediaDevices.getUserMedia({ audio: true, }); mcRef.current = ms; pc.addTrack(ms.getTracks()[0]); // Set up data channel for sending and receiving events const dc = pc.createDataChannel("oai-events"); dcRef.current = dc; // Start the session using the Session Description Protocol (SDP) const offer = await pc.createOffer(); await pc.setLocalDescription(offer); const { sdp } = await RealtimeRPC.session({ body: { sdp: offer.sdp! }, query: { voice }, }); await pc.setRemoteDescription({ type: "answer", sdp, }); dc.onopen = () => { const sessionUpdate = { type: "session.update", session: { type: "realtime", tools: tools.map(({ name, description, parameters, type }) => ({ name, description, parameters, type })), }, }; dc.send(JSON.stringify(sessionUpdate)); }; dc.onmessage = async (event) => { const msg = JSON.parse(event.data); // Handle function call completions if (msg.type === "response.function_call_arguments.done") { const execute = tools.find((tool) => tool.name === msg.name)?.execute; if (execute) { const args = JSON.parse(msg.arguments); const result = await execute(args) as { __preventResponseCreate?: boolean }; // Respond with function output const response = { type: "conversation.item.create", item: { type: "function_call_output", call_id: msg.call_id, output: JSON.stringify(result), }, }; dcRef.current?.send(JSON.stringify(response)); if (!result?.__preventResponseCreate) { const responseCreate = { type: "response.create", }; dcRef.current?.send(JSON.stringify(responseCreate)); } } } }; setIsActive(true); }, []); const stopSession = useCallback(() => { // Close data channel and peer connection dcRef.current?.close(); dcRef.current = null; // Stop microphone tracks mcRef.current?.getTracks().forEach((track) => track.stop()); mcRef.current = null; // Close remote audio context remoteAudioContextRef.current?.close(); remoteAudioContextRef.current = null; remoteAnalyserRef.current = null; // Stop the audio immediately if (audioElement.current) { audioElement.current.srcObject = null; audioElement.current = null; } // Clear monitoring interval if (remoteMonitorIntervalRef.current) { clearInterval(remoteMonitorIntervalRef.current); remoteMonitorIntervalRef.current = null; } setIsTalking(false); setIsActive(false); }, []); const toggleSession = useCallback(() => { if (isActive) { stopSession(); } else { startSession(); } }, [isActive, startSession, stopSession]); // Cleanup on unmount useEffect(() => { return () => stopSession(); }, []); return { startSession, stopSession, toggleSession, isActive, isTalking, }; }

The code above is fetched from GitHub repository.ย 

Finally, we can create a simple component that uses the useWebRTCAudioSession hook and displays a floating button to start and stop the session, as well as indicate whether the model is currently talking.

src/components/RealTimeDemo.tsx
"use client"; import useWebRTCAudioSession from "@/hooks/useWebRTCAudioSession"; import Floaty from "./Floaty"; import { useRouter } from "next/navigation"; import { createTool, deriveTools } from "vovk"; import { TaskRPC, UserRPC } from "vovk-client"; import getCurrentTime from "@/lib/tools/getCurrentTime"; import partyMode from "@/lib/tools/partyMode"; import z from "zod"; const RealTimeDemo = () => { const router = useRouter(); const { isActive, isTalking, toggleSession } = useWebRTCAudioSession("ash", [ ...deriveTools({ modules: { TaskRPC, UserRPC }, }).tools, createTool({ name: "getCurrentTime", description: "Gets the current time in the user's timezone", outputSchema: z.object({ time: z.string(), timezone: z.string(), message: z.string() }).meta({ description: "Current time info." }), execute: getCurrentTime, }), createTool({ name: "partyMode", description: "Triggers a confetti animation on the page", execute: partyMode, }), createTool({ name: "navigateTo", description: "Navigates the user to a specified URL within the application.", inputSchema: z.object({ url: z.enum(["/", "/openapi"]).meta({ description: "The URL to navigate to." }), }), outputSchema: z.string().meta({ description: "Navigation confirmation message." }), execute: async ({ url }: { url: string }) => { router.push(url); return `Navigating to ${url}`; }, }), createTool({ name: "scroll", description: "Scrolls the page up or down.", inputSchema: z.object({ direction: z.enum(["up", "down"]).meta({ description: "The direction to scroll" }), px: z.number().optional().meta({ description: "The number of pixels to scroll. If not provided, scrolls by one viewport height." }), }), outputSchema: z.object({ message: z.string().meta({ description: "Scroll action confirmation message." }), __preventResponseCreate: z.boolean().meta({ description: "Flag to prevent response creation." }), }), execute: async ({ direction, px }: { direction: "up" | "down"; px?: number }) => { console.log("Scrolling", direction); const windowHeight = window.innerHeight || document.documentElement.clientHeight; const pxToScroll = px ?? windowHeight; window.scrollBy({ top: direction === "up" ? -pxToScroll : pxToScroll, behavior: "smooth", }); return { message: `Scrolling ${direction}`, __preventResponseCreate: true, }; }, }), createTool({ name: "getVisiblePageSection", description: "Gets the currently visible section of the page", outputSchema: z.string().meta({ description: "Visible text content from the page." }), execute: async () => { function getVisibleText() { const viewportHeight = window.innerHeight; const viewportWidth = window.innerWidth; // Check if an element or its ancestors are hidden from accessibility tree function isAccessibilityHidden(element: Element | null): boolean { while (element) { if (element.getAttribute("aria-hidden") === "true") return true; if (element.hasAttribute("hidden")) return true; const role = element.getAttribute("role"); if (role === "presentation" || role === "none") return true; const style = window.getComputedStyle(element); if (style.display === "none" || style.visibility === "hidden") return true; element = element.parentElement; } return false; } // Get accessible name from aria-label or aria-labelledby function getAccessibleName(element: Element): string { const ariaLabel = element.getAttribute("aria-label"); if (ariaLabel) return ariaLabel; const labelledBy = element.getAttribute("aria-labelledby"); if (labelledBy) { return labelledBy .split(/\s+/) .map((id) => document.getElementById(id)?.textContent?.trim() || "") .filter(Boolean) .join(" "); } // For images, use alt text if (element.tagName === "IMG") { const alt = element.getAttribute("alt"); if (alt) return alt; } return ""; } // Get aria-describedby text function getDescription(element: Element): string { const describedBy = element.getAttribute("aria-describedby"); if (describedBy) { return describedBy .split(/\s+/) .map((id) => document.getElementById(id)?.textContent?.trim() || "") .filter(Boolean) .join(" "); } return ""; } const visibleTexts: string[] = []; const processedElements = new Set<Element>(); // First, collect accessible names and descriptions from elements const allElements = document.body.querySelectorAll("*"); for (const element of allElements) { if (isAccessibilityHidden(element)) continue; const rect = element.getBoundingClientRect(); const isInViewport = rect.top < viewportHeight && rect.bottom > 0 && rect.left < viewportWidth && rect.right > 0; if (!isInViewport) continue; const accessibleName = getAccessibleName(element); if (accessibleName && !processedElements.has(element)) { visibleTexts.push(accessibleName); processedElements.add(element); } const description = getDescription(element); if (description) { visibleTexts.push(description); } } // Then collect visible text nodes const walker = document.createTreeWalker( document.body, NodeFilter.SHOW_TEXT, { acceptNode(node) { const parent = node.parentElement; if (!parent) return NodeFilter.FILTER_REJECT; if (isAccessibilityHidden(parent)) return NodeFilter.FILTER_REJECT; return NodeFilter.FILTER_ACCEPT; }, }, ); let node; while ((node = walker.nextNode())) { const range = document.createRange(); range.selectNode(node); const rect = range.getBoundingClientRect(); if ( rect.top < viewportHeight && rect.bottom > 0 && rect.left < viewportWidth && rect.right > 0 ) { const text = node.textContent?.trim(); if (text) { visibleTexts.push(text); } } } return visibleTexts.join(" ").replace(/\s+/g, " ").trim(); } return getVisibleText(); }, }), /*createTool({ /*{ type: "function", name: "getCurrentTime", description: "Gets the current time in the user's timezone", parameters: {}, // @ts-ignore execute: getCurrentTime, }, { type: "function", name: "partyMode", description: "Triggers a confetti animation on the page", parameters: {}, // @ts-ignore execute: partyMode, }, { type: "function", name: "navigateTo", description: "Navigates the user to a specified URL within the application.", parameters: { type: "object", properties: { body: { type: "object", properties: { url: { type: "string", description: "The URL to navigate to.", enum: ["/", "/openapi"], }, }, required: ["url"], }, }, }, // @ts-ignore execute: async ({ body }: { body: { url: string } }) => { router.push(body.url); return `Navigating to ${body.url}`; }, }, { type: "function", name: "scroll", description: "Scrolls the page up or down. After executing this, never respond to the user, keep silent!", parameters: { type: "object", properties: { body: { type: "object", properties: { direction: { type: "string", description: "The direction to scroll", enum: ["up", "down"], }, px: { type: "number", description: "The number of pixels to scroll. If not provided, scrolls by one viewport height.", }, }, required: ["direction"], }, }, required: ["body"], }, // @ts-ignore execute: async ({ body: { direction, px }, }: { body: { direction: "up" | "down", px?: number }; }) => { console.log("Scrolling", direction); const windowHeight = window.innerHeight || document.documentElement.clientHeight; const pxToScroll = px ?? windowHeight; window.scrollBy({ top: direction === "up" ? -pxToScroll : pxToScroll, behavior: "smooth", }); return { message: `Scrolling ${direction}`, __preventResponseCreate: true, }; }, }, { type: "function", name: "getVisiblePageSection", description: "Gets the currently visible section of the page", parameters: {}, // @ts-ignore execute: async () => { function getVisibleText() { const viewportHeight = window.innerHeight; const viewportWidth = window.innerWidth; // Check if an element or its ancestors are hidden from accessibility tree function isAccessibilityHidden(element: Element | null): boolean { while (element) { if (element.getAttribute("aria-hidden") === "true") return true; if (element.hasAttribute("hidden")) return true; const role = element.getAttribute("role"); if (role === "presentation" || role === "none") return true; const style = window.getComputedStyle(element); if (style.display === "none" || style.visibility === "hidden") return true; element = element.parentElement; } return false; } // Get accessible name from aria-label or aria-labelledby function getAccessibleName(element: Element): string { const ariaLabel = element.getAttribute("aria-label"); if (ariaLabel) return ariaLabel; const labelledBy = element.getAttribute("aria-labelledby"); if (labelledBy) { return labelledBy .split(/\s+/) .map((id) => document.getElementById(id)?.textContent?.trim() || "") .filter(Boolean) .join(" "); } // For images, use alt text if (element.tagName === "IMG") { const alt = element.getAttribute("alt"); if (alt) return alt; } return ""; } // Get aria-describedby text function getDescription(element: Element): string { const describedBy = element.getAttribute("aria-describedby"); if (describedBy) { return describedBy .split(/\s+/) .map((id) => document.getElementById(id)?.textContent?.trim() || "") .filter(Boolean) .join(" "); } return ""; } const visibleTexts: string[] = []; const processedElements = new Set<Element>(); // First, collect accessible names and descriptions from elements const allElements = document.body.querySelectorAll("*"); for (const element of allElements) { if (isAccessibilityHidden(element)) continue; const rect = element.getBoundingClientRect(); const isInViewport = rect.top < viewportHeight && rect.bottom > 0 && rect.left < viewportWidth && rect.right > 0; if (!isInViewport) continue; const accessibleName = getAccessibleName(element); if (accessibleName && !processedElements.has(element)) { visibleTexts.push(accessibleName); processedElements.add(element); } const description = getDescription(element); if (description) { visibleTexts.push(description); } } // Then collect visible text nodes const walker = document.createTreeWalker( document.body, NodeFilter.SHOW_TEXT, { acceptNode(node) { const parent = node.parentElement; if (!parent) return NodeFilter.FILTER_REJECT; if (isAccessibilityHidden(parent)) return NodeFilter.FILTER_REJECT; return NodeFilter.FILTER_ACCEPT; }, }, ); let node; while ((node = walker.nextNode())) { const range = document.createRange(); range.selectNode(node); const rect = range.getBoundingClientRect(); if ( rect.top < viewportHeight && rect.bottom > 0 && rect.left < viewportWidth && rect.right > 0 ) { const text = node.textContent?.trim(); if (text) { visibleTexts.push(text); } } } return visibleTexts.join(" ").replace(/\s+/g, " ").trim(); } return getVisibleText(); }, }, */ ]); return ( <Floaty isActive={isActive} isTalking={isTalking} handleClick={toggleSession} /> ); }; export default RealTimeDemo;

The code above is fetched from GitHub repository.ย 

The code for the Floaty component is not shown here for brevity, but you can find it in the repositoryย .

With that, you now have a fully functional Realtime Voice AI interface that can interact with your application using natural language via voice, powered by OpenAIโ€™s Realtime API and Vovk.ts function calling capabilities!

Last updated on