WebRTC-based Realtime Voice AI
For the JARVIS-like experience, mentioned at the Overview page, weโre going to set up a voice interface that uses OpenAI Realtime API with WebRTC to send and receive audio data in real-time. This time, instead of controller methods that are executed by AI SDK on the back-end, weโre going to use function calling capabilities to call the methods of the RPC modules directly from the browser, making the experience truly real-time and almost instantaneous.
Back-end Setup
For the back-end weโre going to create a session endpoint thatโs implemented with the help of the Realtime API with WebRTC articleย from the official OpenAI documentation. The endpoint is going to accept the SDP offer from the client, voice selection query parameter, and return the SDP answer from the OpenAI Realtime API.
import { procedure, prefix, post, HttpException, HttpStatus } from "vovk";
import { z } from "zod";
import { sessionGuard } from "@/decorators/sessionGuard";
@prefix("realtime")
export default class RealtimeController {
@post("session")
@sessionGuard()
static session = procedure({
query: z.object({
voice: z.enum(["ash", "ballad", "coral", "sage", "verse"]),
}),
body: z.object({ sdp: z.string() }),
output: z.object({ sdp: z.string() }),
async handle({ vovk }) {
const voice = vovk.query().voice;
const { sdp: sdpOffer } = await vovk.body();
const sessionConfig = JSON.stringify({
type: "realtime",
model: "gpt-realtime",
audio: { output: { voice } },
});
const fd = new FormData();
fd.set("sdp", sdpOffer);
fd.set("session", sessionConfig);
try {
const r = await fetch("https://api.openai.com/v1/realtime/calls", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: fd,
});
// Send back the SDP we received from the OpenAI REST API
const sdp = await r.text();
return { sdp };
} catch (error) {
throw new HttpException(
HttpStatus.INTERNAL_SERVER_ERROR,
"Failed to generate token. " + String(error),
);
}
},
});
}The code above is fetched from GitHub repository.ย
Front-end Setup
Client-side Tools
For the front-end, letโs create the client-side tools first. The tools array is going to include all the HTTP tools from the TaskRPC and UserRPC modules, as well as two custom tools: getCurrentTime and partyMode, borrowed from this repositoryย to demonstrate custom tool creation. The list can be extended with other UI-related tools.
404: Not FoundThe code above is fetched from GitHub repository.ย
WebRTC Audio Session Hook
Next, weโre going to create a custom hook useWebRTCAudioSession that manages the WebRTC session, including starting and stopping the session, handling audio streams, and managing the data channel for function calling.
The hook accepts the selected voice and the tools list as parameters. It returns the session state (isActive, isTalking) and a function to toggle the session (toggleSession).
The important parts of the hook are the onopen event handler of the data channel, where we send the session.update message with the tools list to inform the OpenAI Realtime API about the available tools, and the onmessage event handler, where we listen for function call requests from the model via response.function_call_arguments.done event, execute the corresponding tool, and send back the results.
"use client";
import { useState, useRef, useCallback, useEffect } from "react";
import { VovkTool } from "vovk";
import { RealtimeRPC } from "vovk-client";
/**
* Hook to manage a real-time session with OpenAI's Realtime endpoints.
* @example const { isActive, isTalking, handleStartStopClick } = useWebRTCAudioSession(voice, tools);
*/
export default function useWebRTCAudioSession(
voice: "ash" | "ballad" | "coral" | "sage" | "verse",
tools: VovkTool[],
) {
const audioElement = useRef<HTMLAudioElement | null>(null);
const [isActive, setIsActive] = useState(false);
// Data channel ref
const dcRef = useRef<RTCDataChannel | null>(null);
// Media stream ref for microphone
const mcRef = useRef<MediaStream | null>(null);
// talking state + refs
const [isTalking, setIsTalking] = useState(false);
const remoteAnalyserRef = useRef<AnalyserNode | null>(null);
const remoteMonitorIntervalRef = useRef<number | null>(null);
const remoteAudioContextRef = useRef<AudioContext | null>(null);
const startSession = useCallback(async () => {
// Create a peer connection
const pc = new RTCPeerConnection();
// Set up to play remote audio from the model
audioElement.current = document.createElement("audio");
audioElement.current.autoplay = true;
pc.ontrack = (e) => {
audioElement.current!.srcObject = e.streams[0];
// Simple audio activity monitor
try {
const audioCtx = new AudioContext();
remoteAudioContextRef.current = audioCtx;
const source = audioCtx.createMediaStreamSource(e.streams[0]);
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 256;
source.connect(analyser);
remoteAnalyserRef.current = analyser;
remoteMonitorIntervalRef.current = window.setInterval(() => {
if (!remoteAnalyserRef.current) return;
const a = remoteAnalyserRef.current;
const data = new Uint8Array(a.fftSize);
a.getByteTimeDomainData(data);
let sum = 0;
for (let i = 0; i < data.length; i++) {
const v = (data[i] - 128) / 128;
sum += v * v;
}
const rms = Math.sqrt(sum / data.length);
setIsTalking(rms > 0.02); // simple threshold
}, 200);
} catch {
// ignore audio activity errors
}
};
// Add local audio track for microphone input in the browser
const ms = await navigator.mediaDevices.getUserMedia({
audio: true,
});
mcRef.current = ms;
pc.addTrack(ms.getTracks()[0]);
// Set up data channel for sending and receiving events
const dc = pc.createDataChannel("oai-events");
dcRef.current = dc;
// Start the session using the Session Description Protocol (SDP)
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const { sdp } = await RealtimeRPC.session({
body: { sdp: offer.sdp! },
query: { voice },
});
await pc.setRemoteDescription({
type: "answer",
sdp,
});
dc.onopen = () => {
const sessionUpdate = {
type: "session.update",
session: {
type: "realtime",
tools: tools.map(({ name, description, parameters, type }) => ({ name, description, parameters, type })),
},
};
dc.send(JSON.stringify(sessionUpdate));
};
dc.onmessage = async (event) => {
const msg = JSON.parse(event.data);
// Handle function call completions
if (msg.type === "response.function_call_arguments.done") {
const execute = tools.find((tool) => tool.name === msg.name)?.execute;
if (execute) {
const args = JSON.parse(msg.arguments);
const result = await execute(args) as { __preventResponseCreate?: boolean };
// Respond with function output
const response = {
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: msg.call_id,
output: JSON.stringify(result),
},
};
dcRef.current?.send(JSON.stringify(response));
if (!result?.__preventResponseCreate) {
const responseCreate = {
type: "response.create",
};
dcRef.current?.send(JSON.stringify(responseCreate));
}
}
}
};
setIsActive(true);
}, []);
const stopSession = useCallback(() => {
// Close data channel and peer connection
dcRef.current?.close();
dcRef.current = null;
// Stop microphone tracks
mcRef.current?.getTracks().forEach((track) => track.stop());
mcRef.current = null;
// Close remote audio context
remoteAudioContextRef.current?.close();
remoteAudioContextRef.current = null;
remoteAnalyserRef.current = null;
// Stop the audio immediately
if (audioElement.current) {
audioElement.current.srcObject = null;
audioElement.current = null;
}
// Clear monitoring interval
if (remoteMonitorIntervalRef.current) {
clearInterval(remoteMonitorIntervalRef.current);
remoteMonitorIntervalRef.current = null;
}
setIsTalking(false);
setIsActive(false);
}, []);
const toggleSession = useCallback(() => {
if (isActive) {
stopSession();
} else {
startSession();
}
}, [isActive, startSession, stopSession]);
// Cleanup on unmount
useEffect(() => {
return () => stopSession();
}, []);
return {
startSession,
stopSession,
toggleSession,
isActive,
isTalking,
};
}The code above is fetched from GitHub repository.ย
Finally, we can create a simple component that uses the useWebRTCAudioSession hook and displays a floating button to start and stop the session, as well as indicate whether the model is currently talking.
"use client";
import useWebRTCAudioSession from "@/hooks/useWebRTCAudioSession";
import Floaty from "./Floaty";
import { useRouter } from "next/navigation";
import { createTool, deriveTools } from "vovk";
import { TaskRPC, UserRPC } from "vovk-client";
import getCurrentTime from "@/lib/tools/getCurrentTime";
import partyMode from "@/lib/tools/partyMode";
import z from "zod";
const RealTimeDemo = () => {
const router = useRouter();
const { isActive, isTalking, toggleSession } = useWebRTCAudioSession("ash", [
...deriveTools({
modules: { TaskRPC, UserRPC },
}).tools,
createTool({
name: "getCurrentTime",
description: "Gets the current time in the user's timezone",
outputSchema: z.object({ time: z.string(), timezone: z.string(), message: z.string() }).meta({ description: "Current time info." }),
execute: getCurrentTime,
}),
createTool({
name: "partyMode",
description: "Triggers a confetti animation on the page",
execute: partyMode,
}),
createTool({
name: "navigateTo",
description: "Navigates the user to a specified URL within the application.",
inputSchema: z.object({
url: z.enum(["/", "/openapi"]).meta({ description: "The URL to navigate to." }),
}),
outputSchema: z.string().meta({ description: "Navigation confirmation message." }),
execute: async ({ url }: { url: string }) => {
router.push(url);
return `Navigating to ${url}`;
},
}),
createTool({
name: "scroll",
description:
"Scrolls the page up or down.",
inputSchema: z.object({
direction: z.enum(["up", "down"]).meta({ description: "The direction to scroll" }),
px: z.number().optional().meta({ description: "The number of pixels to scroll. If not provided, scrolls by one viewport height." }),
}),
outputSchema: z.object({
message: z.string().meta({ description: "Scroll action confirmation message." }),
__preventResponseCreate: z.boolean().meta({ description: "Flag to prevent response creation." }),
}),
execute: async ({ direction, px }: { direction: "up" | "down"; px?: number }) => {
console.log("Scrolling", direction);
const windowHeight =
window.innerHeight || document.documentElement.clientHeight;
const pxToScroll = px ?? windowHeight;
window.scrollBy({
top: direction === "up" ? -pxToScroll : pxToScroll,
behavior: "smooth",
});
return {
message: `Scrolling ${direction}`,
__preventResponseCreate: true,
};
},
}),
createTool({
name: "getVisiblePageSection",
description: "Gets the currently visible section of the page",
outputSchema: z.string().meta({ description: "Visible text content from the page." }),
execute: async () => {
function getVisibleText() {
const viewportHeight = window.innerHeight;
const viewportWidth = window.innerWidth;
// Check if an element or its ancestors are hidden from accessibility tree
function isAccessibilityHidden(element: Element | null): boolean {
while (element) {
if (element.getAttribute("aria-hidden") === "true") return true;
if (element.hasAttribute("hidden")) return true;
const role = element.getAttribute("role");
if (role === "presentation" || role === "none") return true;
const style = window.getComputedStyle(element);
if (style.display === "none" || style.visibility === "hidden") return true;
element = element.parentElement;
}
return false;
}
// Get accessible name from aria-label or aria-labelledby
function getAccessibleName(element: Element): string {
const ariaLabel = element.getAttribute("aria-label");
if (ariaLabel) return ariaLabel;
const labelledBy = element.getAttribute("aria-labelledby");
if (labelledBy) {
return labelledBy
.split(/\s+/)
.map((id) => document.getElementById(id)?.textContent?.trim() || "")
.filter(Boolean)
.join(" ");
}
// For images, use alt text
if (element.tagName === "IMG") {
const alt = element.getAttribute("alt");
if (alt) return alt;
}
return "";
}
// Get aria-describedby text
function getDescription(element: Element): string {
const describedBy = element.getAttribute("aria-describedby");
if (describedBy) {
return describedBy
.split(/\s+/)
.map((id) => document.getElementById(id)?.textContent?.trim() || "")
.filter(Boolean)
.join(" ");
}
return "";
}
const visibleTexts: string[] = [];
const processedElements = new Set<Element>();
// First, collect accessible names and descriptions from elements
const allElements = document.body.querySelectorAll("*");
for (const element of allElements) {
if (isAccessibilityHidden(element)) continue;
const rect = element.getBoundingClientRect();
const isInViewport =
rect.top < viewportHeight &&
rect.bottom > 0 &&
rect.left < viewportWidth &&
rect.right > 0;
if (!isInViewport) continue;
const accessibleName = getAccessibleName(element);
if (accessibleName && !processedElements.has(element)) {
visibleTexts.push(accessibleName);
processedElements.add(element);
}
const description = getDescription(element);
if (description) {
visibleTexts.push(description);
}
}
// Then collect visible text nodes
const walker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT,
{
acceptNode(node) {
const parent = node.parentElement;
if (!parent) return NodeFilter.FILTER_REJECT;
if (isAccessibilityHidden(parent)) return NodeFilter.FILTER_REJECT;
return NodeFilter.FILTER_ACCEPT;
},
},
);
let node;
while ((node = walker.nextNode())) {
const range = document.createRange();
range.selectNode(node);
const rect = range.getBoundingClientRect();
if (
rect.top < viewportHeight &&
rect.bottom > 0 &&
rect.left < viewportWidth &&
rect.right > 0
) {
const text = node.textContent?.trim();
if (text) {
visibleTexts.push(text);
}
}
}
return visibleTexts.join(" ").replace(/\s+/g, " ").trim();
}
return getVisibleText();
},
}),
/*createTool({
/*{
type: "function",
name: "getCurrentTime",
description: "Gets the current time in the user's timezone",
parameters: {},
// @ts-ignore
execute: getCurrentTime,
},
{
type: "function",
name: "partyMode",
description: "Triggers a confetti animation on the page",
parameters: {},
// @ts-ignore
execute: partyMode,
},
{
type: "function",
name: "navigateTo",
description:
"Navigates the user to a specified URL within the application.",
parameters: {
type: "object",
properties: {
body: {
type: "object",
properties: {
url: {
type: "string",
description: "The URL to navigate to.",
enum: ["/", "/openapi"],
},
},
required: ["url"],
},
},
},
// @ts-ignore
execute: async ({ body }: { body: { url: string } }) => {
router.push(body.url);
return `Navigating to ${body.url}`;
},
},
{
type: "function",
name: "scroll",
description:
"Scrolls the page up or down. After executing this, never respond to the user, keep silent!",
parameters: {
type: "object",
properties: {
body: {
type: "object",
properties: {
direction: {
type: "string",
description: "The direction to scroll",
enum: ["up", "down"],
},
px: {
type: "number",
description:
"The number of pixels to scroll. If not provided, scrolls by one viewport height.",
},
},
required: ["direction"],
},
},
required: ["body"],
},
// @ts-ignore
execute: async ({
body: { direction, px },
}: {
body: { direction: "up" | "down", px?: number };
}) => {
console.log("Scrolling", direction);
const windowHeight =
window.innerHeight || document.documentElement.clientHeight;
const pxToScroll = px ?? windowHeight;
window.scrollBy({
top: direction === "up" ? -pxToScroll : pxToScroll,
behavior: "smooth",
});
return {
message: `Scrolling ${direction}`,
__preventResponseCreate: true,
};
},
},
{
type: "function",
name: "getVisiblePageSection",
description: "Gets the currently visible section of the page",
parameters: {},
// @ts-ignore
execute: async () => {
function getVisibleText() {
const viewportHeight = window.innerHeight;
const viewportWidth = window.innerWidth;
// Check if an element or its ancestors are hidden from accessibility tree
function isAccessibilityHidden(element: Element | null): boolean {
while (element) {
if (element.getAttribute("aria-hidden") === "true") return true;
if (element.hasAttribute("hidden")) return true;
const role = element.getAttribute("role");
if (role === "presentation" || role === "none") return true;
const style = window.getComputedStyle(element);
if (style.display === "none" || style.visibility === "hidden") return true;
element = element.parentElement;
}
return false;
}
// Get accessible name from aria-label or aria-labelledby
function getAccessibleName(element: Element): string {
const ariaLabel = element.getAttribute("aria-label");
if (ariaLabel) return ariaLabel;
const labelledBy = element.getAttribute("aria-labelledby");
if (labelledBy) {
return labelledBy
.split(/\s+/)
.map((id) => document.getElementById(id)?.textContent?.trim() || "")
.filter(Boolean)
.join(" ");
}
// For images, use alt text
if (element.tagName === "IMG") {
const alt = element.getAttribute("alt");
if (alt) return alt;
}
return "";
}
// Get aria-describedby text
function getDescription(element: Element): string {
const describedBy = element.getAttribute("aria-describedby");
if (describedBy) {
return describedBy
.split(/\s+/)
.map((id) => document.getElementById(id)?.textContent?.trim() || "")
.filter(Boolean)
.join(" ");
}
return "";
}
const visibleTexts: string[] = [];
const processedElements = new Set<Element>();
// First, collect accessible names and descriptions from elements
const allElements = document.body.querySelectorAll("*");
for (const element of allElements) {
if (isAccessibilityHidden(element)) continue;
const rect = element.getBoundingClientRect();
const isInViewport =
rect.top < viewportHeight &&
rect.bottom > 0 &&
rect.left < viewportWidth &&
rect.right > 0;
if (!isInViewport) continue;
const accessibleName = getAccessibleName(element);
if (accessibleName && !processedElements.has(element)) {
visibleTexts.push(accessibleName);
processedElements.add(element);
}
const description = getDescription(element);
if (description) {
visibleTexts.push(description);
}
}
// Then collect visible text nodes
const walker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT,
{
acceptNode(node) {
const parent = node.parentElement;
if (!parent) return NodeFilter.FILTER_REJECT;
if (isAccessibilityHidden(parent)) return NodeFilter.FILTER_REJECT;
return NodeFilter.FILTER_ACCEPT;
},
},
);
let node;
while ((node = walker.nextNode())) {
const range = document.createRange();
range.selectNode(node);
const rect = range.getBoundingClientRect();
if (
rect.top < viewportHeight &&
rect.bottom > 0 &&
rect.left < viewportWidth &&
rect.right > 0
) {
const text = node.textContent?.trim();
if (text) {
visibleTexts.push(text);
}
}
}
return visibleTexts.join(" ").replace(/\s+/g, " ").trim();
}
return getVisibleText();
},
}, */
]);
return (
<Floaty
isActive={isActive}
isTalking={isTalking}
handleClick={toggleSession}
/>
);
};
export default RealTimeDemo;The code above is fetched from GitHub repository.ย
The code for the Floaty component is not shown here for brevity, but you can find it in the repositoryย .
With that, you now have a fully functional Realtime Voice AI interface that can interact with your application using natural language via voice, powered by OpenAIโs Realtime API and Vovk.ts function calling capabilities!