Audio Transcription

@ai_kit/core includes model-agnostic audio transcription support, compatible with any OpenAI-compatible endpoint (Scaleway Whisper large v3, OpenAI whisper-1, etc.).

Four public primitives

Export	Role
`createTranscriptionModel(config)`	Creates a `TranscriptionModelV3` provider
`createTranscriptionStreamingModel(config)`	Creates a native streaming model (no AI SDK) that emits partial text over SSE
`transcribe(options)`	Standalone function: loads audio (path / URL / buffer), calls the model, returns the transcript
`createTranscriptionTool(model, options?)`	Returns an AI SDK `tool()` to attach directly to an `Agent`

`createTranscriptionModel`

import { createTranscriptionModel } from "@ai_kit/core";

const whisperModel = createTranscriptionModel({
  modelId: "whisper-large-v3",
  apiKey: process.env.SCALEWAY_API_KEY!,
  baseURL: "https://api.scaleway.ai/v1",
  providerName: "scaleway", // optional, used in logs
});

Supports any OpenAI-compatible /audio/transcriptions endpoint (response_format=verbose_json).

`transcribe`

import { transcribe } from "@ai_kit/core";

// From a file path
const result = await transcribe({
  model: whisperModel,
  audio: "/path/to/audio.wav",
  inputType: "path",         // "path" | "url" | "buffer" — auto-detected if omitted
  language: "fr",            // optional ISO-639-1 code
});

console.log(result.text);
// result.segments → [{ text, startSecond, endSecond }]
// result.language, result.durationInSeconds

audio accepts a file path, an http(s) URL, or a Buffer / Uint8Array. The inputType is auto-detected when omitted.

Return value

interface TranscribeResult {
  text: string;
  segments: Array<{ text: string; startSecond: number; endSecond: number }>;
  language: string | undefined;
  durationInSeconds: number | undefined;
}

`createTranscriptionStreamingModel` — streaming (native)

For long recordings you can stream the transcript as it is produced instead of waiting for the whole file. This primitive talks directly to the OpenAI-compatible /audio/transcriptions endpoint with stream=true and parses the server-sent events natively — it does not use the AI SDK’s experimental_transcribe.

import { createTranscriptionStreamingModel } from "@ai_kit/core";

const streamingModel = createTranscriptionStreamingModel({
  modelId: "whisper-large-v3",
  apiKey: process.env.SCALEWAY_API_KEY!,
  baseURL: "https://api.scaleway.ai/v1",
  providerName: "scaleway",
});

let full = "";
for await (const chunk of streamingModel.stream({
  audio: "/path/to/audio.wav", // path / URL / Buffer / Uint8Array (auto-detected)
  language: "fr",              // optional ISO-639-1 code
})) {
  if (chunk.type === "delta") {
    full += chunk.textDelta;
    process.stdout.write(chunk.textDelta); // print as it arrives
  } else {
    // final event
    console.log("\nDone:", chunk.text, chunk.durationInSeconds);
  }
}

Output starts streaming as soon as the provider has processed the first 30-second chunk — a few seconds in, rather than after the whole file.

Chunk shape

type TranscriptionStreamChunk =
  | { type: "delta"; textDelta: string }
  | { type: "done"; text: string; durationInSeconds?: number };

delta events carry incremental text; the single closing done event carries the full accumulated text (equal to the concatenation of all deltas). Pass an AbortSignal via abortSignal to cancel mid-stream.

Streaming uses the default JSON stream format — verbose_json (and therefore per-segment timestamps) is not available while streaming. Use transcribe / createTranscriptionModel when you need segments.

`createTranscriptionTool` — attach to an Agent

import { createTranscriptionModel, createTranscriptionTool, Agent } from "@ai_kit/core";
import { scaleway } from "@ai_kit/core";

const whisperModel = createTranscriptionModel({
  modelId: "whisper-large-v3",
  apiKey: process.env.SCALEWAY_API_KEY!,
  baseURL: "https://api.scaleway.ai/v1",
});

const agent = new Agent({
  name: "medical-assistant",
  model: scaleway("gpt-oss-120b"),
  tools: {
    transcribeAudio: createTranscriptionTool(whisperModel, {
      description: "Transcribe a medical audio recording to text",
    }),
  },
});

const result = await agent.generate({
  prompt: "Transcribe this file: /recordings/consultation.mp3",
});

The tool schema exposed to the LLM: audio (path / URL / base64), inputType, language.

Supported audio formats

flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm (identical to OpenAI Whisper).

​Four public primitives

​createTranscriptionModel

​transcribe

​Return value

​createTranscriptionStreamingModel — streaming (native)

​Chunk shape

​createTranscriptionTool — attach to an Agent

​Supported audio formats

Four public primitives

`createTranscriptionModel`

`transcribe`

Return value

`createTranscriptionStreamingModel` — streaming (native)

Chunk shape

`createTranscriptionTool` — attach to an Agent

Supported audio formats