Skip to main content

Chat Messages

Roles

public enum ChatMessageRole: String {
  case user
  case system
  case assistant
  case tool
}
Include .tool messages when you append function-call results back into the conversation.

Message Structure

public struct ChatMessage {
  public var role: ChatMessageRole
  public var content: [ChatMessageContent]
  public var reasoningContent: String?
  public var functionCalls: [LeapFunctionCall]?

  public init(
    role: ChatMessageRole,
    content: [ChatMessageContent],
    reasoningContent: String? = nil,
    functionCalls: [LeapFunctionCall]? = nil
  )

  public init(from json: [String: Any]) throws
}
  • content: Ordered fragments of the message. The SDK supports .text, .image, and .audio parts.
  • reasoningContent: Optional text produced inside <think> tags by eligible models.
  • functionCalls: Attach the calls returned by MessageResponse.functionCall when you include tool execution results in the history.

Message Content

public enum ChatMessageContent {
  case text(String)
  case image(Data)   // JPEG bytes
  case audio(Data)   // WAV bytes

  public init(from json: [String: Any]) throws
}
Provide JPEG-encoded bytes for .image and WAV data for .audio. Helper initializers such as ChatMessageContent.fromUIImage, ChatMessageContent.fromNSImage, ChatMessageContent.fromWAVData, and ChatMessageContent.fromFloatSamples(_:sampleRate:channelCount:) simplify interop with platform-native buffers. On the wire, image parts are encoded as OpenAI-style image_url payloads and audio parts as input_audio arrays with Base64 data.

Audio Format Requirements

The LEAP inference engine requires WAV-encoded audio with specific format requirements:
PropertyRequired ValueNotes
FormatWAV (RIFF)Only WAV format is supported
Sample Rate16000 Hz (16 kHz) recommendedOther sample rates are automatically resampled to 16 kHz
EncodingPCM (various bit depths)Supports Float32, Int16, Int24, Int32
ChannelsMono (1 channel)Required - stereo audio will be rejected
Byte OrderLittle-endianStandard WAV format
Supported PCM Encodings:
  • Float32: 32-bit floating point, normalized to [-1.0, 1.0]
  • Int16: 16-bit signed integer, range [-32768, 32767] (recommended)
  • Int24: 24-bit signed integer, range [-8388608, 8388607]
  • Int32: 32-bit signed integer, range [-2147483648, 2147483647]
The inference engine only accepts WAV format. M4A, MP3, AAC, or other compressed formats are not supported and will cause errors. Audio must be converted to WAV before sending to the model.
Automatic Resampling: The inference engine automatically resamples audio to 16 kHz if provided at a different sample rate. However, for best performance and quality, provide audio at 16 kHz to avoid resampling overhead.
Mono Channel Required: The inference engine strictly requires single-channel (mono) audio. Multi-channel or stereo WAV files will be rejected with an error. Convert stereo audio to mono before sending.

Creating Audio Content from WAV Files

import LeapSDK

// Load WAV file
let wavURL = Bundle.main.url(forResource: "audio", withExtension: "wav")!
let wavData = try Data(contentsOf: wavURL)

let message = ChatMessage(
    role: .user,
    content: [
        .text("What is being said in this audio?"),
        .audio(wavData)
    ]
)

Creating Audio Content from Raw PCM Samples

Use the fromFloatSamples helper to create WAV-encoded data from raw audio samples:
import AVFoundation

// Float samples normalized to -1.0 to 1.0
let samples: [Float] = [0.1, 0.2, 0.15, -0.3, ...]

// Create WAV-encoded Data
let audioContent = ChatMessageContent.fromFloatSamples(
    samples,
    sampleRate: 16000,
    channelCount: 1
)

let message = ChatMessage(
    role: .user,
    content: [
        .text("Transcribe this audio"),
        audioContent
    ]
)

Recording Audio on iOS

When recording audio from the device microphone, configure AVAudioRecorder with the correct settings:
import AVFoundation

let audioURL = FileManager.default.temporaryDirectory
    .appendingPathComponent("recording.wav")

let settings: [String: Any] = [
    AVFormatIDKey: kAudioFormatLinearPCM,           // Linear PCM
    AVSampleRateKey: 16000.0,                       // 16 kHz
    AVNumberOfChannelsKey: 1,                       // Mono
    AVLinearPCMBitDepthKey: 16,                     // 16-bit
    AVLinearPCMIsFloatKey: false,                   // Integer samples
    AVLinearPCMIsBigEndianKey: false                // Little-endian
]

let audioRecorder = try AVAudioRecorder(url: audioURL, settings: settings)
audioRecorder.record()

// ... wait for user to finish speaking ...

audioRecorder.stop()

// Read the WAV file
let wavData = try Data(contentsOf: audioURL)
let audioContent: ChatMessageContent = .audio(wavData)

Audio Duration Considerations

  • Minimum duration: At least 1 second of audio is recommended for reliable speech recognition
  • Maximum duration: Limited by the model’s context window (typically several minutes)
  • Silence: Trim excessive silence from the beginning and end for better results

Audio Output from Models

When generating audio responses (e.g., with LFM2.5-Audio-1.5B), the model outputs audio at 24 kHz sample rate:
for try await response in conversation.generateResponse(message: userMessage) {
    switch response {
    case .audioSample(let samples, let sampleRate):
        // samples: [Float] (32-bit float PCM, normalized -1.0 to 1.0)
        // sampleRate: Int (typically 24000 Hz for audio generation models)

        // Accumulate samples or play immediately
        audioPlayer.enqueue(samples: samples, sampleRate: sampleRate)

    default:
        break
    }
}
Note: Audio input should be 16 kHz, but audio output from generation models is typically 24 kHz. Make sure your audio playback code supports the correct sample rate.