Turn any python function into a real-time audio and video stream over WebRTC or WebSockets.
pip install fastrtc
to use built-in pause detection (see ReplyOnPause), and text to speech (see Text To Speech), install the vad
and tts
extras:
pip install fastrtc[vad, tts]
- 🗣️ Automatic Voice Detection and Turn Taking built-in, only worry about the logic for responding to the user.
- 💻 Automatic UI - Use the
.ui.launch()
method to launch the webRTC-enabled built-in Gradio UI. - 🔌 Automatic WebRTC Support - Use the
.mount(app)
method to mount the stream on a FastAPI app and get a webRTC endpoint for your own frontend! - ⚡️ Websocket Support - Use the
.mount(app)
method to mount the stream on a FastAPI app and get a websocket endpoint for your own frontend! - 📞 Automatic Telephone Support - Use the
fastphone()
method of the stream to launch the application and get a free temporary phone number! - 🤖 Completely customizable backend - A
Stream
can easily be mounted on a FastAPI app so you can easily extend it to fit your production application. See the Talk To Claude demo for an example on how to serve a custom JS frontend.
See the Cookbook for examples of how to use the library.
Stream BOTH your webcam video and audio feeds to Google Gemini. You can also upload images to augment your conversation! gemini-audio-video-first.mp4 |
Talk to Gemini in real time using Google's voice API. gemini-live-chat.mp4 |
Talk to ChatGPT in real time using OpenAI's voice API. openai-live-chat.mp4 |
Say computer before asking your question! 2025-02-20_00-05-11.mp4 |
Create and edit HTML pages with just your voice! Powered by SambaNova systems. llama-code-editor.mp4 |
Use the Anthropic and Play.Ht APIs to have an audio conversation with Claude. talk-to-claude.mp4 |
Have whisper transcribe your speech in real time! whisper-realtime.mp4 |
Run the Yolov10 model on a user webcam stream in real time! yolov10-stream.mp4 |
Kyutai's moshi is a novel speech-to-speech model for modeling human conversations. talk-to-moshi.mp4 |
A code editor built with Llama 3.3 70b that is triggered by the phrase "Hello Llama". Build a Siri-like coding assistant in 100 lines of code! hey-llama-final.mp4 |
This is an shortened version of the official usage guide.
.ui.launch()
: Launch a built-in UI for easily testing and sharing your stream. Built with Gradio..fastphone()
: Get a free temporary phone number to call into your stream. Hugging Face token required..mount(app)
: Mount the stream on a FastAPI app. Perfect for integrating with your already existing production system.
from fastrtc import Stream, ReplyOnPause
import numpy as np
def echo(audio: tuple[int, np.ndarray]):
# The function will be passed the audio until the user pauses
# Implement any iterator that yields audio
# See "LLM Voice Chat" for a more complete example
yield audio
stream = Stream(
handler=ReplyOnPause(detection),
modality="audio",
mode="send-receive",
)
from fastrtc import (
ReplyOnPause, AdditionalOutputs, Stream,
audio_to_bytes, aggregate_bytes_to_16bit
)
import gradio as gr
from groq import Groq
import anthropic
from elevenlabs import ElevenLabs
groq_client = Groq()
claude_client = anthropic.Anthropic()
tts_client = ElevenLabs()
# See "Talk to Claude" in Cookbook for an example of how to keep
# track of the chat history.
def response(
audio: tuple[int, np.ndarray],
):
prompt = groq_client.audio.transcriptions.create(
file=("audio-file.mp3", audio_to_bytes(audio)),
model="whisper-large-v3-turbo",
response_format="verbose_json",
).text
response = claude_client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
response_text = " ".join(
block.text
for block in response.content
if getattr(block, "type", None) == "text"
)
iterator = tts_client.text_to_speech.convert_as_stream(
text=response_text,
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
output_format="pcm_24000"
)
for chunk in aggregate_bytes_to_16bit(iterator):
audio_array = np.frombuffer(chunk, dtype=np.int16).reshape(1, -1)
yield (24000, audio_array)
stream = Stream(
modality="audio",
mode="send-receive",
handler=ReplyOnPause(response),
)
from fastrtc import Stream
import numpy as np
def flip_vertically(image):
return np.flip(image, axis=0)
stream = Stream(
handler=flip_vertically,
modality="video",
mode="send-receive",
)
from fastrtc import Stream
import gradio as gr
import cv2
from huggingface_hub import hf_hub_download
from .inference import YOLOv10
model_file = hf_hub_download(
repo_id="onnx-community/yolov10n", filename="onnx/model.onnx"
)
# git clone https://huggingface.co/spaces/fastrtc/object-detection
# for YOLOv10 implementation
model = YOLOv10(model_file)
def detection(image, conf_threshold=0.3):
image = cv2.resize(image, (model.input_width, model.input_height))
new_image = model.detect_objects(image, conf_threshold)
return cv2.resize(new_image, (500, 500))
stream = Stream(
handler=detection,
modality="video",
mode="send-receive",
additional_inputs=[
gr.Slider(minimum=0, maximum=1, step=0.01, value=0.3)
]
)
Run:
stream.ui.launch()
```py
stream.fastphone()
```
app = FastAPI()
stream.mount(app)
# Optional: Add routes
@app.get("/")
async def _():
return HTMLResponse(content=open("index.html").read())
# uvicorn app:app --host 0.0.0.0 --port 8000