Workers AI
Learning Focus
By the end of this lesson you will understand how to run serverless AI models on Workers, which models are available, and how to use the free tier.
What Is Workers AI?
Workers AI allows you to run machine learning models on Cloudflare's global network of GPUs. Instead of managing GPU servers or paying high fees for external AI APIs, you can run inference (predictions) directly inside your Workers.
flowchart LR
USER["User"] --> WORKER["Worker\n(Edge)"]
WORKER -->|"Inference Request"| GPU["Cloudflare GPU\n(Local PoP)"]
GPU -->|"Prediction"| WORKER
WORKER -->|"AI Response"| USER
style GPU fill:#7c3aed,color:#fff,stroke:#6d28d9
style WORKER fill:#f6821f,color:#fff,stroke:#e5711e
Model Categories
| Category | Typical Models | Use Case |
|---|---|---|
| Text Generation | Llama 3, Mistral, Gemma | Chatbots, summarization, content gen |
| Text-to-Image | Stable Diffusion | Generating unique art or placeholders |
| Translation | M2M-100 | Multi-language support |
| Speech-to-Text | Whisper | Transcribing audio |
| Embeddings | bge-small, bge-base | Vector search, RAG |
| Classification | ResNet, DistilBERT | Sentiment analysis, image labeling |
Free Tier (Beta)
During the beta period, Cloudflare offers a generous free tier for certain models. Check the Cloudflare Dashboard for the most current limits.
| Resource | Typical Free Usage |
|---|---|
| Inference | Free for "beta" tagged models |
| Daily Limit | Up to 1,000 requests/day (varies by model) |
| GPU Access | Built-in to the global network |
Running AI in a Worker
1. Configure the Binding
wrangler.toml
[ai]
binding = "AI"
2. Run Inference (LLM Example)
src/index.ts
export interface Env {
AI: any;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
const prompt = url.searchParams.get("prompt") || "Why is Cloudflare awesome?";
// Run the Llama 3 model
const response = await env.AI.run("@cf/meta/llama-3-8b-instruct", {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: prompt },
],
});
return Response.json(response);
},
};
3. Image Generation Example
const response = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
prompt: "A futuristic city in the clouds, cyberpunk aesthetic",
});
return new Response(response, {
headers: { "Content-Type": "image/png" },
});
Using Workers AI outside of Workers
You can also call Workers AI via a standard REST API using your Cloudflare API token.
Call AI via REST
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3-8b-instruct \
-H "Authorization: Bearer {API_TOKEN}" \
-d '{ "messages": [{ "role": "user", "content": "Hello!" }] }'
Common Strategy: RAG (Retrieval Augmented Generation)
Workers AI is most powerful when combined with Vectorize and D1:
- Store documents in D1 or R2.
- Generate embeddings using
bge-smallon Workers AI. - Store and search embeddings in Vectorize.
- Use the search results as context for Llama 3 to provide accurate answers.
Key Takeaways
- Workers AI runs machine learning models on GPUs at the edge.
- Supports LLMs, Image Gen, Translation, Speech, and Embeddings.
- Free tier available for "beta" models during the preview period.
- Integrated natively with the Workers ecosystem (bindings).
- No cold starts for AI — models are pre-loaded on the network.
What's Next
- Continue to Vectorize to learn about vector databases for AI search.