Skip to main content

Workers AI

Learning Focus

By the end of this lesson you will understand how to run serverless AI models on Workers, which models are available, and how to use the free tier.

What Is Workers AI?

Workers AI allows you to run machine learning models on Cloudflare's global network of GPUs. Instead of managing GPU servers or paying high fees for external AI APIs, you can run inference (predictions) directly inside your Workers.

flowchart LR
USER["User"] --> WORKER["Worker\n(Edge)"]
WORKER -->|"Inference Request"| GPU["Cloudflare GPU\n(Local PoP)"]
GPU -->|"Prediction"| WORKER
WORKER -->|"AI Response"| USER

style GPU fill:#7c3aed,color:#fff,stroke:#6d28d9
style WORKER fill:#f6821f,color:#fff,stroke:#e5711e

Model Categories

CategoryTypical ModelsUse Case
Text GenerationLlama 3, Mistral, GemmaChatbots, summarization, content gen
Text-to-ImageStable DiffusionGenerating unique art or placeholders
TranslationM2M-100Multi-language support
Speech-to-TextWhisperTranscribing audio
Embeddingsbge-small, bge-baseVector search, RAG
ClassificationResNet, DistilBERTSentiment analysis, image labeling

Free Tier (Beta)

During the beta period, Cloudflare offers a generous free tier for certain models. Check the Cloudflare Dashboard for the most current limits.

ResourceTypical Free Usage
InferenceFree for "beta" tagged models
Daily LimitUp to 1,000 requests/day (varies by model)
GPU AccessBuilt-in to the global network

Running AI in a Worker

1. Configure the Binding

wrangler.toml
[ai]
binding = "AI"

2. Run Inference (LLM Example)

src/index.ts
export interface Env {
AI: any;
}

export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
const prompt = url.searchParams.get("prompt") || "Why is Cloudflare awesome?";

// Run the Llama 3 model
const response = await env.AI.run("@cf/meta/llama-3-8b-instruct", {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: prompt },
],
});

return Response.json(response);
},
};

3. Image Generation Example

const response = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
prompt: "A futuristic city in the clouds, cyberpunk aesthetic",
});

return new Response(response, {
headers: { "Content-Type": "image/png" },
});

Using Workers AI outside of Workers

You can also call Workers AI via a standard REST API using your Cloudflare API token.

Call AI via REST
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3-8b-instruct \
-H "Authorization: Bearer {API_TOKEN}" \
-d '{ "messages": [{ "role": "user", "content": "Hello!" }] }'

Common Strategy: RAG (Retrieval Augmented Generation)

Workers AI is most powerful when combined with Vectorize and D1:

  1. Store documents in D1 or R2.
  2. Generate embeddings using bge-small on Workers AI.
  3. Store and search embeddings in Vectorize.
  4. Use the search results as context for Llama 3 to provide accurate answers.

Key Takeaways

  • Workers AI runs machine learning models on GPUs at the edge.
  • Supports LLMs, Image Gen, Translation, Speech, and Embeddings.
  • Free tier available for "beta" models during the preview period.
  • Integrated natively with the Workers ecosystem (bindings).
  • No cold starts for AI — models are pre-loaded on the network.

What's Next

  • Continue to Vectorize to learn about vector databases for AI search.