Workers AI

Learning Focus

By the end of this lesson you will understand how to run serverless AI models on Workers, which models are available, and how to use the free tier.

What Is Workers AI?

Workers AI allows you to run machine learning models on Cloudflare's global network of GPUs. Instead of managing GPU servers or paying high fees for external AI APIs, you can run inference (predictions) directly inside your Workers.

Model Categories

Category	Typical Models	Use Case
Text Generation	Llama 3, Mistral, Gemma	Chatbots, summarization, content gen
Text-to-Image	Stable Diffusion	Generating unique art or placeholders
Translation	M2M-100	Multi-language support
Speech-to-Text	Whisper	Transcribing audio
Embeddings	bge-small, bge-base	Vector search, RAG
Classification	ResNet, DistilBERT	Sentiment analysis, image labeling

Free Tier (Beta)

During the beta period, Cloudflare offers a generous free tier for certain models. Check the Cloudflare Dashboard for the most current limits.

Resource	Typical Free Usage
Inference	Free for "beta" tagged models
Daily Limit	Up to 1,000 requests/day (varies by model)
GPU Access	Built-in to the global network

Running AI in a Worker

1. Configure the Binding

wrangler.toml
[ai]
binding = "AI"

2. Run Inference (LLM Example)

src/index.ts
export interface Env {
  AI: any;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);
    const prompt = url.searchParams.get("prompt") || "Why is Cloudflare awesome?";

    // Run the Llama 3 model
    const response = await env.AI.run("@cf/meta/llama-3-8b-instruct", {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: prompt },
      ],
    });

    return Response.json(response);
  },
};

3. Image Generation Example

const response = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
  prompt: "A futuristic city in the clouds, cyberpunk aesthetic",
});

return new Response(response, {
  headers: { "Content-Type": "image/png" },
});

Using Workers AI outside of Workers

You can also call Workers AI via a standard REST API using your Cloudflare API token.

Call AI via REST
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3-8b-instruct \
  -H "Authorization: Bearer {API_TOKEN}" \
  -d '{ "messages": [{ "role": "user", "content": "Hello!" }] }'

Common Strategy: RAG (Retrieval Augmented Generation)

Workers AI is most powerful when combined with Vectorize and D1:

Store documents in D1 or R2.
Generate embeddings using bge-small on Workers AI.
Store and search embeddings in Vectorize.
Use the search results as context for Llama 3 to provide accurate answers.

Key Takeaways

Workers AI runs machine learning models on GPUs at the edge.
Supports LLMs, Image Gen, Translation, Speech, and Embeddings.
Free tier available for "beta" models during the preview period.
Integrated natively with the Workers ecosystem (bindings).
No cold starts for AI — models are pre-loaded on the network.

What's Next

Continue to Vectorize to learn about vector databases for AI search.

What Is Workers AI?​

Model Categories​

Free Tier (Beta)​

Running AI in a Worker​

1. Configure the Binding​

2. Run Inference (LLM Example)​

3. Image Generation Example​

Using Workers AI outside of Workers​

Common Strategy: RAG (Retrieval Augmented Generation)​

Key Takeaways​

What's Next​