Skip to content

Running an LLM in a CI pipeline

Overview

With the recent explosion of AI and large language models (LLM), I've been brainstorming how to take advantage of AI capabilities within a CI/CD pipeline.

Most of the major AI providers have a REST API, so I could of course easily use that in a CI pipeline, but there are many situations where this isn't an option:

  • Cost: As many "AI wrapper" companies quickly discovered, these APIs are expensive. And running queries in a CI pipeline that could run potentially hundreds of times per day adds up quickly.
  • Security: Many organizations handling sensitive or proprietary data don't want their information sent to a third party like OpenAI or Google.

To solve these issues, I wanted to see if it's possible to run an LLM locally in a CI job, to which I can send queries without worrying about API cost or revealing sensitive data.

How it's done

Tools

All the tools I'm using in this article are free to use.

Name Description
Ollama A free, open-source tool for running LLMs locally
Gitlab CI A free CI/CD pipeline system developed by Gitlab for running automated jobs in the same environment as your git repository
GitHub Actions Same as Gitlab CI, but provided by GitHub

Note

In this article I won't be getting too deep into exactly what Ollama is and how it works. To learn more about it, check out their GitHub.

Setup

To start, you'll need either a GitHub or Gitlab account and you'll need to create your first repository12. Once that's done, create a basic CI/CD pipeline--we'll name it ci:

name: ci
on:
  push:
workflow:
  name: ci

This creates a basic structure for a pipeline that runs on all commits. To limit the pipeline to only run on a certain branch, modify GitHub's on.push option, or Gitlab's workflow:rules. For example:

name: ci
on:
  push:
    branches:
      - main
workflow:
  name: ci
  rules:
    - if: $CI_COMMIT_BRANCH == 'main'

Run an LLM in a job

The ollama CLI is great for running a local, interactive chat session in your terminal. But for a non-interactive, automated CI job it's best to interface with the Ollama API. To do this, we need to first define our ollama job and run Ollama as a service34 accessible by our job.

jobs:
  ollama:
    runs-on: ubuntu-latest
    services:
      ollama: ollama/ollama
ollama:
  services:
    - image: ollama/ollama
      alias: ollama

Next we'll add our script. When we request a response from the LLM we'll need to specify a large language model to generate that response. These models can be found in Ollama's library. Any model will work, but keep in mind that models with more parameters--while providing much better responses--are much larger in size. The 671 billion parameter version of deepseek-r1, for example, is 404GB in size. As such, it's ideal to use smaller models such as Meta's llama3.2.

Prior to generating a response, we'll first need to pull the model we want using Ollama's pull API. Then we generate the response with the generate API. Any Docker image will work for this job as long as it has the ability to send web requests with tools like wget or curl. For this example we'll be using curl with the alpine/curl image.

container: alpine/curl
steps:
  - name: Generate response
    run: |
      curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull
      curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate
image: alpine/curl
script: |
  curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull
  curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate
Note

Ideally, the pull and generate operations would run in separate steps. GitHub uses the steps functionality for this, however, the comparable functionality in Gitlab (run) is still in the experimental stage. For simplicity for the sake of this article, we'll be running the commands in a single script in both GitHub and Gitlab.

To accomplish the same in separate steps would look like this:

container: alpine/curl
steps:
  - name: Pull model
    run: curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull

  - name: Generate response
    run: curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate
image: alpine/curl
run:
  - name: Pull model
    script: curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull

  - name: Generate response
    script: curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate

That's all we need--let's see the response:

> curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull
{"status":"success"}
> curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate
{"model":"llama3.2","created_at":"2025-02-06T18:46:52.362892453Z","response":"Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?","done":true,"done_reason":"stop","context":[128004,9125,128007,276,39766,3303,33025,2696,22,8790,220,2366,11,271,128009,128006,882,128007,271,9906,1917,128009,128006,78191,128007,271,9906,0,1102,596,6555,311,3449,499,13,2209,1070,2555,358,649,1520,499,449,477,1053,499,1093,311,6369,30],"total_duration":9728821911,"load_duration":2319403269,"prompt_eval_count":27,"prompt_eval_duration":3406000000,"eval_count":25,"eval_duration":4001000000}

Parse the output

This is great, but the JSON output is a bit verbose. We can simplify the response and make it a bit more readable using the jq command.

steps:
  - name: Install jq
    run: apk add jq
  - name: Generate response
    run: |
      curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull | jq -r .status
      curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate | jq -r .response
before_script: apk add jq
script: |
  curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull | jq -r .status
  curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate | jq -r .response

This looks much better:

> curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull | jq -r .status
success
> curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate | jq -r .response
Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?

Put it all together

This is our final product:

name: ci
on:
  push:

jobs:
  ollama:
    runs-on: ubuntu-latest
    services:
      ollama: ollama/ollama
    container: alpine/curl
    steps:
      - name: Install jq
        run: apk add jq
      - name: Generate response
        run: |
          curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull | jq -r .status
          curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate | jq -r .response
workflow:
  name: ci

ollama:
  image: alpine/curl
  services:
    - name: ollama/ollama
      alias: ollama
  before_script: apk add jq
  script: |
    curl -sS -X POST -d '{"model":"llama3.2","stream":false}' ollama:11434/api/pull | jq -r .status
    curl -sS -X POST -d '{"model":"llama3.2","stream":false,"prompt":"Hello world"}' ollama:11434/api/generate | jq -r .response

Summary

With just a few lines of code, we're able to run an Ollama server, pull down a large language model, and generate responses--all completely local to our CI job. We can now use this capability to generate release notes, automate code review, write documentation--the possibilities are endless.

Comments