Tabby-API-Ollama

Tech Stack:
Python
ExLlama2
FastAPI

Tabby-API-Ollama

A fork of Tabby-API that functions as a drop-in replacement of Ollama.

Motivation

I really disliked every LLM frontend besides open-webui. But open-webui only works with Ollama. Ollama lacks my favorite model format, Exllamav2. This format is quite fast for models that fit entirely in the GPU(s). So why not make open-webui compatible with my go-to server, TabbyAPI?

The Process

I fired up open-webui and pointed it at the TabbyAPI endpoint. I read the error message, then the open-webui backend code to see what it was expecting, and then TabbyAPI server to figure out how to resolve it. I ultimately made all changes in Tabby since it was simplest, and would make it a drop-in replacement for Ollama. I found the endpoint open-webui was trying to hit, what it expected, and implemented it in Tabby. Like model lists, version numbers, chat completions etc.

I worked like this for a couple days, bouncing between reading open-webui code and then implementing that in Tabby. I finally got text to generate and receive in open-webui, but open-webui wouldn't display it. I got a JSON parsing error. After an embarrasingly long time, I realized the answer was in the Ollama backend and I just had to mimic their streaming response in Tabby.

With that final issue resolved I can finally use the two together and experience the sweet relief of being able to move on with my life.