Skip to content

Standalone Server #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
9 of 12 tasks
abetlen opened this issue Apr 4, 2023 · 9 comments
Closed
9 of 12 tasks

Standalone Server #21

abetlen opened this issue Apr 4, 2023 · 9 comments

Comments

@abetlen
Copy link
Owner

abetlen commented Apr 4, 2023

Since the server is one of the goals / highlights of this project. I'm planning to move it into a subpackage e.g. llama-cpp-python[server] or something like that.

Work that needs to be done first:

  • Ensure compatibility with OpenAI
    • Response objects match
    • Request objects match
    • Loaded model appears under /v1/models endpoint
    • Test OpenAI client libraries
    • Unsupported parameters should be silently ignored
  • Ease-of-use
    • Integrate server as a subpackage
    • CLI tool to run the server

Future work

  • Prompt caching to improve latency
  • Support multiple models in the same server
  • Add tokenization endpoints to make it easier to make it easier for small clients to calculate context window sizes
@MillionthOdin16
Copy link
Contributor

Just a note, I found a package fastapi-code-generator that you can put the OpenAI OpenAPI spec into and it will generate a server skeleton with the correct models and endpoints. Similarly, there are packages that can create test cases for the endpoints based off the API spec. This might save some time and we can return a not implemented error for endpoints that our server doesn't support.

@abetlen abetlen changed the title Standalone Server as a Subpackage Standalone Server Apr 5, 2023
@abetlen
Copy link
Owner Author

abetlen commented Apr 5, 2023

With the latest commit we now handle all the request parameters for the /v1/completions, /v1/chat/completions and /v1/embeddings endpoints. The server accepts additional parameters that are llama.cpp specific and ignores any that we currently don't support.

Last step is really just to bundle this into the PyPI package as a subpackage so it can be installed with pip install llama-cpp-python[server] and then run with python -m llama_cpp.server or something like that.

@MillionthOdin16
Copy link
Contributor

Awesome work! Just FYI, Llama CPP just got some major bug fixes that improves performance in the last hour. There should no longer be a performance degradation as the context size increases. Hopefully this translates into better performance for us too 🔥

@abetlen
Copy link
Owner Author

abetlen commented Apr 5, 2023

Awesome, I'll update the package!

@abetlen
Copy link
Owner Author

abetlen commented Apr 5, 2023

@MillionthOdin16 pushed the updated llama.cpp and the standalone server.

Do you mind testing it for me?

Just update from pip and run MODEL=/path/to/model python3 -m llama_cpp.server

@abetlen abetlen pinned this issue Apr 5, 2023
@MillionthOdin16
Copy link
Contributor

@abetlen done! I created a #29 with some fixes, especially for windows. API is super nice. I did experience significantly slower chat_completion performance compared to the other endpoints (as you previously mentioned). But overall, super cool!

@MillionthOdin16
Copy link
Contributor

MillionthOdin16 commented Apr 5, 2023

One extra note on usability, I think it would be nice to pass in the model (and eventually model folder) as an argument to llama_cpp.server instead of using an env var. Would make it more similar to other usages I think

@abetlen
Copy link
Owner Author

abetlen commented Apr 6, 2023

Next steps on the server (no particular order):

  • multiple models selectable by model parameter
  • prompt caching (if possible or maybe just hack this with multiple contexts)
  • server cli options
  • logprobs
  • investigate adding /models/{model} endpoint
  • model aliasing (kind of a hack but could fix some issues)

I'll close this issue and spin these out individually

@abetlen abetlen closed this as completed Apr 6, 2023
@abetlen abetlen unpinned this issue Apr 6, 2023
xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
@riverzhou
Copy link

I want to save chat log.
What's the best practices for this purpose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants