-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Flaky server responses with llama 3 #6785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Llama 3 is not yet supported, please wait for: |
It works perfectly fine when it's responding slowly. I do not use chat template, I use my own client that calls |
Feel free to reopen once those 2 PR are merged. |
I have no rights to reopen issues in this repo. Also I looked through those MRs - they have nothing to do with this problem. |
llama3 is not supported, could you understand the GGUF you are using is probably just wrong ? |
If your issue persists once you have converted again the HF model and run the latest server code with those PRs merged. Please ping. I will reopen |
I noticed that some of the responses I got from llama-cpp server (latest master) are unnaturally fast for 70b model, and it happens randomly. And when this happens the response has worse quality. The model I'm using is https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/blob/main/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf with the command line
llama-server -m Meta-Llama-3-70B-Instruct-Q5_K_M.gguf -c 0 -t 24 -ngl 24
. It's only partially offloaded to gpu (with rocm on linux) so maybe somehow llama-cpp doesn't use all layers when it responds quickly.The text was updated successfully, but these errors were encountered: