Skip to content

Fixes and Tweaks to Defaults #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 7, 2023
Merged

Fixes and Tweaks to Defaults #29

merged 5 commits into from
Apr 7, 2023

Conversation

MillionthOdin16
Copy link
Contributor

Wow. It took awhile to figure out that mlock was silently causing the model to fail without an exception. 😢 haha

It's awesome once I got it working though. I still need to look into where it puts the built library and where it searches for it. I think it's searching in the same directory as the executable, but I might have had to manually move it around to get it happy. Don't know why this is.

Summary:

  1. Added pydantic as a dependency in setup.py. The code was previously throwing errors when pydantic was not present.
    ❗ This might also need to be done for scikit-build. I installed it manually at the very start. You probably know more about the best place to do it.

  2. Set the default value of n_batch to 8 in both examples/high_level_api/fastapi_server.py and llama_cpp/server/__main__.py. This replaces the previous default value of 2048.

  3. Reduced the default value of n_threads to half of the available CPU count in both examples/high_level_api/fastapi_server.py and llama_cpp/server/__main__.py. This is to protect against locking up the system, and I usually only run 1/3 of available threads. Users can always turn it up, but 100 is kind of shocking. More details in actual code.

  4. Disabled the use of mlock by default in both examples/high_level_api/fastapi_server.py and llama_cpp/server/__main__.py. The previous setting was causing silent failures on platforms that don't support mlock, such as Windows. We can either check if it's supported on the platform, or allow users to enable manually, but I don't think it's recommended by some people in llama.cpp discussions.

  5. Updated .gitignore to ignore the ./idea folder.


For additional info, and just future reference, this is how mlock on an unsupported platform fails.

llama_model_load: loading model from 'D:\models\gpt4all\gpt4all-lora-unfiltered-quantized-llama-nmap.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'D:\models\gpt4all\gpt4all-lora-unfiltered-quantized-llama-nmap.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291

can't mlock because it's not supported on this system     <------- Here

AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
INFO:     Started server process [728180]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
INFO:     ::1:59853 - "GET / HTTP/1.1" 404 Not Found
INFO:     ::1:59853 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     ::1:59853 - "GET /docs HTTP/1.1" 200 OK
INFO:     ::1:59853 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     ::1:59854 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\uvicorn\protocols\http\httptools_impl.py", line 398, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\errors.py", line 184, in __call__
    raise exc
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\cors.py", line 92, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\cors.py", line 147, in simple_response
    await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
    raise exc
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 21, in __call__
    raise e
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\routing.py", line 66, in app
    response = await func(request)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\server\__main__.py", line 106, in create_completion
    return llama(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\llama.py", line 527, in __call__
    return self.create_completion(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\llama.py", line 488, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore

  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\llama.py", line 305, in _create_completion
    assert self.ctx is not None    <------- Causes this

…s relating to TypeDict or subclass() if the version is too old or new...
…when n_ctx was missing and n_batch was 2048.
Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default.

Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%
@MillionthOdin16 MillionthOdin16 mentioned this pull request Apr 5, 2023
12 tasks
@abetlen
Copy link
Owner

abetlen commented Apr 5, 2023

Thank you!

With regards to pydantic did you install with pip install llama-cpp-python[server]? It should be a dependency of fastapi and I don't think it's used outside of the server subpackage.

@MillionthOdin16
Copy link
Contributor Author

With that command it does properly install pydantic. Will it still get pydantic if I clone the repo and do python setup.py develop install from repo root?

@abetlen
Copy link
Owner

abetlen commented Apr 5, 2023

Unfortunately, I think with python setup.py develop easy_install "llama_cpp[server]" it does. This is related to the issue of needing to update the build / release system, setup.py is not the recommended solution but it works at the moment. For now let's not install pydantic for general users.

@@ -19,6 +19,7 @@
entry_points={"console_scripts": ["llama_cpp.server=llama_cpp.server:main"]},
install_requires=[
"typing-extensions>=4.5.0",
"pydantic==1.10.7",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this for now, we can fix properly when we migrate from setup.py. For now it should only effect developers who are also working on the server (very small number of people) vs. requiring pydantic for every install (would effect all build steps for example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants