Fixes and Tweaks to Defaults #29

MillionthOdin16 · 2023-04-05T22:37:26Z

Wow. It took awhile to figure out that mlock was silently causing the model to fail without an exception. 😢 haha

It's awesome once I got it working though. I still need to look into where it puts the built library and where it searches for it. I think it's searching in the same directory as the executable, but I might have had to manually move it around to get it happy. Don't know why this is.

Summary:

Added pydantic as a dependency in setup.py. The code was previously throwing errors when pydantic was not present.
❗ This might also need to be done for scikit-build. I installed it manually at the very start. You probably know more about the best place to do it.
Set the default value of n_batch to 8 in both examples/high_level_api/fastapi_server.py and llama_cpp/server/__main__.py. This replaces the previous default value of 2048.
Reduced the default value of n_threads to half of the available CPU count in both examples/high_level_api/fastapi_server.py and llama_cpp/server/__main__.py. This is to protect against locking up the system, and I usually only run 1/3 of available threads. Users can always turn it up, but 100 is kind of shocking. More details in actual code.
Disabled the use of mlock by default in both examples/high_level_api/fastapi_server.py and llama_cpp/server/__main__.py. The previous setting was causing silent failures on platforms that don't support mlock, such as Windows. We can either check if it's supported on the platform, or allow users to enable manually, but I don't think it's recommended by some people in llama.cpp discussions.
Updated .gitignore to ignore the ./idea folder.

For additional info, and just future reference, this is how mlock on an unsupported platform fails.

llama_model_load: loading model from 'D:\models\gpt4all\gpt4all-lora-unfiltered-quantized-llama-nmap.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'D:\models\gpt4all\gpt4all-lora-unfiltered-quantized-llama-nmap.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291

can't mlock because it's not supported on this system     <------- Here

AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
INFO:     Started server process [728180]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
INFO:     ::1:59853 - "GET / HTTP/1.1" 404 Not Found
INFO:     ::1:59853 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     ::1:59853 - "GET /docs HTTP/1.1" 200 OK
INFO:     ::1:59853 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     ::1:59854 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\uvicorn\protocols\http\httptools_impl.py", line 398, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\errors.py", line 184, in __call__
    raise exc
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\cors.py", line 92, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\cors.py", line 147, in simple_response
    await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
    raise exc
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 21, in __call__
    raise e
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\routing.py", line 66, in app
    response = await func(request)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\fastapi\routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\starlette\concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\server\__main__.py", line 106, in create_completion
    return llama(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\llama.py", line 527, in __call__
    return self.create_completion(
  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\llama.py", line 488, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore

  File "C:\Users\Odin\Documents\GitHub\llama-cpp-python\llama_cpp\llama.py", line 305, in _create_completion
    assert self.ctx is not None    <------- Causes this

…s relating to TypeDict or subclass() if the version is too old or new...

…when n_ctx was missing and n_batch was 2048.

Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default. Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%

abetlen · 2023-04-05T22:44:16Z

Thank you!

With regards to pydantic did you install with pip install llama-cpp-python[server]? It should be a dependency of fastapi and I don't think it's used outside of the server subpackage.

MillionthOdin16 · 2023-04-05T22:48:54Z

With that command it does properly install pydantic. Will it still get pydantic if I clone the repo and do python setup.py develop install from repo root?

abetlen · 2023-04-05T22:57:55Z

Unfortunately, I think with python setup.py develop easy_install "llama_cpp[server]" it does. This is related to the issue of needing to update the build / release system, setup.py is not the recommended solution but it works at the moment. For now let's not install pydantic for general users.

abetlen · 2023-04-05T23:28:37Z

setup.py

@@ -19,6 +19,7 @@
    entry_points={"console_scripts": ["llama_cpp.server=llama_cpp.server:main"]},
    install_requires=[
        "typing-extensions>=4.5.0",
+        "pydantic==1.10.7",


Revert this for now, we can fix properly when we migrate from setup.py. For now it should only effect developers who are also working on the server (very small number of people) vs. requiring pydantic for every install (would effect all build steps for example).

MillionthOdin16 added 5 commits April 5, 2023 17:37

Add pydantic dep. Errors if pedantic isn't present. Also throws error…

1e90597

…s relating to TypeDict or subclass() if the version is too old or new...

Set n_batch to the default value of 8. I think this is leftover from …

76a82ba

…when n_ctx was missing and n_batch was 2048.

Merge remote-tracking branch 'origin/main'

b9b6dfd

Ignore ./idea folder

2e91aff

MillionthOdin16 mentioned this pull request Apr 5, 2023

Standalone Server #21

Closed

12 tasks

abetlen reviewed Apr 5, 2023

View reviewed changes

abetlen merged commit c2e690b into abetlen:main Apr 7, 2023

MillionthOdin16 mentioned this pull request Apr 11, 2023

Running server gives error when using huggingface model #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes and Tweaks to Defaults #29

Fixes and Tweaks to Defaults #29

MillionthOdin16 commented Apr 5, 2023

abetlen commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023

abetlen commented Apr 5, 2023

abetlen Apr 5, 2023

Fixes and Tweaks to Defaults #29

Fixes and Tweaks to Defaults #29

Conversation

MillionthOdin16 commented Apr 5, 2023

abetlen commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023

abetlen commented Apr 5, 2023

abetlen Apr 5, 2023

Choose a reason for hiding this comment