Skip to content

Commit 9f5275d

Browse files
authored
Added two new caches: FileCache, MemoryCache plus more tests
* added two new caches: FileCache, MemoryCache * raise if similarity gets unexpected response * updated interactive mode and added test cases * updated web to work with Evaluator.Match * separated evaluator classes into sub files
1 parent 703ca1e commit 9f5275d

28 files changed

+733
-221
lines changed

README.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# 🏋️‍♂️ BenchLLM 🏋️‍♀️
22

3-
🦾 Continuous Integration for LLM powered applications 🦙🦅🤖
3+
🦾 Continuous Integration for LLM powered applications 🦙🦅🤖
44

55
[![GitHub Repo stars](https://img.shields.io/github/stars/v7labs/BenchLLM?style=social)](https://github.com/v7labs/BenchLLM/stargazers)
66
[![Twitter Follow](https://img.shields.io/twitter/follow/V7Labs?style=social)](https://twitter.com/V7Labs)
@@ -10,7 +10,6 @@
1010

1111
BenchLLM is actively used at [V7](https://www.v7labs.com) for improving our LLM applications and is now Open Sourced under MIT License to share with the wider community
1212

13-
1413
## 💡 Get help on [Discord](https://discord.gg/x7ExfHb3bG) or [Tweet at us](https://twitter.com/V7Labs)
1514

1615
<hr/>
@@ -26,7 +25,7 @@ Use BenchLLM to:
2625

2726
> ⚠️ **NOTE:** BenchLLM is in the early stage of development and will be subject to rapid changes.
2827
>
29-
>For bug reporting, feature requests, or contributions, please open an issue or submit a pull request (PR) on our GitHub page.
28+
> For bug reporting, feature requests, or contributions, please open an issue or submit a pull request (PR) on our GitHub page.
3029
3130
## 🧪 BenchLLM Testing Methodology
3231

@@ -116,6 +115,16 @@ The non interactive evaluators also supports `--workers N` to run in the evaluat
116115
$ bench run --evaluator string-match --workers 5
117116
```
118117

118+
To accelerate the evaluation process, BenchLLM uses a cache. If a (prediction, expected) pair has been evaluated in the past and a cache was used, the evaluation output will be saved for future evaluations. There are several types of caches:
119+
120+
- `memory`, only caches output values during the current run. This is particularly useful when running with `--retry-count N`
121+
- `file`, stores the cache at the end of the run as a JSON file in output/cache.json. This is the default behavior.
122+
- `none`, does not use any cache.
123+
124+
```bash
125+
$ bench run examples --cache memory
126+
```
127+
119128
### 🧮 Eval
120129

121130
While _bench run_ runs each test function and then evaluates their output, it can often be beneficial to separate these into two steps. For example, if you want a person to manually do the evaluation or if you want to try multiple evaluation methods on the same function.
@@ -163,6 +172,20 @@ results = evaluator.run()
163172
print(results)
164173
```
165174

175+
If you want to incorporate caching and run multiple parallel evaluation jobs, you can modify your evaluator as follows:
176+
177+
```python
178+
from benchllm.cache import FileCache
179+
180+
...
181+
182+
evaluator = FileCache(StringMatchEvaluator(workers=2), Path("path/to/cache.json"))
183+
evaluator.load(predictions)
184+
results = evaluator.run()
185+
```
186+
187+
In this example, `FileCache` is used to enable caching, and the `workers` parameter of `StringMatchEvaluator` is set to `2` to allow for parallel evaluations. The cache results are saved in a file specified by `Path("path/to/cache.json")`.
188+
166189
## ☕️ Commands
167190

168191
- `bench add`: Add a new test to a suite.

benchllm/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
import inspect
22
from pathlib import Path
3-
from typing import Any, Callable, Generic, Optional, Type, TypeVar
3+
from typing import Callable, Type, TypeVar
44

55
from .data_types import Evaluation, Prediction, Test # noqa
66
from .evaluator import Evaluator, SemanticEvaluator, StringMatchEvaluator # noqa
7-
from .input_types import ChatInput, SimilarityInput
7+
from .input_types import ChatInput, SimilarityInput # noqa
88
from .similarity import semantically_similar # noqa
99
from .singleton import TestSingleton # noqa
1010
from .tester import Tester # noqa

benchllm/cache.py

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
import json
2+
from pathlib import Path
3+
from typing import Optional
4+
5+
from benchllm.data_types import Evaluation, Prediction
6+
from benchllm.evaluator import Evaluator
7+
from benchllm.input_types import Json
8+
from benchllm.listener import EvaluatorListener
9+
10+
11+
class MemoryCache(Evaluator):
12+
"""Caches the results of the evaluator in memory"""
13+
14+
def __init__(self, evaluator: Evaluator):
15+
super().__init__(workers=evaluator.workers)
16+
self._data: dict = {}
17+
self._evaluator = evaluator
18+
self._num_cache_misses = 0
19+
self._num_cache_hits = 0
20+
21+
def _key(self, answer1: Json, answer2: Json) -> str:
22+
key1, key2 = json.dumps([answer1, answer2]), json.dumps([answer2, answer1])
23+
return key1 if key1 < key2 else key2
24+
25+
def lookup(self, answer1: Json, answer2: Json) -> Optional[bool]:
26+
return self._data.get(self._key(answer1, answer2), None)
27+
28+
def store(self, answer1: Json, answer2: Json, value: bool) -> None:
29+
key = self._key(answer1, answer2)
30+
self._data[key] = value
31+
32+
def evaluate_prediction(self, prediction: Prediction) -> Optional[Evaluator.Match]:
33+
uncached_expectations = []
34+
for expected in prediction.test.expected:
35+
lookup = self.lookup(expected, prediction.output)
36+
if lookup is None:
37+
uncached_expectations.append(expected)
38+
elif lookup:
39+
# If we find a positive match we can stop comparing and just return.
40+
# For negative matches we still need to check the other expected answers.
41+
self._num_cache_hits += 1
42+
return Evaluator.Match(prediction=prediction.output, expected=expected)
43+
44+
# If all expectations were found in the cache but were negative matches,
45+
# we increment the cache hits counter and return None as there's no match.
46+
if not uncached_expectations:
47+
self._num_cache_hits += 1
48+
return None
49+
50+
self._num_cache_misses += 1
51+
# set prediction.test.expected to only the ones that were not cached
52+
prediction = Prediction(**prediction.dict())
53+
prediction.test.expected = uncached_expectations
54+
result = self._evaluator.evaluate_prediction(prediction)
55+
if result:
56+
self.store(result.expected, result.prediction, True)
57+
else:
58+
for expected in prediction.test.expected:
59+
self.store(expected, prediction.output, False)
60+
return result
61+
62+
@property
63+
def num_cache_hits(self) -> int:
64+
return self._num_cache_hits
65+
66+
@property
67+
def num_cache_misses(self) -> int:
68+
return self._num_cache_misses
69+
70+
71+
class FileCache(MemoryCache, EvaluatorListener):
72+
"""Caches the results of the evaluator in a json file"""
73+
74+
def __init__(self, evaluator: Evaluator, path: Path):
75+
super().__init__(evaluator)
76+
self._path = path
77+
self.add_listener(self)
78+
self._load()
79+
80+
def _load(self) -> None:
81+
if self._path.exists():
82+
try:
83+
cache = json.loads(self._path.read_text(encoding="UTF-8"), parse_int=str)
84+
if cache["version"] != "1":
85+
raise ValueError("Unsupported cache version")
86+
self._data = cache["entries"]
87+
except Exception:
88+
print(f"Failed to load cache file {self._path}")
89+
self._data = {}
90+
91+
def _save(self) -> None:
92+
cache = {"entries": self._data, "version": "1"}
93+
self._path.write_text(json.dumps(cache, indent=4), encoding="UTF-8")
94+
95+
def evaluate_ended(self, evaluations: list[Evaluation]) -> None:
96+
self._save()

benchllm/cli/commands/evaluate.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
from pathlib import Path
22

3+
from benchllm.cache import FileCache
34
from benchllm.cli.listener import ReportListener, RichCliListener
4-
from benchllm.cli.utils import get_evaluator
5-
from benchllm.evaluator import load_prediction_files
6-
from benchllm.utils import find_json_yml_files
5+
from benchllm.cli.utils import add_cache, get_evaluator
6+
from benchllm.utils import find_json_yml_files, load_prediction_files
77

88

99
def evaluate_predictions(
10-
file_or_dir: list[Path], model: str, output_dir: Path, workers: int, evaluator_name: str
10+
file_or_dir: list[Path], model: str, output_dir: Path, workers: int, evaluator_name: str, cache: str
1111
) -> bool:
1212
files = find_json_yml_files(file_or_dir)
1313

@@ -17,6 +17,10 @@ def evaluate_predictions(
1717
load_prediction_files(file_or_dir)
1818

1919
evaluator = get_evaluator(evaluator_name, model, workers)
20+
evaluator = add_cache(cache, evaluator, output_dir.parent / "cache.json")
21+
22+
cli_listener.set_evaulator(evaluator)
23+
2024
evaluator.add_listener(cli_listener)
2125
evaluator.add_listener(report_listener)
2226
for file in files:

benchllm/cli/commands/run_suite.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22

33
import typer
44

5+
from benchllm.cache import FileCache
56
from benchllm.cli.listener import ReportListener, RichCliListener
6-
from benchllm.cli.utils import get_evaluator
7+
from benchllm.cli.utils import add_cache, get_evaluator
78
from benchllm.tester import Tester
89
from benchllm.utils import find_files
910

@@ -17,6 +18,7 @@ def run_suite(
1718
workers: int,
1819
evaluator_name: str,
1920
retry_count: int,
21+
cache: str,
2022
) -> bool:
2123
files = find_files(file_search_paths)
2224
if not files:
@@ -45,6 +47,10 @@ def run_suite(
4547
return True
4648

4749
evaluator = get_evaluator(evaluator_name, model, workers)
50+
evaluator = add_cache(cache, evaluator, output_dir.parent / "cache.json")
51+
52+
cli_listener.set_evaulator(evaluator)
53+
4854
evaluator.add_listener(cli_listener)
4955
evaluator.add_listener(report_listener)
5056
evaluator.load(tester.predictions)

benchllm/cli/evaluator.py

Lines changed: 0 additions & 78 deletions
This file was deleted.

benchllm/cli/evaluator/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from benchllm.cli.evaluator.interactive import InteractiveEvaluator # noqa
2+
from benchllm.cli.evaluator.web import WebEvaluator # noqa

benchllm/cli/evaluator/interactive.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
from typing import Optional
2+
3+
import click
4+
import typer
5+
6+
from benchllm.data_types import Prediction
7+
from benchllm.evaluator import Evaluator
8+
9+
10+
class InteractiveEvaluator(Evaluator):
11+
def evaluate_prediction(self, prediction: Prediction) -> Optional[Evaluator.Match]:
12+
header = (
13+
f'{typer.style("Does ", bold=True)}'
14+
f"{typer.style(prediction.output, fg=typer.colors.BRIGHT_BLUE, bold=True)}"
15+
f'{typer.style(" match any of the following expected prompts?", bold=True)}'
16+
)
17+
typer.echo("")
18+
typer.echo(header)
19+
20+
for i, expected in enumerate(prediction.test.expected, start=1):
21+
typer.secho(f"{i}. ", fg=typer.colors.BRIGHT_BLUE, bold=True, nl=False)
22+
typer.secho(expected, bold=True)
23+
24+
options = [str(idx) for idx, _ in enumerate(prediction.test.expected, start=1)] + ["n"]
25+
26+
prompt_string = f"[{typer.style('matching number', fg=typer.colors.GREEN, bold=True)} or {typer.style('n', fg=typer.colors.RED, bold=True)}]"
27+
click_choice = click.Choice(options)
28+
response = typer.prompt(prompt_string, default="n", type=click_choice, show_choices=False).lower()
29+
if response == "n":
30+
return None
31+
return Evaluator.Match(prediction=prediction.output, expected=prediction.test.expected[int(response) - 1])

benchllm/cli/evaluator/web.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import signal
2+
from typing import Optional
3+
4+
import typer
5+
from pywebio import session
6+
from pywebio.input import radio
7+
from pywebio.output import put_markdown
8+
9+
from benchllm.data_types import Prediction
10+
from benchllm.evaluator import Evaluator
11+
12+
13+
class WebEvaluator(Evaluator):
14+
def __init__(self) -> None:
15+
super().__init__(workers=1)
16+
17+
@session.defer_call
18+
def on_close() -> None:
19+
print("shutting down")
20+
typer.secho(
21+
f"The evaluation was interrupted. Run bench eval to start again", fg=typer.colors.RED, bold=True
22+
)
23+
# sys.exit doesn't work here, so we have to raise a signal to kill the process
24+
signal.raise_signal(signal.SIGINT)
25+
26+
put_markdown("# BenchLLM Web Evaluator")
27+
28+
def evaluate_prediction(self, prediction: Prediction) -> Optional[Evaluator.Match]:
29+
test_name = prediction.test.file_path or prediction.test.id
30+
31+
put_markdown(f"## {test_name}")
32+
put_markdown(f"*Question*: `{prediction.test.input}`")
33+
put_markdown(f"*Prediction*: `{prediction.output}`")
34+
35+
table = [["Question:", f"{prediction.test.input}", ""], ["Prediction:", prediction.output], ""]
36+
label = f"Question: {prediction.test.input}Prediction: {prediction.output}"
37+
38+
options: list[dict[str, Optional[int | str]]] = [
39+
{"label": expected, "value": idx} for idx, expected in enumerate(prediction.test.expected)
40+
]
41+
options.append({"label": "None", "value": None, "selected": True})
42+
answer = radio("Pick the matching answer", options=options, required=True)
43+
44+
if answer and isinstance(answer, int):
45+
return Evaluator.Match(prediction=prediction.output, expected=prediction.test.expected[answer])
46+
else:
47+
return None

0 commit comments

Comments
 (0)