How to use off-the-shelf evaluators (Python only)
Before diving into this content, it might be helpful to read the following:
LangChain provides a suite of off-the-shelf evaluators you can use right away to evaluate your application performance without writing any custom code. These evaluators are meant to be used more as a starting point for evaluation.
Create a dataset and set up the LangSmith client in Python to follow along
from langsmith import Client
client = Client()
Create a dataset
examples = [
("Ankush", "Hello Ankush"),
("Harrison", "Hello Harrison"),
]
dataset_name = "Hello Set"
dataset = client.create_dataset(dataset_name=dataset_name)
inputs, outputs = zip(
*[({"input": input}, {"expected": expected}) for input, expected in examples]
)
client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)
Use question and answer (correctness) evaluators
Question and answer (QA) evaluators help to measure the correctness of a response to a user query or question. If you have a dataset with reference labels or reference context docs, these are the evaluators for you!
Three QA evaluators you can load are: "qa"
, "context_qa"
, "cot_qa"
. Based on our meta-evals, we recommend using "cot_qa"
, or Chain of Thought QA.
Here is a trivial example that uses a "cot_qa"
evaluator to evaluate a simple pipeline that prefixes the input with "Hello":
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
cot_qa_evaluator = LangChainStringEvaluator("cot_qa")
client = Client()
evaluate(
lambda input: "Hello " + input["input"],
data=dataset_name,
evaluators=[cot_qa_evaluator],
)
Use criteria evaluators
If you don't have ground truth reference labels, you can evaluate your run against a custom set of criteria using the "criteria"
evaluators. These are helpful when there are high level semantic aspects of your model's output you'd like to monitor that aren't captured by other explicit checks or rules.
- The
"criteria"
evaluator instructs an LLM to assess if a prediction satisfies the given criteria, outputting a binary score (0 or 1) for each criterion
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
criteria_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"says_hello": "Does the submission say hello?",
}
}
)
client = Client()
evaluate(
lambda input: "Hello " + input["input"],
data=dataset_name,
evaluators=[
criteria_evaluator,
],
)
Default criteria are implemented for the following aspects: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality. To specify custom criteria, write a mapping of a criterion name to its description, such as:
criterion = {"creativity": "Is this submission creative, imaginative, or novel?"}
criteria_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": criterion}
)
Evaluation scores don't have an inherent "direction" (i.e., higher is not necessarily better). The direction of the score depends on the criteria being evaluated. For example, a score of 1 for "helpfulness" means that the prediction was deemed to be helpful by the model. However, a score of 1 for "maliciousness" means that the prediction contains malicious content, which, of course, is "bad".
Use labeled criteria evaluators
If you have ground truth reference labels, you can evaluate your run against custom criteria while also providing that reference information to the LLM using the "labeled_criteria"
or "labeled_score_string"
evaluators.
- The
"labeled_criteria"
evaluator instructs an LLM to assess if a prediction satisfies the criteria, taking into account the reference label - The
"labeled_score_string"
evaluator instructs an LLM to assess the prediction against a reference label on a specified scale
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
labeled_criteria_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={
"criteria": {
"helpfulness": (
"Is this submission helpful to the user,"
" taking into account the correct reference answer?"
)
}
}
)
labeled_score_evaluator = LangChainStringEvaluator(
"labeled_score_string",
config={
"criteria": {
"accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?"
},
"normalize_by": 10,
}
)
client = Client()
evaluate(
lambda input: "Hello " + input["input"],
data=dataset_name,
evaluators=[
labeled_criteria_evaluator,
labeled_score_evaluator
],
)