Support for Large Language Models¶
Large Language Models (LLMs) find more and more different applications, as their capacity and availability are growing over time. Even though skorch is not primarily focused on working with pre-trained models, it does provide a few integrations with the Hugging Face ecosystem and thus with the pre-trained models on Hugging Face. Some use cases for LLMs supported by skorch are detailed in this document.
Using language models as zero- and few-shot classifiers¶
In this section, we will show how to use skorch to perform zero-shot and few-shot classification.
Getting started with zero-shot classification¶
One application for LLMs is in zero-shot and few-shot learning. In particular, they can be used to perform zero-shot and few-shot classification. This means that without updating the model weights in a training step, the model can make useful predictions about what class best fits the input data.
As an example, let’s assume we have customer reviews and would like to know their sentiment, i.e. whether they are positive or negative. A customer review could look like this:
I’m very happy with this new smartphone. The display is excellent and the battery lasts for days. My only complaint is the camera, which could be better. Overall, I can highly recommend this product.
Thanks to the ability to understand text, a LLM should be perfectly capable to classify this review to be either positive or negative. With the help of skorch and pre-trained Hugging Face models, we can perform this type of prediction in a convenient way. Let’s show how this looks in code:
from skorch.llm import ZeroShotClassifier
clf = ZeroShotClassifier('bigscience/bloomz-1b1')
clf.fit(X=None, y=['positive', 'negative'])
review = """I'm very happy with this new smartphone. The display is excellent
and the battery lasts for days. My only complaint is the camera, which could
be better. Overall, I can highly recommend this product."""
clf.predict([review]) # returns 'positive'
clf.predict_proba([review]) # returns array([[0.05547452, 0.94452548]])
Let’s unpack this step by step. First of all, to run this, we need to install the Hugging Face transformers library in our environment. To do this, run:
python -m pip install transformers
Then, we imported the skorch class
ZeroShotClassifier
. This class takes care of all
the heavy lifting like loading the LLMs and generating the predictions. As
expected from skorch, it is scikit-learn compatible, thus you can use it for
grid searching, etc. (more on that later).
As you can see, we initialize the model simply by passing the argument
'bigscience/bloomz-1b1'
. What does this mean? This string is the name of a
Large Language Model hosted on Hugging Face. It is called “bloomz” and, as
implied by the name, it is the 1 billion parameters variant (so it’s a “small”
large language model by today’s standards). You can browse the Hugging Face
models page to
find more models that may be suitable for the task.
Note
It is possible to initialize
ZeroShotClassifier
with a model and
tokenizer directly by initializing it like this:
ZeroShotClassifier(model=model, tokenizer=tokenizer)
. In this case, the
model and tokenizer are not downloaded from Hugging Face. This can be useful
if you want to use your own models, which works as long as they are
compatible with Hugging Face transformers. In particular, they have to
implement the .generate
method.
After initializing the model, we will fit it. What does “fitting” mean in the
context of a zero-shot learning? In actuality, there is no fitting. The only
thing that happens under the hood is that skorch prepares a few things, such as
downloading the model and tokenizer. This is why we pass X=None
– we don’t
actually need any training data for zero-shot learning. We do, however, need to
pass y=['positive', 'negative']
, because this information will be used by
the model to determine which labels it is allowed to predict.
Then, we take the example review from above and pass it to the .predict
method. As expected, the model predicts 'positive'
, which is indeed the
sentiment that best fits the review.
When calling .predict_proba
, we get back 0.055 and 0.945. The first value
corresponds to “negative” and the second to “positive”. This order of these
results is determined by the alphabetical order of the labels, check
clf.classes_
if you’re unsure. So this means that the model predicts a 94.5%
probability that the label is positive.
Of course, sentiment analysis is not the only thing that you can do with LLMs. Maybe you would like to know whether the review mentions the shopping experience on your e-commerce website. Or you want to identify all phone reviews that mention the battery life? This information could be part of a bigger machine learning pipeline (say, helping customers find phones with good battery life according to reviews). There is almost no limit to what can be done.
Prompting¶
One important aspect we didn’t mention yet is prompting. As you may know, to get
the best results from an LLM, the prompt has to be crafted carefully. There is a
lot of information on the web about how best to prompt LLMs, so we won’t repeat
that here. But the nice thing about using
ZeroShotClassifier
is that it only takes a few
lines of code to change the prompt and see if it improves the results or not.
Given the example of predicting the sentiment of customer reviews, let’s assume
we have a (small) dataset of manually labeled data X
and y
. Now let’s
say we want to test which of two prompts results in better accuracy. This is how
we can achieve it:
from sklearn.metrics import accuracy_score
X, y = ... # your data
# first the default prompt
clf_default = ZeroShotClassifier('bigscience/bloomz-1b1')
clf_default.fit(X=None, y=['negative', 'positive'])
accuracy_default = accuracy_score(y, clf_default.predict(X))
# now a custom prompt
my_prompt = """Your job is to analyze the sentiment of customer reviews.
The available sentiments are: {labels}
The customer review is:
```
{text}
```
Your response:"""
clf_my_prompt = ZeroShotClassifier('bigscience/bloomz-1b1', prompt=my_prompt)
clf_my_prompt.fit(X=None, y=['negative', 'positive'])
accuracy_my_prompt = accuracy_score(y, clf_my_prompt.predict(X))
In this example, we check whether the default prompt that skorch provides, or
our own customized prompt results in better accuracy. For our own prompt, we
have to take care to include two placeholders, {labels}
and {text}
.
Labels is where the possible labels are placed, in this case “negative” and
“positive”. The text placeholder is for the input taken from X
. Also note
that we delimit the text input using “```”. Although it’s not strictly
necessary, it is often a good idea to let the LLM know what part of the text
belongs to the review and what part belongs to the instructions. Try
experimenting with different delimiters.
Grid searching LLM parameters¶
Investigating the best prompt in the way described above can become quite
tedious if we have a lot of prompts and metrics. Also, there might be other
hyper-parameters we want to check. This is where
sklearn.model_selection.GridSearchCV
enters the stage. Since
ZeroShotClassifier
is sklearn-compatible, we can
just plug it into a grid search and let sklearn do the tedious work for us.
Let’s see how that works in practice:
from sklearn.model_selection import GridSearchCV
from skorch.llm import DEFAULT_PROMPT_ZERO_SHOT
params = {
'model_name': ['bigscience/bloomz-1b1', 'gpt2', 'tiiuae/falcon-7b-instruct'],
'prompt': [DEFAULT_PROMPT_ZERO_SHOT, my_prompt],
}
metrics = ['accuracy', 'neg_log_loss']
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=metrics, refit=False)
search.fit(X, y)
print(search.cv_results_)
In this example, we grid search over three different LLMs, two different prompts, and calculate two scores.
Be aware that predicting with LLMs can be quite slow, depending on your
available hardware, so this grid search could take quite a while. If you have a
CUDA-capable GPU with sufficient memory, it is recommended to pass
device='cuda'
to ZeroShotClassifier
.
Few-shot classification¶
As research has shown, providing a few examples to the LLM about how to perform its task can result in much improved performance. This technique is called few-shot learning and we support this in skorch as well. It works almost the same as zero-shot learning, with a few notable differences that we’ll explain in a minute. First, let’s check out a code example:
from skorch.llm import FewShotClassifier
X_train, y_train, X_test, y_test = ... # your data
clf = FewShotClassifier('bigscience/bloomz-1b1', max_samples=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
This example is almost identical to what we saw previously. We need to import
FewShotClassifier
and provide a LLM name. We
also set the max_samples
parameter here. This indicates how many samples we
want to use for few-shot learning. Those will later be added to the prompt.
In contrast to zero-shot learning, during the fit call, we pass some actual
data. This is because from the passed X_train
and y_train
, 5 samples
will be picked to augment the prompt with examples. skorch will try to pick
examples such that all labels are represented at least once, if possible.
Since the examples are picked from the data and their label is shown to the LLM, it is important in this case to split train and test data, which is something we didn’t need to bother with when using zero-shot learning. Of course, if you want to later run a grid search, sklearn will automatically do this for you.
The number of samples taken to augment the prompt is controlled by the
max_samples
parameter. By default, this value is 5, i.e. 5 samples from
X_train
and y_train
are used. The samples that are not used don’t
actually influence the outcome. Therefore, you should keep X_train
and
y_train
as small as possible. This leaves more samples for your
validation/testing data.
The rest is no different from what we saw earlier. We can call .predict
and
.predict_proba
and calculate scores based on those predictions.
When it comes to testing different prompts, the approach is almost the same as
for zero-shot classification. One difference is that on top of placeholders for
{labels}
and {text}
, there should be one more placeholder for
{examples}
. This is where the few shot examples will be placed.
One big disadvantage of few-shot learning is that since the prompt is augmented
with examples, it becomes much longer. Not only will this lead to slower
prediction times, it also makes it more likely that the prompt length will
exceed the context size window of the LLM. If that happens, we will get a
warning from Hugging Face transformers. If in doubt, just run a grid search on
the max_samples
parameter, you will see how it affects the scores and
inference time.
Detecting issues with LLMs¶
LLMs are a bit of a black box, so it can be difficult to detect possible issues. skorch provides some options to help a little bit with noticing issues in the first place.
In general, if the model doesn’t perform as expected, you should find that reflected in the metrics. In the examples above, if the prompt was not well chosen or if the LLM was not up to the task, we would expect the accuracy to have a low value. This should be a first indicator that something is wrong.
To dig deeper, it can be useful to figure out if the probabilities that the
model assigns to the labels are low. At first glance, this is not easy to
detect. If we call predict_proba
, the total probabilities for each label
will always sum up to 1, as is expected from probabilities. However, if we
initialize the classifier with the parameter probas_sum_to_1=False
, we will
receive the unnormalized probabilities.
For example, given that we have set this option, let’s assume that the returned probabilities for “negative” and “positive” are 0.1 and 0.2. This means that the LLM assigns a probability of 0.7 to neither of these two labels. If we observe this, it is a strong indicator that the prompt is not working well for this specific LLM, as it tries to steer the generated text in the wrong direction. Consider adjusting the prompt, choosing a different LLM, or performing few-shot learning if you aren’t already.
skorch provides more options to detect such issues. Set the argument
error_low_prob
to "warn"
and you will get a warning when the
(unnormalized) probabilities are too low. Set the argument to raise
to raise
an error instead.
If you find that the model only occasionally assigns low probabilities to the
labels, you may want to set error_low_prob='none'
. In that case, skorch will
not complain about low probabilities, but for each sample with low
probabilities, .predict
will actually return None
instead of the label
with the highest probability (.predict_proba
is unaffected). You can later
decide what to do with those particular predictions.
If you make use of any of those options, you should also change
threshold_low_prob
. This is a float that indicates what probability is
actually considered to be “low”. For the sentiment example from above, if we set
this value to 0.5
, it means that the probability is considered low if the
sum of the probabilities for both “negative” and “positive” is less than 0.5.
A common issue you may discover is that the model always returns a very low probability. How can we make some progress in that case? A tip is to inspect what the model would actually generate for a given prompt. This will help to figure out what the LLM is trying to achieve, which may guide us towards finding a better prompt. Generating the output is not difficult because skorch just uses Hugging Face transformers models, so we can use them to generate sequences without any constraint. Here is a code snippet:
clf = ZeroShotClassifier(..., probas_sum_to_1=False)
clf.fit(X, y)
y_proba = clf.predict_proba(X)
# we notice that y_proba values are quite low, let's check what the LLM tries
# to do:
prompt = clf.get_prompt(X[0]) # get prompt for 1st sample from X
inputs = clf.tokenizer_(prompt, return_tensors='pt').to(clf.device_)
# adjust the min_new_tokens argument as needed
output = clf.model_.generate(**inputs, min_new_tokens=20)
generation = clf.tokenizer_.decode(output[0])
print(generation)
Now we can see what the LLM actually tries to generate if we’re not forcing it to predict one of the labels. This can often help us understand the underlying issue better.
Some examples of issues this helps identifying:
- If the model doesn’t return the label but instead generates new example inputs, it might be confused about the structure of the expected response; carefully rewording the prompt or using few-shot learning may help here.
- If the model tries to insert a spurious new line/space before the label, add it to your prompt. If the model still insists, change the label to contain that new line/space, e.g. from “positive” to ” positive”.
- If the model doesn’t produce any output, check if
clf.model_.max_length
is not set too low. Hopefully, these tips can help resolving the most common issues.
Advantages¶
What are some advantages of using skorch for zero-shot and few-shot classification?
- Working with few or even no labeled samples: Supervised ML methods are not an option when there is not enough labeled data. Using LLMs might provide a solution that is good enough for your use case.
- Forcing the LLM to output one of the labels: By using
ZeroShotClassifier
andFewShotClassifier
, you can ensure that the LLM will only ever predict one of the desired labels. Usually, when working with LLMs, this can be quite tricky – no matter how well the prompt is phrased, there is no guarantee that the LLM won’t return an undesired output. With skorch, you don’t have to worry about that. - Returning probabilities: When generating texts with LLMs, you will usually
only get the generated text as output. But often, we would like to know the
associated probabilities: Is the model 99% sure that the label is “positive”
or only 51%? skorch provides the
.predict_proba
method, which will return those probabilities to you. - Caching: skorch performs some caching under the hood. This can lead to faster prediction times, especially when the labels are long and share a common prefix.
- Scikit-learn compatibility: Thanks to skorch, the classifiers are compatible
with sklearn. You can call
fit
,predict
, andpredict_proba
as you always do. You can run a grid search to identify the best LLM and prompt. You can use the classifier as a drop-in replacement for your existing sklearn model. Or you can start prototyping with a zero/few-shot classifier and later, when more labeled data is available, replace it with an sklearn model, without the need to change any further code. - Everything runs locally: Once the model and tokenizer have been downloaded, everything runs locally on your machine. No data is sent to any API provider, such as OpenAI. If you work with sensitive data or company data, you don’t have to worry about leaking it to the outside world. (Tip: If you actually prefer to use OpenAI API, take a look at scikit-llm)
That said, there are situation where you should not use zero-shot and few-shot
classification with skorch: When you have sufficient amounts of labeled data, a
supervised ML approach (like skorch.NeuralNetClassifier
) will most
likely work better, require less memory and compute and thus have faster
inference. If you need more open ended text generation, say abstractive
summarization instead of predicting a fixed list of labels, this is also not a
good use case. Another concern could be interpretability – LLMs are mostly a
black box, but some other ML methods lend themselves much better for
interpretation.
Technical details¶
Q: How do ZeroShotClassifier
and
FewShotClassifier
ensure that only the given
labels, such as “positive” and “negative” are generated? Usually, a language
model can generate any tokens.
A: Under the hood, we intercept the model predictions (the logits) and force them to be one of the labels. That way, we can ensure that the model will never predict anything that we’re not expecting. This is possible because of integration with Hugging Face transformers.
Q: How do ZeroShotClassifier
and
FewShotClassifier
derive the probabilities?
Usually, a language model just returns text.
A: We don’t use the generated text but instead directly inspect the logits returned by the language model. For example, let’s assume that the label “positive” is represented as two tokens, [123, 456]. Then we first check the logit that the language model assigns to token 123, given the input prompt, then force the model to predict 123 and check the logit assigned to token 456. The logits are then converted to probabilities and aggregated.
More examples¶
To see a complete working example of using
ZeroShotClassifier
and
FewShotClassifier
, take a look at this notebook
about using LLMs for classification.
It contains movie review sentiment analysis use case and also compares the
results with those of other methods.