Author: Josh Bottum
Date: July 25th, 2023
Within (30) minutes of reading this post, you should be able to complete model serving requests from two variants of a popular python-based large language model (LLM) using LangChain on your local computer without requiring the connection or costs to an external 3rd party API server, such as HuggingFaceHub or OpenAI. This exercise provides the scripts that will enable you to test these LLMs’ capabilities in answering three prompt types i.e. knowledge retrieval, six forms of reasoning questions and a long question, which provides context in its details. After providing some background on the models and LangChain, we will walk you through installing dependencies, and we will provide the code and the output of each model. We will also provide side by side comparisons on model performance and processing times. We hope that these examples will help you to develop your LLM testing plans, especially for your LLM’s reasoning requirements.
Caveats and notes - Although you will not need a real-time connection to HuggingFace for model serving, you will need a connection to Huggingface to fetch code. You will not need a HuggingFaceHub_API_Token.
Some of the reasons why you may need to run your model locally, and not use an external API server, include:
More than anything you want to protect you private internal information that may be needed in prompt contexts from being exposed to models that may use the prompts for training later, exposing your internal data.
In this blog, we will show the process to run the Flan-T5-Large and Flan-T5-XL models. This family of transformer models, open sourced from Google, is designed for natural language processing tasks and provides both text-to-text and text generation capabilities, especially for question answering.
The Flan-T5-Large version is based on the T5 (Text-To-Text Transfer Transformer) architecture and has 780M parameters. This paper, which provides the following chart, claims that the Flan-T5-Large achieved a MMLU Direct score of 45.1%, which is pretty good when compared to ChatGPT-3’s score of 43.9% (see page 10). It is a fairly popular model, which had 446,125 downloads last month. For more detailed information on this model’s background, performance and capabilities, please see this link on HuggingFaceHub. For reference, the Measuring Massive Multitask Language Understanding(MMLU) tests cover 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. Please find more on MMLU in this paper.
The Flan-T5-xl version is based on the T5 (Text-To-Text Transfer Transformer) architecture and has 3B parameters. It is a fairly popular model, which had 349,257 downloads last month. It achieved a MMLU score of 52%, which is better than T5-Large and ChatGPT-3. For more detailed information on this model’s background, performance and capabilities, please see this link on HuggingFaceHub, https://huggingface.co/google/flan-t5-xl.
The text in this section is from https://python.LangChain.com/en/latest/index.html
LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model, but will also be:
The LangChain framework is designed around these principles. This is the Python specific portion of the documentation. For a purely conceptual guide to LangChain, see here. For the JavaScript documentation, see here. For concepts and terminology, please see here.
These modules are the core abstractions which we view as the building blocks of any LLM-powered application. For each module LangChain provides standard, extendable interfaces. LangChain also provides external integrations and even end-to-end implementations for off-the-shelf use. The docs for each module contain quickstart examples, how-to guides, reference docs, and conceptual guides.
The modules are (from least to most complex):
Best practices and built-in implementations for common LangChain use cases:
As you can see, LangChain includes many advanced features and it enables complex model processing. In our example, we will use models, prompts, and pipelines for question answering.
From the terminal, please run the commands below:
pip3 install transformers
pip3 install langchain
pip3 install torch
pip3 install matplotlib
After installing the dependencies, please build your python script. In your terminal or code editor, please create a file, t5pat.py, in your directory i.e. t5pat, and cut and paste in the following code into your t5pat.py file.
import time
import matplotlib.pyplot as plt
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import os
# Disable parallelism and avoid the warning message
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Define model IDs
model_ids = ['google/flan-t5-large', 'google/flan-t5-xl']
# Define prompts and types
prompts = [
'What is the capital of Germany?',
'What is the capital of Spain?',
'What is the capital of Canada?',
'What is the next number in the sequence: 2, 4, 6, 8, ...? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail?',
'If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer?',
'In the same way that pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to?',
'Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies?',
'All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella?',
'If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb?',
'The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall?'
]
types = [
'Knowledge Retrieval',
'Knowledge Retrieval',
'Knowledge Retrieval',
'Logical Reasoning',
'Cause and Effect',
'Analogical Reasoning',
'Inductive Reasoning',
'Deductive Reasoning',
'Counterfactual Reasoning',
'In Context'
]
# Create empty lists to store generation times, model load times, tokenizer load times, and pipeline load times
xl_generation_times = []
large_generation_times = []
xl_model_load_times = []
large_model_load_times = []
xl_tokenizer_load_times = []
large_tokenizer_load_times = []
xl_pipeline_load_times = []
large_pipeline_load_times = []
prompt_types = []
for model_id in model_ids:
# Load tokenizer
tokenizer_start_time = time.time()
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer_end_time = time.time()
# Load model
model_start_time = time.time()
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model_end_time = time.time()
# Load pipeline
pipe_start_time = time.time()
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, max_length=512)
local_llm = HuggingFacePipeline(pipeline=pipe)
pipe_end_time = time.time()
# Store loading times
if model_id == 'google/flan-t5-large':
large_model_load_times.append(model_end_time - model_start_time)
large_tokenizer_load_times.append(tokenizer_end_time - tokenizer_start_time)
large_pipeline_load_times.append(pipe_end_time - pipe_start_time)
elif model_id == 'google/flan-t5-xl':
xl_model_load_times.append(model_end_time - model_start_time)
xl_tokenizer_load_times.append(tokenizer_end_time - tokenizer_start_time)
xl_pipeline_load_times.append(pipe_end_time - pipe_start_time)
# Print model results
print()
print(f"Results for model: {model_id}")
print("=" * 30)
# Loop thru prompt list, measure the time to the generate answers, print prompt, answer, time, type
for i, prompt in enumerate(prompts):
start_time = time.time()
answer = local_llm(prompt)
end_time = time.time()
print(f"Prompt: {prompt}")
print(f"Answer: {answer}")
print(f"Generation Time: {end_time - start_time:.5f} seconds")
print(f"Type: {types[i]}")
print()
# store prompt types and time measures to generate ansswers by prompt types
prompt_types.append(types[i]) # Store the prompt type
if model_id == 'google/flan-t5-large':
large_generation_times.append(end_time - start_time)
elif model_id == 'google/flan-t5-xl':
xl_generation_times.append(end_time - start_time)
# print loading times
print(f"Loading times for model {model_id}")
print("Tokenizer Loading Time:", f"{tokenizer_end_time - tokenizer_start_time:.5f}", "seconds")
print("Model Loading Time:", f"{model_end_time - model_start_time:.5f}", "seconds")
print("Pipeline Loading Time:", f"{pipe_end_time - pipe_start_time:.5f}", "seconds\n\n")
# Plot model load times
model_load_times = [sum(xl_model_load_times), sum(large_model_load_times)]
model_labels = ['XL Model', 'Large Model']
plt.figure(figsize=(18, 6))
plt.subplot(131)
plt.bar(model_labels, model_load_times, color=['blue', 'orange'])
plt.ylabel('Load Time (seconds)')
plt.xlabel('Model')
plt.title('Model Load Time Comparison')
# Plot tokenizer load times
tokenizer_load_times = [sum(xl_tokenizer_load_times), sum(large_tokenizer_load_times)]
plt.subplot(132)
plt.bar(model_labels, tokenizer_load_times, color=['blue', 'orange'])
plt.ylabel('Load Time (seconds)')
plt.xlabel('Model')
plt.title('Tokenizer Load Time Comparison')
# Plot pipeline load times
pipeline_load_times = [sum(xl_pipeline_load_times), sum(large_pipeline_load_times)]
plt.subplot(133)
plt.bar(model_labels, pipeline_load_times, color=['blue', 'orange'])
plt.ylabel('Load Time (seconds)')
plt.xlabel('Model')
plt.title('Pipeline Load Time Comparison')
# Plot generation times
plt.figure(figsize=(9, 6))
plt.barh(range(len(types)), xl_generation_times, height=0.4, align='center', color='blue', label='XL Model')
plt.barh([x + 0.4 for x in range(len(types))], large_generation_times, height=0.4, align='center', color='orange', alpha=0.5, label='Large Model')
plt.yticks(range(len(types)), types)
plt.ylabel('Type')
plt.xlabel('Generation Time (seconds)')
plt.title('Generation Time Comparison')
plt.legend()
plt.tight_layout()
plt.show()
To run your script, please open your terminal to the directory that holds the file, i.e. t5pat. Then run the following statement:
python3 t5pat.py
The following provides sample model output from running the script. Your answers and generation times will likely be different. The script has a text output followed by four charts. You can save the charts using the file button on the chart displays or just close them out. Either action will release the script and bring you back to the terminal prompt.
Results for model: google/flan-t5-large
==============================
Prompt: What is the capital of Germany?
Answer: berlin
Generation Time: 1.06194 seconds
Type: Knowledge Retrieval
Prompt: What is the capital of Spain?
Answer: turin
Generation Time: 0.73172 seconds
Type: Knowledge Retrieval
Prompt: What is the capital of Canada?
Answer: toronto
Generation Time: 1.12487 seconds
Type: Knowledge Retrieval
Prompt: What is the next number in the sequence: 2, 4, 6, 8, ...? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail?
Answer: yes
Generation Time: 1.08774 seconds
Type: Logical Reasoning
Prompt: If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer?
Answer: no
Generation Time: 0.69614 seconds
Type: Cause and Effect
Prompt: In the same way that pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to?
Answer: brick is related to brick
Generation Time: 1.51508 seconds
Type: Analogical Reasoning
Prompt: Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies?
Answer: yes
Generation Time: 1.24550 seconds
Type: Inductive Reasoning
Prompt: All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella?
Answer: yes
Generation Time: 1.28181 seconds
Type: Deductive Reasoning
Prompt: If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb?
Answer: no one would have invented the light bulb
Generation Time: 2.15294 seconds
Type: Counterfactual Reasoning
Prompt: The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall?
Answer: about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba
Generation Time: 10.67541 seconds
Type: In Context
Loading times for model google/flan-t5-large
Tokenizer Loading Time: 0.94174 seconds
Model Loading Time: 17.28348 seconds
Pipeline Loading Time: 0.11213 seconds
Loading checkpoint shards: 100%|██████████████████| 2/2 [01:38<00:00, 49.17s/it]
Results for model: google/flan-t5-xl
==============================
Prompt: What is the capital of Germany?
Answer: berlin
Generation Time: 43.58305 seconds
Type: Knowledge Retrieval
Prompt: What is the capital of Spain?
Answer: santander
Generation Time: 2.80783 seconds
Type: Knowledge Retrieval
Prompt: What is the capital of Canada?
Answer: ottawa
Generation Time: 3.06489 seconds
Type: Knowledge Retrieval
Prompt: What is the next number in the sequence: 2, 4, 6, 8, ...? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail?
Answer: yes
Generation Time: 2.89040 seconds
Type: Logical Reasoning
Prompt: If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer?
Answer: It increases the risk of developing lung cancer.
Generation Time: 5.07974 seconds
Type: Cause and Effect
Prompt: In the same way that pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to?
Answer: building
Generation Time: 2.60167 seconds
Type: Analogical Reasoning
Prompt: Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies?
Answer: yes
Generation Time: 3.53700 seconds
Type: Inductive Reasoning
Prompt: All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella?
Answer: yes
Generation Time: 2.90499 seconds
Type: Deductive Reasoning
Prompt: If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb?
Answer: the world would be dark
Generation Time: 3.81147 seconds
Type: Counterfactual Reasoning
Prompt: The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall?
Answer: Fort Myers in Florida
Generation Time: 14.06618 seconds
Type: In Context
Loading times for model google/flan-t5-xl
Tokenizer Loading Time: 0.54048 seconds
Model Loading Time: 131.81162 seconds
Pipeline Loading Time: 0.57841 seconds
Note - the following might display on the terminal while the figures are rendering. This message is informational.
Python[23374:939566] +[CATransaction synchronize] called within transaction
For each model, the script prints text output for each prompt including the prompt, answer, generation time and prompt type. It provides a text output on the loading times for the model, tokenizer and pipeline times. It produces charts for generation time for each prompt type and load times for the model, tokenizer and pipeline.
In our tests, the XL model took significantly longer in all aspects than the Large model, but the XL model answered more questions correctly. Neither model did a good job with prompts that contained two questions and both models mostly answered the second question and ignored the 1st question. The answers did not provide much context, although we did not ask for context. Determining what is correct or incorrect for a reasoning question could have some subjectivity. The XL was stronger than the Large model on knowledge retrieval, in context, cause and effect, and analogical reasoning. The XL did not perform well initially on analogical or counterfactual questions but its answers improved as we ran more epochs. Both models were better at answering reasoning questions than knowledge retrieval or in context questions.
The following table provides a summary of the models’ correct answers. We recognize that the format of the prompts, especially asking two questions in one prompt, can impact the model. We used these more complex examples as they might reflect human interaction. As you can see, the model’s performance can vary depending on the question type and the prompt construction. This is to be expected and could be fine tuned, which is a potential topic for follow-on discussions and/or further experimentations.
The Large model did not answer the first question in the prompts which contained multiple questions. If you count the first questions, the Large model missed more questions than it answered correctly. Additionally, the answers did not provide much context, but to be fair, we did not ask for context in the answer. Determining what is correct or incorrect for a reasoning question could have some subjectivity.
Let’s analyze the output for the Large model’s output. This chart provides the question types, the correctness of the answer, and the time required to generate the pipelines answers.
Prompt | Correct | Time in sec |
---|---|---|
Knowledge retrieval 1 | 100% | 1.1 |
Knowledge retrieval 2 | 0% | 0.7 |
Knowledge retrieval 3 | 0% | 1.1 |
Logical Reasoning | 50% | 1.1 |
Cause Effect Reasoning | 0% | 0.7 |
Analogical Reasoning | 0% | 1.5 |
Inductive Reasoning | 50% | 1.2 |
Deductive Reasoning | 50% | 1.3 |
Counterfactual Reasoning | 50% | 2.2 |
In Context | 0% | 10.7 |
Now, let’s examine the results of the flan-t5-xl model. The XL model takes longer but provides better answers. Before the XL produces answers to the prompts, it prints the following informational message (below) as the script loads its checkpoint shards. In our test, this took 49 seconds and it caused a noticable delay between the completion of the Large model output and the start of output from the XL Model.
Loading checkpoint shards: 100%|██████████████████| 2/2 [01:38<00:00, 49.17s/it]
Next let’s analyze output for the XL model’s output. This chart provides the question types, the correctness of the answer, and the time required to generate the pipelines answers.
Prompt | Correct | Time in sec |
---|---|---|
Knowledge retrieval 1 | 100% | 43.6 |
Knowledge retrieval 2 | 0% | 2.8 |
Knowledge retrieval 3 | 100% | 3.1 |
Logical Reasoning | 50% | 2.9 |
Cause Effect Reasoning | 50% | 5.1 |
Analogical Reasoning | 50% | 2.6 |
Inductive Reasoning | 50% | 3.5 |
Deductive Reasoning | 50% | 2.9 |
Counterfactual Reasoning | 50% | 3.8 |
In Context | 25% | 14.1 |
The XL model did a pretty good job. It answered (2) of the (3) Knowledge retrieval questions correctly. It answered all of the reasoning questions correctly and it provided an answer for the In Context question that could be correct. From a grading standpoint, it only answered the second question in the reasoning prompts, and so we gave it a grade of 50% for answering those prompts.
This initial post is intended to help you to develop a plan to test reasoning questions in your LLMs. In a follow-on post, we plan to provide more details and descriptions of the code and output. We look forward to your feedback and hope these examples provide you with ideas for your LLM testing.
For more on building applications with large language models, check out: