Chat models
Overviewβ
Large Language Models (LLMs) are advanced machine learning models that excel in a wide range of language-related tasks such as text generation, translation, summarization, question answering, and more, without needing task-specific tuning for every scenario.
Modern LLMs are typically accessed through a chat model interface that takes a list of messages as input and returns a message as output.
The newest generation of chat models offer additional capabilities:
- Tool calling: Many popular chat models offer a native tool calling API. This API allows developers to build rich applications that enable AI to interact with external services, APIs, and databases. Tool calling can also be used to extract structured information from unstructured data and perform various other tasks.
- Structured output: A technique to make a chat model respond in a structured format, such as JSON that matches a given schema.
- Multimodality: The ability to work with data other than text; for example, images, audio, and video.
Featuresβ
LangChain provides a consistent interface for working with chat models from different providers while offering additional features for monitoring, debugging, and optimizing the performance of applications that use LLMs.
- Integrations with many chat model providers (e.g., Anthropic, OpenAI, Ollama, Microsoft Azure, Google Vertex, Amazon Bedrock, Hugging Face, Cohere, Groq). Please see chat model integrations for an up-to-date list of supported models.
- Use either LangChain's messages format or OpenAI format.
- Standard tool calling API: standard interface for binding tools to models, accessing tool call requests made by models, and sending tool results back to the model.
- Standard API for structuring outputs (/docs/concepts/structured_outputs) via the
with_structured_output
method. - Provides support for async programming, efficient batching, a rich streaming API.
- Integration with LangSmith for monitoring and debugging production-grade applications based on LLMs.
- Additional features like standardized token usage, rate limiting, caching and more.
Available integrationsβ
LangChain has many chat model integrations that allow you to use a wide variety of models from different providers.
These integrations are one of two types:
- Official models: These are models that are officially supported by LangChain and/or model provider. You can find these models in the
langchain-<provider>
packages. - Community models: There are models that are mostly contributed and supported by the community. You can find these models in the
langchain-community
package.
LangChain chat models are named with a convention that prefixes "Chat" to their class names (e.g., ChatOllama
, ChatAnthropic
, ChatOpenAI
, etc.).
Please review the chat model integrations for a list of supported models.
Models that do not include the prefix "Chat" in their name or include "LLM" as a suffix in their name typically refer to older models that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output.
Interfaceβ
LangChain chat models implement the BaseChatModel interface. Because [BaseChatModel] also implements the Runnable Interface, chat models support a standard streaming interface, async programming, optimized batching, and more. Please see the Runnable Interface for more details.
Many of the key methods of chat models operate on messages as input and return messages as output.
Chat models offer a standard set of parameters that can be used to configure the model. These parameters are typically used to control the behavior of the model, such as the temperature of the output, the maximum number of tokens in the response, and the maximum time to wait for a response. Please see the standard parameters section for more details.
In documentation, we will often use the terms "LLM" and "Chat Model" interchangeably. This is because most modern LLMs are exposed to users via a chat model interface.
However, LangChain also has implementations of older LLMs that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output. These models are typically named without the "Chat" prefix (e.g., Ollama
, Anthropic
, OpenAI
, etc.).
These models implement the BaseLLM interface and may be named with the "LLM" suffix (e.g., OllamaLLM
, AnthropicLLM
, OpenAILLM
, etc.). Generally, users should not use these models.
Key methodsβ
The key methods of a chat model are:
- invoke: The primary method for interacting with a chat model. It takes a list of messages as input and returns a list of messages as output.
- stream: A method that allows you to stream the output of a chat model as it is generated.
- batch: A method that allows you to batch multiple requests to a chat model together for more efficient processing.
- bind_tools: A method that allows you to bind a tool to a chat model for use in the model's execution context.
- with_structured_output: A wrapper around the
invoke
method for models that natively support structured output.
Other important methods can be found in the BaseChatModel API Reference.
Inputs and outputsβ
Modern LLMs are typically accessed through a chat model interface that takes messages as input and returns messages as output. Messages are typically associated with a role (e.g., "system", "human", "assistant") and one or more content blocks that contain text or potentially multimodal data (e.g., images, audio, video).
LangChain supports two message formats to interact with chat models:
- LangChain Message Format: LangChain's own message format, which is used by default and is used internally by LangChain.
- OpenAI's Message Format: OpenAI's message format.
Standard parametersβ
Many chat models have standardized parameters that can be used to configure the model:
Parameter | Description |
---|---|
model | The name or identifier of the specific AI model you want to use (e.g., "gpt-3.5-turbo" or "gpt-4" ). |
temperature | Controls the randomness of the model's output. A higher value (e.g., 1.0) makes responses more creative, while a lower value (e.g., 0.1) makes them more deterministic and focused. |
timeout | The maximum time (in seconds) to wait for a response from the model before canceling the request. Ensures the request doesnβt hang indefinitely. |
max_tokens | Limits the total number of tokens (words and punctuation) in the response. This controls how long the output can be. |
stop | Specifies stop sequences that indicate when the model should stop generating tokens. For example, you might use specific strings to signal the end of a response. |
max_retries | The maximum number of attempts the system will make to resend a request if it fails due to issues like network timeouts or rate limits. |
api_key | The API key required for authenticating with the model provider. This is usually issued when you sign up for access to the model. |
base_url | The URL of the API endpoint where requests are sent. This is typically provided by the model's provider and is necessary for directing your requests. |
rate_limiter | An optional BaseRateLimiter to space out requests to avoid exceeding rate limits. See rate-limiting below for more details. |
Some important things to note:
- Standard parameters only apply to model providers that expose parameters with the intended functionality. For example, some providers do not expose a configuration for maximum output tokens, so max_tokens can't be supported on these.
- Standard params are currently only enforced on integrations that have their own integration packages (e.g.
langchain-openai
,langchain-anthropic
, etc.), they're not enforced on models inlangchain-community
.
ChatModels also accept other parameters that are specific to that integration. To find all the parameters supported by a ChatModel head to the API reference for that model.
Tool callingβ
Chat models can call tools to perform tasks such as fetching data from a database, making API requests, or running custom code. Please see the tool calling guide for more information.
Structured outputsβ
Chat models can be requested to respond in a particular format (e.g., JSON or matching a particular schema). This feature is extremely useful for information extraction tasks. Please read more about the technique in the structured outputs guide.
Multimodalityβ
Large Language Models (LLMs) are not limited to processing text. They can also be used to process other types of data, such as images, audio, and video. This is known as multimodality.
Currently, only some LLMs support multimodal inputs, and almost none support multimodal outputs. Please consult the specific model documentation for details.
Context windowβ
A chat model's context window refers to the maximum size of the input sequence the model can process at one time. While the context windows of modern LLMs are quite large, they still present a limitation that developers must keep in mind when working with chat models.
If the input exceeds the context window, the model may not be able to process the entire input and could raise an error. In conversational applications, this is especially important because the context window determines how much information the model can "remember" throughout a conversation. Developers often need to manage the input within the context window to maintain a coherent dialogue without exceeding the limit. For more details on handling memory in conversations, refer to the memory.
The size of the input is measured in tokens which are the unit of processing that the model uses.
Advanced topicsβ
Rate-limitingβ
Many chat model providers impose a limit on the number of requests that can be made in a given time period.
If you hit a rate limit, you will typically receive a rate limit error response from the provider, and will need to wait before making more requests.
You have a few options to deal with rate limits:
- Try to avoid hitting rate limits by spacing out requests: Chat models accept a
rate_limiter
parameter that can be provided during initialization. This parameter is used to control the rate at which requests are made to the model provider. Spacing out the requests to a given model is a particularly useful strategy when benchmarking models to evaluate their performance. Please see the how to handle rate limits for more information on how to use this feature. - Try to recover from rate limit errors: If you receive a rate limit error, you can wait a certain amount of time before retrying the request. The amount of time to wait can be increased with each subsequent rate limit error. Chat models have a
max_retries
parameter that can be used to control the number of retries. See the standard parameters section for more information. - Fallback to another chat model: If you hit a rate limit with one chat model, you can switch to another chat model that is not rate-limited.
Cachingβ
Chat model APIs can be slow, so a natural question is whether to cache the results of previous conversations. Theoretically, caching can help improve performance by reducing the number of requests made to the model provider. In practice, caching chat model responses is a complex problem and should be approached with caution.
The reason is that getting a cache hit is unlikely after the first or second interaction in a conversation if relying on caching the exact inputs into the model. For example, how likely do you think that multiple conversations start with the exact same message? What about the exact same three messages?
An alternative approach is to use semantic caching, where you cache responses based on the meaning of the input rather than the exact input itself. This can be effective in some situations, but not in others.
A semantic cache introduces a dependency on another model on the critical path of your application (e.g., the semantic cache may rely on an embedding model to convert text to a vector representation), and it's not guaranteed to capture the meaning of the input accurately.
However, there might be situations where caching chat model responses is beneficial. For example, if you have a chat model that is used to answer frequently asked questions, caching responses can help reduce the load on the model provider and improve response times.
Please see the how to cache chat model responses guide for more details.
Related resourcesβ
- How-to guides on using chat models: how-to guides.
- List of supported chat models: chat model integrations.