Streaming LLM output#

Large Language Models produce output token by token, and each token can take some time to generate. If you are waiting for the model to finish the entire response before outputting the result, these times add up quickly!

An alternative to this is streaming: Similar to video streaming, where you don’t wait for the entire video to be downloaded before playing it, you can stream the output of an LLM. In this recipe, we will explore how to do that with langchain_dartmouth!

Note

Many LLMs in the LangChain ecosystem support streaming, not just the ones in langchain_dartmouth! You could replace the model in this notebook with, e.g., ChatOpenAI from langchain_openai, and it would work exactly the same!

Importing and instantiating a model#

Just as we saw in the previous recipe, we will import a chat model and then instantiate it. We will use the streaming parameter, however, to tell the model that we want it to stream its output!

from langchain_dartmouth.llms import ChatDartmouth
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())
True
llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", streaming=True)

Streaming the output#

We could use the invoke method as we did before, but that would still require us to wait for the entire response to be generated before returning. Since we have set our model’s streaming parameter to True, we can instead call the stream method. This will return a generator object. We can then iterate through this generator and print each returned chunk as it is generated:

for chunk in llm.stream("Write a haiku about Dartmouth College"):
    print(chunk.content)
Green
 hills
 of
 Han
over


D
art
mouth
's
 quiet
,
 wise
 hearts
Learning
's
 gentle
 shore

We can see that the chunks came in one by one, because the print function breaks the line after every chunk. We can use the end parameter to avoid that:

for chunk in llm.stream("Write a haiku about Dartmouth College"):
    print(chunk.content, end="")
Ancient trees stand tall
Han
over hills echo silence
Learning's
 quiet place

That looks better! Let’s try a longer response to show the benefit of streaming:

for chunk in llm.stream("Write five haiku about Dartmouth College"):
    print(chunk.content, end="")
Here are
 five haiku about Dart
mouth College
:

1.
Snowy
 Hanover
Dartmouth's
 green woods
 call me home
Winter
's peaceful
 nest

2.
River
 runs so
 free
Connecticut's
 gentle waters
Dartmouth's
 gentle flow
3.
Liberal arts
 shine
Tuck School's wisdom
 guides
 my path
Dartmouth
's
 noble aim

4.
Winter
 Carnival
Snowflakes dance,
 and
 I am free
Dart
mouth
's joyful cheer

5.
As
pen leaves turn gold
D
art
mouth's beauty shines within
Aut
umn's fleeting joy

Summary#

This recipe showed how to stream output from an LLM using the stream method. Streaming long responses makes for a better user experience and a more efficient use of time when working with an LLM interactively.