Output parsing#
We often use LLMs as text processing engines to convert unstructured text into structured data. In this recipe, we will show how we can use an LLM to extract a structured timeline from a narrative paragraph containing years.
Input data#
We will use the following text, which is a paragraph on the history of the Baker-Berry Library from Wikipedia:
unstructured_text = """The original, historic library building is the Fisher Ames Baker Memorial Library; it opened in 1928 with a collection of 240,000 volumes. The building was designed by Jens Fredrick Larson, modeled after Independence Hall in Philadelphia, and funded by a gift to Dartmouth College by George Fisher Baker in memory of his uncle, Fisher Ames Baker, Dartmouth class of 1859. The facility was expanded in 1941 and 1957–1958 and received its one millionth volume in 1970.
In 1992, John Berry and the Baker family donated US $30 million for the construction of a new facility, the Berry Library designed by architect Robert Venturi, adjoining the Baker Library. The new complex, the Baker-Berry Library, opened in 2000 and was completed in 2002.[6] The Dartmouth College libraries presently hold over 2 million volumes in their collections."""
print(unstructured_text)
The original, historic library building is the Fisher Ames Baker Memorial Library; it opened in 1928 with a collection of 240,000 volumes. The building was designed by Jens Fredrick Larson, modeled after Independence Hall in Philadelphia, and funded by a gift to Dartmouth College by George Fisher Baker in memory of his uncle, Fisher Ames Baker, Dartmouth class of 1859. The facility was expanded in 1941 and 1957–1958 and received its one millionth volume in 1970.
In 1992, John Berry and the Baker family donated US $30 million for the construction of a new facility, the Berry Library designed by architect Robert Venturi, adjoining the Baker Library. The new complex, the Baker-Berry Library, opened in 2000 and was completed in 2002.[6] The Dartmouth College libraries presently hold over 2 million volumes in their collections.
As we can see, the history is written in a narrative style. It mentions various important points in the history of Baker-Berry, but not in a way that they can be easily extracted. There are two major challenges here:
Not all years in the text are actually relevant to the task (“class of 1859”)
Each year needs a succinct summary of the corresponding event
We can solve both of these challenges with an LLM.
Data extraction#
The most straight-forward approach is to simply prompt a model to extract the timeline of events from the unstructured text. Let’s go ahead and try that first!
We will start by using what we have learned in previous recipes about instantiating and invoking a chat model:
from langchain_dartmouth.llms import ChatDartmouth
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())
True
Basic approach#
llm = ChatDartmouth(model_name="llama-3-1-8b-instruct")
prompt = (
"Extract a succinct timeline of events directly related the Library from the following text: \n\n"
+ unstructured_text
)
response = llm.invoke(prompt)
print(response.content)
Here's a succinct timeline of events directly related to the Library:
- 1859: Fisher Ames Baker graduates from Dartmouth College.
- 1928: Fisher Ames Baker Memorial Library opens with a collection of 240,000 volumes.
- 1941: The library building is expanded.
- 1957-1958: The library building is expanded again.
- 1970: The library receives its one millionth volume.
- 1992: John Berry and the Baker family donate $30 million for the construction of a new library.
- 2000: The new Baker-Berry Library complex opens.
- 2002: The construction of the Baker-Berry Library complex is completed.
It worked! Well, kind of. The goal was to extract the data as a data structure (like a Python list
, or a dict
), not just another string. We could now manually parse the response, but if you run the above cell multiple times, you will see that the model may produce different formats of the timeline. This makes it hard to write code that can reliably and reproducibly extract the data. Let’s take it one step further and instruct the model to return the data in a specific format.
Explicit output structure#
Ideally, we want the data to be a list
of dict
objects, where each dict
has two fields: 'year'
and 'event'
. The LLM can’t return actual Python objects, it can only ever return a string. But let’s instruct our model to format the string in such a way that we can parse it back into an object in Python. JSON is a great format for this, because it is string-based and can be easily parsed by Python’s json
module. Here is how we could modify our prompt:
prompt = (
"Extract a succinct timeline of events directly related the Library from the following text. Return the timeline as a list of dictionaries, where each dictionary has two keys: 'year' and 'event'. Format your output in JSON format. The text:\n\n"
+ unstructured_text
)
Now let’s give it another try:
response = llm.invoke(prompt)
print(response.content)
```json
[
{"year": 1928, "event": "Fisher Ames Baker Memorial Library opened with a collection of 240,000 volumes"},
{"year": 1941, "event": "The library building was expanded"},
{"year": 1957, "event": "The library building was further expanded"},
{"year": 1958, "event": "The expansion was completed"},
{"year": 1970, "event": "The library received its one millionth volume"},
{"year": 1992, "event": "John Berry and the Baker family donated $30 million for a new library facility"},
{"year": 2000, "event": "The new Baker-Berry Library opened"},
{"year": 2002, "event": "The library construction was completed"}
]
```
Close! But if you run the cell multiple times, you will see that the model adds additional stuff around the JSON string: Markdown tags for a code block, or sometimes a brief preamble or closing summary. However: Many LLMs, including Llama 3.1, are trained to use the markdown codeblock (everything between the triple backticks ```
) around the actual data. So we could write a parser that first extracts the text between those backticks, and then parses that string as a JSON.
Lucky for us, LangChain includes such a parser already! Since langchain_dartmouth
is built on LangChain, we can directly use that component with ChatDartmouth
!
By convention, most components in LangChain use the invoke
method to “do their thing”, so here is how we can fit everything together:
from langchain_core.output_parsers import JsonOutputParser
parser = JsonOutputParser()
response = llm.invoke(prompt)
timeline = parser.invoke(response)
for event in timeline:
print(event)
{'year': 1928, 'event': 'Fisher Ames Baker Memorial Library opens with a collection of 240,000 volumes'}
{'year': 1941, 'event': 'Facility expansion'}
{'year': 1957, 'event': 'Facility expansion begins'}
{'year': 1958, 'event': 'Facility expansion completes'}
{'year': 1970, 'event': 'One millionth volume is added to the collection'}
{'year': 1992, 'event': 'John Berry and the Baker family donate $30 million for a new facility'}
{'year': 2000, 'event': 'Baker-Berry Library complex opens'}
{'year': 2002, 'event': 'Baker-Berry Library complex completes'}
There are many more output parsers available in LangChain for all sorts of different desired output formats. All of them have the same usage pattern demonstrated above: Instruct the model to return the data in a specific format, then pass the model’s response through the parser.
If there is a specific format you need that is not already supported by any of the available parsers, you can also write your own by subclassing any of them. Let’s say instead of a generic JSON, we wanted to extract a Pandas DataFrame
. We could create such a parser by subclassing the JsonOutputParser
and adding an additional step to its invoke
method:
import pandas as pd
class DataFrameParser(JsonOutputParser):
def invoke(self, text: str) -> pd.DataFrame:
json_data = super().invoke(text)
return pd.DataFrame.from_records(json_data)
parser = DataFrameParser()
response = llm.invoke(prompt)
df = parser.invoke(response)
df
year | event | |
---|---|---|
0 | 1928 | The Fisher Ames Baker Memorial Library opened ... |
1 | 1941 | The library building was expanded. |
2 | 1957 | The library building was expanded again. |
3 | 1958 | The library building was completed after the s... |
4 | 1970 | The library received its one millionth volume. |
5 | 1992 | John Berry and the Baker family donated US $30... |
6 | 2000 | The Baker-Berry Library opened. |
7 | 2002 | The Baker-Berry Library was completed. |
Summary#
In this recipe, we saw that LLMs are great to extract structured data from unstructured text. Since LLMs can only output strings, output parsers are a great tool to convert the text representation of the structured data into Python objects (like lists, dictionaries, or even data frames) for further processing.