Output parsing

Output parsing#

We often use LLMs as text processing engines to convert unstructured text into structured data. In this recipe, we will show how we can use an LLM to extract a structured timeline from a narrative paragraph containing years.

Input data#

We will use the following text, which is a paragraph on the history of the Baker-Berry Library from Wikipedia:

unstructured_text = """The original, historic library building is the Fisher Ames Baker Memorial Library; it opened in 1928 with a collection of 240,000 volumes. The building was designed by Jens Fredrick Larson, modeled after Independence Hall in Philadelphia, and funded by a gift to Dartmouth College by George Fisher Baker in memory of his uncle, Fisher Ames Baker, Dartmouth class of 1859. The facility was expanded in 1941 and 1957–1958 and received its one millionth volume in 1970.

In 1992, John Berry and the Baker family donated US $30 million for the construction of a new facility, the Berry Library designed by architect Robert Venturi, adjoining the Baker Library. The new complex, the Baker-Berry Library, opened in 2000 and was completed in 2002.[6] The Dartmouth College libraries presently hold over 2 million volumes in their collections."""

print(unstructured_text)

The original, historic library building is the Fisher Ames Baker Memorial Library; it opened in 1928 with a collection of 240,000 volumes. The building was designed by Jens Fredrick Larson, modeled after Independence Hall in Philadelphia, and funded by a gift to Dartmouth College by George Fisher Baker in memory of his uncle, Fisher Ames Baker, Dartmouth class of 1859. The facility was expanded in 1941 and 1957–1958 and received its one millionth volume in 1970.

In 1992, John Berry and the Baker family donated US $30 million for the construction of a new facility, the Berry Library designed by architect Robert Venturi, adjoining the Baker Library. The new complex, the Baker-Berry Library, opened in 2000 and was completed in 2002.[6] The Dartmouth College libraries presently hold over 2 million volumes in their collections.

As we can see, the history is written in a narrative style. It mentions various important points in the history of Baker-Berry, but not in a way that they can be easily extracted. There are two major challenges here:

Not all years in the text are actually relevant to the task (“class of 1859”)
Each year needs a succinct summary of the corresponding event

We can solve both of these challenges with an LLM.

Data extraction#

The most straight-forward approach is to simply prompt a model to extract the timeline of events from the unstructured text. Let’s go ahead and try that first!

We will start by using what we have learned in previous recipes about instantiating and invoking a chat model:

from langchain_dartmouth.llms import ChatDartmouth
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

Basic approach#

llm = ChatDartmouth(model_name="llama-3-2-11b-vision-instruct")

prompt = (
    "Extract a succinct timeline of events directly related the Library from the following text: \n\n"
    + unstructured_text
)

response = llm.invoke(prompt)

print(response.content)

Here's a succinct timeline of events related to the Library:

1928: Fisher Ames Baker Memorial Library opens with a collection of 240,000 volumes.
1941: Library expansion.
1957-1958: Library expansion.
1970: Library receives its one millionth volume.
1992: John Berry and the Baker family donate $30 million for a new library facility.
2000: Baker-Berry Library complex opens.
2002: Completion of Baker-Berry Library complex.

It worked! Well, kind of. The goal was to extract the data as a data structure (like a Python list, or a dict), not just another string. We could now manually parse the response, but if you run the above cell multiple times, you will see that the model may produce different formats of the timeline. This makes it hard to write code that can reliably and reproducibly extract the data. Let’s take it one step further and instruct the model to return the data in a specific format.

Explicit output structure#

Ideally, we want the data to be a list of dict objects, where each dict has two fields: 'year' and 'event'. The LLM can’t return actual Python objects, it can only ever return a string. But let’s instruct our model to format the string in such a way that we can parse it back into an object in Python. JSON is a great format for this, because it is string-based and can be easily parsed by Python’s json module. Here is how we could modify our prompt:

prompt = (
    "Extract a succinct timeline of events directly related the Library from the following text. Return the timeline as a list of dictionaries, where each dictionary has two keys: 'year' and 'event'. Format your output in JSON format. The text:\n\n"
    + unstructured_text
)

Now let’s give it another try:

response = llm.invoke(prompt)

print(response.content)

```
[
  {
    "year": 1928,
    "event": "Opening of the Fisher Ames Baker Memorial Library"
  },
  {
    "year": 1941,
    "event": "Expansion of the library building"
  },
  {
    "year": 1957,
    "event": "Further expansion of the library building"
  },
  {
    "year": 1970,
    "event": "Received its one millionth volume"
  },
  {
    "year": 1992,
    "event": "Donation for the construction of a new facility"
  },
  {
    "year": 2000,
    "event": "Opening of the Baker-Berry Library"
  },
  {
    "year": 2002,
    "event": "Completion of the Baker-Berry Library"
  }
]
```

Note: The text does not specify a year for the expansion from 1957-1958, so I included both years, which is not ideal. If I were to choose just one year, it would be 1958.

Close! But if you run the cell multiple times, you will see that the model adds additional stuff around the JSON string: Markdown tags for a code block, or sometimes a brief preamble or closing summary. However: Many LLMs, including Llama 3.1, are trained to use the markdown codeblock (everything between the triple backticks ```) around the actual data. So we could write a parser that first extracts the text between those backticks, and then parses that string as a JSON.

Lucky for us, LangChain includes such a parser already! Since langchain_dartmouth is built on LangChain, we can directly use that component with ChatDartmouth!

By convention, most components in LangChain use the invoke method to “do their thing”, so here is how we can fit everything together:

from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser()

response = llm.invoke(prompt)
timeline = parser.invoke(response)

for event in timeline:
    print(event)

library_events

There are many more output parsers available in LangChain for all sorts of different desired output formats. All of them have the same usage pattern demonstrated above: Instruct the model to return the data in a specific format, then pass the model’s response through the parser.

If there is a specific format you need that is not already supported by any of the available parsers, you can also write your own by subclassing any of them. Let’s say instead of a generic JSON, we wanted to extract a Pandas DataFrame. We could create such a parser by subclassing the JsonOutputParser and adding an additional step to its invoke method:

import pandas as pd


class DataFrameParser(JsonOutputParser):
    def invoke(self, text: str) -> pd.DataFrame:
        json_data = super().invoke(text)
        return pd.DataFrame.from_records(json_data)


parser = DataFrameParser()
response = llm.invoke(prompt)
df = parser.invoke(response)
df

	year	event
0	1928	Fisher Ames Baker Memorial Library opened with...
1	1941	Library facility was expanded
2	1957	Library facility expanded again
3	1970	Received its one millionth volume
4	1992	John Berry and the Baker family donated $30 mi...
5	2000	Baker-Berry Library opened
6	2002	Baker-Berry Library completed

Summary#

In this recipe, we saw that LLMs are great to extract structured data from unstructured text. Since LLMs can only output strings, output parsers are a great tool to convert the text representation of the structured data into Python objects (like lists, dictionaries, or even data frames) for further processing.