Extracting data from Harry Potter with GPT-3
In this blog post, I’ll show how I used Julia and a GPT-3 model (via an online API) in an attempt to analyze the monetary value of items in the Harry Potter novels, and what I learned in the process.
Harry Potter and money
Anybody who has read the Harry Potter books has probably noticed how the monetary value of items in the wizarding world is a bit… inconsistent at times. For example, the Weasley family has only a single gold coin in their vault, while a wand costs seven Galleons, and special fireworks from the Weasley twins in the later books ten Galleons. We don’t know how much time or effort it takes to make a wand, but probably more than mass produced fireworks, and I hope the Weasleys would have been able to afford food and clothes equaling fireworks in value.
I have sometimes wondered about how inconsistent the books really are in this point. So I wanted to visualize how the value of items mentioned across the books develops over time, because my hypothesis was that J.K. Rowling might have added more and more expensive things while she was writing, just to contrast them against previously introduced items. For example, something relatively expensive in the first book (the wand) looks very cheap compared to later items (the fireworks).
My initial plan for this exercise was to extract all prices from the books and plot them over time (position in the books) or in a sorted bar graph on a logarithmic axis to related all the things you could buy in the wizarding world. However, I never got there because the more interesting part proved to be how to assemble the list of items of value in the first place, without manually going through all the books.
(In the end, I could have just googled anyway, and would have found the table on this page, but that would not have been as much fun!)
Preparing the data
First of all, I downloaded the Harry Potter plain text corpora from kaggle.com and extracted them into a folder, deleting the attached list of characters.
Then, I loaded all seven of them into a string:
using DataFrames
using JSON3
using HTTP
using CSV
using Chain
using DataFrameMacros
= @chain begin
corpus readdir("archive", join = true)
read.(String)
joinreplace(r"(\n *){2,}" => "\n")
replace(r"^Page \|.*$"m => "")
end
The last two replace
calls remove superfluous line breaks and page number annotations. With that, I had a reasonably clean starting point.
Next, I wanted to find all the places in the books where the wizarding currency was mentioned. I achieved this by looking for Galleons, Sickles and Knuts via regular expression:
= @chain begin
snippet_ranges findall(r"galleon|sickle|knut"i, corpus)
foldl(_[2:end], init = _[1:1]) do list, rng
if abs(list[end].stop - rng.start) < 300
end] = list[end].start:rng.stop
list[else
push!(list, rng)
end
listend
-300:x.stop+300 for x in _]
[x.startend
The foldl
command takes the list of occurrence ranges found and merges entries less than 300 characters apart together. That’s because I assumed that often, multiple money related words would be found in a coherent paragraph. I stored the merged ranges with some additional 300 character padding before and after, for context. I found 98 snippets like this.
With a helper function, I could then cut out snippets from the corpus.
snippet(corpus, range) = corpus[prevind(corpus, range.start):nextind(corpus, range.stop)]
Note that I used the prevind
and nextind
functions to shift the stored indices to the next valid indices in the UTF string. I only did this after some snippets failed to extract, as I had assumed most of Harry Potter would be ASCII anyway and the naive 300 character shift would be ok (it wasn’t).
Here’s one example of using the function on the corpus:
> snippet(corpus, snippet_ranges[2]) |> print
julia
ht more eyes. He in every direction as they walked up
turned his head : the
the street, trying to look at everything at once
shops, the things outside them, the people doing their
shopping. A plump woman outside an Apothecary
was shaking her head as they passed, saying,
“Dragon liver, sixteen Sickles an ounce, they’re mad.”
A low, soft hooting came from a dark shop with a sign
saying Eeylops Owl Emporium — Tawny, Screech,
Barn, Brown, and Snowy. Several boys of about
Harry’s age had their noses pressed against a window in it. “Look,” Harry heard one of
with broomsticks them sa
An English speaker reading this should be able to extract the item of value being talked about in this paragraph: Dragon liver for sixteen Sickles.
Of course I didn’t want to manually go through all 98 snippets, so I thought about ways to extract the data I needed automatically. Carefully constructed regexes are usually closer to my mode of thinking than more opaque machine learning methods, but in this case it was pretty clear that there was no generally exploitable sentence structure to extract both the price and the item. Sometimes the item is mentioned three sentences before the price, or only alluded to. Therefore, I thought, why not try one of the fancy language models that are talked about so much these days. Would I be able to automate the task with them?
Using GPT-3
Because I didn’t even want to attempt running models on my laptop, I looked into using a web API to feed my data to a language model. I settled on beta.openai.com/playground which offers some free credits to start with and was relatively painless to get working.
I briefly played around with the user interface on the site to come up with a suitable prompt. My goal was to make the model extract a CSV table of items with their price separated by Galleons, Sickles and Knuts. After 10 minutes or so of prompt engineering, I settled on this version:
function prompt(snippet)
"""
The following is a text snippet from Harry Potter. One or several items are mentioned, together with their prices in galleons, sickles and knuts.
Return a comma-separated table with the column headers Item, Galleons, Sickles, Knuts.
Example input:
#########
...After visiting Diagon Alley, Harry bought a spell book for three Galleons
and fourteen Sickles, as well as a wand made of chocolate which cost him
two Sickles and three Knuts...
#########
Example output:
Item,Galleons,Sickles,Knuts
"Spell Book",3,14,0
"Chocolate Wand",0,2,3
Input:
#########
...$snippet...
#########
Result:
"""
end
Here’s what the model’s output was for the snippet mentioned above:
Item,Galleons,Sickles,Knuts
"Dragon Liver",0,16,0
"Broomsticks",0,0,0
First of all, I was very impressed that the model managed to output valid CSV, and that “Dragon Liver” for sixteen Sickles was extracted correctly.
Interestingly, this snippet also produced “Broomsticks” for zero Galleons. The relevant sentence in the snippet is “Several boys of about Harry’s age had their noses pressed against a window with broomsticks in it.”. Broomsticks are mentioned, and they clearly are something of value, being displayed in a shop’s window. But the price isn’t mentioned so of course it would be silly for a human to include the item in the list.
This already foreshadowed how the overall data extraction would fare.
Using the GPT-3 API
To automatically run the model against every snippet in my database, I used the following function:
function get_response(snippet)
= prompt(snippet)
p = HTTP.post(
response "https://api.openai.com/v1/completions",
Dict(
"Content-Type" => "application/json",
"Authorization" => "Bearer $(read("token.txt", String))"
),write(Dict(
JSON3.:model => "text-davinci-002",
:prompt => p,
:temperature => 0.1,
:max_tokens => 50,
))
)= String(response.body)
s read(s).choices[1].text
JSON3.end
My account’s API token was stored in a separate text file, you’ll have to get your own if you want to try this out as well. I chose "text-davinci-002"
as the model which is supposed to be the most capable. I had to set temperature
and max_tokens
to something lower and higher than the defaults, respectively, until the test responses looked good. If the number of tokens is too low, the full table will not print.
I then ran this function for all the items in my collection, taking care not to exceed 60 requests per minute as the requests started failing due to rate limiting the first time I tried it.
= map(enumerate(snippet_ranges)) do (i, snippet_range)
results = time()
t = snippet(corpus, snippet_range)
snip println("\n\n\n\n$i of $(length(snippet_ranges))")
println(snip)
= get_response(snip)
response println(response)
# rate limit 60/min
sleep(max(0, 1 - (time() - t)))
= snip, response)
(; snippet_range, snippet end
# turn the ranges into tuples so the json output doesn't include a zillion numbers
= map(results) do result
results_corrected = (result.snippet_range.start, result.snippet_range.stop),
(; snippet_range
result.snippet, result.response)end
write("results.json", results_corrected) JSON3.
Finally, I parsed each response as a CSV and concatenated them all into a big DataFrame:
= @chain begin
df map(results_corrected) do result
read(IOBuffer(result.response), DataFrame)
CSV.end
reduce(vcat, _, source = :snippet_id, cols = :union)
sort!([:Galleons, :Sickles, :Knuts], rev = true)
@transform! @subset(:Knut !== missing) :Knuts = :Knut
# @subset !(:Galleons == :Sickles == :Knuts == 0)
select(Not(:Knut))
end
write("result_df.csv", df) CSV.
One entry had the column Knut
instead of Knuts
so I fixed this, other than that the format of the table appeared to be followed correctly for each entry. I was already quite impressed by that.
But let’s have a look at the results:
The top of the list looks good at first, the prize money for the Triwizard Cup was indeed 1000 Galleons. A few entries further down, however, we see “Goblet of Fire entry fee”. The goblet itself didn’t have an entry fee and the entry fee also wasn’t a thousand Galleons, so this is obviously wrong. We can have a look at the snippet:
julia> print(snippet(corpus, snippet_ranges[34]))
me the truth,” he
said. “If you don’t want everyone else to know, fine,
but I don’t know why you’re bothering to lie, you
didn’t get into trouble for it, did you? That friend of
the Fat Lady’s, that Violet, she’s already told us all
Dumbledore’s letting you enter. A thousand Galleons
prize money, eh? And you don’t have to do end-of-
year tests either. ...”
“I didn’t put my name in that goblet!” said Harry,
starting to feel angry.
“Yeah, okay,” said Ron, in exactly the same sceptical
tone as Cedric. “Only you said this morning you’d
have done it last nig
The language here is not ambiguous for a human, even with no background knowledge, the thousand Galleons are clearly prize money and not an entry fee, even if for an uninformed reader it would not be clear what the prize money is for. One other interesting fact is that the snippet never mentions that it’s a goblet of fire, so this part is clearly knowledge that the model added itself.
The next glaring error are the Canary Creams for a whopping 1000 Galleons. Let’s have a look at the snippet:
julia> print(snippet(corpus, snippet_ranges[40]))
ld do with a few laughs. We could all
Harry Potter and the Goblet of Fire - J.K. Rowling
do with a few laughs. I’ve got a feeling we’re going to
need them more than usual before long.”
“Harry,” said George weakly, weighing the money bag
in his hands, “there’s got to be a thousand Galleons
in here.”
“Yeah,” said Harry, grinning. “Think how many
Canary Creams that is.”
The twins stared at him.
“Just don’t tell your mum where you got it ...
although she might not be so keen for you to join the
Ministry anymore, come to think of it. ...”
“Harry,” Fred began,
This one’s interesting, because technically this passage does talk about 1000 Galleons worth of Canary Creams. A human would never put this in the list however, because the meaning a reader would infer from the entry is that Canary Creams cost 1000 Galleons. While the passage is “unfair” to the model to some degree, this just goes to show that extracting information from text is a delicate affair, and there are lots of ways it can go wrong.
The list continues with many entries, many looking kind of correct, others obviously incorrect, some doubtful. Let’s look at just one more which jumped out to me, “Dumbledore’s arrest” for a single measly Knut. What prompted this response?
julia> print(snippet(corpus, snippet_ranges[68]))
re calmly.
“Yes, shut up, Potter!” barked Fudge, who was still
ogling Dumbledore with a kind of horrified delight.
“Well, well, well — I came here tonight expecting to
expel Potter and instead — ”
“Instead you get to arrest me,” said Dumbledore,
smiling. “It’s like losing a Knut and finding a Galleon,
isn’t it?”
“Weasley!” cried Fudge, now positively quivering with
delight, “Weasley, have you written it all down,
everything he’s said, his confession, have you got it?”
“Yes, sir, I think so, sir!” said Percy eagerly, whose
nose was splattered with ink from the speed of his
This one is kind of impressive in the peculiar way it manages to be wrong. Dumbledore is using a figure of speech here, his arrest being like losing something of small value (a Knut) and gaining something much more valuable instead (a Galleon). So, technically, if his arrest is like losing a Knut, one could translate that as his arrest having the value of 1 Knut (never mind the Galleon gained). Well done, model!
The list ends with many items worth 0,0,0. These are all errors given that the prompt asks for things that have value, but some of them are less wrong than others. You can have a look at the JSON linked above if you’re interested in the prompts for each entry.
Conclusion
I didn’t end up making any plots, which had been my initial goal, because it was more fun to play around with the model and try to understand its outputs. The quality of the data is, to say it bluntly, garbage – even though on first glance it’s impressive that the whole pipeline worked so smoothly at all.
I did actually make attempts at engineering the prompt away from the errors I was seeing, like trying to exclude items of zero value, or returning “nothing” or other placeholder responses if the model wasn’t sufficiently “sure”. This mostly broke the outputs, though, instead of improving the responses.
Humans also make mistakes, but they mostly do so in predictable or understandable ways. The model, however, sometimes returns sensible results, or complete nonsense, without a means to easily discern those. In practice, I would have to hand check each entry to see if it was correct, saving me essentially no time. The only real time saver for a human in this scenario is the automated extraction of snippets, which is a classic task that dumb computers are good at.
I don’t really think the prompt left much room for interpretation (for a human), so the question is, if the prompt could be improved somehow, how do we know in which direction to go? This amounts to trial and error, essentially. Just because some test outputs look better with a changed prompt, that doesn’t mean that the model suddenly “understands” better. It just means that the pattern it outputs matches the pattern we expect more, but the reason for that is entirely opaque. You can’t ask the model to “show its work” or “explain its reasoning” (yet).
To me, this little excursion proved again what many online discussions of AI language tools like GitHub Copilot already concluded. The output of such models can look convincing on first glance, and by chance, it can be indistinguishable from that of a human. But you cannot know when it will fail, and in what way. It cannot ask for clarification if it is unsure, it cannot mark passages for review or step outside the box of its prompt in any other way. It will just happily output response after response, with the human operator being fully responsible if they continue working with the result. For code generation, it might be valuable to generate “plausible-looking options” for a human to take inspiration from. For the specific task discussed here, even if it’s rather simple, there’s no benefit at all in employing the model. Therefore, I remain very sceptical what the future will bring, especially because of such language models’ ability to output plausible-looking, but completely false output. The models are great at fooling humans into thinking their output is meaningful, because its structure looks correct. But it requires going beyond the structure of the text, and into the meaning, which is the real challenge here. It remains to be seen if even larger models will suddenly cross that gap in the coming years, or if they remain what this one is, a novelty toy that inspires thought, but doesn’t replace it.