Where the rubber meets the horse you road in on.
The thesis of this project has been that the extremely human nature of puns is deceptive and they actually have many qualities that can be figured out without essentially creating an advanced AI or neural network.
To that end as I played with various data sources and tools I intentionally shied away from ones that were more like full on natural language processing.
I jumped through a lot of hoops to find data on definitions, synonyms, and sounds that would let me work with them on my own terms.
Sounds the alarm! This is going phonemenally!
Sounds went great, I found the CMUDict project that has most words encoded with Arpabet, an ASCII equivalent to the IPA. I quickly was able to find sound sequences within other words and compare them via Python’s excellent string comparison methods. There’s tons more that can be done there and I’ll be playing with sounds for a long while yet.
Definitions, not so much. After someone’s JSON encoded dictionary lacked some words (what kind of dictionary doesn’t include the word ‘potato’?) I tried to scrape the dictionary included in OS X. This involved someone’s python wrapper for Objective C that pulled one definition at a time from it and spewed out the raw text to the terminal. I got partway through parsing those definitions into something resembling a usable data structure before I abandoned ship.
Casting a word net to spell out my plans
In the end, what I had been resisting (natural language processing) offered a granular enough solution for me. Princeton’s Wordnet project and the associated python interface, NLTK (Natural Language ToolKit) gave me what I needed.
I experimented with its various methods built into it. One promised to tell me how short the path between words was in terms of a number, to me that sounded like a wonderful way to approximate a word’s subject.
As it turned out that didn’t really work at all. The numbers were too abstract and didn’t really end up mapping to subjects from the perspective of me recognizing words as sharing a subject.
I turned to hyponyms (words contained within a word’s sense) and hypernyms (what overarching words contained the word being asked about). While these were segregated by part of speech (hypo- and hypernyms for nouns did not contain verbs or adjectives and vice versa) that combined with synsets (words that had a similar sense to the asked about word) provided a close enough approximation of subject that I could move forward with it.
This in turn provided me with interesting algorithmic problems like how should I crawl a branching tree of words of indeterminate length/depth and ensure that I successfully found all the words within it?
I worked through these and had a great time teasing out the problems incurred there.
Material improvements and other alloys in my quest
At this point Puntenshawl doesn’t out put puns per-se, but it gives me material to work with in making puns. I’m increasingly able to ask it about a word or series of words and it will tell me various options for making puns on them. While not funny in and of itself, this definitely provides me with more material, so in a sense it is successful so far.
It’s been an adventure, and there’s yet more to go, but for now this is a good place to be.