Deep Learning GPT-2 tested on key NLU tasks by us

Here’s some simple but ‘next-level’ experiments I ran on the amazing new DL neural networks like GPT-2 & 3 from Elon Musk’s Open AI & T5 from Google.

These are a new brand of humongous neural network trained on English text. Like terabytes of text.

At their simplest, after training, they give you what’s called a Language Model (LM). A LM basically lets you predict the next word based on (1) the previous words you give them and (2) their training.

Sounds a bit useful and a bit . . boring?

Well: right & . . wrong.

Language models

Historically, LMs were, and are, used to, for example, bias speech recognition systems towards the most likely interpretation given the utterance, the previous utterance and … the training of the LM.

So, for example if the speech recognition system has transcribed ‘The man walked … ‘ if the next word sounds anything like ‘to’ we’ll put it down as ‘to’ because that’s the most likely next word even if it sounds a bit garbled.

That covers useful.

Now, what about non-boring?

Well, these next breeds of hugely trained (and train-able due to their model size) Deep Learning LMs are so big that they not only predict the next word or words, but the next sentences.

And the thing is, the way neural networks work, it’s not actually doing it just by rote. The sentences it predicts to go next in your essay may never have been seen before. But based on the patterns of word use in your input, it will predict what, most likely, should go next.

So it’s learning deeply. It almost understands.

Examples given (if you search for GPT-2 or 3) on the web include an entire fictional news story of the discovery of unicorns. Only half a sentence was fed in and the rest of the article was constructed by the GPT-2 or 3 LM and, they checked, that story does not exist in its training data. And it makes sense for about 3 sentences and then starts to, ever so surely, contradict itself and eventually begin to sound like rubbish. It’s the accumulation of errors.

Story understanding

Well, the test I wanted to give it was how much it’s absorbed about (1) anaphoric co-referencing and (2) entity tracking. Put another way: using pronouns to refer previously mentioned people and tracking understanding of story plot at least as far as who did what.

I haven’t seen anyone test this yet so I was quite excited when I got into an online GPT-2 interface.

INPUT: The man had a hammer. The girl had a saw. The boy had a glue gun. She gave the

OUTPUT: saw to the boy.

Fantastic! GPT-2 was able to do enough probabilistic neural-network mumbo-jumbo to absorb the sentence to associate the saw with the girl and that she would probably give it to one of the other two people.

That’s a non-trivial result and a more scientific test than almost all the tests I’ve seen on the web so far. I might do the systematic work and write up a paper on the limits of these networks, but if I don’t, at least you get the idea.

Well, unfortunately for the current state of AI, I discovered that GPT-2 is not 100% accurate, even for short extracts. It might be better than most existing anaphoric algorithms but it was quite easy to fool with similar length equally easy examples.

Next I tested it’s ability to track 3 people labelled by their names. My tests looked like this:

INPUT: Peter wore a red tie. Mike had a yellow hat. Mary has a green tie. Peter took his tie off. It was coloured

OUTPUT: red.

But again, the results were not 100% accurate and it made me think there’s room for improvement with these NNs (which is coming actually) and also that traditional logical algorithms still have a role to play.

Next

Next NLU posting I’ll cover Transfer Learning with these networks, T5 in particular. What is it and what can you do?

Leave a comment