Connect with us

# “Mathematics Chicken” ChatGPT understands human preferences very well, generating random numbers online is the ultimate answer to the universe

Published

on

ChatGPT also understands human routines in generating random numbers.

ChatGPT may be a bullshit artist and purveyor of misinformation, but it is not a “mathematician”!

Recently, Colin Fraser, a Meta data scientist, found that ChatGPT does not generate true random numbers, but more like “human random numbers.”

Through experiments, Fraser concluded: “ChatGPT likes the numbers 42 and 7 very much.”

Netizens said that it means that humans like these numbers very much.

## ChatGPT also loves “The Ultimate Answer of the Universe”

In his tests, Fraser entered the prompt as follows:

“Pick a random number between 1 and 100. Just return the number; Don’t include any other text or punctuation in the response.”

By having ChatGPT generate a random number between 1 and 100 each time, Fraser collected 2,000 different answers and compiled them into a table.

It can be seen that the number 42 occurs most frequently, up to 10%. In addition, the frequency of numbers containing 7 is also very high.

Especially the numbers between 71-79 are more frequent. In numbers outside this range, 7 also frequently appears as the second digit.

42 What do you mean?

Anyone who has read Douglas Adams’ blockbuster science fiction novel “The Hitchhiker’s Guide to the Galaxy” knows that 42 is “the ultimate answer to life, the universe, and everything.”

Simply put, 42 and 69 are meme numbers online. This shows that ChatGPT is not actually a random number generator, but just chooses popular numbers in life from a huge data set collected online.

In addition, 7 appears frequently, which just reflects that ChatGPT caters to human preferences.

In Western culture, 7 is generally regarded as a lucky number, there is a saying of Lucky 7. Just like our obsession with the number 8.

Interestingly, Fraser also found that GPT-4 seems to compensate for this.

When GPT-4 was asked for more numbers, the random numbers it returned were too evenly distributed.

In short, ChatGPT basically responds by predicting, rather than actually “thinking” to come up with an answer.

So, a chatbot touted as almost omnipotent is kind of silly.

Let it plan a road trip for you, and it’ll stop you in a town that doesn’t even exist. Or, have it output a random number, most likely based on a popular meme.

Some netizens tried it for themselves and found that GPT-4 really likes 42.

If ChatGPT ends up just repeating online clichés, what’s the point?

## GPT-4, violating the rules of machine learning

The birth of GPT-4 is exciting, but also disappointing.

Not only did OpenAI not release more information about GPT-4, or even disclose the size of the model, but it emphasized that it crushed humans in many professional and standardized tests.

Taking the U.S. BAR lawyer license examination as an example, GPT3.5 can reach the 10% level, and GPT4 can reach the 90% level.

However, Arvind Narayanan, a professor of computer science at Princeton University, and Sayash Kapoor, a doctoral student, wrote that

OpenAI may have tested it on the training data. Also, human benchmarks are meaningless for chatbots.

Specifically, OpenAI may have violated a cardinal rule of machine learning: don’t test on training data. Be aware that the test data and training data must be separated, otherwise there will be overfitting problems.

That problem aside, there’s a bigger problem.

Language models solve problems differently than humans, so these results mean little about how a robot will perform on real-world problems faced by professionals. A lawyer’s job isn’t to answer bar exam questions all day long.

### Problem 1: Training Data Pollution

To evaluate GPT-4’s programming ability, OpenAI conducted an evaluation on Codeforces, a website for Russian programming competitions.

Surprisingly, Horace He pointed out online that in simple classification, GPT-4 solved 10 pre-2021 problems, but none of the last 10 problems.

The training data cutoff for GPT-4 is September 2021.

This strongly suggests that the model is able to memorize the solutions in its training set, or at least partially memorize them, enough to fill in what it cannot recall.

To provide further evidence for this hypothesis, Arvind Narayanan tested GPT-4 on Codeforces competition problems at various times in 2021.

It was found that GPT-4 could solve simple classification problems before September 5, but none of the problems after September 12.

In fact, we can definitively prove that it has memorized the problems in the training set: when GPT-4 is prompted with the title of a Codeforces problem, it includes a link to the exact match in which the problem occurred. It’s worth noting that GPT-4 doesn’t have access to the internet, so memory alone is the only explanation.

For benchmarks other than programming, Prof. Narayanan said, “We don’t know how to separate problems by time periods in a clean way, so we think it is difficult for OpenAI to avoid data pollution. For the same reason, we cannot conduct experiments to test how performance varies with time.” Date changes.”

However, you can start from the other side. If it is memory, then GPT must be highly sensitive to the wording of the question.

In February, Melanie Mitchell, a professor at the Santa Fe Institute, gave an example of an MBA exam question, changing a few details in a way that would be enough to fool ChatGPT (GPT-3.5) in a way that would not be fooled by a person .

More detailed experiments like this would be valuable.

Due to OpenAI’s lack of transparency, Professor Narayanan can’t say with certainty that it is a data pollution problem. But what is certain is that OpenAI’s approach to detecting contamination is sloppy:

“We measure the cross-contamination between the evaluation dataset and the pre-training data using a substring matching method. Both the evaluation and training data are processed to remove all whitespace and symbols, leaving only characters (including numbers). For each evaluation example, We randomly select three substrings of length 50 characters (or use the entire example if the example length is less than 50 characters). If any one of the sampled evaluation substrings is a substring of the processed training example, then The match is considered successful. This results in a list of tainted examples. We discard these and rerun to get the untainted score.”

This approach simply does not stand the test of time.

If the test question is present in the training set, but with changed names and numbers, it cannot be detected. Now there is a more reliable method that can be used, such as embedding distance.

If OpenAI is going to use the method of embedding distance, how much similarity is too similar? There is no objective answer to this question.

Thus, even the seemingly simple performance on multiple-choice standardized tests has a lot of subjectivity.

### Problem 2: Professional exams are not a valid way to compare human and robot capabilities

Memory is like a spectrum, even if the language model has not seen an exact question in the training set, due to the huge size of the training corpus, it has inevitably seen many very similar examples.

That means, it can evade deeper reasoning. Therefore, the benchmark results do not provide us with evidence that language models are acquiring the deep reasoning skills required by human test takers.

In some practical tasks, shallow reasoning GPT-4 may be competent, but not always.

Benchmarks have been widely used to compare large models and have been criticized by many for reducing multidimensional evaluations to a single number.

Unfortunately, OpenAI’s choice to use these tests so heavily in its evaluation of GPT-4, combined with insufficient data pollution handling measures, is very regrettable.

References: