New DeepMind study improves Google's chain of thought method
top of page

New DeepMind study improves Google's chain of thought method

AI's performance in math problems has been refreshed again! As we all know, with the concept of Google's chain of thought, AI has been able to generate problem-solving steps like humans when doing problems. This time, scientists from DeepMind raised a practical question: How to ensure the double correctness of the steps and answers?

To this end, they comprehensively compared process-based and outcome-based supervision methods on the GSM8K dataset and combined the strengths of both to train an optimal model. The results showed that the new model's answer error rate was reduced from 16.8% to 12.7%, and the error rate of the problem-solving step was also reduced from 14.0% to 3.4%.


Step + Answer Double Guarantee

Before introducing the new research, we have to mention the concept of the thinking chain proposed by Google in the paper in January this year. To put it simply, the thought chain prompt is a special kind of context learning. Unlike the standard prompt, which only gives examples of input-output pairs, the thought chain prompt will also add an additional reasoning process.

This method has been verified on three large-scale language models: LaMDA-137B, GPT-3 175B, and PaLM-540B: ​​Compared with standard prompts, the accuracy of the new method on a series of arithmetic reasoning tasks has been significantly improved. But one problem with this approach is that in some cases, the AI ​​can generate the correct answer, but the reasoning process is wrong.


Now, researchers from DeepMind have improved on this, not only focusing on the final result but also on the accuracy of the reasoning process. To this end, they present the first comprehensive comparison of procedure- and outcome-based methods in natural language processing tasks.

Specifically, the following different scenarios are included: few-shot hints, supervised fine-tuning, reinforcement learning through expert iteration, and reward models for reranking and reinforcement learning.

The reason for choosing the GSM8K data set is that it is composed of primary school mathematics application questions, and the answers are all integer solutions, which is convenient for accurate statistics. The second is that the GSM8K dataset has offline supervision for the inference step, as well as online human annotation.


From the results, first, the process-based and outcome-based methods are nearly identical in the final answer error rate. This also means that outcome supervision alone is sufficient to achieve a low answer error rate. Second, the improvement in the accuracy of the inference step requires process supervision or a reward model that mimics it. Although the final answer error rate is similar, it can be seen from the figure below that the inference error rate of result supervision (19.8%) is significantly higher than that of process supervision (11.4%).

The researchers also combined the advantages of the two to train an optimal model, which combines supervised learning with reinforcement learning based on reward models. The new model's wrong answer rate dropped from the previous best of 16.8 percent to 12.7 percent, and the number of cases where the answer was correct but the reasoning process was wrong decreased from 14.0 percent to 3.4 percent. When the model was allowed to avoid 30% of the questions, the error rate of the final answer even reached 2.7%.


Tags:

6 views0 comments

Recent Posts

See All
bottom of page