Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic (2025)

\useunder

\ul

Xin Zheng1,2  Jie Lou3  Boxi Cao 1,2  Xueru Wen 1,2  Yuqiu Ji3Hongyu Lin1  Yaojie Lu1  Xianpei Han1  Debing Zhang3  Le Sun1
1 Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Xiaohongshu Inc
{zhengxin2020,boxi2020,wenxueru2022}@iscas.ac.cn
{hongyu,luyaojie,xianpei,sunle}@iscas.ac.cn
{yinyue2,dengyang}@xiaohongshu.com
This work was done when Xin Zheng interned at Xiaohongshu.

Abstract

Self-critic has become a crucial mechanism for enhancing the reasoning performance of LLMs.However, current approaches mainly involve basic prompts for intuitive instance-level feedback, which resembles System-1 processes and limits the reasoning capabilities. Moreover, there is a lack of in-depth investigations into the relationship between LLM’s ability to criticize and its task-solving performance.To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability.Through a step-wise CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation, Critic-CoT enables LLMs to engage in slow, analytic self-critique and refinement, thereby improving their reasoning abilities.Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance by filtering out invalid solutions or iterative refinement. Furthermore, we investigate the intrinsic correlation between critique and task-solving abilities within LLMs, discovering that these abilities can mutually reinforce each other rather than conflict.

1 Introduction

Enhancing the reasoning abilities of large language models is essential for creating more intelligent and reliable AI systems, which has drawn extensive attention from researchers(Chollet, 2019; Bubeck etal., 2023; Morris etal., 2024). From a cognitive perspective, the procedure of human reasoning involves constant reflection and revision(Hegel etal., 1991; Kierkegaard, 1989; Popper, 1934), which has inspired increasing focus on integrating self-critic mechanisms in the reasoning process of large-scale models(Kim etal., 2023; Shinn etal., 2023; Madaan etal., 2023). This involves iteratively allowing the model to generate feedback on its own responses and then refining its reasoning based on the feedback.Compared with traditional critic methods that depend on feedback from external sources(Saunders etal., 2022; McAleese etal., 2024), self-critic relies solely on the model’s internal capabilities, thus reducing the high cost of additional human annotation, and serving as a promising potential solution to scalable oversight(Leike etal., 2018; Burns etal., 2023; Cao etal., 2024).

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic (1)

However, current studies primarily focus on utilizing LLMs’ critique abilities to enhance their performance. Yet, relatively little attention has been given to the investigation and development of the critique ability itself.Firstly, existing critique methods are often overly simplistic, typically relying on a basic prompt to directly point out the error, without stepwise Chain-of-Thought examination or training procedure, which leads to relatively poor self-critic accuracy (Luo etal., 2023; West etal., 2024).Specifically, proposing a valid critique is a complicated task that requires a thorough understanding of statements and precise negativity. However, current LLMs are normally not explicitly trained for critic capability.Therefore, these simple approaches usually tend to “criticize” like System-1, which is more intuitive and likely to make mistakes, rather than more rigorous and deliberate System-2 (Kahneman, 2011; Yu etal., 2024), while shifting LLMs from System-1 toward System-2 emerges as a promising approach for improving the reasoning capability(OpenAI, 2024).This limitation diminishes the effectiveness of self-critic and, further, self-correct (Huang etal., 2024).Secondly, the capabilities of task-solving and self-critic are both dependent on the model’s inherent knowledge, while there is currently a lack of in-depth exploration regarding the correlation between these two capabilities within LLMs. In that case, it’s challenging to balance the task-solving and the self-critic capabilities of the model within the self-critic framework, which poses a significant obstacle to the subsequent development in this direction.

To this end, this paper is devoted to diving into the following critical research questions:

  • How can we enhance a model’s critique ability, pushing it toward System 2 reasoning?

  • What is the relationship between a model’s critique ability and its task-solving capability?

To answer the above questions, as shown in Figure 1,we propose Critic-CoT, a novel framework designed to enhance LLMs’ reasoning abilities. Through step-wise Chain-of-Thought critique format and distant supervision, our method is able to strengthen System-2-like critic ability, without the intensive cost of human annotation.Specifically, during training, we let LLMs criticize and refine their solutions in a complete CoT way, and collect successful pairs that convert wrong solutions into correct ones, or affirm the validity of original right solutions.After supervised fine-tuning on the obtained step-wise critic-refine data, we enable the target LLM to analyze and criticize each step of its generated reasoning procedure, so that it can filter out wrong attempts and preserve the correct ones with greater precision.During inference, to leverage the model’s abilities of CoT-critique and refinement, we employ two strategies: (1) majority vote filtering involves using the critic model to evaluate multiple generated solutions and filter out those incorrect;and (2) iterative refinement, on the other hand, involves repeatedly critiquing and refining a solution until no further error is detected.

Through a series of experiments on the dataset of GSM8K (Cobbe etal., 2021a) and MATH (Hendrycks etal., 2021), we found that our trained critic model can fairly distinguish incorrect solutions from correct ones, and improve the reasoning accuracy via iterative refinement or critic filtering. These results demonstrate the helpfulness and effectiveness of our proposed method.Additionally, we observed that our critic model already exhibits noticeable performance improvements in task-solving,even in the absence of additional critique steps during the decoding phase. Such findings reveal that strengthening the ability to critique and refinement would not compromise the task-solving performance, but improve it.This also suggests the presence of an intrinsic mechanism by which critique ability and task-solving capability mutually reinforce one another.

We summarize our main contributions as follows:

2 Related Works

2.1 Discriminative Verifier for Mathematics

To further improve the reasoning ability of Large language models, one applicable approach is through the use of reward models, which can either be used in reinforcement learning during training (Ouyang etal., 2022) or rejection sampling at test time (Cobbe etal., 2021b). While outcome-supervised reward models (ORMs) allow for the automatic collection of training data based on the signal of the gold answer, process-supervised reward models (PRMs) would be more advantageous for more precise feedback, better interpretability and stronger alignment (Lightman etal., 2024).

To reduce the considerable human labeling cost and difficulty for dense annotation, a series of works based on automatic approaches have been proposed (Wang etal., 2023a; Chen etal., 2024b; Luo etal., 2024; Snell etal., 2024), all under the heuristic that for an incorrect solution, the first error step is where the continuation of previous step would lead to a correct answer. This may bring noise into training data due to false positives and negatives (Luo etal., 2024). Moreover, annotation based on the implicit solution continuation alone does not leverage LLM’s emerging ability of critic, which is in a more explicit and analytic way and brings better explainability (Saunders etal., 2022; Yuan etal., 2024; Luo etal., 2023; McAleese etal., 2024). Additionally, binary 0/1 discrimination alone, whether outcome-based or process-based, remains more similar to System-1 reasoning rather than the desirable System-2, thus may not fully leverage the computation power support by empirically successful Chain-of-Thought prompting (Feng etal., 2023; Li etal., 2024).

2.2 Critic Model

Learning from natural language feedback could be beneficial (Chen etal., 2024a). With the development of LLM, whether it can discriminate and criticize its own output in a text-generation manner becomes an interesting topic (Luo etal., 2023; Zeng etal., 2023), with doubts at least on off-the-shelf LLMs that are not specially trained for such task (Huang etal., 2024; West etal., 2024). Current applications, such as response evaluation, heavily rely on the reference (Zheng etal., 2023). Therefore, given the limited critic ability of current LLMs, how to train a robust and applicable critic model is worth investigating. Concurrently, Zhang etal. (2024) trained a generative reward model on the outcome level rather than the process level but did not incorporate refinement into the schema.

From the perspective of recursive reward modeling (Leike etal., 2018; Saunders etal., 2022) and scalable oversight (Burns etal., 2023), McAleese etal. (2024) recently trained “CriticGPT” to assist human labelers, which aims to improve the ability of human rather than the base model, i.e. improve the overall recall of error detection, rather than precision. While in this paper, we try to explore whether improving the reasoning ability of LLM without costly human annotation is applicable.

3 Method

To equip LLMs with the ability to criticize and refine themselves step-by-step, we propose Critic CoT. As shown in Figure 2, it consists of two modules, including distant-supervision-based auto-train and self-check at inference-time. First, we introduce the distant-supervision principles in Section 3.1, followed by the training process in Section 3.1, and finally, the inference strategies in 3.3.

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic (2)

3.1 Chain-of-Thought Critique

In this work, we utilize a step-wise chain-of-thought critique, which makes the critique-refine process both controllable and formalizable, thereby facilitating the collection of distant supervision data.Formally, given the question Q𝑄Qitalic_Q and the corresponding gold answer Ans𝐴𝑛𝑠Ansitalic_A italic_n italic_s, we have the n𝑛nitalic_n-step attempt Att=[s1,,sn]𝐴𝑡𝑡subscript𝑠1subscript𝑠𝑛Att=[s_{1},...,s_{n}]italic_A italic_t italic_t = [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] with predicted answer Pred𝑃𝑟𝑒𝑑Preditalic_P italic_r italic_e italic_d sampled by generator G𝐺Gitalic_G.The corresponding critique Cri𝐶𝑟𝑖Criitalic_C italic_r italic_i then can be represented as L=[l1,,ln]𝐿subscript𝑙1subscript𝑙𝑛L=[l_{1},...,l_{n}]italic_L = [ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where the step label li=+1subscript𝑙𝑖1l_{i}=+1italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = + 1 indicates that step i𝑖iitalic_i is predicted to be correct, and li=1subscript𝑙𝑖1l_{i}=-1italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 to be incorrect.Then the refinement Att=[si,,sn]𝐴𝑡superscript𝑡subscriptsuperscript𝑠𝑖subscriptsuperscript𝑠superscript𝑛Att^{\prime}=[s^{\prime}_{i},...,s^{\prime}_{n^{\prime}}]italic_A italic_t italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] is start from the first incorrect step i𝑖iitalic_i with new answer Pred𝑃𝑟𝑒superscript𝑑Pred^{\prime}italic_P italic_r italic_e italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.To automatically annotate the process label for the attempts, we assume that (1) If the final answer is wrong, then there is one earliest mistake, and by refining from this mistake, we could reach a correct answer; (2) If the final answer is correct, then all the intermediate steps are correct.Thus, we enumerate the following cases:

  • PredAns,1Lformulae-sequence𝑃𝑟𝑒𝑑𝐴𝑛𝑠1𝐿Pred\neq Ans,-1\notin Litalic_P italic_r italic_e italic_d ≠ italic_A italic_n italic_s , - 1 ∉ italic_L:The attempt is wrong, yet the critique did not discover any error step. Thus the critique itself is problematic, and we need to sample another critique.

  • PredAns,1L,PredAnsformulae-sequence𝑃𝑟𝑒𝑑𝐴𝑛𝑠formulae-sequence1𝐿𝑃𝑟𝑒superscript𝑑𝐴𝑛𝑠Pred\neq Ans,-1\in L,Pred^{\prime}\neq Ansitalic_P italic_r italic_e italic_d ≠ italic_A italic_n italic_s , - 1 ∈ italic_L , italic_P italic_r italic_e italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_A italic_n italic_s:The attempt is wrong, and the critique found an error, but still, the refinement is not correct. There could be two cases for this situation: (1) the refinement is unsuccessful; (2) the critique did not detect an earlier mistake. We simply sample another critique and corresponding refinement for this situation.

  • PredAns,1L,Pred=Ansformulae-sequence𝑃𝑟𝑒𝑑𝐴𝑛𝑠formulae-sequence1𝐿𝑃𝑟𝑒superscript𝑑𝐴𝑛𝑠Pred\neq Ans,-1\in L,Pred^{\prime}=Ansitalic_P italic_r italic_e italic_d ≠ italic_A italic_n italic_s , - 1 ∈ italic_L , italic_P italic_r italic_e italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_A italic_n italic_s:Not only did the critique point out the error, but also the refinement reached the correct answer. We then believe the critique is valid, and collect the critique data instance C=(Q,Att,Cri)𝐶𝑄𝐴𝑡𝑡𝐶𝑟𝑖C=(Q,Att,Cri)italic_C = ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i ) and the refinement data R=(Q,Att,Cri1,Att)𝑅𝑄𝐴𝑡𝑡𝐶𝑟subscript𝑖1𝐴𝑡superscript𝑡R=(Q,Att,Cri_{-1},Att^{\prime})italic_R = ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , italic_A italic_t italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where Cri1𝐶𝑟subscript𝑖1Cri_{-1}italic_C italic_r italic_i start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT is the critique of last step, since explaining why previous steps are correct may not be helpful for refinement.

  • Pred=Ans,1Lformulae-sequence𝑃𝑟𝑒𝑑𝐴𝑛𝑠1𝐿Pred=Ans,-1\notin Litalic_P italic_r italic_e italic_d = italic_A italic_n italic_s , - 1 ∉ italic_L:The attempt is correct, and the critique believes it is correct. So we can collect the positive critique data instance C=(Q,Att,Cri)𝐶𝑄𝐴𝑡𝑡𝐶𝑟𝑖C=(Q,Att,Cri)italic_C = ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i ).

  • Pred=Ans,1Lformulae-sequence𝑃𝑟𝑒𝑑𝐴𝑛𝑠1𝐿Pred=Ans,-1\in Litalic_P italic_r italic_e italic_d = italic_A italic_n italic_s , - 1 ∈ italic_L:The attempt reached the correct answer, yet the critique found an error. Then, the critique could be wrong, and we need to sample another critique.

3.2 Auto Train: Two-Stage Training

To enable the model to acquire self-critiquing and refining capabilities, we first need to provide it with basic critiquing abilities, followed by self-critique for further enhancement.The overall training procedure is divided into two stages.

Stage 1

In the first step, we collect high-quality critique data to provide the model’s basic critiquing ability.Specifically, we first sample both positive and negative solutions from a representative instruction-following model Gsubscript𝐺\mathcal{M}_{G}caligraphic_M start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT on the dataset D𝐷Ditalic_D.Then, we utilize SOTA LLMs like GPT4-Turbo to serve as critic model MCsubscript𝑀𝐶M_{C}italic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.For each generated attempt Att𝐴𝑡𝑡Attitalic_A italic_t italic_t, the critic model will retry at most k𝑘kitalic_k times to produce a valid critique until it reaches one of the distant supervision constraints.This will form the critic-refine dataset D1={(Q,Att,Cri)}{(Q,Att,Cri1,Att)}subscript𝐷1𝑄𝐴𝑡𝑡𝐶𝑟𝑖𝑄𝐴𝑡𝑡𝐶𝑟subscript𝑖1𝐴𝑡superscript𝑡D_{1}=\{(Q,Att,Cri)\}\bigcup\{(Q,Att,Cri_{-1},Att^{\prime})\}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i ) } ⋃ { ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , italic_A italic_t italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } for fine-tuning the initial model 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the critic model 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.Note that in this process, we actually distill Pass1@N of the teacher model Csubscript𝐶\mathcal{M}_{C}caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT into Top1@N of the student model.So, the theoretical upper bound of the student model is not necessarily limited by the teacher model’s performance.

Stage 2

In the second step, we leverage the model’s self-critique to enhance its critiquing and refining capabilities further.Namely, we let the learned critic model 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT criticize and refine its own output.We first sample M𝑀Mitalic_M correct-answer solutions and M𝑀Mitalic_M incorrect-answer solutions for each question Q𝑄Qitalic_Q in the original dataset D𝐷Ditalic_D.Then, for each attempt Att𝐴𝑡𝑡Attitalic_A italic_t italic_t, we employ 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to repeatedly criticize and refine at most k𝑘kitalic_k times.In case the model fails to successfully critique even after k𝑘kitalic_k times, we fall back on the critique from a stronger yet frozen model MCsubscript𝑀𝐶M_{C}italic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT as the final choice.Finally, we collect dataset D2={(Q,Att,Cri)}{(Q,Att,Cri1,Att)}subscript𝐷2𝑄𝐴𝑡𝑡𝐶𝑟𝑖𝑄𝐴𝑡𝑡𝐶𝑟subscript𝑖1𝐴𝑡superscript𝑡D_{2}=\{(Q,Att,Cri)\}\bigcup\{(Q,Att,Cri_{-1},Att^{\prime})\}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i ) } ⋃ { ( italic_Q , italic_A italic_t italic_t , italic_C italic_r italic_i start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , italic_A italic_t italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } and use D1D2subscript𝐷1subscript𝐷2D_{1}\bigcup D_{2}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋃ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to train the initial model 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the final critic model 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which is similar to Wang etal. (2024).This procedure helps the model to learn to criticize and refine its own reasoning outputs better.

3.3 Inference: Self-Check

To leverage our learned abilities of critique and refinement for more precise reasoning, we employ two different inference strategies: “iterative refine” and “critic as filter”.

Iterative Refine

One single-turn refinement, which consists of multiple steps, may still contain errors. Therefore, we could iteratively inspect the refined solution, and re-refine once the critique found a mistake, and only output the final solution if it’s convincing for the critic, or if it reached the maximum retry. To avoid de-generation after too many refinements, we set the maximum refine depth d=8𝑑8d=8italic_d = 8, and restart from the initial solution after d𝑑ditalic_d unsuccessful refinement at most n=8𝑛8n=8italic_n = 8 times. Figure 3 presents a single successful round of critique and refinement.

Critic As Filter

Self-consistency is an effective way to reduce variance and improve accuracy. With the ability to critique, we can filter out predict-to-be-wrong answers to further boost the performance. Specifically, for the m𝑚mitalic_m attempts S={(Att,Pred)}𝑆𝐴𝑡𝑡𝑃𝑟𝑒𝑑S=\{(Att,Pred)\}italic_S = { ( italic_A italic_t italic_t , italic_P italic_r italic_e italic_d ) }, we first let our model \mathcal{M}caligraphic_M check each attempt and obtain the stepwise label, which is Sc={(Att,Pred,L)}subscript𝑆𝑐𝐴𝑡𝑡𝑃𝑟𝑒𝑑𝐿S_{c}=\{(Att,Pred,L)\}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_A italic_t italic_t , italic_P italic_r italic_e italic_d , italic_L ) }.And then those which detect the error at some step are filtered out and reach Sc={(Att,Pred,L)|1L}subscriptsuperscript𝑆𝑐conditional-set𝐴𝑡𝑡𝑃𝑟𝑒𝑑𝐿1𝐿S^{\prime}_{c}=\{(Att,Pred,L)|-1\notin L\}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_A italic_t italic_t , italic_P italic_r italic_e italic_d , italic_L ) | - 1 ∉ italic_L }.Finally, we perform the majority vote to get the answer.

4 Experiment

We apply the Critic-CoT training process on the dataset of GSM8K and MATH (Section 4.1), and observe a noticeable performance improvement in our trained model (Section 4.3), and out-of-domain evaluations on AGIEval and StrategyQA further exhibits the generalization of our trained critic ability (Section 4.4). We also conduct a series of ablation studies to demonstrate the effectiveness of our proposed Critic-CoT method (Section 4.5). For more analysis on the critique and refinement during test time, see Appendix A.1, and the prompt is presented in Appendix A.6.

4.1 Setup

4.1.1 Model

We fine-tune the critic-refine model on Llama-3-70B-Instruct (Dubey etal., 2024), which was pre-trained on more than 15 Trillion tokens and has a context length of 8,192. For critique / refinement sampling, we use GPT4-Turbo (OpenAI, 2023) of the version gpt-4-0125-preview. We use the Huggingface Transformers (Wolf etal., 2020), DeepSpeed (Rajbhandari etal., 2021) and FastChat (Zheng etal., 2023) libraries for training. We use vLLM library (Kwon etal., 2023) for model inference, adapting top-p sampling of p=0.95𝑝0.95p=0.95italic_p = 0.95, with temperature 0.7 for solution sampling, which follows Cobbe etal. (2021a), and 0.5 for critique and refinement. All inferences are zero-shot.

4.1.2 Dataset

Train & In-Domain Eval

Separately, we train our model on the problem of GSM8K (Cobbe etal., 2021a) and MATH (Hendrycks etal., 2021). GSM8K is a grade-school-level math word problem dataset, with 7,473 training instances and 1,319 test instances. MATH is a challenging high school math competition dataset, which consists of 7,500 training problems and 5,000 test problems. For the MATH dataset, we also follow the data split of Lightman etal. (2024), which adds 4,500 test problems into a training set and, therefore, contains 12,000 training instances and 500 representative test instances.

4.2 Metric

Solution

For the evaluation of the solution, we compute the metrics of Top-1 Accuracy Acc and Refine Accuracy Refine-Acc, in which the original Top-1 predict-answer is replaced with a refined one if the critic model found an error and made iterative refinement (Section 3.3). We also compute Majority Vote Accuracy Maj1@N (Wang etal., 2023b) and Majority Vote Accuracy After Critique Critic + Maj1@N (Section 3.3), which is to select the most frequent answer among N𝑁Nitalic_N samples, i.e. argmaxai=1N𝟙(𝐚i=a)subscript𝑎superscriptsubscript𝑖1𝑁1subscript𝐚𝑖𝑎\arg\max_{a}\sum_{i=1}^{N}\mathds{1}\left(\mathbf{a}_{i}=a\right)roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a ). Following Liu etal. (2023); Havrilla etal. (2024), we compute Pass@N, which select the gold answer g𝑔gitalic_g among the N𝑁Nitalic_N predictions if present, i.e. argmaxa𝟙(g=a)subscript𝑎1𝑔𝑎\arg\max_{a}\mathds{1}\left(g=a\right)roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT blackboard_1 ( italic_g = italic_a ).

Critique

For the “evaluation of evaluation”, we compute Precision, Recall and F1 for error detection. Also, we compute Critic Accuracy, where the critique should find the error in wrong answer solutions and pass the correct answer solution:

P={|PrediAnsi1Li|}|{1Li}|,R={|PrediAnsi1Li|}|{PrediAnsi}|,F1=2PRP+RP=\frac{\{|Pred_{i}\neq Ans_{i}\land-1\in L_{i}|\}}{|\{-1\in L_{i}\}|},R=\frac%{\{|Pred_{i}\neq Ans_{i}\land-1\in L_{i}|\}}{|\{Pred_{i}\neq Ans_{i}\}|},F1=%\frac{2*P*R}{P+R}italic_P = divide start_ARG { | italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_A italic_n italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∧ - 1 ∈ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | } end_ARG start_ARG | { - 1 ∈ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } | end_ARG , italic_R = divide start_ARG { | italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_A italic_n italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∧ - 1 ∈ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | } end_ARG start_ARG | { italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_A italic_n italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } | end_ARG , italic_F 1 = divide start_ARG 2 ∗ italic_P ∗ italic_R end_ARG start_ARG italic_P + italic_R end_ARG
CriticAcc=i=1N(Predi=Ansi1Li)(PrediAnsi1Li)NCriticAcc=\frac{\sum_{i=1}^{N}(Pred_{i}=Ans_{i}\land-1\notin L_{i})\lor(Pred_{%i}\neq Ans_{i}\land-1\in L_{i})}{N}italic_C italic_r italic_i italic_t italic_i italic_c italic_A italic_c italic_c = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A italic_n italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∧ - 1 ∉ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∨ ( italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_A italic_n italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∧ - 1 ∈ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N end_ARG
Out-of-Domain Eval

To further evaluate our critic model’s generalization capabilities beyond mathematical tasks, we assess its performance on reasoning tasks using the StrategyQA and AGIEval datasets, which cover different domains.StrategyQA (Geva etal., 2021) is a multi-step reasoning task constructed from Wikipedia, with binary answers indicating either true or false.AGIEval (Zhong etal., 2023) comprises standardized exam questions from various fields, including college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. Given the overlap with the MATH dataset, we evaluated our model using the original 7,500/5,000 training and validation split from MATH, rather than the extended 12,000/500 split.

4.2.1 Critic Data Construction

GSM8K

On GSM8K, since GPT-4 already got 92.0% accuracy on the test set (OpenAI, 2023), which makes it hard to obtain negative data, we use GPT-3.5-Turbo-0125 instead to sample 10 solutions for each question in the training set. Then, we use GPT-4-Turbo as the critic-refine model to criticize the solutions (Table 7), with K=16𝐾16K=16italic_K = 16 retry. We obtain 63,485 cases, with 49,832 positive examples and 13,653 negative examples.

In the second stage of GSM8K critique construction, we use the learned critic model to repeatedly sample until we obtain at most 5 positive and 5 negative solutions. For strong LLMs like LLaMA-3, it’s challenging to get enough negative solutions even among 512 samples, so the size of negative data would be slightly smaller. Then, we use the learned critic model to criticize itself, also with K=16𝐾16K=16italic_K = 16 retry. In stage two, we obtain 62,877 instances, with 39,654 positive and 26,001 negative. Among the two stages, we got 126,362 instances, with 86,708 positive and 39,654 negative.

MATH

On MATH, in the first stage, we directly use the 90,074 GPT-4 generated solutions of PRM800K Dataset (Lightman etal., 2024), with 11,665 positive instances which all the step labels are correct, and 78,409 negative instances which one step label is incorrect. Since the MATH dataset is challenging, in order to reduce retry of GPT-4-Turbo and avoid not getting valid critique, for the critique of the negative solution, we additionally append reference solution in the input prompt, and hint it might contain mistakes, as suggested in prior work (Zelikman etal., 2022);for the positive solution, we simply hint it’s correct.After obtaining the initial critique, we use GPT-4-Turbo again to remove hint phrases like “According to the reference” or “Given the hint” since we do not have any hint or reference during the test time. In stage one, we obtain 1,606 positive cases and 69,775 negative cases.

Similarly, in the second stage of MATH, we use the learned critic model to sample at most 5 positive and negative solutions. Then, we first use the critic model itself to critic its solutions, and without any hints, under K=16𝐾16K=16italic_K = 16 retry, and use GPT-4-Turbo to retry another K=16𝐾16K=16italic_K = 16 times with hint if failed. We construct 51,618 positive cases and 65,456 negative cases. Among the two stages, we got 188,455 cases, with 53,224 positive and 135,231 negative.

4.3 Main Results

ModelSampling MethodAcc.
Llama-3-70B-Instruct (Dubey etal., 2024)-89.6
Maj1@9694.1
Llama-3.1-70B-Instruct (Dubey etal., 2024)-94.5
GPT4-0314 (OpenAI, 2023)-92.0
DeepSeek-V2 Chat-236B (DeepSeek-AI etal., 2024)-92.2
Qwen2-72B (Yang etal., 2024)-93.2
Mistral-7B: MetaMATH (Gao etal., 2024)PRM+Maj1@25687.8
InternLM-MATH-20B (Ying etal., 2024)PRM Best-of-10089.3
DART-Math-Llama3-70B (Tong etal., 2024)-89.6
DeepSeek-67B: MetaMATH (Wang etal., 2023a)PRM+Maj1@25692.5
Critic-CoT, Llama-3-70B-Instruct (Ours)-91.7
Iterative Refine93.3 \uparrow 1.6
Maj1@9694.8
Critic + Maj1@9695.4 \uparrow 0.6
GSM8K

The results of the GSM8K dataset highlight the effectiveness of the Critic-CoT approach in enhancing the solution accuracy. Initially, our trained model’s top-1 accuracy increases from 89.6% to 91.7%, and the iterative refine strategy further enhances the accuracy to 93.3%. Additionally, the Maj1@96 method combined with the critic’s filter achieves the highest accuracy of 95.4%, which is an improvement of 0.6% over the non-critic-assisted Maj1@96 approach. These results suggest that the Critic-CoT method, under relatively easy task where baseline solving accuracy is already high, can still boost performance, via critic-refine training and filtering out invalid solutions or making corrections at test time. For a concrete example of a single-turn refinement, please refer to Appendix A.5.

ModelSampling MethodAcc.
Llama-3-70B-Instruct (Dubey etal., 2024)-51.0
Maj1@9663.5
Maj1@51264.3
Llama-3.1-70B-Instruct (Dubey etal., 2024)-68.0
DeepSeek-V2 Chat-236B (DeepSeek-AI etal., 2024)-53.9
Qwen2-72B (Yang etal., 2024)-69.0
GPT4-0314 (OpenAI, 2023)-42.5
GPT4-Turbo-72.6
Critic-CoT, Llama-3-70B-Instruct (Ours)-56.2
Iterative Refine56.6 \uparrow 0.4
Maj1@9664.2
Critic + Maj1@9665.0 \uparrow 0.8
Maj1@51264.4
Critic + Maj1@51266.4 \uparrow 2.0
ModelSampling MethodAcc.
Llama-3-70B-Instruct-50.4
Maj1@9662.2
Maj1@51263.4
Mistral-7B: MetaMATH (Gao etal., 2024)PRM+Maj1@25638.6
DeepSeek-67B: MetaMATH (Wang etal., 2023a)PRM+Maj1@25648.1
InternLM-MATH-20B (Ying etal., 2024)PRM Best-of-10050.0
DART-Math-Llama3-70B (Tong etal., 2024)-56.1
GPT-4-MathMix (Lightman etal., 2024)PRM Best-of-10074.5
PRM Best-of-186078.2
Critic-CoT, Llama-3-70B-Instruct (Ours)-57.6
Iterative Refine57.8 \uparrow 0.2
Maj1@9664.6
Critic + Maj1@9666.6 \uparrow 2.0
Maj1@51265.4
Critic + Maj1@51268.4 \uparrow 3.0
MATH

As presented in Table 3, on the test set of MATH500, the baseline performance of Llama-3-70B-Instruct stands at 50.4% accuracy and Tong etal. (2024) reaches 56.1% with difficulty-aware rejection tuning, while our Critic-CoT approach initially improves the model’s performance to 57.6%, with a slight increase to 57.8% through Iterative Refine. Figure 3 presents a concrete example of step-wise CoT critique, which detects an error in problem understanding in Step 3 and a successful refinement that fixes the error in Step 3 and reaches the correct answer. Compared with GSM8K, gaining from refinement is much harder. However, critic filtering still provides a notable improvement, which could be slightly easier than refinement: the accuracy rises from 64.6% with Maj1@96 to 66.6% when Critic filtering is applied, marking a 2.0% improvement. Furthermore, for Maj1@512, the accuracy rises to 68.4% after Critic filtering, showing an increase of 3.0%. While the close-source model GPT-4-MathMix achieves the highest accuracy of 78.2% with extensive sampling of 1860, the Critic-CoT approach on the open-source model can still significantly enhance the accuracy of the base model, particularly through effective error detection. The trend remains consistent with the original 7,500/5,000 split setting (Table 2). Overall, the result demonstrates the effectiveness of our method in training reasonable-level critic-refine capabilities on the challenging MATH dataset. More detailed analysis on GSM8K and MATH500 is in Appendix A.1.

4.4 Out-of-Domain Results

ModelAcc.
Llama-3-70B-Instruct56.6
Llama-3.1-70B-Instruct61.8
DeepSeek-V2 Chat-236B61.4
GPT4o65.2
Critic-CoT, GSM8K54.7
- Iterative Refine55.6 \uparrow 0.8
- Maj1@9660.7
- Critic + Maj1@9660.3 \downarrow 0.4
Critic-CoT, MATH59.8
- Iterative Refine63.7 \uparrow 3.9
- Maj1@9661.0
- Critic + Maj1@9661.2 \uparrow 0.2
ModelAcc.
Llama-3-70B-Instruct76.2
Llama-3.1-70B-Instruct84.3
DeepSeek-V2 Chat-236B75.6
GPT4-031483.6
Critic-CoT, GSM8K77.5
- Iterative Refine78.8 \uparrow 1.3
- Maj1@9678.7
- Critic + Maj1@9680.5 \uparrow 1.8
Critic-CoT, MATH78.0
- Iterative Refine80.1 \uparrow 2.1
- Maj1@9678.3
- Critic + Maj1@9679.7 \uparrow 1.4

For the StrategyQA dataset, our critic models trained on two datasets show a positive performance increase when applying iterative refine and majority vote with the critic filter. On the more challenging AGIEval dataset, the critic model trained on GSM8K improves with iterative refinement, but slightly hurts the performance when filtering samples, indicating the limitations of the grade-level critic model in handling more complex, multi-domain tasks. Conversely, the Critic-CoT model trained on MATH shows significant improvements in iterative refinement, and the method of majority vote after criticizing does not negatively impact the performance.

Overall, the results illustrate that our critic models generalize to other domains, and achieve performance improvements. This underscores the potential of our proposed critic-refine method in improving reasoning accuracy in diverse and challenging tasks beyond the training domain of math.

4.5 Ablation Study

The results of the ablation study are shown in Table 5(a) and 5(b), demonstrating the effectiveness of our Critic-CoT design. At the level of critique output, to assess the necessity of our proposed step-wise CoT critic, we first remove the CoT mechanism, and only train the critic model to directly predict if each step is correct (Process Label), for example, “Step 1 is correct. Step 2 is incorrect.”. Then, we remove further remove the step-wise label, and let the critic model predict if the entire solution is correct, without printing anything else (Outcome Label), for example, “Some step from Step 1 to Step 4 is incorrect.” or “Each step from Step 1 to Step 4 is correct.”. We find that removing the Chain-of-Thought intermediate output and further step-wise labels, which fall back toward System-1 reasoning, negatively impacts the recall metric. Consequently, the critic model fails to detect more errors, resulting in a significantly lower critic accuracy, despite its tendency to more easily pass correct solutions.

At the training data level, to evaluate the effect of different data types, we remove the second-stage data, only use the critique and refinement produced by GPT, or remove the first-stage data and only use the critiques and refinements of self-sampled solutions. In addition, we conducted a vertical ablation by removing either the critic data or the refinement data across both stages. From the results, we find that regarding the roles of critic and refine, it is suggested that refinement contributes more to policy improvement, which echoes the finding of An etal. (2024). Yet only by combining critique and refinement during training can we enhance the policy while leveraging the critic’s ability for further performance gains. Finally, training on the critique of GPT models proves better at identifying faults, but at the cost of precision. In contrast, using only the critique of itself is less effective than simply utilizing data from both stages.

ModelCriticRefineMajority Vote
PRF1Acc.Init. AccRef. Acc.Pass1@NMaj1@N+Critic
Outcome Label95.528.944.488.087.789.799.093.693.7
Process Label67.922.834.189.588.089.299.093.093.0
Only Refine30.011.416.690.892.088.298.995.295.2
Only Critic57.131.040.291.991.291.498.994.494.5
Stage 142.541.542.089.390.791.198.993.694.2
Stage 250.025.033.385.590.591.399.094.494.4
Critic-CoT53.358.255.792.391.793.399.194.895.4
ModelCriticRefineMajority Vote
PRF1Acc.Init. AccRef. Acc.Pass1@NMaj1@N+Critic
Outcome Label84.439.053.363.051.853.684.056.256.2
Process Label80.235.949.663.850.452.678.649.450.8
Only Refine62.360.161.266.055.449.890.463.062.8
Only Critic67.975.471.571.652.855.889.060.660.6
Stage 164.693.776.569.053.241.290.463.463.0
Stage 279.745.858.271.857.257.490.464.665.0
Critic-CoT66.173.769.772.257.657.889.264.666.6

5 Conclusion

In this paper, we introduced the Critic-CoT paradigm to enhance the reasoning abilities of Large Language Models, through a more System-2-like, step-by-step Chain-of-Thought critique. Our approach leverages distant supervision to construct training data for critiques and refinements, thereby reducing the reliance on extensive human annotation. We demonstrated the effectiveness of our method through substantial improvements across the dataset of GSM8K and MATH. Additionally, our results present that training on the capabilities of critique and refinement alone improves task-solving performance, which indicates a mutual-reinforce mechanism within the LLMs. We hope our work may inspire further investigations into the advancement of the self-critic framework and the transition toward System-2 reasoning.

References

  • An etal. (2024)Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen.Learning from mistakes makes llm better reasoner, 2024.URL https://arxiv.org/abs/2310.20689.
  • Bubeck etal. (2023)Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, YinTat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, MarcoTulio Ribeiro, and YiZhang.Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.URL https://arxiv.org/abs/2303.12712.
  • Burns etal. (2023)Collin Burns, Pavel Izmailov, JanHendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu.Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023.URL https://arxiv.org/abs/2312.09390.
  • Cao etal. (2024)Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, LeSun, Hongyu Lin, and Bowen Yu.Towards scalable automated alignment of llms: A survey, 2024.URL https://arxiv.org/abs/2406.01252.
  • Chen etal. (2024a)Angelica Chen, Jérémy Scheurer, JonAnder Campos, Tomasz Korbak, JunShern Chan, SamuelR. Bowman, Kyunghyun Cho, and Ethan Perez.Learning from natural language feedback.Transactions on Machine Learning Research, 2024a.ISSN 2835-8856.URL https://openreview.net/forum?id=xo3hI5MwvU.
  • Chen etal. (2024b)Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan.Alphamath almost zero: process supervision without process.CoRR, abs/2405.03553, 2024b.doi: 10.48550/ARXIV.2405.03553.URL https://doi.org/10.48550/arXiv.2405.03553.
  • Chollet (2019)François Chollet.On the measure of intelligence, 2019.URL https://arxiv.org/abs/1911.01547.
  • Cobbe etal. (2021a)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021a.URL https://arxiv.org/abs/2110.14168.
  • Cobbe etal. (2021b)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.Training verifiers to solve math word problems, 2021b.URL https://arxiv.org/abs/2110.14168.
  • DeepSeek-AI etal. (2024)DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, BoLiu, and etal.Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024.doi: 10.48550/ARXIV.2405.04434.URL https://doi.org/10.48550/arXiv.2405.04434.
  • Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, and etal.The llama 3 herd of models, 2024.URL https://arxiv.org/abs/2407.21783.
  • Feng etal. (2023)Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, DiHe, and Liwei Wang.Towards revealing the mystery behind chain of thought: A theoretical perspective.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=qHrADgAdYu.
  • Gao etal. (2024)Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, CeZheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang.LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback.CoRR, abs/2406.14024, 2024.doi: 10.48550/ARXIV.2406.14024.URL https://doi.org/10.48550/arXiv.2406.14024.
  • Geva etal. (2021)Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant.Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021.doi: 10.1162/tacl˙a˙00370.URL https://aclanthology.org/2021.tacl-1.21.
  • Havrilla etal. (2024)Alexander Havrilla, Yuqing Du, SharathChandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu.Teaching large language models to reason with reinforcement learning.In AI for Math Workshop @ ICML 2024, 2024.URL https://openreview.net/forum?id=mjqoceuMnI.
  • Hegel etal. (1991)G.W.F. Hegel, T.F. Geraets, W.A. Suchting, and H.S. Harris.The Encyclopaedia Logic, with the Zustze: Part I of the Encyclopaedia of Philosophical Sciences with the Zusatze.Hackett Classics Series. Hackett, 1991.ISBN 9780872200708.URL https://books.google.ca/books?id=4BNUFZ_hQ1wC.
  • Hendrycks etal. (2021)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the MATH dataset.In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.URL https://openreview.net/forum?id=7Bywt2mQsCe.
  • Huang etal. (2024)Jie Huang, Xinyun Chen, Swaroop Mishra, HuaixiuSteven Zheng, AdamsWei Yu, Xinying Song, and Denny Zhou.Large language models cannot self-correct reasoning yet.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=IkmD3fKBPQ.
  • Kahneman (2011)Daniel Kahneman.Thinking Fast and Slow.Farrar, Straus and Giroux, 2011.
  • Kierkegaard (1989)Søren Kierkegaard.Kierkegaard’s Writings, II, Volume 2: The Concept of Irony, with Continual Reference to Socrates/Notes of Schelling’s Berlin Lectures.Princeton University Press, 1989.ISBN 9780691073545.URL http://www.jstor.org/stable/j.ctt24hr3n.
  • Kim etal. (2023)Geunwoo Kim, Pierre Baldi, and StephenMarcus McAleer.Language models can solve computer tasks.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=M6OmjAZ4CX.
  • Kwon etal. (2023)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, CodyHao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica.Efficient memory management for large language model serving with pagedattention.In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pp. 611–626, New York, NY, USA, 2023. Association for Computing Machinery.ISBN 9798400702297.doi: 10.1145/3600006.3613165.URL https://doi.org/10.1145/3600006.3613165.
  • Leike etal. (2018)Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg.Scalable agent alignment via reward modeling: a research direction, 2018.URL https://arxiv.org/abs/1811.07871.
  • Li etal. (2024)Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma.Chain of thought empowers transformers to solve inherently serial problems.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=3EWTEy9MTM.
  • Lightman etal. (2024)Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe.Let’s verify step by step.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=v8L0pN6EOi.
  • Liu etal. (2023)Yixin Liu, Avi Singh, C.Daniel Freeman, JohnD. Co-Reyes, and PeterJ. Liu.Improving large language model fine-tuning for solving math problems.CoRR, abs/2310.10047, 2023.doi: 10.48550/ARXIV.2310.10047.URL https://doi.org/10.48550/arXiv.2310.10047.
  • Luo etal. (2023)Liangchen Luo, ZiLin, Yinxiao Liu, Lei Shu, Yun Zhu, Jingbo Shang, and Lei Meng.Critique ability of large language models, 2023.URL https://arxiv.org/abs/2310.04815.
  • Luo etal. (2024)Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi.Improve mathematical reasoning in language models by automated process supervision, 2024.URL https://arxiv.org/abs/2406.06592.
  • Madaan etal. (2023)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, BodhisattwaPrasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark.Self-refine: Iterative refinement with self-feedback.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=S37hOerQLB.
  • McAleese etal. (2024)Nat McAleese, RaiMichael Pokorny, Juan FelipeCeron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike.Llm critics help catch llm bugs, 2024.URL https://arxiv.org/abs/2407.00215.
  • Morris etal. (2024)MeredithRingel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg.Position: Levels of AGI for operationalizing progress on the path to AGI.In Forty-first International Conference on Machine Learning, 2024.URL https://openreview.net/forum?id=0ofzEysK2D.
  • OpenAI (2023)OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.doi: 10.48550/ARXIV.2303.08774.URL https://doi.org/10.48550/arXiv.2303.08774.
  • OpenAI (2024)OpenAI.Openai o1 system card.2024.URL https://assets.ctfassets.net/kftzwdyauwt9/67qJD51Aur3eIc96iOfeOP/71551c3d223cd97e591aa89567306912/o1_system_card.pdf.
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, PaulF Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback.In S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh (eds.), Advances in Neural Information Processing Systems, volume35, pp. 27730–27744. Curran Associates, Inc., 2022.URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  • Popper (1934)KarlRaimund Popper.The Logic of Scientific Discovery.Routledge, New York, 1934.
  • Rajbhandari etal. (2021)Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He.Zero-infinity: breaking the gpu memory wall for extreme scale deep learning.In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery.ISBN 9781450384421.doi: 10.1145/3458817.3476205.URL https://doi.org/10.1145/3458817.3476205.
  • Saunders etal. (2022)William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike.Self-critiquing models for assisting human evaluators, 2022.URL https://arxiv.org/abs/2206.05802.
  • Shinn etal. (2023)Noah Shinn, Federico Cassano, Ashwin Gopinath, KarthikR Narasimhan, and Shunyu Yao.Reflexion: language agents with verbal reinforcement learning.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=vAElhFcKW6.
  • Snell etal. (2024)Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar.Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024.URL https://arxiv.org/abs/2408.03314.
  • Tong etal. (2024)Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He.Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving, 2024.URL https://arxiv.org/abs/2407.13690.
  • Wang etal. (2023a)Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, and Zhifang Sui.Math-shepherd: Verify and reinforce llms step-by-step without human annotations.CoRR, abs/2312.08935, 2023a.doi: 10.48550/ARXIV.2312.08935.URL https://doi.org/10.48550/arXiv.2312.08935.
  • Wang etal. (2024)Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, RichardYuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li.Self-taught evaluators, 2024.URL https://arxiv.org/abs/2408.02666.
  • Wang etal. (2023b)Xuezhi Wang, Jason Wei, Dale Schuurmans, QuocV Le, EdH. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou.Self-consistency improves chain of thought reasoning in language models.In The Eleventh International Conference on Learning Representations, 2023b.URL https://openreview.net/forum?id=1PL1NIMMrw.
  • West etal. (2024)Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, JenaD. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, PangWei Koh, Allyson Ettinger, and Yejin Choi.The generative AI paradox: “what it can create, it may not understand”.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=CF8H8MS5P8.
  • Wolf etal. (2020)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, TevenLe Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and AlexanderM. Rush.Transformers: State-of-the-art natural language processing.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics.URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  • Yang etal. (2024)AnYang, Baosong Yang, Binyuan Hui, BoZheng, Bowen Yu, Chang Zhou, and etal.Qwen2 technical report, 2024.URL https://arxiv.org/abs/2407.10671.
  • Ying etal. (2024)Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin.Internlm-math: Open math large language models toward verifiable reasoning.CoRR, abs/2402.06332, 2024.doi: 10.48550/ARXIV.2402.06332.URL https://doi.org/10.48550/arXiv.2402.06332.
  • Yu etal. (2024)Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov.Distilling system 2 into system 1, 2024.URL https://arxiv.org/abs/2407.06023.
  • Yuan etal. (2024)Weizhe Yuan, RichardYuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and JasonE Weston.Self-rewarding language models.In Forty-first International Conference on Machine Learning, 2024.URL https://openreview.net/forum?id=0NphYCmgua.
  • Zelikman etal. (2022)Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman.STar: Bootstrapping reasoning with reasoning.In AliceH. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.URL https://openreview.net/forum?id=_3ELRdg2sgI.
  • Zeng etal. (2023)Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia.Mr-gsm8k: A meta-reasoning benchmark for large language model evaluation.CoRR, abs/2312.17080, 2023.doi: 10.48550/ARXIV.2312.17080.URL https://doi.org/10.48550/arXiv.2312.17080.
  • Zhang etal. (2024)Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal.Generative verifiers: Reward modeling as next-token prediction, 2024.URL https://arxiv.org/abs/2408.15240.
  • Zheng etal. (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, JosephE Gonzalez, and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine (eds.), Advances in Neural Information Processing Systems, volume36, pp. 46595–46623. Curran Associates, Inc., 2023.URL https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
  • Zhong etal. (2023)Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan.Agieval: A human-centric benchmark for evaluating foundation models, 2023.URL https://arxiv.org/abs/2304.06364.

Appendix A Appendix

A.1 Analysis

A.2 Critic Performance

For both datasets, the critic model’s accuracy continues to grow as the sample size N𝑁Nitalic_N increases, ultimately surpassing the performance of the majority vote, which gradually converges. Specifically, in the MATH dataset, the critic model achieves substantially higher accuracy than the solution accuracy, consistently outperforming the naive majority vote due to the critic filter’s superior performance. This stark contrast highlights the critic model’s effectiveness in identifying and promoting correct answers. In the GSM8K dataset, despite having a critic accuracy of only 92.3%, the critic model still manages to deliver higher accuracy gains. This outcome suggests that the critic model successfully filters answers to increase the density of correct answers and decrease the density of wrong answers, compared to the normal answer distribution. The overall results demonstrate the critic model’s robust capability to enhance accuracy across different datasets, validating its practical utility in improving prediction outcomes.

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic (3)

A.3 Inspect on Iterative Refine

RoundRefine Acc.
True \rightarrow
True
False \rightarrow
True
091.7--
191.748.245.3
292.678.637.5
392.764.353.1
493.073.250.0
593.275.053.1
693.276.853.1
793.380.450.0
893.380.450.0
RoundRefine Acc.
True \rightarrow
True
False \rightarrow
True
057.6--
153.429.017.7
257.265.713.9
355.248.615.2
457.260.915.9
557.460.017.1
657.661.417.1
757.860.018.4
857.862.916.5

The iterative refinement process for the GSM8K and MATH datasets demonstrates different levels of effectiveness due to their complexity, as shown in Table 6. GSM8K, being simpler, shows a higher success rate in refinement. For effective refinement, the number of false answers corrected (False \rightarrow True) must exceed the number of true answers incorrectly changed (True \rightarrow False). Despite occasional mistakes by the critic, correct answers are not always altered incorrectly.

For GSM8K (Table 6(a)), accuracy improves from 91.7% initially to 93.3% by the seventh round, with significant gains in both true-to-true and false-to-true transformations. In contrast, MATH (Table 6(b)) starts at 57.6% accuracy, reaching 57.8% by the seventh round. The iterative refinement process tends to converge, which is expected.

A.4 Group By Difficulty Level

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic (4)

For the MATH dataset, the difficulty level is given from 1 to 5. For the GSM8K dataset, we set the difficulty level according to the number of expressions n𝑛nitalic_n that appeared in the reference solution, i.e., max(1,min(5,n))𝑚𝑎𝑥1𝑚𝑖𝑛5𝑛max(1,min(5,n))italic_m italic_a italic_x ( 1 , italic_m italic_i italic_n ( 5 , italic_n ) ). As illustrated in Figure 5, the performance on the GSM8K dataset shows a gradual decline as the difficulty level increases. This trend is accompanied by the emerging effects of the critic and refine stages, which become more prominent at higher difficulty levels. In contrast, the accuracy on the MATH dataset declines sharply as the problems become more challenging. Generally, the refine stage proves effective across all levels, while the critic stage is beneficial at most levels, with some minor exceptions. These observations suggest potential areas for further improvements in the critic mechanism.

A.5 An Example of Refinement on GSM8K

As presented in Figure 6, the model forgot to add one year at Step 3; then, through CoT critique, the model found that while Step 1 and Step 2 are correct, Step 3 contains this ignorance error. Finally, guided by the critique of Step 3, the model made a correction and reach the gold answer of 13.

A.6 Prompts

Table 7, Table 10, and Table 8 presents the prompt for critic-refine data collection using GPT4-Turbo, with Table 9 for removing the hint phrases (Section 3.2). Table 11, Table 12, and Table 13 shows the prompt of trained model for solving, critique, and refinement during stage-2-training (Section 3.2) and inference (Section 3.3). Table 14, Table 15 and Table 16 present the prompts and responses of a single turn critique-refinement, under Critic-CoT, Step-wise Label Critic and Final Label Critic respectively.

Prompt
How do you evaluate the following attempt with respect to the problem?
<problem>
{problem}
</problem>
<attempt>
{attempt}
</attempt>
-----
*Notes**:
- Please think step by step.
- Your reasoning should precede any claims or conclusions you make to avoid unwarranted assertions.
- At the end of the evaluation for each step, YOU MUST articulate the conclusion using the format ”Conclusion: Step [i] is correct” or ”Conclusion: Step [i] is incorrect”. Words like ”partially correct” are prohibited.
- You shall not evaluate multiple steps at a time, so words like ”Step 7 to Step 24:” or ”Step 4 through 6” are forbidden.
- Once a mistake is identified and stated, stop the evaluation, and enumerate the corrected steps starting from the step where the mistake was detected, and label this part of your response with <correction> at the start and </correction> at the end. Also, the final answer should be a single number, in the form \boxed{}, at the final step.
Prompt
How do you evaluate the following attempt with respect to the problem, with the help of reference solution?
Hint: There could be a mistake.
<problem>
{problem}
</problem>
<reference_solution>
{reference_solution}
</reference_solution>
<attempt>
{attempt}
</attempt>
-----
*Notes**:
- Please think step by step.
- Your reasoning should precede any claims or conclusions you make to avoid unwarranted assertions.
- Please ensure that the output text does not include phrases implying the use of a reference solution or hint, even though these resources are being utilized.
- At the end of the evaluation for each step, YOU MUST articulate the conclusion using the format ”Conclusion: Step [i] is correct” or ”Conclusion: Step [i] is incorrect”. Words like ”partially correct” are prohibited.
- You shall not evaluate multiple steps at a time, so words like ”Step 7 to Step 24:” or ”Step 4 through 6” are forbidden.
- Once a mistake is identified and stated, stop the evaluation, and enumerate the corrected steps starting from the step where the mistake was detected, and label this part of your response with <correction> at the start and </correction> at the end. Also, the final answer should be in the form \boxed{}, at the final step.
Prompt
For the following text, remove any phrases like ”reference solution” or ”hint”, and keep all the other content. Do not miss the ”<correction>” and ”</correction>” labels that exist in the text. Do not respond to anything else.
-----
{critique_refinement}
Prompt
How do you evaluate the following attempt with respect to the problem?
Hint: All the steps are correct, and the attempt reached a correct answer.
<problem>
{problem}
</problem>
<attempt>
{attempt}
</attempt>
-----
*Notes**:
- Please think step by step.
- Your reasoning should precede any claims or conclusions you make to avoid unwarranted assertions.
- Please ensure that the output text does not include phrases implying the use of a reference solution or hint, even though these resources are being utilized.
- At the end of the evaluation for each step, YOU MUST articulate the conclusion using the format ”Conclusion: Step [i] is correct” or ”Conclusion: Step [i] is incorrect”. Words like ”partially correct” are prohibited.
- You shall not evaluate multiple steps at a time, so words like ”Step 7 to Step 24:” or ”Step 4 through 6” are forbidden.
- Once a mistake is identified and stated, stop the evaluation, and enumerate the corrected steps starting from the step where the mistake was detected, and label this part of your response with <correction> at the start and </correction> at the end. Also, the final answer should be in the form \boxed{}, at the final step.
Prompt
## Problem
{problem}
-----
Solve the problem step by step, marking each step as ”Step [i]:”.
Your final answer should be in the form \boxed{answer}, at the end of your response.
Prompt
How do you evaluate the following attempt with respect to the problem?
<problem>
{problem}
</problem>
<attempt>
{attempt}
</attempt>
-----
*Notes**:
- Please think step by step.
- Your reasoning should precede any claims or conclusions you make to avoid unwarranted assertions.
- At the end of the evaluation for each step, YOU MUST articulate the conclusion using the format ”Conclusion: Step [i] is correct” or ”Conclusion: Step [i] is incorrect”. Words like ”partially correct” are prohibited.
Prompt
How do you refine the following attempt with respect to the problem, given the criticism?
<problem>
{problem}
</problem>
<attempt>
{attempt}
</attempt>
<criticism>
{wrong_step_criticism}
</criticism>
Prompt of Critique under Critic-CoT
How do you evaluate the following attempt with respect to the problem?
<problem>
Consider the geometric sequence 1259,253,5,3,125925353\frac{125}{9},\frac{25}{3},5,3,\ldotsdivide start_ARG 125 end_ARG start_ARG 9 end_ARG , divide start_ARG 25 end_ARG start_ARG 3 end_ARG , 5 , 3 , …. What is the eighth term of the sequence? Express your answer as a common fraction.
</problem>
<attempt>
Step 1: Identify the first term of the sequence.The first term a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is:a1=1259subscript𝑎11259a_{1}=\frac{125}{9}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARGStep 2: Calculate the common ratio r𝑟ritalic_r.The common ratio r𝑟ritalic_r can be found by dividing the second term a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by the first term a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:r=a2a1=2531259=253×9125𝑟subscript𝑎2subscript𝑎125312592539125r=\frac{a_{2}}{a_{1}}=\frac{\frac{25}{3}}{\frac{125}{9}}=\frac{25}{3}\times%\frac{9}{125}italic_r = divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG divide start_ARG 25 end_ARG start_ARG 3 end_ARG end_ARG start_ARG divide start_ARG 125 end_ARG start_ARG 9 end_ARG end_ARG = divide start_ARG 25 end_ARG start_ARG 3 end_ARG × divide start_ARG 9 end_ARG start_ARG 125 end_ARGCalculating the right-hand side:=2593125=225375absent2593125225375=\frac{25\cdot 9}{3\cdot 125}=\frac{225}{375}= divide start_ARG 25 ⋅ 9 end_ARG start_ARG 3 ⋅ 125 end_ARG = divide start_ARG 225 end_ARG start_ARG 375 end_ARGNow simplify 225375225375\frac{225}{375}divide start_ARG 225 end_ARG start_ARG 375 end_ARG:=225÷75375÷75=35absent225753757535=\frac{225\div 75}{375\div 75}=\frac{3}{5}= divide start_ARG 225 ÷ 75 end_ARG start_ARG 375 ÷ 75 end_ARG = divide start_ARG 3 end_ARG start_ARG 5 end_ARGThus, the common ratio is:r=35𝑟35r=\frac{3}{5}italic_r = divide start_ARG 3 end_ARG start_ARG 5 end_ARGStep 3: Use the formula for the n𝑛nitalic_n-th term of a geometric sequence.The n𝑛nitalic_n-th term of a geometric sequence can be calculated using the formula:an=a1rn1subscript𝑎𝑛subscript𝑎1superscript𝑟𝑛1a_{n}=a_{1}\cdot r^{n-1}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPTStep 4: Calculate the eighth term a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.Substituting n=8𝑛8n=8italic_n = 8 into the formula:a8=a1r81=1259(35)7subscript𝑎8subscript𝑎1superscript𝑟811259superscript357a_{8}=a_{1}\cdot r^{8-1}=\frac{125}{9}\cdot\left(\frac{3}{5}\right)^{7}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT 8 - 1 end_POSTSUPERSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ ( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPTStep 5: Calculate (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.Calculating (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT:(35)7=3757=218778125superscript357superscript37superscript57218778125\left(\frac{3}{5}\right)^{7}=\frac{3^{7}}{5^{7}}=\frac{2187}{78125}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT = divide start_ARG 3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG start_ARG 5 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 6: Substitute back into the expression for a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.Now we compute:a8=1259218778125subscript𝑎81259218778125a_{8}=\frac{125}{9}\cdot\frac{2187}{78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 7: Multiply the fractions.To multiply:a8=1252187978125subscript𝑎81252187978125a_{8}=\frac{125\cdot 2187}{9\cdot 78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 ⋅ 2187 end_ARG start_ARG 9 ⋅ 78125 end_ARGCalculating the denominator:978125=7031259781257031259\cdot 78125=7031259 ⋅ 78125 = 703125
Prompt of Critique under Critic-CoT
Calculating the numerator:1252187=2733751252187273375125\cdot 2187=273375125 ⋅ 2187 = 273375Thus:a8=273375703125subscript𝑎8273375703125a_{8}=\frac{273375}{703125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 273375 end_ARG start_ARG 703125 end_ARGStep 8: Simplify the fraction 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG.To simplify:First, calculate the GCD of the numerator and denominator. We can find that both 273375 and 703125 are divisible by 135.Calculating:273375÷135703125÷135=2025520527337513570312513520255205\frac{273375\div 135}{703125\div 135}=\frac{2025}{5205}divide start_ARG 273375 ÷ 135 end_ARG start_ARG 703125 ÷ 135 end_ARG = divide start_ARG 2025 end_ARG start_ARG 5205 end_ARGContinuing to simplify, both 2025 and 5205 are divisible by 15:2025÷155205÷15=135347202515520515135347\frac{2025\div 15}{5205\div 15}=\frac{135}{347}divide start_ARG 2025 ÷ 15 end_ARG start_ARG 5205 ÷ 15 end_ARG = divide start_ARG 135 end_ARG start_ARG 347 end_ARGThis is now fully simplified because 135 and 347 have no common factors.Final Answer:Thus, the eighth term of the sequence is:135347135347\boxed{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac%{135}{347}}}divide start_ARG 135 end_ARG start_ARG 347 end_ARG
</attempt>
-----
*Notes**:
- Please think step by step.
- Your reasoning should precede any claims or conclusions you make to avoid unwarranted assertions.
- At the end of the evaluation for each step, YOU MUST articulate the conclusion using the format ”Conclusion: Step [i] is correct” or ”Conclusion: Step [i] is incorrect”. Words like ”partially correct” are prohibited.
Response of Critique under Critic-CoT
Step 1: The first term a1=1259subscript𝑎11259a_{1}=\frac{125}{9}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG is identified correctly.Conclusion: Step 1 is correct.Step 2: The common ratio r𝑟ritalic_r is calculated as:r=a2a1=2531259=253×9125=225375=35𝑟subscript𝑎2subscript𝑎12531259253912522537535r=\frac{a_{2}}{a_{1}}=\frac{\frac{25}{3}}{\frac{125}{9}}=\frac{25}{3}\times%\frac{9}{125}=\frac{225}{375}=\frac{3}{5}italic_r = divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG divide start_ARG 25 end_ARG start_ARG 3 end_ARG end_ARG start_ARG divide start_ARG 125 end_ARG start_ARG 9 end_ARG end_ARG = divide start_ARG 25 end_ARG start_ARG 3 end_ARG × divide start_ARG 9 end_ARG start_ARG 125 end_ARG = divide start_ARG 225 end_ARG start_ARG 375 end_ARG = divide start_ARG 3 end_ARG start_ARG 5 end_ARGThis calculation is correct.Conclusion: Step 2 is correct.Step 3: The formula for the n𝑛nitalic_n-th term is correctly stated as:an=a1rn1subscript𝑎𝑛subscript𝑎1superscript𝑟𝑛1a_{n}=a_{1}\cdot r^{n-1}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPTConclusion: Step 3 is correct.Step 4: The eighth term is computed as:a8=1259(35)7subscript𝑎81259superscript357a_{8}=\frac{125}{9}\cdot\left(\frac{3}{5}\right)^{7}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ ( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPTThis is correct.Conclusion: Step 4 is correct.
Response of Critique under Critic-CoT
Step 5: Calculating (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT:(35)7=218778125superscript357218778125\left(\frac{3}{5}\right)^{7}=\frac{2187}{78125}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT = divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGThis calculation is correct.Conclusion: Step 5 is correct.Step 6: Substituting back:a8=1259218778125subscript𝑎81259218778125a_{8}=\frac{125}{9}\cdot\frac{2187}{78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGThis is correct.Conclusion: Step 6 is correct.Step 7: Multiplying the fractions:a8=1252187978125subscript𝑎81252187978125a_{8}=\frac{125\cdot 2187}{9\cdot 78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 ⋅ 2187 end_ARG start_ARG 9 ⋅ 78125 end_ARGThe denominator calculation is correct:978125=7031259781257031259\cdot 78125=7031259 ⋅ 78125 = 703125However, for the numerator:1252187=2733751252187273375125\cdot 2187=273375125 ⋅ 2187 = 273375This is also correct.Conclusion: Step 7 is correct.Step 8: The goal is to simplify the fraction 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG. The attempt made to find the GCD and simplify.To simplify the fraction, we need to find the greatest common divisor (GCD) of 273375 and 703125.The standard approach involves either prime factorization or using the Euclidean algorithm.Prime Factorization:
- 273375:
- Dividing by 5: 273375÷5=54675273375554675273375\div 5=54675273375 ÷ 5 = 54675
- Dividing by 5 again: 54675÷5=109355467551093554675\div 5=1093554675 ÷ 5 = 10935
- Dividing by 5 again: 10935÷5=2187109355218710935\div 5=218710935 ÷ 5 = 2187
- The number 2187 is 37superscript373^{7}3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.
- Therefore, 273375=5337273375superscript53superscript37273375=5^{3}\cdot 3^{7}273375 = 5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⋅ 3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.
- 703125:
- Dividing by 5: 703125÷5=1406257031255140625703125\div 5=140625703125 ÷ 5 = 140625
- Dividing by 5 again: 140625÷5=28125140625528125140625\div 5=28125140625 ÷ 5 = 28125
- Dividing by 5 again: 28125÷5=5625281255562528125\div 5=562528125 ÷ 5 = 5625
- Dividing by 5 again: 5625÷5=11255625511255625\div 5=11255625 ÷ 5 = 1125
- Dividing by 5 again: 1125÷5=225112552251125\div 5=2251125 ÷ 5 = 225
- Dividing by 5 again: 225÷5=45225545225\div 5=45225 ÷ 5 = 45
- Finally, 45=325145superscript32superscript5145=3^{2}\cdot 5^{1}45 = 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ 5 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.
- Therefore, 703125=5632703125superscript56superscript32703125=5^{6}\cdot 3^{2}703125 = 5 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ⋅ 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
The GCD can now be derived from the lowest powers of the common factors:
- For 5: 53superscript535^{3}5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT
- For 3: 32superscript323^{2}3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Thus, the GCD is:5332superscript53superscript325^{3}\cdot 3^{2}5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⋅ 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTTherefore, 273375 and 703125 are not divisible by 135.Conclusion: Step 8 is incorrect.
Prompt of Refinement under Critic-CoT
How do you refine the following attempt with respect to the problem, given the criticism?
<problem>
Consider the geometric sequence 1259,253,5,3,125925353\frac{125}{9},\frac{25}{3},5,3,\ldotsdivide start_ARG 125 end_ARG start_ARG 9 end_ARG , divide start_ARG 25 end_ARG start_ARG 3 end_ARG , 5 , 3 , …. What is the eighth term of the sequence? Express your answer as a common fraction.
</problem>
<attempt>
Step 1: Identify the first term of the sequence.The first term a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is:a1=1259subscript𝑎11259a_{1}=\frac{125}{9}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARGStep 2: Calculate the common ratio r𝑟ritalic_r.The common ratio r𝑟ritalic_r can be found by dividing the second term a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by the first term a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:r=a2a1=2531259=253×9125𝑟subscript𝑎2subscript𝑎125312592539125r=\frac{a_{2}}{a_{1}}=\frac{\frac{25}{3}}{\frac{125}{9}}=\frac{25}{3}\times%\frac{9}{125}italic_r = divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG divide start_ARG 25 end_ARG start_ARG 3 end_ARG end_ARG start_ARG divide start_ARG 125 end_ARG start_ARG 9 end_ARG end_ARG = divide start_ARG 25 end_ARG start_ARG 3 end_ARG × divide start_ARG 9 end_ARG start_ARG 125 end_ARGCalculating the right-hand side:=2593125=225375absent2593125225375=\frac{25\cdot 9}{3\cdot 125}=\frac{225}{375}= divide start_ARG 25 ⋅ 9 end_ARG start_ARG 3 ⋅ 125 end_ARG = divide start_ARG 225 end_ARG start_ARG 375 end_ARGNow simplify 225375225375\frac{225}{375}divide start_ARG 225 end_ARG start_ARG 375 end_ARG:=225÷75375÷75=35absent225753757535=\frac{225\div 75}{375\div 75}=\frac{3}{5}= divide start_ARG 225 ÷ 75 end_ARG start_ARG 375 ÷ 75 end_ARG = divide start_ARG 3 end_ARG start_ARG 5 end_ARGThus, the common ratio is:r=35𝑟35r=\frac{3}{5}italic_r = divide start_ARG 3 end_ARG start_ARG 5 end_ARGStep 3: Use the formula for the n𝑛nitalic_n-th term of a geometric sequence.The n𝑛nitalic_n-th term of a geometric sequence can be calculated using the formula:an=a1rn1subscript𝑎𝑛subscript𝑎1superscript𝑟𝑛1a_{n}=a_{1}\cdot r^{n-1}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPTStep 4: Calculate the eighth term a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.Substituting n=8𝑛8n=8italic_n = 8 into the formula:a8=a1r81=1259(35)7subscript𝑎8subscript𝑎1superscript𝑟811259superscript357a_{8}=a_{1}\cdot r^{8-1}=\frac{125}{9}\cdot\left(\frac{3}{5}\right)^{7}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT 8 - 1 end_POSTSUPERSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ ( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPTStep 5: Calculate (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.Calculating (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT:(35)7=3757=218778125superscript357superscript37superscript57218778125\left(\frac{3}{5}\right)^{7}=\frac{3^{7}}{5^{7}}=\frac{2187}{78125}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT = divide start_ARG 3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG start_ARG 5 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 6: Substitute back into the expression for a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.Now we compute:a8=1259218778125subscript𝑎81259218778125a_{8}=\frac{125}{9}\cdot\frac{2187}{78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 7: Multiply the fractions.To multiply:a8=1252187978125subscript𝑎81252187978125a_{8}=\frac{125\cdot 2187}{9\cdot 78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 ⋅ 2187 end_ARG start_ARG 9 ⋅ 78125 end_ARGCalculating the denominator:978125=7031259781257031259\cdot 78125=7031259 ⋅ 78125 = 703125Calculating the numerator:
Prompt of Refinement under Critic-CoT
1252187=2733751252187273375125\cdot 2187=273375125 ⋅ 2187 = 273375Thus:a8=273375703125subscript𝑎8273375703125a_{8}=\frac{273375}{703125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 273375 end_ARG start_ARG 703125 end_ARGStep 8: Simplify the fraction 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG.To simplify:First, calculate the GCD of the numerator and denominator. We can find that both 273375 and 703125 are divisible by 135.Calculating:273375÷135703125÷135=2025520527337513570312513520255205\frac{273375\div 135}{703125\div 135}=\frac{2025}{5205}divide start_ARG 273375 ÷ 135 end_ARG start_ARG 703125 ÷ 135 end_ARG = divide start_ARG 2025 end_ARG start_ARG 5205 end_ARGContinuing to simplify, both 2025 and 5205 are divisible by 15:2025÷155205÷15=135347202515520515135347\frac{2025\div 15}{5205\div 15}=\frac{135}{347}divide start_ARG 2025 ÷ 15 end_ARG start_ARG 5205 ÷ 15 end_ARG = divide start_ARG 135 end_ARG start_ARG 347 end_ARGThis is now fully simplified because 135 and 347 have no common factors.Final Answer:Thus, the eighth term of the sequence is:135347135347\boxed{\frac{135}{347}}divide start_ARG 135 end_ARG start_ARG 347 end_ARG
</attempt>
<criticism>
Step 8: The goal is to simplify the fraction 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG. The attempt made to find the GCD and simplify.To simplify the fraction, we need to find the greatest common divisor (GCD) of 273375 and 703125.The standard approach involves either prime factorization or using the Euclidean algorithm.Prime Factorization:
- 273375:
- Dividing by 5: 273375÷5=54675273375554675273375\div 5=54675273375 ÷ 5 = 54675
- Dividing by 5 again: 54675÷5=109355467551093554675\div 5=1093554675 ÷ 5 = 10935
- Dividing by 5 again: 10935÷5=2187109355218710935\div 5=218710935 ÷ 5 = 2187
- The number 2187 is 37superscript373^{7}3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.
- Therefore, 273375=5337273375superscript53superscript37273375=5^{3}\cdot 3^{7}273375 = 5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⋅ 3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.
- 703125:
- Dividing by 5: 703125÷5=1406257031255140625703125\div 5=140625703125 ÷ 5 = 140625
- Dividing by 5 again: 140625÷5=28125140625528125140625\div 5=28125140625 ÷ 5 = 28125
- Dividing by 5 again: 28125÷5=5625281255562528125\div 5=562528125 ÷ 5 = 5625
- Dividing by 5 again: 5625÷5=11255625511255625\div 5=11255625 ÷ 5 = 1125
- Dividing by 5 again: 1125÷5=225112552251125\div 5=2251125 ÷ 5 = 225
- Dividing by 5 again: 225÷5=45225545225\div 5=45225 ÷ 5 = 45
- Finally, 45=325145superscript32superscript5145=3^{2}\cdot 5^{1}45 = 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ 5 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.
- Therefore, 703125=5632703125superscript56superscript32703125=5^{6}\cdot 3^{2}703125 = 5 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ⋅ 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
The GCD can now be derived from the lowest powers of the common factors:
- For 5: 53superscript535^{3}5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT
- For 3: 32superscript323^{2}3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Thus, the GCD is:5332superscript53superscript325^{3}\cdot 3^{2}5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ⋅ 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTTherefore, 273375 and 703125 are not divisible by 135.Conclusion: Step 8 is incorrect.
</criticism>
Prompt of Refinement under Critic-CoT
<correction>
Step 8: Simplify 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG. First, we divide it by 375.Calculating:273375÷375703125÷375=72918752733753757031253757291875\frac{273375\div 375}{703125\div 375}=\frac{729}{1875}divide start_ARG 273375 ÷ 375 end_ARG start_ARG 703125 ÷ 375 end_ARG = divide start_ARG 729 end_ARG start_ARG 1875 end_ARGStep 9: Further simplify 72918757291875\frac{729}{1875}divide start_ARG 729 end_ARG start_ARG 1875 end_ARG. The numerator 729 is 36superscript363^{6}3 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT and the denominator 1875 is 362536253\cdot 6253 ⋅ 625. The GCD is 3.Calculating:729÷31875÷3=243625729318753243625\frac{729\div 3}{1875\div 3}=\frac{243}{625}divide start_ARG 729 ÷ 3 end_ARG start_ARG 1875 ÷ 3 end_ARG = divide start_ARG 243 end_ARG start_ARG 625 end_ARGThis fraction is fully simplified.Final Answer:Thus, the eighth term of the sequence is:243625243625\boxed{{\color[rgb]{0.1953125,0.796875,0}\definecolor[named]{pgfstrokecolor}{%rgb}{0.1953125,0.796875,0}\frac{243}{625}}}divide start_ARG 243 end_ARG start_ARG 625 end_ARG</correction>
Prompt of Critique under Step-wise Label Critic
How do you evaluate the following attempt with respect to the problem?
<problem>
[...]
</problem>
<attempt>
[...]
</attempt>
-----
*Notes**:
- For each step, you only need to output ”Step [i] is correct” if right, or ”Step [i] is incorrect” if wrong. Do not provide anything else in your output.
Response of Critique under Step-wise Label Critic
Step 1 is correct.
Step 2 is correct.
Step 3 is correct.
Step 4 is correct.
Step 5 is incorrect.
Prompt of Refinement under Step-wise Label Critic
How do you refine the following attempt with respect to the problem, given the criticism?
<problem>
[...]
</problem>
<attempt>
[...]
</attempt>
<criticism>
Step 5 is incorrect.
</criticism>
Response of Refinement under Step-wise Label Critic
<correction>
Step 5: Calculate (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.(35)7=3757=218778125(correct)superscript357superscript37superscript57218778125(correct)\left(\frac{3}{5}\right)^{7}=\frac{3^{7}}{5^{7}}=\frac{2187}{78125}\text{ (%correct)}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT = divide start_ARG 3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG start_ARG 5 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 2187 end_ARG start_ARG 78125 end_ARG (correct)Step 6: Substitute back into the expression for a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.a8=1259218778125subscript𝑎81259218778125a_{8}=\frac{125}{9}\cdot\frac{2187}{78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 7: Multiply the fractions.Calculating the numerator:1252187=273375(correct)1252187273375(correct)125\cdot 2187=273375\text{ (correct)}125 ⋅ 2187 = 273375 (correct)Calculating the denominator:978125=703125(correct)978125703125(correct)9\cdot 78125=703125\text{ (correct)}9 ⋅ 78125 = 703125 (correct)Step 8: Simplify the fraction 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG.To simplify:Calculating the GCD shows that both 273375 and 703125 are divisible by 675, so:273375÷675703125÷675=40510452733756757031256754051045\frac{273375\div 675}{703125\div 675}=\frac{405}{1045}divide start_ARG 273375 ÷ 675 end_ARG start_ARG 703125 ÷ 675 end_ARG = divide start_ARG 405 end_ARG start_ARG 1045 end_ARGContinuing to simplify, both 405 and 1045 are divisible by 5:405÷51045÷5=8120940551045581209\frac{405\div 5}{1045\div 5}=\frac{81}{209}divide start_ARG 405 ÷ 5 end_ARG start_ARG 1045 ÷ 5 end_ARG = divide start_ARG 81 end_ARG start_ARG 209 end_ARGThus, the final answer is:8120981209\boxed{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac%{81}{209}}}divide start_ARG 81 end_ARG start_ARG 209 end_ARG</correction>
Prompt of Critique under Outcome Label Critic
How do you evaluate the following attempt with respect to the problem?
<problem>
[...]
</problem>
<attempt>
[...]
</attempt>
-----
*Notes**:
- For the whole attempt, you only need to output ”Each step from Step 1 to Step 8 is correct.” if right, or ”Some step from Step 1 to Step 8 is incorrect” if wrong. Do not provide anything else in your output.
Response of Critique under Outcome Label Critic
Some step from Step 1 to Step 8 is incorrect.
Prompt of Refinement under Outcome Label Critic
How do you refine the following attempt with respect to the problem, given the criticism?
<problem>
[...]
</problem>
<attempt>
[...]
</attempt>
<criticism>
Some step from Step 1 to Step 8 is incorrect.
</criticism>
Response of Refinement under Outcome Label Critic
<correction>
Step 1: Identify the first term of the sequence.The first term a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is:a1=1259subscript𝑎11259a_{1}=\frac{125}{9}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARGStep 2: Calculate the common ratio r𝑟ritalic_r.The common ratio r𝑟ritalic_r can be found by dividing the second term a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by the first term a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:r=a2a1=2531259=253×9125=225375𝑟subscript𝑎2subscript𝑎125312592539125225375r=\frac{a_{2}}{a_{1}}=\frac{\frac{25}{3}}{\frac{125}{9}}=\frac{25}{3}\times%\frac{9}{125}=\frac{225}{375}italic_r = divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG divide start_ARG 25 end_ARG start_ARG 3 end_ARG end_ARG start_ARG divide start_ARG 125 end_ARG start_ARG 9 end_ARG end_ARG = divide start_ARG 25 end_ARG start_ARG 3 end_ARG × divide start_ARG 9 end_ARG start_ARG 125 end_ARG = divide start_ARG 225 end_ARG start_ARG 375 end_ARGNow simplify 225375225375\frac{225}{375}divide start_ARG 225 end_ARG start_ARG 375 end_ARG:=225÷75375÷75=35absent225753757535=\frac{225\div 75}{375\div 75}=\frac{3}{5}= divide start_ARG 225 ÷ 75 end_ARG start_ARG 375 ÷ 75 end_ARG = divide start_ARG 3 end_ARG start_ARG 5 end_ARGThus, the common ratio is:r=35𝑟35r=\frac{3}{5}italic_r = divide start_ARG 3 end_ARG start_ARG 5 end_ARGStep 3: Use the formula for the n𝑛nitalic_n-th term of a geometric sequence.
Response of Refinement under Outcome Label Critic
The n𝑛nitalic_n-th term of a geometric sequence can be calculated using the formula:an=a1rn1subscript𝑎𝑛subscript𝑎1superscript𝑟𝑛1a_{n}=a_{1}\cdot r^{n-1}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPTStep 4: Calculate the eighth term a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.Substituting n=8𝑛8n=8italic_n = 8 into the formula:a8=a1r81=1259(35)7subscript𝑎8subscript𝑎1superscript𝑟811259superscript357a_{8}=a_{1}\cdot r^{8-1}=\frac{125}{9}\cdot\left(\frac{3}{5}\right)^{7}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUPERSCRIPT 8 - 1 end_POSTSUPERSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ ( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPTStep 5: Calculate (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT.Calculating (35)7superscript357\left(\frac{3}{5}\right)^{7}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT:(35)7=3757=218778125superscript357superscript37superscript57218778125\left(\frac{3}{5}\right)^{7}=\frac{3^{7}}{5^{7}}=\frac{2187}{78125}( divide start_ARG 3 end_ARG start_ARG 5 end_ARG ) start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT = divide start_ARG 3 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG start_ARG 5 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 6: Substitute back into the expression for a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT.Now we compute:a8=1259218778125subscript𝑎81259218778125a_{8}=\frac{125}{9}\cdot\frac{2187}{78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 end_ARG start_ARG 9 end_ARG ⋅ divide start_ARG 2187 end_ARG start_ARG 78125 end_ARGStep 7: Multiply the fractions.To multiply:a8=1252187978125subscript𝑎81252187978125a_{8}=\frac{125\cdot 2187}{9\cdot 78125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 125 ⋅ 2187 end_ARG start_ARG 9 ⋅ 78125 end_ARGCalculating the denominator:978125=7031259781257031259\cdot 78125=7031259 ⋅ 78125 = 703125Calculating the numerator:1252187=2733751252187273375125\cdot 2187=273375125 ⋅ 2187 = 273375Thus:a8=273375703125subscript𝑎8273375703125a_{8}=\frac{273375}{703125}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT = divide start_ARG 273375 end_ARG start_ARG 703125 end_ARGStep 8: Simplify the fraction 273375703125273375703125\frac{273375}{703125}divide start_ARG 273375 end_ARG start_ARG 703125 end_ARG.To simplify:Finding the GCD, both 273375 and 703125 are divisible by 675:273375÷675703125÷675=40510422733756757031256754051042\frac{273375\div 675}{703125\div 675}=\frac{405}{1042}divide start_ARG 273375 ÷ 675 end_ARG start_ARG 703125 ÷ 675 end_ARG = divide start_ARG 405 end_ARG start_ARG 1042 end_ARGNow we check if 40510424051042\frac{405}{1042}divide start_ARG 405 end_ARG start_ARG 1042 end_ARG can be simplified further. Both numbers do not have common factors.Final Answer:Thus, the eighth term of the sequence is:40510424051042\boxed{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac%{405}{1042}}}divide start_ARG 405 end_ARG start_ARG 1042 end_ARG</correction>
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Maia Crooks Jr

Last Updated:

Views: 6591

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.