plan
One of the drawbacks of CoT-style reasoning is that the LLM must greedily decode the path to the answer. This is a problem for complex problems, such as math questions or games, because it is difficult to predict the path without trial and error. In 2023, the community made some progress on this issue with a new framework that enables planning with LLM.
➡️ Conceptualizing CoT as “System 1” reasoning, characterized by an automatic and unconscious nature, raises the following questions: Is it possible to use LLM to replicate humans’ more conscious “System 2” reasoning? This query is based on two of his methodologies: Reasoning by Planning (RAP) and Trees of Thought (ToT). Find relevance. Both allow LLMs to navigate through possible inference steps and search for the optimal inference chain based on a given evaluation. RAP also requires an LLM as a “world model” that predicts the next state following an action. This allows the LLM to operate in a self-simulated world rather than interacting with an external environment. Both algorithms are now available in the LLM Reasoners library.
self series
The Self series is a set of techniques that replaces human effort with LLM predictions in the LLM development loop. In 2023, a significant number of papers on this line were published. Let’s take a look at some of his representative works.
➡️ Many people have had the experience that ChatGPT does not provide the desired output on the first try, but this may be correctable by pointing out the mistake. Self-debugging and self-improvement automate this step by replacing human feedback with machine feedback. Feedback comes from her LLM who compares the description of the problem with the program executor or generation. One important observation is that the performance of self-regulation depends on the quality of the feedback, and stronger base models that provide better feedback can benefit more. Such iterative improvement methods have also been shown to be very effective in pose estimation and protein structure prediction, where it is difficult to predict the structure in a single run.
➡️ In the memory of thoughts (MoTUsing Li and Qiu’s framework, the authors ask LLM to generate CoT evidence on an unlabeled dataset and use it for RAG. You may be wondering how this is useful, since the evidence generated often contains errors. The key trick is to filter rationales based on majority voting or entropy minimization (a similar idea is used for theoretical filtering in Wan et al.). Once you have a good rationale for your unlabeled dataset, dynamically take a few sample shots based on your test questions. This has been shown to be much better than a fixed few-shot sample. MoT can be interpreted as converting a parametric model to a nonparametric model without additional supervision.
➡️Beyond MoT, Yasunaga et al.was suggested analogical prompts This eliminates the need to dump evidence into an unlabeled dataset. The analogy prompt asks the LLM to invoke relevant samples based on the question, thereby generating a dynamic few-shot sample from scratch. Indeed, the authors found that analog prompting is an emerging ability in large-scale language models, similar to previous research on open-domain question answering. Large-scale LLM can self-generate better samples compared to the standard His RAG solution. Additionally, this work provides a nice trick to use markdown grammar to fuse multiple steps of generation into a single prompt. This is a godsend for prompt he engineers on a tight budget. 💡
➡️Are self-refinement and self-generation the limits of LLM reasoning? Yang et al. We demonstrate a more advanced use of LLM’s inference capabilities: how to optimize prompts based on the history of generated prompts. This is a cool reinvention of the famous meta-learning paper “Learning to Learn with Gradient Descent”, but every step here is performed on text by his LLM. At each step, the LLM is asked for the previous solution and the corresponding performance metrics and attempts to predict a new solution. In particular, without telling LLM how to perform optimizations, LLM can gradually find better solutions that maximize metrics. Maybe this job will bring ready-to-work engineers one step closer to unemployment?
🔁 Perhaps the most eye-opening 👀 piece in the Self series is the self-taught optimizer (Stop) by Zelikman et al. LLM is guided by text prompts and accepts text as input and output text. These texts are usually separate variables, but what if we modeled them as a single variable? In STOP, the authors take inspiration from self-modifying code and use self-improvement prompts to Improve yourself.
Although seed prompts are not as complex as random search algorithms, powerful LLMs can be used to discover many advanced metaheuristic algorithms. Interestingly, many prompting strategies have been discovered in GPT-4, such as ToT and Parsel, which are exposed after GPT-4 training. It seems that the day when LLMs conduct their own research is approaching. A step in this direction is a recent study by Huang et al. This shows that LLM can design ML models that correspond to common benchmarks and his Kaggle challenges.
Evaluation and consideration
➡️Kampal et al. We conducted a systematic study on memorization ability in LLMs. They asked the LLM about factual questions from Wikipedia and found that regardless of model size, accuracy was highly correlated with the frequency of the entities asked within the pre-training documents. . The authors estimate that he would need 10¹⁸ models to estimate trends and match human performance on long-tail entities. This is much larger than his LLM today. So the key point is that for tasks related to high-frequency knowledge he uses LLM reasoning, and for tasks related to long-tail knowledge he considers RAG or other tools.
➡️ As the community seeks to build larger mixtures for training LLMs, one concern is that LLMs will learn to reason from training distributions, similar to humans teaching to tests, rather than actually learning to reason. may simply memorize the solution. Wu et al. address this concern by comparing the performance of GPT-4 with zero-shot CoT on 11 different tasks using default and counterfactual settings, respectively. They observed that although LLM performs better than random in the counterfactual setting, its performance is consistently worse than in the default setting. How models can be trained to focus on inference rather than memorization remains an open question.
➡️Saparov et al. We extended the synthetic dataset PrOntoQA to an OOD setting and tested the generalization ability of LLM in deductive reasoning with control over depth, width, configuration structure, etc. The authors found that CoT can be generalized to constructive and longer proofs. This is in contrast to previous conclusions regarding compositional semantic analysis. This is probably because deductive reasoning only requires a synthetic deduction step, whereas semantic analysis also processes the increasing output further. LLMs can use most deduction rules, but explicit demonstration is required. Proof by example and proof by contradiction. There is also a counterintuitive qualitative difference between learning in context and supervised learning.
➡️ Regarding parametric knowledge in LLM, Berglund et al. discovered a phenomenon they called curse of reversal. In other words, an LLM trained to memorize “A is B” may be encouraged to perform deductive reasoning, whereas closed-book question answering may result in “B is A.” I don’t know that. This shows that LLM lacks a certain kind of symmetry in parametric knowledge, and it is important to give LLM such symmetry in order to enable better generalization. . In fact, the Knowledge Graph community is a leader in this field, working on things like double permutation homovariance and relation rotation. It will be interesting to see how these ideas apply to his LLM.