Summary

The paper proposes a new approach to chain-of-thought reasoning. The idea is that LMs could reason more effectively with some intermediate computation that is not in natural language (which is the case in explicit CoT). The main advantage is that it speeds up the process (throughput is very high) because the reasoning happens “vertically” among the hidden states in different layers instead of “horizontally” by producing intermediate words one by one like in explicit CoT.

Dataset / Task

Multi digit multiplication task from the BIG-bench benchmark
- 4x4 and 5x5 multiplications
- Augmentation: randomly sample equations that do not overlap with the BIG-bench dataset.
Grade school math problems
- GSM8k dataset
- Augmentation: Created 400k additional mathematical problems of same format using GPT4

Methodology

Basically, at high level, the idea is knowledge distillation

Mind-Reading the Teacher: We train a student model to “read” the teacher’s “thought process”— the continuous hidden states during intermediate reasoning step generation. The student model, rather than replicating these steps, uses some of the teacher’s hidden states to produce the answer.
Thought Emulation: We then employ knowledge distillation (Hinton et al., 2015; Kim & Rush, 2016) to train an emulator that predicts the teacher’s hidden states from the input “vertically”, across layers, eliminating the need for “horizontal” explicit reasoning steps.
Couple and Optimize: Finally, we combine the emulator, which predicts the teacher’s thought process, with the mind-reading student, which produces the final answer from the emulated teacher’s thought process. This combined system is then optimized end-to-end, allowing the student model to develop its own reasoning methods that might differ from the teacher’s approach.

Untitled

Experiment

Untitled

Shortcoming & Limitations

Not as good accuracy as explicit CoT
Not interpretable like explicit CoT
Requires significant amount of training data, making it costly
Not sure how it would perform for out-of-distribution data