Research conducted jointly by Stanford and Berkeley Universities delved into the evaluation of GPT-3.5 and GPT-4, two extensively used large language model (LLM) services. The study, conducted in March and June 2023, encompassed various tasks, including math problems, sensitive questions, opinion surveys, multi-hop knowledge-intensive questions, code generation, US Medical License tests, and visual reasoning. The findings unveiled significant variations in the performance and behavior of both GPT-3.5 and GPT-4 over time.
For instance, GPT-4’s identification of prime vs. composite numbers showed reasonable accuracy (84%) in March 2023 but significantly decreased to 51% accuracy in June 2023. This decrease was partly attributed to GPT-4’s reduced amenity in following chain-of-thought prompting. Interestingly, GPT-3.5 demonstrated notable improvement in this task during June as compared to March.
Moreover, GPT-4 displayed reduced willingness to respond to sensitive questions and opinion surveys in June, compared to March. However, GPT-4’s performance in multi-hop questions improved in June, while GPT-3.5’s performance in the same area declined.
Additionally, both GPT-4 and GPT-3.5 exhibited more formatting mistakes in code generation in June compared to March. These observations highlight the significance of continuous monitoring for LLMs, as the behavior of the “same” service can undergo substantial changes within a relatively short span.