A Robust Evaluation Benchmark of Understanding Symbols (2024)

Andrew Gritsevskiy
Cavendish Labs
&Arjun Panickssery
MATS
&Derik Kauffman
Cavendish Labs
\ANDJoe Cavanagh
Cavendish Labs
&Jonathan Chiang
Cavendish Labs, OMEGA Labs
\ANDAaron Kirtland, Hans Gundlach, Irina Gritsevskaya, Lydia La Roux, Michelle Hung
Dataset contributors
andrew@cavendishlabs.org

Abstract

We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.

RebusCategoryDifficultyAnswerA Robust Evaluation Benchmark of Understanding Symbols (1)Towns in MAEasyBUCKLANDA Robust Evaluation Benchmark of Understanding Symbols (2)Famous moviesEasyDIE HARDA Robust Evaluation Benchmark of Understanding Symbols (3)Marine lifeMediumMARLINA Robust Evaluation Benchmark of Understanding Symbols (4)MBTA stationsMediumMISSION PARKA Robust Evaluation Benchmark of Understanding Symbols (5)CitiesHardASHGABATA Robust Evaluation Benchmark of Understanding Symbols (6)CitiesHardNGERULMUD

1 Introduction

Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini (1), they showcased its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi-step reasoning, and understanding the human creator’s intent.

We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse categories, including hand-drawn and digital images created by nine contributors. Samples are presented in Table1. Notably, GPT-4V, the most powerful model we initially evaluated, answered only 24% of puzzles correctly, highlighting the poor capabilities of MLLMs in new and unexpected domains to which human reasoning generalizes with comparative ease. However, the more recent GPT-4o’s capabilities jumped to 42%, indicating the significant development in the reasoning abilities of fully multimodal models (2). Open-source models perform worse, with a median accuracy below 1%. We notice that models often give faithless explanations, fail to change their minds after an initial approach doesn’t work, and remain highly uncalibrated on their own abilities.

2 Related Work

Question-answering evaluations have made use of text (3; 4) and multimodal (5; 6) benchmarks to understand model capabilities to perceive and reason. Additional benchmarks have required multimodal reasoning for multiple-choice and free-response questions testing academic knowledge (7; 8; 9; 10; 11). One notable advantage of our dataset is its middle ground between multiple-choice questions, which are easy to evaluate but anchor the model to the options, and free-response questions, which provide more interesting context into the model’s reasoning and knowledge but are hard to score. All puzzles in REBUS have one clearly-correct answer that doesn’t need to be provided to the model beforehand. MLLM puzzle-solving abilities have been studied previously, such as with the development of Kosmos-1 (12), which scores 26% on the Raven’s Progressive Matrices IQ test, and (13) demonstrating that GPT-3.5 can solve 50.2% of NPR’s Sunday Puzzle game show.

3 The REBUS Benchmark

3.1 Dataset description

A Robust Evaluation Benchmark of Understanding Symbols (7)

We introduce the REBUS dataset, a collection of 333 hand-created rebuses designed to test the capabilities of multimodal large language models at solving image-based wordplay puzzles. The rebuses span 13 diverse categories including “Cities," “Towns in Massachusetts," “Marine life," “Christmas songs," and “Composers," and require multiple cognitive skills to solve. There are 191 easy, 114 medium, and 28 difficult puzzles, with harder puzzles requiring more detailed image recognition, more advanced reasoning techniques, or both. Difficulty was evaluated subjectively by our human solvers, but correlates in the expected way with model performance. Rebuses might be hand-drawn, drawn digitally, created via digital composition of public domain images from the Internet, or a combination of the above. Table1 provides a sample of rebuses and their solutions.

The dataset is annotated according to several other characteristics that the rebuses may or may not have:

In all, 238 rebuses (71.5% of the dataset) have at least one of the three more advanced properties, and the detailed breakdown is shown in Figure1.

The REBUS dataset highlights several key challenges in multimodal language models:

  • Multi-step visual reasoning—many rebuses contain information in a meaningful pattern, from which the necessary string operations and structure must be successfully inferred.

  • Spelling—string manipulations require accurate letter-wise representations.

  • Hypothesis testing—for instance, if the model recognizes a fictional-character-themed puzzle as containing images representing “Megachiroptera” and “Einstein,” it needs to revise its initial interpretations to reach the correct answer “Batman.”

  • World knowledge—many puzzles contain crucial references to specific elements of the real world.

  • Grounded image recognition—puzzles sometimes require identifying the most important part of an image, or recognizing what the image may represent as a whole, like understanding that a photograph of a group of lions might be cluing “lion,” “big cats,” “savanna,” or “pride.”

  • Understanding human intent—solving the puzzles requires an understanding of what answers or reasoning steps could have been plausibly developed by the puzzle author.

3.2 Limitations

Our dataset has several limitations—one major one is the distribution over categories and rebus styles. Almost 28% of the entire dataset falls into the category “Towns in Massachusetts”, which, while a favorite topic of one of the authors, may unfairly advantage or disadvantage models depending on their knowledge of the geography of a particular U.S. state. Additionally, digitally created rebuses significantly outnumber hand-drawn rebuses, due to their relative ease of creation. However, we decided that having more rebuses was better than having a good distribution over topics and styles, and thus we interpret model performance in the context of certain rebus characteristics in Table 3.

4 Experiments

We evaluate several leading open-source and proprietary multimodal models. Every model is provided the rebus image, the category of the puzzle, and a short prompt, shown in Figure 2 asking it to solve the rebus. All models are evaluated zero-shot on the task. We use similar prompts for all models; for models on which the default prompt led to bad results, we engage in modest prompt engineering; only the best results over all prompts is reported. We ask the models to provide final answers in a specific format; where this format is not followed, we report our best guess as to the model’s answer. The GPT-4V, GPT-4o, Gemini Pro, Claude 3, and Reka Core evaluations were conducted via their respective APIs; all other evaluations were done on NVIDIA A100 GPUs through the ACCESS cyberinfrastructure ecosystem (14). All evaluation code and results are provided at https://github.com/cvndsh/rebus.

4.1 Baselines

OverallEasyMediumHard
Model
GPT-4o41.752.432.57.1
GPT-4V24.033.013.27.1
Claude 3 Opus18.929.36.10.0
Gemini 1.5 Pro17.423.011.43.6
Gemini 1.0 Pro13.219.45.33.6
Claude 3 Sonnet7.511.52.70.0
Gemini 1.5 Flash6.08.91.83.6
Reka Core5.47.91.83.6
Claude 3 Haiku4.56.31.83.6
LLaVa-1.6-34B2.73.71.80.0
LLaVa-1.5-13B1.82.60.90.0
LLaVa-1.5-7B1.52.60.00.0
BLIP2-FLAN-T5-XXL0.90.51.80.0
CogVLM0.91.60.00.0
QWEN0.91.60.00.0
InstructBLIP0.60.50.90.0

We evaluate GPT-4V (15), the Claude 3 famly of models (16), Gemini Pro (1), Reka Core (17), LLaVa-1.6-34B (18), LLaVa-1.5-13B (19), LLaVa-1.5-7B (20), BLIP2-FLAN-T5-XXL (21), CogVLM (22), QWEN (23), and InstructBLIP (24). We attempted to evaluate Fuyu-8B (25), but failed to elicit any reasonable output, likely because it lacks fine-tuning. The performance of each model, broken down by difficulty, is presented in Table2, as well as Figure3. Proprietary models GPT-4V and Claude 3 Opus exhibit by far the best performance, scoring 24.0% and 18.9%, respectively. Open-source models never reach an accuracy above 3%, with LLaVa-1.6-34B performing best.

Additionally, we break down model performance by their accuracy on puzzles with and without the exact spelling, specific reference, and required reading characteristics, in Table 3. We note that Gemini Pro seems to particularly struggle with phonetic rebuses, and other rebuses where the spelling is inexact. Also, most models tend to perform better on rebuses without specific references, and slightly better on rebuses where no reading is required. However, detailed conclusions for open-source models are difficult to draw due to small samples (often models solve just 0–1 puzzles from a certain subset).

A Robust Evaluation Benchmark of Understanding Symbols (8)

A Robust Evaluation Benchmark of Understanding Symbols (9)

ModelExactSpellingSpecificReferenceReadingRequired
YesNoYesNoYesNo
GPT-4o42.640.430.846.937.842.4
GPT-4V22.823.516.826.117.824.0
Claude 3 Opus17.820.611.222.68.920.5
Gemini 1.5 Pro16.217.79.420.413.317.4
Gemini 1.0 Pro16.85.29.413.32.213.5
Claude 3 Sonnet11.22.26.58.06.77.6
Gemini 1.5 Flash7.13.74.76.26.75.6
Reka Core6.14.43.76.25.64.4
Claude 3 Haiku7.10.84.74.44.44.5
LLaVa-1.6-34B3.61.51.93.12.22.8
LLaVa-1.5-13B1.51.51.91.30.01.7
LLaVa-1.5-7B2.00.71.91.32.21.4
BLIP2-FLAN-T5-XXL1.00.01.90.02.20.4
CogVLM1.50.00.90.90.01.0
QWEN1.50.00.90.90.01.0
InstructBLIP1.00.01.90.02.20.4

4.2 Calibration and in-context answers

We analyzed the calibration of GPT-4V, GPT-4o, Gemini Pro 1.5 and Gemini Pro 1.0 by asking them to give point estimates of their confidence that their answers are correct. We found that all models are highly overconfident in their answers, as shown in Figure5. However, GPT-4o was the best-calibrated. We can quantify this by computing each model’s Brier score(26), where GPT-4o gets a score of 0.242, GPT-4V gets 0.338, Gemini Pro 1.5 gets 0.583, and Gemini Pro 1.0 gets 0.754.

A Robust Evaluation Benchmark of Understanding Symbols (10)

ModelCitiesTownsMoviesNamesMarineKYCommonFoodCompsMiscMBTAin MALifePhrasesGPT-4o97.092.7100.083.3100.0100.0100.0100.092.992.395.0GPT-4V78.872.892.3100.087.5100.050.020.085.785.790.0Claude 3 Opus84.172.892.391.793.871.450.070.085.776.985.0Gemini 1.5 Pro84.862.092.383.362.5100.050.080.071.461.585.0Gemini 1.0 Pro74.279.384.650.081.271.4100.090.078.657.185.0Claude 3 Sonnet85.682.669.266.743.885.775.0100.0100.092.390.0Gemini 1.5 Flash59.153.476.975.043.871.40.060.064.346.255.0Reka Core70.564.176.983.3100.071.475.060.085.769.255.0Claude 3 Haiku68.269.669.291.756.271.475.070.071.446.245.0LLaVa-1.6-34B68.953.3100.075.050.085.725.070.057.138.555.0LLaVa-1.5-13B47.053.376.975.056.2100.025.050.064.342.930.0LLaVa-1.5-7B58.375.092.358.337.557.150.030.064.342.910.0BLIP2-FLAN-T5-XXL9.166.346.250.025.00.025.030.050.014.30.0CogVLM18.221.776.98.318.857.125.010.050.014.30.0QWEN8.310.90.050.018.80.00.020.021.40.00.0InstructBLIP4.56.561.58.30.00.00.00.00.00.00.0

Finally, we examined the extent to which models gave answers which at least matched the stated puzzle category. For example, if a model knows that the category of a puzzle is “food,” then the answer “potato” is considered in-category, whereas the answer “violin” would be out-of-category. GPT-4o has the highest percentage of in-category answers, at 95.3%. Surprisingly, we find that Claude 3 Sonnet comes in second at 82.9%, followed by Claude 3 Opus, GPT-4V, and Gemini Pro at 80.6%, 78.3%, and 75.6%, respectively. Reka Core and Claude 3 Haiku round out the proprietary models at 70.6% and 67.0%, followed by LLaVa-1.6-34B at 62.1%, LLaVa-7B outperforming LLaVa-13B 59.7% to 52.1%, and BLIP2-FLAN-T5-XXL, CogVLM, QWEN, and InstructBLIP at 30.9%, 22.2%, 10.5%, and 6.3%, respectively. The full data, broken down by category, is shown in Table4. Many cases, such as the MBTA category, suggest that open-source models simply lack enough knowledge of MBTA stations to effectively solve the puzzles, providing category-appropriate answers to just 13.6% of puzzles, compared to the closed-source models’ 75.0%. However, in other categories like “Movies," both open- and closed-source models have relatively high rates of category-appropriate answers (64.8% and 80.8% respectively), but all but four models get an accuracy of 0% for the entire category (the exceptions being GPT-4V, Claude 3 Haiku, and Gemini Flash at 7.7%, and and GPT-4o at 15.4%). Thus, better world knowledge is a necessary, but far from sufficient, condition for model improvement.

Claude 3 Sonnet’s impressive perfomance on giving in-category answers, especially relative to the much larger and more capable Claude 3 Opus, remains puzzling. Qualitatively, we noticed that Claude 3 Opus would sometimes give realistic-sounding, but fake, answers, such as “Dumballa, Australia” and “Vert, UK” for cities. We speculate that thinking of believable but false answers that match the correct category requires a higher level of intelligence and understanding, which is why we observe this behavior in Opus but not in Sonnet, which hurts its relative performance on this metric. However, further investigation of potentially unfaithful behavior is warranted, as we discuss further in the next section.

4.3 Faithfulness

We observe that even when models provide the correct answer, they often misunderstand aportion of the puzzle or provide incorrect justifications, failing to produce faithful reasoning(27). Full model outputs are available in our GitHub repository.Especially interesting to us was a specific type of incorrect justification exhibited by GPT-4V, where it would refuse to recognize a person due to OpenAI’s privacy policies, and thenuse their name anyways, pretending to have drawn that name at random. An example ofthis is shown in the chat box to the right, in which GPT-4V refuses to recognize the actress Sharon Stone, but then says “Let’s assume her first name is a common one, such as Sharon.”

4.4 Human baselines

While all rebuses were test-solved by the authors, implying a hypothetical human-solvability rate of 100%, we additionally obtained a human baseline by collecting crowdsourced data. We asked seven participants to solve 45–50 rebuses each, compensating them at $1 per easy rebus solved, $1.50 per medium rebus solved, and $2 per hard rebus solved (minimum rate: $18.50/hr; median rate: $24.00/hr; total paid out: $336.00). The participants were allowed to use the internet and Google reverse image search; however, they were not allowed to consult AI tools such as ChatGPT. Additionally, the participants were instructed to not spend more than five minutes on any single rebus. The results of our human baselines were an overall correct answer rate of 82.0%, including 84.8% on easy, 84.1% on medium, and 53.6% on hard puzzles; 78.7% on puzzles that had exact spelling, 84.3% on puzzles that didn’t; 86.0% on puzzles that had a specific reference, and 80.1% on ones that didn’t; and 82.2% on puzzles that required reading, and 81.9% on ones that didn’t. A full example of what the participants saw is available at this https url.

5 Future Work

The REBUS dataset serves as an effective benchmark for evaluating the advanced multimodal capabilities of language models. While there are several immediate steps which can improve model performance on this dataset, such as prompt engineering, access to word spellings, reverse-image-lookup tools, and phonetic transcriptions, it would be interesting to see how non-rebus-specific improvements affect performance on this benchmark. One recent direction we’re excited about is guided visual search in MLLMs, in which image tokens are extracted multiple times as needed, instead of only once. One such model is Vsuperscript𝑉V^{\star}italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT (28), which performs multi-round guided search to perform detailed visual grounding. Such methods are intuitively closer to how humans solve rebuses by looking over the puzzle, thinking of hypotheses, and then zooming in to the relevant parts of the rebus that need to be adjusted or seem not to fit.

We are also curious about the specific flavor of faithlessness exhibited in Section4.3, in which a model is not supposed to divulge certain information, so it hallucinates plausible reasons to expose it regardless. In every attempted follow-up, GPT-4V maintained that the name “Sharon” was chosen “randomly” or “as a placeholder”; however, further prodding reliably induces the model to “guess” that the person’s last name is Stone, and that she is an actress. This lack of faithfulness indicates a major attack vector against models which can draw inferences from information that RLHF trains them not to disclose, and may be important to investigate further.

Reproducibility statement

The evaluation code and results are provided at https://github.com/cvndsh/rebus. The dataset is on Hugging Face at https://huggingface.co/datasets/cavendishlabs/rebus

Acknowledgements

We would like to thank the following for their helpful comments: Justin Li, Nina Rimsky, and Sarah Chen. AG and DK are supported by Cavendish Labs Research Grant #000-0012. This work used the Delta GPU system at the National Center for Supercomputing Applications through allocation CIS230057 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

References

  • [1]Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, PaulR. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, YiLuan, XiChen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu,Heidi Howard, Adam Bloniarz, JackW. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, AleJakse Hartman, Martin Chadwick, GauravSingh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, ThanumalayanSankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego deLasCasas, Dasha Valter, Connie Tao, Lorenzo Blanco, AdriàPuigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki, Antoine Miech, Annie Louis, LaurentEl Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson,Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M.R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, TomLe Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, ClaraHuiyi Hu, Raoul deLiedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, ShaoboHou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, ReinaldKim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George vanden Driessche, Tao Wang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, PaulKishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, LorenMaggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, RaphaëlLopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen,CharlineLe Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, JaimeAlonso Lorenzo, LarsLowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, LivioBaldini Soares, Kate Baumli, MichaelB. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, DaliaEl Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, CeZheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, TobyShevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, LisaAnne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, SayedHadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, LeHou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, NicolaDe Cao, Charlie Chen, Gamaleldin Elsayed, EdChi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent,Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, PamG Rabinovitch, Piotr Stanczyk, YeZhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, JaumeSanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, YuMao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, DianaGage Wright, Yawen Wei, Harsha Vashisht,Yana Kulizhskaya, Jay Hoover, Maigo Le, LuLi, Chimezie Iwuanyanwu, LuLiu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom vander Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, LamNguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, DucDung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, ElenaAllica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, MariaAbi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, RémiLeblond, Vikas Yadav, Shirley Chung, Harry Askham, LuisC. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, DanielJ. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, JenniferBeattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, TianHuey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, DanHoltmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, SoheilHassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, ChristopherA. Choquette-Choo, Yunjie Li, TJLu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ish*ta Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia vander Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto,Hanna Klimczak-Plucińska, David Bridson, Dario deCesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, ShreyasRammohan Belle, Lei Wang, Chetan Tekur, MihirSanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, YiSun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, ManishReddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, XiXiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, HanZhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, JiLiu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MKBlake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals.Gemini: A family of highly capable multimodal models, 2023.
  • [2]Aidan Clark, Alex Paino, Jacob MenickLiam Fedus, Luke MetzClemens Winter, Lia GuySam Schoenholz, Daniel Levy Nitish KeskarAlex Carney, Ian Sohl, Qiming Yuan Reimar LeikeArka Dhar, Brydon Eastman, Mia Glaese Ben Sokolowsky Andrew Kondrich Felipe Petroski Such HenriquePonde deOliveira Pinto JiayiWeng, Randall Lin, Youlong Cheng Nick Ryder Lauren ItowBarret Zoph, John Schulman Mianna ChenAdam Lerer, AdamP. Goucher, Adam Perelman, Akila Welihinda, Alec Radford, Alex Borzunov, Alex Carney, Alex Chow, Alex Renzin, AlexTachard Passos, Alexi Christakis, Ali Kamali, Allison Moyer, Allison Tam, Amin Tootoonchian, Ananya Kumar, Andrej Karpathy, Andrey Mishchenko, Andrew Cann, Andrew Kondrich, Andrew Tulloch, Angela Jiang, Antoine Pelisse, Anuj Gosalia, Avi Nayak, Avital Oliver, Behrooz Ghorbani, Ben Leimberger, Ben Wang, Blake Samic, Brian Guarraci, Camillo Lugaresi, Chak Li, Charlotte Barette, Chelsea Voss, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christopher Hesse, ColinWei, Daniel Kappler, Daniel Levin, Daniel Levy, David Farhi, David Mely, David Sasaki, Dimitris Tsipras, Doug Li, DucPhong Nguyen, Duncan Findlay, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Peterson, Eric Sigler, Eugene Brevdo, Farzad Khorasani, Francis Zhang, Gene Oden, Geoff Salmon, Hadi Salman, Haiming Bao, Heather Schmidt, Hongyu Ren, HyungWon Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ilya Kostrikov, Ingmar Kanitscheider, Jacob Coxon, James Crooks, James Lennon, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jiayi Weng, Jie Tang, Joanne Jang, Jonathan Ward, Jonathan McKay, JongWook Kim, Josh Gross, Josh Kaplan, Joy Jiao, Joyce Lee, Juntang Zhuang, Kai Fricke, Kavin Karthik, Kenny Hsu, Kiel Howe, Kyle Luther, Larry Kai, Lauren Itow, Leo Chen, Lia Guy, Lien Mamitsuka, Lilian Weng, Long Ouyang, Louis Feuvrier, Lukas Kondraciuk, Lukasz Kaiser, Lyric Doshi, Mada Aflak, Maddie Simens, Madeleine Thompson, Marat Dukhan, Marvin Zhang, Mateusz Litwin, Max Johnson,Mayank Gupta, Mia Glaese, Michael Janner, Michael Petrov, Michael Wu, Michelle Fradin, Michelle Pokrass, Miguel OomTemudo deCastro, Mikhail Pavlov, Minal Khan, MoBavarian, Natalia Gimelshein, Natalie Staudacher, Nick Stathas, Nik Tezak, Nithanth Kudige, Noel Bundick, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivier Godement, Owen Campbell-Moore, Philip Pronin, Philippe Tillet, Rachel Lim, Rajan Troll, Rapha gontijo lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Rob Honsby, Rohit Ramchandani, Rory Carmichael, Ruslan Nigmatullin, Ryan Cheu, Scott Gray, Sean Grove, Sean Metzger, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shuaiqi(Tony) Xia, Sonia Phene, Spencer Papay, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tarun Gogineni, Ted Sanders, Thomas Cunninghman, Thomas Dimson, Thomas Raoux, Tianhao Zheng, Tina Kim, Todd Underwood, Tristan Heywood, Valerie Qi, Vinnie Monaco, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wojciech Zaremba, Yash Patil, Yilei, Qian,Yongjik Kim, Youlong Cheng, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, Yury Malkov Prafulla Dhariwal Alexander KirillovAlexis Conneau, James BetkerAlexander Kirillov, James Betker, YuZhangJamie Kiros, Rowan Zellers, Jiahui YuJames Betker, Alex Nichol, Heewoo Jun, Casey Chu, Gabriel GohGabriel Goh, Ishaan GulrajaniIan Sohl, Qiming YuanAlex Paino, Alex NicholArka Dhar, Mia GlaeseHeewoo Jun, Alexis Conneau, LiJing, Jamie KirosAllan Jabri, James BetkerAlexis Conneau, Tao Xu, YuZhang Tomer KaftanBogo Giertler, Tomer KaftanNacho Soto, Rocky Smith, Wayne ChangAlexander Kirillov, Luke Metz, Vlad Fomenko Jordan Sitkin Christine McLeavey Mark Chen Mianna ChenAditya Ramesh, AJOstrow, Allan Jabri, Benjamin Zweig, Bogo Giertler, Bowen Cheng, Brandon Walkin, Brendan Quinn, Christine McLeavey, Constantin Koumouzelis, Edede Oiwoh, FelipePetroski Such, Huiwen Chang, Ian Silber, Ishaan Gulrajani, David Carr, Haitang Hu, Jamie Kiros, Jenia Varavva, Jiahui Yu, JiLin, Johannes Heidecke, Liang Zhou,Madelaine Boyd, Mark Hudnall, Mengchao Zhong, Nick Turley, Noah Deutsch, Ola Okelola, Peter Bak, Peter Bakkum, Saachi Jain, Shirong Wu, Wesam Manassra, YuZhang Andrew TullochAmin Tootoochian, Miguel CastroNik Tezak, Christopher Hesse Ian O’Connell Jason Teplitz Phil Tillet Reza Zamani Michael Petrov Rory Carmichael Christian Gibson Johannes Heidecke Saachi Jain Tejal Patwardhan Troy PetersonAlex Beutel, Andrea Vallone, Carroll Wainwright, Claudia Fischer, Evan Mays, Filippo Raso, Haoyu Wang, Jason Phang, Jieqi Yu, Joel Parish, Joshua Achiam, Jonathan Uesato, Joost Huizinga, Josh Snyder, Justyn Harriman, Katy Shi, Keren Gu-Lemberg, Kevin Liu, Lama Ahmad, Meghan Shah, Mehmet Yatbaz, Michael Lampe, Miles Wang, Molly Lin, Natalie Cone, Neil Chowdhury, Olivia Watkins, Peter Dolan, Rachel Dias, Rahul Arora, Sam Toizer, Sandhini Agarwal, Todor Markov Mianna ChenAleksander Mądry, Barret Zoph, Bob McGrew, Brad Lightcap, Greg Brockman, Hannah Wong, Ilya Sutskever, Jakub Pachocki, Jan Leike, Jason Kwon, JohnSchulman, Jonathan Lachman, Krithika Muthukumar, Mark Chen, Miles Brundage, Mira Murati, Nick Ryder, Peter Deng, Peter Welinder, Sam Altman, Srinivas Narayanan, Tal BrodaAlan Hayes, Ashley Pantuliano, Bright Kellogg, Fred von Lohmann, Heather Whitney, Tom RubinAidan Clark, Alex Baker-Whitcomb, Alexander Kirillov, Ben Sokolowsky, Cheng Lu, Coley Czarnecki, Eric Antonow, Eric Wallace, Gabriel Goh, Hendrik Kirchner, Jacob Menick, Jordan Sitkin, Kendra Rimbach, Leher Pathak, Liam Fedus, Lindsay McCallum, Maya Shetty, Mianna Chen, Nacho Soto, Natalie Summers, Niko Felix, Prafulla Dhariwal, Tejal Patwardhan, Tomer Kaftan, Tom Stasi, Troy Peterson, Veit Moeller, Wayne Chang, YuZhang, Yuchen HeAlex Baker-Whitcomb, Bobby Spero, Chad Nelson, Colin Jarvis, Jessica Shieh, Joe Beutler, Joe Landers, Ricky Wang, Rohan Sahai, Romain Huet, Scott Ethersmith, Toki Sherbakov, Wayne ChangAlex Baker-Whitcomb, Andrew Galu, Angela Baek, Dev Valladares, Lindsey Held, Roy Chen, Ruby Chen, Taya Christianson, Thomas Degry, VeitMoellerAlex Baker-Whitcomb, Veit MoellerAndrew Codispoti, Brian Hsu, Channing Conger, Ikai Lan, Jos Kraaijeveld, Kai Hayashi, Kenny Nguyen, LuZhang, Natan LaFontaine, Pavel Belov, Peng Su, Vishal Kuo, Will SheuKevin Button, Paul McMillan, Shino Jomoto, Thomas Shadwell, Vinnie MonacoAndrew Braunstein, Denny Jin, Eric Kramer, Lauren Workman, Rob Donnelly, Romain Huet, and Shamez Hermani.Hello gpt-4o, 2024.
  • [3]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.CoRR, abs/2009.03300, 2020.
  • [4]Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan.Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  • [5]Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C.Lawrence Zitnick, and Devi Parikh.VQA: visual question answering.CoRR, abs/1505.00468, 2015.
  • [6]Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
  • [7]Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom.Gaia: a benchmark for general ai assistants, 2023.
  • [8]Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, GeZhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, YuSu, and Wenhu Chen.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  • [9]Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  • [10]Yuan Liu, Haodong Duan, Yuanhan Zhang, BoLi, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin.Mmbench: Is your multi-modal model an all-around player?, 2023.
  • [11]Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, LuSheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, and Wanli Ouyang.Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
  • [12]Shaohan Huang, LiDong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, OwaisKhan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei.Language is not all you need: Aligning perception with language models, 2023.
  • [13]Jingmiao Zhao and CarolynJane Anderson.Solving and generating npr sunday puzzles with large language models, 2023.
  • [14]TimothyJ. Boerner, Stephen Deems, ThomasR. Furlani, ShelleyL. Knuth, and John Towns.Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support.In Practice and Experience in Advanced Research Computing (PEARC ’23), page4, New York, NY, USA, 2023. ACM.July 23–27, 2023, Portland, OR, USA.
  • [15]OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, MoBavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, SimónPosada Fishman, Juston Forte, Isabella Fulford, Leo Gao,Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar, Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim, Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe,Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvila BelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael, Pokorny, Michelle Pokrass, Vitchyr Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder,Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, TianhaoZheng, Juntang Zhuang, William Zhuk, and Barret Zoph.GPT-4 technical report, 2023.
  • [16]Anthropic.The claude 3 model family: Opus, sonnet, haiku, 2024.
  • [17]Reka Team, Aitor Ormazabal, Che Zheng, Cyprien deMassond’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, QiLiu, Ren Chen, Samuel Phua, Yazheng Yang, YiTay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie.Reka core, flash, and edge: A series of powerful multimodal language models, 2024.
  • [18]Haotian Liu, Chunyuan Li, Yuheng Li, BoLi, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  • [19]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning, 2023.
  • [20]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning, 2023.
  • [21]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  • [22]Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, JiQi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang.Cogvlm: Visual expert for pretrained language models, 2023.
  • [23]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  • [24]Wenliang Dai, Junnan Li, Dongxu Li, Anthony MengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi.Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • [25]Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar.Introducing our multimodal models, 2023.
  • [26]GlennW Brier.Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950.
  • [27]Antonia Creswell and Murray Shanahan.Faithful reasoning using large language models, 2022.
  • [28]Penghao Wu and Saining Xie.V*: Guided visual search as a core mechanism in multimodal llms, 2023.

Appendix

5.1 Additional examples of faithlessness

5.2 Examples of various failure modes

5.2.1 Correct answer, incorrect reasoning

Here, GPT-4o correctly gets the answer Harvard mainly based on the first image, even though the correct reasoning is (HARVARDARD)(BILLBOARDBILLBO)=HARVARD𝐻𝐴𝑅𝑉𝐴𝑅𝐷𝐴𝑅𝐷𝐵𝐼𝐿𝐿𝐵𝑂𝐴𝑅𝐷𝐵𝐼𝐿𝐿𝐵𝑂𝐻𝐴𝑅𝑉𝐴𝑅𝐷(HARVARD-ARD)(BILLBOARD-BILL-BO)=HARVARD( italic_H italic_A italic_R italic_V italic_A italic_R italic_D - italic_A italic_R italic_D ) ( italic_B italic_I italic_L italic_L italic_B italic_O italic_A italic_R italic_D - italic_B italic_I italic_L italic_L - italic_B italic_O ) = italic_H italic_A italic_R italic_V italic_A italic_R italic_D.

5.2.2 Incorrect reasoning, correct recognition

Here, GPT-4o correctly recognizes all the parts of the image, but fails to put them together into the correct answer.

5.2.3 Incorrect recognition

Here, GPT-4o fails to recognize the elements labeled in the periodic table (Br and K).

A Robust Evaluation Benchmark of Understanding Symbols (2024)

References

Top Articles
Latest Posts
Article information

Author: Fredrick Kertzmann

Last Updated:

Views: 6187

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.