DS1000: by models

Home   Eval Arena

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
gpt-4o-2024-08-06 59.9% 90.0% 1263.5
claude-3-5-sonnet-20240620 54.3% 85.0% 1180.6
gpt-4-turbo-2024-04-09 54.0% 86.5% 1200.0
deepseek-ai-deepseek-coder-V2-SFT 53.2% 85.7% 1187.4
Qwen-Qwen2-72B-Instruct 52.8% 85.3% 1183.1
deepseek-chat-V2.5 51.2% 84.1% 1162.6
mistralai-Codestral-22B-v0.1 51.2% 84.1% 1165.8
gpt-4-0613 51.0% 83.1% 1152.2
gpt-4o-mini-2024-07-18 50.5% 82.5% 1144.5
meta-llama-Llama-3-70b-chat-hf 48.6% 80.0% 1116.9
deepseek-ai-deepseek-coder-V2-Base 46.7% 79.3% 1106.9
microsoft-wavecoder-ultra-6.7b 46.0% 78.0% 1091.9
deepseek-ai-deepseek-coder-33b-instruct 45.4% 76.3% 1072.8
m-a-p-OpenCodeInterpreter-DS-6.7B 42.0% 72.9% 1043.9
deepseek-ai-deepseek-coder-33b-base 41.7% 74.0% 1054.8
meta-llama-Llama-3-70B 40.9% 71.8% 1034.4
deepseek-ai-deepseek-llm-67b-chat 40.7% 71.3% 1026.4
microsoft-Phi-3-medium-4k-instruct 40.6% 72.1% 1035.3
Phind-Phind-CodeLlama-34B-v2 40.4% 70.5% 1023.3
Qwen-Qwen1.5-110B-Chat 40.2% 71.4% 1025.9
mistralai-Mixtral-8x22B 40.0% 71.3% 1031.1
codellama-CodeLlama-70b-hf 39.8% 70.3% 1022.7
m-a-p-OpenCodeInterpreter-CL-7B 39.5% 68.4% 1002.9
gpt-3.5-turbo-0125 39.4% 68.4% 1006.1
m-a-p-OpenCodeInterpreter-SC2-7B 38.9% 65.9% 977.9
codellama-CodeLlama-70b-Python-hf 38.9% 68.9% 1005.2
codellama-CodeLlama-34b-Python-hf 38.9% 69.0% 1005.9
gpt-3.5-turbo-0613 38.6% 67.8% 1000.0
codex002 38.6% 69.6% 1010.9
m-a-p-OpenCodeInterpreter-SC2-3B 38.6% 66.9% 990.7
deepseek-ai-deepseek-V2-chat 38.5% 67.6% 997.1
microsoft-Phi-3-small-8k-instruct 37.7% 67.1% 994.3
bigcode-starcoder2-15b 37.0% 66.2% 983.6
WizardLM-WizardCoder-Python-34B-V1.0 36.7% 65.1% 974.0
Qwen-Qwen1.5-72B-Chat 35.5% 63.9% 965.1
google-codegemma-7b 34.8% 62.9% 960.7
ibm-granite-granite-34b-code-base 34.8% 62.3% 953.9
codellama-CodeLlama-34b-hf 34.6% 62.3% 950.0
Qwen-Qwen1.5-72B 34.3% 61.5% 950.8
deepseek-ai-deepseek-coder-7b-base-v1.5 34.2% 61.7% 950.8
ibm-granite-granite-8b-code-base 33.8% 60.7% 939.6
Qwen-Qwen1.5-32B-Chat 32.8% 58.4% 923.9
microsoft-wavecoder-ds-6.7b 32.8% 58.3% 924.8
microsoft-Phi-3-mini-4k-instruct 32.1% 57.0% 916.8
meta-llama-Llama-3-8B 31.5% 56.0% 910.4
bigcode-starcoder2-7b 31.4% 56.0% 907.0
microsoft-Phi-3-mini-128k-instruct 31.3% 55.4% 906.3
microsoft-wavecoder-pro-6.7b 31.2% 55.1% 900.9
deepseek-ai-deepseek-coder-6.7b-base 31.1% 55.5% 900.4
codellama-CodeLlama-13b-Python-hf 31.0% 55.2% 900.6
Qwen-Qwen2-7B 31.0% 55.3% 900.5
deepseek-ai-deepseek-coder-V2-Lite-Base 30.5% 54.2% 890.4
openchat-openchat-3.5-0106 30.3% 53.5% 886.8
ibm-granite-granite-20b-code-base 30.0% 53.0% 886.6
google-codegemma-1.1-7b-it 29.7% 52.3% 881.6
Doubao-pro-4k 29.1% 51.1% 872.7
mistralai-Mixtral-8x7B-v0.1 28.8% 50.6% 868.9
Qwen-Qwen1.5-32B 28.5% 49.9% 862.3
codellama-CodeLlama-13b-hf 27.8% 48.5% 854.0
Qwen-CodeQwen1.5-7B 27.6% 48.1% 847.9
bigcode-starcoder2-3b 27.3% 47.4% 842.8
google-codegemma-7b-it 26.2% 45.3% 831.4
google-gemma-7b 26.1% 44.8% 823.3
codellama-CodeLlama-7b-Python-hf 26.0% 44.6% 824.4
stabilityai-stable-code-3b 25.6% 43.8% 820.2
meta-llama-Llama-2-70b-hf 25.2% 42.7% 807.6
m-a-p-OpenCodeInterpreter-DS-1.3B 25.0% 43.0% 811.1
Qwen-Qwen1.5-14B 24.8% 41.9% 804.1
THUDM-codegeex2-6b 24.1% 40.5% 791.0
deepseek-ai-deepseek-coder-V2-Instruct 23.3% 40.2% 798.5
claude-3-sonnet-20240229 23.2% 39.5% 790.8
codellama-CodeLlama-7b-hf 22.9% 37.6% 770.5
ibm-granite-granite-3b-code-base 22.8% 37.2% 765.1
claude-3-opus-20240229 21.6% 36.6% 769.7
microsoft-phi-2 21.5% 34.9% 749.2
Qwen-Qwen1.5-14B-Chat 21.4% 35.4% 755.4
Qwen-Qwen1.5-7B 20.1% 31.5% 720.2
mistralai-Mixtral-8x22B-Instruct-v0.1 19.9% 35.4% 764.2
mistralai-Mistral-7B-v0.3 19.7% 31.0% 713.2
google-gemma-1.1-7b-it 18.3% 29.3% 700.0
meta-llama-Llama-3-8b-chat-hf 17.8% 29.3% 709.5
deepseek-ai-deepseek-coder-1.3b-base 17.5% 25.8% 667.3
deepseek-ai-deepseek-V2-Lite 16.9% 25.2% 665.0
google-codegemma-1.1-2b 16.6% 24.9% 661.8
claude-3-haiku-20240307 16.3% 25.7% 674.7
Doubao-lite-4k 15.7% 24.0% 655.6
Salesforce-codegen25-7b-mono_P 15.6% 23.6% 652.5
google-codegemma-2b 13.3% 18.4% 592.3
Qwen-Qwen2-1.5B 11.8% 16.9% 576.9
meta-llama-Llama-2-13b-hf 11.6% 15.5% 552.9
google-gemma-7b-it 11.4% 15.5% 553.8
google-gemma-2b 10.3% 12.4% 509.3
microsoft-phi-1 9.1% 10.6% 471.3
ERNIE-Speed-8K 8.8% 11.3% 490.9
codellama-CodeLlama-70b-Instruct-hf 8.7% 12.2% 511.2
google-gemma-1.1-2b-it 8.5% 11.2% 488.4
microsoft-phi-1_5 8.3% 10.4% 471.3
codellama-CodeLlama-13b-Instruct-hf 7.9% 10.6% 470.0
meta-llama-Llama-2-7b-hf 6.9% 7.9% 414.4
mistralai-Mistral-7B-Instruct-v0.3 6.9% 7.1% 399.4
meta-llama-Llama-2-7b-chat-hf 6.4% 7.2% 400.2
google-gemma-2b-it 6.0% 7.0% 393.6
smallcloudai-Refact-1_6B-fim 5.7% 7.2% 395.9
codellama-CodeLlama-34b-Instruct-hf 5.2% 5.7% 354.4
Qwen-Qwen2-0.5B 3.9% 3.0% 232.7
meta-llama-Llama-2-70b-chat-hf 3.7% 4.3% 305.1
mistralai-Mixtral-8x7B-Instruct-v0.1 3.7% 4.2% 294.6