p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
gpt-4o-2024-08-06	59.9%	90.0%	1263.5
claude-3-5-sonnet-20240620	54.3%	85.0%	1180.6
gpt-4-turbo-2024-04-09	54.0%	86.5%	1200.0
deepseek-ai-deepseek-coder-V2-SFT	53.2%	85.7%	1187.4
Qwen-Qwen2-72B-Instruct	52.8%	85.3%	1183.1
deepseek-chat-V2.5	51.2%	84.1%	1162.6
mistralai-Codestral-22B-v0.1	51.2%	84.1%	1165.8
gpt-4-0613	51.0%	83.1%	1152.2
gpt-4o-mini-2024-07-18	50.5%	82.5%	1144.5
meta-llama-Llama-3-70b-chat-hf	48.6%	80.0%	1116.9
deepseek-ai-deepseek-coder-V2-Base	46.7%	79.3%	1106.9
microsoft-wavecoder-ultra-6.7b	46.0%	78.0%	1091.9
deepseek-ai-deepseek-coder-33b-instruct	45.4%	76.3%	1072.8
m-a-p-OpenCodeInterpreter-DS-6.7B	42.0%	72.9%	1043.9
deepseek-ai-deepseek-coder-33b-base	41.7%	74.0%	1054.8
meta-llama-Llama-3-70B	40.9%	71.8%	1034.4
deepseek-ai-deepseek-llm-67b-chat	40.7%	71.3%	1026.4
microsoft-Phi-3-medium-4k-instruct	40.6%	72.1%	1035.3
Phind-Phind-CodeLlama-34B-v2	40.4%	70.5%	1023.3
Qwen-Qwen1.5-110B-Chat	40.2%	71.4%	1025.9
mistralai-Mixtral-8x22B	40.0%	71.3%	1031.1
codellama-CodeLlama-70b-hf	39.8%	70.3%	1022.7
m-a-p-OpenCodeInterpreter-CL-7B	39.5%	68.4%	1002.9
gpt-3.5-turbo-0125	39.4%	68.4%	1006.1
m-a-p-OpenCodeInterpreter-SC2-7B	38.9%	65.9%	977.9
codellama-CodeLlama-70b-Python-hf	38.9%	68.9%	1005.2
codellama-CodeLlama-34b-Python-hf	38.9%	69.0%	1005.9
gpt-3.5-turbo-0613	38.6%	67.8%	1000.0
codex002	38.6%	69.6%	1010.9
m-a-p-OpenCodeInterpreter-SC2-3B	38.6%	66.9%	990.7
deepseek-ai-deepseek-V2-chat	38.5%	67.6%	997.1
microsoft-Phi-3-small-8k-instruct	37.7%	67.1%	994.3
bigcode-starcoder2-15b	37.0%	66.2%	983.6
WizardLM-WizardCoder-Python-34B-V1.0	36.7%	65.1%	974.0
Qwen-Qwen1.5-72B-Chat	35.5%	63.9%	965.1
google-codegemma-7b	34.8%	62.9%	960.7
ibm-granite-granite-34b-code-base	34.8%	62.3%	953.9
codellama-CodeLlama-34b-hf	34.6%	62.3%	950.0
Qwen-Qwen1.5-72B	34.3%	61.5%	950.8
deepseek-ai-deepseek-coder-7b-base-v1.5	34.2%	61.7%	950.8
ibm-granite-granite-8b-code-base	33.8%	60.7%	939.6
Qwen-Qwen1.5-32B-Chat	32.8%	58.4%	923.9
microsoft-wavecoder-ds-6.7b	32.8%	58.3%	924.8
microsoft-Phi-3-mini-4k-instruct	32.1%	57.0%	916.8
meta-llama-Llama-3-8B	31.5%	56.0%	910.4
bigcode-starcoder2-7b	31.4%	56.0%	907.0
microsoft-Phi-3-mini-128k-instruct	31.3%	55.4%	906.3
microsoft-wavecoder-pro-6.7b	31.2%	55.1%	900.9
deepseek-ai-deepseek-coder-6.7b-base	31.1%	55.5%	900.4
codellama-CodeLlama-13b-Python-hf	31.0%	55.2%	900.6
Qwen-Qwen2-7B	31.0%	55.3%	900.5
deepseek-ai-deepseek-coder-V2-Lite-Base	30.5%	54.2%	890.4
openchat-openchat-3.5-0106	30.3%	53.5%	886.8
ibm-granite-granite-20b-code-base	30.0%	53.0%	886.6
google-codegemma-1.1-7b-it	29.7%	52.3%	881.6
Doubao-pro-4k	29.1%	51.1%	872.7
mistralai-Mixtral-8x7B-v0.1	28.8%	50.6%	868.9
Qwen-Qwen1.5-32B	28.5%	49.9%	862.3
codellama-CodeLlama-13b-hf	27.8%	48.5%	854.0
Qwen-CodeQwen1.5-7B	27.6%	48.1%	847.9
bigcode-starcoder2-3b	27.3%	47.4%	842.8
google-codegemma-7b-it	26.2%	45.3%	831.4
google-gemma-7b	26.1%	44.8%	823.3
codellama-CodeLlama-7b-Python-hf	26.0%	44.6%	824.4
stabilityai-stable-code-3b	25.6%	43.8%	820.2
meta-llama-Llama-2-70b-hf	25.2%	42.7%	807.6
m-a-p-OpenCodeInterpreter-DS-1.3B	25.0%	43.0%	811.1
Qwen-Qwen1.5-14B	24.8%	41.9%	804.1
THUDM-codegeex2-6b	24.1%	40.5%	791.0
deepseek-ai-deepseek-coder-V2-Instruct	23.3%	40.2%	798.5
claude-3-sonnet-20240229	23.2%	39.5%	790.8
codellama-CodeLlama-7b-hf	22.9%	37.6%	770.5
ibm-granite-granite-3b-code-base	22.8%	37.2%	765.1
claude-3-opus-20240229	21.6%	36.6%	769.7
microsoft-phi-2	21.5%	34.9%	749.2
Qwen-Qwen1.5-14B-Chat	21.4%	35.4%	755.4
Qwen-Qwen1.5-7B	20.1%	31.5%	720.2
mistralai-Mixtral-8x22B-Instruct-v0.1	19.9%	35.4%	764.2
mistralai-Mistral-7B-v0.3	19.7%	31.0%	713.2
google-gemma-1.1-7b-it	18.3%	29.3%	700.0
meta-llama-Llama-3-8b-chat-hf	17.8%	29.3%	709.5
deepseek-ai-deepseek-coder-1.3b-base	17.5%	25.8%	667.3
deepseek-ai-deepseek-V2-Lite	16.9%	25.2%	665.0
google-codegemma-1.1-2b	16.6%	24.9%	661.8
claude-3-haiku-20240307	16.3%	25.7%	674.7
Doubao-lite-4k	15.7%	24.0%	655.6
Salesforce-codegen25-7b-mono_P	15.6%	23.6%	652.5
google-codegemma-2b	13.3%	18.4%	592.3
Qwen-Qwen2-1.5B	11.8%	16.9%	576.9
meta-llama-Llama-2-13b-hf	11.6%	15.5%	552.9
google-gemma-7b-it	11.4%	15.5%	553.8
google-gemma-2b	10.3%	12.4%	509.3
microsoft-phi-1	9.1%	10.6%	471.3
ERNIE-Speed-8K	8.8%	11.3%	490.9
codellama-CodeLlama-70b-Instruct-hf	8.7%	12.2%	511.2
google-gemma-1.1-2b-it	8.5%	11.2%	488.4
microsoft-phi-1_5	8.3%	10.4%	471.3
codellama-CodeLlama-13b-Instruct-hf	7.9%	10.6%	470.0
meta-llama-Llama-2-7b-hf	6.9%	7.9%	414.4
mistralai-Mistral-7B-Instruct-v0.3	6.9%	7.1%	399.4
meta-llama-Llama-2-7b-chat-hf	6.4%	7.2%	400.2
google-gemma-2b-it	6.0%	7.0%	393.6
smallcloudai-Refact-1_6B-fim	5.7%	7.2%	395.9
codellama-CodeLlama-34b-Instruct-hf	5.2%	5.7%	354.4
Qwen-Qwen2-0.5B	3.9%	3.0%	232.7
meta-llama-Llama-2-70b-chat-hf	3.7%	4.3%	305.1
mistralai-Mixtral-8x7B-Instruct-v0.1	3.7%	4.2%	294.6

DS1000: by models

Home Eval Arena

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model