DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

1The University of Hong Kong, 2Peking University, 3Stanford University, 4University of California, Berkeley, 5University of Washington, 6Meta AI, 7Carnegie Mellon University

DS-1000 is a code generation benchmark with a thousand data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.

Abstract

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) – across all Codex-002-predicted solutions that our evaluation accepts, only 1.8% of them is incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement.

Data Statistics and Examples

DS-1000 contains 1000 problems originating from 451 unique StackOverflow problems. To defend against potential memoriza- tion, more than half of the DS-1000 problems are modified from the original StackOverflow problems; they include 152 surface perturbations, 235 semantic perturbations, and 162 difficult rewrites.

Below are more DS-1000 examples. For each example, The model needs to fill in the code into “[insert]” in the prompt on the left; the code will then be executed to pass the multi-criteria automatic evaluation, which includes the test cases and the surface form constraints; a reference solution is provided at the bottom left.

NumPy example problem involving randomness, requiring the use of a specialist knowledge test.

Perturbation and Prompt

For questions from StackOverflow, we equipped them with prompts, test cases, and evaluation functions and called them Origin. To prevent models simply recalling the solutions seen during pre-training, we have perturbed the questions in two ways: surface perturbations and semantic perturbations. To make DS-1000 more challenging, we additionally introduced Difficult Rewrite.

  • Origin:   Equipped them with a prompt, testcases, and an evaluation function.
  • Surface Perturbation:   We paraphrased the question or modified the code context in the question, but the reference solution should stay the same after the perturbation.
  • Semantic Perturbation:   We changed the semantics of the reference solution without changing its difficulty.
  • Difficult Rewrite:   More complex and more difficult.

We also provided an Insertion style prompt and a Completion style prompt for each question (including perturbation).

  • Insertion:   Models need to predict what should be written in the place of "[insert]".
  • Completion:   Models need to predict what should b written after the prompt.

Here is the performance of Codex-davinci-002 on DS-1000.

Here are more prompts, you can copy them and run the models on the playground of OpenAI.

Insertion:

Problem:
Let's say I have a 1d numpy positive integer array like this:
a = array([1,0,3])
I would like to encode this as a 2D one-hot array(for natural number)
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
The leftmost element corresponds to 0 in `a`(NO MATTER whether 0 appears in `a` or not.), and the rightmost vice versa.
Is there a quick way to do this only using numpy? Quicker than just looping over a to set elements of b, that is.
A:
<code>
import numpy as np
a = np.array([1, 0, 3])
</code>
BEGIN SOLUTION
<code>
[insert]
</code>
END SOLUTION
<code>
print(b)
</code>

Comparison

Table 4 compares DS-1000 to other datasets. Notably, the average of problem words in DS-1000 is much larger compared to other data science related datasets (e.g., DSP and CoNaLa).

More importantly, the problems in DS-1000 represent more diverse and naturalistic intent and context formats that cannot be seen in any other datasets.

Unlike generic Python code generation benchmarks (MBPP and HumanEval), we note that data science code generation benchmarks have fewer test cases since the annotators need to define program inputs with complex objects such as square matrices, classifiers, or dataframes than simple primitives, such as floats or lists.

Nevertheless, even a few test cases suffice for DS-1000 – only 1.8% of the Codex-002-predicted solutions accepted by our evaluation are incorrect.

Baselines

Notice: t=temperature, top-p=top-p cutoff, len = max generation length

Rank Model Details Score

1

Jun 05, 2023
gpt-3.5-turbo + SelfEvolve

Shanghai Jiao Tong University

Shuyang et al., '23
Insertion

t=0.2, top-p=0.95

len = 1024

57.1

2

Nov 08, 2022
codex-davinci-002

OpenAI

Chen et al., '21
Insertion

t=0.2, top-p=0.95

len = 1024

43.3

3

Nov 08, 2022
codex-davinci-002

OpenAI

Chen et al., '21
Completion

t=0.2, top-p=0.95

len = 1024

39.2

4

Dec 4, 2023
MagicoderS-CL

University of Illinois at Urbana-Champaign

Yuxiang et al., '23
Completion

t=0.2, top-p=0.5

len = 1024

37.5

5

Oct 10, 2023
Lemur-70B-Chat

XLang Lab & University of Hong Kong & Salesforce Research

Yiheng and Hongjin and Chen et al., '23

Completion

-


34.5

6

Nov 29, 2023
CodeLlama-13B + Self-infilling

The University of Hong Kong & ByteDance

Zheng et al., '23
Completion

greedy decoding

len = 2048 + 512

33.1

7

Jun 14, 2023
WizardCoder

Microsoft

Luo et al., '23
Insertion

t=0.2, top-p=0.5

len = 1024

32.8

8

Oct 10, 2023
Lemur-70B

XLang Lab & University of Hong Kong & Salesforce Research

Yiheng and Hongjin and Chen et al., '23

Completion

-


30.7

9

Dec 4, 2023
Magicoder-CL

University of Illinois at Urbana-Champaign

Yuxiang et al., '23
Completion

t=0.2, top-p=0.5

len = 1024

29.9

10

Jun 14, 2023
WizardCoder

Microsoft

Luo et al., '23
Completion

t=0.2, top-p=0.5

len = 1024

29.2

11

Nov 29, 2023
CodeLlama-7B + Self-infilling

The University of Hong Kong & ByteDance

Zheng et al., '23
Completion

greedy decoding

len = 2048 + 512

28.7

12

May 09, 2023
StarCoder

Hugging Face & ServiceNow Research

Li et al., '23
Completion

t=0.2, top-p=0.5

len = 1024

26.0

12

May 09, 2023
StarCoder

Hugging Face & ServiceNow Research

Li et al., '23
Insertion

t=0.2, top-p=0.5

len = 1024

25.4

14

May 09, 2023
StarCoderBase

Hugging Face & ServiceNow Research

Li et al., '23
Insertion

t=0.2, top-p=0.5

len = 1024

24.0

15

May 09, 2023
StarCoderBase

Hugging Face & ServiceNow Research

Li et al., '23
Completion

t=0.2, top-p=0.5

len = 1024

23.8

16

Nov 08, 2022
codex-davinci-001

OpenAI

Chen et al., '21
Completion

t=0.2, top-p=0.95

len = 1024

20.2

17

Nov 08, 2022
codex-cushman-001

OpenAI

Chen et al., '21
Completion

t=0.2, top-p=0.95

len = 1024

18.1

18

Nov 08, 2022
CodeGen-6B

Salesforce Research

Nijkamp et al., '22
Completion

t=0.2, top-p=0.95

len = 1024

8.4

19

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Insertion

t=0.2, top-p=0.95

len = 1024

7.5

20

Nov 08, 2022
InCoder-6B

Facebook AI

Fried et al., '22
Completion

t=0.2, top-p=0.95

len = 1024

7.4

Acknowledgement

We thank Noah A. Smith, Tianbao Xie, Shuyang Jiang for their helpful feedback on this work.

BibTeX

@article{Lai2022DS1000,
  title={DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation},
  author={Lai, Yuhang and Li, Chengxi and Wang, Yiming and Zhang, Tianyi and Zhong, Ruiqi and Zettlemoyer, Luke and Yih, Wen-Tau and Fried, Daniel and Wang, Sida and Yu, Tao},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.11501}
}