Genai-Numpy Project: Enhancement of the NumPy documentation using Generative AI tools

Can generative AI meaningfully contribute to a mature, widely-used library like NumPy? During my three-month internship as a POSSEE intern, I had an opportunity to explore that very question. Our pod was tasked with investigating how Generative AI could help maintain NumPy repositories by generating useful pull requests (PRs). This blog post outlines the objectives assigned, which centered on improving documentation, test coverage, and resolving bugs, the tools we built, and the contributions we made.

About GenAI-NumPy

Let’s quickly understand the details of the GenAI-Numpy. This project aims to develop AI-assisted tools that can enhance the code-base of open-source projects by generating useful Pull requests (PRs). The project involves curation and oversight of students along with mentors to ensure the quality and effectiveness of the generated PRs. A key aspect of this project was the enhancement of the NumPy documentation using Generative AI tools. Drawing the definition from the NumPy documentation:

"NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more."

What is Generative AI

After reviewing various definitions, I found Wikipedia’s definition to be the most suitable and practically applicable:

"Generative AI refers to a type of artificial intelligence that focuses on creating new content, such as code, images or even text, rather than simply analyzing or classifying existing data. It utilizes machine learning models to learn the underlying patterns and structures within a dataset and then generate novel outputs that resemble the training data."

This is inseparable from prompt engineering which plays a crucial role in this process. Drawing the definition from the Google documentation:

"Prompt engineering is the art and science of designing and optimizing prompts to guide AI models, particularly LLMs, towards generating the desired responses."

Benefits to the community

In the context of NumPy documentation, Generative AI can be used to generate examples, explanations, and tutorials based on the existing codebase and documentation. By analyzing the structure and content of NumPy’s functions and methods (signatures), LLMs can produce clear and concise documentation that adheres to established conventions. It can also assist in creating diverse examples demonstrating the functionality of various NumPy features, potentially even generating code snippets that illustrate specific use cases. Furthermore, Generative AI could be employed to identify areas of the documentation that are unclear or incomplete and suggest improvements or additions, ultimately enhancing the overall clarity and accessibility of the NumPy library.

💻 Work done

We began this project by targeting several critical areas for improvement through AI-assisted tools. Specifically, we set out to create a docstring_analyzer, a test coverage enhancer, an issue resolution tool, and a PR review tool. These tools were to be designed to automate time-consuming tasks, freeing up NumPy maintainers to concentrate on more strategic work. In this blog I will focus on the tools we used to address the need to identify the functions and methods without docstrings and consequently generating and injecting them into the NumPy codebase.

To ensure effective contributions to the project, we spent the first month getting familiar with the NumPy codebase and the GitHub workflow. My first pull request: [DOC: Adding links to polynomial table. #26442] added links to the polynomial classes table to improve readability.

Identifying Missing Docstrings

The first step was identifying functions and methods without docstrings. Leveraging chatGPT, docstring_analyze tool was engineered. This tool recursively searches the NumPy codebase, specifically .py files, and generates a CSV file containing comprehensive function metadata. The tool’s modular design enables customization, allowing for the extraction of varied information and the analysis of different file types. This initial step was essential for preparing the codebase for AI-assisted docstring generation.

Python code snippet showing a class named MethodAnalyzer. The class inherits from ast.NodeVisitor and defines methods to extract information about functions in a Python file, like name, docstring length, and start line. The code is displayed in a dark-themed code editor with syntax highlighting

This is an extract of the docstring_analysis.ipynb file designed to analyze methods (functions) in the Numpy codebase. It extracts and records various attributes of each method, such as its name, the length of its docstring, and the start line number.

Reference: GPT4 used to help create docstring_analysis.ipynb. co-authored by Ben Woodruff

Generating Examples with Prompts

The next step involved coming up with a prompt creator tool that creates a prompt that you can pass into an LLM which includes several examples of function code and corresponding docstrings from a given file followed by code for a function that we wish to predict docstrings for.

Sample Output for the tensorsolve Function

Python code snippet showing a function tensorsolve without a docstring. The function uses NumPy operations and includes conditional logic. Code is displayed in a dark-themed code editor with syntax highlighting.
Python docstring for the tensorsolve function. It describes the function's purpose, parameters, and provides mathematical context. Code is displayed in a dark-themed code editor with syntax highlighting.
Python docstring for the tensorsolve function, showing the "Returns," "Raises," "See Also," and "Examples" sections. Includes code examples using NumPy functions and asserts the result. Code is displayed in a dark-themed code editor with syntax highlighting.

Reference: Script that creates a prompt that passes into an LLM example-generator.py

We observed that using Llama3-8B to generate the docstrings examples resulted in too many repetitive examples that were of very low quality even though the example generation was fast. Using Llama3-70B on the other hand created better examples with minimal duplication which were removed manually.

Automating Insertion with Post-Processing

To streamline integration, we built example_post_processing.py to insert generated examples into the codebase. While the generated code generally executes, we encountered challenges with hallucinated outputs, which were resolved by replacing them with actual results. Additionally, issues with special characters and verbatim copying were mitigated using the fuzzywuzzy package for line identification.

For future iterations, we aim to refine prompt formats to further improve accuracy and reduce editing.

A list of pull requests submitted and merged to NumPy repository

Below are Pull requests(PRs) merged into the NumPy repository:

Screenshot of a GitHub page showing a list of closed pull requests for the NumPy repository. The list includes titles, authors, labels, and comment counts, displayed in a dark theme interface. A search bar and navigation options are visible at the top.

Conclusion

The integration of generative AI into open-source workflows, especially for documentation, represents a promising new direction. While some skepticism is healthy, this project demonstrates that with thoughtful human oversight and strong QA processes, Generative AI can:

  • Save maintainers time
  • Enhance usability of libraries like NumPy
  • Lower the barrier for new contributors.


We hope more open-source communities will explore and experiment with Generative AI tools responsibly and collaboratively, building upon successes like this one to shape a future where GenAI serves as a valuable partner in the evolution of open-source software.

Acknowledgements

This project would not have been possible without the support and dedication of a fantastic team. I want to especially thank Ben Woodruff for his expert technical guidance and Inessa Pawson for her exceptional program management and mentorship skills. My heartfelt gratitude also goes to George Ogidi and Ebigide Jude for their hard work and collaborative spirit, which significantly contributed to the project’s achievements. Finally, I’m deeply grateful to the POSSEE program and its partners for investing in projects like this. As a direct beneficiary of their extensive support network, including student mentors and career coaches, I’ve experienced firsthand the program’s value and has been instrumental in my career growth. I hope these opportunities continue to be extended to BYU Pathway students and learners worldwide, contributing to a growing and inclusive open-source community.

References

Skip to content