Learn Aider: AI Coder

Overview

This blog will document my journey of reading the source code of Aider. I will explore the basic structure of the repository and highlight key learnings. Aider has an official blog and documentation that includes insights from the author.

Main

In the main function, Aider functions as a CLI tool. It creates a coder object and initiates coder.run(). If a Git repository doesn’t already exist, it will set one up for you. The io object helps configure the terminal, allowing for reading/writing text and even reading images.

The main logic resides in the coder class.

Coder

Inside the run() method, there’s a while loop:

while new_user_message:
    new_user_message = self.send_new_user_message(new_user_message)

Within send_new_user_message, the following steps are executed:

  1. Formatted message:

    • Retrieve multiple system prompts.
    • Obtain related code snippets from repomap.
    • Count tokens.
  2. Send message to LLM:

    • Send the formatted message to the Language Model.
  3. Get output and apply changes:

    • Receive the output and make the necessary code changes.

There are three main coder types:

  • EditBlockCoder
  • UnifiedDiffCoder
  • WholeFileCoder

Only GPT-4 and Claude 3 can provide unified diff (udiff) output, which is the most efficient way to update the code.

The high-level idea is clear, but there is a lot of code related to input/output and error handling.

The key part of this project is the repomap. Let’s dive into it.

Coder

Insider run(), it is a while loop

while new_user_message:
    new_user_message = self.send_new_user_message(new_user_message)

Inside send_new_user_message, it will do following things:

  1. formatted message:
    1. get lots of system prompt
    2. get related codes from repomap
    3. count token
  2. send message to LLM
  3. get output and do change

There are three main coder:

  • EditBlockCoder
  • UnifiedDiffCoder
  • WholeFileCoder

only gpt4 and claude3 can provide udiff output and it is most efficient way to update the code.

The highlevel idea is very clear but there is a lot of codes related to input/output, error handling.

The key part of this project is the repomap part. Let’s dive into it.

Repo

In repomap.py, the main function is get_repo_map, with the main logic in get_ranked_tags. Here, chatfiles are the files to be updated, and otherfiles are all other files in the repository.

    def get_repo_map(self, chat_files, other_files):

        files_listing = self.get_ranked_tags_map(chat_files, other_files)
        num_tokens = self.token_count(files_listing)
        repo_content += files_listing
        return repo_content
    def get_ranked_tags_map(self, chat_files, other_files):
        ranked_tags = self.get_ranked_tags(chat_fnames, other_fnames)

        num_tags = len(ranked_tags)

        lower_bound = 0
        upper_bound = num_tags
        best_tree = None

        chat_rel_fnames = [self.get_rel_fname(fname) for fname in chat_fnames]

        while lower_bound <= upper_bound:
            middle = (lower_bound + upper_bound) // 2
            tree = self.to_tree(ranked_tags[:middle], chat_rel_fnames)
            num_tokens = self.token_count(tree)
            if num_tokens < self.max_map_tokens:
                best_tree = tree
                lower_bound = middle + 1
            else:
                upper_bound = middle - 1

        return best_tree


    def get_ranked_tags(self, chat_fnames, other_fnames):
        defines = defaultdict(set)
        references = defaultdict(list)
        definitions = defaultdict(set)

        personalization = dict()

        fnames = set(chat_fnames).union(set(other_fnames))
        chat_rel_fnames = set()

        fnames = sorted(fnames)

        if self.cache_missing:
            fnames = tqdm(fnames)
        self.cache_missing = False

        for fname in fnames:
            if not Path(fname).is_file():
                if fname not in self.warned_files:
                    if Path(fname).exists():
                        self.io.tool_error(
                            f"Repo-map can't include {fname}, it is not a normal file"
                        )
                    else:
                        self.io.tool_error(
                            f"Repo-map can't include {fname}, it no longer exists"
                        )

                self.warned_files.add(fname)
                continue

            # dump(fname)
            rel_fname = self.get_rel_fname(fname)

            if fname in chat_fnames:
                personalization[rel_fname] = 1.0
                chat_rel_fnames.add(rel_fname)

            tags = list(self.get_tags(fname, rel_fname))

            if tags is None:
                continue

            for tag in tags:
                if tag.kind == "def":
                    defines[tag.name].add(rel_fname)
                    key = (rel_fname, tag.name)
                    definitions[key].add(tag)

                if tag.kind == "ref":
                    references[tag.name].append(rel_fname)

    
        if not references:
            references = dict((k, list(v)) for k, v in defines.items())

        idents = set(defines.keys()).intersection(set(references.keys()))

        G = nx.MultiDiGraph()

        for ident in idents:
            definers = defines[ident]
            for referencer, num_refs in Counter(references[ident]).items():
                for definer in definers:
                    # if referencer == definer:
                    #    continue
                    G.add_edge(referencer, definer, weight=num_refs, ident=ident)

        if not references:
            pass

        if personalization:
            pers_args = dict(personalization=personalization, dangling=personalization)
        else:
            pers_args = dict()

        try:
            ranked = nx.pagerank(G, weight="weight", **pers_args)
        except ZeroDivisionError:
            return []

        # distribute the rank from each source node, across all of its out edges
        ranked_definitions = defaultdict(float)
        for src in G.nodes:
            src_rank = ranked[src]
            total_weight = sum(
                data["weight"] for _src, _dst, data in G.out_edges(src, data=True)
            )
            # dump(src, src_rank, total_weight)
            for _src, dst, data in G.out_edges(src, data=True):
                data["rank"] = src_rank * data["weight"] / total_weight
                ident = data["ident"]
                ranked_definitions[(dst, ident)] += data["rank"]

        ranked_tags = []
        ranked_definitions = sorted(
            ranked_definitions.items(), reverse=True, key=lambda x: x[1]
        )


        for (fname, ident), rank in ranked_definitions:
            # print(f"{rank:.03f} {fname} {ident}")
            if fname in chat_rel_fnames:
                continue
            ranked_tags += list(definitions.get((fname, ident), []))

        rel_other_fnames_without_tags = set(
            self.get_rel_fname(fname) for fname in other_fnames
        )

        fnames_already_included = set(rt[0] for rt in ranked_tags)

        top_rank = sorted(
            [(rank, node) for (node, rank) in ranked.items()], reverse=True
        )
        for rank, fname in top_rank:
            if fname in rel_other_fnames_without_tags:
                rel_other_fnames_without_tags.remove(fname)
            if fname not in fnames_already_included:
                ranked_tags.append((fname,))

        for fname in rel_other_fnames_without_tags:
            ranked_tags.append((fname,))

        return ranked_tags

Here chatfiles are files are going to update and other files are all other files in the repo.

In get_ranked_tags, it uses Tree-sitter to analyze all files and generates a tag output file like:

[Tag(rel_fname='app/we.py', fname='/path/app/we.py', line=6, name='we', kind='def'), 
Tag(rel_fname='app/we.py', fname='/path/app/we.py', line=7, name='__init__', kind='def'),
 Tag(rel_fname='app/we.py', fname='/path/app/we.py', line=8, name='super', kind='ref'), 
 Tag(rel_fname='app/we.py', fname='/path/app/we.py', line=8, name='__init__', kind='ref')]
  • def indicates definitions in this file.
  • ref indicates references to definitions from other files.

Summary

The dream of each software engineer is to have AI assist us in our work. Aider partially achieves this dream. One major challenge when attempting a similar project is the context length limitation of LLMs, which prevents the inclusion of an entire repository. This project uses a repo map to reduce the context input to the LLM, which is a very clear and effective solution.

comments powered by Disqus