Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vecbench: enable benchmarking against CRDB vector index #142480

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andy-kimball
Copy link
Contributor

Refactor vecbench to use a new VectorProvider index that abstracts two different vector indexing implementations:

  1. A MemProvider that wraps the existing in-memory vector index and persists its state to a file on disk.
  2. A new SQLProvider that runs against CockroachDB and uses SQL to create, insert into, and search vector indexes.

Epic: CRDB-42943

Release note: None

Copy link

blathers-crl bot commented Mar 7, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR refactors vecbench to support benchmarking against the CockroachDB vector index by introducing a new VectorProvider interface with two implementations (SQLProvider and MemProvider). Key changes include:

  • Adding progress reporting via an updated ProgressWriter.
  • Implementing the VectorProvider interface in both SQLProvider and MemProvider.
  • Updating vector set functionality with a new Slice method and corresponding tests.
  • Minor improvements in transaction and fixup worker logic.

Reviewed Changes

File Description
pkg/cmd/vecbench/progress_writer.go Adds progress reporting with a progress writer implementation
pkg/cmd/vecbench/vector_provider.go Introduces the VectorProvider interface
pkg/cmd/vecbench/sql_provider.go Implements SQLProvider for vector indexing
pkg/cmd/vecbench/mem_provider.go Implements MemProvider with in-memory vector store and persistence
pkg/util/vector/vector_set.go Adds a documented Slice method for subsetting vector sets
pkg/util/vector/vector_set_test.go Adds tests for the new Slice method
pkg/sql/vecindex/cspann/fixup_worker.go Improves error handling in deferred transaction finalization
pkg/sql/vecindex/cspann/index_test.go Adjusts goroutine wait handling in index insertion tests
pkg/sql/vecindex/vecstore/store_txn.go Adds notes to clarify scan behavior during transactional operations
pkg/sql/vecindex/cspann/memstore/memstore_txn.go Enhances transaction time advancement handling

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.


var idxCtx cspann.Context
idxCtx.Init(txn)
for i := range vectors.Count {
Copy link
Preview

Copilot AI Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop iteration 'for i := range vectors.Count' is incorrect since 'vectors.Count' is an integer. Use a traditional for loop such as 'for i := 0; i < vectors.Count; i++' instead.

Suggested change
for i := range vectors.Count {
for i := 0; i < vectors.Count; i++ {

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Refactor vecbench to use a new VectorProvider index that abstracts two
different vector indexing implementations:

1. A MemProvider that wraps the existing in-memory vector index and
   persists its state to a file on disk.
2. A new SQLProvider that runs against CockroachDB and uses SQL to create,
   insert into, and search vector indexes.

Epic: CRDB-42943

Release note: None
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 11 of 11 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @mw5h)


pkg/cmd/vecbench/main.go line 416 at r2 (raw file):

			panic(err)
		}
		return provider, errors.Wrap(err, "creating in-memory provider")

Shouldn't we return nil for the error here?


pkg/cmd/vecbench/sql_provider.go line 127 at r2 (raw file):

		}
		defer func() {
			_ = tx.Rollback(ctx)

Should the Rollback be unconditional?

Copy link
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)


pkg/cmd/vecbench/main.go line 416 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Shouldn't we return nil for the error here?

Also, NewMemProvider can only return nil for the error, so that return value is a bit misleading.


pkg/cmd/vecbench/sql_provider.go line 127 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Should the Rollback be unconditional?

It's probably fine to rollback after a commit, but a check on the return value would be cleaner.


pkg/cmd/vecbench/sql_provider.go line 84 at r2 (raw file):

// Load implements the VectorProvider interface.
func (s *SQLProvider) Load(ctx context.Context) (bool, error) {
	// Assume that if the table exists, that it contains the dataset rows.

nit: Maybe in a future version we could write a status row to a table when we get done loading data so that we have more confidence that the loaded table isn't truncated in some way.


pkg/cmd/vecbench/main.go line 95 at r2 (raw file):

	Test      vector.Set
	Neighbors [][]int64
}

nit: I personally find comments on data structures to be very high value compared to almost any other kind.

Code quote:

type searchData struct {
	Count     int
	Test      vector.Set
	Neighbors [][]int64
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants