Fuzzy Find a Github Repository - Part Deux

Remember our first installment?  We iterated on building a command-line tool for quickly selecting a private github repo to clone.  We built a solution using fish, fzf, and Go.  We iterated to a Github API interaction using a purposeful Go app and leveraged  Github's V4 GraphQL API.  We gained a lot speed from that.   However, after some usage, I've decided it's still a little bit slow for my taste when fetching a large list of repositories which requires paging via the Github API.

One thing that occurred to me is that the list of repositories doesn't change often, and I'm rarely cloning a repo that is brand new.   Why not cache the results and leverage that cache to present choices more quickly?

If we are attempting to choose a repo that's already in the cache, the UI will be lightning fast.  If we're choosing a repo that is newly added and it takes a little longer to show up in the list, then we are no worse off than we were without the cache.

This all seems completely reasonable on first glance.   However, there are a few tendrils:

  1. Cache + Updates: fzf only takes in a list of choices to present, supplied via stdin/pipe.  There are not a lot of good examples of 'feeding' fzf from both a pre-cached set of results -and- a set of updates to that cache, as they roll in gradually over 3-10 seconds.  
  2. Improve UI Performance: One of the main reasons we choose fzf was UI performance.  To gain performance over what we have already built, we need to be able to immediately feed a decent cache of results to fzf.  Cache updates also need to be fed to fzf to not break the experience of typing ahead to match against an entry which has not been cached, so we have to think a little bit about cache invalidation+updating here as well.
  3. Duplicates: In our case, when a persisted cache is one source, and a incremental update from API responses is another, it is extremely likely there will be a lot of overlap between them.  fzf does not offer a built-in way to de-dupe.  Showing duplicates to the user would be annoying at best and confusing at worst.  Additionally, duplicates won't be adjacent, or even guaranteed to be temporally close, so leveraging things like sort -u or uniq in a chain which needs to process everything all at once would not help because it would hold up the whole pipeline.  That would decimate UI performance and user experience.

Let's tackle these in order of how they will build on each other:

Duplicates

One wrinkle is that while a cache may be immediately available, the results of a cache update operation which fetches results from the Github API, will be rolling in  over time as we are fetching several pages of results.  Additionally the entry ordering from the API fetch is not guaranteed to match the sort order of what is in the cache.   If we're going to aggressively feed the combination of the cache and cache updates  into FZF, as they arrive, we need a way to de-duplicate out of order results.

For example, consider the following sequence of events populating a list of repository choices in fzf, where a cache returns immediately.

relative timing source repository is 'dupe'?
immediate cache repo1 -
immediate cache repo2 -
immediate cache repo3...101 -
2 seconds API fetch, request #1 repo1 yes
2 seconds API fetch, request #1 repo2...100 yes
4 seconds API fetch, request #2 repo101 yes
4 seconds API fetch, request #2 repo102 no

If we don't do anything to filter out repo1 through repo101 as they roll in from the API fetch, then fzf will present dupes to the user.

However, we can't wait or do any smart diff'ing against the cache since we're dealing with a page of results at a time, and needing to send them into the UI as quickly as possible without waiting on subsequent API requests or having the full data set.

There are several implementations around of command-line non-adjacent lines unique filters.   The most barebones and naive implementation of such a filter is generally leveraging a map/dictionary for duplicate checking.  Since we're not dealing with more than a few thousand entries, that sort of implementation should be fine.  The scope of such a tool seems like a great candidate to write some Go code and learn more about writing tests in Go.

Our implementation of a non-adjacent uniq tool will function roughly as follows:

  1. Read lines from stdin.
  2. If a line is a new line, print it out.

The key logic to implement here is a function that will answer the question: "Is a line new?"

Go has a widely accepted idiom called Table Tests.  Let's express our expectations for this tool in tests.  I'm going to present all the tests here without all the iterations that got me there, for the sake of brevity.

type lineset map[string]struct{}

tests := []struct {
    name       string
    pastlines  lineset
    line       string
    new        bool
}{
    {"No Entries - Add one", lineset{}, "test", true},
    {"Single Entry - Add dupe", lineset{"test": {}}, "test", false},
    {"Multiple Entries - adjacent tail dupe", lineset{"test": {}, "test2": {}}, "test2", false},
    {"Multiple Entries - non-adjacent head dupe", lineset{"test": {}, "test2": {}}, "test", false},
    {"Multiple Entries - non-adjacent dupe in middle", lineset{"test": {}, "test2": {}, "test3": {}}, "test2", false},
}

⬆️ Here we lay out our expectations for tests of a given name, in the tests struct.  Given accumulated  state of pastlines, we expect a function to answer whether line is new.  With that structure in place, we then supply 5 name'd test cases.

A couple of points of interest here:

  1. Empty structs ( {}'s ) are a Go-ism.  The fact that they are here for the initialization of each test case's starting state, feels like implementation leaking into the test.  I'll look to improve that in the future.
  2. We create a typedef 'lineset' to reduce the curly bracket clutter of leveraging empty structs '{}' inside map literal definitions (which also use {}'s).

A driver loops over the tests setting up and executing each test from the table:

for _, tt := range tests {
    t.Run(tt.name, func(t *testing.T) {
        us := &UniqueStrings{
            entries: tt.pastlines,
        }
        if got := us.add(tt.line); got != tt.new {
            t.Errorf("UniqueStrings.add() = %v, expected %v", got, tt.new)
        }
    })
}

One thing I found really helpful when iterating on this table test was the Go autotest plugin for Visual Code.  You 'pin' the test you're working on, and then every time you save the file, the test is re-executed.

You can't beat their hilarious tagline either:

Have you ever been banging your head against the wall, trying to get that one test to pass? This VS Code extension for Go projects makes the head banging faster!

My first cut at an implementation to satisfy these tests is tracking every line passed in via a map of strings (keys) to empty structs (values).  Map has an efficient way to test whether a given key is already present – because it uses hashing to optimize lookups.  

So a bare bones implementation is a map:

entries map[string]struct{} // string keys, zero-width/empty struct values

and a function implemented that can report whether line is new given entries in map recording prior lines:

if _, present := us.entries[line]; !present {
    us.entries[line] = struct{}{}
	return true
}
	
return false

Ok ok , let's see it in action!

Automated Unit Tests Pass!

Now let's use it in our main program loop:

lines := NewUniqueStrings()
scanner := bufio.NewScanner(os.Stdin)

for scanner.Scan() {
    line := scanner.Text()

    if lines.add(line) {
        fmt.Println(line)
    }
}

And now, vet it live to make sure all the glue is in the right place:

YES!

Ok.... That was a fun and we now have a purpose built tool for filtering a stream for non-adjacent unique lines.

Present Cache and Updates

Given the sequence of events:

  1. Retrieve a list of repositories from the persisted cache and send to UI (fzf)
  2. Retrieve a list of repositories from github, update the cache, and send updates to UI (fzf), progressively, de-duplicating

To start off, here is our pipeline sending a list to fzf, before caching:

listrepo_gql $GITHUB_TOKEN $ADD_PUB_GITHUB_ORG \
  | fzf \
  | read -l repo

A first pass at leveraging a cache looks like:

cat $CACHE 2>/dev/null; listrepo_gql $GITHUB_TOKEN $ADD_PUB_GITHUB_ORG \
  | tee $CACHE \
  | fzf \
  | read -l repo

That is:

  1. Render (cat) cache to stdout; fetch repo list (listrepo_gql) to stdout.  Pipe (|) stdout to next process
  2. Replicate (tee) to a file $CACHE  (Cache update), -and- pipe to next process
  3. fzf renders UI, user selects, then fzf renders selection to stdout
  4. selected line is set to an environment var

This works ok, but has two problems: Duplicates and Stalling.  

There are duplicates because the cache has significant overlap (sometimes 100%) with the list that comes back from the Github API.  Here is where the nauniq tool we built earlier comes in.  Insert that in the chain just before fzf to eliminate dupe overlap for the cache and live feed.

cat $CACHE 2>/dev/null; listrepo_gql $GITHUB_TOKEN $ADD_PUB_GITHUB_ORG \
  | tee $CACHE \
  | nauniq \
  | fzf \
  | read -l repo

To fix the stalling, we're going to need to dig a bit deeper...

Improve UI Performance

In bash the stalling can be eliminating by grouping the execution of the sub-shells to feed an unnamed pipe:

(cat $CACHE 2>/dev/null; listrepo_gql $GITHUB_TOKEN $ADD_PUB_GITHUB_ORG) \
  | tee $CACHE \
  | nauniq \
  | fzf \
  | read -l repo

In fish, they're not sub-shells/forks, and tee ends up stalling for a moment waiting on the second command to start filling the pipe (buffering? eof's? not sure).  In any case, it's an opportunity to learn about named pipes.

Assuming a unique name for a pipe $QUEUE and filename for a cache $CACHE, we can make a named pipe:

mkfifo $QUEUE

queue up a background job to feed all cached values to it:

cat $CACHE 2>/dev/null > $QUEUE &

queue up background job to feed rolling cache updates to it:

listrepo_gql $GITHUB_TOKEN $ADD_PUB_GITHUB_ORG | tee $CACHE > $QUEUE &

and then connect that named pipe to be the data source for nauniq+fzf  launched in the foreground:

cat $QUEUE | nauniq | fzf

Voilà!  

Instant results in the UI, which feels like great UI performance for most usages:

fclone UI performance with caching

. . . and the uncached results roll in gradually, updating the list of options as they are received – indicated by the spinner at the bottom.  All the while, the user can type ahead fuzzy match against the list of choices without waiting on anything to finish.

Known Issues

It's always worth mentioning known cons and design compromises, to distinguish between known issues and misses/unknowns.

  • Removed Items:  If a repo is removed from github, it will not be removed from the list of choices presented until after a full cache update has completed from a prior cycle.  This is partly due to the design of this solution, and partly due to the fact that there is no way to send a command to fzf to remove an item from the list it is presenting.  I don't consider it a big issue because I'm never intentionally  trying to clone a repository that has been dropped, and the worst case, is a rare experience of it erroring out, once.
  • nauniq holds the repo list in memory to de-dupe.  This should be ok for anyone dealing with a couple hundred or even several thousand repo's.  I haven't really tested beyond that: i.e. - Microsoft number of Github repo's.
  • The named pipe approach with two processes feeding it assumes the disk-based cache finishes feeding fzf before first API request returns.  This avoids us having to use something like parallel w/line buffering.  This just works because reading from the disk always finishes faster than getting the first response back from the Github API.