Bazaar Performance Roadmap

Contents

1   About the performance roadmap

1.1   What should be in the roadmap?

A good roadmap provides a place for contributors to look for tasks, it provides users with a sense of when we will fix things that are affecting them, and it also allows us all to agree about where we are headed. So the roadmap should contain enough things to let all this happen.

I think that it needs to contain the analysis work which is required, a list of the use cases to be optimised, the disk changes required, and the broad sense of the api changes required. It also needs to list the inter-dependencies between these things: we should aim for a large surface area of 'ready to be worked on' items, that makes it easy to improve performance without having to work in lockstep with other developers.

Clearly the analysis step is an immediate bottleneck - we cannot tell if an optimisation for use case A is a pessimism for use case B until we have analysed both A and B. I propose that we complete the analysis of say a dozen core use cases end to end during the upcoming sprint in London. We should then be able to fork() for much of the detailed design work and regroup with disk and api changes shortly thereafter.

I suspect that clarity of layering will make a big difference to developer parallelism, so another proposal I have is for us to look at the APIs for Branch and Repository in London in the light of what we have learnt over the last years.

1.2   What should the final system look like, how is it different to what we have today?

One of the things I like the most about bzr is its rich library API, and I've heard this from numerous other folk. So anything that will remove that should be considered a last resort.

Similarly our relatively excellent cross platform support is critical for projects that are themselves cross platform, and thats a considerable number these days.

And of course, our focus on doing the right thing is what differentiates us from some of the other VCS's, so we should be focusing on doing the right thing quickly :).

What we have today though has grown organically in response to us identifying bottlenecks over several iterations of back end storage, branch metadata and the local tree representation. I think we are largely past that and able to describe the ideal characteristics of the major actors in the system - primarily Tree, Branch, Repository - based on what we have learnt.

1.3   What use cases should be covered?

My list of use cases is probably not complete - its just the ones I happen to see a lot :). I think each should be analysed comprehensively so we dont need to say 'push over the network' - its implied in the scaling analysis that both semantic and file operation latency will be considered.

These use cases are ordered by roughly the ease of benchmarking, and the frequency of use. This ordering is so that when people are comparing bzr they are going to get use cases we have optimised; and so that as we speed things up our existing users will have the things they do the most optimised.

  • status tree
  • status subtree
  • commit
  • commit to a bound branch
  • incremental push/pull
  • log
  • log path
  • add
  • initial push or pull [both to a new repo and an existing repo with different data in it]
  • diff tree
  • diff subtree
  • revert tree
  • revert subtree
  • merge from a branch
  • merge from a bundle
  • annotate
  • create a bundle against a branch
  • uncommit
  • missing
  • update
  • cbranch

1.4   How is development on the roadmap coordinated?

I think we should hold regular get-togethers (on IRC) to coordinate on our progress, because this is a big task and its a lot easier to start helping out some area which is having trouble if we have kept in contact about each areas progress. This might be weekly or fortnightly or some such.

we need a shared space to record the results of the analysis and the roadmap as we go forward. Given that we'll need to update these as new features are considered, I propose that we use doc/design as a working space, and as we analyse use cases we include them in there - including the normal review process for each patch. We also need documentation about doing performance tuning - not the minutiae, though that is needed, but about how to effective choose things to optimise which will give the best return on time spent - that is what the roadmap should help with, but this looks to be a large project and an overview will be of great assistance I think. We want to help everyone that wishes to contribute to performance to do so effectively.

Finally, its important to note that coding is not the only contribution - testing, giving feedback on current performance, helping with the analysis are all extremely important tasks too and we probably want to have clear markers of where that should be done to encourage such contributions.

1.5   Planned changes to the bzr core

Delivering the best possible performance requires changing the bzr core design from that present in 0.16. Some of these changes are incremental and can be done with no impact on disk format. Many of them however do require changes to the disk format, and these can be broken into two sets of changes, those which are sufficiently close to the model bzr uses today to interoperate with the 0.16 disk formats, and those that are not able to interoperate with the 0.16 disk formats - specifically some planned changes may result in data which cannot be exported to bzr 0.16's disk formats and then imported back to the new format without losing critical information. If/when this takes place it will be essentially a migration for users to switch from their bzr 0.16 repository to a bzr that supports them. We plan to batch all such changes into one large 'experimental' repository format, which will be complete stable and usable before we migrate it to become a supported format. Getting new versions of bzr in widespread use at that time will be very important, otherwise the user base may be split in two - users that have upgraded and users that have not.

The following changes are grouped according to their compatability impact: library only, disk format but interoperable, disk format interoperability unknown, and disk format, not interoperable.

1.5.1   Library changes

These changes will change bzrlib's API but will not affect the disk format and thus do not pose a significant migration issue.

  • For our 20 core use cases, we plan to add targeted API's to bzrlib that are repository-representation agnostic. These will instead reflect the shape of data access most optimal for that case.
  • Deprecate 'versioned files' as a library concept. Instead of asking for information about a file-over-time as a special case, we will move to an API that assumes less coupling between the historical information and the ability to obtain texts/deltas etc. Specifically, we need to remove all API's that act in terms of on disk representation except those within a given repository implementation.
  • Create a validator for revisions that is more amenable to use by other parts of the code base than just the gpg signing facility. This can be done today without changing disk, possibly with a performance hit until the disk formats match the validatory logic. It will be hard to tell if we have the right routine for that until all the disk changes are complete, so while this is a library only change, its likely one that will be delayed to near the end of the process.
  • Add an explicit API for managing cached annotations. While annotations are considered a cache this is not exposed in such a way that cache operations like 'drop the cache' can be performed. On current disk formats the cache is mandatory, but an API to manage would allow refreshing of the cache (e.g. after ghosts are filled in in baz conversions).
  • Use the _iter_changes API to perform merges. This is a small change that may remove the need to use inventories in merge, making a dramatic difference to merge performance once the tree shape comparison optimisations are implemented.
  • Create a network-efficient revision graph API. This is the logic at the start of push and pull operations, which currently scales O(graph size). Fixing the scaling can be done, but there are tradeoffs to latency and performance to consider, making it a little tricky to get right.
  • Working tree disk operation ordering. We plan to change the order in which some operations are done (specifically TreeTransform ones) to improve performance. There is already a 66% performance boost in that area going through review.
  • Stop requiring full memory copies of files. Currently bzr requires that it can hold 3 copies of any file its versioning in memory. Solving this is tricky, particularly without performance regressions on small files, but without solving it versioning of .iso and other large objects will continue to be extremely painful.
  • Add an API for per-file graph access that alllows incremental access and is suitable for on-demand generation if desired.
  • Repository stacking API. Allowing multiple databases to be stacked to give a single 'repository' will allow implementation of some long desired features like history horizons, and bundle usage where the bundle is not added to the local repository just to examine its contents.
  • Revision data manipulation API. We need a single streaming API for adding data to or getting it from a repository. This will need to allow hints such as 'optimise for size', or 'optimise for fast-addition' to meet the various users planned, but it is a core part of the library today, and its not sufficiently clean to let us simplify/remove a lot of related code today.

1.5.2   Interoperable disk changes

  • New container format to allow single-file description of multiple named objects. This will provide the basis for transmission of revisions over the network, the new bundle format, and possibly a new repository format as well.
  • Separate the annotation cache from the storage of actual file texts and make the annotation style, and when to do it, configurable. This will reduce data sent over the wire when repositories have had 'needs-annotations' turned off, which very large trees may choose to do - generating just-in-time annotations may be desirable for those trees (even when performing annotation based merges).
  • Repository disk operation ordering. The order that tasks access data within the repository and the layout of the data should be harmonised. This will require disk format changes but does not inherently alter the model, so its straight forward to export from a repository that has been optimised in this way to a 0.16 based repository.
  • Inventory representation. An inventory is a logical description of the shape of a version controlled tree. Currently we operate on the whole inventory as a tree broken down per directory, but we store it as a flat file. This scale very poorly as even a minor change between inventories requires us to scan the entire file, and in large trees this is many megabytes of data to consider. We are investigating the exact form, but the intent is to change the serialisation of inventories so that comparing two inventories can be done in some smaller time - e.g. O(log N) scaling. Whatever form this takes, a repository that can export it directly will be able to perform operations between two historical trees much more efficiently than the current repositories.
  • Delta storage optimisation. We plan to change the delta storage logic to use a binary delta like xdelta rather than using line based deltas from python. These binary deltas could be done along ancestry ordering, or other arbitrary patterns chosen for their intended use. Line based deltas will still be created for cached annotations. This is still under some discussion. http://bazaar-vcs.org/PerformanceRoadmap/Xdelta
  • Greatest distance from origin cache. This is a possible change to introduce, but it may be unnecessary - listed here for completeness till it has been established as [un]needed.

1.5.3   Possibly non-interoperable disk changes

  • Removing of derivable data from the core of bzr. Much of the data that bzr stores is derivable from the users source files. For instance the annotations that record who introduced a line. Given the full history for a repository we can recreate that at any time. We want to remove the dependence of the core of bzr on any data that is derivable, because doing this will give us the freedom to:

    • Improve the derivation algorithm over time.
    • Deal with bugs in the derivation algorithms without having 'corrupt repositories' or such things.

    However, some of the data that is technically derived, like the per-file merge graph, is both considered core, and can be generated differently when certain circumstances arive, by bzr 0.16. Any change to the 'core' status of that data will discard data that cannot be recreated and thus lead to the inability to export from a format where that is derived data to bzr 0.16's formats without errors occuring in those circumstances. Some of the data that may be considered for this includes:

    • Per file merge graphs
    • Annotations

1.5.4   Non-interoperable disk changes

  • Drop the per-file merge graph 'cache' currently held in the FILE-ID.kndx files. A specific case of removing derivable data, this may allow smaller inventory metadata and also make it easier to allow two different trees (in terms of last-change made, e.g. if one is a working tree) to be compared using a hash-tree style approach.
  • Use hash based names for some objects in the bzr database. Because it would force total-knowledge-of-history on the graph revision objects will not be namable via hash's and neither will revisio signatures. Other than that though we can in principle use hash's e.g. SHA1 for everything else. There are many unanswered questions about hash based naming related to locality of reference impacts, which need to be answered before this becomes a definite item.

2   Analysis of use cases

2.1   Analysing a specific use case

The analysis of a use case needs to provide as outputs:
  • The functional requirements that the use case has to satisfy.
  • The file level operations and access patterns that will give the best performance.
  • A low friction API which will allow the use case to be implemented.
  • The release of bzr (and thus the supported features) for which the analysis was performed. The feature set of bzr defines the access patterns and data required to implement any use case. So when we add features, their design changes the requirements for the parts of the system they alter, so we need to re-analyse use cases when bzr's feature set changes. If future plans are considered in the analysis with the intention of avoiding rework, these should also be mentioned.

2.2   Performing the analysis

The analysis needs to be able to define the characteristics of the involved disk storage and APIs. That means we need to examine the data required for the operation, in what order it is required, on both the read and write sides, and how that needs to be presented to be consistent with our layering.

As a quick example: 'annotation of a file requires the file id looked up from the tree, the basis revision id from the tree, and then the text of that fileid-revisionid pair along with the creating revision id allocated to each line, and the dotted revision number of each of those revision ids.' All three of our key domain objects are involved here, but we haven't defined any characteristics of the api or disk facilities yet. We could then do that by saying something like 'the file-id lookup should degrade gracefully as trees become huge. The tree basis id should be constant time. Retrieval of the annotated text should be roughly constant for any text of the same size regardless of the number of revisions contributing to its content. Mapping of the revision ids to dotted revnos could be done as the text is retrieved, but its completely fine to post-process the annotated text to obtain dotted-revnos.'

2.3   What factors should be considered?

Obviously, those that will make for an extremely fast system :). There are many possible factors, but the ones I think are most interesting to design with are:

  • baseline overhead:

    • The time to get bzr ready to begin the use case.
  • scaling: how does performance change when any of the follow aspects of the system are ratched massively up or down:

    • number of files/dirs/symlinks/subtrees in a tree (both working and revision trees)
    • size of any particular file
    • number of elements within a single directory
    • length of symlinks
    • number of changes to any file over time (subordinately also the number of merges of the file)
    • number of commits in the ancestry of a branch (subordinately also the number of merges)
    • number of revisions in a repository
    • number of fileids in a repository
    • number of ghosts in a given graph (revision or per-file)
    • number of branches in a repository
    • number of concurrent readers for a tree/branch/repository
    • number of concurrent writers for objects that support that.
    • latency to perform file operations (e.g. slow disks, network file systems, our VFS layer and FTP/SFTP/etc)
    • bandwidth to the disk storage
    • latency to perform semantic operations (hpss specific)
    • bandwidth when performing semantic operations.
  • locality of reference: If an operation requires data that is located within a small region at any point, we often get better performance than with an implementation of the same operation that requires the same amount of data but with a lower locality of reference. Its fairly tricky to add locality of reference after the fact, so I think its worth considering up front.

Using these factors, to the annotate example we can add that its reasonable to do two 'semantic' round trips to the local tree, one to the branch object, and two to the repository. In file-operation level measurements, in an ideal world there would be no more than one round trip for each semantic operation. What there must not be is one round trip per revision involved in the revisionid->dotted number mapping, nor per each revision id attributed to a line in the text.

Not all the items mentioned above are created equal. The analysis should include the parameters considered and the common case values for each - the optimisation should be around the common cases not around the exceptions.

For instance, we have a smart server now; file level operations are relatively low latency and we should use that as the common case. At this point we intend to preserve the performance of the dumb protocol networking, but focus on improving network performance via the smart server and thus escape the file-level operation latency considerations.

Many performance problems only become visible when changing the scaling knobs upwards to large trees. On small trees its our baseline performance that drives incremental improvements; on large trees its the amount of processing per item that drives performance. A significant goal therefore is to keep the amouont of data to be processed under control. Ideally we can scale in a sublinear fashion for all operations, but we MUST NOT scale even linearly for operations that invoke a latency multiplier. For example, reading a file on disk requires finding the inode for the file, then the block with the data and returning the contents. Due to directory grouping logic we pay a massive price to read files if we do not group the reads of files within the same directory.

3   Use cases

3.1   Initial push / pull

3.1.1   Optimal case

(a motivating example of ultimate performance) Assume there is a file with exactly the right data in compressed form. This may be a tarred branch, a bundle, or a blob format. Performance in this case scales with the size of the file.

3.1.2   Disk case

Assume current repo format. Attempt to achieve parity with cp -r. Read each file only 1 time.

  • read knit graph for revisions
  • write filtered copy of revision knit O(d+a)
  • write filtered copy of knit index O(d)
  • Open knit index for inventory
  • Write a filtered copy of inventory knit and simultaneously not all referenced file-ids O(b+d)
  • Write filtered copy of inventory knit index O(d)
  • For each referenced file-id:
    • Open knit index for each file knit O(e)
    • If acceptable threshold of irrelevant data hard-link O(f)
    • Otherwise write filtered copy of text knit and simultaneously write the fulltext to tree transform O(h)
  • Write format markers O(1)
a:size of aggregate revision metadata
b:size of inventory changes for all revisions
c:size of text changes for all files and all revisions (e * g)
d:number of relevant revisions
e:number of relevant versioned files
f:size of the particular versioned file knit index
g:size of the filtered versioned file knit
h:size of the versioned file fulltext
i:size of the largest file fulltext

3.1.3   Smart Network Case

3.1.3.1   Phase 1

Push: ask if there is a repository, and if not, what formats are okay Pull: Nothing

3.1.3.2   Phase 2

Push: send initial push command, streaming data in acceptable format, following disk case strategy Pull: receive initial pull command, specifying format

Pull client complexity: O(a), memory cost O(1) Push client complexity: procesing and memory cost same as disk case

3.1.4   Dumb Network Case

Pull: same as disk case, but request all file knit indices at once and request al file knits at once. Push: same as disk case, but write all files at once.

3.1.5   Wants

  • Read partial graph
  • Read multiple segments of multiple files on http and sftp
  • Write multiple files over SFTP

3.2   Incremental push/pull

This use case covers pulling in or pushing out some number of revisions which is typically a small fraction of the number already present in the target repository. Pushing and pulling are defined as branch level operations for ease of interaction with VCS systems that have no repository abstraction (such as bzr-svn or GNU Arch) but within bzrlib's core they are currently the responsibility of the Repository object.

3.2.1   Functional Requirements

A push or pull operation must:
  • Copy all the data to reconstruct the selected revisions in the target branch. This is the goal of push and pull after all.
  • Reject corrupt data. As bzr has no innate mechanism for discarding corrupted data, corrupted data should not be incorporated accidentally.

3.2.2   Factors which should add work for push/pull

  • Baseline overhead: The time to connect to both branches.
  • Actual new data in the revisions being pulled (drives the amount of data to move around, includes the commit messages etc)
  • Number of revisions in the two repositories (scaling affects the determination of what revisions to move around).

3.2.3   Push/pull overview

  1. New data is identified in the source repository.
  2. That data is read from the source repository.
  3. The same data is verified and written to the target repository in such a manner that its not visible to readers until its ready for use.

3.2.3.1   New data identification

We have a single top level data object: revisions. Everything else is subordinate to revisions, so determining the revisions to propogate should be all thats needed. This depends on revisions with partial data - such as those with no signature - being flagged in some efficient manner.

We could do this in two manners: determine revisions to sync and signatures to sync in two passes, or change the 'value' of a revision implicitly when the signature is different. E.g. by using merkle hash trees with the signature data a separate component the signatures will naturally be identified to sync.

We want to only exchange data proportional to the number of new revisions and signatures in the system though. One way to achieve this for revisions is to walk the graph out from the desired tips until the surface area intersection is found. For signatures a set difference seems to be needed as there is no DAG of signatures: the presence of one has no implications on the presence of another, so a full pass over the set of signatures would be required to confirm no new signatures are needed (let alone replaced signatures).

IFF we can determine 'new revisions' and 'new signatures' without full graph access then we can scale acceptable for push and pull.

Ghosts are revisions which are not present in a particular repository. Filling ghosts refers to removing ghosts in the target repository when the ghost is present in the source repository. Filling ghosts can be either an explicit or implicit action. The common case is no ghosts.

3.2.3.1.1   Set synchronisation approaches

A set synchronisation approach is one which synchronises two sets without regard for innate structure. This can be very efficient but requires adding a new node to be processed with every commit. Caching of the results of the various set based syncs I've seen is possible but because the data structures look different depending on the tip revision being synced up to the cache needs to be very complex. I recommend not using such an approach for the common case pull because of the failure to scale. We can use such an approach for synchronisation of new signatures and ghosts, which should be an explicit option in both cases.

3.2.3.1.2   DAG synchronisation approaches

A DAG based approach to synchronistion is one that uses the DAG structure to determine the difference in present nodes. It can as a result operate from the tip of the DAG backwards. A dag based approach should allow incremental access to data and not require a full-graph scan for incremental operations.

3.2.3.1.3   File level scaling

We should read roughly as much of the revision level graph as is needed from each repository to determine the node difference. If requested we should perform a detailed scan to pick up ghost revisions and revisions which have had signatures added. This should not be the default as it requires full history access in both cases.

Expected file IO and access pattern:

  • Common case: repo with many branches of one project, to the same.

    1. Source and Target branch tips read.
    2. Find the tip of each branch in their repo (will require reading some of the revision graph but is typically near the end of the graph).
    3. Read and parse increasing amounts of the revision graph until one is found to be a subset of the other, or a complete list of revisions to be transmitted is created.
  • Uncommon cases:

    1. Repositories with many projects or branches which are very old may require reading a lot of unrelated graph data.
    1. Initial push/pull scenarios should not require reading an entire graph.
3.2.3.1.4   API scaling
  1. Get branch tips.
  2. Determine one sided graph difference. To avoid obtaining a full graph over the wire this needs to be done without reference to the full graph, and with some logarthmic scaling algorithm. There are several already available for this.

With ghost and new-signature detection:

  • File IO access pattern will read the entire graph on the 'target' side - if no ghosts are present then stop, otherwise seek the new revisions on the source side with the regular algorithm and also explicitly search for the ghost points from the target; plus a set difference search is needed on signatures.
  • Semantic level can probably be tuned, but as its also complex I suggest deferring analysis for optimal behaviour of this use case.

3.2.3.2   Data reading

When transferring information about a revision the graph of data for the revision is walked: revision -> inventory, revision -> matching signature, inventory -> file ids:revision pairs.

3.2.3.2.1   File level scaling

As we're reading already committed data, as long as nothing is mutating data on disk reading should be race free. We will:

  • read each revision object
  • read the matching inventory delta
  • attempt to read a signature object
  • parse the inventory delta
  • read the fileid:revisionid compressed chunk for each line in the inventory delta

Theres no point validating that the data read is valid, as transmission through to the client writing the data might invalidate it; we need to validate before we write.

3.2.3.2.2   API scaling

Given that we have established the revisions needed, a single API call should suffice to obtain all data; the API should present the data in such an order that it can be validated as it arrives and thus not require large scale buffering on disk. Specifically each item of data should be validatable (e.g. for some file data we want the fileid:revisionid:validationhash + content).

3.2.3.3   Data Verification and writing

New data written to a repository should be completed intact when it is made visible. This suggests that either all the data for a revision must be made atomically visible (e.g. by renaming a single file) or the leaf nodes of the reference graph must become visible first.

Data is referred to via the following graph: revision -> revision revision -> signature revision -> inventory inventory -> fileid:revisionid fileid:revisionid -> fileid:revisionid

Data is verifiable via a different ordering: signature -> revision -> inventory -> fileid:revisionid texts.

We dont gpg verify each revision today; this analysis only speaks to hash verification of contents.

To validate a revision we need to validate the data it refers to. But to validate the contents of a revision we need the new texts in the inventory for the revision - to check a fileid:revisionid we need to know the expected sha1 of the full text and thus also need to read the delta chain to construct the text as we accept it to determine if its valid. Providing separate validators for the chosen representation would address this. e.g: For an inventory entry FILEID:REVISIONID we store the validator of the full text :SHA1:. If we also stored the validator of the chosen disk representation (:DELTASHA1:) we could validate the transmitted representation without expanding the delta in the common case. If that failed we could expand the delta chain and try against the full text validator, and finally fail. As different delta generators might generate different deltas, :DELTASHA1: should not become part of the revision validator, only the inventory disk encoding. In a related manner a transmission format that allowed cheap validation of content without applying locally stored deltas would be advantageous because no local reads would be incurred to validate new content. For instance, always sending a full text for any file, possibly with a delta-chain when transmitting multiple revisionids of the file, would allow this. (git pack-files have this property).

3.2.3.3.1   Overview summary

A single-file local format would allow safe atomic addition of data while allowing optimisal transmission order of data. Failing this the validation of data should be tuned to not require reading local texts during data addition even in the presence of delta chains. We should have transmission-validators separate from content validators that allow validation of the delta-transmitted form of objects.

3.2.3.3.2   File level scaling
  • Every new file text requires transmission and local serialisation.
  • Every commit requires transmission and storage of a revision, signature and inventory.

Thus 4000 commits to a 50000 path tree of 10 files on averages requires (with knits) between 26 writes (2*(3+10)) and 80006 (2*(4000*10 + 3)) writes. In all cases there are 4000 * 13 distinct objects to record.

Grouping data by fileid, content and metadata, gives the figures above. Data grouping:

  • File per full identifier (fileid:revisionid:meta|content): 104000
  • Delta-chain per object: object id count * constant overhead per object id (26 -> 80006)
  • Collation/pack file: 1
Performance for these depends heavily on implementation:
  • Using full ids we could name by validator or by id, giving best performance that depends on either receiving data in validator order or in id order.
  • using delta-chain per object we get least seek overhead and syscall overhead if we recieve in topological order within the object id, and object ids in lexical order.
  • Using a collation/pack file we can stream it into place and validate as we go, giving near ideal performance.
3.2.3.3.3   API scaling

The api for writing new data recieved over the network will need to be geared to the transmission and local storage method. What we need is for the transmission method to reasonably closely match the desired write ordering locally. This suggests that once we decide on the best local storage means we should design the api.

take N commits from A to B, if B is local then merge changes into the tree. copy ebough data to recreate snapshots avoid ending up wth corrupt/bad data

3.2.4   Notes from London

  1. setup

look at graph of revisions for ~N comits to deretmine eligibility for if preserve mainline is on, check LH only

identify objects to send that are not on the client repo
  • revision - may be proportional to the graph
  • inventory - proportional to work
  • texts - proportional to work
  • signatures - ???
  1. data transmission
  • send data proportional to the new information
  • validate the data:
  1. validate the sha1 of the full text of each transmitted text.
  2. validate the sha1:name mapping in each newly referenced inventory item.
  3. validate the sha1 of the XML of each inventory against the revision. this is proportional to tree size and must be fixed
  1. write the data to the local repo. The API should output the file texts needed by the merge as by product of the transmission
  2. tree application

Combine the output from the transmission step with additional 'new work data' for anything already in the local repository that is new in this tree. should write new files and stat existing files proportional to the count of the new work and the size of the full texts.

3.3   Add

Add is used to recursively version some paths supplied by the user. Paths that match ignore rules are not versioned, and paths that become versioned are versioned in the nearest containing bzr tree. Currently we only do this within a single tree, but perhaps with nested trees this should change.

3.3.1   Least work we can hope to perform

  • Read a subset of the full versioned paths data for the tree matching the scope of the paths the user supplied.
  • Seek once to each directory within the scope and readdir its contents.
  • Probe if each directory is a child tree to avoid adding data for paths within a child tree.
  • Calculate the ignored status for paths not previously known to be ignored
  • Write data proportional to the newly versioned file count to record their versioning.
  • Assign a fileid for each path (so that merge --uncommitted can work immediately)

Optionally:

  • Print the ignore rule for each ignored path in the scope.
  • Print the path of each added file.
  • Print the total count of ignored files within the scopes.
  • Record the result of calculating ignored status for ignored files. (proportional to the number we actually calculate).

3.3.2   Per file algorithm

  1. If the path is versioned, and it is a directory, push onto the recurse stack.
  2. If the path is supplied by the user or is not ignored, version it, and if a directory, push onto the recurse stack. Versioning the path may require versioning the paths parents.
  3. Output or otherwise record the ignored rule as per the user interface selected.

3.4   Garbage Collection

Garbage collection is used to remove data from a repository that is no longer referenced.

Generally this involves locking the repository and scanning all its branches then generating a new repository with less data.

3.4.1   Least work we can hope to perform

  • Read all branches to get initial references - tips + tags.
  • Read through the revision graph to find unreferenced revisions. A cheap HEADS list might help here by allowing comparison of the initial references to the HEADS - any unreferenced head is garbage.
  • Walk out via inventory deltas to get the full set of texts and signatures to preserve.
  • Copy to a new repository
  • Bait and switch back to the original
  • Remove the old repository.

A possibility to reduce this would be to have a set of grouped 'known garbage free' data - 'ancient history' which can be preserved in total should its HEADS be fully referenced - and where the HEADS list is deliberate cheap (e.g. at the top of some index).

possibly - null data in place without saving size.

3.5   Revert

Change users selected paths to be the same as those in a given revision making backups of any paths that bzr did not set the last contents itself.

3.5.1   Least work we can hope to perform

We should be able to do work proportional to the scope the user is reverting and the amount of changes between the working tree and the revision being reverted to.

This depends on being able to compare unchanged subtrees without recursing so that the mapping of paths to revert to ids to revert can be done efficiently. Specifically we should be able to avoid getting the transitive closure of directory contents when mapping back to paths from ids at the start of revert.

One way this might work is to: for the selected scopes, for each element in the wt:

1. get hash tree data for that scope. 1. get 'new enough' hash data for the siblings of the scope: it can be out of date as long as its not older than the last move or rename out of that siblings scope. 1. Use the hash tree data to tune the work done in finding matching paths/ids which are different in the two trees.

For each thing that needs to change - group by target directory?

1. Extract new content. 1. Backup old content or replace-in-place (except windows where we move and replace).

3.6   Annotate

Broadly tries to ascribe parts of the tree state to individual commits.

There appear to be three basic ways of generating annotations:

If the annotation works by asking the storage layer for successive full texts then the scaling of this will be proportional to the time to diff throughout the history of thing being annotated.

If the annotation works by asking the storage layer for successive deltas within the history of the thing being annotated we believe we can make it scale broadly proportional to the depth of the tree of revisions of the annotated object.

If the annotation works by combining cached annotations such that creating a full text recreates annotations for it then it will scale with the cost of obtaining that text.

Generally we want our current annotations but it would be nice to be able to do whitespace annotations and potentially other diff based annotations.

Some things to think about:

  • Perhaps multiparent deltas would allow us to not store the cached annotations in each delta without losing performance or accuracy.

3.7   Scaling analysys of Merge

  1. Fetch revisions O(a)
  2. Common Ancestor [O(b)] O(h)
  3. Calculate tree merge O(c) [+ O(b) + O(d)] + O(i)
  • text merge O(e * e * f) + O(b)
  1. Find filesystem conflicts O(c)
  2. Resolve filesystem conflicts O(g)
  3. Apply changes O(c) + O(log(d))
  4. Set pending merges O(1)
  5. Print conflicts O(g)
  6. Print changes O(c)
a:revisions missing from repo:
b:nodes in the revision graph:
c:files that differ between base and other:
d:number of files in the tree
e:number of lines in the text
f:number number of files requiring text merge
g:number of conflicts (g <= c)
h:humber of uncommon ancestors
i:number of revisions between base and other

3.7.1   Needs

  • Access to revision graph proportional to number of revisions read
  • Access to changed file metadata proportional to number of changes and number of intervening revisions.
  • O(1) access to fulltexts

3.7.2   Notes

Multiparent deltas may offer some nice properties for performance of annotation based merging.

3.8   Bundle Creation

  1. Find common ancestor [O(a)] O(b)
  2. Emit bundle [O(a)] O(b) O(h)

Per revision

  1. emit metadata O(1)
  2. emit changes for files
  1. find changed files [O(c)] O(f)
  2. emit file metadata O(d)
  3. emit diff [O(e * e) * O(f) + O(h)] O(i)
  4. base64 encode O(g)
  1. emit overal diff (or maybe do interdiff) O(e * e) * O(f)
a:nodes in revision graph
b:number of descendants of common ancestor
c:number of files in the tree
d:length of metadata
e:number of lines
f:number of modified files
g:length of diff
h:nodes in knit graph of modified files
i:length of stored diff

3.8.1   Needs

  • Improved common ancestor algorithm
  • Access to partial revision graph proportional to relevant revisions
  • Access to changed files proportional to number of change files and intervening revisions
  • Use knit deltas without recomputing
  • Access to knit deltas in O(1) time
  • Access to snapshots in O(1) amortized time
  • All snapshots must have knit deltas