Concepts of Distributed Version Control

Concepts of Distributed Version Control

So far, we've only talked about one repository: your own local one. But Mercurial is a distributed version control system, which means there are (potentially) lots of repositories.

With the traditional non-distributed version control systems (such as SVN), there is one central repository that you commit to. You check out a copy of the code for your local machine, which you can edit however you want, but as when you check stuff in, you do so back into this central repository.


One side effect of this is that when you check stuff in, your changes affect everyone else on the team. (OK, technically, you could check in to a branch, and only affect people who are using that branch also, but the point is, the central repository is shared by everyone.)

This is usually good; everyone gets your changes right off the bat. But it's occasionally bad. When two people are working on different things, some times the changes made by one person cause the second person to have to make additional changes to the code their working on. With the way a centralized version control system works, they'll have to update their working copy and deal with your new changes before they can commit and save their work.

This whole situation, in a centralized system, means that people hold off on committing until they're certain everything is exactly the way they want it. Which means that sometimes a person will start developing a feature and go off into a hole and do development for weeks at a time before committing. When they finally do commit, it's a lot of changes all at once, and the version control system has a harder time making all of the changes mesh together correctly. So you end up having to manually sort out merge conflicts more often.

With distributed version control, everything is going to be different. In this tutorial, we'll finally open things up to multiple repositories. In this tutorial, we're going to cover terminology, theory, and the groundwork. I apologize that it's a lot of reading and not a lot of hands-on work. It just makes the most sense to me to lay the conceptual foundation first. We'll get hands on experience in the next tutorial.

Basic Distributed Version Control Architectures

In a distributed version control system, there will be lots of repositories. As far as Mercurial is concerned, these repositories are all equivalent. They're peers of each other.

In practice, the developers who use distributed version control rarely see all of their repositories as being equal, and they organize them to fit their team's specific needs.

Let's start by talking about some simple, possible architectures.

The Local One-Person Team Architecture

Let's start simple. The setup that we've been working with the past few tutorials looks like this:


You have your local repository, and you commit your changes right there in place, in the repository that sits in the same place as your working copy.

This setup works very well. It's very simple.

But it does have a couple of limitations: it's not backed up anywhere off of your computer (so if your hard drive crashes, you've lost it all) and this is only really going to work when you're alone.

So let's start expanding this.

The One-Person Team with Off-Box Backup Architecture

We'll take what we had before, and build in a second repository that is on a different computer:


In this setup, we do what we've always done. Make some changes, commit the code or revert. Make some more changes, commit or revert. All commits are done to the the local repository, like in the past.

On occasion, we'll replicate our changes to a second repository (labeled "Central Repository" in the above diagram). This second repository is usually on a different computer (perhaps a server that you've got somewhere, or BitBucket, which we'll talk about in the next tutorial). Because it's on a different computer, if your hard drive fails, you've still got the off-box backup.

Moving changes from one repository is called pushing or pulling depending on the direction the changes are going. For instance, if you make a change and commit it to your local repository, from your local repository, you can push those changes over to the central repository. Alternatively, if you're working from the central repository, you can pull changes from the local repository. (Which also means that if someone else puts changes in the central repository, you can pull those changes from there into your local repository.)

Pushing and pulling do the same thing (moving a set of changes from one repository to another) but it's called "pushing" if your moving changes out to another repository, and "pulling" if you're bringing changes in from another repository.

I want to reiterate once again: Mercurial doesn't pay attention to the names I've applied here. There's no "central" repository in Mercurial's eyes. But from a developer's perspective, it makes sense to assign a name and a purpose to each of them.

How often should you be pushing to the central repository? Whenever. Some people might push every change to the central repository. Others may only do so every few days, when they've got something very stable.

One final thing about pushing changes. Let's say you make 10 commits to your local repository, and then you push those changes to the central repository. The central repository sees this as 10 separate commits, not one. Even though it only got pushed there once, it still understands that this happened across multiple steps. Indeed, knowing that helps it keep track of what changed and makes it smarter at being able to do merges.

If you're working alone, you'll do your local edits and commit to your local repository. You'll periodically push to the central repository. You'll almost never pull from the central repository, because you've already got all of the changes. This will change quickly when you bring on someone else to the team.

Speaking of that…

Small Team Architecture

Let's say you've got your one-person team going from the previous architecture, and someone else comes along and is going to start developing with you.

The setup that you'll want will be something like this:


Basically, Dave will come in and clone (or fork) the central repository to get his own copy of it. Now he's got a local repository, just like you! You both pick tasks and go to work. You each make copies and commit to your local repository. When you feel ready for it, you push your changes from your local repository up to the central repository. Your changes now live in your local repository and the central repository, but not Dave's.

To get these changes, Dave will pull changes from the central repository down to his local repository, and then do an update to get the changes into his working copy.

In other words, commit puts changes from your working copy (the actual file system where you're editing files) into your local repository, update puts changes from your local repository to your working copy. Push will move changes out from one repository to another, while pull brings changes from another repository into the one you're working from.

When Dave makes changes, you'll have to pull those down to your local repository from the central one as well, and then update your working copy.

(We'll talk about how to do all of this in a second.)

Interestingly, because Mercurial sees no practical difference between all of these repositories, Mercurial would have allowed us to push and pull changes between our two local copies, if we had wanted. Mercurial wouldn't require us to go through the central repository. But it matches the mental model and simplifies things if we go through the central repository anyway.

Obviously, this setup isn't limited to just two people. Any other programmers could come in, clone the central repository, start making changes, and start pushing those changes back into the central repository. (Well… that assumes the person has the appropriate rights to push to the repository.) So this model works whether you've got two or two hundred people.

The Open Source (or Gatekeeper) Architecture

Related to the small team architecture that I just described is a similar one that many open source projects tend to follow. This addresses the point I made in the last paragraph in parentheses.

In open source projects, a lot of times you want everybody and anybody to be able to see the code and pull it down. Open source projects encourage people to make changes, which will inevitably happen. With a team of people that are all working together as co-workers, side-by-side, giving everyone rights to commit to the central repository isn't that big of a deal. The whole team is generally on the same page, code reviews happen, and if somebody commits something stupid, they're right there to fix it.

But in an open source project, your team is spread across the world. People have different visions of what the project could or should be. And while most changes are done with the intent of being helpful and useful, sometimes they don't match the vision of the person or people who are trying to manage the project (the Benevolent Dictator for Life). Sometimes they're even malicious—people trying to sneak in backdoors into the project that they'll later exploit for various nefarious purposes.

In these scenarios, it's highly unlikely that we want the entire world to be able to push changes to the repository. This leads people to build a slightly different architecture:


In this setup, somebody or some small group of people might still have direct access to the central repository, but not everyone. For them, things work just like they did previously.

But for everyone else, what they do is clone the central repository into a public location which the BDFL (Benevolent Dictator for Life) can see. In this context, this is often called forking, but it is still just a simple clone. They make their changes and commit to their local repository. When they've got something that they want to share with the world, they push these changes to their public clone.

Here's where this gets interesting. You can't push from your public clone back to the central repository. You don't have permissions to do so. But your repository is public (at least in the sense that people have read access to it, even if they don't have write access). You can't push to them, but they can pull from you. You just have to tell them what to pull.

You create a pull request, which includes a definition of what to pull for Mercurial's purposes, and a human-readable description of what changes have been made for the BDFL's sake.

The BDFL in charge of the project will see the pull request, review it, and make a decision on whether it's a good change or not. If they accept it, they'll pull the changes from your repository and include it in the central repository. Mission accomplished! Bring out the banners and aircraft carriers!

This method doesn't have to just apply to open source projects, but that's where it's typically seen. A business could set things up in the same way. But it does mean someone (the team lead, perhaps) is going to have to spend a lot of time analyzing pull requests, and it does have a certain smell of distrust by the management.

Other Architectures

For most of the people reading this, one of the repository architectures that I described above will be good enough. But using the tools and tricks that we've now talked about, there are plenty of ways you could set things up.

For example, on a very large project, you may have individual teams of 5 or 10 people working on different parts of the program. In this situation, you might have team repositories for each team. When each team finishes major milestones, they'll push their changes to the true central repository, creating a layered effect.

A second example: even on a small team, if a couple of people want to work together to build a specific new feature, they can spin up a shared temporary repository between them that they'll push and pull changes from their local repository, and then push the changes to the central repository when everything's done. Once it clears all the hurdles, the shared repository could be trashed.

A third example: your testers may be in a separate team from the development team. In this case, you could have development have a central repository, and the testing team have a central repository. When the development team thinks they're ready, they submit a pull request to the testing team which will grab the latest changes and test them out. Each team gets their own "central" repository, and changes flow from the dev repository to the test repository.

Really, the list could go on and on. Unlike with a centralize revision control, with distributed version control allows us to make as many repositories as makes sense for us, and push and pull changes among them as we feel the need. But there is overhead in having lots of repositories sitting around. There's more to maintain, more to forget, and more to wonder what the purpose of an old repository is. The intention is that you come up with a system that is as simple as it can be for your project's and team's needs.