Merge Git Authors

March 19, 2019

Stats! Stats are awesome. You can collect them, you can analyze and visualize them, you can boil, grill and fry them… Wait, that’s not food, right?

git shortlog

Let’s say we have a Git repository. We want to count commits per author. There is a Git command for that!

$ git shortlog --summary --numbered

  2  Bender Rodriguez
  2  Philip J. Fry
  1  Bending Unit 22
  1  Turanga Leela

There are actually three authors but since names are non-consistent Git makes it look like there are four. It gets worse with emails.

$ git shortlog --summary --numbered --email

  2  Philip J. Fry <philip.j.fry@planet-express.earth>
  1  Bender Rodriguez <bender.rodriguez@planet-express.earth>
  1  Bender Rodriguez <bending-unit-22@mom.corp>
  1  Bending Unit 22 <bender.rodriguez@planet-express.earth>
  1  Turanga Leela <turanga.leela@planet-express.earth>

Bender has three identities:

Name Email
Bender Rodriguez bender.rodriguez@planet-express.earth
Bender Rodriguez bending-unit-22@mom.corp
Bending Unit 22 bender.rodriguez@planet-express.earth

This is a made-up repository but the underlying issue is very real. Git history gets messed up. Reasons are different: a new computer, multiple Git identities on a single machine, a company domain change, dog ate it. The result is always the same — the history is not consistent.

The deal here is not actually stats-related (although it is useful). A more frequent task is researching, finding a person who made a change and the motivation behind it. I’m talking about git blame and tooling around it.

Fortunately enough Git provides an instrument to deal with such conditions. It is called .mailmap. Like HashMap and ConcurrentHashMap, but MailMap.

The following .mailmap content will resolve our consistency issues.

Bender Rodriguez <bender.rodriguez@planet-express.earth>
Bender Rodriguez <bender.rodriguez@planet-express.earth> <bending-unit-22@mom.corp>

We are associating the primary name with the primary email address and aliasing secondary email to the primary identity. Let’s check.

$ git shortlog --summary --numbered

  3  Bender Rodriguez
  2  Philip J. Fry
  1  Turanga Leela

$ git shortlog --summary --numbered --email

  3  Bender Rodriguez <bender.rodriguez@planet-express.earth>
  2  Philip J. Fry <philip.j.fry@planet-express.earth>
  1  Turanga Leela <turanga.leela@planet-express.earth>

git log

There is a catch. Log will not show mail-mapped values.

$ git log --format="%an <%ae>: %s"

Bender Rodriguez <bending-unit-22@mom.corp>: 01101000 01110101 01101101 01100001 01101110 01110011
Philip J. Fry <philip.j.fry@planet-express.earth>: Delivering pizza to D. Frosted Wang.
Bending Unit 22 <bender.rodriguez@planet-express.earth>: 01100001 01101100 01101100
Bender Rodriguez <bender.rodriguez@planet-express.earth>: 01101011 01101001 01101100 01101100
Philip J. Fry <philip.j.fry@planet-express.earth>: Delivering pizza to I. C. Wiener.
Turanga Leela <turanga.leela@planet-express.earth>: Blast off!

The thing is — git shortlog uses .mailmap by default, so does git blame. Not git log though.

Yeah, that’s complicated. It works though!

$ git log --format="%aN <%aE>: %s"

Bender Rodriguez <bender.rodriguez@planet-express.earth>: 01101000 01110101 01101101 01100001 01101110 01110011
Philip J. Fry <philip.j.fry@planet-express.earth>: Delivering pizza to D. Frosted Wang.
Bender Rodriguez <bender.rodriguez@planet-express.earth>: 01100001 01101100 01101100
Bender Rodriguez <bender.rodriguez@planet-express.earth>: 01101011 01101001 01101100 01101100
Philip J. Fry <philip.j.fry@planet-express.earth>: Delivering pizza to I. C. Wiener.
Turanga Leela <turanga.leela@planet-express.earth>: Blast off!

Tools

.mailmap is nice and all but tools handling is a hit-and-miss.

Nevertheless, it is better to have it than not. Such projects as Gradle, SymPy, TypeScript and Git itself have and maintain them.

Preemptive Strike

.mailmap is an after-the-fact measure. Ideally it is better to have before-the-fact measures in place.

Bitbucket has a plugin for that — it checks that a Git author email address matches Bitbucket account email address. Surprisingly I haven’t found anything close for GitHub.

A pre-commit Git hook will do the trick as well.

if [[ "$(git config user.email)" != *"@planet-express.earth" ]]; then
    echo "Danger! High Voltage!"
    exit 1
fi

Obviously such checks will not work in the OSS world but for companies — seems like a way to go.