Git Line Endings

Saturday, June 30, 2018 6:34 PM git umbraco

It happened during the June Umbraco v8 Hackaton: someone made a few changes to a file named ContentTypeFactory.cs and pushed a pull request. And then, according to GitHub PR diff page, the file had completely changed. As in, all the lines were different. Which tends to scare reviewers.

Now, GitHub has an option to hide whitespace changes:

img1.png

And sure enough, checking that box reduces the changed lines to two lines. Way better. This means that GitHub sees a lot of changes, all related to whitespaces. Or, in our case, to line endings. Line endings in Git is not really something new, it has been discussed for example in this 2012 post, and GitHub has a documentation page about it.

Without entering details, most projects are configured to use LF line endings in the repository, and the appropriate line endings on the developer's disk. Which means that, on a Windows machine, what you see on disk is not exactly what is in the Git repository: everything on disk uses CRLF line endings, whereas everything in the repository should use LF.

Show me the repo

And yet, GitHub seems to think that what is in the repository changed entirely with the PR. Which brings us to the first task: finding out what is in the repository. Without checking anything out, as that would cause Git to change line endings. This is achieved with the Git show command:

git show temp8:src/.../ContentTypeFactory.cs

Problem is, the console is clever enough to deal with line endings, and hide them—all of them, LF, CR, and CRLF. We have to pipe the result to cat to explicitely display line endings:

git show temp8:src/.../ContentTypeFactory.cs | cat -v

And there, all lines look like:

namespace Umbraco.Core.Persistence.Factories^M

The trailing ^M means that, within the Git repository, the file has CRLF line endings. Now try the same command for the PR file...

git show temp8-pr:src/.../ContentTypeFactory.cs | cat -v

... and sure enough, no ^M. So... the file in the repository is wrong, it should have LF line endings, and the PR is fixing this. But... why would the file be wrong?

Experimenting

Let us start again from the original temp8 branch. Make a very small change to the file, save, push... and still, CRLF everywhere in the repository, where Git should have translated everything to LF.

A bit of Googling reveals that Git will not translate files with mixed line endings. So... if there is one single line already ending with LF, the file will remain unchanged. However, examining the file with an hex editor reveals that all lines end in CRLF.

More Googling reveals that Git... can get confused, and can stop translate some files.

How come the PR is fixing everything then? Well, the commits were created with GitKraken, and I am suspecting that it translates files more aggressively.

Can we fix this?

Yes we can. Git provides a command that normalizes every file according to .gitattributes settings:

git add --renormalize .

And then, we should get a list of modified files that just need to be commited and pushed. However in our case... along with a large number of modified files, we find some added files. Which all appear under the same directory. Out of curiosity, let us look at the tree in GitHub:

img2.png

Can you notice the weird thing? How many "umbraco" directories are you seeing? Yes, there is both an umbraco and an Umbraco directories! Of course, this is invisible on a clone on a Windows machine, where the filesystem is case-insensitive, but within the repository Git believes they are two different directories, and this confuses the re-normalization.

Fixing, again

One can actually see this on a Windows machine, by listing the content of the Git tree.

git ls-tree -r temp8:src/Umbraco.Web.UI

This indeed shows two umbraco and Umbraco blobs. And now we need to tell Git to please move everything from umbraco to Umbraco. To make it short, this is how it is done in bash:

git ls-tree -r temp8:src/Umbraco.Web.UI/umbraco `
   | awk -F' ' '{print $4}' `
   | xargs -n 1 -I {} git mv src/Umbraco.Web.UI/umbraco/{} src/Umbraco.Web.UI/Umbraco/{}

Git will register these moves as true moves, and will keep the history of each file.

Wrapping up

With the moves commited, we can try to re-normalize again. This time all we have are modified files, with clean line endings. Commit, push, fixed! We now have a consistent repository (until next time?).

Caveat: pre-existing PRs may appear to change tons of lines. Careless merging may corrupt everything again. We'll see.

Notes

Just to be complete, this post was also inspired by this page, this page, and this page. And, there is another way to renormalize:

git rm --cached -r .
git reset --hard
git add .
git status
git commit -m "Normalize line endings"
comments powered by Disqus