Can SourceTree handle mixed UTF-16/Ansi file formats?

Ken Ismert April 22, 2013

I have a project that has SQL files stored in a mixture of utf-16 and ansi formats. The newer files were produced in SQL Server Management Studio, and the older ones are legacy.

Unfortunately, msysGit and Git Extensions have no easy way to work with mixed format files.

Does SourceTree fix this problem, or is it bound by the limitations of Git/msysGit, just like any other Git client?

Thanks,

-Ken

3 answers

0 votes
HomeAway
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
May 16, 2013

Like Ken, we're MS SQL Server and we've been dealing with this problem as well. Not only does it make diff unfriendly and break gitattributes crlf>lf conversions, we suspect that it is causing msysgit client to report erroneous modified files (even after git reset --hard, git checkout ., git stash, etc!!) which is having a cumulative/cascading effect when frustrated DBAs have to commit files they didn't change. The *only* apparent strategy seems to be the clean/smudge which we're implementing with crossed fingers hoping it fixes this.

My question. Why would the smudge be necessary? I don't know of a modern editor that can't just simply handle UTF-8. We're going to do a bulk convert/crlf>lf normalization and then place the .gitattributes and .gitconfig with the clean filter in every repo root and be done!

We've been tracking this here in discussions with Bryan Turner from Atlassian:

https://support.atlassian.com/browse/SSP-1042

Bottom Line: Atlassian and other companies stand to make a killing with Stash and its adoption so closely coupled with the adoption of Git itself, should compel you to take the lead in offering a viable, stable solution for this. I'm frankly suprised you're not already all over it.

Also, TortoiseGit provides its diff tool with UCS-2/UTF-16 support (despite its other warts).

0 votes
Ken Ismert April 28, 2013

Steve,
I have been working on a solution with help from the msysGit people, and have come up with this clean/smudge filter. The filter uses the Gnu file and iconv commands to determine the type of the file, and convert it to and from msysGit's internal UTF-8 format.

1. Get Gnu libiconv (http://gnuwin32.sourceforge.net/packages/libiconv.htm), and file (http://gnuwin32.sourceforge.net/packages/file.htm), and install both.
2. Ensure that the GnuWin32\bin directory (usually "C:\Program Files\GnuWin32\bin") is in your %PATH%
3. Add the following to ~\Git\etc\gitconfig:

[filter "mixedtext"]
    clean = iconv -sc -f $(file -b --mime-encoding %f) -t utf-8
    smudge = iconv -sc -f utf-8 -t $(file -b --mime-encoding %f)
    required

4. Add a line to your global ~/Git/etc/gitattributes or local ~/.gitattributes to handle mixed format text files, for example:

*.txt filter=mixedtext

I have used this on a directory with sql files in ANSI, UTF-16, and UTF-8 formats. This looks to be the 20% effort that should cover 80% of all Windows text format problems.

-Ken

stevestreeting
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 28, 2013

That's very interesting, thanks Ken! This would only help diffs on commit details rather than diffs of the working copy (since if I understand correctly the contents would be UTF-8 only in the repo, and in the other encoding on disk), but it's certainly interesting. I'm also not entirely clear about how this handles new files, since it derives the encoding on disk assuming the file is already there, so if I add a new file and someone else checks it out, the smudge filter won't work - unless I'm missing something?

Ken Ismert April 29, 2013

Steve:

I'm pretty new to Git, so I am relying heavily on other's advice, and lightly on my experience. To quote from a msysGit frequent contributer (Karsten Blees):

There has been a discussion about handling UTF-16 on the git ML a while back, see http://thread.gmane.org/gmane.comp.version-control.git/159708

As suggested there, I would try to use a clean/smudge filter (i.e. store UTF-16 files as UTF-8 in the repository and convert back to UTF-16 on checkout). That way git can treat your UTF-16 files as text in most cases (i.e. you can merge them, git-grep works, gitattributes work (eol-conversion, ident-replacement, built-in diff patterns...)).

...

As described above, I think a diff filter is not the right tool for the job. The only universal format for text content that works reasonably well with established text-based technologies (merge algorithms, regex etc.) is UTF-8. If we want to benefit from these technologies, git should store text files as UTF-8 and convert from / to platform-specific formats on checkin / checkout or for display.

The gitattributes man page has a fair discussion of filters. My impression is filters work not just for diffs, but also a number of other use cases. Diff filters are more limited, working only for diffs.

Regarding your new file question, I don't know. I assume the only way to checkout a new file is to first checkin and commit it, so smudge should work. But, I could be ignorant here.

Part of my reason for pushing this out is to have other people try it out, and see what it's limitations are. My current usage requirements for Git are fairly modest, and unlikely to fully exersize the solution.

Thanks,

-Ken

0 votes
stevestreeting
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
April 22, 2013

We're limited by what the underlying Git binaries do. In my experience this means that UTF-16 files will show as binary (so no merging or diffs). ANSI formats will be considered text, but right now SourceTree only supports UTF-8 for extended characters in the diff - merging etc will still work, but the display in the diff won't look right for extended characters. Support for old-skool codepages is on the TODO list though.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events