How to search for Non-ASCII characters in a Tab-Indented Keyword List
Here's a familiar scenario to anyone editing or creating their own Keyword Lists: you start to import the list to Lightroom, and half-way through the import process, everything halts with a less-than-helpful "Only text files encoded with ASCII or unicode UTF-8 are supported when importing keywords" message.
Hey - I'm a photographer. I've heard of ASCII but I'm not sure what it means. Not got a clue about 'unicode UTF-8'. What do I do next?
Don't worry. You will soon get the problem fixed with the information on this page. When creating or editing a tab-indented Keyword List that you intend to import into Lightroom, Breeze Browser, Photo Mechanic, or other popular image management and storage programs, there are certain characters that should be avoided, as they cause an error, and in many cases will stop the import of the Keyword List with a less-than helpful error.
These characters are commonly referred to as 'Non-ASCII characters'. Though they will display quite safely in PSPad or many other text or word processors, the image management and storage programs regard them as outside the range of safe text. Looking at this a bit closer, we learn that Safe ASCII characters are defined as those with character-codes within the range 000-127 in decimal notation, which is 00-7F in HEX, and 000000-1111111 in binary. The ones that you want to avoid, so-called 'Non-ASCII characters', are those within the range 128-255 in decimal notation, which is 80-FF in HEX, or 10000000-11111111 in binary. Counting the characters in the two binary strings, we can soon spot the real reason they are un-popular: The Non-ASCII characters require an extra bit to store them in.
Is this going to cause problems, in the real world? Unfortunately, yes it will. This forbidden range of characters includes all of the foreign accented letters: A, E, I, O, U with dots and dashes and slanted lines above them, plus a range of others. If these are in your Keyword List you will need to transliterate them to their approximate safe equivalents - in other words swap the characters for the letters as they would appear if they didn't have extra marks above or below them.
Other problems that you might encounter come when copying and pasting text from a webpage or pdf document into your Keyword List. It is quite common for authors to use a variety of characters that 'look' the same as normal inverted commas, quotation marks, dashes, etc. but are in fact from a Non-ASCII range of characters that will trip-up your list the moment you try to use it.
So, now we know about these characters, is there any way to spot them automatically? Yes there is, and its quite easy to do this as long as your text-editor supports 'Regular Expressions'. In previous pages we have been using the PSPad text editor, so I will continue to use it here, and demonstrate how to spot Non-ASCII characters.
Open the file that you wish to check, then click 'Search > Find' or 'Control+F'. The 'Find' window will open, as show in the image above. Copy and paste the following characters into the white 'Find' window:
[\x80-\xFF]
and check the box marked 'Regular Expressions'. Don't worry - you don't need to understand what regular expressions are or how they work to do this! Click 'OK' and the first Non-ASCII character will be highlighted - in our case its the accented 'e' in Algerie. Press the 'F3' button to find the next one, and so on. Alternatively, instead of clicking 'OK' you can click the 'List' button, and a list of every occurrence of a Non-ASCII character, with the line number that it was found on, will be shown in a list at the bottom of the PSPad window.
An alternative to the above method is to use an automatic tool that will transliterate all foreign accented characters and catch other Non-ASCII characters too. It also catches errors, finds duplicates, stores comments for each line, visualizes the list, plus much more. You can find it here: Tab-List Tools.