This version of John is UTF-8 and codepage aware. This means that unlike
"core John", this version can recognize national vowels, lower or upper
case characters, etc. in most common encodings.

By default, nothing of this is enabled and John Jumbo works just like Solar's
"core John" a.k.a "John proper". If you only care about 7-bit ASCII passwords,
you can stop reading now and move on to next document.

The traditional behavior, and what is still happening if you don't specify any
encodings, is that John will assume ISO-8859-1 when converting plaintexts or
salts to UTF-16 (this happens to be very fast), and assume ASCII in most
other cases. The rules engine will accept 8-bit candidates as-is, but it will
not upper/lower-case them or recognise letters etc. And some truncation or
insert operation might split a multi-byte UTF-8 character in the middle,
resulting in meaningless garbage. Nearly all other password crackers have these
limitations.

For proper function, it's imperative that you let John know about what
encodings are involved. For example, if your wordlist is encoded in
UTF-8, you need to use the "--encoding=UTF-8" option (unless you have set
that as default in john.conf). But you also need to know what encoding the
hashes were made from - for example, LM hashes are always made from a legacy
MS-DOS codepage like CP850. This can be specified by using the option
"--target-encoding=CP850". John will convert to/from Unicode as needed.

Finally, there's the special case where both input (wordlist) and output (eg.
hashes from a website) are UTF-8 but you want to use rules including eg. upper
or lower-casing non-ASCII characters. In this case you can use an intermediate
encoding (--internal-encoding option) and pick some codepage that can
shelter as much of the needed input as possible. For US/Western Europe, any
"Latin-1" codepage will do fine, eg. CP850, ISO-8859-1 or CP1252 (if you do
this with a Unicode format like NT, it will silently be treated in another
way internally for performance reasons but the outcome will be the same).

Mask mode also honors --internal-encoding (or plain --encoding). For
example, the mask ?l that normally is a placeholder for [a-z] will also
include all lowercase Greek letters if you use CP737.

The limitation is if you use --target-encoding or --internal-encoding,
the input encoding must be UTF-8. The recommended, and easiest, use is to
un-comment all encoding parameters in john.conf and only use UTF-8 wordlists.
This will work for most cases without too much impact on cracking speed
and you will almost never have to give any command-line options.

Some new reject rules and character classes are implemented, see doc/RULES.
If you use rules without --internal-encoding, some wordlist rules may cut
UTF-8 multibyte sequences in the middle, resulting in garbage. You can reject
such rules with -U to have them in use only when UTF-8 is not used.

Caveats:
Beware of UTF-8 BOM's. They will cripple the first word in your wordlist or
the first entry in your hashfile. Try to avoid using Windows tools that add
them. This version does try to discard them though.

Unicode beyond U+FFFF (4-byte UTF-8) is not supported by default for the NT
formats because it hits performance and because the chance of it being used
in the wild is pretty slim. Supply --enable-nt-full-unicode to configure when
building if you need that support.

Examples:
1. LM hashes from Western Europe, using a UTF-8 wordlist:

   ./john hashfile -form:lm -enc:utf8 -target:cp850 -wo:spanish.lst

2. NT hashes, using a legacy Latin-1 wordlist. Since NT is a Unicode format,
   you do not have to worry about target encoding at all - any input encoding
   can be used:

   ./john hashfile -form:nt -enc:8859-1 -wo:german.lst

3. Using a UTF-8 wordlist with an internal encoding for rules processing:

   ./john hashfile -enc:utf8 -int=CP1252 -wo:french.lst -ru

4. Using mask mode to print all possible "Latin-1" words of length 4,
   first letter upper case:

   ./john -stdout -enc:utf8 -int=8859-1 -mask:?u?l?l?l

5. Using the recommended john.conf settings mentioned above:

   $ ../run/john hashfile -form:lm -single
   Using default input encoding: UTF-8
   Target encoding: CP850
   Loaded 3 password hashes with no different salts (LM [DES 128/128 AVX-16])
   Press 'q' or Ctrl-C to abort, almost any other key for status
   GEN              (Kübelwagen:2)
   KÜBELWA          (Kübelwagen:1)
   MÜLLER           (Müller)
   3g 0:00:00:00 DONE (2014-03-28 20:48) 300.0g/s 12800p/s 12800c/s 38400C/s
   Warning: passwords printed above might be partial
   Use the "--show" option to display all of the cracked passwords reliably
   Session completed

   $ ../run/john hashfile -form:nt -loopback
   Rules engine using CP850 for Unicode
   Loaded 2 password hashes with no different salts (NT [MD4 128/128 SSE2-16])
   Assembling cracked LM halves for loopback
   Loop-back mode: Reading candidates from pot file $JOHN/john.pot
   Press 'q' or Ctrl-C to abort, almost any other key for status
   Kübelwagen       (Kübelwagen)
   müller           (Müller)
   2g 0:00:00:00 DONE (2014-03-28 20:48) 200.0g/s 3200p/s 3200c/s 6400C/s
   Use the "--show" option to display all of the cracked passwords reliably
   Session completed


Currently supported encodings:
UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-7, ISO-8859-15,
CP437, CP737, CP850, CP852, CP858, CP866,
CP1250, CP1251, CP1252, CP1253 and KOI8-R.

New encodings can be added with ease, using automated tools that rely on the
Unicode Database (see Openwall wiki, or just post a request on john-users
mailing list).

--

These contributions to John are hereby placed in the public domain. In case
that is not applicable, they are Copyright 2009-2014 by magnum and
JimF and hereby released to the general public. Redistribution and use in
source and binary forms, with or without modification, is permitted.
