CS 465 - Vigenere Cipher Lab

Objectives:

  1. Learn about a cipher known to be insecure for over a century, but which is still commonly proposed and implemented.
  2. Understand how cryptanalysis using statistical methods can trivially break this cipher and many others like it.

Overview:

The Vigenere cipher was proposed by Blaise de Vigenere in the 16th century. Charles Lutwidge Dodgson, the noted maths lecturer at Christ Church College, Oxford, described this cipher as unbreakable in his 1868 paper “The Alphabet-Cipher” (published in a children's magazine). Read it here:

the_alphabet_cipher.txt

You'll implement this substitution cipher, then use statistical methods to break it.

The Vigenere cipher is very similar to a stream cipher. Variants of it are commonly proposed in public forums and implemented in commercial applications by people who are apparently not aware of Kasiski's 1863 paper demonstrating its insecurity.

Another perspective on this story was kindly provided by Kevin Buhr in a post to sci.crypt (note that the algorithm he describes is not the algorithm to implement for this lab):

[...]

I know this is only a quick class project, but you might want to take
a look at Kahn's _Codebreakers_ or maybe a few other sources.  Kahn
writes, "the comedy of errors and neglect that constitutes so much of
the historiography of cryptology reached a climax of irony when it
came to the inventor [Vigenere] of the second and acceptable autokey
system.  It ignored this important contribution and instead named a
regressive and elementary cipher for him *though he had nothing to do
with it*." (my emphasis)

Intead, Kahn credits Giovan Batista Belaso with publishing a booklet
in 1553 that "proposed the use of a literal, easily remembered, and
easily changed key...for a polyalphabetic cipher", what we know today
as the Vigenere cipher.

He credits Vigenere with the development of a far more sophisticated
system, an "autokey" that uses the plaintext as its own key.  For
example, if we want to encrypt "THIS IS A SECRET MESSAGE", we choose a
secret seed key character, say "D", and we write:

   autokey:  DTHISISASECRETMESSAG
   message:  THISISASECRETMESSAGE

You can see how the autokey consists of the seed character followed by
the plaintext message itself shifted once to the right.  Now, the key
and message characters are looked up in a tableau to produce the
enciphered message, not entirely unlike the usual, so-called "Vigenere
cipher".  However, there's one important difference.  While the
tableau still consists of standard (shifted) alphabets, Vigenere
proposed scrambling the row and column indexing alphabets at the top
and side.  This scrambling plus the seed character would form what
we'd consider the "secret key" nowadays.

Decryption is rather interesting.  Using the seed character "D", the
intended recipient can decipher the first ciphertext character to get
the first plaintext character---this plaintext character is now the
key character to use to decrypt the second ciphertext character, and
so on.

There is still a great deal of structure in this system to
cryptanalyze, but it isn't nearly as easy to break as the simple
cipher *attributed* to Vigenere.

[...]

Also, according to Kahn, Kasiski's 1863 work wasn't really a "paper".
It was a 95-page book "Die Geheimschriften und die Dechiffrir-kunst"
(Cryptographs and the Art of Deciphering, or something to that effect)
three-quarters of which concentrated on a general solution for
polyalphabetic ciphers with a regularly repeating series of alphabets,
including the Vigenere case where the alphabets are all unscrambled
and keyed by a (usually plaintext) word or phrases, but also including
the case of completely random tableaux.

[...]

Break the code:

There are many ways to statistically break Vignere ciphers. Here's how they all generally work. You can implement any method you choose.

Vigenere ciphers can be thought of as combinations of Caesar ciphers, where each character of the key specifies a different rotation. For example, using "abcde" to encrypt this string:

the words are squeamish ossifrage
abc deabc dea bcdeabcde abcdeabcd

is like breaking up the plaintext into 5 different threads and encrypting each with a Ceasar cipher:

abcde

|||||
vvvvv

thewo
rdsar
esque
amish
ossif
rage

Once you know the key length, you can break up the ciphertext into threads as in the example above, then break each one individually by using the character frequency within each thread to deduce the amount of rotation (which just happens to indicate what that character of the key was). We'll elaborate on that.

The attacks on the Vigenere cipher are based on the frequency distribution of characters in the plaintext. Here's a graph of letter frequency in the Book of Mormon:

A ##################################################
B ########
C ###########
D #############################
E #############################################################################
F ##############
G ##########
H #################################################
I ###################################
J #
K ###
L ######################
M #################
N #########################################
O #############################################
P ##########
Q 
R ###############################
S ################################
T #########################################################
U ##############
V #####
W ############
X 
Y ###########
Z 

As you might expect, "E" is the most common letter, followed by "T". Consider what happens if we use a Caesar cipher on the plaintext, shifting each letter by 1 (a->b, b->c, z->a, etc.). Here's the frequency distribution of the ciphertext:

A 
B ##################################################
C ########
D ###########
E #############################
F #############################################################################
G ##############
H ##########
I #################################################
J ###################################
K #
L ###
M ######################
N #################
O #########################################
P #############################################
Q ##########
R 
S ###############################
T ################################
U #########################################################
V ##############
W #####
X ############
Y 
Z ###########

Note that the shape of the distribution is identical, but it has been shifted down by one letter. Not suprisingly, "f" is now the most frequent character, followed by "u". Just this frequency table of the ciphertext should tip off a cryptanalyst that she's dealing with English text rotated by one character using a Caesar cipher.

Now consider what happens with the Vigenere cipher. The string "eeeee" has a boring distribution:

A 
B
C
D
E #####

Encrypt "eeeee" with the string "abcde", and you get "efghi", which has a much flatter distribution:

E #
F # 
G #
H #
I #

Likewise, the Book of Mormon encrypted with a certain 12-character password has a much flatter distribution, since each "e" and "t" (along with everything else) has been shifted to one of 12 different positions instead of piling up:

A #########################
B #############################
C #################
D #######################
E ######################
F ##########################
G ##########################
H #####################
I ########################
J ######################
K ###################
L ##########
M ############################
N #######################
O ##########################
P ######################
Q #######################
R ##################
S ######################
T #########################
U ###############################
V #####################
W ################
X #####################
Y #############
Z ##############################

Statistical attacks work by breaking up the ciphertext into "threads", one for each character in the key, and treating each one as a Caesar cipher. To find the length of the key, guess different lengths and see which one gives a distribution that looks the most like the assumed distribution of the plaintext.

In the case of English text, an easy way to find out which key length gives the best distribution and where the "e" has been shifted is to calculate the mode of each distribution. Wrong key lengths will produce threads with flatter distributions than those of correct key lengths, so the key length which gives the highest frequency for its most common character is likely to be the correct one (or a multiple thereof). Likewise, the most common character probably represents "e" - and remember, since each thread is like a Caesar cipher, all the other characters will be shifted by the same amount.

Another way of performing the statistical analysis can be found on this page at Oregon State.

Assignment:

  1. Implement the Vigenere cipher as described in "The Alphabet-Cipher". Remove all spaces and punctuation from the plaintext before encrypting to discourage attacks based on word placement and length. Verify that your program encrypts and decrypts properly.
  2. Write a program that, given only the ciphertext, determines the key used to encrypt the plaintext. The plaintext is taken from the Book of Mormon and is a very large passage (357,370 characters). This will be plenty of text with which to perform an attack.
  3. You must pass off in person, before the due date, during TA hours. To passoff, you will be using the pass off scripts, starting here.
  4. You must show a TA that you have written a program that will encrypt and decrypt using the vigenere cipher. To test your program, use it to discover the key used to encrypt the following file: cipherText.txt (file last updated: Aug. 30, 2004). The file cipherText.txt contains a passage from the Book of Mormon and was encrypted using the vigenere cipher. If your program takes more than a few minutes to discover the key, run the program before you get the TA. Be prepared to show your code to the TA and explain in detail how your program works.