Decrypting a "monoalphabetic" substitution cipher - A Case Study

Acknowledgements: Mike Wright pointed out (22 Jul 2001) that this page actually describes a cipher even simpler than monoalphabetic substitution. Mike is quite right, of course, and I've made a couple of minor changes in this page as a result.


On Tuesday 8th May 2001, a friend of mine sent me some sample ciphertext, which he said was a monoalphabetic substitution cipher, where each byte had been XORd with an integer value - and he asked me to demonstrate how to decrypt it. My pleasure.

First of all, let's nail down the encryption algorithm. It's very simple. Let the key be some integer K. Let the plaintext be P, and the ciphertext be C.

For each character in the plaintext, then, we perform the following operation:

C = P ^ K

where ^ means XOR.

So we take the first byte of plaintext, XOR it with the key, and write out the result to a file. Then we take the next byte, XOR it with the key, and write that one to the file. We keep doing that over and over, until we run out of plaintext. If this doesn't sound terribly secure to you, you're right! It won't take at all long to crack.

Here's a short extract from the ciphertext, to give us some flavour:

535 535 535 535 634 638 628 613 632 612 632 625 611 535 625 632 610 633 627
630 611 638 632 633 535 628 635 630 612 612 535 635 638 629 613 630 613 622
535 525 535 611 606 602 606 601 592 628 603 606 594 601 579 573 522 522 522
522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522
522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522
522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522 522
522 522 522 522 522 522 522 522 522 522 522 522 573 573 573 630 583 583 608
606 589 598 581 595 535 607 598 580 535 596 581 594 598 579 594 595 535 579
607 606 580 535 611 606 602 606 601 592 628 603 606 594 601 579 535 598 583
583 603 606 596 598 579 606 600 601 535 593 600 581 535 590 600 578 537 535
535 611 607 606 580 535 598 583 583 603 606 596 598 579 606 600 601 573 601

Now, we can see that all the numbers have three digits. That's handy to know, although by no means essential. What is useful to know is how many discrete symbols there are, and how many times each symbol occurs. To discover some of that information, I passed the data through this program:

#include <stdio.h>

int main(void)
{
  int ch = 0;
  while((ch = getchar()) != EOF)
  {
    if(ch == ' ' || ch == '\n')
    {
      putchar('\n');
    }
    else
    {
      putchar((unsigned char)ch);
    }
  }

  return 0;
}

Details: I saved the ciphertext in a file, cipher.txt. I called the above program ts.c, compiled it, and ran it like so:

ts < cipher.txt > cipher.ts

(So I still have the original ciphertext, of course.)

The next stage was to sort my cipher.ts file...

sort < cipher.ts > cipher.srt

Now we can write another C program.  :-)  This one reports the number of occurrences of each symbol. Note that it requires sorted input (hence the previous sort). Note also that it might contain bugs. I don't actually mind this as much as I usually would - if it gets one or two symbol counts wrong, it doesn't matter. (Of course, that doesn't mean I put any bugs in deliberately...) The reason it doesn't matter is that we are about to take some wild guesses anyway...

#include <stdio.h>
#include <string.h>

int main(void)
{
  char buffer[128] = {0};
  char old[128] = {0};
  unsigned long count = 0;

  while(fgets(buffer, sizeof buffer, stdin))
  {
    if(0 != strcmp(buffer, old))
    {
      if(0 == strcmp(old, ""))
      {
        /* this is just the first line - do nothing */
      }
      else
      {
        printf("%09lu : %s", count, buffer);
      }
      count = 1;
      strcpy(old, buffer);
    }
    else
    {
      ++count;
    }
  }

  if(0 == strcmp(old, buffer))
  {
    printf("%09lu : %s", count, buffer);
  }
  else
  {
    printf("000000001 : %s", buffer);
  }

  printf("\n");

  return 0;
}
 

symcount < cipher.srt | sort /R > cipher.cnt

Note that the program produces text output with the numeric key zero-padded on the left, making it highly suitable for sorting - and that's exactly what I've done. Note also that I have used the DOS sort program's /R switch to reverse the sense of the sort. This moves the most commonly used symbols to the top of the list. Here are the first few lines of the output:

000000516 : 536
000000308 : 537
000000241 : 607
000000239 : 595
000000186 : 580
000000180 : 581
000000167 : 600
000000160 : 601
000000155 : 602
000000144 : 525
000000122 : 604
000000121 : 583

So symbol '536' is used 516 times, symbol '537' is used 308 times, etc. I have quoted the frequencies of only the first twelve symbols here.

Believe it or not, we're almost finished. If you look at the table above, you'll see that the first few symbols are used a lot, and the frequency declines quite drastically. This is beginning to look very much like English text. Now, in English, the most commonly used letters are, in order, E T A O I N S H R D L U (and indeed newspaper typesetters used to keep the letters in little cases stored in this order - with the capital letters stored in the upper case, and the small letters stored in the lower case - so now you know).

Of course, one of those frequently used characters could be a space character or something, but that won't hold us up for long. It's time for another program.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
  unsigned long key = 0;
  char *endp = NULL;
  unsigned long ciphersym = 0;
  char ciphertext[128] = {0};
  char *ciphertoken = NULL;
  char plain = 0;

  if(argc > 2)
  {
    ciphersym = strtoul(argv[1], &endp, 10);
    if(endp == argv[1])
    {
      fprintf(stderr, "Can't parse arg 1.\n");
    }
    else
    {
      key = ciphersym ^ (unsigned char)argv[2][0];
    }
  }
 
  while(fgets(ciphertext, sizeof ciphertext, stdin))
  {
    ciphertoken = strtok(ciphertext, " \t\n");
    while(ciphertoken != NULL)
    {
      ciphersym = strtoul(ciphertoken, NULL, 10);
      plain = (char)((unsigned char)(ciphersym ^ key));
      if(isprint((unsigned char)plain))
      {
        printf("%c", plain);
      }
      else
      {
        printf(".", plain);
      }
      fflush(stdout);
      ciphertoken = strtok(NULL, " \t\n");
    }
  }

  fprintf(stderr, "If this looks good, the key is %lu\n", key);

  return 0;
}

This program (guess.c) accepts the ciphertext on standard input. In argv[1], you give it a symbol, and in argv[2] you give it your guess as to the meaning of that symbol. Since 'e' is the most common letter in normal text, it's quite likely to occur at least once, so it makes a useful hook into the ciphertext. Now, what the program does is very simple - it works out the XOR of your symbol and (the ASCII value of) your guess. Rather conveniently, given that C = P ^ K, it is also true that K = C ^ P. Thus, given a symbol and a guess at the meaning of that symbol, we can work out a guessed key. Now, since 'e' is the most common letter in typical English text, we can start off by guessing like this:

guess 536 e < cipher.txt

This gives us complete gibberish. Ah well. We move down the list a little:

guess 537 e < cipher.txt

Gibberish. Next...

guess 607 e < cipher.txt

Still gibberish. We should expect several punctuation marks to be about as common as 'e' (the space character, for example), so let's not lose hope. There are, in fact, only 64 unique symbols within the ciphertext and, if need be, we will try them all.

guess 595 e < cipher.txt

More gibberish. But, a few symbols further down the list, we hit this guess:

guess 594 e < cipher.txt

And here is the output:

========================================================================.
MICROSOFT FOUNDATION CLASS LIBRARY : TimingClient.==============================
==========================================...AppWizard has created this TimingCl
ient application for you.  This application.not only demonstrates the basics of
using the Microsoft Foundation classes.but is also a starting point for writing
your application...This file contains a summary of what you will find in each of
 the files that.make up your TimingClient application...TimingClient.dsp.    Thi
s file (the project file) contains information at the project level and.    is u
sed to build a single project or subproject. Other users can share the.    proje
ct (.dsp) file, but they should export the makefiles locally...TimingClient.h.
  This is the main header file for the application.  It includes other.    proje
ct specific headers (including Resource.h) and declares the.    CTimingClientApp
 application class...TimingClient.cpp.    This is the main application source fi
le that contains the application.    class CTimingClientApp...TimingClient.rc.
  This is a listing of all of the Microsoft Windows resources that the.    progr
am uses.  It includes the icons, bitmaps, and cursors that are stored.    in the
 RES subdirectory.  This file can be directly edited in Microsoft..Visual C++...
TimingClient.clw.    This file contains information used by ClassWizard to edit
existing.    classes or add new classes.  ClassWizard also uses this file to sto
re.    information needed to create and edit message maps and dialog data.    ma
ps and to create prototype member functions...res\TimingClient.ico.    This is a
n icon file, which is used as the application's icon.  This.    icon is included
 by the main resource file TimingClient.rc...res\TimingClient.rc2.    This file
contains resources that are not edited by Microsoft ..Visual C++.  You should pl
ace all resources not editable by..the resource editor in this file......///////
//////////////////////////////////////////////////////////////////////..AppWizar
d creates one dialog class:..TimingClientDlg.h, TimingClientDlg.cpp - the dialog
.    These files contain your CTimingClientDlg class.  This class defines.    th
e behavior of your application's main dialog.  The dialog's.    template is in T
imingClient.rc, which can be edited in Microsoft..Visual C++....////////////////
/////////////////////////////////////////////////////////////.Other standard fil
es:..StdAfx.h, StdAfx.cpp.    These files are used to build a precompiled header
 (PCH) file.    named TimingClient.pch and a precompiled types file named StdAfx
.obj...Resource.h.    This is the standard header file, which defines new resour
ce IDs..    Microsoft Visual C++ reads and updates this file.../////////////////
////////////////////////////////////////////////////////////.Other notes:..AppWi
zard uses "TODO:" to indicate parts of the source code you.should add to or cust
omize...If your application uses MFC in a shared DLL, and your application is .i
n a language other than the operating system's current language, you.will need t
o copy the corresponding localized resources MFC42XXX.DLL.from the Microsoft Vis
ual C++ CD-ROM onto the system or system32 directory,.and rename it to be MFCLOC
.DLL.  ("XXX" stands for the language abbreviation..For example, MFC42DEU.DLL co
ntains resources translated to German.)  If you.don't do this, some of the UI el
ements of your application will remain in the.language of the operating system..
./////////////////////////////////////////////////////////////////////////////.I
f this looks good, the key is 567

Well, there we are. The 'e' character is represented by 594, and the program kindly displays the key for us: 567.

My friend actually provided the key for me as a check, and my key agrees with his, so I know for sure that I'm done - but the above output is highly unlikely to have occurred by chance - the key he gave me was effectively redundant. (Actually, not quite redundant. I wrote a quick known-key decryption program to check that he wasn't just yanking my chain with random 'ciphertext'!)

Anyway, we're done. That's how it's done - or at least it's one way to do it. Frighteningly easy. Don't rely on such simple ciphers to protect your data.
 

  C_Dreamer
  8 May 2001