Statistical Techniques for Language Recognition: An Introduction and Guide for Cryptanalysts

Author/Creator ORCID

Date

2010-06-04

Department

Program

Citation of Original Publication

Ravi Ganesan & Alan T. Sherman (1993) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN INTRODUCTION AND GUIDE FOR CRYPTANALYSTS, CRYPTOLOGIA, 17:4, 321-366, DOI: 10.1080/0161-119391867980

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
“This is an Accepted Manuscript of an article published by Taylor & Francis in Cryptologia on04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119391867980.”

Abstract

We explain how to apply statistical techniques to solve several language-recognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext. Written for cryptanalysts, this guide should also be helpful to others as an introduction to statistical inference on Markov chains. Modeling language as a finite stationary Markov process, we adapt a statistical model of pattern recognition to language recognition. Within this framework we consider four well-defined language-recognition problems: 1) recognizing a known language, 2) distinguishing a known language from uniform noise, 3) distinguishing unknown 0th-order noise from unknown lst-order language, and 4) detecting non-uniform unknown language. For the second problem we give a most powerful test based on the Neyman-Pearson Lemma. For the other problems, which typically have no uniformly most powerful tests, we give likelihood ratio tests. We also discuss the chi-squared test statistic X 2 and the Index of Coincidence IC. In addition, we point out useful works in the statistics and pattern-matching literature for further reading about these fundamental problems and test statistics.