• Login
    View Item 
    •   Maryland Shared Open Access Repository Home
    • ScholarWorks@UMBC
    • UMBC Academic Centers and Institutes
    • UMBC Center for Information Security and Assurance (CISA)
    • View Item
    •   Maryland Shared Open Access Repository Home
    • ScholarWorks@UMBC
    • UMBC Academic Centers and Institutes
    • UMBC Center for Information Security and Assurance (CISA)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English

    Thumbnail
    Files
    ShermanCryptologia94.pdf (1012.Kb)
    Links to Files
    https://www.tandfonline.com/doi/pdf/10.1080/0161-119491882919?needAccess=true
    Permanent Link
    http://hdl.handle.net/11603/12834
    Collections
    • UMBC Center for Information Security and Assurance (CISA)
    • UMBC Computer Science and Electrical Engineering Department
    • UMBC Faculty Collection
    Metadata
    Show full item record
    Author/Creator
    Ganesan, Ravi
    Sherman, Alan T.
    Date
    2010-06-04
    Type of Work
    21 pages
    Text
    journal articles postprints
    Citation of Original Publication
    Ravi Ganesan & Alan T. Sherman (1994) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN EMPIRICAL STUDY USING REAL AND SIMULATED ENGLISH, CRYPTOLOGIA, 18:4, 289-331, DOI: 10.1080/0161-119491882919
    Rights
    This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
    “This is an Accepted Manuscript of an article published by Taylor & Francis in CRYPTOLOGIA, on 04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119491882919”
    Subjects
    automatic language recognition
    statistical approaches to language recognition
    cryptography
    cryptanalysis
    Abstract
    Computer experiments compare the effectiveness of five test statistics at recognizing and distinguishing several types of real and simulated English strings. These experiments measure the statistical power and robustness of the test statistics X², ML, IND, S, and IC when applied to samples of everyday American English from the Brown Corpus and Wall Street Journal and to simulated English generated from lst-order Markov models based on these samples. An empirical approach is needed because the asymptotic theory of statistical inference on Markov chains does not apply to short strings drawn from natural language. Here, X² is the chi-squared test statistic; ML is a likelihood ratio test for recognizing a known language; IND is a likelihood ratio test for distinguishing unknown Oth-order noise from unknown lst-order language; S is a log-likelihood function that is a most-powerful test for distinguishing a known language from uniform noise; and IC is the index of coincidence. The test languages comprise four types of real English, two types of simulated lst-order English, and three types of noise. Two experiments characterize the distributions of these test statistics when applied to nine test languages, presented as strings of different lengths and contaminated with various amounts of noise. Experiment 1 varies the length of the string from 2 to 2¹⁷ characters. Experiment 2 adds uniform noise to samples of three fixed lengths (2⁴, 2⁷, 2¹⁰), with the amount of added noise ranging from 0% to 100%. These experiments assess the performance of the test statistics under realistic cryptographic constraints. Using graphs and tables of observed statistical power, we compare the effectiveness of the test statistics at distinguishing various pairs of languages at several critical levels. Although no statistic dominated all others for all critical levels and string lengths, each test performed well at its designated task. For distinguishing a known type of English from uniform noise at critical levels 0.1 through 0.0001, X 2 attained the highest power, with ML and S also performing well. For distinguishing uniform noise from a known type of English at the same critical levels, ML had the overall best performance, with IC, X², S, and IND also performing well. And for each of these tasks under noisy conditions, ML attained the highest power. In addition, through histograms we describe the actual distribution of each statistic on various language types. These detailed results, which show relationships between power, critical level, and string length, will help cryptanalysts and others apply statistical methods to practical language-recognition problems.


    Albin O. Kuhn Library & Gallery
    University of Maryland, Baltimore County
    1000 Hilltop Circle
    Baltimore, MD 21250
    www.umbc.edu/scholarworks

    Contact information:
    Email: scholarworks-group@umbc.edu
    Phone: 410-455-3021


    If you wish to submit a copyright complaint or withdrawal request, please email mdsoar-help@umd.edu.

     

     

    My Account

    LoginRegister

    Browse

    This CollectionBy Issue DateTitlesAuthorsSubjectsType

    Statistics

    View Usage Statistics


    Albin O. Kuhn Library & Gallery
    University of Maryland, Baltimore County
    1000 Hilltop Circle
    Baltimore, MD 21250
    www.umbc.edu/scholarworks

    Contact information:
    Email: scholarworks-group@umbc.edu
    Phone: 410-455-3021


    If you wish to submit a copyright complaint or withdrawal request, please email mdsoar-help@umd.edu.