Patrick Juola

Duquesne University, Pittsburgh, Pennsylvania

Feature Sets in Authorship Attribution: A Software-Based Case Study

Computational authorship attribution is a well-understood problem, with a well-known and widely used general structure for its solution, but with literally thousands of individual variations.   This makes it relatively easy to support different variations within a single program and to compare how different methods work.   This is how the JGAAP program, developed by the Evaluating Variations in Language (EVL) laboratory at Duquesne University works.  I will demonstrate JGAAP and describe its architecture.
A key question in authorship attribution is the specific document features to be examined.  De Morgan and Mendenhall suggested word lengths in the 19th century.   Mosteller and Wallace used a problem-specific list of words to analyze The Federalist Papers.  Burrows and later researchers suggested using the most common words of a set of documents, while Stamatatos (among others) suggests using character clusters (n-grams).  I myself have proposed using a mixture of independent features such as common words, word pairs, word lengths, and character clusters.  Probably more than 50% of the research into this area focuses on simply words, character, or clusters of words/characters.
What are some other features that have been used?   What other features can be used,…. or what new technology will need to be developed to make the use of other features practical and competitively accurate?   A lively discussion is encouraged to push the field of authorship studies creatively forward.

Lingua:

Ciao Lorenzo…

Una pagina in onore del Prof. Lorenzo Bernardi (1943-2014).

Highlights!

Highlights!

University of Wroclaw (Poland)

settembre: 2017
L M M G V S D
« Mag    
 123
45678910
11121314151617
18192021222324
252627282930