Error Reporting

We use the basic ANTLR 4 strategy for error recovery but have customized the way errors are reported.

Redirecting Parsing Errors

Parser errors have been (re)directed to our Validator class; this class implements the ANTLRErrorListener interface. Moreover, the methods in the DefaultErrorStrategy class for reporting parser errors have been overridden in a new class MyErrorStrategy that extends the default class. The MyErrorStrategy class includes grammar-specific code for reporting parser errors.

Redirecting Token Recognition Errors (Lexer)

Lexer errors have also been (re)directed to the Validator. Handling and redirecting token recognition errors in ANTLR is a bit more involved than rerouting parsing errors, however, as LexerNoViableAltException is not routed through the reportError method in DefaultErrorStrategy. We want to be able to retrieve the character sequence that caused the lexer problem without the standard 'token recognition error' message. To achieve this, we need to override the public void notifyListeners(LexerNoViableAltException e) method in the lexer. We can simply do this by creating a MyLexer class extending the generated Lexer.

String Token Problems

A problem that is not easily solved is the missing quote problem which simply means that a user either has forgotten a closing or an opening (double) quote. Strings are typically defined as tokens, e.g., as follows:

StringLiteral
: '"' ('\\"' | .)*? '"'
;

In case of a missing double quote '"' the lexer easily gets stuck when using this definition, however, and tries to match all remaining text from the last double quote found till the end of the file with token StringLiteral. This is not very helpful to a user, not only because a LexerNoViableAltException produces a very long error message, but also because the parser is not fed with any useful tokens any more and it becomes impossible to generate good error messages. One attempt to address at least the issue of very long messages running till the end of file is to add code to reduce the way too long error messages and indicate we're hitting EOF, e.g., as follows:

/**
 * Reduces length of text.  
 * 
 * @param text The (token) text to be shortened. Assumes text.length() > 30!
 * @return shorter text
 */
private String reduceTooLongTokens(String text, boolean endOfFile) {
    String reduced = text.substring(0,26) + "...";
    if (endOfFile) {
        if (text.length()-reduced.length() > 34) {
            reduced += " (which continues till end of file!)";
        } else {
            reduced = text;
        }
    }
    return reduced;
}

using: boolean eof = ((Lexer) recognizer).nextToken().getType() == Token.EOF; to identify we have reached the end of file.

This approach, however, is not very satisfactory, because it does not address the issue that the parser is not fed any useful information any more.

Delegating All Issues to Parser

An alternative approach is to reroute all token recognition issues and to make sure that all characters can be handled by the lexer. This can be achieved by introducing a simple ERROR token definition at the end of a (lexer) grammar:

  ERROR :  .  ;

This token will handle all characters which failed to match any other token. This should get rid of LexerNoViableAltException completely. See for this suggestion also: http://stackoverflow.com/questions/22415208/get-rid-of-token-recognition-error. In practice, however, though this will delegate the forgotten double quote problem now to the parser, it does not appear that the parser recovery strategy is much better up to the job when faced with this problem.

A Simple Solution to the Missing Quote Problem

Given that lexers are not context-sensitive (at all), a simple and probably best yet very pragmatic solution to avoid the worst case scenarios discussed above with the missing quote problems is to redefine the StringLiteral token and no longer allow such a token to span multiple lines. We follow a suggestion made here which is also used in the Eclipse JDT and redefine StringLiteral as follows:

 StringLiteral
      : '"' ('\\"' | ~[\r\n])* '"'
      ;
 UnterminatedStringLiteral
      : '"' ('\\"' | ~[\r\n"])*
      ;

This solution exploits the fact that users divide their code over multiple lines and usually try to keep line length within the bounds of (editor) window limits. In practice the solution also works well and thus offers a good pragmatic solution to the missing quote problem that is reasonably user-friendly on the coding site and offers great benefits to a user from an error reporting point of view.

The solution can be further improved using the UnterminatedStringLiteral token above and by changing the token emit method in the MyLexer class we created above and adding (derived again from the suggestion above):

@Override
public Token emit() {
    switch (getType()) {
    case UnterminatedStringLiteral:
        setType(StringLiteral);
        Token result = super.emit();

      // report error
        String msg = "String literal is not properly closed by a double-quote";

      ANTLRErrorListener listener = getErrorListenerDispatch();
        listener.syntaxError(this, null, _tokenStartLine, _tokenStartCharPositionInLine, msg, 
                new LexerNoViableAltException(this, result.getInputStream(), result.getStartIndex(), null));
        return result;
    default:
        return super.emit();
    }
}

Finally, by introducing a parser rule string : StringLiteral ('+' StringLiteral)*; it is still possible to define strings that span multiple lines by having the user split up the string in multiple sub-parts.

Displaying Tokens

In order to display tokens in a readable manner to users, it is important to avoid printing short token names typically used in lexer grammars such as ID (instead of the longer Identifier). The most straightforward way to do that in ANTLR 4 is to override the getTokenErrorDisplay method in the MyErrorStrategy class and to do a case-by-case analysis of tokens that need improved output (get the relevant token types from the generated parser) and relabel the token names where desired. Make sure to also change the places where expecting.toString() is used in the DefaultErrorStrategy by introducing your own toString(IntervalSet tokens) method, e.g.:

/**
 * Helper method for reporting multiple expected alternatives
 */
private String toString(IntervalSet tokens) {
    int size = tokens.toList().size();
    String str = (size > 1 ? "either " : "");

    for (int i=0; i<size; i++) {
        int type = tokens.toList().get(i);
        String tokenName = improvedTokenDisplay(MAS2GParser.tokenNames[type], type);
        str += tokenName + (i < size-1 ? " or " : "");
    }

    return str.toString();
}

Tweaking Error Messages

Implementing a new error strategy also allows to tweak the specific error messages that are produced. Modify the various report methods such as reportNoViableAlternative to make this happen.

Validation

Validation (or semantic analysis) takes into account whether the parser has detected errors already or not by including a simple flag in each visitor method that prevents reporting additional (validation) errors in that case:

boolean problem = (ctx.exception != null);