Lex Strings

Quoted strings frequently appear in programming languages. Here is one way to match a string in lex:

%{
    char *yylval;
    #include <string.h>
%}
%%
\"[^"\n]*["\n] {
        yylval = strdup(yytext+1);
        if (yylval[yyleng-2] != '"')
            warning("improperly terminated string");
        else
            yylval[yyleng-2] = 0;
        printf("found '%s'\n", yylval);
    }

The above example ensures that strings don't cross line boundaries and removes enclosing quotes. If we wish to add escape sequences, such as "\n", start states simplify matters:

%{
char buf[100];
char *s;
%}
%x STRING

%%

\"              { BEGIN STRING; s = buf; }
<STRING>\\n     { *s++ = '\n'; }
<STRING>\\t     { *s++ = '\t'; }
<STRING>\\\"    { *s++ = '\"'; }
<STRING>\"      { 
                  *s = 0;
                  BEGIN 0;
                  printf("found '%s'\n", buf);
                }
<STRING>\n      { printf("invalid string"); exit(1); }
<STRING>.       { *s++ = *yytext; }

Exclusive start state STRING is defined in the definition section. When the scanner detects a quote the BEGIN macro shifts lex into the STRING state. Lex stays in the STRING state and recognizes only patterns that begin with <STRING> until another BEGIN is executed. Thus we have a mini-environment for scanning strings. When the trailing quote is recognized we switch back to initial state 0.