extend this awk script to handle apostrophes
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

The follow awk script correctly converts text so that runs of consecutive uppercase letters such as ABCDE becomes \textsc{abcde}. For example, this text:

~> cat old-file.txt
   THIS SENTENCE IS ALL CAPS except not really
   But this sentence has ONE capped word
   This Sentence Has Many Caps

When run through this script:

~> cat tst.awk
   {
   while ( match( $0, /([[:upper:]]{2,}[[:space:]]*)+/) ) {
        rstart  = RSTART
        rlength = RLENGTH
        if ( match( substr($0,RSTART,RLENGTH), /[[:space:]]+$/) ) {
            rlength = rlength - RLENGTH
        }
        $0 = substr($0,1,rstart-1) \
             "\\textsc{" tolower(substr($0,rstart,rlength)) "}" \
             substr($0,rstart+rlength)
   }

   print
}

Produces this output:

~> awk -f tst.awk old-file.txt  > new-file.txt
~> cat new-file.txt
\textsc{this sentence is all caps} except not really
But this sentence has \textsc{one} capped word
This Sentence Has Many Caps

Bounty: Please extend the script tst.awk so that it can handle runs of uppercase letters separated by apostrophes. Specifically, a sentence such as I'LL BUY YOU A DRINK should become \textsc{i'll buy you a drink}.

Just for clarification, should the Ls in I'LL in your test sentence be capitalized too? Also, do words that start with and end with an apostrophe next to a space need to be matched too? Should JONES' HOUSE and JUST 'CAUSE (just because) be matched? What if the apostrophe is the first or last character in the string?
maccam912 over 6 years ago
Yes, the Ls in I'LL in the test sentence should have been capitalized (fixed). "JONES' HOUSE" should become "\textsc{jones' house} and "JUST 'CAUSE" should become "\textsc{just 'cause}". If the apostrophe is first or last, it doesnt matter if you include it, so that "BLAZIN' " could become either "\textsc{blazin}' " or "\textsc{blazin'} ", your call.
suchow over 6 years ago
awarded to maccam912

Crowdsource coding tasks.

1 Solution

Winning solution

One assumption I had to make (correct me if I'm wrong): Even though it doesn't match when only one character is capitalized as in "This Sentence Has Many Caps" I still made it match single capitalized characters if that character was surrounded by two spaces. It seemed you wanted to ignore words that were not all caps, but match single letter words, as in I'LL BUY YOU A DRINK" (that A in there). Then, because the word I is always capitalized, I wanted to match it with a capital letter in a word before or after it.

Sadly, this made the regular expression quite a bit longer. It works, however. Here is my new old-file.txt

THIS SENTENCE IS ALL CAPS except not really
But this sentence has ONE capped word
This Sentence Has Many Caps
New Line's Have apostrophe's in Incorrect Grammar Places
Now LET'S TRY it with CERTAI'N WORD'S in CAPS
I'LL BUY YOU A DRINK
Hello 'BEGINNING and MID'DLE and END' and L'O'T'S' 'O'F' 'T'H'E'M' work
We also don't want to match single letter words like I is
Unless the single letter word is surrounded by
other WORDS THAT I CAPITALIZED

and here is the output:

\textsc{this sentence is all caps} except not really
But this sentence has \textsc{one} capped word
This Sentence Has Many Caps
New Line's Have apostrophe's in Incorrect Grammar Places
Now \textsc{let's try} it with \textsc{certai'n word's} in \textsc{caps}
\textsc{i'll buy you a drink}
Hello \textsc{'beginning} and \textsc{mid'dle} and \textsc{end'} and \textsc{l'o't's' 'o'f' 't'h'e'm'} work
We also don't want to match single letter words like I is
Unless the single letter word is surrounded by
other \textsc{words that i capitalized}

If that looks like correct behavior, then here is the updated awk script. If not, let me know and I can undo some of my "fixes": (note that the whole regex on the 'while' line is supposed to be on one line)

{
while ( match( $0, /((([']*[[:upper:]]+){2,}[']*[[:space:]]*)|[[:upper:]][[:space:]][[:upper:]][[:space:]]|[[:space:]][[:upper:]][[:space:]][[:upper:]])+/) ) {
    rstart  = RSTART
    rlength = RLENGTH
    if ( match( substr($0,RSTART,RLENGTH), /[[:space:]]+$/) ) {
        rlength = rlength - RLENGTH
    }
    $0 = substr($0,1,rstart-1) \
         "\\textsc{" tolower(substr($0,rstart,rlength)) "}" \
         substr($0,rstart+rlength)
   }

   print
}

And in case the formatting is messed up on this site for some reason, the plaintext of it is also here: http://pastie.org/private/tbs8jmyfieek0psq4pdgg

Thanks, you did a much better job of answering my question than I did of asking it. This is great.
suchow over 6 years ago
Nice regex. I was trying but it exceeded my brain's pattern buffer :-)
elwood over 6 years ago