URL encoding/decoding with sed

There are various ways of encoding/decoding urls. Programmers often use ready-made functions for this. But do you really know what these functions are doing? For this article, I've chosen sed as the tool to replace the codes and I point out the RFCs that discuss the subject.

Estimated reading time: 4 minutes

Introduction

URL encoding must be applied every time it is necessary to use a reserved character in a URL. But what are these reserved characters and who defined them?

The reserved characters are explicitly described in RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-2.2

They are:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

In other words: “:” , “,” , “?” , “#” , “[” , “]” , “@”, “!” , “$” , “&” , “‘” , “(” , “)”, “*” , “+” , “,” , “;” , “=”

To make the decode, know that the hexadecimal codes in the ASCII table are used. Check out the table in the man:

man ascii

Encoded URLs display the % before each hexadecimal number in the ASCII table. Therefore, to create the encoding, simply replace the reserved character with %hexadecimal! =D

Consider that your shell will read and interpret the hexadecimal with the corresponding escape code. For bash, we’ll use \x. Look at the table above and do your own tests. For the exclamation !, the code is 21. See:

$ echo -e "\x21"
!

Replacing with sed

Let’s analyze a simple function with sed to do the decode:

    #!/bin/bash
    URL_DECODE="$(echo "$1" | sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'"
    echo -e "$URL_DECODE"

Basically, the sed command s/%([0-9a-fA-F]{2})/\x\1/g replaces all % with \x, provided that the following 2 characters represent a hexadecimal number (from 00 to FF). Then, the -e option of echo is activated to interpret this hexadecimal. Oh, and the second sed command s/\+/ /g is replacing any + signs with space =). The -E in sed is to enable the use of modern regular expressions, to avoid too many escape characters that clutter the syntax.

For a slightly more sophisticated script, which also does the encoding, then we use a bunch of sed commands in sequence.

See the complete code, which includes all the reserved characters from RFC 3986:

    #!/bin/bash
    #
    # Enconding e Decoding de URL com sed
    #
    # Por Daniel Cambría
    # [email protected]
    #
    # jul/2021

    function url_decode() {
    echo "$@" \
        | sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'
    }

    function url_encode() {
        # Conforme RFC 3986
        echo "$@" \
        | sed \
        -e 's/ /%20/g' \
        -e 's/:/%3A/g' \
        -e 's/,/%2C/g' \
        -e 's/\?/%3F/g' \
        -e 's/#/%23/g' \
        -e 's/\[/%5B/g' \
        -e 's/\]/%5D/g' \
        -e 's/@/%40/g' \
        -e 's/!/%41/g' \
        -e 's/\$/%24/g' \
        -e 's/&/%26/g' \
        -e "s/'/%27/g" \
        -e 's/(/%28/g' \
        -e 's/)/%29/g' \
        -e 's/\*/%2A/g' \
        -e 's/\+/%2B/g' \
        -e 's/,/%2C/g' \
        -e 's/;/%3B/g' \
        -e 's/=/%3D/g'
    }

    echo -e "URL decode: " $(url_decode "$1")
    echo -e "URL encode: " $(url_encode "$1")

Note on encoding query strings

Often the + sign will appear in URLs to replace the space. This occurs when the text is in a query string. See this section in RFC1866: https://datatracker.ietf.org/doc/html/rfc1866#section-8.2.1

But for any other HTML encoding, you must use percent-encoding (URL encoding).

Unicode

Okay, now that you’ve understood the logic of the thing, you’re probably wondering: what if I use accented characters?

Well, accents are not in the ASCII table, but in the Unicode standard. This standard can appear as UTF-8, UTF-16 and UTF-32 (UTF= Unicode Transformation Format, read more at https://www.unicode.org/faq/utf_bom.html). You can find out more about Unicode directly from the source https://unicode.org/.

If the default for using a hexadecimal number is \x, the unicode is \u. For example:

echo -e "\u2623"
printf "\u2623"
python -c 'print u"\u2623"'

To check the hexadecimal code, use hexdump:

$ echo -en "☣" | hexdump
0000000 e2 98 a3

$ echo -e "\xe2\x98\xa3"
☣

To check the unicode code from a hexadecimal code, this area of the Unicode website which looks up the code charts of the characters should be enough, but it only accepts UTF-16 hexadecimal. However, it is possible to check which Unicode code is UTF-8 hexadecimal through the lookup that the scarfboy site performs. Check it out:

Unicode in MacOS bash

On MacOS, due to software licensing problems with bash from version 4.0 onwards, we can’t generate unicode like ☣ with echo -e "\u2623". But you can install the very up-to-date bash 5+ using brew install bash. If you’ve never used brew, it’s very easy to install, check out the author’s website https://brew.sh. And in this other article, the procedure for making the new bash the default is very detailed. Fortunately, Linux users won’t have this problem. =D

If you want to generate a HUGE list of unicode characters, try this script below. Remember to save it in a file and give it executable permission with chmod +x. If no characters appear, re-read the previous paragraph and update your bash XD

#!/bin/bash
for y in $(seq 0 524287)
  do
  for x in $(seq 0 7)
  do
    a=$(expr $y \* 8 + $x)
    echo -ne "$a \\u$a "
  done
  echo
done

See also

Questions? Post in the comments.
Until next time!

Sobre o autor

Comentários

Mais artigos