{"id":4888,"date":"2021-07-10T02:56:17","date_gmt":"2021-07-10T05:56:17","guid":{"rendered":"https:\/\/bureau-it.com\/artigos\/url-encoding-decoding-with-sed\/"},"modified":"2024-09-19T22:34:12","modified_gmt":"2024-09-20T01:34:12","slug":"url-encoding-decoding-with-sed","status":"publish","type":"post","link":"https:\/\/bureau-it.com\/en\/artigos\/url-encoding-decoding-with-sed\/","title":{"rendered":"URL encoding\/decoding with sed"},"content":{"rendered":"\n

<\/span>Estimated reading time: <\/span>4<\/span> minutes<\/span><\/p>\n\n

Introduction<\/h2>\n\n

URL encoding must be applied every time it is necessary to use a reserved character in a URL.\nBut what are these reserved characters and who defined them? <\/p>\n\n

The reserved characters are explicitly described in RFC 3986: https:\/\/datatracker.ietf.org\/doc\/html\/rfc3986#section-2.2<\/a><\/p>\n\n

They are:<\/p>\n\n

reserved    = gen-delims \/ sub-delims\n\ngen-delims  = \":\" \/ \"\/\" \/ \"?\" \/ \"#\" \/ \"[\" \/ \"]\" \/ \"@\"\n\nsub-delims  = \"!\" \/ \"$\" \/ \"&\" \/ \"'\" \/ \"(\" \/ \")\"\n              \/ \"*\" \/ \"+\" \/ \",\" \/ \";\" \/ \"=\"<\/pre>\n\n

In other words: “:” , “,” , “?” , “#” , “[” , “]” , “@”, “!” , “$” , “&” , “‘” , “(” , “)”, “*” , “+” , “,” , “;” , “=”<\/p>\n\n

To make the decode, know that the hexadecimal codes in the ASCII table are used.\nCheck out the table in the man: <\/p>\n\n

man ascii<\/pre>\n\n
\"\"<\/figure>\n\n

Encoded URLs display the % before each hexadecimal number in the ASCII table.\nTherefore, to create the encoding, simply replace the reserved character with %hexadecimal<\/strong>! =D <\/p>\n\n

Consider that your shell will read and interpret the hexadecimal with the corresponding escape code.\nFor bash, we’ll use \\x.\nLook at the table above and do your own tests.\nFor the exclamation !, the code is 21.\nSee: <\/p>\n\n

$ echo -e \"\\x21\"\n!<\/pre>\n\n

Replacing with sed<\/h2>\n\n

Let’s analyze a simple function with sed<\/strong> to do the decode:<\/p>\n\n

    #!\/bin\/bash\n    URL_DECODE=\"$(echo \"$1\" | sed -E 's\/%([0-9a-fA-F]{2})\/\\\\x\\1\/g;s\/\\+\/ \/g'\"\n    echo -e \"$URL_DECODE\"<\/pre>\n\n

Basically, the sed command s\/%([0-9a-fA-F]{2})\/\\x\\1\/g<\/strong> replaces all % with \\x, provided that the following 2 characters represent a hexadecimal number (from 00 to FF).\nThen, the -e<\/strong> option of echo<\/code> is activated to interpret this hexadecimal.\nOh, and the second sed command s\/\\+\/ \/g <\/code>is replacing any + signs with space =).\nThe -E in sed is to enable the use of modern regular expressions, to avoid too many escape characters that clutter the syntax. <\/p>\n\n

For a slightly more sophisticated script, which also does the encoding, then we use a bunch of sed commands in sequence.<\/p>\n\n

See the complete code, which includes all the reserved characters from RFC 3986:<\/p>\n\n

    #!\/bin\/bash\n    #\n    # Enconding e Decoding de URL com sed\n    #\n    # Por Daniel Cambr\u00eda\n    # daniel.cambria@bureau-it.com\n    #\n    # jul\/2021\n\n    function url_decode() {\n    echo \"$@\" \\\n        | sed -E 's\/%([0-9a-fA-F]{2})\/\\\\x\\1\/g;s\/\\+\/ \/g'\n    }\n\n    function url_encode() {\n        # Conforme RFC 3986\n        echo \"$@\" \\\n        | sed \\\n        -e 's\/ \/%20\/g' \\\n        -e 's\/:\/%3A\/g' \\\n        -e 's\/,\/%2C\/g' \\\n        -e 's\/\\?\/%3F\/g' \\\n        -e 's\/#\/%23\/g' \\\n        -e 's\/\\[\/%5B\/g' \\\n        -e 's\/\\]\/%5D\/g' \\\n        -e 's\/@\/%40\/g' \\\n        -e 's\/!\/%41\/g' \\\n        -e 's\/\\$\/%24\/g' \\\n        -e 's\/&\/%26\/g' \\\n        -e \"s\/'\/%27\/g\" \\\n        -e 's\/(\/%28\/g' \\\n        -e 's\/)\/%29\/g' \\\n        -e 's\/\\*\/%2A\/g' \\\n        -e 's\/\\+\/%2B\/g' \\\n        -e 's\/,\/%2C\/g' \\\n        -e 's\/;\/%3B\/g' \\\n        -e 's\/=\/%3D\/g'\n    }\n\n    echo -e \"URL decode: \" $(url_decode \"$1\")\n    echo -e \"URL encode: \" $(url_encode \"$1\")<\/pre>\n\n

Note on encoding query strings<\/h2>\n\n

Often the + sign will appear in URLs to replace the space.\nThis occurs when the text is in a query string.\nSee this section in RFC1866: https:\/\/datatracker.ietf.org\/doc\/html\/rfc1866#section-8.2.1 <\/p>\n\n

But for any other HTML encoding, you must use percent-encoding (URL encoding).<\/p>\n\n

Unicode<\/h2>\n\n

Okay, now that you’ve understood the logic of the thing, you’re probably wondering: what if I use accented characters?<\/p>\n\n

Well, accents are not in the ASCII table, but in the Unicode standard.\nThis standard can appear as UTF-8, UTF-16 and UTF-32 (UTF= Unicode Transformation Format, read more at https:\/\/www.unicode.org\/faq\/utf_bom.html)<\/a>.\nYou can find out more about Unicode directly from the source https:\/\/unicode.org\/.<\/a> <\/p>\n\n

If the default for using a hexadecimal number is \\x, the unicode is \\u.\nFor example: <\/p>\n\n

echo -e \"\\u2623\"\nprintf \"\\u2623\"\npython -c 'print u\"\\u2623\"'<\/pre>\n\n

To check the hexadecimal code, use hexdump:<\/p>\n\n

$ echo -en \"\u2623\" | hexdump\n0000000 e2 98 a3\n\n$ echo -e \"\\xe2\\x98\\xa3\"\n\u2623<\/pre>\n\n

To check the unicode code from a hexadecimal code<\/a>, this area of the Unicode<\/a> website which looks up the code charts of<\/a> the characters should be enough, but it only accepts UTF-16 hexadecimal.\nHowever, it is possible to check which Unicode code is UTF-8 hexadecimal through the lookup that the scarfboy site performs.\nCheck it out: <\/p>\n\n