{"id":4888,"date":"2021-07-10T02:56:17","date_gmt":"2021-07-10T05:56:17","guid":{"rendered":"https:\/\/bureau-it.com\/artigos\/url-encoding-decoding-with-sed\/"},"modified":"2024-09-19T22:34:12","modified_gmt":"2024-09-20T01:34:12","slug":"url-encoding-decoding-with-sed","status":"publish","type":"post","link":"https:\/\/bureau-it.com\/en\/artigos\/url-encoding-decoding-with-sed\/","title":{"rendered":"URL encoding\/decoding with sed"},"content":{"rendered":"\n
<\/span>Estimated reading time: <\/span>4<\/span> minutes<\/span><\/p>\n\n URL encoding must be applied every time it is necessary to use a reserved character in a URL.\nBut what are these reserved characters and who defined them? <\/p>\n\n The reserved characters are explicitly described in RFC 3986: https:\/\/datatracker.ietf.org\/doc\/html\/rfc3986#section-2.2<\/a><\/p>\n\n They are:<\/p>\n\n In other words: “:” , “,” , “?” , “#” , “[” , “]” , “@”, “!” , “$” , “&” , “‘” , “(” , “)”, “*” , “+” , “,” , “;” , “=”<\/p>\n\n To make the decode, know that the hexadecimal codes in the ASCII table are used.\nCheck out the table in the man: <\/p>\n\n Encoded URLs display the % before each hexadecimal number in the ASCII table.\nTherefore, to create the encoding, simply replace the reserved character with %hexadecimal<\/strong>! =D <\/p>\n\n Consider that your shell will read and interpret the hexadecimal with the corresponding escape code.\nFor bash, we’ll use \\x.\nLook at the table above and do your own tests.\nFor the exclamation !, the code is 21.\nSee: <\/p>\n\n Let’s analyze a simple function with sed<\/strong> to do the decode:<\/p>\n\n Basically, the sed command s\/%([0-9a-fA-F]{2})\/\\x\\1\/g<\/strong> replaces all % with \\x, provided that the following 2 characters represent a hexadecimal number (from 00 to FF).\nThen, the -e<\/strong> option of For a slightly more sophisticated script, which also does the encoding, then we use a bunch of sed commands in sequence.<\/p>\n\n See the complete code, which includes all the reserved characters from RFC 3986:<\/p>\n\n Often the + sign will appear in URLs to replace the space.\nThis occurs when the text is in a query string.\nSee this section in RFC1866: https:\/\/datatracker.ietf.org\/doc\/html\/rfc1866#section-8.2.1 <\/p>\n\n But for any other HTML encoding, you must use percent-encoding (URL encoding).<\/p>\n\n Okay, now that you’ve understood the logic of the thing, you’re probably wondering: what if I use accented characters?<\/p>\n\n Well, accents are not in the ASCII table, but in the Unicode standard.\nThis standard can appear as UTF-8, UTF-16 and UTF-32 (UTF= Unicode Transformation Format, read more at https:\/\/www.unicode.org\/faq\/utf_bom.html)<\/a>.\nYou can find out more about Unicode directly from the source https:\/\/unicode.org\/.<\/a> <\/p>\n\n If the default for using a hexadecimal number is \\x, the unicode is \\u.\nFor example: <\/p>\n\n To check the hexadecimal code, use hexdump:<\/p>\n\n To check the unicode code from a hexadecimal code<\/a>, this area of the Unicode<\/a> website which looks up the code charts of<\/a> the characters should be enough, but it only accepts UTF-16 hexadecimal.\nHowever, it is possible to check which Unicode code is UTF-8 hexadecimal through the lookup that the scarfboy site performs.\nCheck it out: <\/p>\n\n On MacOS, due to software licensing problems with bash from version 4.0 onwards, we can’t generate unicode like \u2623 with If you want to generate a HUGE list of unicode characters, try this script below.\nRemember to save it in a file and give it executable permission with Questions?\nPost in the comments. There are various ways of encoding\/decoding urls.Introduction<\/h2>\n\n
reserved = gen-delims \/ sub-delims\n\ngen-delims = \":\" \/ \"\/\" \/ \"?\" \/ \"#\" \/ \"[\" \/ \"]\" \/ \"@\"\n\nsub-delims = \"!\" \/ \"$\" \/ \"&\" \/ \"'\" \/ \"(\" \/ \")\"\n \/ \"*\" \/ \"+\" \/ \",\" \/ \";\" \/ \"=\"<\/pre>\n\n
man ascii<\/pre>\n\n
<\/figure>\n\n
$ echo -e \"\\x21\"\n!<\/pre>\n\n
Replacing with sed<\/h2>\n\n
#!\/bin\/bash\n URL_DECODE=\"$(echo \"$1\" | sed -E 's\/%([0-9a-fA-F]{2})\/\\\\x\\1\/g;s\/\\+\/ \/g'\"\n echo -e \"$URL_DECODE\"<\/pre>\n\n
echo<\/code> is activated to interpret this hexadecimal.\nOh, and the second sed command
s\/\\+\/ \/g <\/code>is replacing any + signs with space =).\nThe -E in sed is to enable the use of modern regular expressions, to avoid too many escape characters that clutter the syntax. <\/p>\n\n
#!\/bin\/bash\n #\n # Enconding e Decoding de URL com sed\n #\n # Por Daniel Cambr\u00eda\n # daniel.cambria@bureau-it.com\n #\n # jul\/2021\n\n function url_decode() {\n echo \"$@\" \\\n | sed -E 's\/%([0-9a-fA-F]{2})\/\\\\x\\1\/g;s\/\\+\/ \/g'\n }\n\n function url_encode() {\n # Conforme RFC 3986\n echo \"$@\" \\\n | sed \\\n -e 's\/ \/%20\/g' \\\n -e 's\/:\/%3A\/g' \\\n -e 's\/,\/%2C\/g' \\\n -e 's\/\\?\/%3F\/g' \\\n -e 's\/#\/%23\/g' \\\n -e 's\/\\[\/%5B\/g' \\\n -e 's\/\\]\/%5D\/g' \\\n -e 's\/@\/%40\/g' \\\n -e 's\/!\/%41\/g' \\\n -e 's\/\\$\/%24\/g' \\\n -e 's\/&\/%26\/g' \\\n -e \"s\/'\/%27\/g\" \\\n -e 's\/(\/%28\/g' \\\n -e 's\/)\/%29\/g' \\\n -e 's\/\\*\/%2A\/g' \\\n -e 's\/\\+\/%2B\/g' \\\n -e 's\/,\/%2C\/g' \\\n -e 's\/;\/%3B\/g' \\\n -e 's\/=\/%3D\/g'\n }\n\n echo -e \"URL decode: \" $(url_decode \"$1\")\n echo -e \"URL encode: \" $(url_encode \"$1\")<\/pre>\n\n
Note on encoding query strings<\/h2>\n\n
Unicode<\/h2>\n\n
echo -e \"\\u2623\"\nprintf \"\\u2623\"\npython -c 'print u\"\\u2623\"'<\/pre>\n\n
$ echo -en \"\u2623\" | hexdump\n0000000 e2 98 a3\n\n$ echo -e \"\\xe2\\x98\\xa3\"\n\u2623<\/pre>\n\n
Unicode in MacOS bash<\/h2>\n\n
echo -e \"\\u2623\"<\/code>.\nBut you can install the very up-to-date bash 5+ using
brew install bash<\/code>.\nIf you’ve never used brew, it’s very easy to install, check out the author’s website https:\/\/brew.sh. And in this other article<\/a>, the procedure for making the new bash the default is very detailed.\nFortunately, Linux users won’t have this problem.\n=D <\/p>\n\n
chmod +x<\/code>.\nIf no characters appear, re-read the previous paragraph and update your bash XD <\/p>\n\n
#!\/bin\/bash\nfor y in $(seq 0 524287)\n do\n for x in $(seq 0 7)\n do\n a=$(expr $y \\* 8 + $x)\n echo -ne \"$a \\\\u$a \"\n done\n echo\ndone<\/pre>\n\n
See also<\/h2>\n\n
Until next time! <\/p>\n","protected":false},"excerpt":{"rendered":"
\nProgrammers often use ready-made functions for this.
\nBut do you really know what these functions are doing?
\nFor this article, I’ve chosen sed as the tool to replace the codes and I point out the RFCs that discuss the subject. <\/p>\n","protected":false},"author":2,"featured_media":4615,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"resumo_insta":"","imagem_insta":"","acessibilidade_insta":"","hashtags_insta":"","resumo_linkedin":"","imagem_linkedin":"","hashtag_linkedin":"","resumo_face":"","imagem_face":"","hashtag_face":"","footnotes":""},"categories":[96],"tags":[100,102],"class_list":["post-4888","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-shell-script-en","tag-sed-en","tag-url-encoding-en"],"yoast_head":"\n