Information encoding is to represent the low-level mapping of the information being handled.
Understanding encoding schemes is big advantage during the detection and exploitation of vulnerabilities in web application penetration testing.
Before we talk about encoding, firstly lets know what is a character set ?
It is a set of characters symbol (what user see on the screen like A,B,..) and its code point (a numeric index like 41 for A character)
Charset vs character encoding
encoding is the representation in one or more bytes of the symbols of a Charset.
Example is Unicode charset which have three implementation of character encoding UTF-8, UTF-16 and UTF-32, where the numbers 8,16 and 32 are the amount of bits used to represent code points.
You may have seen that line in HTML page header before, so what it does for us?, that is what specify the character encoding of an HTML document which let the browsers handle characters or symbols correctly.
These encoding schemas can be applied to all applications not just web applications.
HTML language have many characters with special meaning, it may be interpreted as a part of the language and not shown to the end user like < > tags.
what if we need to use that symbols? HTML Entities come to do so.
HTML entity is simply a string (starting with & or &# and ending with ; ) that corresponds with a symbol.
Browser will show the corresponding symbol and will not interpret the symbol as an HTML language element, for reference.
Note: Even HTML Entities is not a security feature, its use can limit most client side attacks (like XSS).
URLs sent over Internet must contain characters in the range of the US-ASCII charset.
If unsafe characters are present in a URL, encoding them is required.
characters in a URL is a subset of specific characters:
General Chars: [a-zA-z] [0-9] [- . _ ~]
Reserved Chars (that have a specific purpose): : / ? # [ ] @ ! $ & ” ( ) * + , ; = %
Other characters or Reserved Chars when they have no special role inside the URL must be encoded by percent char (%) plus two hexadecimal digits (That is why it is called percent encoding).
Example: [# become %23 ], [? become %3F], for complete list http://www.w3schools.com/tags/ref_urlencode.asp
URL-encoding is performed automatically by your browser.
is a binary-to-text encoding schema used to convert binary files and send them over Internet like attached files in emails or images in web pages.
It is composed of digits [0-9] and upper and lower case [a-zA-Z], for a total of 62 values.
To complete the character set to 64 there are the plus (+) and slash (/) characters or (=) based on different implementation
Example using image base64 in HTML document:
<img src=”data:image/gif;base64, zwQVEEkFRG2aCVqslieiOa5ruCtc1eCoqAeDdJnoi6N1B1lfpR6acujsqtM1quoZHf0fOH9DgnzQ……………==” alt=”Base64 encoded image” width=”150″ height=”150″></img>
URL Encoding, Base64 maybe used to bypass some basic filtering in web application penetration testing.
Burp suite has a simple good decoder to encode or decode URL, Base64 and other types.