New protocols are being invented and applied rapidly. Basics remains the same at foundation levels.
- Binary Protocol Structure
- Dates and Times
- Tag, Length, Value Pattern
- Multiplexing and Fragmentation
- Network Address Information
- Structured Binary Formats
- Text Protocol Structures
- Encoding Binary Data
Good understanding of these intact numbering mechanisms are mandatory to analyse various aspects of information security functions; including monitoring, forensics, malware analysis and so.
Binary Protocol Structures
Binary Protocols works at the binary machine level; the smallest unit of data is a single binary digit/bit(0 or 1). Dealing with single bits is difficult, so we’ll use 8-bit units called octets, commonly called bytes. The octet is the de facto unit of network protocols. Although octets can be broken down into individual bits (for example, to represent a set of flags), we’ll treat all network data in 8-bit units,
Example;
When showing individual bits, We’ll use the bit format, which shows bit 7, the most significant bit (MSB), on the left. Bit 0, or the least significant bit (LSB), is on the right. (Some architectures, such as PowerPC, define the bit numbering in the opposite direction.)
- Numeric Data
- Unsigned Integers
- Signed Integers
- Variable Length Integers
- Floating Point Data
- Booleans
- Bit Flags
- Binary Endian
- Text and Human Readable Data
- Code Pages
- Multi-byte Character Sets
- Unicode
- UTF-16
- UTF-32
- UTF-8
- Variable Binary Length Data
- Terminated Data
- Length Prefixed Data
- Implicit-Length Data
- Padded Data
Numeric Data
Data values representing numbers are usually at the core of a binary protocol. These values can be integers or decimal values. Numbers can be used to represent the length of data, to identify tag values, or simply to represent a number.
Note: In binary, numeric values can be represented in a few different ways, and a protocol’s method of choice depends on the value it’s representing.
The followings are some of the more common formats.
- Unsigned Integers
- Signed Integers
- Variable-Length Integers
- Floating-Point Data
Unsigned Integers
Unsigned integers are the most obvious representation of a binary number. Each bit has a specific value based on its position, and these values are added together to represent the integer.
Table 3-1 shows the decimal a value for an 8-bit integer.
Signed Integers
Not all integers are positive in real world calculations. Example, for calculate the difference bitween two numbers; negative representation is required. So there should be a way to represent negative numbers; because CPUs only understand and interpret binary.
The most common signed interpretation is two’s complement. The term two’s complement refers to the way in which the signed integer is represented within a native integer value in the CPU. Only signed integers can hold negative values
Conversion between unsigned and signed values in two’s complement is done by taking the bitwise NOT (where a 0 bit is converted to a 1 and 1 is converted to a 0) of the integer and adding 1.
For example, Following figure shows the 8-bit integer 41 converted to its two’s complement representation.
The two’s complement representation has one dangerous security consequence. For example, an 8-bit signed integer has the range -128 to 127, so the magnitude of the minimum is larger than the maximum. If the minimum value is negated, the result is itself; in other words, -(-128) is -128. This can cause calculations to be incorrect in parsed formats, leading to security vulnerabilities.
Variable-Length Integers
Efficient transfer of network data has historically been very important. Even though today’s high-speed networks might make efficiency concerns unnecessary, there are still advantages to reducing a protocol’s bandwidth. It can be beneficial to use variable-length integers when the most common integer values being represented are within a very limited range.
For example, consider length fields ( may be 32-bit words.): when sending blocks of data between 0 and 127 bytes in size, you could use a 7-bit variable integer representation. At most, five octets are required to represent the entire range. But if your protocol tends to assign values between 0 and 127, it will only use one octet, which saves a considerable amount of space.
That said, if you parse more than five octets (or even 32 bits), the resulting integer from the parsing operation will depend on the parsing program.
Some programs (including those developed in C) will simply drop any bits beyond a given range, whereas other development environments will generate an overflow error. If not handled correctly, this integer overflow might lead to vulnerabilities, such as buffer overflows, which could cause a smaller than expected memory buffer to be allocated, in turn resulting in memory corruption.
Floating-Point Data
Sometimes, integers aren’t enough to represent the range of decimal values needed for a protocol.
For example, a protocol for a multiplayer computer game might require sending the coordinates of players or objects in the game’s virtual world. If this world is large, it would be easy to run up against the limited range of a 32- or even 64-bit fixed-point value.
The format of floating-point integers used most often is the IEEE format specified in IEEE Standard for Floating-Point Arithmetic (IEEE 754).
Although the standard specifies a number of different binary and even decimal formats for floating-point values, you’re likely to encounter only two: a single-precision binary representation, which is a -bit value; and a double-precision, 64-bit value. Each format specifies the position and bit size of the significand(mantissa) and exponent.
A sign bit is also specified, indicating whether the value is positive or negative.
layout of an IEEE floating-point value
Booleans
Because Booleans are very important to computers, it’s no surprise to see them reflected in a protocol. Each protocol determines how to represent whether a Boolean value is true or false, but there are some common conventions.
The basic way to represent a Boolean is with a single-bit value. A 0 bit means false and a 1 means true. This is certainly space efficient but not necessarily the simplest way to interface with an underlying application.
It’s more common to use a single byte for a Boolean (than a single bit) value because it’s far easier to manipulate. It’s also common to use zero to represent false and non-zero to represent true.
Bit Flags
Bit flags are one way to represent specific Boolean states in a protocol. For example, in TCP a set of bit flags is used to determine the current state of a connection.
When making a connection, the client/initiator sends a packet with the synchronize flag (SYN) set to indicate that the connections should synchronize their timers. The server/listener can then respond with an acknowledgment (ACK) flag to indicate it has received the client request as well as the SYN flag to establish the synchronization with the client (aka SYN-ACK). Then the connection established with the ACK flag from the initiator/client. If this handshake used single enumerated values, this dual state would be impossible without a distinct SYN/ACK state.
Binary Endian
The endianness of data is a very important part of interpreting binary protocols correctly. It comes into play whenever a multi-octet value, such as a 32-bit word, is transferred. The endian is an artifact of how computers store data in memory.
Because octets are transmitted sequentially on the network, it’s possible to send the most significant octet of a value as the first part of the transmision, as well as the reverse-send the least significant octet first. The order in which octets are sent determines the endianness of the data. Failure to correctly handle the endian format can lead to subtle bugs in the parsing of protocols.
Modern platforms use two main endian formats: big and little. Big endian stores the most significant byte at the lowest address, whereas little endian stores the least significant byte in that location. Figure 3-5 shows how the 32-bit integer 0x01020304 is stored in both forms.
The endianness of a value is commonly referred to as either network order or host order. Because the Internet RFCs invariably use big endian as the preferred type for all network protocols they specify (unless there are legacy reasons for doing otherwise), big endian is referred to as network order. But your computer could be either big or little endian. Processor architectures such as ×86 use little endian; others such as SPARC use big endian.
Some processor architectures, including SPARC, ARM, and MIPS, may have onboard logic that specifies the endianness at runtime, usually by toggling a processor control flag. When developing network software, make no assumptions about the endianness of the platform you might be running on. The networking API used to build an application will typically contain convenience functions for converting to and from these orders. Other platforms, such as PDP-11, use a middle endian format where 16-bit words are swapped; however, you’re unlikely to ever encounter one in everyday life, so don’t dwell on it.
Text and Human-Readable Data
Along with numeric data, strings are the value type you’ll most commonly encounter, whether they’re being used for passing authentication credentials or resource paths. When inspecting a protocol designed to send only English characters, the text will probably be encoded using ASCII. The original ASCII standard defined a 7-bit character set from 0 to 0x7F, which includes most of the characters needed to represent the English language (shown below).
The ASCII standard was originally developed for text terminals (physical devices with a moving printing head). Control characters were used to send messages to the terminal to move the printing head or to synchronize serial communications between the computer and the terminal. The ASCII character set contains two types of characters: control and printable.
Control Characters : 01 to F1 and F7 in above table.
Most of the control characters are relics of those devices and are virtually unused.But some still provide information on modern computers, such as CR and LF, which are used to end lines of text.
The printable characters are the ones you can see. This set of characters consists of many familiar symbols and alphanumeric characters; however, they won’t be of much use if you want to represent international characters, of which there are thousands. It’s unachievable to represent even a fraction of the possible characters in all the world’s languages in a 7-bit number.
Three strategies are commonly employed to counter this limitation: code pages, multibyte character sets, and Unicode. A protocol will either require that you use one of these three ways to represent text, or it will offer an option that an application can select.
Code Pages
The simplest way to extend the ASCII character set is by recognizing that if all your data is stored in octets, 128 unused values (from 128 to 255) can be repurposed for storing extra characters. Although 256 values are not enough to store all the characters in every available language, you have many different ways to use the unused range. Which characters are mapped to which values is typically codified in specifications called code pages or character encodings.
Multibyte Character Sets
In languages such as Chinese, Japanese, and Korean (collectively referred to as CJK), you simply can’t come close to representing the entire written language with 256 characters, even if you use all available space. The solution is to use multibyte character sets combined with ASCII to encode these languages. Common encodings are Shift-JIS for Japanese and GB2312 for simplified Chinese.
Multibyte character sels allow you to use two or more octets in sequence to encode a desired character, although you’ll rarely see them in use. In fact, if you’re not working with JK, you probably won’t see them at all.
Unicode
The Unicode standard, first standardized in 1991, aims to represent all languages within a unified character set. You might think of Unicode as another multibyte character set. But rather than focusing on a specific language, such as Shift-JIS does with Japanese, it tries to encode all written languages, including some archaic and constructed ones, into a single universal character set.
Unicode defines two related concepts: character mapping and character encoding. Character mappings include mappings between a numeric value and a character, as well as many other rules and regulations on how characters are used or combined, Character encodings define the way these numeric values are encoded in the underlying file or network protocol.
For analysis purposes, it’s far more important to know how these numeric values are encoded.
Each character in Unicode is assigned a code point that represents a unique character. Code points are commonly written in the format U+ABCD, where ABCD is the code point’s hexadecimal value. For the sake of compatibility, the first 128 code points match what is specified in ASCII, and the second 128 code points are taken from ISO/IEC 8859-1.
The resulting value is encoded using a specific scheme, sometimes referred to as Universal Character Set (UCS) or Unicode Transformation Format (UTF) encodings. (Subtle differences exist between UCS and UTF formats,but for the sake of identification and manipulation, these differences are unimportant.)
Three common Unicode encodings in use are UTF-16, UTF-32, and UTF-8.
UGS-2/UTF-16
UGS-2/UTF-16 is the native format on modern Microsoft Windows plat-forms, as well as the Java and NET virtual machines when they are running code. It encodes code points in sequences of 16-bit integers and has little and big endian variants.
UGS-4/UTF-32
UCS-4/UTF-32 is a common format used in Unix applications because it’s the default wide-character format in many C/C++ compilers. It encodes code points in sequences of 32-bit integers and has different endian variants.
UTF-8
UTF-8 is probably the most common format on Unix. It is also the default input and output format for varying platforms and technologies, such as XML. Rather than having a fixed integer size for code points, it encodes them using a simple variable length value. Table 3-3 shows how code points are encoded in UTF-8.
UTF-8 has many advantages. For one, its encoding definition ensures that the ASCII character set, code points U+0000 through U+007F, are encoded using single bytes. This scheme makes this format not only ASCII compatible but also space efficient. In addition, UTF-8 is compatible with C/C++ programs that rely on NUL-terminated strings.
For all of its benefits, UTF-8 does come at a cost, because languages like Chinese and Japanese consume more space than they do in UTF-16.
Variable Binary Length Data
If the protocol developer knows in advance exactly what data must be transmitted, they can ensure that all values within the protocol are of a fixed length. In reality this is quite rare, although even simple authentication credentials would benefit from the ability to specify variable username and password string lengths. Protocols use several strategies to produce variable-length data values: Will discuss the most common–terminated data, length-prefixed data, implicit-length data, and padded data- in the following sections.
Terminated Data
You saw an example of variable-length data when variable-length integers were discussed earlier. The variable-length integer value was terminated when the octet’s MSB was 0. We can extend the concept of terminating values further to elements like strings or data arrays.
A terminated data value has a terminal symbol defined that tells the data parser that the end of the data value has been reached. The terminal symbol is used because it’s unlikely to be present in typical data, ensuring that the value isn’t terminated prematurely. With string data, the terminating value can be a NUL value (represented by 0) or one of the other control characters in the ASCII set.
If the terminal symbol chosen occurs during normal data transfer, you need to use a mechanism to escape these symbols. With strings, it’s common to see the terminating character either prefixed with a backslash (\) or repeated twice to prevent it from being identified as the terminal symbol. This approach is especially useful when a protocol doesn’t know head of time how long a value is for example, if it’s generated dynamically.
Bounded data is often terminated by a symbol that matches the first character in the variable-length sequence. For example, when using string data, you might find a quoted string sandwiched between quotation marks. The initial double quote tells the parser to look for the matching character to end the data.
Length-Prefixed Data
If a data value is known in advance, it’s possible to insert its length into the protocol directly. The protocol’s parser can read this value and then read the appropriate number of units (say characters or octets) to extract the original value. This is a very common way to specify variable-length data.
The actual size of the length prefix is usually not that important, although it should be reasonably representative of the types of data being transmitted.
Most protocols won’t need to specify the full range of a 32-bit integer; how-ever, you’ll often see that size used as a length field, if only because it fits well with most processor architectures and platforms. For example, Figure 3-11 shows a string with an 8-bit length prefix.
Implicit-Length Data
Sometimes the length of the data value is implicit in the values around it.
For example, think of a protocol that is sending data back to a client using a connection-oriented protocol such as TCP. Rather than specifying the size of the data up front, the server could close the TCP connection, thus implicitly signifying the end of the data. This is how data is returned in an HTTP version 1.0 response.
Another example would be a higher-level protocol or structure that has already specified the length of a set of values. The parser might extract that higher-level structure first and then read the values contained within it. The protocol could use the fact that this structure has a finite length associated with it to implicitly calculate the length of a value in a similar fashion to close the connection (without closing it, of course).
Padded Data
Padded data is used when there is a maximum upper bound on the length of a value, such as a 32-octet limit. For the sake of simplicity, rather than prefixing the value with a length or having an explicit terminating value, the protocol could instead send the entire fixed-length string but terminate the value by padding the unused data with a known value. Figure 3-13 shows an-example.
Dates and Times
It can be very important for a protocol to get the correct date and time. Both can be used as metadata, such as file modification timestamps in a network file protocol, as well as to determine the expiration of authentication credentials. Failure to correctly implement the timestamp might cause serious security issues. The method of date and time representation depends on usage requirements, the platform the applications are running on, and the protocol’s space requirements. I discuss two common repre-sentations, POSIX/Unix Time and Windows FILETIME, in the following sections.
POSIX/Unix Time
Currently, POSIX/Unix time is stored as a 32-bit signed integer value representing the number of seconds that have elapsed since the Unix epoch, which is usually specified as 00:00:00 (UTC), 1 January 1970. Although this isn’t a high-definition timer, it’s sufficient for most scenarios. As a 32-bit inte-ger, this value is limited to 03:14:07 (UTC) (19 January 2038) at which point the representation will overflow. Some modern operating systems now use a 64-bit representation to address this problem.
Windows FILETIME
The Windows FILETIME is the date and time format used by Microsoft Windows for its filesystem timestamps. As the only format on Windows with simple binary representation, it also appears in a few different protocols.
The FILETIME format is a 64-bit unsigned integer. One unit of the integer represents a 100 ns interval. The epoch of the format is 00:00:00 (UTC), 1 January 1601. This gives the FILETIME format a larger range than the POSIX/Unix time format.
Tag, Length, Value Pattern
It’s easy to imagine how one might send unimportant data using simple protocols, but sending more complex and important data takes some explaining. For example, a protocol that can send different types of structures must have a way to represent the bounds of a structure and its type.
One way to represent data is with a Tag, Length, Value (TLY) pattern. The Tag value represents the type of data being sent by the protocol, which is commonly a numeric value (usually an enumerated list of possible values).
But the Tag can be anything that provides the data structures with a unique pattern. The Length and Value are variable-length values. The order in which the values appear isn’t important; in fact, the Tag might be part of the Value.
The Tag value sent can be used to determine how to further process the data. For example, given two types of Tags, one that indicates the authentication credentials to the application and another that represents a message being transmitted to the parser, we must be able to distinguish between the two types of data. One big advantage to this pattern is that it allows us to extend a protocol without breaking applications that have not been updated to support the updated protocol. Because each structure is sent with an associated Tag and Length, a protocol parser could ignore the structures that it doesn’t understand.
Multiplexing and Fragmentation
Often multiple computer operations happens at once.
This complex data transfer would not result in a very rich experience if display updates had to wait for a 10-minute audio file to finish before updating the display. Of course, a workaround would be opening multiple. connections to the remote computer, but those would use more resources.
Instead, many protocols use multiplexing, which allows multiple connections to share the same underlying network connection.
Multiplexing defines an internal channel mechanism that allows a single connection to host multiple types of traffic by fragmenting large transmissions into smaller chunks. Multiplexing then combines these chunks into a single connection. When analyzing a protocol, you may need to de-multiplex these channels to get the original data back out.
Unfortunately, some network protocols restrict the type of data that can be transmitted and how large each packet of data can be a problem commonly encountered when layering protocols. For example, Ethernet defines the maximum size of traffic frames as 1500 octets, and running IP on top of that causes problems because the maximum size of IP packets can be 65536 bytes. Fragmentation is designed to solve this problem: it uses a mechanism that allows the network stack to convert large packets into smaller fragments when the application or OS knows that the entire packet cannot be handled by the next layer.
Network Address Information
The representation of network address information in a protocol usually follows a fairly standard format. Because we’re almost certainly dealing with TCP or UDP protocols, the most common binary representation is the IP address as either a 4- or 16-octet value (for IPv4 or IPv6) along with a 2-octet port. By convention, these values are typically stored as big endian integer values.
You might also see hostnames sent instead of raw addresses. Because hostnames are just strings, they follow the patterns used for sending variable-length strings, which was discussed earlier in “Variable Binary Length Data”.
Structured Binary Formats
Although custom network protocols have a habit of reinventing the wheel, sometimes it makes more sense to repurpose existing designs when describing a new protocol. For example, one common format encountered in binary protocols is Abstract Syntax Notation 1 (ASN.1). ASN.1 is the basis for protocols such as the Simple Network Management Protocol (SNMP), and it is the encoding mechanism for all manner of cryptographic values, such as X.509 certificates.
ASN.1 is standardized by the ISO, IEC, and ITU in the X.680 series. It defines an abstract syntax to represent structured data, Data is represented in the protocol depending on the encoding rules, and numerous encodings exist. But you’re most likely to encounter the Distinguished Encoding Rules (DER), which is designed to represent ASN.1 structures in a way that cannot be misinterpreted–a useful property for cryptographic protocols. The DER representation is a good example of a TLV protocol.
Text Protocol Structures
Text protocols are a good choice when the main purpose is to transfer text, which is why mail transfer protocols, instant messaging, and news aggregation protocols are usually text based. Text protocols must have structures similar to binary protocols. The reason is that, although their main content differs, both share the goal of transferring data from one place to another.
The following section details some common text protocol structures that you’ll likely encounter in the real world.
Numeric Data
Integers
It’s easy to represent integer values using the current character set’s representation of the characters 0 through 9 (or A through F if hexadecimal). In this simple representation, size limitations are no concern, and if a number needs to be larger than a binary word size, you can add digits. Of course, you’d better hope that the protocol parser can handle the extra digits or security issues will inevitably occur.
To make a signed number, you add the minus (-) character to the front of the number; the plus (+) symbol for positive numbers is implied.
Decimal Numbers
Decimal numbers are usually defined using human-readable forms. For example, you might write a number as 1.234, using the dot character to separate the integer and fractional components of the number; however, you’ll still need to consider the requirement of parsing a value afterward.
Binary representations, such as floating point, can’t represent all decimal values precisely with finite precision (just as decimals can’t represent numbers like 1/3). This fact can make some values difficult to represent in text format and can cause security issues, especially when values are compared to one another.
Text Booleans
Booleans are easy to represent in text protocols. Usually, they’re represented using the words true or false. But just to be difficult, some protocols might require that words be capitalized exactly to be valid. And sometimes integer values will be used instead of words, such as 0 for false and 1 for true, but not very often.
Dates and Times
At a simple level, it’s easy to encode dates and times: just represent them & they would be written in a human-readable language. As long as all applications agree on the representation, that should suffice.
Unfortunately, not everyone can agree on a standard format, so typically many competing date representations are in use. This can be a particularly acute issue in applications such as mail clients, which need to process all manner of international date formats.
Variable-Length Data
All but the most trivial protocols must have a way to separate important text fields so they can be easily interpreted. When a text field is separated out of the original protocol, it’s commonly referred to as a token. Some protocols specify a fixed length for tokens, but it’s far more common to require some type of variable-length data.
Delimited Text
Separating tokens with delimiting characters is a very common way to separate tokens and fields that’s simple to understand and easy to construct and parse. Any character can be used as the delimiter (depending on the type of data being transferred), but whitespace is encountered most in human-readable formats. That said, the delimiter doesn’t have to be whitespace.
For example, the Financial Information Exchange (FIX protocol delimits tokens using the ASCII Start of Header (SOH) character with a value of 1.
Terminated Text
Protocols that specify a way to separate individual tokens must also have a way to define an End of Command condition. If a protocol is broken into separate lines, the lines must be terminated in some way. Most well-known, text-based Internet protocols are line oriented, such as HTTP and IRC; lines typically delimit entire structures, such as the end of a command.
What constitutes the end-of-line character? That depends on whom you ask. OS developers usually define the end-of-line character as either the ASCII Line Feed (LF), which has the value 10; the Carriage Return (CR) with the value 13; or the combination CR LF. Protocols such as HTTP and Simple Mail Transfer Protocol (SMTP) specify CR LF as the official end-of-line combination. However, so many incorrect implementations occur that most parsers will also accept a bare LF as the end-of-line indication.
Structured Text Formats
As with structured binary formats such ASN.1, there is normally no reason to reinvent the wheel when you want to represent structured data in a text protocol. You might think of structured text formats as delimited text on steroids, and as such, rules must be in place for how values are represented and hierarchies constructed. With this in mind, I’ll describe three formats in common use within real-world text protocols.
Multipurpose Internet Mail Extensions
Originally developed for sending multipart email messages, Multipurpose Internet Mail Extensions (MIME) found its way into a number of protocols, such as HITP. The specification in RFCs 2045, 2046 and 2047, along with numerous other related RFCs, defines a way of encoding multiple discrete attachments in a single MIME-encoded message.
MIME messages separate the body parts by defining a common separator line prefixed with two dashes (–). The message is terminated by following this separator with the same two dashes. Listing 3-3 shows an example of a text message combined with a binary version of the same message.
From: Some One <someone@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="XXXXboundary text"
This is a multipart message in MIME format.
--XXXXboundary text
Content-Type: text/plain
this is the body text
--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
filename="test.txt"
this is the attachment text
--XXXXboundary text--
One of the most common uses of MIME is for Content-Type values, which are usually referred to as MIME types. A MIME type is widely used when serving HTTP content and in operating systems to map an application to a particular content type. Each type consists of the form of the data it represents, such as text or application, in the format of the data. In this case plain is unencoded and the text octet-stream is a series of bytes.
JavaScript Object Notation
JavaScript Object Notation (JSON) was designed as a simple representation for a structure based on the object format provided by the JavaScript programming language. It was originally used to transfer data between a web page in a browser and a backend service, such as in Asynchronous JavaScript and XML (AJAX). Currently, it’s commonly used for web service data transfer and all manner of other protocols.
The ISON format is simple: a JSON object is enclosed using the braces({}) ASCII characters. Within these braces are zero or more member entries, each consisting of a key and a value. For example, Listing 3-4 shows a simple JSON object consisting of an integer index value, “Hello world!” as a string, and an array of strings.
{
"index":0,
"str":"Hello World",
"arr":["A", "B"]
}
The JSON format was designed for JavaScript processing, and it can be parsed using the “eval” function. Unfortunately, using this function comes with a significant security risk; namely, it’s possible to insert arbitrary script code during object creation. Although most modern applications use a parsing library that doesn’t need a connection to JavaScript, it’s worth ensuring that arbitrary JavaScript code is not executed in the context of the application. The reason is that it could lead to potential security issues, such as cross-site scripting (XSS), a vulnerability where attacker-controlled JavaScript can be executed in the context of another web page, allowing the attacker to access the page’s secure resources.
Extensible Markup Language
Extensible Markup Language (XML) is a markup language for describing a structured document format. Developed by the W3C, it’s derived from Standard Generalized Markup Language (SGML). It has many similarities to HTML, but it aims to be stricter in its definition in order to simplify parsers and create fewer security issues.
At a basic level, XML consists of elements, attributes, and text. Elements are the main structural values. They have a name and can contain child elements or text content. Only one root element is allowed in a single document. Attributes are additional name-value pairs that can be assigned to an element. They take the form of name= “Value”. Text content is just that, text.
Text is a child of an element or the value component of an attribute.
Following is very simple XML document with elements, attributes, and text values.
‹value index="o">
‹str›Hello World!</str>
<arr›
‹value›A‹/value›
‹value›B‹/value>
‹/arr>
</value>
All XML data is text; no type information is provided for in the XML specification, so the parser must know what the values represent. Certain specifications, such as XML Schema, aim to remedy this type information deficiency but they are not required in order to process XML content. The XML specification defines a list of well-formed criteria that can be used to determine whether an XML document meets a minimal level of structure.
XML is used in many different places to define the way information is transmitted in a protocol, such as in Rich Site Summary (RSS). It can also be part of a protocol, as in Extensible Messaging and Presence Protocol (XMPP).
Encoding Binary Data
In the early history of computer communication, 8-bit bytes were not the norm. Because most communication was text based and focused on English-speaking countries, it made economic sense to send only 7 bits per byte as required by the ASCII standard. This allowed other bits to provide control for serial link protocols or to improve performance. This history is reflected heavily in some early network protocols, such as the SMTP or Network News Transfer Protocol (NNTP), which assume 7-bit communication channels.
But a 7-bit limitation presents a problem if you want to send that amusing picture to your friend via email or you want to write your mail in a non-English character set. To overcome this limitation, developers devised a number of ways to encode binary data as text, each with varying degrees of efficiency or complexity.
As it turns out, the ability to convert binary content into text still has its advantages, For example, if you wanted to send binary data in a structured text format, such as JSON or XML, you might need to ensure that delimiters were appropriately escaped. Instead, you can choose an existing encoding format, such as Base64, to send the binary data and it will be easily understood on both sides.
Let’s look at some of the more common binary-to-text encoding schemes you’re likely to encounter when inspecting a text protocol.
Hex Encoding
One of the most naive encoding techniques for binary data is hex encoding. In hex encoding, each octet is split into two 4-bit values that are converted to two text characters denoting the hexadecimal representation. The result is a simple representation of the binary in text form.
Although simple, hex encoding is not space efficient because all binary data automatically becomes 100 percent larger than it was originally. But one advantage is that encoding and decoding operations are fast and simple and little can go wrong, which is definitely beneficial from a security perspective.
HTTP specifies a similar encoding for URLs and some text protocol called percent encoding. Rather than all data being encoded, only nonprintable data is converted to hex, and values are signified by prefixing the v. with a % character. If percent encoding was used to encode the value.
Base64
To counter the obvious inefficiencies in hex encoding, we can use Base64, an encoding scheme originally developed as part of the MIME specifications. The 64 in the name refers to the number of characters used to encode the data.
The input binary is separated into individual 6-bit values, enough to represent 0 through 63. This value is then used to look up a corresponding character in an encoding table.
But there’s a problem with this approach: when 8 bits are divided by 6, 2 bits remain. To counter this problem, the input is taken in units of three octets, because dividing 24 bits by 6 bits produces 4 values. Thus, Base64 encodes 3 bytes into 4, representing an increase of only 33 percent, which is significantly better than the increase produced by hex encoding. Figure 3-20 shows an example of encoding a three-octet sequence into Base64.
But yet another issue is apparent with this strategy. What if you have only one or two octets to encode? Would that not cause the encoding to fail? Base64 gets around this issue by defining a placeholder character, the equal sign (=). If in the encoding process, no valid bits are available to use, the encoder will encode that value as the placeholder. Figure 3-21 shows an example of only one octet being encoded. Note that it generates two placeholder characters. If two octets were encoded, Base64 would generate only one.
To convert Base64 data back into binary, you simply follow the steps in reverse. But what happens when a non-Base64 character is encountered during the decoding? Well that’s up to the application to decide. We can only hope that it makes a secure decision.
Special Content Note
Contents of this post are mainly taken from James ForShaw’s ‘Attacking Network Protocols ‘ book. I kept this as a post for my quick reference. If you are using these on your thesis or researches do not forget to cite the actual author.