www.pa2jfx.nl

TCP provides a connection oriented, reliable, byte stream service. The term connection-oriented means the two applications using TCP must establish a TCP connection with each other before they can exchange data. It is a full duplex protocol, meaning that each TCP connection supports a pair of byte streams, one flowing in each direction. TCP includes a flow-control mechanism for each of these byte streams that allows the receiver to limit how much data the sender can transmit. TCP also implements a congestion-control mechanism

Two processes communicating via TCP sockets. Each side of a TCP connection has a socket which can be identified by the pair < IP_address, port_number >. Two processes communicating over TCP form a logical connection that is uniquely identifiable by the two sockets involved, that is by the combination < local_IP_address, local_port, remote_IP_address, remote_port>.

TCP provides the following facilities to:

From the application's viewpoint, TCP transfers a contiguous stream of bytes. TCP does this by grouping the bytes in TCP segments, which are passed to IP for transmission to the destination. TCP itself decides how to segment the data and it may forward the data at its own convenience.

TCP assigns a sequence number to each byte transmitted, and expects a positive acknowledgment (ACK) from the receiving TCP. If the ACK is not received within a timeout interval, the data is retransmitted. The receiving TCP uses the sequence numbers to rearrange the segments when they arrive out of order, and to eliminate duplicate segments.

The receiving TCP, when sending an ACK back to the sender, also indicates to the sender the number of bytes it can receive beyond the last received TCP segment, without causing overrun and overflow in its internal buffers. This is sent in the ACK in the form of the highest sequence number it can receive without problems.

To allow for many processes within a single host to use TCP communication facilities simultaneously, the TCP provides a set of addresses or ports within each host. Concatenated with the network and host addresses from the internet communication layer, this forms a socket. A pair of sockets uniquely identifies each connection.

The reliability and flow control mechanisms described above require that TCP initializes and maintains certain status information for each data stream. The combination of this status, including sockets, sequence numbers and window sizes, is called a logical connection. Each connection is uniquely identified by the pair of sockets used by the sending and receiving processes.

TCP provides for concurrent data streams in both directions.

TCP data is encapsulated in an IP datagram. The figure shows the format of the TCP header. Its normal size is 20 bytes unless options are present. Each of the fields is discussed below:

The SrcPort and DstPort fields identify the source and destination ports,respectively. These two fields plus the source and destination IP addresses, combine to uniquely identify each TCP connection.

The sequence number identifies the byte in the stream of data from the sending TCP to the receiving TCP that the first byte of data in this segment represents.

The Acknowledgement number field contains the next sequence number that the sender of the acknowledgement expects to receive. This is therefore the sequence number plus 1 of the last successfully received byte of data. This field is valid only if the ACK flag is on. Once a connection is established the Ack flag is always on.

The Acknowledgement, SequenceNum, and AdvertisedWindow fields are all involved in TCP's sliding window algorithm.The Acknowledgement and AdvertisedW indow fields carry information about the flow of dat going in the other direction. In TCP's sliding window algorithm the reciever advertises a window size to the sender. This is done using the AdvertisedWindow field. The sender is then limited to having no more than a value of AdvertisedWindow bytes of un acknowledged data at any given time. The receiver sets a suitable value for the AdvertisedWindow based on the amount of memory allocated to the connection for the purpose of buffering data.

The header length gives the length of the header in 32-bit words. This is required because the length of the options field is variable.

The 6-bit Flags field is used to relay control information between TCP peers. The possible flags include SYN, FIN, RESET, PUSH, URG, and ACK.

The SYN and Fin flags are used when establishing and terminating a TCP connection, respectively.
The ACK flag is set any time the Acknowledgement field is valid, implying that the receiver should pay attention to it.
The URG flag signifies that this segment contains urgent data. When this flag is set, the UrgPtr field indicates where the non-urgent data contained in this segment begins.
The PUSH flag signifies that the sender invoked the push operation, whic h indicates to the receiving side of TCP that it should notify the receiving process of this fact.
Finally, the RESET flag signifies that the receiver has become confused and so wants to abort the connection.

The Checksum covers the TCP segment: the TCP header and the TCP data. This is a mandatory field that must be calculated by the sender, and then verified by the receiver.

The Option field is the maximum segment size option, called the MSS. Each end of the connection normally specifies this option on the first segment exchanged. It specifies the maximum sized segment the sender wants to recieve.

The data portion of the TCP segment is optional.

The two transitions leading to the ESTABLISHED state correspond to the opening of a connection, and the two transitions leading from the ESTABLISHED state are for the termination of a connection. The ESTABLISHED state is where data transfer can occur between the two ends in both the directions.

If a connection is in the LISTEN state and a SYN segment arrives, the connection makes a transition to the SYN_RCVD state and takes the action of replying with an ACK+SYN segment. The client does an active open which causes its end of the connection to send a SYN segment to the server and to move to the SYN_SENT state. The arrival of the SYN+ACK segment causes the client to move to the ESTABLISHED state and to send an ack back to the server. When this ACK arrives the server finally moves to the ESTABLISHED state. In other words, we have just traced the THREE-WAY HANDSHAKE.

In the process of terminating a connection, the important thing to keep in mind is that the application process on both sides of the connection must independently close its half of the connection. Thus, on any one side there are three combinations of transition that get a connection from the ESTABLISHED state to the CLOSED state:

This side closes first:

ESTABLISHED -> FIN_WAIT_1-> FIN_WAIT_2 -> TIME_WAIT -> CLOSED.

The other side closes first:

ESTABLISHED -> CLOSE_WAIT -> LAST_ACK -> CLOSED.

Both sides close at the same time:

ESTABLISHED -> FIN_WAIT_1-> CLOSING ->TIME_WAIT -> CLOSED.

The main thing to recognize about connection teardown is that a connection in the TIME_WAIT state cannot move to the CLOSED state until it has waited for two times the maximum amount of time an IP datagram might live in the Inter net. The reason for this is that while the local side of the connection has sent an ACK in response to the other side's FIN segment, it does not know that the ACK was successfully delivered. As a consequence this other side might re transmit its FIN segment, and this second FIN segment might be delayed in the network. If the connection were allowed to move directly to the CLOSED state, then another pair of application processes might come along and open the same connection, and the delayed FIN segment from the earlier incarnation of the connection would immediately initiate the termination of the later incarnation of that connection.

The sliding window serves several purposes:

it guarantees the reliable delivery of data
it ensures that the data is delivered in order
it enforces flow control between the sender and the receiver

The sending and receiving sides of TCP interact in the following manner to implement reliable and ordered delivery:

Each byte has a sequence number

ACKs are cumulative.

LastByteAcked <=LastByteSent
LastByteSent <= LastByteWritten
bytes between LastByteAcked and LastByteWritten must be buffered

LastByteRead < NextByteExpected
NextByteExpected <= LastByteRcvd + 1
bytes between NextByteRead and LastByteRcvd must be buffered.

Sender buffer size : MaxSendBuffer
Receive buffer size : MaxRcvBuffer

LastByteRcvd - NextBytteRead <= MaxRcvBuffer
AdvertisedWindow = MaxRcvBuffer - (LastByteRcvd - NextByteRead)

LastByteSent - LastByteAcked <= AdvertisedWindow
EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked)
LastByteWritten - LastByteAcked <= MaxSendBuffer
Block sender if (LastByteWritten - LastByteAcked) + y > MaxSendBuffer

Always send ACK in response to an arriving data segment

Persist when AdvertisedWindow = 0

TCP guarantees reliable delivery and so it retransmits each segment if an ACK is not received in a certainperiod of time. TCP sets this timeout as a function of the RTT it expects between the two ends of the connection. Unfortunately, given the range of possible RTT's between any pair of hosts in the Internet, as well as the variation in RTT between the same two hosts over time, choosing an appropriate timeout value is not that easy. To address this problem, TCP uses an adaptive retransmission medchanism. We describe this mechanism and how it has evolved over time.

Compute weighted average of RTT
EstimatedRTT = a*EstimatedRTT + b*SampleRTT, where a+b = 1
a between 0.8 and 0.9
b between 0.1 and 0.2
Set timeout based on EstimatedRTT
TimeOut = 2 * EstimatedRTT

New calculation for average RTT
Difference = SampleRTT - EstimatedRTT
EstimatedRTT = EstimatedRTT + ( d * Difference)
Deviation = Deviation + d ( |Difference| - Deviation)), where d is a fraction between 0 and 1
Consider variance when setting timeout value
Timeout = u * EstimatedRTT + q * Deviation, where u = 1 and q = 4

It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end.

Slow start adds another window to the sender's TCP: the congestion window, called "cwnd". When a new connection is established with a host on another network, the congestion window is initialized to one segment (i.e., the segment size announced by the other end, or the default, typically 536 or 512). Each time an ACK is received, the congestion window is increased by one segment. The sender can transmit up to the minimum of the congestion window and the advertised window. The congestion window is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver. The former is based on the sender's assessment of perceived network congestion; the latter is related to the amount of available buffer space at the receiver for this connection.

The sender starts by transmitting one segment and waiting for its ACK. When that ACK is received, the congestion window is incremented from one to two, and two segments can be sent. When each of those two segments is acknowledged, the congestion window is increased to four. This provides an exponential growth, although it is not exactly exponential because the receiver may delay its ACKs, typically sending one ACK for every two segments that it receives.

At some point the capacity of the internet can be reached, and an intermediate router will start discarding packets. This tells the sender that its congestion window has gotten too large.

Early implementations performed slow start only if the other end was on a different network. Current implementations always perform slow start.

Congestion can occur when data arrives on a big pipe (a fast LAN) and gets sent out a smaller pipe (a slower WAN). Congestion can also occur when multiple input streams arrive at a router whose output capacity is less than the sum of the inputs. Congestion avoidance is a way to deal with lost packets.

The assumption of the algorithm is that packet loss caused by damage is very small (much less than 1%), therefore the loss of a packet signals congestion somewhere in the network between the source and destination. There are two indications of packet loss: a timeout occurring and the receipt of duplicate ACKs.

Congestion avoidance and slow start are independent algorithms with different objectives. But when congestion occurs TCP must slow down its transmission rate of packets into the network, and then invoke slow start to get things going again. In practice they are implemented together.

Congestion avoidance and slow start require that two variables be maintained for each connection: a congestion window, cwnd, and a slow start threshold size, ssthresh. The combined algorithm operates as follows:

Initialization for a given connection sets cwnd to one segment and ssthresh to 65535 bytes.
The TCP output routine never sends more than the minimum of cwnd and the receiver's advertised window.
When congestion occurs (indicated by a timeout or the reception of duplicate ACKs), one-half of the current window size (the minimum of cwnd and the receiver's advertised window, but at least two segments) is saved in ssthresh. Additionally, if the congestion is indicated by a timeout, cwnd is set to one segment (i.e., slow start).
When new data is acknowledged by the other end, increase cwnd, but the way it increases depends on whether TCP is performing slow start or congestion avoidance.If cwnd is less than or equal to ssthresh, TCP is in slow start; otherwise TCP is performing congestion avoidance. Slow start continues until TCP is halfway to where it was when congestion occurred (since it recorded half of the window size that caused the problem in step 2), and then congestion avoidance takes over.

Slow start has cwnd begin at one segment, and be incremented by one segment every time an ACK is received. As mentioned earlier, this opens the window exponentially: send one segment, then two, then four, and so on. Congestion avoidance dictates that cwnd be incremented by segsize*segsize/cwnd each time an ACK is received, where segsize is the segment size and cwnd is maintained in bytes. This is a linear growth of cwnd, compared to slow start's exponential growth. The increase in cwnd should be at most one segment each round-trip time (regardless how many ACKs are received in that RTT), whereas slow start increments cwnd by the number of ACKs received in a round-trip time.

TCP may generate an immediate acknowledgment (a duplicate ACK) when an out- of-order segment is received. This duplicate ACK should not be delayed. The purpose of this duplicate ACK is to let the other end know that a segment was received out of order, and to tell it what sequence number is expected.

Since TCP does not know whether a duplicate ACK is caused by a lost segment or just a reordering of segments, it waits for a small number of duplicate ACKs to be received. It is assumed that if there is just a reordering of the segments, there will be only one or two duplicate ACKs before the reordered segment is processed, which will then generate a new ACK. If three or more duplicate ACKs are received in a row, it is a strong indication that a segment has been lost. TCP then performs a retransmission of what appears to be the missing segment, without waiting for a retransmission timer to expire.

After fast retransmit sends what appears to be the missing segment, congestion avoidance, but not slow start is performed. This is the fast recovery algorithm. It is an improvement that allows high throughput under moderate congestion, especially for large windows.

The reason for not performing slow start in this case is that the receipt of the duplicate ACKs tells TCP more than just a packet has been lost. Since the receiver can only generate the duplicate ACK when another segment is received, that segment has left the network and is in the receiver's buffer. That is, there is still data flowing between the two ends, and TCP does not want to reduce the flow abruptly by going into slow start.

The fast retransmit and fast recovery algorithms are usually implemented together as follows.

When the third duplicate ACK in a row is received, set ssthresh to one-half the current congestion window, cwnd, but no less than two segments. Retransmit the missing segment. Set cwnd to ssthresh plus 3 times the segment size. This inflates the congestion window by the number of segments that have left the network and which the other end has cached .
Each time another duplicate ACK arrives, increment cwnd by the segment size. This inflates the congestion window for the additional segment that has left the network. Transmit a packet, if allowed by the new value of cwnd.
When the next ACK arrives that acknowledges new data, set cwnd to ssthresh (the value set in step 1). This ACK should be the acknowledgment of the retransmission from step 1, one round-trip time after the retransmission. Additionally, this ACK should acknowledge all the intermediate segments sent between the lost packet and the receipt of the first duplicate ACK. This step is congestion avoidance, since TCP is down to one-half the rate it was at when the packet was lost.

Sometimes when traffic goes through a network, you can successfully use the ping command and Telnet, but you cannot download Internet pages or transfer files using File Transfer Protocol (FTP).

In the diagram above, when the Client wants to access a page on the Internet, it establishes a TCP session with the Web Server. During this process, the Client and Web Server announce their maximum segment size (MSS), indicating to each other that they can accept TCP segments up to this size. Upon receiving the MSS option, each device calculates the size of the segment that can be sent. This is called the Send Max Segment Size (SMSS), and it equals the smaller of the two MSSs.

For more information about TCP Maximum Segment Size, see RFC 879

For the sake of argument, let's say the Web Server in the example above determines that it can send packets up to 1500 bytes in length. It therefore sends a 1500 byte packet to the Client, and, in the IP header, it sets the "don't fragment" (DF) bit. When the packet arrives at R2, the router tries encapsulating over the network, the IP maximum transmission unit (MTU) is 24 bytes less than the IP MTU of the real outgoing interface. For an Ethernet outgoing interface that means the IP MTU on the tunnel interface would be 1500 minus 24, or 1476 bytes. R2 is trying to send a 1500 byte IP packet into a 1476 byte IP MTU interface. Since this is not possible, R2 needs to fragment the packet, creating one packet of 1476 bytes (data and IP header) and one packet of 44 bytes (24 bytes of data and a new IP header of 20 bytes). R2 then encapsulates both of these packets to get 1500 and 68 byte packets, respectively. These packets can now be sent out the real outbound interface, which has a 1500 byte IP MTU. However, remember that the packet received by R2 has the DF bit set. Therefore, R2 can't fragment the packet, and instead, it needs to instruct the Web Server to send smaller packets. It does this by sending an Internet Control Message Protocol (ICMP) type 3 code 4 packet (Destination Unreachable; Fragmentation Needed and DF set). This ICMP message contains the correct MTU to be used by the Web Server, which should receive this message and adjust the packet size accordingly.

NOTE:You can view the ICMP messages sent by R2 by enabling the debug ip icmp command:
ICMP: dst (10.10.10.10) frag. needed and DF set unreachable sent to 10.1.3.4

A common problem occurs when ICMP messages are blocked along the path to the Web server. When this happens, the ICMP packet never reaches the Web server, thereby preventing data from passing between client and server.

One of these three solutions should solve the problem:

Find out where along the path the ICMP message is blocked, and see if you can get it allowed. Set the MTU on the Client's network interface to 1476 bytes, forcing the SMSS to be smaller, so packets will not have to be fragmented when they reach R2. However, if you change the MTU for the Client, you should also change the MTU for all devices that share the network with this Client. On an Ethernet segment, this could be a large number of devices. Use a proxy-server (or, even better, a Web cache engine) between R2 and the Gateway router, and let the proxy-server request all the Internet pages.

If the above options are not feasible then these options can be useful:

Use policy routing to clear and set the DF bit in the data IP packet (available in Cisco IOS? Software Release 12.1(6) and later).

TERMINAL

interface ethernet0 ... ip policy route-map clear-df !--- This command is used to identify a route map !--- to use for policy routing on an interface, !--- use the ip policy route-map command in !--- interface configuration mode. route-map clear-df permit 10 match ip address 101 set ip df 0 !--- This command is used to change the Don't Fragment (DF) !--- bit value in the IP header, use this command !--- in route-map configuration mode. access-list 101 permit tcp 10.1.3.0 0.0.0.255 any

This will allow the data IP packet to be fragmented before it is encapsulated. The receiving end host must then reassemble the data IP packets. This is usually not a problem.

Change the TCP MSS option value on SYN packets that traverse through the router (available in IOS 12.2(4)T and higher). This reduces the MSS option value in the TCP SYN packet so that it's smaller than the value in the ip tcp adjust-mss value command, in this case 1436 (MTU minus the size of the IP, TCP and PPP headers). The end hosts now send TCP/IP packets no larger than this value.

TERMINAL

interface ethernet0 ... ip tcp adjust-mss 1436 !--- This command is used to adjust the maximum segment size (MSS) !--- value of TCP SYN packets going through the router. !--- The maximum segment size is in the range from 500 to 1460.

A final option is to increase the IP MTU on the tunnel interface to 1500 (available in IOS 12.0and later). However, increasing the interface IP MTU causes the tunnel packets to be fragmented because the DF bit of the original packet is not copied to the tunnel packet header. In this scenario, the router on the other end of the network must reassemble the packet before it can remove the additional header and forward the inner packet. IP packet reassembly is done in process-switch mode and uses memory. Therefore, this option can significantly reduce the packet throughput through the network.

TERMINAL

interface ethernet0 ... ip mtu 1500 !--- This command is used to set the maximum transmission unit (MTU) !--- size of IP packets sent on an interface. The minimum size !--- you can configure is 128 bytes; the maximum depends on the interface medium.

In conclusion, the most common cause of not being able to browse the Internet over a network is due to the above mentioned fragmentation issue. The solution is to allow the ICMP packets or work around the ICMP problem with any of the above solutions.

How IP Fragmentation and Path Maximum Transmission Unit Discovery (PMTUD) work and to discuss some scenarios involving the behavior of PMTUD when combined with different combinations of IP tunnels. The current widespread use of IP tunnels in the Internet has brought the problems involving IP Fragmentation and PMTUD to the forefront.

The IP protocol was designed for use on a wide variety of transmission links. Although the maximum length of an IP datagram is 64K, most transmission links enforce a smaller maximum packet length limit, called a MTU. The value of the MTU depends on the type of the transmission link. The design of IP accommodates MTU differences by allowing routers to fragment IP datagrams as necessary. The receiving station is responsible for reassembling the fragments back into the original full size IP datagram.

IP fragmentation involves breaking a datagram into a number of pieces that can be reassembled later. The IP source, destination, identification, total length, and fragment offset fields, along with the "more fragments" and "don't fragment" flags in the IP header, are used for IP fragmentation and reassembly. For more information about the mechanics of IP fragmentation and reassembly,

NOTE: please see RFC 791

The image below depicts the layout of an IP header.

The identification is 16 bits and is a value assigned by the sender of an IP datagram to aid in reassembling the fragments of a datagram.
The fragment offset is 13 bits and indicates where a fragment belongs in the original IP datagram. This value is a multiple of eight bytes.
In the flags field of the IP header, there are three bits for control flags. It is important to note that the "don't fragment" (DF) bit plays a central role in PMTUD because it determines whether or not a packet is allowed to be fragmented.
Bit 0 is reserved, and is always set to 0. Bit 1 is the DF bit (0 = "may fragment," 1 = "don't fragment"). Bit 2 is the MF bit (0 = "last fragment," 1 = "more fragments").

Value	bit 0 Reserved	Bit 1 DF	Bit 2 MF
0	0	May	Last
1	0	Do Not	More

The graphic below shows an example of fragmentation. If you add up all the lengths of the IP fragments, the value exceeds the original IP datagram length by 60. The reason that the overall length is increased by 60 is because three additional IP headers were created, one for each fragment after the first fragment.
The first fragment has an offset of 0, the length of this fragment is 1500; this includes 20 bytes for the slightly modified original IP header.
The second fragment has an offset of 185 (185 x 8 = 1480), which means that the data portion of this fragment starts 1480 bytes into the original IP datagram. The length of this fragment is 1500; this includes the additional IP header created for this fragment.
The third fragment has an offset of 370 (370 x 8 = 2960), which means that the data portion of this fragment starts 2960 bytes into the original IP datagram. The length of this fragment is 1500; this includes the additional IP header created for this fragment.
The fourth fragment has an offset of 555 (555 x 8 = 4440), which means that the data portion of this fragment starts 4440 bytes into the original IP datagram. The length of this fragment is 700 bytes; this includes the additional IP header created for this fragment.
It is only when the last fragment is received that the size of the original IP datagram can be determined.
The fragment offset in the last fragment (555) gives a data offset of 4440 bytes into the original IP datagram. If you then add the data bytes from the last fragment (680 = 700 - 20), that gives you 5120 bytes, which is the data portion of the original IP datagram. Then, adding 20 bytes for an IP header equals the size of the original IP datagram (4440 + 680 + 20 = 5140).

There are several issues that make IP fragmentation undesirable. There is a small increase in CPU and memory overhead to fragment an IP datagram. This holds true for the sender as well as for a router in the path between a sender and a receiver. Creating fragments simply involves creating fragment headers and copying the original datagram into the fragments. This can be done fairly efficiently because all the information needed to create the fragments is immediately available.

Fragmentation causes more overhead for the receiver when reassembling the fragments because the receiver must allocate memory for the arriving fragments and coalesce them back into one datagram after all of the fragments are received. Reassembly on a host is not considered a problem because the host has the time and memory resources to devote to this task.

But, reassembly is very inefficient on a router whose primary job is to forward packets as quickly as possible. A router is not designed to hold on to packets for any length of time. Also a router doing reassembly chooses the largest buffer available (18K) with which to work because it has no way of knowing the size of the original IP packet until the last fragment is received.

Another fragmentation issue involves handling dropped fragments. If one fragment of an IP datagram is dropped, then the entire original IP datagram must be resent, and it will also be fragmented. You see an example of this with Network File System (NFS). NFS, by default, has a read and write block size of 8192, so a NFS IP/UDP datagram will be approximately 8500 bytes (including NFS, UDP, and IP headers). A sending station connected to an Ethernet (MTU 1500) will have to fragment the 8500 byte datagram into six pieces; five 1500 byte fragments and one 1100 byte fragment. If any of the six fragments is dropped because of a congested link, the complete original datagram will have to be retransmitted, which means that six more fragments will have to be created. If this link drops one in six packets, then the odds are low that any NFS data can be transferred over this link, since at least one IP fragment would be dropped from each NFS 8500 byte original IP datagram.

Firewalls that filter or manipulate packets based on Layer 4 (L4) through Layer 7 (L7) information in the packet may have trouble processing IP fragments correctly. If the IP fragments are out of order, a firewall may block the non-initial fragments because they do not carry the information that would match the packet filter. This would mean that the original IP datagram could not be reassembled by the receiving host. If the firewall is configured to allow non-initial fragments with insufficient information to properly match the filter, then a non-initial fragment attack through the firewall could occur. Also, some network devices (such as Content Switch Engines) direct packets based on L4 through L7 information, and if a packet spans multiple fragments, then the device may have trouble enforcing its policies.

The TCP Maximum Segment Size (MSS) defines the maximum amount of data that a host is willing to accept in a single TCP/IP datagram. This TCP/IP datagram may be fragmented at the IP layer. The MSS value is sent as a TCP header option only in TCP SYN segments. Each side of a TCP connection reports its MSS value to the other side. Contrary to popular belief, the MSS value is not negotiated between hosts. The sending host is required to limit the size of data in a single TCP segment to a value less than or equal to the MSS reported by the receiving host.

Originally, MSS meant how big a buffer (greater than or equal to 65496K) was allocated on a receiving station to be able to store the TCP data contained within a single IP datagram. MSS was the maximum segment (chunk) of data that the TCP receiver was willing to accept. This TCP segment could be as large as 64K (the maximum IP datagram size) and it could be fragmented at the IP layer in order to be transmitted across the network to the receiving host. The receiving host would reassemble the IP datagram before it handed the complete TCP segment to the TCP layer.

Below are a couple of scenarios showing how MSS values are set and used to limit TCP segment sizes, and therefore, IP datagram sizes.

Scenario 1 illustrates the way MSS was first implemented. Host A has a buffer of 16K and Host B a buffer of 8K. They send and receive their MSS values and adjust their send MSS for sending data to each other. Notice that Host A and Host B will have to fragment the IP datagrams that are larger than the interface MTU but still less than the send MSS because the TCP stack could pass 16K or 8K bytes of data down the stack to IP. In Host B's case, packets could be fragmented twice, once to get onto the Token Ring LAN and again to get onto the Ethernet LAN.

There are three things that can break PMTUD, two of which are uncommon and one of which is common.

A router can drop a packet and not send an ICMP message. (Uncommon)
A router can generate and send an ICMP message but the ICMP message gets blocked by a router or firewall between this router and the sender. (Common)
A router can generate and send an ICMP message, but the sender ignores the message. (Uncommon)

The first and last of the three bullets above are uncommon and are usually the result of an error, but the middle bullet describes a common problem. People that implement ICMP packet filters tend to block all ICMP message types rather than only blocking certain ICMP message types. A packet filter can block all ICMP message types except those that are "unreachable" or "time-exceeded." The success or failure of PMTUD hinges upon ICMP unreachable messages getting through to the sender of a TCP/IP packet. ICMP time-exceeded messages are important for other IP issues. An example of such a packet filter, implemented on a router is shown below. such a packet filter, implemented on a router is shown below.

TERMINAL

access-list 101 permit icmp any any unreachable access-list 101 permit icmp any any time-exceeded access-list 101 deny icmp any any access-list 101 permit ip any any

There are other techniques that can be used to help alleviate the problem of ICMP being completely blocked.

Clear the DF bit on the router and allow fragmentation anyway (This may not be a good idea, though.
Manipulate the TCP MSS option value MSS using the interface command ip tcp adjust-mss <500-1460>.

Router A and Router B are in the same administrative domain. Router C is inaccessible and is blocking ICMP, so PMTUD is broken. A workaround for this situation is to clear the DF bit in both directions on Router B to allow fragmentation. This can be done using policy routing. The syntax to clear the DF bit is available in Cisco IOS? Software Release 12.1(6) and later.

TERMINAL

interface serial0 ... ip policy route-map clear-df-bit route-map clear-df-bit permit 10 match ip address 111 set ip df 0 access-list 111 permit tcp any any

Another option is to change the TCP MSS option value on SYN packets that traverse the router (available in Cisco IOS 12.2(4)T and later). This reduces the MSS option value in the TCP SYN packet so that it's smaller than the value (1460) in the ip tcp adjust-mss command. The result is that the TCP sender will send segments no larger than this value. The IP packet size will be 40 bytes larger (1500) than the MSS value (1460 bytes) to account for the TCP header (20 bytes) and the IP header (20 bytes).

You can adjust the MSS of TCP SYN packets with the ip tcp adjust-mss command. The following syntax will reduce the MSS value on TCP segments to 1460. This command effects traffic both inbound and outbound on interface serial0.

TERMINAL

int s0 ip tcp adjust-mss 1460

IP fragmentation issues have become more widespread since IP tunnels have become more widely deployed. The reason that tunnels cause more fragmentation is because the tunnel encapsulation adds "overhead" to the size a packet. For example, adding Generic Router Encapsulation (GRE) adds 24 bytes to a packet, and after this increase the packet may need to be fragmented because it is larger then the outbound MTU. In a later section of this document, you will see examples of the kinds of problems that can arise with tunnels and IP fragmentation.

TCP (Transmission Control Protocol)