IP Networks - TCP
TCP
TCP
- Transport Layer Protocols: UDP and TCP
- TCP Connection Management
- TCP States
- TCP Flow Control
- TCP Error Control
- TCP Congestion Control
- TCP Delay Analysis
Internet Transport-Layer Protocols
- More than one transport protocol available to applications
- Internet: TCP and UDP
User Datagram Protocol (UDP) [RFC 768]
- Connectionless
- Unreliable
Transmission Control Protocol (TCP) [RFC: 793, 1122, …]
- Connection-oriented
- Flow and error Control
- Congestion Control
Revisit the concept of UDP
- UDP Header Fields
- Source and destination port addresses
- Length: total length of the entire segment (in bytes, at least 8 bytes)
- Checksum: error detection (header plus data) (optional)
- Data sequence not guaranteed
- Reception not guaranteed
- Connectionless
- Data can be sent to multiple destinations and received from multiple destinations
- No flow control
- Some application uses of UDP
- Route updating protocols
- Real time data
Well-known Ports used with UDP
UDP vs TCP
Reliability and Control:
- UDP treats multiple datagrams belonging to a single transmission as entirely separate units, unrelated to each other, while TCP provides reliable delivery of the entire message generated by the sending application
- All TCP segments must be received and acknowledged before the transmission is considered complete and the logical connection is terminated
- By setting up a connection, flow and error controls can be exercised in TCP
Overhead:
- UDP incurs less overhead in terms of:
- header size – hence UDP is more suitable for very small datagrams
- time for connection control – no need to set up a connection before data transmission
Transmission Control Protocol (TCP)
- TCP is a reliable but complex transport protocol
- Connection-oriented means that a logical connection must be established between the sender and receiver
- Stream-oriented means that sending process delivers data as a strea of bytes and the receiving process obtains data as a stream of bytes
- TCP begins transmission by alerting the receiver that TCP segments are on their way, and ends transmission with an explicit connection termination
Basic Operation of TCP
-
At the sending side, TCP divides a long stream of data into small data units called a Segment.
- Segments are carried across networks, encapsulated inside IP datagrams
- Note that IP may divide a TCP segment into multiple IP fragments
At the receiving side, TCP collects each segment as it comes in, and reorders the segments based on their sequence numbers
- Recall that IP provides a best-effort datagram delivery service that may arrive out of order and/or with errors
- Segments are carried across networks, encapsulated inside IP datagrams
In TCP, it use byte count as sequence number
TCP Segment Structure
- Source port address (16 bits)
- Define the application program in the source computer
- Destination port address (16 bits)
- Define the application program in the destination computer, e.g. Telnet = 23
- Sequence number (32 bits)
- Define the number assigned to the first byte of data contained in a segment
- Acknowledgement number (32 bits)
- Valid only if the ACK bit in the control field is set – in this case, it defines the byte sequence number that is next expected (in unit of bytes)
- To piggyback positive acknowledgment for the receipt of data from the other communication device
- ACK is “cumulative” – i.e. ACK n means all data before byte n have been received without error
- HLEN (Header Length) (4 bits)
- Length of TCP header (in unit of 4 bytes).
- Reserved (6 bits) - all 0s now, reserved for future use
- Control field (6 bits):
- URG (Urgent) - urgent
- ACK - the acknowledgment number is valid
- PSH (Push) - the segment should be read immediately
- RST (Reset) - reset the connection now
- SYN - used in connection establishment
- FIN (Finish) - used in connection termination
- Window size (16 bits) (Called R Window Values)
- Defines the size of the sliding window by reciever side (rwnd)
- Maximum window size is 2^16 – 1
- Checksum (16 bits)
- Error detection
- Urgent pointer (16 bits)
- Valid only when the URG control bit is set
- Used when the segment contain urgent data
- The urgent data number are from (Sequence number) to (Sequence number + Urgent pointer)
- Options - The common use options are
- Maximum Segment Size
- specify the maximum segment size (usually to match with the receiver’s buffer size)
- Window Scale Factor
- Scale up the Window by 2^F where maximum allow F = 14
- Timestamp
- Used to calculate how long it takes to travel from the sender to the reciever.
- Maximum Segment Size
Data Transfer in TCP
Connection Establishment
As TCP transmits data in full-duplex mode, each party must initialize communication and get approval from the other part before any data are transferred. That’s a 3-way handshaking.
- Connection request from Client
- Connection confirmation from Server
- Acknowledgment of confirmation from Client
The Sequence number only increase when the segment contains data.
Connection request,Connection confirmation and Acknowledgment of confirmation has no data.
However, Connection request and Connection confirmation will consume 1 Sequence number(measured in unit of byte).
- A SYN segment cannot carry data, but it consumes one sequence number
- A SYN + ACK segment cannot carry data, but does consume one sequence number.
The Acknowledgment of confirmation will not consume 1 sequence number.
- An ACK segment, if carrying no data, consumes no sequence number.
Therefore: Segment 1 and Segment 2 consume 1 Sequence number. Segment 3 will not consume any Sequence number.
Before the Client make the Request, the Server should become Ready first
- Server program tells its TCP that it is ready to accept a connection
- Called “passive open”
- Then Client program issues a request for an “active open”
This is same as previous notation but “passive open” is included.
- The Segment 3 can be 8000 or 8001, but the next segment must be 8001 because Segment 1 already consumed a sequence number.
Data Transfer using 3-Way Handshake
Connection Termination using 3-Way Handshake
- Terminate by seting FIN bit to 1
- The FIN segment consumes 1 sequence number.
- The FIN + ACK segment consumes 1 sequence number.
- An ACK segment, if carrying no data, consumes no sequence number.
Some special situation (Half-Close)
If the client has no data to send, It will try to close the connection.
But if at that time the server still have some data to send, the server will go into Half Close mode.
- Server send back a ACK to indicate it recieve the request, but Client is not yet ready to close.
- There will be no data sending from Client to server afterwards, until Server finished sending the remaining data segments (meanwhile Client will still send ACK to server)
- After Finish sending all the data from Server to Client, Server will send the Confirmation at that time by seting FIN and ACK bit to 1
- Client recieve the confirmation and then send the ACK. After Server recieve the ACK, Server closes its connection.
State Transition Diagram of TCP
- To keep track of all the different events happening during connection establishment, connection termination, and data transfer, TCP is specified as the finite state machine (FSM)
TCP Flow Control
receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast
- TCP uses sliding window to control the flow of data so that the destination does not become overwhelmed with data.
- Window Size = minimum (rwnd, cwnd)
- Receiver “advertises” free buffer space by including rwnd value in TCP header of receiver-to- sender segments
- RcvBuffer size set via socket options (typical default is 4096 bytes)
- many operating systems auto adjust RcvBuffer
- Sender limits amount of unacked (“in-flight”) data to receiver’srwnd value
- Guarantees receive buffer will not overflow
TCP Flow Control - Setting the Recieving Window Size
Flow Control is to Prevent from overwhelming the destination buffer.
- Use sliding window to make transmission more efficient as well as to control the flow of data
- The window size (rwnd) is depends on the available buffer size at the reciever.
- Avilable buffer size rwnd = Reciever Buffer Size – Buffered Data
Example
- 1- Client first send a SYN segement.
- Server set up its receieveing window rwnd.
- 2- Server then send back SYN + ACK to tell the size of rwnd. (That is 800 in the example)
- 3- Then Client will set the window.
- Sender can send at most 800 byte(according to recieving window of Server)
- 4- First segment only contain 200 byte is sent from Client to Server
- Available buffer size of Server become less (800 - 200 = 600)
- 5- Server send back ACK to tell a new rwnd value (i.e. current available buffer)
- Client after recieving the ACK, the window is slided to the right position by 200
- The sending window of client now become 600
- 6- Second segment contain 300 byte is sent from Client to Server
- Available buffer size of Server become less (600 - 300 = 300)
- At the same time, we assume 100 bytes are consumed, therefore buffer size = 300 + 100 = 400
- 7- Server send back ACK to tell a new rwnd value (i.e. current available buffer)
- Client after recieving the ACK, the window is slided to the right position by 200
- The sending window of client now become 400
- After some time, we assume another 200 bytes are consumed, therefore buffer size = 600
- Client after recieving the ACK, the window is slided to the left position by 200
- The sending window of client now become 600
- 8- Server send back ACK to tell a new rwnd value (i.e. current available buffer)
Note in practice, There are 1 sending window and 1 receieving window in each device.
TCP Error Control
- TCP will only retransmit the error/lost frames
- TCP can
- Detect and resend corrupted segments
- Resend lost segments
- Storing out-of-order segments
- Detecting and discarding duplicated segments
TCP: Retransmission Scenarios
- Lost Data scenario
- Timeout -> Resend data
- Lost ACK scenario
- Timeout -> Resend data
- Drop the duplicate data
- Cumulative ACK scenario (one of the ACK is lost, but second ACK reached before the timeout)
- Do no action
- permature timeout scenario (delay of the ACK)
- Duplicated data are sent and will be rejected, but the ACK will be sent again
TCP ACK Generation
TCP Congestion Control
Principles of Congestion Control
- Congestion: “Too many sources sending too much data too fast for network to handle”
- Different from flow control
- Effects:
- Lost packets (buffer overflow at routers)
- Long delays (queueing in router buffers)
Congestion
- Advertised window (rwnd) size (flow control) is used to ensure that receiver’s buffer will not overflow
- However, buffers at intermediate routers between source and destination may overflow
- If the input rate > output rate, it will queue up and cause Congestion
- Congestion occurs when total arrival rate from all packet flows exceeds R over a sustained period of time
- Buffers at multiplexer will fill and packets will be lost
Congestion Collapse
- Inefficiency occurs when resources are consumed by traffic that will later be discarded
- In the extreme, we have congestion collapse
- This was observed in the Internet (80s) until we added congestion control
- When too much traffic is offered, congestion sets in and performance degrades sharply
- Increase packets sent could reduce the number of packets delivered
- Causes
- Packets arrive faster than the router can process
- may due to sudden burst
- packets queue up, increase the round trip time
- Host timeout and retransmits
- Bringing in more packets to the network
- Buffer overflow, lost packets
- Followed by more retransmission
- Each packet has been sent several times
- Packets arrive faster than the router can process
Phases of Congestion Behavior
- Light traffic
(Green Pointer in the graph)
- Arrival Rate << R
- Low delay
- Can accommodate more
- Knee (congestion onset)
(Blue Pointer in the graph)
- Arrival rate approaches R
- Delay increases rapidly
- Throughput begins to saturate
- Congestion collapse
(Red Pointer in the graph)
- Arrival rate > R
- Large delays, packet loss
- Useful application throughput drops
Model of Congestion Control
- The primary goal of congestion control is to alleviate congestion
- A control problem, relevant issues:
- What is the form of feedback?
- Is it stable?
- Is it fair?
Binary Feedback
- The simplest form of feedback is binary:
- Either the network is congested, or not congested
- A proposal for carrying binary feedback in data packet header, and then ACKs back to sender
- Originally from DECnet (known as DECbit)
- Later proposed and adopted in ATM
- Known as Explicit Congestion Notification (ECN) in IETF
- Current TCP is based on using packet drop (or not) as binary congestion indication
Approaches Towards Congestion Control
Two broad approaches towards congestion control:
- End-to-End Congestion Control
- No explicit feedback from network
- Congestion inferred from end-system observed loss, delay
- Approach taken by TCP
- Network-assisted congestion control
- Routers provide feedback to end systems
- Single bit indicating congestion (SNA, DECbit, TCP/IP ECN, A TM)
- Explicit rate sender should send at
- Routers provide feedback to end systems
TCP Congestion Control
TCP is using the End-to-End approach, that mean there will be no explicit feedback to the network.
- End-to-end control (no network assistance)
The end system will determine when to adjust the sending rate by looking at the error/lost events at the sending end
- The cwnd (Congestion Window) will be adjusted
- TCP sender maintains a congestion window cwnd to control congestion at intermediate routers
- Effective window is minimum of congestion window and advertised window
-
Problem: source does not know what its “fair” share of available bandwidth should be
-
Solution: adapt dynamically to available Bandwidth
- Sources probe the network by increasing cwnd (congestion window)
- When congestion detected, sources reduce rate
- Ideally, sources sending rate stabilizes near ideal point
- (Start from minimum rate, then slowly increase the rate)
Congestion Window
How does the TCP congestion algorithm change congestion window dynamically according to the most up-to-date state of the network?
- At light traffic: each segment is ACKed quickly
- Increase cwnd aggresively
- At knee: segment ACKs arrive, but more slowly
- Slow down increase in cwnd
- At congestion: segments encounter large delays (so retransmission timeouts occur); segments are dropped in router buffers (resulting in time out / duplicate ACKs)
- Reduce transmission rate, then probe again
TCP Congestion Control Mechanisms
Several mechanisms defined in RFC 2581:
- Slow start – quickly find an approx right window size
- Congestion avoidance – based on AIMD, adapt window size to a value that is fair
- Fast retransmit and Fast Recovery – a faster way to detect loss and keep window for retransmission
The slow start and congestion avoidance algorithms MUST be used by a TCP sender to control the amount of outstanding data being injected into the network.
- Two variables are added to the TCP per-connection state:
- Congestion Window cwnd
- Slow Start Threshold ssthresh
- To determin whether the slow start or congestion avoidance algorithm is used to control data transmission
Slow Start Algorithm
1 | initialze: Cwnd = 1 |
- Exponential increase (per RTT) in window size (not so slow!)
- 1, 2, 4, 8, 16, 32, …
- Loss event: timeout (TahoeTCP) and/or or three duplicate ACKs (RenoTCP)
Congestion Avoidance Algorithm
1 | /* slowstart is over */ |
-
TCP congestion avoidance:
- additive increase, multiplicative decrease (AIMD)
- increase window by 1 per RTT
- decrease window by factor of 2 on loss event
- additive increase, multiplicative decrease (AIMD)
-
E.g. Loss event at transmission round 8
Fast Retransmit
- When receiver receives an out-of-order packet, it sends a duplicate ack for the packet before the gap, and cache the out-of-order packet
- If sender receives 3 or more duplicate acks of a packet, the following packet is assumed lost
Fast retransmit Algorithm
1 | When (3 Duplicate Ack received) |
Fast Recovery
- Also triggered when sender receives 3 or more duplicate acks of a packet
- Sender enters “fast recover” phase
- Set ssthresh = cwnd/2
- Temporarily reduce cwnd to (ssthresh + #dup_ack)
- Retransmit missing packet, assuming receiver caches out-of-order packets
- When receiver receives the retransmitted packet (filling the gap) it acks all packets (including out-of-order ones)
- Sender set cwnd to ssthresh (cwnd/2) and returns to congestion avoidance phase
State diagram
Summary: TCP Congestion Control
MSS stands for Maximum Segment Size.
TCP Versions
- Old Tahoe [RFC793] is the early implementation of TCP, which use of a go-back-n retransmission scheme with cumulative acknowledgements and a retransmission timer for the recovery of lost packets. The timeout value is a multiple of the RTT.
- Tahoe [RFC1122] appends additional features to Old Tahoe such as slow start, congestion avoidance, fast retransmit and a refined timeout value calculation.
- Reno [RFC2581] extends the congestion control scheme by including fast recovery scheme to Tahoe. No slow start is performed after 3-DupAcks are received in this version that results in increased performance in case of single packet drops.
- New Reno [RFC3782]
- widely deployed today (Windows 7/8/10, BSD Unix, Linux)
- (Improved in fast retransmit and recovery process)
How TCP set the Timeout Values
RTO Timer Setting
- RTT: Round Trip Time between a pair of hosts on the Internet.
- How to set the TimeOut value (RTO)?
- Original Algorithm:
- Keep a running average of RTT and compute TimeOut as a function of this RTT.
- Send packet and keep timestamp .
- When ACK arrives, record timestamp .
- Keep a running average of RTT and compute TimeOut as a function of this RTT.
- Compute a smoothing average:
to write in shorthand:
Original TCP spec: in range (0.2, 0.1)
We don’t want our measured value take too much portion.
Smoothed RTT:
-
Original→No value
-
After 1st measurement →
-
2nd … →
-
An obvious flaw in the original algorithm:
- Whenever there is a retransmission it is impossible to know whether to associate the ACK with the original packet or the retransmitted packet.
- Either over-estimate or under-estimate
- Whenever there is a retransmission it is impossible to know whether to associate the ACK with the original packet or the retransmitted packet.
The problem with the original algorithm is that it did not take into account the variance of Measured RTT.
Karn/Partidge Algorithm
- Do not measure SampleRTT when sending packet more than once.
- For each retransmission, set TimeOut to double the last TimeOut.
- Note - this is a form of exponential backoff based on the believe that the lost packet is due to congestion.
Smoothed RTT:
- Original→No value
- After 1st measurement →
- 2nd … →
RTT Deviation :
- Original→No value
- After 1st measurement →
- 2nd … →
Retransmission Timeout (RTO)
- Original→No value
- After any measurement →
Example:
TCP Fairness
- Whether the TCP control mechanism is fair to all TCP connections
- Fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K
AIMD is a fair mechanism.
TCP Latency Modeling
How long does it take to recieve an object from a Web server after sending a request?
- TCP connection establishment
- data transfer delay
Notation, assumptions:
- Assume one link between client and server of rate
- Assume: fixed congestion window, segments
- : MSS (Maximum Segment Size) (bits)
- : object size (bits)
- no retransmissions (no loss, no corruption)
Two cases to consider:
-
- ACK for first segment in window returns before window’s worth of data sent
-
- wait for ACK after sending window’s worth of data sent
First Case
- ACK for first segment in window returns before window’s worth of data sent
Second Case
- wait for ACK after sending window’s worth of data sent
- Where
TCP Delay Modeling: Slow Start
Now suppose window grows according to slow start
Will show that the delay for one object is:
where is the number of times TCP idles at server:
- Where is the number of times the server idles if the object were of infinite size.
- is the number of windows that cover the object.