TCP

TCP

  • Transport Layer Protocols: UDP and TCP
  • TCP Connection Management
  • TCP States
  • TCP Flow Control
  • TCP Error Control
  • TCP Congestion Control
  • TCP Delay Analysis

Internet Transport-Layer Protocols

  • More than one transport protocol available to applications
    • Internet: TCP and UDP

User Datagram Protocol (UDP) [RFC 768]

  • Connectionless
  • Unreliable

Transmission Control Protocol (TCP) [RFC: 793, 1122, …]

  • Connection-oriented
  • Flow and error Control
  • Congestion Control

Revisit the concept of UDP

  • UDP Header Fields
  • Source and destination port addresses
  • Length: total length of the entire segment (in bytes, at least 8 bytes)
  • Checksum: error detection (header plus data) (optional)
img
  • Data sequence not guaranteed
  • Reception not guaranteed
  • Connectionless
  • Data can be sent to multiple destinations and received from multiple destinations
  • No flow control
  • Some application uses of UDP
    • Route updating protocols
    • Real time data

Well-known Ports used with UDP

img

UDP vs TCP

Reliability and Control:

  • UDP treats multiple datagrams belonging to a single transmission as entirely separate units, unrelated to each other, while TCP provides reliable delivery of the entire message generated by the sending application
    • All TCP segments must be received and acknowledged before the transmission is considered complete and the logical connection is terminated
    • By setting up a connection, flow and error controls can be exercised in TCP

Overhead:

  • UDP incurs less overhead in terms of:
    • header size – hence UDP is more suitable for very small datagrams
    • time for connection control – no need to set up a connection before data transmission

Transmission Control Protocol (TCP)

  • TCP is a reliable but complex transport protocol
  • Connection-oriented means that a logical connection must be established between the sender and receiver
  • Stream-oriented means that sending process delivers data as a strea of bytes and the receiving process obtains data as a stream of bytes
  • TCP begins transmission by alerting the receiver that TCP segments are on their way, and ends transmission with an explicit connection termination

Basic Operation of TCP

  • At the sending side, TCP divides a long stream of data into small data units called a Segment.

    • Segments are carried across networks, encapsulated inside IP datagrams
      • Note that IP may divide a TCP segment into multiple IP fragments

    At the receiving side, TCP collects each segment as it comes in, and reorders the segments based on their sequence numbers

    • Recall that IP provides a best-effort datagram delivery service that may arrive out of order and/or with errors

img

In TCP, it use byte count as sequence number

TCP Segment Structure

img

  • Source port address (16 bits)
    • Define the application program in the source computer
  • Destination port address (16 bits)
    • Define the application program in the destination computer, e.g. Telnet = 23
  • Sequence number (32 bits)
    • Define the number assigned to the first byte of data contained in a segment
  • Acknowledgement number (32 bits)
    • Valid only if the ACK bit in the control field is set – in this case, it defines the byte sequence number that is next expected (in unit of bytes)
    • To piggyback positive acknowledgment for the receipt of data from the other communication device
    • ACK is “cumulative” – i.e. ACK n means all data before byte n have been received without error
  • HLEN (Header Length) (4 bits)
    • Length of TCP header (in unit of 4 bytes).
  • Reserved (6 bits) - all 0s now, reserved for future use
  • Control field (6 bits):
    • URG (Urgent) - urgent
    • ACK - the acknowledgment number is valid
    • PSH (Push) - the segment should be read immediately
    • RST (Reset) - reset the connection now
    • SYN - used in connection establishment
    • FIN (Finish) - used in connection termination
  • Window size (16 bits) (Called R Window Values)
    • Defines the size of the sliding window by reciever side (rwnd)
    • Maximum window size is 2^16 – 1
  • Checksum (16 bits)
    • Error detection
  • Urgent pointer (16 bits)
    • Valid only when the URG control bit is set
    • Used when the segment contain urgent data
    • The urgent data number are from (Sequence number) to (Sequence number + Urgent pointer)
  • Options - The common use options are
    • Maximum Segment Size
      • specify the maximum segment size (usually to match with the receiver’s buffer size)
    • Window Scale Factor
      • Scale up the Window by 2^F where maximum allow F = 14
    • Timestamp
      • Used to calculate how long it takes to travel from the sender to the reciever.

Data Transfer in TCP

Connection Establishment

As TCP transmits data in full-duplex mode, each party must initialize communication and get approval from the other part before any data are transferred. That’s a 3-way handshaking.

  • Connection request from Client
  • Connection confirmation from Server
  • Acknowledgment of confirmation from Client

The Sequence number only increase when the segment contains data.

Connection request,Connection confirmation and Acknowledgment of confirmation has no data.

However, Connection request and Connection confirmation will consume 1 Sequence number(measured in unit of byte).

  • A SYN segment cannot carry data, but it consumes one sequence number
  • A SYN + ACK segment cannot carry data, but does consume one sequence number.

The Acknowledgment of confirmation will not consume 1 sequence number.

  • An ACK segment, if carrying no data, consumes no sequence number.

Therefore: Segment 1 and Segment 2 consume 1 Sequence number. Segment 3 will not consume any Sequence number.

Before the Client make the Request, the Server should become Ready first

  • Server program tells its TCP that it is ready to accept a connection
    • Called “passive open”
  • Then Client program issues a request for an “active open”

This is same as previous notation but “passive open” is included.

  • The Segment 3 can be 8000 or 8001, but the next segment must be 8001 because Segment 1 already consumed a sequence number.

Data Transfer using 3-Way Handshake

Connection Termination using 3-Way Handshake

  • Terminate by seting FIN bit to 1
  • The FIN segment consumes 1 sequence number.
  • The FIN + ACK segment consumes 1 sequence number.
  • An ACK segment, if carrying no data, consumes no sequence number.

Some special situation (Half-Close)

If the client has no data to send, It will try to close the connection.

But if at that time the server still have some data to send, the server will go into Half Close mode.

  • Server send back a ACK to indicate it recieve the request, but Client is not yet ready to close.
    • There will be no data sending from Client to server afterwards, until Server finished sending the remaining data segments (meanwhile Client will still send ACK to server)
  • After Finish sending all the data from Server to Client, Server will send the Confirmation at that time by seting FIN and ACK bit to 1
  • Client recieve the confirmation and then send the ACK. After Server recieve the ACK, Server closes its connection.

State Transition Diagram of TCP

  • To keep track of all the different events happening during connection establishment, connection termination, and data transfer, TCP is specified as the finite state machine (FSM)
img img

TCP Flow Control

receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast

  • TCP uses sliding window to control the flow of data so that the destination does not become overwhelmed with data.
img
  • Window Size = minimum (rwnd, cwnd)
  • Receiver “advertises” free buffer space by including rwnd value in TCP header of receiver-to- sender segments
    • RcvBuffer size set via socket options (typical default is 4096 bytes)
    • many operating systems auto adjust RcvBuffer
  • Sender limits amount of unacked (“in-flight”) data to receiver’srwnd value
  • Guarantees receive buffer will not overflow
img

TCP Flow Control - Setting the Recieving Window Size

Flow Control is to Prevent from overwhelming the destination buffer.

  • Use sliding window to make transmission more efficient as well as to control the flow of data
  • The window size (rwnd) is depends on the available buffer size at the reciever.
    • Avilable buffer size rwnd = Reciever Buffer Size – Buffered Data

Example

  • 1- Client first send a SYN segement.
  • Server set up its receieveing window rwnd.
  • 2- Server then send back SYN + ACK to tell the size of rwnd. (That is 800 in the example)
  • 3- Then Client will set the window.
  • Sender can send at most 800 byte(according to recieving window of Server)
  • 4- First segment only contain 200 byte is sent from Client to Server
  • Available buffer size of Server become less (800 - 200 = 600)
  • 5- Server send back ACK to tell a new rwnd value (i.e. current available buffer)
  • Client after recieving the ACK, the window is slided to the right position by 200
  • The sending window of client now become 600
  • 6- Second segment contain 300 byte is sent from Client to Server
  • Available buffer size of Server become less (600 - 300 = 300)
  • At the same time, we assume 100 bytes are consumed, therefore buffer size = 300 + 100 = 400
  • 7- Server send back ACK to tell a new rwnd value (i.e. current available buffer)
  • Client after recieving the ACK, the window is slided to the right position by 200
  • The sending window of client now become 400
  • After some time, we assume another 200 bytes are consumed, therefore buffer size = 600
  • Client after recieving the ACK, the window is slided to the left position by 200
  • The sending window of client now become 600
  • 8- Server send back ACK to tell a new rwnd value (i.e. current available buffer)

Note in practice, There are 1 sending window and 1 receieving window in each device.

TCP Error Control

  • TCP will only retransmit the error/lost frames
  • TCP can
    • Detect and resend corrupted segments
    • Resend lost segments
    • Storing out-of-order segments
    • Detecting and discarding duplicated segments

TCP: Retransmission Scenarios

img

  • Lost Data scenario
    • Timeout -> Resend data
  • Lost ACK scenario
    • Timeout -> Resend data
    • Drop the duplicate data

img

  • Cumulative ACK scenario (one of the ACK is lost, but second ACK reached before the timeout)
    • Do no action
  • permature timeout scenario (delay of the ACK)
    • Duplicated data are sent and will be rejected, but the ACK will be sent again

TCP ACK Generation

img

TCP Congestion Control

Principles of Congestion Control

  • Congestion: “Too many sources sending too much data too fast for network to handle”
    • Different from flow control
  • Effects:
    • Lost packets (buffer overflow at routers)
    • Long delays (queueing in router buffers)

Congestion

  • Advertised window (rwnd) size (flow control) is used to ensure that receiver’s buffer will not overflow
  • However, buffers at intermediate routers between source and destination may overflow
    • If the input rate > output rate, it will queue up and cause Congestion
img
  • Congestion occurs when total arrival rate from all packet flows exceeds R over a sustained period of time
  • Buffers at multiplexer will fill and packets will be lost

Congestion Collapse

img
  • Inefficiency occurs when resources are consumed by traffic that will later be discarded
  • In the extreme, we have congestion collapse
  • This was observed in the Internet (80s) until we added congestion control
  • When too much traffic is offered, congestion sets in and performance degrades sharply
  • Increase packets sent could reduce the number of packets delivered
  • Causes
    • Packets arrive faster than the router can process
      • may due to sudden burst
      • packets queue up, increase the round trip time
    • Host timeout and retransmits
    • Bringing in more packets to the network
    • Buffer overflow, lost packets
    • Followed by more retransmission
    • Each packet has been sent several times

Phases of Congestion Behavior

img
  • Light traffic (Green Pointer in the graph)
    • Arrival Rate << R
    • Low delay
    • Can accommodate more
  • Knee (congestion onset) (Blue Pointer in the graph)
    • Arrival rate approaches R
    • Delay increases rapidly
    • Throughput begins to saturate
  • Congestion collapse (Red Pointer in the graph)
    • Arrival rate > R
    • Large delays, packet loss
    • Useful application throughput drops

Model of Congestion Control

img
  • The primary goal of congestion control is to alleviate congestion
  • A control problem, relevant issues:
    • What is the form of feedback?
    • Is it stable?
    • Is it fair?

Binary Feedback

  • The simplest form of feedback is binary:
    • Either the network is congested, or not congested
  • A proposal for carrying binary feedback in data packet header, and then ACKs back to sender
    • Originally from DECnet (known as DECbit)
    • Later proposed and adopted in ATM
    • Known as Explicit Congestion Notification (ECN) in IETF
  • Current TCP is based on using packet drop (or not) as binary congestion indication

Approaches Towards Congestion Control

Two broad approaches towards congestion control:

  • End-to-End Congestion Control
    • No explicit feedback from network
    • Congestion inferred from end-system observed loss, delay
    • Approach taken by TCP
  • Network-assisted congestion control
    • Routers provide feedback to end systems
      • Single bit indicating congestion (SNA, DECbit, TCP/IP ECN, A TM)
      • Explicit rate sender should send at

TCP Congestion Control

TCP is using the End-to-End approach, that mean there will be no explicit feedback to the network.

  • End-to-end control (no network assistance)

The end system will determine when to adjust the sending rate by looking at the error/lost events at the sending end

  • The cwnd (Congestion Window) will be adjusted
  • TCP sender maintains a congestion window cwnd to control congestion at intermediate routers
  • Effective window is minimum of congestion window and advertised window

img

  • Problem: source does not know what its “fair” share of available bandwidth should be

  • Solution: adapt dynamically to available Bandwidth

    • Sources probe the network by increasing cwnd (congestion window)
    • When congestion detected, sources reduce rate
    • Ideally, sources sending rate stabilizes near ideal point
    • (Start from minimum rate, then slowly increase the rate)

Congestion Window

How does the TCP congestion algorithm change congestion window dynamically according to the most up-to-date state of the network?

  • At light traffic: each segment is ACKed quickly
    • Increase cwnd aggresively
  • At knee: segment ACKs arrive, but more slowly
    • Slow down increase in cwnd
  • At congestion: segments encounter large delays (so retransmission timeouts occur); segments are dropped in router buffers (resulting in time out / duplicate ACKs)
    • Reduce transmission rate, then probe again

TCP Congestion Control Mechanisms

Several mechanisms defined in RFC 2581:

  • Slow start – quickly find an approx right window size
  • Congestion avoidance – based on AIMD, adapt window size to a value that is fair
  • Fast retransmit and Fast Recovery – a faster way to detect loss and keep window for retransmission

The slow start and congestion avoidance algorithms MUST be used by a TCP sender to control the amount of outstanding data being injected into the network.

  • Two variables are added to the TCP per-connection state:
    • Congestion Window cwnd
    • Slow Start Threshold ssthresh
      • To determin whether the slow start or congestion avoidance algorithm is used to control data transmission

Slow Start Algorithm

img
1
2
3
4
5
initialze: Cwnd = 1
for (each segment ACKed)
Cwnd = Cwnd + 1
until (loss event OR Cwnd >= ssthresh)
perform (Congestion Avoidance Algorithm)
  • Exponential increase (per RTT) in window size (not so slow!)
    • 1, 2, 4, 8, 16, 32, …
  • Loss event: timeout (TahoeTCP) and/or or three duplicate ACKs (RenoTCP)

Congestion Avoidance Algorithm

1
2
3
4
5
6
7
8
9
10
/* slowstart is over */ 
/* Cwnd >= ssthresh */
Until (loss event)
{
for (every Cwnd non-duplicated ack)
Cwnd=Cwnd + 1
}
ssthresh = Cwnd/2
Cwnd = 1
perform (Slow Start Algorithm)
  • TCP congestion avoidance:

    • additive increase, multiplicative decrease (AIMD)
      • increase window by 1 per RTT
      • decrease window by factor of 2 on loss event
  • E.g. Loss event at transmission round 8

img

Fast Retransmit

  • When receiver receives an out-of-order packet, it sends a duplicate ack for the packet before the gap, and cache the out-of-order packet
  • If sender receives 3 or more duplicate acks of a packet, the following packet is assumed lost
img

Fast retransmit Algorithm

1
2
3
4
5
When (3 Duplicate Ack received)
{
ssthresh = cwnd/2
retransmit the lost segment
}

Fast Recovery

  • Also triggered when sender receives 3 or more duplicate acks of a packet
  • Sender enters “fast recover” phase
    • Set ssthresh = cwnd/2
    • Temporarily reduce cwnd to (ssthresh + #dup_ack)
    • Retransmit missing packet, assuming receiver caches out-of-order packets
  • When receiver receives the retransmitted packet (filling the gap) it acks all packets (including out-of-order ones)
  • Sender set cwnd to ssthresh (cwnd/2) and returns to congestion avoidance phase

State diagram

img

Summary: TCP Congestion Control

img

MSS stands for Maximum Segment Size.

TCP Versions

  • Old Tahoe [RFC793] is the early implementation of TCP, which use of a go-back-n retransmission scheme with cumulative acknowledgements and a retransmission timer for the recovery of lost packets. The timeout value is a multiple of the RTT.
  • Tahoe [RFC1122] appends additional features to Old Tahoe such as slow start, congestion avoidance, fast retransmit and a refined timeout value calculation.
  • Reno [RFC2581] extends the congestion control scheme by including fast recovery scheme to Tahoe. No slow start is performed after 3-DupAcks are received in this version that results in increased performance in case of single packet drops.
  • New Reno [RFC3782]
    • widely deployed today (Windows 7/8/10, BSD Unix, Linux)
    • (Improved in fast retransmit and recovery process)

How TCP set the Timeout Values

RTO Timer Setting

  • RTT: Round Trip Time between a pair of hosts on the Internet.
  • How to set the TimeOut value (RTO)?
  • Original Algorithm:
    • Keep a running average of RTT and compute TimeOut as a function of this RTT.
      • Send packet and keep timestamp tst_s .
      • When ACK arrives, record timestamp tat_a .

 Measured Round Trip Time=RTTM=tats\text { Measured Round Trip Time}=\mathrm{RTT}_{\mathrm{M}}=\mathrm{t}_{\mathrm{a}}-\mathrm{t}_{\mathrm{s}}

  • Compute a smoothing average:

New Running Average Value=(1α)×Old Running Average Value+α× Measured RTT\text{New Running Average Value} = (1-\alpha) \times \text{Old Running Average Value} + \alpha \times \text { Measured RTT}

to write in shorthand:

RTTS=(1α)×RTTS+α×RTTM\mathrm{RTT}_{\mathrm{S}}=(1-\alpha) \times \mathrm{RTT}_{\mathrm{S}}+\alpha \times \mathrm{RTT}_{\mathrm{M}}

Original TCP spec: α\alpha in range (0.2, 0.1)

We don’t want our measured value take too much portion.

Round Time Out=RTO=2×RTTS\text{Round Time Out} = \mathrm{RTO}=2 \times \mathrm{RTT}_{\mathrm{S}}

Smoothed RTT: RTTSRTT_S

  • OriginalNo value

  • After 1st measurement → RTTS=RTTMRTT_S = RTT_M

  • 2nd … → RTTS=(1α)×RTTS+α×RTTM\mathrm{RTT}_{\mathrm{S}}=(1-\alpha) \times \mathrm{RTT}_{\mathrm{S}}+\alpha \times \mathrm{RTT}_{\mathrm{M}}

  • An obvious flaw in the original algorithm:

    • Whenever there is a retransmission it is impossible to know whether to associate the ACK with the original packet or the retransmitted packet.
      • Either over-estimate or under-estimate

img

The problem with the original algorithm is that it did not take into account the variance of Measured RTT.

Karn/Partidge Algorithm

  • Do not measure SampleRTT when sending packet more than once.
  • For each retransmission, set TimeOut to double the last TimeOut.
    • Note - this is a form of exponential backoff based on the believe that the lost packet is due to congestion.

Smoothed RTT: RTTSRTT_S

  • OriginalNo value
  • After 1st measurement → RTTS=RTTMRTT_S = RTT_M
  • 2nd … → RTTS=(1α)×RTTS+α×RTTM\mathrm{RTT}_{\mathrm{S}}=(1-\alpha) \times \mathrm{RTT}_{\mathrm{S}}+\alpha \times \mathrm{RTT}_{\mathrm{M}}

RTT Deviation : RTTDRTT_D

  • OriginalNo value
  • After 1st measurement → RTTD=0.5×RTTMRTT_D = 0.5 \times RTT_M
  • 2nd … → RTTD=(1β)×RTTD+β×RTTSRTTM\mathbf{R T T}_{\mathrm{D}}=(1-\beta) \times \mathbf{R T T}_{\mathrm{D}}+\boldsymbol{\beta} \times \left|\mathbf{R T T}_{\mathrm{S}}-\mathbf{R T T}_{\mathrm{M}}\right|

Retransmission Timeout (RTO)

  • OriginalNo value
  • After any measurementRTO=RTTs+4RTTD\rm{RTO} = \rm{RTT_s} + 4\rm{RTT_D}

Example:

img

TCP Fairness

  • Whether the TCP control mechanism is fair to all TCP connections
  • Fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K
img

AIMD is a fair mechanism.

TCP Latency Modeling

How long does it take to recieve an object from a Web server after sending a request?

  • TCP connection establishment
  • data transfer delay

Notation, assumptions:

  • Assume one link between client and server of rate RR
  • Assume: fixed congestion window, WW segments
  • SS: MSS (Maximum Segment Size) (bits)
  • OO: object size (bits)
  • no retransmissions (no loss, no corruption)

R=Data RateR = \text{Data Rate}

S=Segment SizeS = \text{Segment Size}

W=Window SizeW = \text{Window Size}

Two cases to consider:

  • WSRRTT+SR\frac{WS}{R} \geq \rm{RTT} + \frac{S}{R}
    • ACK for first segment in window returns before window’s worth of data sent
  • WSR<RTT+SR\frac{WS}{R} < \rm{RTT} + \frac{S}{R}
    • wait for ACK after sending window’s worth of data sent

First Case

WSRRTT+SR\frac{WS}{R} \geq \rm{RTT} + \frac{S}{R}

  • ACK for first segment in window returns before window’s worth of data sent

delay=2RTT+OR\text{delay} = 2\rm{RTT} + \frac{O}{R}

img

Second Case

WSR<RTT+SR\frac{WS}{R} < \rm{RTT} + \frac{S}{R}

  • wait for ACK after sending window’s worth of data sent

delay=2RTT+OR+(K1)[SR+RTTWSR]\text{delay} = 2\rm{RTT} + \frac{O}{R} + (K-1)[\frac{S}{R} + \rm{RTT} - \frac{WS}{R}]

  • Where K=OWS=no. of windows that cover the objectK = \frac{O}{WS} = \text{no. of windows that cover the object}
img

TCP Delay Modeling: Slow Start

Now suppose window grows according to slow start

Will show that the delay for one object is:

 Latency =2 Round Trip Time+Object Transmission Time+Waiting Time\text { Latency }= \text{2 Round Trip Time} + \text{Object Transmission Time} + \text{Waiting Time}

 Latency =2RTT+OR+P[RTT+SR](2P1)SR\text { Latency }=2 R T T+\frac{O}{R}+P\left[R T T+\frac{S}{R}\right]-\left(2^{P}-1\right) \frac{S}{R}

where PP is the number of times TCP idles at server:

P=min{Q,K1}P=\min \{Q, K-1\}

  • Where QQ is the number of times the server idles if the object were of infinite size.
  • KK is the number of windows that cover the object.