Chapter 5 Link Layer

The network layer is concerned with arranging for the end-to-end delivery of data. We have seen that this is basically a matter of routing packets along a path from router to router until you reach the final destination. The link layer is concerned with how the packet is moved across one of the links in such a path. The link level packets are usually referred to as frames.

  1. The network layer packet is typically encapsulated in a a link level frame by adding a header and sometimes also a trailer. The header typically contains addressing information, where the link level addresses are referred to as physical or MAC addresses. For example, we have already seen that ethernet uses 6 byte addresses whereas the network level IP addresses were 4 bytes long.
  2. The link level may provide for reliable delivery and for flow control, but it is only at link level, which is distinct from the problem of providing the same service on an end-to-end (i.e. network level) basis.
  3. The link level protocol may provide for error detection and/or correction.
  4. Link level protocols may be either half or full duplex, meaning that data may be able to move only in one direction or in both directions at the same time.

I. Error Detection and Correction

We have already seen the internet check sum used as a means for detecting errors in IP headers or transport level data packets. In addition, you have no doubt already run into parity checking as an error detection mechanism. In both cases, it is rather easy to find errors which are NOT detected.

The idea of any error detection mechanism is to add additional redundant data to any transmission. Basically, if you wish to send data D, then you compute a functional value f(D) and send the pair (D, f(D)). The receiver gets a pair (E, F) where hopefully E = D and F = f(D). The receiver calculates f(E) concludes that an error has occurred if f(E) is not equal to F. Note that either D or f(D) can be corrupted; so the test can declare some received data to be bad and some bad data to be good.

One good property of both the internet checksum and parity checks is that f(D) is small, and so there is not too much overhead in sending the pair. (Another method that might detect more errors would be to send multiple copies of D; this would not be as economical.)

One bad property of both the internet checksum and parity checks is that it is fairly easy to find errors which would not be detected. For example, we saw that both miss a number of two bit errors. Note that the goal is not to find an algorithm that makes no errors; the goal is to find one which detects most errors that are likely to occur often.

One simple way to improve parity checking is by calculating parity in multiple ways. A typical example, for each byte add one parity bit so that all 9 bits add to 0 modulo. For each block of bytes, calculate a parity bit for each of the 9 bit positions. This is referred to as a two-dimensional parity scheme -- it is the method that was used on 9 track magnetic tape for years.

Note that this method allows you to detect and correct all 1 bit errors in a block of data (and parity bits). The idea is that you can use the per-byte parity bits to determine which byte has the error, and then use the bit position parity bits to determine in which bit position the error is located. You can then flip the single bit and your error is corrected. (Note that this even corrects the error when it occurs in a parity bit -- but it only works if the error is actually a one bit error.)

Cyclic Redundancy Check (CRC)

Link level protocols typically use more robust error detection mechanisms than those described to date. The redundant bits are typically calculated with the aid of special hardware. One common method is the cyclic redundancy check; it is the method used, e.g. by ethernet. The CRC check adds a small number of redundant bits (e.g. 32 bits per frame for ethernet) but detects far more errors than do the earlier schemes. For example, we will see that the ethernet CRC detects all single burst errors of length at most 32 bits and all 3 bit errors in a packet. (The analysis of this is quite complex; there is a nice description in

How can you accomplish this? The basic idea is to consider the message as a sequence of bits; these bits are either 0 or 1, and so they can be considered as the coefficients of a polynomial with coefficients which are integers modulo 2. For example, the letter A has an ascii value of 65 which base 2 is 1000001. This is the polynomial X^6+1.

Now the CRC algorithm requires another polynomial, which in the case of ethernet is

P(x) = x^32+x^26+x^23+x^22+x^16+x^12+x^11+x^10+x^8+x^7+x^5+x^4+x^2+x^1+x^0 
If the message is the polynomial G(x), then calculate the remainder R(x) when G(x)x32 is divided by P(x), i.e.
   G(x)x^32 = Q(x)P(x) + R(x), degree(R) < 32
(Of course this is done modulo 2). The idea here is that the polynomials with coefficients integers modulo 2 is well behaved algebraically, e.g. every polynomial factors uniquely, etc. In our earlier notation, G(x) is our message D and R(x) is the f(D).

So, now we have an algorithm. If this is all somewhat intimidating, then there is good news and bad news. The good news is that we can actually understand a bit of it, if not all the details, and implementing it can be done efficiently in hardware. The bad news is that the implementation in software is both inefficient and totally opaque. If you are curious, have a look at a short, if incomprehensible, implementation as a C program.

To try and understand why the CRC check is so effective:

  1. What errors will be undetected? If we transmit (D,f(D)) and receive (E,F), then Now, think of these as polynomials. On receipt, we divide E(x)x32+F(x) by P(x) and detect an error if the remainder is non-zero. So, if we receive a polynomial multiple of P(x), then the error will go undetected.
  2. We can already understand why any single bit error will be detected. In fact, a single bit error means that you receive the polynomial E(x)x32+F(x) + xk for some k. Since the first two terms are a multiple of P(x), this means that the error would be undetected only if xk were also a multiple of P(x). But this is never true -- if M(x) is a polynomial of degree d and if its lowest degree monomial is of degree e, then M(x)P(x) has degree d + 32 and also has a term of degree precisely e (obtained by multiplying the lowest degree monomials of each factor together).

    The general idea was that algebraic properties of the particular polynomail P(x) determined that particular types of errors would not go undetected. As an exercise, check that you understand that all single bit errors would be undetected by any CRC algorithm defined by a polynomial P(x) with at least two non-zero terms.

  3. Although P(x) is not divisible by X+1, many of the standard CRC polynomials are chosen to be multiples of X + 1. This is because any such P(x) detects any error involving an odd number of error bits, e.g. all single bit errors, all 3 bit errors, all five bit errors, etc. (To understand why, choose any polynomial with an odd number of terms and try dividing it by X + 1.)
  4. A burst error of length k is an error in which all the erroneous bits lie in a contiguous sequence of bits of length k. These are important, because typical communication errors are burst errors spanning one or more bytes. You should be able to explain why a polynomial P(x) of degree d should detect all single burst errors of length no larger than d.

II. Ethernet

Ethernet has become the standard multiple access link layer protocol for IP based networks. During the late 1960's Abramson at the University of Hawaii developed Aloha a multiple access satellite based network. The idea is that land based systems would broadcast to the satellite which would relay the signal back to the ground based systems. Each sender could transmit at will; if two senders sent packets at almost the same time, they could interfere with each other and this interference would be seen as a garbled radio signal. When collisions were detected, the packet would be resent. Analysis of the algorithm showed that the system would saturate at about the 18 percent utilization level. An alternative, in which transmissions were in fixed time slots allowed for this maximum utilization level to be doubled.

Metcalfe and Boggs at Xerox Parc modified this algorithm for use as a lan protocol. The basic algorithm is often described as CSMA/CD.

  1. MA simply indicates that the protocol is multiple access, like the Aloha protocols.
  2. CS stands for carrier sense. The sender first checks that the cable to see if there if another node is sending, i.e. the ether cable is active at the point where the sender is connected to the cable. The sender can transmit only if the cable has been idle for a minimum of 96 bit times.
  3. CD stands for collision detect. While sending the cable is monitored for indications of a collision with the packet sent by another sender. This is detected because the voltage levels on the channel are approximately doubled when a collision is occurring. If a collision is detected, the sender stops sending and retries the transmission at a later time.
  4. The resend time is an integral number 512 bit times later, where the number is randomly chosen in a range which doubles with each resend attempt on the same packet.
The original ethernet in 1972 was 2.94 megabit per second and was called the Alto Aloha Network (since he was using Altos computers). A year later, Metcalfe changed the name to ethernet. Ethernet became an open standard, the DIX Ethernet (DEC-Intel-Xerox); in 1985, it was released as ISO 802.3 CSMA/CD standard. The two standards are still both in common use today; the basic protocols are essentially identical, but the header formats differ slightly in a manner which allows both types of packets to be used on the same physical ethernet. The headers you have been using this semester are DIX headers.

A. Ethernet Frames

The basic DIX frame looks like:
  1. Preamble: 8 bytes long, the first 7 bytes are 0xaa and the 8th is 0xab.
  2. Destination Address: 48 bits long. The address is written as hyphen separated byte values in hexadecimal, e.g. 00-b0-d0-15-d0-ca. When the address is transmitted, it is transmitted in this same byte order, i.e. leftmost byte first. But the individual bits of each byte are sent in little endian order, i.e. least significant bit first. (Token Ring (806.5) transmits most significant bit first....)

    The first bit of the address sent (i.e. the least significant bit of the first byte) is 0 for unicast and 1 for multi-cast.

  3. Source Address: This has the same format as the Destination address.
  4. Type: This is the 2 byte protocol type, and it allows one to distinguish various higher level protocols. For example, IP uses 0x800, and ARP uses 0x806. All protocol types are larger than 1500 decimal so that you can distinguish DIX packets from 802.3 packets, which use the same field as
  5. Data: Up to 1500 bytes
  6. FCS: Frame Check Sequence is a 32 bit CRC check value.

After the FCS, the channel must remain idle for 96 bit times (12 byte times). It is permissible for dribble bits to occur in the byte after the FCS. These bits are hardware anomalies and are just discarded. Note that it is important that the carrier be gone before the first byte after the FCS is finished; otherwise, you will misjudge the length of the data, as the header does not have a length field.

The IEEE 802.3 frame is similar, but it has a more complicated way of specifying the Type, as a result only allows for up to 1497 bytes of data. In addition, it allows for locally allocated addresses -- the next to least significant bit of the first byte of the address is 1; for DIX, this bit is always 0, i.e. all addresses are global (except for the broadcast address, which is all 48 bits set to 1).

B. Media Access Control

When can a node send data? Here are the rules:
  1. Before sending data, the node must have seen the no carrier (i.e. the channel is idle) for at least 96 bit times (the interframe gap (IFG)). Note that this is 9.6 microseconds for 10 MB ether and 960 ns for 100 MB ether. Before attempting to send the frame for the first time, the collision count is set to zero.
  2. If the node sees carrier, then it must wait, i.e. defer to the current sender.
  3. If the node detects a collision while sending, it must send a jam signal for 32 bit times. More precisely, the sender must finish the preamble if it is not yet done, and then send the jam signal which is simply 32 bit times of data. After which it must stop transmitting. The purpose of the jam signal is to make sure that enough data has been transmitted so that other nodes are sure to see that the packet transmission was attempted. After the collision is detected, the collision counter is incremented.
  4. If the first 512 bits have been transmitted without detecting a collision, then the channel is said to have been acquired. There should be no collisions after this time; if there is, then it is called a late collision which is a serious error; the sending of the frame is aborted and no retries are made. If the entire frame is sent without encountering a collision, then the collision count is set back to zero. The 512 bit time unit is referred to as a slot time.
  5. If a collision is detected before the channel is acquired, then up to 16 retries are attempted. If the channel cannot be acquired in 16 retries, the send is aborted.
  6. Suppose that the collision count is c and a retry is to be scheduled. No retry is attempted until after an integral number of 512 bit times later. This number is randomly chosen between 0 and 2min(c,10) - 1. After this wait, the retry can be attempted (but only after the channel has been idle for at least IFG).

The retry time has the following properties:

  1. It is randomized so that the various colliding senders are not likely to retry at the same time.
  2. The exponential back off takes into account the possibility of many potential senders. Basically, if there are many contending for the channel, you can expect a second collision because there are only 2 possible times when the first retry would occur. But the exponential back off gives more and more possible retry times as the collision count increases.
  3. The retry times are at discrete times, at intervals arranged so that if there is a single sender with the minimal retry time, then it should acquire the channel. Compare this to what would happen if one simply chose random values from the interval without any consideration of the slot time. (Of course, non-retries CAN at other times.)

C. The Ethernet Slot Time

Clearly 512 bit times is important to the media access control system. Note that this is also, the same as the amount of time it takes to send a minimally sized frame, 14 bytes of header + 46 bytes of data + 4 bytes of check sum amount to 64 bytes or 512 bits.

This magic number comes about because one needs to be able to detect the a collision before the frame is completely sent. Let's see how this works. In the worst case, the collision would be between senders at opposite ends of the ethernet. Sender A would start sending, the signal would propagate down the wire to and arrive just at the instant that Sender B would start sending; the collision signal (i.e. high voltage level on the wire) would propagate back to sender A. So, in the worst case, the signal would have to propagate the entire length of the ethernet and back, and all this before Sender A is done sending the frame.

The minimum frame is 512 bits (46 data bytes). One bit time is 100 ns for 10 MB ether or 10 ns for 100 MB ether. So, in 512 bit times, assuming a propagation speed of 0.6 times the speed of light in a vacuum, would allow the signal to move approximately 9 km. on a 10 MB ethernet. Since the signal has to move twice the length of the ethernet, this would mean that the ethernet could not be longer than about 4500 meters. Actually, life is not this simple; tbe standard limits old thick ethernet to be no longer than 500 meters in any segment of wire, and that these segments need to be connected with repeaters -- it allows for 5 segments, which is 2500 meters. (One hasn't taken into account the delays at the repeaters, but hasn't taken advantage of the additional 64 bit times for the preamble.) So, there are additional complications, and the rules for how you can connect together ethernet segments are quite complex; hopefully, however, this conveys the general idea of why the slot time is chosen as it is.

D. Channel Capture

This is a subtle bit of unfairness in the ethernet algorithm. One would hope that the algorithm were fair in that a single node should not be able to monopolize more than its fair share of the bandwidth. However, if there is a sender who always has data to send, then the following anomaly (called channel capture is possible.

  1. Suppose both A and B wish to send. They both do and a collision occurs. Suppose A 's retry time is 0 and B's retry time turns out to be 1 slot time. Then A will retry, acquire the channel, and complete sending the packet; its collision count is therefore set to 0, while B's is still at So far, things appear fair.
  2. Next A tries to send its second packet and B tries its resend of its first packet since its slot time has expired. Again, a collision occurs. Now, A must wait either 0 or 1 slot time to retry. But B now has collision count 2 and so it must wait either 0, 1, 2, or 3 slot times.
  3. It is more likely that A will have to wait less time than B will. (The probability that A will retry first is 5/8, the probability that B will retry first is 1/8, and they will tie with probability 1/4.)
  4. If A tries first, it will acquire the channel and complete tranmission of its second packet; again its collision count is set to 0. Unless, A's packet is small, it is likely that both A and B will collide again and the probability of B getting the channel goes down even further because B's collision count would be 3 and A's only 1.
  5. If A and B tie, then a collision would occur and B's collision could would be 3 whereas A's would be 2. So, the probabilities of A getting to retry first on the next round would still be better than B's.

E. Full Duplex

What we have described so far is the standard CSMA/CD behavior of half duplex ethernet. Nowadays, however, a common setup is not this at all. For example, in the INSLAB, one has a 100 MB ethernet switch. The switch is actually a bridge. The effect is that each machine plugged into the switch actually has its own private ethernet, with just the switch and the machine on it. Furthermore, the connection is by a patch cable which has separate channels for sending and receiving data. This means that you can send and receive data at the same time, i.e. you have a full duplex connection. Under these circumstances, ethernet does not use the media access control protocol we described above. Rather, each sender simply sends whenever it wants to -- except that it still needs to wait for the IFG between frames. Effectively, you have two 100 MB contention free channels, a point-to-point connection, with just the IFG as the only additional overhead.

When the traffic is between two of the machines connected to the same switch, the switch acts in store-and-forward mode, like a router. So, the frame is received and stored in buffers until it can be sent out another of its ports. In principle, it would be possible to overflow the buffers in the switch. However, the switch is constructed with enough buffer space, so that it can handle full speed traffic at 100 MB/sec in each direction on about 20 of its ports simultaneously without overflowing its buffers. (Of course, this means that you have actually optimally addressed the packets -- so that you are not asking any more than 100 MB/sec at any of the ports. If you direct all the traffic to a single port, it will overflow quite quickly.)

The switch also allows for cut-through mode. In this mode, the packet is forwarded immediately (when the destination port is free) -- so as soon as the destination is known (note that the destination address comes first in the frame), the bytes are routed to the output port instead of waiting until the frame is fully received. This optimization does mean that you cannot change your mind if you should notice at the end of the frame that the FCS does not match; so you could be forwarding incorrect data.

F. Gigabit Ethernet

The basic algorithm for CSMA/CD has remained almost identical to the original ethernet specification, now almost three decades old. During this time, we have moved from 10 MBS to 100 MBS and now to 1000 MBS in speed. Remember that the main point was that we had to make sure that packets were sized appropriately so that collisions could be detected before the sender was done sending a packet. So, at 10 MB, the frame was required to hold at least 46 bytes of data (i.e. 512 bits in all); this allowed for ethernets which were approximately 2.5 km in size. In moving to 100 MB, it was decided to keep the old minimum frame size. This meant that the maximum sized ethernet would have to shrink by about a factor of 10; so 100 MB ethernets cannot be larger than about 210 meters. With gigabit ether, the speed again increased by a factor of 10. If one kept the same minimum frame size, one would be down to about 20 meters for an ethernet. Although this is adequate for most connections within a machine room, it would mean that IEEE standard structured wiring for commercial buildings would not be far too long -- it allows for 100 meters from a hub to an office workspace, and so requires 200 meters between the two farthest workstations.

On the other hand, changing the minimum frame size is not attractive either; it means that you cannot easily bridge from 100 MB to gigabit ethernet. (You would have to extend all the short packets to the gigabit minimum.) So, the problem was how do you keep the same frame structure and yet allow for gigabit speeds?

Recall that, with 10 MB ether, the minimum ethernet frame size has to be accomodated by the higher level protocols. The approach with IP is to pad the actual IP datagram with additional bytes extending it out to 46 bytes of payload in the ether frame. This means that IP needs a length field in its header so that it can determine where the packet ends. (Although, in principle possible, one does not attempt to pack small IP packets within a maximal sized ethernet frame; but instead encourages the use of longer packets.) This is the basic idea of what is done with gigabit ether, but it is done outside the frame. You add padding after the end of the packet by keeping the carrier on -- for a minimum of 512 bytes (4096 bit times) for a minimum sized packet. This means that a minimum sized frame uses an additional 3584 bit times of channel time. These additional bits are non-data bits and are referred to as extension bits. The good news is that you maintain the original frame structure even though your speed is increased by a factor of 10.

But this is a bit illusory -- basically, in the case of minimum sized frames, we increase the speed by a factor of 10, but increase the overhead so that the extension bits are 7 times as long as the entire original frame. So, for small frames, we have less than doubled our speed. For frames larger than 4096 bits, there is no additional overhead; also, for full duplex, there is no additional overhead as the extension bits are NOT required there.

To address our poor performance with small packets, we allow for packing of the frames -- Recall that after each frame, the 10 MB (and 100 MB) specification requires that one turn off carrier and only start on the next frame when you have sensed no carrier for at least 96 bit times (the interframe gap IFG). The reason for this is to make it so that hosts are obliged to share the ethernet channel; but notice that if fair sharing is the ability to pass a fair share of the total number of data bytes, then this is unfair to users of small frames. The gigabit standard addresses this with frame bursts.

With gigabit ethernet, you are allowed to send a frame and, at the end fo the frame, keep the carrier on and go ahead and send additional frames separated by IFG amount of idle time (but with the carrier left on to prevent others from contending for the channel.) So, the idea is:

  1. If carrier has not been present for at least 96 bit times, transmit the first frame.
  2. If the frame is less than 4096 bits long, leave the carrier on padding after the frame has been transmitted up to 4096 bits. (If a collision occurs, then the frame needs to be resent; even though the data bits were already out before you detected the collision.)
  3. After 4096 bit times, the channel is acquired, and so any further collisions are errors.
  4. At the end of the first frame, you can either release the channel by turning off the carrier or you can continue with the frame burst. To continue, you leave the carrier on for 96 bit times (the IFG).
  5. You then send additional frames separated by IFG with the carrier left on. The maximum length of the additional frames (those beyond the first) is 12,000 bit times. Because the channel is acquired, no collisions should occur for any of the additional packets.

Notice that the protocol requires you to use extension bits for the first frame even though you COULD have been sending additional frames. This is to done so that one always knows that a legal collision always collides with the first packet -- so there is no question as to which of the packets need to be resent.

III. Address Resolution Protocol (ARP)

In order to actually send an IP packet across an ethernet, it needs to be encapsulated in an ethernet frame. To do so, one needs to know the ethernet destination address. This is referred to as the Neighbor Discovery problem. Finding out the ethernet address of hosts on a directly connected ethernet is the purpose of Address Resolution Protocol (ARP). For the original description, see RFC 826.

A. The Basic ARP Algorithm

It is reasonable for each host to remember certain ethernet addresses, e.g. its own ethernet address is queried from its interface card and remembered. Also, finding the ethernet address associated with a neighbor host needs to be done for each outgoing IP packet; so it needs to be efficient. As a result, recently used mappings are kept in a cache on each host. One can obtain a listing of the current contents of that cache with the arp -a command (for Unix or Windows). Here is typical output:

Interface: on Interface 0x1000003
  Internet Address      Physical Address      Type         00-e0-fe-14-68-c0     dynamic        00-50-da-64-0e-ff     dynamic        00-60-08-31-f4-14     dynamic        00-01-02-64-7c-dd     dynamic        00-50-04-aa-2d-10     dynamic        00-50-da-64-0e-ff     dynamic        00-50-04-9a-0d-6b     dynamic        00-02-b3-25-0a-bb     dynamic
The command also allows you to add and delete entries by hand from the cache. All the entries here are marked dynamic which means that they will time out and be deleted from the cache if they are unused for a while. The exact timing is controlled by registry entries on Windows systems, but the default behavior is:
  1. Entries which are not referenced for 2 minutes are removed from the cache.
  2. Entries that are older than 10 minutes are removed from the cache whether or not they have been referenced.
The entries that are added with the arp command are static entries; they never time out.

Here is the basic ARP algorithm:

  1. To map a given IP address to its corresponding ethernet address for a particular interface, first check to see if it is in the ARP cache. If so, use the entry.
  2. If not, send a broadcast ethernet ARP request packet out the interface with the local IP address, local ethernet address for the same interface, and the IP address for which the mapping is desired.
  3. A host on the local ethernet (the one using the IP address for which the mapping is desired) responds with a unicast ARP response informing which includes the same information as the request together with the requested ethernet address.
  4. Upon receipt, the information is placed in the ARP cache of the requestor. The received address is used to address the ethernet frame.

The reason that the requestor sends its own addressing information is

  1. This allows the responder to unicast its response, rather than using another broadcast message.
  2. The responder puts the information about the requestor into its own ARP cache. The idea is that, it will likely need the address soon, in order to respond to the responder's IP message.
  3. In fact, ANY host receiving the request which has an entry for the address in its cache will use the information from the request to update its cache.

The ARP protocol uses an ethernet frame; i.e. it is NOT an IP packet. It uses ethernet type 0x806 in the DIX header. The ethernet broadcast address is ff-ff-ff-ff-ff-ff. Here is the format of the body of the frame:

2Hardware Address Space (ethernet = 1)
1Hardware Address Length in bytes
1Protocol Address Length in bytes
2Opcode (request=1, response=2)
nSender's Hardware Address
mSender's Protocol Address
nReceiver's Hardware Address
mReceiver's Protocol Address

The receiver's hardware address is typically not filled in for the request packet, as its value is unknown. The RFC suggests that one might use the broadcast address there if convenient.

B. Gratuitous ARP

DHCP is used to distribute IP numbers to machines on the network. When a host receives an offer from a DHCP server, it checks to see whether or not it appears to already be in use on the network by sending out a so-called gratuitous ARP. This is a request in which the target and the sender is the same. The idea is that any system using IP address (referred to as the defender) will send a reply -- the reply will inform the offender that the IP number is actually in use. In this case, the DHCP server will be sent a DECLINE message from the client; the server will take the number out of service.

Exercise: Actually, there is one more step. The defender will itself send a gratuitous ARP itself. This is called Address Conflict Exchange. Explain why this additional step is needed. What would be the consequence if this ARP were not heard?

C. Proxy ARP

Normally, the host whose IP address is being resolved should be the one which answers the ARP request. However, there are situations in which it is convenient to let other hosts act as a proxy answering the ARP request on behalf of another host; in this case we refer to the protocol as Proxy ARP.

As an example of a situation in which proxy ARP is useful, consider a host A which is connected to a LAN, but also has a modem through which machines connect using PPP. If B is connected through PPP, then A can be a proxy for B -- the end result is that packets destined for A will be encapulated in frames addressed to A. In this case, B does not see the ARP request packets because it is not on the ethernet, and, in fact, it does not even have an ethernet address.

Proxy ARP can be used by network gateways as well in order to make it appear as if machines are working in an unsubnetted environment. This is what is done here at U.K. Notice that the routing table on the machines in INSLAB show that all of is directly connected to the local ethernet. What is actually happening is that the router is acting as a proxy for all the machines which are not on the local ethernet. To see that this is indeed happening, analyze the following output of the arp command, taken shortly after I had sent a ping to the two main campus nameservers as well as to

Interface: on Interface 0x1000003
  Internet Address      Physical Address      Type           00-e0-fe-14-68-c0     dynamic          00-e0-fe-14-68-c0     dynamic         00-e0-fe-14-68-c0     dynamic        00-50-da-64-0e-ff     dynamic        00-60-08-31-f4-14     dynamic        00-01-02-64-7c-dd     dynamic        00-50-04-aa-2d-10     dynamic        00-50-da-64-0e-ff     dynamic        00-50-04-9a-0d-6b     dynamic        00-02-b3-25-0a-bb     dynamic
For more details, see RFC 1027.

D. Implementation Notes

The purpose of this section is to give the flavor of how one might implement an ARP module in the operating system. Our implementation will omit important details needed for an actual implementation, but hopefully will be detailed enough to show the basic lay of the land. Here are some facets of the environment:

  1. An operating system is an interrupt driven runtime environment. Some of the these will include:
    1. Device interrupts: the network interface cards request interrupts when after an input or output operation has completed. The interrupt service routine checks for errors, schedules the high level handling of the I/O completion, and starts the next I/O operation for the device.
    2. Timer interrupts: An interval timer requests an interrupts at, for example, every 10 milliseconds. The interrupt service routine updates clocks, starts scheduled jobs, as well as updating structures for scheduling threads.
  2. The operating system is a multithreaded environment -- access to various shared resources such as operating system tables is controlled by locking protocols in order to preserve data integrity.
  3. The operating system is a virtual memory environment, maintained by the operating system itself. This means that one needs to be aware that certain required structures may or may not be in accessible at any instant. Although in an application environment, you can be assured that referencing a memory location will cause a page fault that will bring the page of memory into place, the situation in the operating system may be that the page fault mechanism may be disabled as a consequence of the locking mechanisms mentioned above. Similarly, care must be taken because expected niceties like error trapping, thread management, etc. are all implemented within the program itself, and so may not be an available service at the time the code in the module is executing.

One of the problems which every operating system has to address is the problem of how to handle a wide variety of i/o devices. Typically, the program must accomodate both hardware that has not even been invented when the operating system version was written. This is typically done by virtualizing the devices: you extract a certain number of standard operations like read and write. One maintains a table of devices which includes, for each device, a pointer to the code that implements each virtual operation for the device; operating system code references the device through this table. The code that implements the virtual operations is called the device driver; new devices can be added to the system because the operating system typically can dynamically link a new device driver into itself by loading the code into its address space and updating the table. At the same time, various higher level structures may also be created.

In our particular case, one of the higher level structures is the network interface. One normally expects that the operating system would contain a network interface table, which would normally be a linked list in order to allow it to be dynamic. The table would have a row for each interface, one for each network card or modem as well as one for the LOOPBACK interface. You will get an idea of some of the publicly viewable parts of the network interface when you use GetIfTable for your fourth programming assignment. Each row of the table might be described as a netif structure. Here are typical fields you might expect:

  1. name: Usable by system administrators and other humans.
  2. state: For example, is the interface UP or DOWN.
  3. Various information about the interface, e.g. IP numbers, net mask, broadcast address, MTU, MAC address, etc.
  4. Various statistics like: type of interface, speed, number of frames received, number of frames sent, various error counts, etc.
  5. Input and output queue(s) together with any information needed to manage these, such as maximum sizes. The idea is that the output routines will simply add the packet to the output queue; the interrupt service routine will schedule the actual I/O by working through the output queues. Similarly, an incoming packet is queued to the input queue by the interrupt service routine which also schedules the task of handling the input packet -- it will be executed later when all higher priority tasks are idle.

Suppose an ethernet packet is input by the network interface card. It requests an interrupt, the interrupt service routine checks for errors, allocates memory for a structure it can use to add the packet to the appropriate interface's input queue, makes the entry, and schedules the remaining input handling. When additional interrupts are handled and all higher priority tasks are finished, the interface input handler is called. This routine's job is to demultiplex the ether stream: In our case, it will examine the ethernet Type field in the frame header and call the appropriate handler -- e.g. a separate routine for arp packets, one for IP packets, etc. If there is no handler for the particular type, the memory used for the frame needs to be deallocated; in all cases, statistics in the netif structure need to be updated.

Now we are ready for the actual ARP implementation. The ARP cache can be implemented in a number of ways:

  1. Although it has been implemented simply as a fixed length table, one normally would want it to be dynamic; this allows the cache to be typically contain very few elements, but would allow it to have hundreds of entries when, e.g. you are running your 4th programming assignment. Because the ARP cache is referenced at a very low level in the operating system, it may be kept in non-paged memory, and, if so, memory resources here are particularly tight.
  2. Because the cache is accessed very often, one would want a reasonably quick access mechanism. If the cache is almost always small, like less than 10 entries, linear search might be adequate; however, if it is often hundreds of entries long, you may want to set up a hashed lookup scheme.
  3. One needs to decide if there should be a single cache for the entire machine or one cache for each interface. If you opt for a global cache, then there are concerns about particular interfaces getting more than a fair share of the resources and starving other interfaces.

What is stored in each ARP cache entry is defined by an arpEntry struct. Here are some fields you might expect (given that you know about the MIB_IPNETROW typedef):

  1. state: Is the entry STATIC, DYNAMIC, or INVALID. We would want the third possibility so that we can have placeholders in the cache created when you make an ARP request, but whose MAC address is not used until the ARP response is received. If the cache were a static table, you would also need a FREE state, indicating that the row is not in use.
  2. Descriptive fields telling such things as hardware type, protocol, address lengths, as well as the hardware and protocol addresses. Also a pointer back to the relevant interface structure.
  3. Output queue for frames to be output as soon as the hardware address is known. This might include a length -- you probably don't want to allow arbitrarily long queues.
  4. Various counters including
    1. Number of ARP Requests which have been sent. This would allow you to make multiple attempts to determine the Mac address before giving up and discarding the frames to be output.
    2. Resend Timer: How many ticks to wait before doing the next resend of the ARP Request. For example, you may schedule resends every second or you might try an exponential backoff scheme.
    3. Cache Timout Timers: How many ticks until the entry is discarded if the cache entry is not referenced. Another timer to indicate how many ticks to retain the entry whether or not the entry is referenced.

You will need several utility subroutines:

  1. ARPLookup(protocolAddr, protocolType, netif *): Returns a pointer to ARPENTRY, the table corresponding to the protocol address, or NULL if there is none. ARPLookup should reset timer in returned Cache entry to show it has been accessed.
  2. SendARP(ARPENTRY *): sends an ARP request for the address specified by its parameter.
  3. AddARP(NETIF * netif, ARP *pArp): Add an ARPENTRY to the ARP cache initializing it with the information from source address of the ARP request.

We are now ready for the generic output routine, which would be called by e.g. the IP implementation when it wants to send a packet:

int netOutput(NETIF *netif, IPPACKET *pIP, int length) {
    if (netif->state != UP) {
        return ERROR;
    if (netif specifies LOCALHOST) {
        locally deliver the packet;
        return SUCCESS;
    } else if(pIP specifies a broadcast packet) {
        Add frame to the Output queue (don't forget to lock) of netif
        using the ethernet broadcast address ff-ff-ff-ff-ff-ff.
        return SUCCESS;
    } else {
        ARPENTRY *pEntry = ARPLookup(pIP->nextHopAddr, pIP->type, netif);
        if (pEntry->state == DYNAMIC || pEntry->State == STATIC) {
            Add frame to the Output queue of netif
            using the Mac address in eEntry.
        } else if (pEntry->state == INVALID) {
            if (output queue is FULL) {
            } else {
                Add frame to the pEntry's output queue
        } else {
            Allocate an ARPENTRY, add it to the table initializing it
            with this frame in its Output queue.
            Call SendARP
        return SUCCESS;

Here is the routine for handling ARP packets; it is called from the ethernet demultiplexer routine.

int ARPInput (NETIF *netif, ETHERFRAME *pEther) {
    if (Packet does not specify ethernet or does not specify IP) {
        return SUCCESS;
    if (Packet is Request packet) {
        if (Packet is requesting local address) {
            if (sender and target IP are identical) {
                Send ARP Response frame to defend IP number
                Send Gratuitous ARP Request frame
            } else {
                Call AddARP to remember the source address
                Send ARP Response Packet
        } else {
            Call ARPLookup to see if the source address is in the cache
            If so, update the entry with the source Mac Address
    If (Packet is a Response packet) {
        Call ARPLookup
        If found, update entry and re-queue all packets in entry's output
            queue to the netif output queue.
    return SUCCESS;

In addition, one also needs a timer routine which is called to update the ARP cache. It decrements various time counters in the entries, discarding the entries when appropriate. It should be run once every second.

void ARPTimer(int ticks) {
   for (Every ARP cache entry) {
       entry.TTL -= ticks;
       if (entry.TTL < 0) {
           clean up and freeMemory(&entry);
       } else if (entry.type == ILLEGAL) {
           entry.resendTimer -= ticks;
           if (entry.resendTimer <= 0) {
              if (entry.resendCnt < MAX_RESENDS) {
                  entry.resendTimer = RETRY_TIME;
              } else {
                  clean up and freeMemory(&entry);

IV. Bridges

Suppose we have multiple LAN's, say all running ethernet. Working at the network level, we have connected them using routers, which are also referred to as layer 3 switches. The router basically copies an IP packet from one network to the other. On the other hand, we might connect them with repeaters (a.k.a. a layer 1 switches). These are devices which simply copy bits from one LAN to the other. This means that if two nodes send frames at the same time, they will collide, even if the nodes are on differen LAN's. We say that the two LAN's form a single collision domain. Multiport repeaters are often referred to as hubs. The third possibility is to connect the LAN's with a bridge (a.k.a. a layer 2 switch). In this case, we are copying link layer frames (i.e. ethernet frames) from one LAN to the other. A layer 2 switch reads in an ethernet frame, buffering it until it can be transmitted onto the other LAN. Because of this store-and-forward model, the two LAN's form separate collision domains. Multiport layer two switches are often referred to as simply switches.

Bridges can help alleviate a number of limitations of the CSMA/CD LAN:

  1. Physical Size Limitation: 802.3 limits the physical size of the LAN because two nodes in a single collision domain must be close enough so that a signal can make a round trip between the nodes within in the transmission delay time of a minimum sized packet.
  2. Bandwidth Limitation: All traffic within a single collision domain shares the same bandwidth. Frames which do not leave their respective LAN's can be transmitted at the same time, thus increasing the total bandwidth of the system.
  3. Number of nodes Limitation: In a token ring, there is a limitation on the number of nodes that can participate in a single token ring caused by time drift in long rings.

The rest of this section will describe bridges. It is convenient to break the functionality into three levels: basic bridges, learning bridges, and transparent bridges. The functionality of all three is standardized as in 802.1, and implemented in all modern bridges.

A. Basic Bridges

Most nodes on an ethernet are configured to listen for packets with a small number of destination addresses -- e.g. the MAC address of the interface card, the broadcast address and a small number of multicast ranges. For diagnostic purposes, however, one sometimes listens for and handles all the frames that pass the node; this is called operating in promiscuous mode. The basic bridge simply listens promiscuously on each interface, storing each frame until it can be transmitted on its other interfaces (in accordance with media access protocol). The description holds for both 2 port as well as multiport (hub) bridges.

The basic bridge addresses the physical size limitation and (for token ring) the limitation on the number of nodes. However, it does not help with the bandwidth limitation because every frame is transferred to every port. (The buffering capability of the bridge can help slightly with bandwidth in the case of simultaneous data bursts.)

B. Learning Bridges

A frame need only be transferred from the LAN of the source node to the LAN of the destination node. In particular, if the source and destination are on the same LAN, the frame need not be handled by the bridge at all. This could be handled by statically configuring the bridges so that they know precisely where each MAC address is located. However, it is often possible with a simple algorithm for the bridge to learn the location of the MAC addresses.

The learning bridge operates as follows:

  1. The bridge listens promiscuously for all traffic.
  2. For each frame, the bridge caches the source address of the packet together with the port number of the interface on which the packet was seen.
  3. For each frame received, the bridge looks up the destination address in its cache.
    1. If the destination address is in its cache and it is associated with a port other than that on which the frame was seen, the frame is forwarded to the new port associated with the destination address.
    2. If the destination address is in its cache and is associated with the same port as the one on which the frame was seen, the frame is discarded.
    3. If the destination address if not in its cache, the frame is forwarded to all ports other than the one on which the frame was seen.
  4. Entries in the cache are timed out and discarded unless they are seen as source addresses on new packets. (Should the address be seen as a source on a port other than that associated with the address, the bridge assumes the node has moved and discards the original entry in preference for the new one.)

The learning bridge works well if the LAN's are connected with bridges in a loop-free topology -- i.e. in a tree. However, it fails spectacularly if there are loops in the network. Here is an example:

  1. Suppose two LAN's A1 and A2 are connected by two bridges B1 and B2. Assume that a frame F1 is sent from a node S on A1 to a node D on A2.
  2. If B1 sees it first, then it will learn that S is on A1 and will copy it as F2 onto A2 (it doesn't know yet where D is located.)
  3. It may happen that B2 behaves the same as B2. But it may also happen that B2 sees the copy F2 on A2. It would then learn that S is on A2 and so would copy it as F" to A1.
  4. B1 now sees F3, and will copy it as F4 onto A2... Notice that we now have our single frame F being cloned over and over again onto the two networks.
Note that the bridges do not have any means of knowing that the duplicates are not new packets. Furthermore, the situation is much worse than what we saw with routing. At least with the routing loops, the same packet did not multiply itself -- note here that BOTH bridges move each copy to the other network, so the amount of traffic keeps increasing. Also, in the case of routing, the TTL field put a lifetime on the packets. In this case, the copies will keep reproducing themselves forever.

Spanning Tree Algorithm

One of the problems of a simple learning bridge is that they must be connected in a loop free topology. In particular, this means that one add redundant bridges to improve reliability. The way to resolve the problem is with the so-called transparent bridge -- this is a bridge that implements the learning behavior as well as the so-called spanning tree algorithm. A collection of such bridges autoconfigure themselves in a plug-and-play manner so that there are no loops; if they are connected, certain of the bridges will become dormant in order for the others to form a minimal cost spanning tree. More specifically:

  1. The bridges will elect a single bridge to be the Root bridge.
  2. The bridges will find the shortest path (using hop counts) from themselves to the root bridge.
  3. Amongst all the bridges attached to a given LAN, one will be elected which is of minimal distance to the root bridge as the Designated Bridge for the LAN. The designated bridge on a LAN is the bridge which will carry all traffic from the LAN to the root.
  4. Each bridge will determine its root port, one of the ports closest to the root bridge, which will carry all traffic toward the root bridge.
  5. The ports to be included in the spanning tree are:
    1. The root port of each bridge.
    2. Each port of every bridge which is connected to a LAN on which the bridge is the designated router for the LAN.
Frames are routed through the ports that are part of the spanning tree. No Frames (except for Spanning Tree Algorithm frames) are routed through ports not part of the spanning tree.

To accomplish all this, the bridges spend special configuration frames out each of their ports. These configuration frames are multicast out the ports -- the address is one such that the frames will be received by all bridges on the same LAN as the port, and such that no bridge will forward the frame off the LAN. Here is the format of the body of the frame:

2Protocol Id (0)
1Version (0)
1Message Type (0)
8Proposed root bridge Id
4cost (to proposed root bridge)
8sending bridge Id
2sending port Id
2Message age
2Maximum age
2Hello time
2forward delay

The basic ingredients are:

  1. The Proposed root bridge, cost, sending bridge Id and port Id are the basic information required to construct the minimal cost spanning tree.
    1. Each bridge chooses a best message from amongst those received at a given port -- best is the lexicographically smallest value based on these four fields from Proposed root bridge down to port Id.
    2. The Proposed root Id and bridge Id are typically 6 byte addresses, but there is space for a two byte bridge priority -- which could be administratively set to influence the likelihood that a particular bridge would chosen as the root.
    3. The Port Id is 2 bytes long in order to allow for priority ordering amongst ports as well.
  2. The last four time fields are measured in 1/256 second units.
    1. Message Age: Estimated time since the proposed root originally transmitted the configuration message - the message has been passed on by the protocol.
    2. Time when the message should simply be deleted. Default: 20 sec.
    3. Time between messages sent by the proposed root. Default: 2 sec.
    4. Forward Delay: Instead of immediately using a particular port as a link for forwarding packets, the bridge continues to listen for additional configuration frames for this amount of time. When it ends, the bridge examines frames to learn addresses for the same amount of time; then it starts routing frames through this port. Default: 15 sec.

Here is the basic algorithm:

  1. A bridge starts off by sending a configuration message nominating itself as the proposed root bridge, with a cost of 0. The message is sent out all ports.
  2. If a port sees a smaller (in the lexicographic order) message than the one which it would transmit onto the particular LAN where the message is seen, the bridge stops sending configuration messages through that port.
  3. When a bridge receives a configuration message with a lower proposed root bridge ID, then it starts using that bridge in its own configuration messages -- the cost is set to be one larger than the lowest cost seen to the new proposed root bridge.
  4. The root port for the bridge is the port with the smallest port Id through which one has seen a configuration message at minimal distance to the current proposed root bridge.
  5. The bridge now no longer sends configuration frames to the ports through which it has received least cost paths to the current proposed root bridge. It does send configuration frames through all the remaining ports. the bridge is the designated bridge for all the ports through which it sends the messages.
  6. The ports to go in the spanning tree are:
    1. The bridge's root port.
    2. All ports for which a bridge is the designated bridge for the attached LAN.
Such ports in the spanning tree are said to be in the forwarding state; the others are in the blocked state. Ports in the forwarding state are not actually used for forwarding until the bridge leaves the listening state after Forward Delay seconds have passed as well as the learning state (after another Forward Delay seconds have passed.)

Configuration Updates are sent out regularly:

  1. The root bridge sends out configuration messages with message age 0 every Hello time seconds. Various designated bridges forward the message keeping the message age of 0.
  2. Each bridge keeps the current configuration message associated with each port. Each time it receives a message, it compares it with the current one. If the message is smaller (better) in the lexicographic order or if it is the same but with a smaller age, it replaces the current configuration message and the bridge recalculates the root bridge, cost to the root bridge, and root port.
  3. With each timer tick (256 HZ clock), the age of the current configuration message for each port is incremented. If the age reaches the maximum age, then the bridge discards the current configuration message and recalculates the root bridge, cost to the root bridge and root port.
  4. If a new bridge comes up and sends a configuration frame to a given bridge. Then that bridge sends out its current configuration message on that port, but the age is set to the age of that message. This allows the bridge to get receive a configuration frame response earlier than would otherwise be the case, without giving the other bridges on the same LAN the impression that an configuration message has been received from the root bridge.

Finally, we should consider the interaction between the spanning tree algorithm and the learning bridge function. The learning bridge cached entries do need to time out -- but they only become invalid because of a topology change. So, typical timeout values are 15 minutes -- about how long it take to do a physical bridge move. However, the topology can change simply because of a change in the spanning tree. For this type of change, 15 minutes are too long to have bad cache entries. So, one needs to have a way to tell bridges to time out cache entries faster when the spanning tree changes. This is accomplished by Topology Change Notification frames:

2Protocol Identifier (0)
1Version (0)
1Message Type (128)
Bridges that notice the change in the spanning tree, send these messages back toward the root -- it repeats the message every Hello Time seconds until it receives from the Designated bridge a configuration update with the Change Acknowledgement flag bit set. The message moves down toward the root. Upon arrival, the root sends its configuration messages with the Topology Change flag set. It continues to send its configuration messages with this bit set for a time equal to the sum of the Forward Delay and the maximum age. Any bridge which has receives such configuration messages uses the Forward Delay time to time out cached addresses rather than the longer 15 minute timer. It stops doing so when it receives a configuration frame from the root with the flag no longer set.

V. Serial Line Protocols (SLIP and PPP)

We will restrict ourselves to asynchronous byte oriented protocols, specifically SLIP and PPP -- the first because it is so simple, and the second because it is such a common protocol.

A. Serial Line Internet Protocol (SLIP)

Details are available in RFC 1055. This is a very simple protocol that provides nothing more than frame delimitation. This is done by adding an END character (ASCII 0xc0) right before and right after the IP packet. Since the packet might also contain, END characters, one has to escape them with a character stuffing format:

  1. Every occurrence of END in the IP datagram is replaced with two characters ESC (ASCII 0xDB) followed by the character 0xDC.
  2. Every occurrence of the ESC character in the IP datagram is replaced with the pair 0xDB (ESC) followed by 0xDD.
Upon receipt, you check for END characters to delimit the packet and then scan for datagram to remap the stuffing with the original characters.

The RFC 1055 defines a MTU of 1006 characters, which is typical for SLIP implementations. Some implementations, like in Windows allows for 1500 characters, so that IP packets running over ethernets do not need to be fragmented in order to pass over the serial line.

In addition, there is a protocol for compressing TCP/IP headers. For details, see RFC 1042.

Point-to-Point Protocol (PPP)

This is a rather complex protocol described in RFC 1661. It includes:

  1. A data link layer encapsulation that supports multiple protocols on the same link.
  2. Link Control Protocol (LCP) is a protocol for negotiating data link layer options.
  3. Internet Protocol Control Protocol (IPCP) is used to negotiate network layer protocol properties, such a nameservers, and compression. See RFC 1332 and RFC 1877 for details. There are analogous protocols for each of the protocol families that use PPP.

PPP encapsulation is based on HDLC (High-level Data Link Control, used originally for synchronous serial connections). This explains some of the extraneous fields which can occur in the frame -- beyond this, one has packet delimiting done with byte stuffing as in SLIP and a Protocol field to allow for multiple protocol families on the same connection. Here is the frame:

1 byteFlag 0x7E
1 byteAddress 0xFF
1 byteControl 0x03
2 bytesProtocol
VariableIP Datagram
2 bytesFrame Check Sequence
1 byteFlag 0x7E
  1. The Flag bytes are used to delimit the frames. Again byte stuffing is used:
    1. FLAG (0x7E) characters delimit the frame.
    2. Any FLAG characters in the frame are replaced by 0x7D5E.
    3. Any ESC (0x7d) characters in the frame are replaced by 0x7D5D.
    4. Any control character ch in the frame are replaced by 0x7D followed by ch | 0x10 -- i.e. change a character with ascii value less than 32 decimal by setting the 5th bit. (This last part is done only if negotiated via the LCP; it avoids problems because control characters may be used for serial line flow control.)
  2. HDLC allows for multiple addresses. In our case, the link is being used for point-to-point; so use 0xff, the broadcast address.
  3. HDLC allows for sequencing, acknowledgements to provide for reliable data link service. In our case, we are just using it for datagram traffic; the correct value is 0x03, for unnumbered information (UI) frames.
  4. The protocol field allows for, amongst others:
    1. IP 0x0021
    2. AppleTalk 0x29
    3. IPX 0x2b
  5. The frame check sequence is calculated using CRC-16: x16 + x12 + x5 + 1. For details and example code, see RFC 1662.

Because the standard frame given above includes a certain amount of redundant data, one can negotiate with LCP to avoid sending it. A typical frame would then look like:

1 byteFlag 0x7E
1 bytesProtocol
VariableIP Datagram
2 bytesFrame Check Sequence
1 byteFlag 0x7E

The MTU is by default 1500 bytes.

As an interesting aside, it is possible to combine multiple PPP links so they are treated as a single IP connection. So, for example, one could combine two analogue phone lines to provide a service at twice the effective bandwidth. This is called PPP Multilink Protocol and is documented in RFC 1990.