Purchase the MEF-CECP Exam Today!
Home

Login/Register

Recent Members

Online Users

  • Daniel Bar-Lev
  • Hieu Hieu
  • Donamark Mirabel
  • Ayal Lior
  • Eng. Anuradha Udunuwara
5 user(s) and 870 guest(s) online | Show All
Today
Ayal Lior added a new wall post in the group, Operations, Administration & Maintenance 04:43 AM
Yesterday
Ravi Prakash Vaish added a new wall post in the group, Operations, Administration & Maintenance 04:10 PM
Francesco Fucelli and Richard Strike are now friends 02:33 PM
 
Follow us on Twitter
Putting 50 ms in Perspective Print E-mail
(33 votes, average: 4.30 out of 5)
Papers - Ethernet Academy Articles
Saturday, 29 November 2008 06:06

Putting 50-ms in Perspective

Written by  Lionel Florit

Content Disclaimer

1. Abstract

No matter which new networking technology is invented, when it comes to availability, vendors quickly claim that their equipment meet a so-called 50-ms requirement for convergence.This article explores the historical reasons behind this figure, as well as the problems service providers have in achieving it.  This article also concludes that 50 ms, in itself, is not relevant to most of the applications running over Metro Ethernet Networks. Furthermore, as networks become more intelligent, other mechanisms can mitigate the effect of network outages.

2. Introduction

Subscribers demand reliable services – as they perceive reliability. Historically, service providers (SP) have built the ir networks with as much redundancy as they can afford in order to be as close as possible to a so-called 50-ms convergence time. Equipment vendors tout 50 ms capable equipments. On the surface, everything looks consistent. However, as we examine the problem more closely, we will see that subscribers don't really need the networks to converge in 50 ms.  The 50 msec figure comes f rom historical requirements of a voice component no longer in the network.  What is more appropriate is to look at the convergence required by the application running on the network and provide the required service to them. In most cases, it is extremely difficult for service providers to offer that level of availability and equipment vendors are quick to confuse end-to-end convergence – which is what is really needed – with simple and bounded failure scenarios. There is a great misunderstanding about what people mean when they talk about “50 ms” and the marketing machines confuse the issue further.  In order to lift the fog, we look at where this figure comes from, what it means, where it applies, and which applications really need it.

3. Where does 50 ms come from ?

The 50 ms figure historically originated from the specifications of APS (Automated Protection Switching) subsystems in early digital transmission systems and was not actually based on any particular service requirement. 1 Early digital transmission systems embodied 1:NAPS, which required typically about 20 ms for fault detection, 10 ms f or signaling, and 10 ms for the operation of the tail-end transfer relay; consequently, the specification for APS switching times was reasonably set at 50 ms, allowing a 10 ms margin. Early generations of DS1 channel banks (from the 1970s) also had a Carrier Group Alarm (CGA) threshold of about 230 ms. The CGA is a time threshold for persistence of any alarm state (such as loss of signal or frame synch loss) on the transmission line side, after which all trunk channels would be busied out. The 230 ms CGA threshold reinforced the need for 50 ms APS switches at the DS3 transmission level, to allow for worst-case reframe times all the way down the DS3, DS2, DS1 hierarchy with suitable margin against the 230-ms CGA deadline. However, it was long since realized that a 230 ms CG A time was f ar too short. Many minor line interruptions would trigger an associated switching machine into mass call-dropping because of spurious CGA activations. As a result, the persistence time before call dropping was raised to 2.5+/-0.5s by ITU recommendations in the 1980s. Nevertheless, the requirement for 50-ms APS switching stayed in place, mainly because this was still technically quite feasible at no extra cost in the design of APS subsystems. The apparent sanctity of 50 ms was further entrenched in the 1990s by vendors who promoted only ring-based transport solutions and found it advantageous to insist on 50 ms as the requirement, effectively precluding distributed mesh restoration alternatives that had been under equal consideration at the start of the SONET era. As a marketing strategy the 50 ms issue thus served as the "mesh killer" for the 1990s[..]. On the other hand, there was also real urgency in the early 1990s to deploy some kind of fast automated restoration method. This lead to the quick adoption of ring-based solutions.

Footnote: 1 Section from the book: Mesh based survivable networks Author: Wayne Grover [ 1 ]which had only incremental development requirements over 1+1 APS transmission systems.

However, once rings were deployed, the effect was to only to reinforce further the cultural assumption of 50 ms as the standard. Thus, as som etimes happens in engineering, what was initially a performance capability in one specific context (APS switching time) evolved into a perceived requirement in all other contexts.

4. What do we mean by 50 ms recovery?

The expression “50 ms recovery” is overly used. What does it mean that “the network must converge in 50 ms”?  Does it mean that a failure must be detected in less than 50 ms and the recovery will take place later?If a port fails, must a backup link be brought up within 50 ms?  Must end-to-end service must be restored in 50 ms?  Is a service restore d when the first frame of that application makes it through the back-up path of the network or when the application resumes its work?  Does it apply to all type of services, point-to-point and multipoint?  Which type of failure does the 50 ms figure covers, fiber cut, port down, box down, POP down, CPE down?  After a failure, are we allowed to drop less important traffic in order to provide bandwidth for more important traffic?  These questions are very difficult to answer. There is no common view on all of them and it is easy to claim “50 ms” in a context and be less than stellar in another. Let's tackle the questions above using a generic example. Consider the following network diagram:

fig1-50ms.png
This diagram represents a company’s headquarters, HQ (right hand side ), connected to two branch offices over a multipoint EVC with UNIs A, B, C, and D in that EVC. Can this network offer 50 ms protection? Let's take a closer look. The first thing we notice are the three locations. This is a multipoint situation, therefore, one could use H-VPLS2 (with L2 or MPLS in the access part of the network) or L2 spanning tree end to end (unlikely). Footnote: 2 Hierarchical VPLS

4.1 Failure at the UNI

In our example, the only UNI protected is located at the HQ (right hand side of the diagram).  If UNI B fails, the customer equipment (CE) must make the decision to switch to UNI D.In the best case scenario (as far as recovery speed is concerned), the CE is a single piece of equipment, either a router or switch. If it is a switch, the CE is probably running Rapid Spanning Tree Protocol (RSTP) to block one of the two UNIs so a loop with the service provider is not created. Because the EVC is multipoint, RSTP has detected several neighbors and is now running with 30-stimers. If the CE is a router, the backup path will be used after the routing protocol converges, which is likely longer than 50 ms. The CE can’t take advantage of the 50 ms. In the access and aggregation networks, recovery is more complex.  A frame coming from the branch offices going through Agg 4 must now be directed to the other edge device (UNI D).This means all headquarters’ MAC addresses mapped to UNI B must first be forgotten before the forwarding tables can begin to be rebuilt.Depending on the size of the access network, this could easily take more than 50 ms.

4.2 Failure in the access layer

The failure of a link in the access layer of the network is probably what most people have in mind when they talk about 50 ms recovery. This event is not likely to involve the CE and the failure can be dealt with locally, between the aggregation node and the edge node within the service provider's domain.This is the classic “backhoe” incident that causes a fibercut. Equipment vendors claiming 50 ms recovery as a feature of their equipment are likely to protect against this type of scenario.If the access is a ring, technologies such as SONET3 or Cisco REP4 will switch traffic to the other side of the ring very quickly.If the access network is hub and spoke, then MPLS FRR5 will also switch traffic quickly. As long as the data path doesn't have to use a different aggregation node, recovery will be prompt.

4.3 Failure at the aggregation layer

A failure of the aggregation node itself is more involved. In our example,if Agg4 were to fail altogether, all other aggregation nodes would have to send their traffic to Agg3. This could be challenging , because potentially a lot of addresses have to be mapped to Footnote: 3 Note that standard SONET performance figures are given in the context of a single ring with 16 nodes, and with adequate unused bandwidth reserved for protection., 4 REP (link to document), 5 MPLS FRR: MPLS fast reroute ( MPLS-FRR ) mechanisms deviate the traffic in case of network failures different ports on Agg2 and Agg6. The EVC in our example is likely to share this network with many other EVCs which will have to be moved away from Agg 4. This is a typical example of why the 50 ms context matters. It is one thing to apply this ideal delay to a single segment, quite another to apply it to an entire network. Some questions are left unanswered.In Ethernet technologies, flooding is a common mechanism for recovering from a failure. After detecting a failure, network elements flood unicasts until they learn the location of the source unicast and stop flooding at that point.  Because flooding causes frames to be multiplied, congestion may occur in other parts of the network that were not affected by the failure in the first place.However, traffic is indeed flowing end to end.When do we decide the network has converged?Can we stop the clock that measures when we hit the “50 ms mark”, even if new congestion is introduced in other parts of the network? As we see, we are still far away from the point-to-point context described in the section 3. Observed from a network-wide perspective, convergence is more complicated than a simple link failover.Let's now turn our attention to the application layer.

5. Which application needs 50 ms protection?

As mentioned previously, looking at protection from the application’s point of view should be the ultimate goal. After all, networks ultimately carry application’s traffic. If we thought the failure scenarios described in the Sect 4 could be complicated, the impact of a network failure on a user application is even more complex to evaluate.  How the application be haves when packets are lost and how this translates into user experience depends on many parameters: protocol behavior, computer speed at each end, type of application (voice vs. data), type of user (stock trader vs. instant messenger), and so on. Furthermore, it is nearly impossible to measure down to the millisecond the effect of a network outage on an application. Let's adopt a commonly accepted simplification to this problem.Let’s assume the application is considered to be fully recovered as soon as the first packet of that application, after a failure, is transported across the service provider's network.  Assume, furthermore that, after the failure, it doesn't matter whether the packets are out of order, or even if some of them are missing . Finally, assume that there is only a single failure event. With this in mind, let's try to make sense of the application space and find which application really needs 50 ms. The leave s of this tree show applications commonly thought to be candidates perceived to require a maximum 50-ms outage.The list is not exhaustive.

fig2-50ms.jpg

The application space can be divided into three categories: data, voice and video. The diag ram shows “mission-critical” as a classification. A mission-critical application is viewed by the subscribers as important enough that they would be willing to pay more to the service provider in order to guarantee fast recovery.

5.1 Data oriented applications

In the data space, unicast and multicast applications are the two main subcategories. We will not talk about applications running over TCP. These applications expect packet loss and are adapted to retransmission mechanisms built into TCP.  They will not be affected by short network outages.

5.1.1 Unicast applications

The unicast applications of interest run over a connectionless protocol.Therefore, network outages have a greater impact because packets can't be retransmitted.Let’s look at three categories: military, Internet and network timing applications:

5.1.1.1 Military applications

Military applications cover, for example, multiple target tracking problems, Coastal Air Defense System, real-time imaging, and the like. As seen in RFC 1679, the Navy's High-Performance Network (HPN) working group has studied the requirements of mission-critical applications on Navy platforms. However, these applications are deployed on submarines, on aircrafts, on ships, and on bases – and the military owns the WAN. This is typically not a service provider play.

5.1.1.2 Industrial Ethernet applications

PLCs (programmable logic controllers) are simple dedicated-function computers that are used to automate real-world processes, such as controlling machinery on factory assembly lines. PLCs are connected to control stations by means of proprietary protocols, such as Modbus, Profibus, CANopen and DeviceNet. Recently there has been a push to use Ethernet.Industrial Ethernet LAN products need to be heavy duty to stand up to electrical noise, dust and dirt, humidity, extended media distances, and extended temperatures extremes.  Using Ethernet to inter-connect PLCs and sensors over fiber media is typical. Very tight time synchronization among machines is needed (be low 1 ms, I EC 61850 part 5). This is not a WAN application, but rather a LAN application. It is out of the scope of this document.

5.1.1.3 Time synchronization protocols (NTP)

The synchronization accuracy of a WAN using NTP is typically within the range of 10 to 100 ms; on a LAN, this is typically a few milliseconds. A broadcast server sends out a packet about every 64 seconds. A non broadcast client/server requires 2 packets per transaction. When first started, the transactions occur about once per minute, increasing gradually to once per 17 minutes under normal conditions.  Clients using lower-quality clocks must poll more frequently than well-synchronized clients.  If a packet is lost, no retransmission takes place, the stations simply wait for the next update. Therefore, with NTP a network outage could last as long as several seconds, with the loss of only a single timing packet, with little effect on accuracy.  IEEE1588 offers much better accuracy than NTP. Typically, 1588 will generate a few packets per second.  During and network outage, the client clock would drift for the duration of the outage (depending on the precision and quality of the clock, the drift will vary). However, the short-term drift of a Stratum 3 clock is less than 3.7 x 10-7 in 24 hours. This amounts to approximately 255 frame slips in 24 hours while the system is holding. An outage as long as 500 ms – 10 times our archetypical 50-ms factor! – won't introduce a significant drift.

5.1.2 Multicast and unicast data applications

There are a few data oriented applications which have more stringent requirements. Two examples are: Real-Time Distributed Applications and trading applications.

5.1.2.1 Real-Time Distributed Applications

Distributed real-time applications typically depend upon fast communications between nodes.Protocols such as UDP or TCP provide adequate transport services.However, in case of a network outage, some applications can't retransmit.Some times the data received is obsolete after the time it would take to re transmit it, or the transmit side can’t buffer what has been sent. Application architects can design the applications to be resilient to some loss if the application requires it. This is commonly done in a middle ware layer. 6 Real-time applications are sensitive to loss, but software solutions can accommodate outages larger than 50 ms.

5.1.2.2 Trading Applications: the race to the best price

Automated order routing systems and the dawn of algorithmic trading me an that the window of optimum trading opportunity is increasingly measured in milliseconds. This is the world of zero packe t loss (and big dollar loss). 50 ms outage is not acceptable.Uptime is essential.If a single packet is lost, a transaction may not take place, money will be lost.Most market feeds are point-to-point T1s.Metro Ethernet providers may be used as backup to T1 lines.Banks and trading companies will constantly monitor the perf ormance of their backup connections. However, these applications use other networks than Metro Ethernet.

5.2 Voice applications

Voice applications are very interactive. There are two broad categories: Voice over IP and Circuit Emulation over Packet.

5.2.1 Voice over IP

There are two classes of traffic: the voice signal itself (bearer traffic) and the out of band call-signaling traffic. If an outage occurs during a conversation, the end-user may lose contact for the duration of the outage.If the outage lasts less than a few seconds, the call Footnote: 6 See for example Real-Time Publish-Subscribe (RTPS) [ 2 ] itself is not dropped.If the outage occurs during a call setup phase, it takes longer to setup the call, or the user might simply have to dial again.VoIP deployments over networks designed to converge in 800 ms or more are very common.

5.2.2 Circuit Emulation over Packet

Of all the applications we have seen so f ar, this one is the most challenging. Circuit Emulation Service over Packet (CESoP ) is defined by the IETF, Metro Ethernet Forum, MPLS Forum, and ITU-T. The entire T1 (framing, signaling, and payload) is carried transparently across the Ethernet network.  If a packet gets dropped (or excessively delayed) in the network, its content in the egress data stream is replaced with a configurable idle pattern, as shown in Figure.

fig3-50ms.jpg

When a lot of packets are lost, an alarm is sent towards the packet source and the destination will see all 1. In the case of unframed service, all 1 is an alarm in itself. However in case of framed service the framing is preserved and only the payload is replaced by FFs. As a result, the destination will not see an alarm at all. More than 50 ms (or even 200) will not cause calls to be dropped. The mobile backhaul applications add a little twist to this equation. Mobile operators want an end-to-end delay of less than 10 ms so phone calls are not dropped during tower site handoffs. This implies a de-jitter buffer size of less than 10 ms, which means a 10 ms network outage will cause an alarm. However, once the alarm is raised, nothing else should happen if the outage is less than 500 ms.From the user-experience perspective, there will be a glitch, more or less notice able depending on the length of the outage. Note that running CES over IP allows more flexibility in terms of packet loss and common implementations can accommodate a 500 ms outage.

5.3 Video applications

Video applications over data networks can have many different forms:

  • Video conferencing, which is real time, interactive, and bidirectional
  • Broadcast content (TVoDSL) to living room TV which is near real-time, noninteractive and unidirectional
  • Video on demand (VoD), which is non-real-time, interactive (VCR-like controls), unidirectional
  • Near VoD, network personal video recorder (PVR)
  • Security applications – surveillance
  • Internet streaming to PC desktop
  • Broadcast contribution and TV production networks
  • Etc.

These applications have different requirements and addressing all of them is beyond the scope of this paper.However, when people talk about video quality, they often use the example of to the final touchdown of a Super Bowl game, the last penalty kick of soccer’s World Cup final, a brain surgeon using an HD video feed to stitch synapses on a patient located on another continent, or the president having a video conference with his generals to order (or not) to launch a nuclear strike.We can imagine the consequences of a 500-ms glitch in these situations. “Did he say launch or not?”We can't say that these situations will never happen, but one can’t design a network based on these requirements either!Nonetheless, let's look at the effect of a network outage on a video stream. Let's use a common MPEG 2 stream for the basis of the discussion.  Video streams travel compressed over data networks. A data packet contains a certain amount of information describing a video frame.There are about 30 video frames (or images) per second. There are different kinds ofvideo frames: I, B and P frames.  I, P and B frames are assembled into a Group of Pictures (GOP).  A GOP is typically bounded by I frames and 12-15 frames long but it can vary with frame rate, content complexity, and encoder implementation.

An I picture is a reference picture containing all the pixel information needed to represent accurately the picture. A P picture (also called predicted picture) contains all the motion vectors to describe the new positions of the macroblocks, along with the difference data that must be added to those macroblocks.  P pictures require approximately half of the data of an I picture and are based on the previous picture (I or P). The B pictures are based on past and future I and P pictures and are not derived from each other.  Like P pictures, they contain vectors and difference data.They usually require about a quarter of the data of an I picture.Because video frames are linked to the each other, the loss of consecutive packets translates into a bi-dimensional effect, in space and time, as shown in Figure 4.

fig4-50ms.jpg

One of the most visible effects of a frame loss occurs when some data of an I frame is not received by the decoder.  Typically, 20% of the packets of a video stream carry information used to construct an I f rame. A network outage of less than 500 ms affects up to two consecutive GOPs. The user will see pixilation, but the video program resumes after the network recovers without the need for user intervention.  It is important to note that the duration of the artifact on the video screen will be longer than the duration of the network outage itself.  No matter how short the network outage is, as long as frames are lost, there is a possibility to see an artifact on the screen.Several schemes exist to compensate for packet loss: some of the most efficient are forward error correction, error repair and live-live protection.

5.3.1 Forward Error Correction (FEC)

A FEC capable client receives FEC repair packets and searches for missing Real-Time Transport Protocol (RTP) sequence numbers encountered across a FEC-protected block period of N-packets.FEC protection periods are determined at the headend by the definition of the FEC block size.Missing RTP packets within the FEC block coverage are automatically corrected by the client, which use s received FEC Protection Packets.  Any missing RTP packets beyond FEC coverage are forwarded to an error repair function.  FEC is a good way to improve the quality of experience but introduces overhead, latency, and subsequently cost. The more we budget for overhead and latency, the longer network outage we can absorb without seeing an image artifact.FEC can handle a network outage of 50 or 100 ms or more.FEC budget and network design go hand in hand.A 50 ms or 100 ms limit on the network as the only requirement will not prevent artifacts due to a network failure.

5.3.2 Error Repair

The client waits for missing RTP sequence numbers.If packets have been dropped and remain uncorrected following FEC repair, the client requests re transmission of the packets from its designated VQE server 7.  Before they are handed off to an MPEG demux, retransmitted packets are re-sequenced and de-jittered in the client' s network.  A single RCTP message may request the retransmission of multiple contiguous or non-contiguous packets.

fig5-50ms.jpg

5.3.3 Live-live protection

In certain cases such as head-end redundancy, a lossless delivery may be required.In this situation, we need to achieve protection against a single network failure of any length.A solution consists of sending two copies of the multicast video stream on two physically separated paths.The last core edge router or a VQE element receives the two copies but passes only one.Such a design protects against a single failure of any length and makes the 50 ms discussion irrelevant.

5.3.4 Summary of Video Applications

As we saw at the introduction of this section, there are numerous video applications, each with its own requirements. Some applications may require zero loss, others may accommodate some loss.A network outage of any length will cause packets to be lost and a 50-ms or longer network outage can create visible video artifacts.  The faster a network converges, the more satisfactory the user experience.However, aside from network convergence, there are other mechanisms that improve video delivery.

Footnote: 7 A server located at the aggregation site and allows retransmission of missing packets. More information about Cisco VQE is available (link to document)

These new tools should be built into the video delivery solution to improve overall the performance of the network.Examples of such tools are FEC, repair packets, live-live feeds, time-shifted streams and so on.These mechanisms can correct the degradation of quality resulting from a network failure.Relying solely on a 50-ms convergence requirement is not likely to lead to the most satisfactory solution.

6. Conclusion

As the context in which 50-ms has been reflexively used has expanded, the definition of failure and recovery has become less meaningful. A common trap is to look at localized failure scenarios and claim to achieve “50 ms” for all cases. When we look at failures across the entire network, we see it is very difficult for a service provider to design a network that will accommodate 50-ms recovery for any possible failure. We have also established that the concept of recovery itself, when viewed network wide, is not well defined.When exactly, after a failure, a network has recovered is up to debate. We have reviewed a set of applications perceived to be very sensitive to packet loss. We have established that, in most cases, these applications don't mandate a hard 50-ms figure.Most of the time, they can cope with much a longer outage.Examples of such applications are Voice over IP , time synchronization, real-time distributed systems. We have also encountered examples of applications that will be affected by a 50-ms outage. Trading applications can’t accept any loss. For circuit emulation over Ethernet, a 10 ms loss could have an effect (alarms) but this doesn’t mean catastrophic consequences. Finally, packet loss will create video impairments. There are effective mechanisms beyond fast network convergence that can be used to complement a given network’s performance.Such mechanisms are FEC, VQE, live-live and time-shifted streams.It is up to the service providers to balance the investment between fast network convergence and error corrections based on the level of quality of experience they want to achieve. Service providers and vendors continue to strive to provide solutions that recover from failures as quickly as possible. However, trying to achieve an artificial goal of 50 ms is likely to affect the affordability, scalability and flexibility of the solution. Deciding when we reach the right balance, using sound technical and economic justification, will serve the interest of both the provider and the consumer.

7. References

[1] Grover,Wayne, Mesh Based Survivability Networks, Prentice Hall, ISBN-10: 0-13-494576-X [2] NDDS and RTPS information: http://rti.com/resources.html

About the Author

Written by:
Lionel Florit
 
Trackback(0)
Comments (0)add
You must be logged in to a comment. Please register if you do not have an account yet.

busy
Last Updated on Saturday, 07 August 2010 05:52
 
MEF Accredited Training Providers