IP Performance Measurement C. Paasch Internet-Draft R. Meyer Intended status: Experimental S. Cheshire Expires: January 12, 2023 O. Shapira Apple Inc. M. Mathis Google, Inc July 11, 2022 Responsiveness under Working Conditions draft-ietf-ippm-responsiveness-01 Abstract For many years, a lack of responsiveness, variously called lag, latency, or bufferbloat, has been recognized as an unfortunate, but common, symptom in today's networks. Even after a decade of work on standardizing technical solutions, it remains a common problem for the end users. Everyone "knows" that it is "normal" for a video conference to have problems when somebody else at home is watching a 4K movie or uploading photos from their phone. However, there is no technical reason for this to be the case. In fact, various queue management solutions (fq_codel, cake, PIE) have solved the problem. Our networks remain unresponsive, not from a lack of technical solutions, but rather a lack of awareness of the problem and its solutions. We believe that creating a tool whose measurement matches people's everyday experience will create the necessary awareness, and result in a demand for products that solve the problem. This document specifies the "RPM Test" for measuring responsiveness. It uses common protocols and mechanisms to measure user experience specifically when the network is under working conditions. The measurement is expressed as "Round-trips Per Minute" (RPM) and should be included with throughput (up and down) and idle latency as critical indicators of network quality. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute Paasch, et al. Expires January 12, 2023 [Page 1] Internet-Draft Responsiveness under Working Conditions July 2022 working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 12, 2023. Copyright Notice Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 2. Design Constraints . . . . . . . . . . . . . . . . . . . . . 4 3. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4. Measuring Responsiveness Under Working Conditions . . . . . . 6 4.1. Working Conditions . . . . . . . . . . . . . . . . . . . 6 4.1.1. From single-flow to multi-flow . . . . . . . . . . . 7 4.1.2. Parallel vs Sequential Uplink and Downlink . . . . . 7 4.1.3. Reaching full link utilization . . . . . . . . . . . 8 4.1.4. Final "Working Conditions" Algorithm . . . . . . . . 8 4.2. Measuring Responsiveness . . . . . . . . . . . . . . . . 10 4.2.1. Aggregating the Measurements . . . . . . . . . . . . 11 5. Interpreting responsiveness results . . . . . . . . . . . . . 11 5.1. Elements influencing responsiveness . . . . . . . . . . . 11 5.1.1. Client side influence . . . . . . . . . . . . . . . . 12 5.1.2. Network influence . . . . . . . . . . . . . . . . . . 12 5.1.3. Server side influence . . . . . . . . . . . . . . . . 13 5.2. Root-causing Responsiveness . . . . . . . . . . . . . . . 13 6. RPM Test Server API . . . . . . . . . . . . . . . . . . . . . 13 7. RPM Test Server Discovery . . . . . . . . . . . . . . . . . . 15 8. Security Considerations . . . . . . . . . . . . . . . . . . . 16 Paasch, et al. Expires January 12, 2023 [Page 2] Internet-Draft Responsiveness under Working Conditions July 2022 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 16 11. Informative References . . . . . . . . . . . . . . . . . . . 16 Appendix A. Example Server Configuration . . . . . . . . . . . . 17 A.1. Apache Traffic Server . . . . . . . . . . . . . . . . . . 17 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 1. Introduction For many years, a lack of responsiveness, variously called lag, latency, or bufferbloat, has been recognized as an unfortunate, but common, symptom in today's networks [Bufferbloat]. Solutions like fq_codel [RFC8290] or PIE [RFC8033] have been standardized and are to some extent widely implemented. Nevertheless, people still suffer from bufferbloat. Although significant, the impact on user experience can be transitory - that is, its effect is not always visible to the user. Whenever a network is actively being used at its full capacity, buffers can fill up and create latency for traffic. The duration of those full buffers may be brief: a medium-sized file transfer, like an email attachment or uploading photos, can create bursts of latency spikes. An example of this is lag occurring during a videoconference, where a connection is briefly shown as unstable. These short-lived disruptions make it hard to narrow down the cause. We believe that it is necessary to create a standardized way to measure and express responsiveness. Existing network measurement tools could incorporate a responsiveness measurement into their set of metrics. Doing so would also raise the awareness of the problem and would help establish a new expectation that the standard measures of network quality should - in addition to throughput and idle latency - also include latency under load, or, as we prefer to call it, responsiveness under working conditions. 1.1. Terminology A word about the term "bufferbloat" - the undesirable latency that comes from a router or other network equipment buffering too much data. This document uses the term as a general description of bad latency, using more precise wording where warranted. "Latency" is a poor measure of responsiveness, since it can be hard for the general public to understand. The units are unfamiliar ("what is a millisecond?") and counterintuitive ("100 msec - that sounds good - it's only a tenth of a second!"). Paasch, et al. Expires January 12, 2023 [Page 3] Internet-Draft Responsiveness under Working Conditions July 2022 Instead, we create the term "Responsiveness under working conditions" to make it clear that we are measuring all, not just idle, conditions, and use "round-trips per minute" as the metric. The advantage of round-trips per minute are two-fold: First, it allows for a metric that is "the higher the better". This kind of metric is often more intuitive for end-users. Second, the range of the values tends to be around the 4-digit integer range which is also a value easy to compare and read, again allowing for a more intuitive use. Finally, we abbreviate the measurement to "RPM", a wink to the "revolutions per minute" that we use for car engines. This document defines an algorithm for the "RPM Test" that explicitly measures responsiveness under working conditions. 2. Design Constraints There are many challenges around measurements on the Internet. They include the dynamic nature of the Internet, the diverse nature of the traffic, the large number of devices that affect traffic, and the difficulty of attaining appropriate measurement conditions. Internet paths are changing all the time. Daily fluctuations in the demand make the bottlenecks ebb and flow. To minimize the variability of routing changes, it's best to keep the test duration relatively short. TCP and UDP traffic, or traffic on ports 80 and 443, may take significantly different paths on the Internet and be subject to entirely different Quality of Service (QoS) treatment. A good test will use standard transport-layer traffic - typical for people's use of the network - that is subject to the transport's congestion control that might reduce the traffic's rate and thus its buffering in the network. Traditionally, one thinks of bufferbloat happening on the routers and switches of the Internet. However, the networking stacks of the clients and servers can have huge buffers. Data sitting in TCP sockets or waiting for the application to send or read causes artificial latency, and affects user experience the same way as "traditional" bufferbloat. Finally, it is crucial to recognize that significant queueing only happens on entry to the lowest-capacity (or "bottleneck") hop on a network path. For any flow of data between two communicating devices, there is always one hop along the path where the capacity available to that flow at that hop is the lowest among all the hops of that flow's path at that moment in time. It is important to understand that the existence of a lowest-capacity hop on a network Paasch, et al. Expires January 12, 2023 [Page 4] Internet-Draft Responsiveness under Working Conditions July 2022 path is not itself a problem. In a heterogeneous network like the Internet it is inevitable that there must necessarily be some hop along the path with the lowest capacity for that path. If that hop were to be improved to make it no longer the lowest-capacity hop, then some other hop would become the new lowest-capacity hop for that path. In this context a "bottleneck" should not be seen as a problem to be fixed, because any attempt to "fix" the bottleneck is futile - such a "fix" can never remove the existence of a bottleneck on a path; it just moves the bottleneck somewhere else. Arguably, this heterogeneity of the Internet is one of its greatest strengths. Allowing individual technologies to evolve and improve at their own pace, without requiring the entire Internet to change in lock-step, has enabled enormous improvements over the years in technologies like DSL, cable modems, Ethernet, and Wi-Fi, each advancing independently as new developments became ready. As a result of this flexibility we have moved incrementally, one step at a time, from 56kb/s dial-up modems in the 1990s to Gb/s home Internet service and Gb/s wireless connectivity today. Note that in a shared datagram network, conditions do not remain static. The hop that is the current bottleneck may change from moment to moment. For example, changes in other traffic may result in changes to a flow's share of a given hop. A user moving around may cause the Wi-Fi transmission rate to vary widely, from a few Mb/s when far from the Access Point, all the way up to Gb/s or more when close to the Access Point. Consequently, if we wish to enjoy the benefits of the Internet's great flexibility, we need software that embraces and celebrates this diversity and adapts intelligently to the varying conditions it encounters. Because significant queueing only happens on entry to the bottleneck hop, the queue management at this critical hop of the path almost entirely determines the responsiveness of the entire flow. If the bottleneck hop's queue management algorithm allows an excessively large queue to form, this results in excessively large delays for packets sitting in that queue awaiting transmission, significantly degrading overall user experience. In order to discover the depth of the buffer at the bottleneck hop, the RPM Test mimics normal network operations and data transfers, to cause this bottleneck buffer to fill to capacity, and then measures the resulting end-to-end latency under these operating conditions. A well managed bottleneck queue keeps its queue occupancy under control, resulting in consistently low round-trip time and consistently good responsiveness. A poorly managed bottleneck queue will not. Paasch, et al. Expires January 12, 2023 [Page 5] Internet-Draft Responsiveness under Working Conditions July 2022 3. Goals The algorithm described here defines an RPM Test that serves as a good proxy for user experience. This means: 1. Today's Internet traffic primarily uses HTTP/2 over TLS. Thus, the algorithm should use that protocol. As a side note: other types of traffic are gaining in popularity (HTTP/3) and/or are already being used widely (RTP). Traffic prioritization and QoS rules on the Internet may subject traffic to completely different paths: these could also be measured separately. 2. The Internet is marked by the deployment of countless middleboxes like transparent TCP proxies or traffic prioritization for certain types of traffic. The RPM Test must take into account their effect on TCP-handshake [RFC0793], TLS-handshake, and request/response. 3. The test result should be expressed in an intuitive, nontechnical form. 4. Finally, to be useful to a wide audience, the measurement should finish within a short time frame. Our target is 20 seconds. 4. Measuring Responsiveness Under Working Conditions To make an accurate measurement, the algorithm must reliably put the network in a state that represents those "working conditions". During this process, the algorithm measures the responsiveness of the network. The following explains how the former and the latter are achieved. 4.1. Working Conditions There are many different ways to define the state of "working conditions" to measure responsiveness. There is no one true answer to this question. It is a tradeoff between using realistic traffic patterns and pushing the network to its limits. In this document we aim to generate a realistic traffic pattern by using standard HTTP transactions but exploring the worst-case scenario by creating multiple of these transactions and using very large data objects in these HTTP transactions. This allows to create a stable state of working conditions during which the network is used at its nearly full capacity, without Paasch, et al. Expires January 12, 2023 [Page 6] Internet-Draft Responsiveness under Working Conditions July 2022 generating DoS-like traffic patterns (e.g., intentional UDP flooding). This creates a realistic traffic mix representative of what a typical user's network experiences in normal operation. Finally, as end-user usage of the network evolves to newer protocols and congestion control algorithms, it is important that the working conditions also can evolve to continuously represent a realistic traffic pattern. 4.1.1. From single-flow to multi-flow A single TCP connection may not be sufficient to reach the capacity of a path quickly. Using a 4MB receive window, over a network with a 32 ms round-trip time, a single TCP connection can achieve up to 1Gb/ s throughput. For higher throughput and/or networks with higher round-trip time, TCP allows larger receive window sizes, up to 1 GB. For most applications there is little reason to open multiple parallel TCP connections in an attempt to achieve higher throughput. However, it may take some time for a single TCP connection to ramp up to full speed, and one of the goals of the RPM test is to quickly load the network to capacity, take its measurements, and then finish. Additionally, traditional loss-based TCP congestion control algorithms react aggressively to packet loss by reducing the congestion window. This reaction (intended by the protocol design) decreases the queueing within the network, making it harder to determine the depth of the bottleneck queue reliably. The purpose of the RPM Test is not to productively move data across the network in a useful way, the way a normal application does. The purpose of the RPM Test is, as quickly as possible, to simulate a representative traffic load as if real applications were doing sustained data transfers, measure the resulting round-trip time occurring under those realistic conditions, and then end the test. Because of this, using multiple simultaneous parallel connections allows the RPM test to complete its task more quickly, in a way that overall is less disruptive and less wasteful of network capacity than a test using a single TCP connection that would take longer to bring the bottleneck hop to a stable saturated state. 4.1.2. Parallel vs Sequential Uplink and Downlink Poor responsiveness can be caused by queues in either (or both) the upstream and the downstream direction. Furthermore, both paths may differ significantly due to access link conditions (e.g., 5G downstream and LTE upstream) or the routing changes within the ISPs. To measure responsiveness under working conditions, the algorithm must explore both directions. Paasch, et al. Expires January 12, 2023 [Page 7] Internet-Draft Responsiveness under Working Conditions July 2022 One approach could be to measure responsiveness in the uplink and downlink in parallel. It would allow for a shorter test run-time. However, a number of caveats come with measuring in parallel: o Half-duplex links may not permit simultaneous uplink and downlink traffic. This means the test might not reach the path's capacity in both directions at once and thus not expose all the potential sources of low responsiveness. o Debuggability of the results becomes harder: During parallel measurement it is impossible to differentiate whether the observed latency happens in the uplink or the downlink direction. Thus, we recommend testing uplink and downlink sequentially. Parallel testing is considered a future extension. 4.1.3. Reaching full link utilization The RPM Test gradually increases the number of TCP connections and measures "goodput" - the sum of actual data transferred across all connections in a unit of time. When the goodput stops increasing, it means that the network is used at its full capacity. At this point we are creating the worst-case scenario within the limits of the realistic traffic pattern. The algorithm notes that throughput increases rapidly until TCP connections complete their TCP slow-start phase. At that point, throughput eventually stalls, often due to receive window limitations, particularly in cases of high network bandwidth, high network round-trip time, low receive window size, or a combination of all three. The only means to further increase throughput is by adding more TCP connections to the pool of load-generating connections. If new connections leave the throughput the same, full link utilization has been reached and - more importantly - the working condition is stable. 4.1.4. Final "Working Conditions" Algorithm The following algorithm reaches working conditions of a network by using HTTP/2 upload (POST) or download (GET) requests of infinitely large files. The algorithm is the same for upload and download and uses the same term "load-generating connection" for each. The actions of the algorithm take place at regular intervals. For the current draft the interval is defined as one second. Where Paasch, et al. Expires January 12, 2023 [Page 8] Internet-Draft Responsiveness under Working Conditions July 2022 o i: The index of the current interval. The variable i is initialized to 0 when the algorithm begins and increases by one for each interval. o instantaneous aggregate goodput at interval p: The number of total bytes of data transferred within interval p, divided by the interval duration. If p is negative (i.e., a time interval logically prior to the start of the test beginning, used in moving average calculations), the number of total bytes of data transferred within that interval is considered to be 0. o moving average aggregate goodput at interval p: The number of total bytes of data transferred within interval p and the three immediately preceding intervals, divided by four times the interval duration. o moving average stability during the period between intervals b and e: Whether or not, for all b<=x. [RFC6335] Cotton, M., Eggert, L., Touch, J., Westerlund, M., and S. Cheshire, "Internet Assigned Numbers Authority (IANA) Procedures for the Management of the Service Name and Transport Protocol Port Number Registry", BCP 165, RFC 6335, DOI 10.17487/RFC6335, August 2011, . [RFC6762] Cheshire, S. and M. Krochmal, "Multicast DNS", RFC 6762, DOI 10.17487/RFC6762, February 2013, . [RFC6763] Cheshire, S. and M. Krochmal, "DNS-Based Service Discovery", RFC 6763, DOI 10.17487/RFC6763, February 2013, . Paasch, et al. Expires January 12, 2023 [Page 16] Internet-Draft Responsiveness under Working Conditions July 2022 [RFC8033] Pan, R., Natarajan, P., Baker, F., and G. White, "Proportional Integral Controller Enhanced (PIE): A Lightweight Control Scheme to Address the Bufferbloat Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017, . [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data Interchange Format", STD 90, RFC 8259, DOI 10.17487/RFC8259, December 2017, . [RFC8290] Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys, J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler and Active Queue Management Algorithm", RFC 8290, DOI 10.17487/RFC8290, January 2018, . [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, . [RFC8766] Cheshire, S., "Discovery Proxy for Multicast DNS-Based Service Discovery", RFC 8766, DOI 10.17487/RFC8766, June 2020, . Appendix A. Example Server Configuration This section shows fragments of sample server configurations to host an responsiveness measurement endpoint. A.1. Apache Traffic Server Apache Traffic Server starting at version 9.1.0 supports configuration as a responsiveness server. It requires the generator and the statichit plugin. The sample remap configuration file then is: Paasch, et al. Expires January 12, 2023 [Page 17] Internet-Draft Responsiveness under Working Conditions July 2022 map https://nq.example.com/api/v1/config \ http://localhost/ \ @plugin=statichit.so \ @pparam=--file-path=config.example.com.json \ @pparam=--mime-type=application/json map https://nq.example.com/api/v1/large \ http://localhost/cache/8589934592/ \ @plugin=generator.so map https://nq.example.com/api/v1/small \ http://localhost/cache/1/ \ @plugin=generator.so map https://nq.example.com/api/v1/upload \ http://localhost/ \ @plugin=generator.so Authors' Addresses Christoph Paasch Apple Inc. One Apple Park Way Cupertino, California 95014 United States of America Email: cpaasch@apple.com Randall Meyer Apple Inc. One Apple Park Way Cupertino, California 95014 United States of America Email: rrm@apple.com Stuart Cheshire Apple Inc. One Apple Park Way Cupertino, California 95014 United States of America Email: cheshire@apple.com Paasch, et al. Expires January 12, 2023 [Page 18] Internet-Draft Responsiveness under Working Conditions July 2022 Omer Shapira Apple Inc. One Apple Park Way Cupertino, California 95014 United States of America Email: oesh@apple.com Matt Mathis Google, Inc 1600 Amphitheatre Parkway Mountain View, CA 94043 United States of America Email: mattmathis@google.com Paasch, et al. Expires January 12, 2023 [Page 19]