Minutes - Speech Services Control WG (speechsc) Reported by Tom Taylor Wednesday, July 17 at 0900-1130 =================================== Chairs: Eric Burger (eburger@showshore.com) David Oran (oran@cisco.com) 0900 - Agenda Bashing/Charter Review (Chairs) ============================================= The proposed agenda was accepted. 0910 - Work Roadmap & Timeline (Chairs) ======================================= Dave Oran presented. The charts are available at http://www.ietf.org/proceedings/02jul/slides/speechsc-0/sld002.htm Charter ------- The Working Group is chartered. The name has changed from CATS to SPEECHSC. The scope is initially limited to Automatic Speech Recognition (ASR), Text To Speech (TTS), and Speaker Verification (SV). This will expand later after the group has demonstrated its ability to meet deliverables. Scott Bradner (sob@harvard.edu) noted that there had been concern within IESG that the group is too narrowly focused; he would be disappointed if the scope didn't expand. There should be a strong bias toward protocol reuse. The group is to coordinate with ETSI Aurora, ITU-T SG 16 (Question 15), W3C, and any other interested groups that emerge. Work Items ---------- Dave listed the milestones set by the charter. The Working Group is already late on requirements publication, hence this is highest priority. Timeline for Work Items ----------------------- The Chairs would like to do Working Group Last Call on requirements by early August. (Hence the meeting will focus on this.) They would like to kick off work on protocol analysis immediately following Working Group Last Call of requirements. (The meeting wrapup will include a discussion of ways and means.) 0930 - Discuss requirements document (draft-burger-speechsc-reqts-00.txt) ========================================================================= This document, an update to draft-burger-cats-reqts-00.txt, was posted a month ago. A small number of people generated a substantial number of postings on the list. The Chairs wanted to take as long as necessary to cover open issues in the document and posted on the list. Open Issues ----------- Identified in the reqts document: (1) Means of detection of Speech Synthesis Markup Language (SSML) Proposed resolution: require content type header. Accepted. (2) Should control channels be long-lived? There was only one comment on the list: allow, not require long-lived control channels. Question: does this mean requiring that the control channels be set up in advance? The discussion distinguished long-lived vs. session-based vs. on-demand control channel setup. Long-lived: set up in advance. Session-based: (note "session" is undefined). On-demand: per utterance. It was proposed that the protocol should support the first two, and may allow on-demand setup. On-demand raises design issues if support is stronger than MAY. Proposed summation: there is agreement on session and something larger than session, but there is some question of whether smaller than session duration is needed. (3) For parameters that persist across a session, allow setting on a per-session basis? The proposal is to allow session parameters. There was no discussion. (4) Allow for speech markers, as specified for MRCP over RTSP? Two comments on list: Stephane Maes (smaes@us.ibm.com) stated that speech markers are needed and must be efficient. Dan Burnett (burnett@nuance.com) asked whether SSML was not adequate. The proposed resolution is that SSML is a good initial hypothesis. Discussants noted that we have to to support markers in messaging. SSML is acceptable for now. The protocol must provide an efficient mechanism for reporting that a marker has been sensed. Stephane Maes noted that SSML can reference audio files. You don't know at beginning how many files you are going to play. It was recognized that this is a separate issue from marker. The proposed resolution was accepted. (5) Should ASR support alternative grammar formats? Stephane Maes said yes, we need that. Stephane added that we need an extensibility mechanism, but not discovery. Dan Burnett agreed. Stephane noted that we should differentiate between capability discovery for resource management and capability discovery for control. Dave Oran restated the conclusion: there is a need to discover the capabilities of given device, but this is not necessarily part of this protocol. There was further discussion, but Dave suggested we read RFC 2533 then revisit this discussion. It may be a matter of incorporating that protocol within this one as SIP has done. Proposal: the protocol must be able to explicitly signal grammar format and support extensibility, but we will say nothing for now on capability discovery. (6) Is there an need for all the parameters specified for MRCP over RTSP? List comments: Yes, we need to go beyond the W3C grammar, and also need extensibility (Maes, Burnett). Proposed resolution: Yes. Moreover, we need to be able to specify parameters on per-session basis. The exact set is to be decided as part of the protocol analysis and design phase. At this point there was some discussion of parameter setting beyond the session and within a session. There SHOULD be a capability to reset parameters within a session. It was noted that processor adjustment is done per-call, hence the protocol at least needs to allow adjustment per call. There is also some need to transfer data between servers (e.g. on background noise). Note: session and call are not necessarily related concepts. A question was raised on the handling of conferences (multiple speakers). Dave Oran suggested a protocol requirement to recognize different SSRCs in the RTP stream. There is a problem here: a conference could have multiple speakers associated with an SSRC. (7) The scope of the requirements should go beyond ASR, TTS, and SV/SR (Speaker Recognition). Proposed resolution: not for now. Steve (***don't know E-mail address) remarked that the main market is still for pre-recorded speech. The Chairs responded that this is a solved problem, not something we need to work on. However, we can recognize that it will be present. Text to express this is requested. Stephane Maes suggested that we need some requirement for extensibility of scope. Dave Oran asked how one would determine with such a requirement that the protocol meets that requirement. He saw this as preferable to leave to the design stage. Text is requested for consideration, if this avenue is to be pursued. The question was raised, whether DTMF is in scope. Dave Oran noted that other mechanisms are available for hnadling DTMF. Eric Burger added that in ASR, DTMF would be invisible to protocol: it would be specified in the grammar. For a DTMF server, use another protocol such as Megaco. There was a suggestion that one might want conversion between voice and DTMF. Eric responded that this was an application function. (8) Does protocol have to cope with both parallel and serial composition of servers? Proposal: the charter limits topology. Serial chaining involves OPES proxy issues. There was some discussion about cases associated with wireless LAN. The Chairs' response was that OPES issues are matters of delegation, trust, security, and traceability. We would have to convince the IESG that these issues do not arise or are well met in this case. It would represent a major expansion of work to generate the required analysis. See RFC 3238 for more information. Compromise: note this as an area of research and possible future enhancement. (9) Does the requirement not to redo RTSP or SIP/msuri restrict the ability to use markers and other playout options like pacing? Proposed resolution: reword the requirement to clarify that the intent is not to impose such a restriction. (10) Clarify the OPES requirement. Proposed resolution: add a reference to RFC 3238. The intent is that the client side of the protocol will operate on behalf of one user. Stephane Maes will supply text. (11) Load balancing. The Chairs noted that the current text captured the outcome of lengthy discussions. The requirements must not preclude load balancing but also must not require load balancing. The general feeling was that it is not a fruitful area of effort. (12) Must be able to control language and prosody for plain text. Proposed: this is a matter of clarification: SSML provides the desired control. (13) Need "full control" over TTS engine (Maes). VCR and other fine-grained control should be lower priority (Burnett). Dan Burnett clarified: VCR control are audio controls, not TTS controls. It was agreed such controls are needed, but they are not a high priority for TTS applications. The counter-argument was that we have the analogue in text operations: e.g. skip paragraph, go back to previous page. Stephane's point was that real-time controls are needed, and he is not sure why we would call them out specifically for lower priority. This issue is one for the list to consider. We need a more detailed explication of control requirements. Note that there is a problem of interaction with SSML. There is the question of how what kind of units to skip ahead, for instance: seconds, paragraphs, ... (14) Must handle prompting, recording, possibly utterance verification, retrianing, in addition to record for analysis (Maes). Proposed resolution: design for extensibility, but no specific requirements in the protocol other than for recording for now. (15) Grammar sharing. The Chairs proposed to adopt the Burnett phrasing of requirements: (i) A server implementation needs to be able to store large grammars originally provided by client, and (ii) we need the ability within the protocol to reference grammars already known to the server (e.g. built in). Dave saw this as a name space issue. The distinction between globally unique and well-known was noted, but seen as a design issue. The question of control of grammar use was raised. The Chairs suggested that this is a matter of passing it only to trusted entities. There was a suggestion that (i) is a matter of cache control. There is the issue of who can use which grammar, but the meeting agreed that this is outside of scope. Discovery of grammars is also out of scope. It was agreed that the protocol must not preclude grammar sharing across sessions. Dan Burnett is to supply text. (16) Need to to cover speaker enrollment, identification and classification as well as recognition as part of SV. Multiple methods are needed. Resolution: will add this to requirements. Dan Burnett is to provide more details. (17) Why a requirement on cross-utterance state? Dan Burnett explained: he wants to make sure the implementation option remains open. Hence his concern is that there be no requirement that cross-utterance information be held only in the client. Stephane saw this as an example of a number of cases where extensibility will be needed. Dave Oran suggested we need a way to express in the protocol that some barrier has been crossed and resynchronization is needed. Looking at it another way: we need to be able to indicate that different transactions, not necessarily sequential, are correlated. Stephane suggested we add to this that the specific kind of correlation is proprietary. Following on, it is important that the server be able to give a result and say what context it applies to. (19) Need simultaneous performance of multiple functions on the same streams. The meeting agreed to add the requirement but not to consider parallel decomposition for now. (Could be happening behind the scenes due to OPES issues.) Stephane wondered if we always assume the output of engine goes back to the issuer of the command. The Chairs' answer was "yes", on security grounds: there are too many hacking scenarios otherwise. It was noted that the security section needs expansion. It should distinguish between requirements on the protocol (being put together in this document) and requirements on the system (not to be documented). Other agenda points =================== The requirements discussion took all the time available, so intervening points of the agenda were not covered. 1115 - Wrap-up and next steps ============================= The intent is to reissue the requirements draft by July 27. The group would aim for Working Group Last Call by early August, with text going to the IESG by the end of August. Issue: do use cases go into the requirements, or will they just be used as a guide? Stephane Maes proposed a short summary in the requirements, but mainly use them as a guide. Steve asked what the group would do about discovery and resource management. Dave Oran pointed out that this is a generic problem for client-server protocols. He suggested just leaving it to system architects. This implies that clients and servers become limited in applicability to the discovery mechanisms they implement. The list is now at speechsc@ietf.org. Note 1: the posted IETF agenda still has this as a "CATS" BoF". We are in fact an approved WG, and are called SPEECHSC as we previously reported to the mailing list. Note 2: the mrcp@showshore.com mailing list will be decommissioned immediately following this IETF. PLEASE subscribe to the speechsc@ietf.org mailing list as soon as convenient.