Små ljud av stor betydelse - PDF

Description
Små ljud av stor betydelse - prosodisk manipulering av syntetiska backchannels på svenska Åsa Wallers Handledare: Jens Edlund, Institutionen för Tal, Musik och Hörsel, KTH Görel Sandström, Institutionen

Please download to get full document.

View again

of 35
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Sheet Music

Publish on:

Views: 8 | Pages: 35

Extension: PDF | Download: 0

Share
Transcript
Små ljud av stor betydelse - prosodisk manipulering av syntetiska backchannels på svenska Åsa Wallers Handledare: Jens Edlund, Institutionen för Tal, Musik och Hörsel, KTH Görel Sandström, Institutionen för Filosofi och Lingvistik, Umeå Universitet Examensarbete i kognitionsvetenskap Kognitionsvetenskapliga programmet, Umeå universitet, höstterminen 2005 Psykologiska institutionen, Umeå Minor sounds of major importance - prosodic manipulation of synthetic backchannels in Swedish by Åsa Wallers Supervisors: Jens Edlund, Department of Speech, Music and Hearing, KTH Görel Sandström, Department of Philosophy and Linguistics, Umeå University MA Dissertation Submitted in partial fulfillment of the requirements for award of the Degree of Master of Arts in Cognitive Science at Umeå University ABSTRACT Prosody plays a big role in the human speech and is used for expressing emotions and attitudes in addition to the lexical words. This study looks at the possibility to use prosody in expressing attitudes with backchannels, such as /oh/, /m/, /uhu/ etc. In the study, two monosyllabic backchannels, /a/ and /m/, were chosen, created synthetically and manipulated systematically with regards to the prosodic properties pitch, peak position, and duration. Each stimulus was put in a context, and the task was for the participants to choose among five answer alternatives a suitable interpretation of the backchannel. The results show that the meaning of a backchannel mainly changes with the peak position property, but interaction effects of the combination with pitch and duration could also be seen. Some stimuli have received fairly unambiguous results, which is an indication that they sound natural and are easy to interpret. This is a positive result for the continuous research of usage of backchannels in spoken dialogue systems. SAMMANFATTNING Prosodin har stor betydelse för det mänskliga talet, och används för att uttrycka emotioner och attityder i tillägg till de lexikala orden. Denna studie behandlar möjligheten att med prosodi uttrycka olika attityder med hjälp av s.k. backchannels, eller returord, som t.ex. /m/, /a/, /aha/, /mhm/ osv. Två enstaviga backchannels, /a/ och /m/, valdes ut, skapades syntetiskt och manipulerades systematiskt med avseende på de prosodiska egenskaperna grundton, placering av grundtonstopp, och duration. Varje stimulus sattes i en kontext och försökspersonerna fick utifrån fem svarsalternativ avgöra vad de ansåg att backchanneln betydde i sammanhanget. Resultaten visade att betydelsen hos en backchannel ändrades främst beroende på placering av grundtonstopp, men interaktionseffekter med grundton och duration kunde även påvisas. Vissa stimuli ser entydiga svar, vilket tyder på att de låter naturliga och är lätta att tolka. Detta ses som positivt för den fortsatta forskningen på användning av backchannels i dialogsystem. i ii TABLE OF CONTENTS CHAPTER 1: BACKGROUND PAPER OUTLINE...2 CHAPTER 2: INTRODUCTION BACKCHANNELS Summary PROSODY Prosody and backchannels Summary PRAGMATICS Conversational structure Levels of grounding Summary SPEECH PRODUCING SYSTEMS Primitive dialogue systems Spoken dialogue systems Components A note on written language bias Summary OBJECT OF STUDY...14 CHAPTER 3: FEASIBILITY STUDY MATERIAL METHOD RESULTS AND DISCUSSION...16 CHAPTER 4: MAIN EXPERIMENT METHOD MATERIAL PARTICIPANTS PROCEDURE...20 CHAPTER 5: RESULTS STATISTICS MAIN EFFECTS Peak position Peak height Duration INTERACTION EFFECTS Position and height combined Position and duration combined Three-way combination...26 iii CHAPTER 6: DISCUSSION PARTICIPANTS REPORTS LEXICAL DIFFERENCES SOME STIMULI IN CLOSE-UP LEVELS OF GROUNDING REVISITED PRACTICAL USAGE...38 CHAPTER 7: ADDITIONAL REFLECTIONS...41 CHAPTER 8: SUMMARY AND CONCLUSION CHAPTER 9: FUTURE WORK REFERENCES APPENDICES...51 iv LIST OF FIGURES FIGURE 2-1: AN ILLUSTRATION OF THE SDS FUNCTIONALITY FIGURE 3-1: F 0 PLOT FOR THE LEVEL 3 BACKCHANNEL JAHA. THE LIGHT AREA HIGHLIGHTS THE DURATION OF THE BACKCHANNEL FIGURE 3-2: F 0 PLOT FOR THE LEVEL 4 BACKCHANNEL JAHA. THE LIGHT AREA HIGHLIGHTS THE DURATION OF THE BACKCHANNEL FIGURE 4-1: AN EXAMPLE FROM THE WAVESURFER ENVIRONMENT, A SHORT /A/ WITH AN EARLY HIGH PEAK IS SHOWN FIGURE 4-2: A SCHEMA FOR THE PROSODIC PROPERTIES OF THE SHORT STIMULI, WHERE THE HIGH PEAKS ARE AT 160 HZ AND THE LOW AT 130 HZ FIGURE 4-3: A SCHEMA FOR THE PROSODIC PROPERTIES OF THE LONG STIMULI, WITH PEAK PLATEAUS LASTING 100 MS. THE HIGH PEAKS ARE AT 160 HZ AND THE LOW AT 130 HZ FIGURE 5-1: THE DISTRIBUTION OF VOTES FOR /A/ AND /M/ OVER THE FIVE ANSWER TYPES...23 FIGURE 5-2: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/ OVER THE FIVE ANSWER TYPES FOR EARLY, MID AND LATE PEAK POSITIONS, RESPECTIVELY...27 FIGURE 5-3: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OVER THE FIVE ANSWER TYPES FOR EARLY, MID AND LATE PEAK POSITIONS, RESPECTIVELY...27 FIGURE 5-4: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/OVER THE FIVE ANSWER TYPES FOR HIGH AND LOW PEAK HEIGHT, RESPECTIVELY...28 FIGURE 5-5: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OVER THE FIVE ANSWER TYPES FOR HIGH AND LOW PEAK HEIGHT, RESPECTIVELY...28 FIGURE 5-6: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/ OVER THE FIVE ANSWER TYPES FOR LONG AND SHORT DURATION, RESPECTIVELY FIGURE 5-7: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OVER THE FIVE ANSWER TYPES FOR LONG AND SHORT DURATION, RESPECTIVELY FIGURE 5-8: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/ OF THE COMBINATION EARLY- HIGH PEAK, AND MID-HIGH PEAK, RESPECTIVELY FIGURE 5-9: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/ OF THE COMBINATION EARLY-LOW PEAK, AND MID-LOW PEAK, RESPECTIVELY FIGURE 5-10: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/ OF THE COMBINATION LATE- HIGH PEAK, AND LATE-LOW PEAK, RESPECTIVELY...31 FIGURE 5-11: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OF THE COMBINATION EARLY- HIGH PEAK, AND MID-HIGH PEAK, RESPECTIVELY FIGURE 5-12: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OF THE COMBINATION EARLY- LOW PEAK, AND MID-LOW PEAK, RESPECTIVELY FIGURE 5-13: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OF THE COMBINATION LATE- HIGH PEAK, AND LATE-LOW PEAK, RESPECTIVELY...32 FIGURE 5-14: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /A/ OF THE COMBINATION EARLY- LONG, EARLY-SHORT, MID-LONG, AND MID-SHORT, RESPECTIVELY...33 FIGURE 5-15: THE DISTRIBUTION OF VOTES FOR THE BACKCHANNEL /M/ OF THE COMBINATION LATE- LONG AND LATE-SHORT, RESPECTIVELY...33 FIGURE 7-1: THE DISTRIBUTION OF VOTES FOR /A/ IN THE TWO CLUSTER-CATEGORIES SURPRISED AND NEUTRAL...42 FIGURE 7-2: THE DISTRIBUTION OF VOTES FOR /M/ IN THE TWO CLUSTER-CATEGORIES SURPRISED AND NEUTRAL v vi LIST OF TABLES TABLE 3-1: THE FIVE MOST COMMON BACKCHANNELS IN THE FEASIBILITY STUDY MATERIAL, WITH CORRESPONDING PERCENTAGES OF OCCURRENCES AND AVERAGE LENGTH OF BACKCHANNEL...16 TABLE 4-1: THE PARAPHRASES USED IN DECIDING THE MEANING OF THE BACKCHANNELS TABLE 5-1: GOODNESS-OF-FIT STATISTICS FOR /A/, N= TABLE 5-2: GOODNESS-OF-FIT STATISTICS FOR /M/, N= TABLE 5-3: MEASURES OF ASSOCIATION FOR BOTH /A/ AND /M/...23 vii viii ACKNOWLEDGEMENTS First, I would like to thank my supervisor Jens Edlund, KTH, for his valuable comments, discussions and test setups. I would also like to thank my internal supervisor Görel Sandström, Umeå University, for adding another perspective to the study. In addition, I would also like to thank Mattias Heldner, KTH, for valuable help with the statistics. Finally, I would like to thank Rolf Carlson, KTH, for his comments on the report, and my opponent Samuel Munkstedt, KTH, for reviewing the final paper. ix ÅSA WALLERS CHAPTER 1: BACKGROUND Spoken dialogue among humans is a very intricate and fine-tuned process which puts high demands on the participants ability to perceive, produce and adjust inputs and outputs according to the flow of the dialogue, as well as to the context and the environment. In a conversation, the participants take turns talking, and it seems that the speaker transition is, for the most part, a very smooth interaction where the speech does not overlap notably (Levinson, 1983). This is no small achievement, considering that there is no fixed protocol to follow, or any pre-decided order in which to speak. What, then, makes the conversation flow so smoothly among humans? There have been theories of syntactic completion, of specific prosodic patterns, of attitudinal completion, and more. The concept of turn-taking in spoken dialogue is a major field of study in (psycho) linguistic research, and is too wide an issue to be addressed in this thesis (but see section for a short discussion). Today, computerized systems using speech technology are becoming increasingly good at dealing with online (real-time) analysis of human speech and at giving suitable responses. Efforts have been made to make systems give properly timed feedback, but there are great difficulties in imitating and replicating the split-second timing humans show in taking turns of spoken dialogue. Many researchers working with the development of spoken dialogue systems have shown interest in prosodic features when trying to make the system handle the turns properly in the conversation (Edlund & Heldner, 2005; Hirschberg, 2002; Nöth et al, 2002; Shriberg et al, 1998; Swerts & Ostendorf, 1997). One of the starting points for this thesis is the paper by Edlund, House and Skantze (2005), where it was found that prosody in monosyllabic words, used in clarification ellipses, can create different meaning depending on the fundamental frequency (F 0 ) pattern. Their results show that different F 0 peak positions can be mapped to meanings corresponding to different LEVELS OF GROUNDING (described in section 2.3.2). This thesis will use their findings and extend their research to a specific type of utterances: BACKCHANNELS (described in section 2.1). Another starting point is to make a contribution to speech technology research, and investigate the possibility of using backchannels in dialogue 1 MINOR SOUNDS WITH MAJOR IMPACT systems. The approach here is to use a human-human dialogue metaphor to investigate various aspects of human speech that could be incorporated into computerized dialogue systems. The backchannels used in this study will be created synthetically and evaluated with regards to prosody Paper outline To be able to conduct the proposed experiments, background information from various fields of research is required. The concept of backchannels will be described in section 2.1. Prosody in general and in connection to backchannels will be described in section 2.2. Some conversational structure and the use of language will be described in section 2.3. The basic structure and functionality of speech producing systems will be described in 2.4. Finally, the object of the study will be presented in section 2.5. A feasibility study has been conducted to determine what material that should be used, and is described in chapter 3. The main experiment will be described in chapter 4. The results of the experiment will be presented in chapter 5. The discussion of the results will be presented in chapter 6. A post-experimental reorganization of the results, and implications thereof, will be presented in chapter 7. Finally, summary and conclusions will be presented in chapter 8, and some suggestions for future work will be discussed in chapter 9. 2 ÅSA WALLERS CHAPTER 2: INTRODUCTION 2.1. Backchannels In human spoken dialogue, it has been shown that people use a large amount of feedback (Allwood, 1987). The feedback can consist of words, phrases, hand gestures, head nods etc. One type of feedback is made up of small utterances, such as monosyllabic words or noises. These utterances are known as backchannels (Gumperz, 1982). They are not considered to be attempts to take the conversational floor but rather thought of as means to indicate that the channel is still open, that communication is still working, and that the speaker can go on talking. Backchannels can also be used to indicate disagreement or confusion, but are still not (for the most part) an attempt to take the floor. One definition provided by Ward (1996) states that backchannel feedback: 1. responds directly to the content of an utterance of the speaker, 2. is optional, and 3. does not require acknowledgment by the speaker. According to Ward, the first characteristic rules out mere grunts, which (he claims) often seem to emphasize the speaker s previous utterance. In a later publication (Ward & Tsukahara, 2000) grunts are not mentioned, but an explanation of the first characteristic states that it rules out post-completion vocalization (Ward & Tsukahara, 2000, p. 1182) i.e. feedback that a speaker adds after finishing the original statement, or self-feedback. The first characteristic is also said to rule out feedback which occurs several seconds after the previous utterance (Ward, 1996). The second characteristic is said to rule out direct answers to questions, even if they are just grunts. The third characteristic is said to rule out questions like huh? 1 It is also said to rule out grunts which become full utterances (where the speaker continues talking) (Ward, 1996). 1 However, one might come across backchannels that clearly indicate confusion even though they are not direct questions, but merely a grunt. This type of backchannel may not have to be acknowledged by the speaker, but probably should be so in order for the communication to progress smoothly. Upon hearing a disagreeing or confused backchannel, the speaker therefore has the choice to either ignore it, or to go back and further explain what was previously said. 3 MINOR SOUNDS WITH MAJOR IMPACT Backchannels have not, with a few exceptions, been included in traditional linguistic research. Since backchannels lack meaning in the conventional dictionary sense of the word, they do not fit into the research. In addition, they are not incorporated into clausal structures, which have led to their exclusion from recent research. (Gardner, 2001). Backchannels can be grouped into different categories based on their function. Jurafsky et al (1998) uses four types of utterances (continuers, assessments, incipient speakership, and agreements) as subtypes of backchannels, and notes that they strongly overlap in their lexical realization and are distinguished mainly by their prosodic features (further discussed in section 2.2). The most common kind of backchannel is the CONTINUER, a short utterance which indicates that the speaker is okay to go on talking (Jurafsky et al, 1998). Gardner (2001) extends the list of backchannel subtypes to eight (continuers, acknowledgements, news markers, change-of-activity tokens, assessments, brief questions, collaborative completions, and non-verbal vocalizations) and discusses four of these (continuers, acknowledgements, news markers, changeof-activity tokens) as RESPONSE TOKENS. Response tokens, according to Gardner, are examples of action types of a non-primary speaker (or current listener) in interactive talk, and demonstrate the evidence of the stance that the recipient in the talk is taking at that moment (Gardner, p.3). The backchannels I have studied in this paper can be thought of as response tokens, but I will continue referring to them simply as backchannels Summary In this section, I have presented the major topic of this thesis, backchannels. Backchannels are very common in the spoken dialogue among humans, and seem to hold a lot of information about how the listener perceives the information delivered by the speaker, the listener s attitudes and beliefs, as well as the status of other processes involved in the dialogue Prosody Traditionally, linguistic research has been conducted on written material, putting lexical properties and syntactic structure in main focus (Linell, 1982). However, when spoken language is of interest (as in research on spoken dialogue systems), lexical properties alone are less than often a good measure 2 Gardner (2001) notes that the concept backchannels is a broad notion, and that a wide range of functionally very varied tokens is covered by the term, and that the differences can easily be obscured. This is probably true, but I will not engage further in the discussion of distinguishing one type of backchannel from another, since my study only deals with a certain type of backchannels. 4 ÅSA WALLERS to how the language is used and understood. Spoken language has unique properties not found in written language, one of the more important being prosody, or the melody of the spoken language. Prosody consists of the phonetic parameters duration (perceived as length), intensity (perceived as loudness) and pitch (mainly the fundamental frequency, F 0 ) (Kent & Read, 2002). Prosody is one of the keys to dissolving ambiguity, and to add emotions and attitudes to the semantic content of the spoken words. Fodor (2002) argues that prosody is present even in silent reading, i.e. that prosody is mentally projected by readers onto the written word string, and that it should therefore not be excluded even from the text-based studies of linguistic research. The Scandinavian languages are characterized by their word accents, a property that occurs in addition to word stress in most dialects of Danish, Norwegian and Swedish (Bruce & Hermans, 1999). Word accent is important for the lexical meaning in a couple of hundred polysyllabic word pairs, for example anden [ándɛn] ( the duck ) and anden [àndɛn] ( the spirit ), and can be hard for non-native speakers to distinguish between (Elert, 1989). Stress in Swedish is usually put on the first syllable of a word, but this rule is far from exclusive. In some word pairs the lexical meaning is changed depending on the position of the stress, as in for example formel ['fɔrmɛl] ( formula ) and formell [fɔr'mɛl] ( formal ) (Elert, 1989). Bruce (1998) writes that prosody has a number of linguistic and communicative functions such as prominence, grouping, and a number of other discourse and dialogue functions. However, Bruce s work is mainly aimed at prominence and grouping, while the interest of this study is the discourse and dialogue functions Prosody and backchannels Most of the research done in the area of prosodic analysis and backchannels has been aimed at detecting what features of the prosodic signal might elicit backchannels from the listener. Regions of low pitch seem to have a backchannel eliciting effect, and these regions also often occur at points where the speaker considers the information to be transmitted (also see discussion of TRPs in section 2.3.1). This can be seen as the speaker saying I m done with that thought, did you follow? (Ward & Tsukahara, 2000). When it comes to prosodic properties within backchannels, Allwood (1987) notes that prosodic modification is primarily used to connect attitudes and emotions to basic communicative functions (compare to Levels of grounding discussed in section 2.3.2). Allwood also refers to an as of yet not published article and writes that by adding prosody to backchannels, a number of attitudes can be expressed (Allwood, 1987). Cerrato (2005), for example, studied the prosody of the words ja in Swedish, and sí in Italian (both meaning yes ), and found that by adding a flat F 0 curve to ja the speaker 5 MINOR SOUNDS WITH MAJOR IMPACT indicated that he/she wanted to continue talking, while a rising F 0 indicates that the opposite person can talk Summary In this section, I have introduced the other main topic of this thesis, prosody. Prosody is a way for the speaker to add emotions and attitudes to the spoken words, and to bring clarity to ambiguous words. The object of this study involves manipulating some of the prosodic properties to elicit different responses from the listener Pragmatics Pragmatics can, in short, be describ
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks