Your SlideShare is downloading. ×

Statistics in musicology

2,823

Published on

A complete beauty...

A complete beauty...

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,823
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
46
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Interdisciplinar y Statistics STATISTICS in MUSICOLOGY Jan Beran CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London New York Washington, D.C.©2004 CRC Press LLC
  • 2. C2190 disclaimer.fm Page 1 Monday, June 9, 2003 10:51 AM Library of Congress Cataloging-in-Publication Data Beran, Jan, 1959- Statistics in musicology / Jan Beran. p. cm. — (Interdisciplinary statistics series) Includes bibliographical references (p. ) and indexes. ISBN 1-58488-219-0 (alk. paper) 1. Musical analysis—Statistical methods. I. Title. II. Interdisciplinary statistics MT6.B344 2003 781.2—dc21 2003048488 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe. Visit the CRC Press Web site at www.crcpress.com © 2004 by Chapman & Hall/CRC No claim to original U.S. Government works International Standard Book Number 1-58488-219-0 Library of Congress Card Number 2003048488 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper ©2004 CRC Press LLC©2004 CRC Press LLC
  • 3. Contents Preface 1 Some mathematical foundations of music 1.1 General background 1.2 Some elements of algebra 1.3 Specific applications in music 2 Exploratory data mining in musical spaces 2.1 Musical motivation 2.2 Some descriptive statistics and plots for univariate data 2.3 Specific applications in music – univariate 2.4 Some descriptive statistics and plots for bivariate data 2.5 Specific applications in music – bivariate 2.6 Some multivariate descriptive displays 2.7 Specific applications in music – multivariate 3 Global measures of structure and randomness 3.1 Musical motivation 3.2 Basic principles 3.3 Specific applications in music 4 Time series analysis 4.1 Musical motivation 4.2 Basic principles 4.3 Specific applications in music 5 Hierarchical metho ds 5.1 Musical motivation 5.2 Basic principles 5.3 Specific applications in music 6 Markov chains and hidden Markov mo dels 6.1 Musical motivation 6.2 Basic principles ©2004 CRC Press LLC©2004 CRC Press LLC
  • 4. 6.3 Specific applications in music7 Circular statistics 7.1 Musical motivation 7.2 Basic principles 7.3 Specific applications in music8 Principal comp onent analysis 8.1 Musical motivation 8.2 Basic principles 8.3 Specific applications in music9 Discriminant analysis 9.1 Musical motivation 9.2 Basic principles 9.3 Specific applications in music10 Cluster analysis 10.1 Musical motivation 10.2 Basic principles 10.3 Specific applications in music11 Multidimensional scaling 11.1 Musical motivation 11.2 Basic principles 11.3 Specific applications in musicList of figuresReferences©2004 CRC Press LLC
  • 5. PrefaceAn essential aspect of music is structure. It is therefore not surprising that aconnection between music and mathematics was recognized long before ourtime. Perhaps best known among the ancient “quantitative musicologists”are the Pythagoreans, who found fundamental connections between musi-cal intervals and mathematical ratios. An obvious reason why mathematicscomes into play is that a musical performance results in sound waves thatcan be described by physical equations. Perhaps more interesting, however,is the intrinsic organization of these waves that distinguishes music from“ordinary noise”. Also, since music is intrinsically linked with human per-ception, emotion, and reflection as well as the human body, the scientificstudy of music goes far beyond physics. For a deeper understanding of mu-sic, a number of different sciences, such as psychology, physiology, history,physics, mathematics, statistics, computer science, semiotics, and of coursemusicology – to name only a few – need to be combined. This, togetherwith the lack of available data, prevented, until recently, a systematic de-velopment of quantitative methods in musicology. In the last few years,the situation has changed dramatically. Collection of quantitative data isno longer a serious problem, and a number of mathematical and statis-tical methods have been developed that are suitable for analyzing suchdata. Statistics is likely to play an essential role in future developmentsof musicology, mainly for the following reasons: a) statistics is concernedwith finding structure in data; b) statistical methods and structures aremathematical, and can often be carried over to various types of data –statistics is therefore an ideal interdisciplinary science that can link differ-ent scientific disciplines; and c) musical data are massive and complex –and therefore basically useless, unless suitable tools are applied to extractessential features. This book is addressed to anybody who is curious about how one may an-alyze music in a quantitative manner. Clearly, the question of how such ananalysis may be done is very complex, and no ultimate answer can be givenhere. Instead, the book summarizes various ideas that have proven usefulin musical analysis and may provide the reader with “food for thought” orinspiration to do his or her own analysis. Specifically, the methods and ap-plications discussed here may be of interest to students and researchers inmusic, statistics, mathematics, computer science, communication, and en-©2004 CRC Press LLC
  • 6. gineering. There is a large variety of statistical methods that can be appliedin music. Selected topics are discussed in this book, ranging from simpledescriptive statistics to formal modeling by parametric and nonparametricprocesses. The theoretical foundations of each method are discussed briefly,with references to more detailed literature. The emphasis is on examplesthat illustrate how to use the results in musical analysis. The methodscan be divided into two groups: general classical methods and specific newmethods developed to solve particular questions in music. Examples illus-trate on one hand how standard statistical methods can be used to obtainquantitative answers to musicological questions. On the other hand, thedevelopment of more specific methodology illustrates how one may designnew statistical models to answer specific questions. The data examples arekept simple in order to be understandable without extended musicologicalterminology. This implies many simplifications from the point of view ofmusic theory – and leaves scope for more sophisticated analysis that maybe carried out in future research. Perhaps this book will inspire the readerto join the effort. Chapters are essentially independent to allow selective reading. Sincethe book describes a large variety of statistical methods in a nutshell itcan be used as a quick reference for applied statistics, with examples frommusicology. I would like to thank the following libraries, institutes, and museums fortheir permission to print various pictures, manuscripts, facsimiles, and pho-tographs: Zentralbibliothek Z¨rich (Ruth H¨usler, Handschriftenabteilung; u aAnik´ Lad`nyi and Michael Kotrba, Graphische Sammlung); Belmont Mu- o a ¨sic Publishers (Anne Wirth); Philippe Gontier, Paris; Osterreichische PostAG; Deutsche Post AG; Elisabeth von Janoza-Bzowski, D¨ sseldorf; Univer- usity Library Heidelberg; Galerie Neuer Meister, Dresden; Robert-Sterl-Haus(K.M. Mieth); B´la Bart´k Memorial House (J´nos Szir´nyi); Frank Mar- e o a atin Society (Maria Martin); Karadar-Bertoldi Ensemble (Prof. FrancescoBertoldi); col legno (Wulf Weinmann). Thanks also to B. Repp for provid-ing us with the tempo data for Schumann’s Tr¨umerei. I would also like to athank numerous colleagues from mathematics, statistics, and musicologywho encouraged me to write this book. Finally, I would like to thank mywife and my daughter for their encouragement and support, without whichthis book could not have been written. Jan Beran Konstanz, March 2003©2004 CRC Press LLC
  • 7. CHAPTER 1 Some mathematical foundations of music1.1 General backgroundThe study of music by means of mathematics goes back several thousandyears. Well documented are, for instance, mathematical and philosophi-cal studies by the Pythagorean school in ancient Greece (see e.g. van derWaerden 1979). Advances in mathematics, computer science, psychology,semiotics, and related fields, together with technological progress (in par-ticular computer technology) lead to a revival of quantitative thinking inmusic in the last two to three decades (see e.g. Archibald 1972, Solomon1973, Schnitzler 1976, Balzano 1980, G¨tze and Wille 1985, Lewin 1987, oMazzola 1990a, 2002, Vuza 1991, 1992a,b, 1993, Keil 1991, Lendvai 1993,Lindley and Turner-Smith 1993, Genevois and Orlarey 1997, Johnson 1997;also see Hofstadter 1999, Andreatta et al. 2001, Leyton 2001, and Babbitt1960, 1961, 1987, Forte 1964, 1973, 1989, Rahn 1980, Morris 1987, 1995,Andreatta 1997; for early accounts of mathematical analysis of music alsosee Graeser 1924, Perle 1955, Norden 1964). Many recent references can befound in specialized journals such as Computing in Musicology, Music The-ory Online, Perspectives of New Music, Journal of New Music Research,Int´gral, Music Perception, and Music Theory Spectrum, to name a few. e Music is, to a large extent, the result of a subconscious intuitive “pro-cess”. The basic question of quantitative musical analysis is in how far musicmay nevertheless be described or explained partially in a quantitative man-ner. The German philosopher and mathematician Leibniz (1646-1716) (Fig-ure 1.5) called music the “arithmetic of the soul”. This is a profound philo-sophical statement; however, the difficulty is to formulate what exactly itmay mean. Some composers, notably in the 20th century, consciously usedmathematical elements in their compositions. Typical examples are permu-tations, the golden section, transformations in two or higher-dimensionalspaces, random numbers, and fractals (see e.g. Sch¨nberg, Webern, Bart´k, o oXenakis, Cage, Lutoslawsky, Eimert, Kagel, Stockhausen, Boulez, Ligeti,Barlow; Figures 1.1, 1.4, 1.15). More generally, conscious “logical” con-struction is an inherent part of composition. For instance, the forms ofsonata and symphony were developed based on reflections about well bal-anced proportions. The tormenting search for “logical perfection” is well©2004 CRC Press LLC
  • 8. Figure 1.1 Quantitative analysis of music helps to understand creative processes.(Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and “Jim” byJ.B.)Figure 1.2 J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting byElias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Z¨rich.) u©2004 CRC Press LLC
  • 9. documented in Beethoven’s famous sketchbooks. Similarily, the art of coun-terpoint that culminated in J.S. Bach’s (Figure 1.2) work relies to a highdegree on intrinsically mathematical principles. A rather peculiar early ac-count of explicit applications of mathematics is the use of permutations inchange ringing in English churches since the 10th century (Fletcher 1956,Price 1969, Stewart 1992, White 1983, 1985, 1987, Wilson 1965). Morestandard are simple symmetries, such as retrograde (e.g. Crab fugue, orCanon cancricans), inversion, arpeggio, or augmentation. A curious ex-ample of this sort is Mozart’s “Spiegel Duett” (or mirror duett, Figures1.6, 1.7 ; the attibution to Mozart is actually uncertain). In the 20th cen-tury, composers such as Messiaen or Xenakis (Xenakis 1971; figure 1.15)attempted to develop mathematical theories that would lead to new tech-niques of composition. From a strictly mathematical point of view, theirderivations are not always exact. Nevertheless, their artistic contributionswere very innovative and inspiring. More recent, mathematically stringentapproaches to music theory, or certain aspects of it, are based on mod-ern tools of abstract mathematics, such as algebra, algebraic geometry,and mathematical statistics (see e.g. Reiner 1985, Mazzola 1985, 1990a,2002, Lewin 1987, Fripertinger 1991, 1999, 2001, Beran and Mazzola 1992,1999a,b, 2000, Read 1997, Fleischer et al. 2000, Fleischer 2003). The most obvious connection between music and mathematics is due tothe fact that music is communicated in form of sound waves. Musical soundscan therefore be studied by means of physical equations. Already in ancientGreece (around the 5th century BC), Pythagoreans found the relationshipbetween certain musical intervals and numeric proportions, and calculatedintervals of selected scales. These results were probably obtained by study-ing the vibration of strings. Similar studies were done in other cultures, butare mostly not well documented. In practical terms, these studies lead tosingling out specific frequencies (or frequency proportions) as “musicallyuseful” and to the development of various scales and harmonic systems.A more systematic approach to physics of musical sounds, music percep-tion, and acoustics was initiated in the second half of the 19th century bypath-breaking contributions by Helmholz (1863) and other physicists (seee.g. Rayleigh 1896). Since then, a vast amount of knowledge has been ac-cumulated in this field (see e.g. Backus 1969, 1977, Morse and Ingard 1968,1986, Benade 1976, 1990, Rigden 1977, Yost 1977, Hall 1980, Berg andStork 1995, Pierce 1983, Cremer 1984, Rossing 1984, 1990, 2000, Johnston1989, Fletcher and Rossing 1991, Graff 1975, 1991, Roederer 1995, Rossinget al. 1995, Howard and Angus 1996, Beament 1997, Crocker 1998, Ned-erveen 1998, Orbach 1999, Kinsler et al. 2000, Raichel 2000). For a historicaccount on musical acoustics see e.g. Bailhache (2001). It may appear at first that once we mastered modeling musical soundsby physical equations, music is understood. This is, however, not so. Musicis not just an arbitrary collection of sounds – music is “organized sound”.©2004 CRC Press LLC
  • 10. Figure 1.3 Ludwig van Beethoven (1770-1827). (Drawing by E. D¨rck after a upainting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Z¨rich.) u ¨Figure 1.4 Anton Webern (1883-1945). (Courtesy of Osterreichische Post AG.)©2004 CRC Press LLC
  • 11. Figure 1.5 Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche PostAG and Elisabeth von Janota-Bzowski.)Physical equations for sound waves only describe the propagation of airpressure. They do not provide, by themselves, an understanding of howand why certain sounds are connected, nor do they tell us anything (atleast not directly) about the effect on the audience. As far as structure isconcerned, one may even argue – for the sake of argument – that music doesnot necessarily need “physical realization” in form of a sound. Musiciansare able to hear music just by looking at a score. Beethoven (Figures 1.3,1.16) composed his ultimate masterpieces after he lost his hearing. Thus,on an abstract level, music can be considered as an organized structurethat follows certain laws. This structure may or may not express feelingsof the composer. Usually, the structure is communicated to the audienceby means of physical sounds – which in turn trigger an emotional expe-rience of the audience (not necessarily identical with the one intended bythe composer). The structure itself can be analyzed, at least partially, us-ing suitable mathematical structures. Note, however, that understandingthe mathematical structure does not necessarily tell us anything about theeffect on the audience. Moreover, any mathematical structure used for ana-lyzing music describes certain selected aspects only. For instance, studyingsymmetries of motifs in a composition by purely algebraic means ignorespsychological, historical, perceptual, and other important issues. Ideally, allrelevant scientific disciplines would need to interact to gain a broad under-standing. A further complication is that the existence of a unique “truth”is by no means certain (and is in fact rather unlikely). For instance, acomposition may contain certain structures that are important for somelisteners but are ignored by others. This problem became apparent in theearly 20th century with the introduction of 12-tone music. The generalpublic was not ready to perceive the complex structures of dodecaphonicmusic and was rather appalled by the seemingly chaotic noise, whereas aminority of “specialized” listeners was enthusiastic. Another example is the©2004 CRC Press LLC
  • 12. comparison of performances. Which pianist is the best? This question hasno unique answer, if any. There is no fixed gold standard and no uniquesolution that would represent the ultimate unchangeable truth. What onemay hope for at most is a classification into types of performances thatare characterized by certain quantifiable properties – without attaching asubjective judgment of “quality”. The main focus of this book is statistics. Statistics is essential for con-necting theoretical mathematical concepts with observed “reality”, to findand explore structures empirically and to develop models that can be ap-plied and tested in practice. Until recently, traditional musical analysiswas mostly carried out in a purely qualitative, and at least partially sub-jective, manner. Applications of statistical methods to questions in musicol-ogy and performance research are very rare (for examples see Yaglom andYaglom 1967, Repp 1992, de la Motte-Haber 1996, Steinberg 1995, Waugh1996, Nettheim 1997, Widmer 2001, Stamatatos and Widmer 2002) andmostly consist of simple applications of standard statistical tools to con-firm results or conjectures that had been known or “derived” before bymusicological, historic, or psychological reasoning. An interesting overviewof statistical applications in music, and many references, can be found inNettheim (1997). The lack of quantitative analysis may be explained, inpart, by the impossibility of collecting “objective” data. Meanwhile, how-ever, due to modern computer technology, an increasing number of musicaldata are becoming available. An in-depth statistical analysis of music istherefore no longer unrealistic. On the theoretical side, the developmentof sophisticated mathematical tools such as algebra, algebraic geometry,mathematical statistics, and their adaptation to the specific needs of mu-sic theory, made it possible to pursue a more quantitative path. Becauseof the complex, highly organized nature of music, existing, mostly qual-itative, knowledge about music must be incorporated into the process ofmathematical and statistical modeling. The statistical methods that willbe discussed in the subsequent chapters can be divided into two categories:1. Classical methods of mathematical statistics and exploratory data anal- ysis: many classical methods can be applied to analyze musical struc- tures, provided that suitable data are available. A number of examples will be discussed. The examples are relatively simple from the point of view of musicology, the purpose being to illustrate how the appropriate use of statistics can yield interesting results, and to stimulate the reader to invent his or her own statistical methods that are appropriate for answering specific musicological questions.2. New methods developed specifically to answer concrete questions in mu- sicology: in the last few years, questions in music composition and per- formance lead to the development of new statistical methods that are specifically designed to solve questions such as classification of perfor-©2004 CRC Press LLC
  • 13. mance styles, identification and modeling of metric, melodic, and har- monic structures, quantification of similarities and differences between compositions and performance styles, automatic identification of musi- cal events and structures from audio signals, etc. Some of these methods will be discussed in detail.A mathematical discipline that is concerned specifically with abstract defi-nitions of structures is algebra. Some elements of basic algebra are thereforediscussed in the next section. Naturally, depending on the context, othermathematical disciplines also play an equally important role in musicalanalysis, and will be discussed later where necessary. Readers who are fa-miliar with modern algebra may skip the following section. A few examplesthat illustrate applications of algebraic structures to music are presentedin Section 1.3. An extended account of mathematical approaches to musicbased on algebra and algebraic geometry is given, for instance, in Mazzola(1990a, 2002) (also see Lewin 1987 and Benson 1995-2002).1.2 Some elements of algebra1.2.1 MotivationAlgebraic considerations in music theory have gained increasing popularityin recent years. The reason is that there are striking similarities betweenmusical and algebraic structures. Why this is so can be illustrated by a sim-ple example: notes (or rather pitches) that differ by an octave can be con-sidered equivalent with respect to their harmonic “meaning”. If an instru-ment is tuned according to equal temperament, then, from the harmonicperspective, there are only 12 different notes. These can be represented asintegers modulo 12. Similarily, there are only 12 different intervals. Thismeans that we are dealing with the set Z12 = {0, 1, ..., 11}. The sum of twoelements x, y ∈ Z12 , z = x + y is interpreted as the note/interval resultingfrom “increasing” the note/interval x by the interval y. The set Z12 of notes(intervals) is then an additive group (see definition below).1.2.2 Definitions and resultsWe discuss some important concepts of algebra that are useful to describemusical structures. A more comprehensive overview of modern algebra canbe found in standard text books such as those by Albert (1956), Herstein(1975), Zassenhaus (1999), Gilbert (2002), and Rotman (2002). The most fundamental structures in algebra are group, ring, field, mod-ule, and vector space.Definition 1 Let G be a nonempty set with a binary operation + such thata + b ∈ G for all a, b ∈ G and the following holds:1. (a + b) + c = a + (b + c) (Associativity)©2004 CRC Press LLC
  • 14. 2. There exists a zero element 0 ∈ G such that 0 + a = a + 0 = a for all a∈G3. For each a ∈ G, there exists an inverse element (−a) ∈ G such that (−a) + a = a + (−a) = 0Then (G, +) is called a group. The group (G, +) is called commutative (orabelian), if for each a, b ∈ G, a + b = b + a. The number of elements in Gis called order of the group and is denoted by o(G). If the order is finite,then G is called a finite group.In a multiplicative way this can be written asDefinition 2 Let G be a nonempty set with a binary operation · such thata · b ∈ G for all a, b ∈ G and the following holds:1. (a · b) · c = a · (b · c) (Associativity)2. There exists an identity element e ∈ G such that e · a = a · e = a for all a∈G3. For each a ∈ G, there exists an inverse element a−1 ∈ G such that a−1 · a = a · a−1 = eThen (G, ·) is called a group. The group (G, ·) is called commutative (orabelian), if for each a, b ∈ G, a · b = b · a.For subsets we haveDefinition 3 Let (G, ·) and (H, ·) be groups and H ⊂ G. Then H is calledsubgroup of G.Some groups can be generated by a single element of the group:Definition 4 Let (G, ·) be a group with n < ∞ elements denoted by ai(i = 0, 1, ..., n − 1) and such that1. ao = an = e2. ai aj = ai+j if i + j ≤ n and ai aj = ai+j−n if i + j > nThen G is called a cyclic group. Furthermore, if G = (a) = {ai : i ∈ Z}where ai denotes the product with all i terms equal to a, then a is called agenerator of G.An important notion is given in the followingDefinition 5 Let G be a group that “acts” on a set X by assigning to eachx ∈ X and g ∈ G an element g(x) ∈ X. Then, for each x ∈ X, the setG(x) = {y : y = g(x), g ∈ G} is called orbit of x.Note that, given a group G that acts on X, the set X is partitioned intodisjoint orbits. If there are two operations + and ·, then a ring is defined byDefinition 6 Let R be a nonempty set with two binary operations + and· such that the following holds:1. (R, +) is an abelian group©2004 CRC Press LLC
  • 15. 2. a · b ∈ R for all a, b ∈ R3. (a · b) · c = a · (b · c) (Associativity)4. a · (b + c) = a · b + a · c and (b + c) · a = b · a + c · a (distributive law)Then (R, +, ·) is called an (associative) ring. If also a · b = b · a for alla, b ∈ R, then R is called a commutative ring.Further useful definitions are:Definition 7 Let R be a commutative ring and a ∈ R, a = 0 such thatthere exists an element b ∈ R, b = 0 with a · b = 0. Then a is called azero-divisor. If R has no zero-divisors, then it is called an integral domain.Definition 8 Let R be a ring such that (R {0}, ·) is a group. Then R iscalled a division ring. A commutative division ring is called a field.A module is defined as follows:Definition 9 Let (R, +, ·) be a ring and M a nonempty set with a binaryoperation +. Assume that1. (M, +) is an abelian group2. For every r ∈ R, m ∈ M , there exists an element r · m ∈ M3. r · (a + b) = r · a + r · b for every r ∈ R, m ∈ M4. r · (s · b) = (r · s) · a for every r, s ∈ R, m ∈ M5. (r + s) · a = r · a + s · a for every r, s ∈ R, m ∈ MThen M is called an R−module or module over R. If R has a unit elemente and if e · a = a for all a ∈ M , then M is called a unital R−module. A aunital R−module where R is a field is called a vector space over R. There is an enormous amount of literature on groups, rings, modules,etc. Some of the standard results are summarized, for instance, in textbooks such as those given above. Here, we cite only a few theorems thatare especially useful in music. We start with a few more definitions.Definition 10 Let H ⊂ G be a subgroup of G such that for every a ∈ G,a · H · a−1 ∈ H. Then H is called a normal subgroup of G.Definition 11 Let G be such that the only normal subgroups are H = Gand H = {e}. Then G is called a simple group.Definition 12 Let G be a group and H1 , ..., Hn normal subgroups suchthat G = H1 · H 2 · · · Hn (1.1)and any a ∈ G can be written uniquely as a product a = b1 · b2 · · · bn (1.2)with bi ∈ Hi . Then G is said to be the (internal) direct product of H1 , ..., Hn .©2004 CRC Press LLC
  • 16. Definition 13 Let G1 and G2 be two groups, define G = G1 × G2 ={(a, b) : a ∈ G1 , b ∈ G2 } and the operation · by (a1 , b1 ) · (a2 , b2 ) = (a1 ·a2 , b1 · b2 ). Then the group (G, ·) is called the (external) direct product ofG1 and G2 .Definition 14 Let M be an R−module and M1 , ..., Mn submodules suchthat every a ∈ M can be written uniquely as a sum a = a1 + a2 + ... + an (1.3)with ai ∈ Mi . Then M is said to be the direct sum of M1 , ..., Mn .We now turn to the question which subgroups of finite groups exist.Theorem 1 Let H be a subgroup of a finite group G. Then o(H) is adivisor of o(G).Theorem 2 (Sylow) Let G be a group and p a prime number such that pmis a divisor of o(G). Then G has a subgroup H with o(H) = pm .Definition 15 A subgroup H ⊂ G such that pm is a divisor of o(G) butpm+1 is not a divisor, is called a p−Sylow subgroup. The next theorems help to decide whether a ring is a field.Theorem 3 Let R be a finite integral domain. Then R is a field.Corollary 1 Let p be a prime number and R = Zp = {x mod p : x ∈ N }be the set of integers modulo p (with the operations m + and · definedaccordingly). Then R is a field. An essential way to compare algebraic structures is in terms of operation-preserving mappings. The following definitions are needed:Definition 16 Let (G1 , ·) and (G2 , ·) be two groups. A mapping g : G1 →G2 such that g(a · b) = g(a) · g(b) (1.4)is called a (group-)homomorphism. If g is a one-to-one (group-)homomorph-ism, then it is called an isomorphism (or group-isomorphism). Moreover,if G1 = G2 , then g is called an automorphism (or group-automorphism).Definition 17 Two groups G1 , G2 are called isomorphic, if there is anisomorphism g : G1 → G2 .Analogous definitions can be given for rings and modules:Definition 18 Let R1 and R2 be two rings. A mapping g : G1 → G2 suchthat g(a + b) = g(a) + g(b) (1.5)and g(a · b) = g(a) · g(b) (1.6)is called a (ring-)homomorphism. If g is a one-to-one (ring-)homomorphism,then it is called an isomorphism (or ring-isomorphism). Furthermore, ifR1 = R2 , then g is called an automorphism (or ring-automorphism).©2004 CRC Press LLC
  • 17. Definition 19 Two rings R1 , R2 are called isomorphic, if there is an iso-morphism g : R1 → R2 .Definition 20 Let M1 and M2 be two modules over R. A mapping g :M1 → M2 such that for every a, b ∈ M1 , r ∈ R, g(a + b) = g(a) + g(b) (1.7)and g(r · a) = r · g(a) (1.8)is called a (module-)homomorphism (or a linear transformation). If g isa one-to-one (module-)homomorphism, then it is called an isomorphism(or module-isomorphism). Furthermore, if G1 = G2 , then g is called anautomorphism (or module-automorphism).Definition 21 Two modules M1 , M2 are called isomorphic, if there is anisomorphism g : M1 → M2 .Finally, a general family of transformations is defined byDefinition 22 Let g : M1 → M2 be a (module-)homomorphism. Then amapping h : M1 → M2 defined by h(a) = c + g(a) (1.9)with c ∈ M2 is called an affine transformation. If M1 = M2 , then g is calleda symmetry of M . Moreover, if g is invertible, then it is called an invertiblesymmetry of M . Studying properties of groups is equivalent to studying groups of auto-morphisms:Theorem 4 (Cayley’s theorem) Let G be a group. Then there is a set Ssuch that G is isomorphic to a subgroup of A(S) where A(S) is the set ofall one-to-one mappings of S onto itself.Definition 23 Let G be a finite group. Then the group (A(S), ◦) (wherea ◦ b denotes successive application of the functions a and b) is called thesymmetric group of order n, and is denoted by Sn .Note that Sn is isomorphic to the group of permutations of the numbers1, 2, ..., n, and has n! elements. Another important concept is motivated byrepresentation in coordinates as we are used to from euclidian geometry.The representation follows since, in terms of isomorphy, the inner and outerproduct can be shown to be equivalent:Theorem 5 Let G = H1 · H2 · · · Hn be the internal direct product ofH1 , ..., Hn and G∗ = H1 × H2 × ... × Hn the external direct product. ThenG and G∗ are isomorphic, through the isomorphism g : G∗ → G defined byg(a1 , ..., an ) = a1 · a2 · ... · an .This theorem implies that one does not need to distinguish between theinternal and external direct product. The analogous result holds for mod-ules:©2004 CRC Press LLC
  • 18. Theorem 6 Let M be a direct sum of M1 , ..., Mn . Then M is isomor-phic to the module M ∗ = {(a1 , a2 , ..., an ) : ai ∈ Mi } with the opera-tions (a1 , a2 , ...) + (b1 , b2 , ...) = (a1 + b1 , a2 + b2 , ...) and r · (a1 , a2 , ...) =(r · a1 , r · a2 , ...).Thus, a module M = M1 + M2 + ... + Mn can be described in terms ofits coordinates with respect to Mi (i = 1, ..., n) and the structure of M isknown as soon as we know the structure of Mi (i = 1, ..., n). Direct products can be used, in particular, to characterize the structureof finite abelian groups:Theorem 7 Let (G, ·) be a finite commutative group. Then G is isomor-phic to the direct product of its Sylow-subgroups.Theorem 8 Let (G, ·) be a finite commutative group. Then G is the directproduct of cyclic groups.Similar, but slightly more involved, results can be shown for modules, butwill not be needed here.1.3 Specific applications in musicIn the following, the usefulness of algebraic structures in music is illus-trated by a few selected examples. This is only a small selection fromthe extended literature on this topic. For further reading see e.g. Graeser(1924), Sch¨nberg (1950), Perle (1955), Fletcher (1956), Babbitt (1960, o1961), Price (1969), Archibald (1972), Halsey and Hewitt (1978), Balzano(1980), Rahn (1980), G¨tze and Wille (1985), Reiner (1985), Berry (1987), oMazzola (1990a, 2002 and references therein), Vuza (1991, 1992a,b, 1993),Fripertinger (1991), Lendvai (1993), Benson (1995-2002), Read (1997), Noll(1997), Andreatta (1997), Stange-Elbe (2000), among others.1.3.1 The Mathieu groupIt can be shown that finite simple groups fall into families that can bedescribed explicitly, except for 26 so-called sporadic groups. One such groupis the so-called Mathieu group M12 which was discovered by the Frenchmathematician Mathieu in the 19th century (Mathieu 1861, 1873, also seee.g. Conway and Sloane 1988). In their study of probabilistic properties of(card) shuffling, Diaconis et al. (1983) show that M12 can be generated bytwo permutations (which they call Mongean shuffles), namely 1 2 3 4 5 6 7 8 9 10 11 12 π1 = (1.10) 7 6 8 5 9 4 10 3 11 2 12 1and 1 2 3 4 5 6 7 8 9 10 11 12 π2 = (1.11) 6 7 5 8 4 9 3 10 2 11 1 12©2004 CRC Press LLC
  • 19. where the low rows denote the image of the numbers 1, ..., 12. The orderof this group is o(M12 ) = 95040 (!) An interesting application of thesepermutations can be found in Ile de feu 2 by Olivier Messiaen (Berry 1987)where π1 and π2 are used to generate sequences of tones and durations.1.3.2 CampanologyA rather peculiar example of group theory “in action” (though perhapsrather trivial mathematically) is campanology or change ringing (Fletcher1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). Theart of change ringing started in England in the 10th century and is stillperformed today. The problem that is to be solved is as follows: there arek swinging bells in the church tower. One starts playing a melody thatconsists of a certain sequence in which the bells are played, each bell be-ing played only once. Thus, the initial sequence is a permutation of thenumbers 1, ..., k. Since it is not interesting to repeat the same melody overand over, the initial melody has to be varied. However, the bells are veryheavy so that it is not easy to change the timing of the bells. Each variationis therefore restricted, in that in each “round” only one pair of adjacentbells can exchange their position. Thus, for instance, if k = 4 and the pre-vious sequence was (1, 2, 3, 4), then the only permissible permutations are(2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restictionis that no sequence should be repeated except that the last one is iden-tical with the initial sequence. A typical solution to this problem is, forinstance, the “Plain Bob” that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),...and continues until all permutations in S4 are visited.1.3.3 Representation of musicMany aspects of music can be “embedded” in a suitable algebraic module(see e.g. Mazzola 1990a). Here are some examples:1. Apart from glissando effects, the essential frequencies in most types of music are of the form K ω = ωo px i i (1.12) i=1 where K < ∞, ωo is a fixed basic frequency, pi are certain fixed prime numbers and xi ∈ Q. Thus, K ψ = log ω = ψo + xi ψi (1.13) i=1 K where ψo = log ωo , ψi = log pi (i ≥ 1). Let Ψ = {ψ : ψ = i=1 xi ψi , xi ∈ Q} be the set of all log-frequencies generated this way. Then Ψ is a module over Q. Two typical examples are:©2004 CRC Press LLC
  • 20. (a) ωo = 440 Hz, K = 3, ω1 = 2, ω2 = 3, ω3 = 5 : This is the so-called Euler module in which most Western music operates. An important subset consists of frequencies of the just intonation with the pure in- tervals octave (ratio of frequencies 2), fifth (ratio of frequencies=3/2) and major third (ratio of frequencies 5/4): ψ = log ω = log 440 + x1 log 2 + x2 log 3 + x3 log 5 (1.14) (xi ∈ Z). The notes (frequencies) ψ can then be represented by points in a three-dimensional space of integers Z3 . Note that, using the nota- tion a = (a1 , a2 , a3 ) and b = (b1 , b2 , b3 ), the pitch obtained by addition c = a + b corresponds to the frequency ωo 2a1 +b1 3a2 +b2 5a3 +b3 . (b) ωo = 440 Hz, K = 1, ω1 = 2, and x = 12 , where p ∈ Z : This p corresponds to the well-tempered tuning where an octave is divided into equal intervals. Thus, the ratio 2 is decomposed into 12 ratios √ 12 2 so that p ψ = log 440 + log 2 (1.15) 12 If notes that differ by one or several octaves are considered equiva- lent, then we can identify the set of notes with the Z−module Z12 = {0, 1, ..., 11}.2. Consider a finite module of notes (frequencies), such as for instance the well-tempered module M = Z12 . Then a scale is an element of S = {(x1 , ..., xk ) : k ≤ |M |, xi ∈ M, xi = xj (i = j)}, the set of all finite vectors with different components.1.3.4 Classification of circular chords and other musical objectsA central element of classical theory of harmony is the triad. An alge-braic property that distinguishes harmonically important triads from otherchords can be described as follows: let x1 , x2 , x3 ∈ Z12 , such that (a) xi =xj(i=j) and (b) there is an “inner” symmetry g : Z12 → Z12 such that{y : y = g k (x1 ), k ∈ N} = {x1 , x2 , x3 }. It can be shown that all chords(x1 , x2 , x3 ) for which (a) and (b) hold are standard chords that are har-monically important in traditional theory of harmony. Consider for instancethe major triad (c, e, g) = (0, 4, 7) and the minor triad (c, e , g) = (0, 3, 7).For the first triad, the symmetry g(x) = 3x + 7 yields the desired result:g(0) = 7 = g, g(7) = 4 = e and g(4) = 7 = g. For the minor triad theonly inner symmetry is g(x) = 3x + 3 with g(7) = 0 = c, g(0) = 3 = eand g(3) = 0 = c. This type of classification of chords can be carried overto more complicated configurations of notes (see e.g. Mazzola 1990a, 2002,Straub 1989). In particular, musical scales can be classified by comparingtheir inner symmetries.©2004 CRC Press LLC
  • 21. 1.3.5 Torus of thirdsConsider the group G = (Z12 , +) of pitches modulo octave. Then G isisomorphic to the direct sum of the Sylow groups Z3 and Z4 by applyingthe isomorphism g : Z12 → Z3 + Z4 , (1.16) x → y = (y1 , y2 ) = (x mod 3, −x mod 4) (1.17)Geometrically, the elements of Z3 + Z4 can be represented as points ona torus, y1 representing the position on the vertical meridian and y2 theposition on the horizontal equatorial circle (Figure 1.8). This representationhas a musical meaning: a movement along a meridian corresponds to amajor third, whereas a movement along a horizontal circle corresponds toa minor third. One then can define the “torus-distance” dtorus (x, y) byequating it to the minimal number of steps needed to move from x to y.The value of dtorus (x, y) expresses in how far there is a third-relationshipbetween x and y. The possible values of dtorus are 0 (if x = y), 1, 2, and3 (smallest third-relationship). Note that dtorus can be decomposed intod3 + d4 where d3 counts the number of meridian steps and d4 the numberof equatorial steps.1.3.6 TransformationsFor suitably chosen integers p1 , p2 , p3 , p4 , consider the four-dimensionalmodule M = Zp1 × Zp2 × Zp3 × Zp4 over Z where the coordinates rep-resent onset time, pitch (well-tempered tuning if p2 = 12), duration, andvolume. Transformations in this space play an essential role in music. A se-lection of historically relevant transformations used by classical composersis summarized in Table 1.1 (also see Figure 1.13). Generally, one may say that affine transformations are most important,and among these the invertible ones. In particular, it can be shown that eachsymmetry of Z12 can be written as a product (in the group of symmetriesSymm(Z12 )) of the following musically meaningful transformations: • Multiplication by −1 (inversion);• Multiplication by 5 (ordering of notes according to circle of quarts);• Addition of 3 (transposition by a minor third); • Addition of 4 (transposition by a major third).All these transformations have been used by composers for many centuries.Some examples of apparent similarities between groups of notes (or motifs)are shown in Figures 1.10 through 1.12. In order not to clutter the pic-tures, only a small selection of similar motifs is marked. In dodecaphonicand serial music, transformation groups have been applied systematically(see e.g. Figure 1.9). For instance, in Sch¨berg’s Orchestervariationen op. o©2004 CRC Press LLC
  • 22. Table 1.1 Some affine transformations used in classical music Function Musical meaning Shift: f (x) = x + a Transposition, repetition, change of duration, change of loudness Shear, e.g. of x = (x1 , ..., x4 )t Arpeggio w.r.t. line y = βo + t · (0, 1, 0, 0): f (x) = x + a · (0, 1, 0, 0) for x not on line, f (x) = x for x on line Reflection, e.g. w.r.t. Retrograde, inversion v = (a, 0, 0, 0): f (x) = (a − (x1 − a), x2 , x3 , x4 ) Dilatation, e.g. w.r.t. pitch: Augmentation f (x) = (x1 , a · x2 , x3 , x4 ) Exchange of coordinates: Exchange of “parameters” f (x) = (x2 , x1 , x3 , x4 ) (20th century)31, the full orbit generated by inversion, retrograde and transposition isused. Webern used 12-tone series that are diagonally symmetric in thetwo-dimensional space spanned by pitch and onset time. Other famous ex-amples include Eimert’s rotation by 45 degrees together with a dilatation √by 2 (Eimert 1964) and serial compositions such as Boulez’s “Structures”and Stockhausen’s “Kontra-Punkte”. With advanced computer technol-ogy (e.g. composition soft- and hardware such as Xenaki’s UPIC graph-ics/computer system or the recently developed Presto software by Mazzola1989/1994), the application of affine transformations in musical spaces ofarbitrary dimension is no longer the tedious work of the early dodecaphonicera. On the contrary, the practical ease and enormous artistic flexibilitylead to an increasing popularity of computer aided transformations amongcontemporary composers (see e.g. Iannis Xenakis, Kurt Dahlke, WilfriedJentzsch, Guerino Mazzola 1990b, Dieter Salbert, Karl-Heinz Sch¨ppner, oTamas Ungvary, Jan Beran 1987, 1991, 1992, 2000; cf. Figure 1.14).©2004 CRC Press LLC
  • 23. Spiegel-Duett (W.A. Mozart) Allegro q=120                         Violin          mf 7                            Vln.       12          Vln.              18                            Vln.      22                           Vln.       27                            Vln.    32                    Vln.          36                             Vln.          41                         Vln.              46                           Vln.      51          Vln.             57                      Vln.      60                        Vln.       Figure 1.6 W.A. Mozart (1759-1791) (authorship uncertain) – Spiegel-Duett.©2004 CRC Press LLC
  • 24. Figure 1.7 Wolfgang Amadeus Mozart (1756-1791). (Engraving by F. M¨ller af- uter a painting by J.W. Schmidt; courtesy of Zentralbibliothek Z¨rich.) u Figure 1.8 The torus of thirds Z3 + Z4 .©2004 CRC Press LLC
  • 25. Figure 1.9 Arnold Sch¨nberg – Sketch for the piano concert op. 42 – notes with otone row and its inversions and transpositions. (Used by permission of BelmontMusic Publishers.)Figure 1.10 Notes of “Air” by Henry Purcell. (For better visibility, only a smallselection of related “motifs” is marked.)©2004 CRC Press LLC
  • 26. Figure 1.11 Notes of Fugue No. 1 (first half ) from “Das WohltemperierteKlavier” by J.S. Bach. (For better visibility, only a small selection of related“motifs” is marked.)Figure 1.12 Notes of op. 68, No. 2 from “Album f¨r die Jugend” by Robert Schu- umann. (For better visibility, only a small selection of related “motifs” is marked.)©2004 CRC Press LLC
  • 27. Figure 1.13 A miraculous transformation caused by high exposure to Wagneroperas. (Caricature from a 19th century newspaper; courtesy of ZentralbibliothekZ¨rich.) uFigure 1.14 Graphical representation of pitch and onset time in Z2 together with 71 ´ainstrumentation of polygonal areas. (Excerpt from S¯nti – Piano concert No. 2by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)©2004 CRC Press LLC
  • 28. Figure 1.15 Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier, Paris.)Figure 1.16 Ludwig van Beethoven (1770-1827). (Courtesy of ZentralbibliothekZ¨rich.) u©2004 CRC Press LLC
  • 29. CHAPTER 2 Exploratory data mining in musical spaces2.1 Musical motivationThe primary aim of descriptive statistics is to summarize data by a smallset of numbers or graphical displays, with the purpose of finding typicalrelevant features. An in-depth descriptive analysis explores the data as faras possible in the hope of finding anything interesting. This activity istherefore also called “exploratory data analysis” (EDA; see Tukey 1977),or “data mining”. EDA does not require a priori model assumptions – thepurpose is simply free exploration. Many exploratory tools are, however,inspired by probabilistic models and designed to detect features that maybe captured by these. Descriptive or exploratory analysis is of special interest in music. Thereason is that in music very subtle local changes play an important role.For instance, a good pianist may achieve a desired emotional effect by slightlocal variations of tempo, dynamics, etc. Composers are able to do the sameby applying subtle variations. Extreme examples of small gradual changescan be found, for instance, in minimal music (e.g. Reich, Glass, Riley). As aresult, observed data consist of a dominating deterministic component plusmany other very subtle (and presumably also deterministic, i.e. intended)components. Thus, because of their subtle nature, many musically relevantfeatures are difficult to detect and can often be identified in a descriptiveway only - for instance by suitable graphical displays. A formal statistical“proof” that these features are indeed real, and not just accidental, is thenonly possible if more similar data are collected. To illustrate this, consider the tempo curves of three performances ofRobert Schumann’s (1810-1856) Tr¨umerei by Vladimir Horowitz (1903- a1989), displayed in Figure 2.2. It is obvious that the three curves are verysimilar even with respect to small details. However, since these details areof a local nature and we observed only three performances, it is not an easytask to show formally (by statistical hypothesis testing or confidence inter-vals) that, apart from an overall smooth trend, Horowitz’s tempo variationsare not random. An even more difficult task is to “explain” these features,i.e. to attach an explicit musical meaning to the local tempo changes.©2004 CRC Press LLC
  • 30. Träumerei op. 15, No. 7 Robert Schumann q=100 (72)                                Piano p                              5                                                                             ritard. 9                                                                                             13                                                                                17 ritard.   a tempo                                                                            21                                  23                                  p   ritard.                            Figure 2.1 Robert Schumann (1810-1856) – Tr¨umerei op. 15, No. 7. a©2004 CRC Press LLC
  • 31. 0 1947 -5 1963 log(tempo) 1965 -10 -15 0 10 20 30 onset timeFigure 2.2 Tempo curves of Schumann’s Tr¨umerei performed by Vladimir aHorowitz.2.2 Some descriptive statistics and plots for univariate data2.2.1 DefinitionsWe give a brief summary of univariate descriptive statistics. For a com-prehensive discussion we refer the reader to standard text books such asTukey (1977), Mosteller and Tukey (1977), Hoaglin (1977), Tufte (1977),Velleman and Hoaglin (1981), Chambers et al. (1983), Cleveland (1985). Suppose that we observe univariate data x1 , x2 , ..., xn . To summarizegeneral characteristics of the data, various numerical summary statisticscan be calculated. Essential features are in particular center (location),variability, asymmetry, shape of distribution, and location of unusual values(outliers). The most frequently used statistics are listed in Table 2.1. We recall a few well known properties of these statistics:• Sample mean: The sample mean can be understood as the “center of gravity” of the data, whereas the median divides the sample in two halves©2004 CRC Press LLC
  • 32. Table 2.1 Simple descriptive statisticsName Definition Feature measured −1 nEmpirical distribution Fn (x) = n i=1 1{xi ≤ x} Proportion offunction obs. ≤ xMinimum xmin = min{x1 , ..., xn } Smallest valueMaximum xmin = max{x1 , ..., xn } Largest valueRange xrange = xmax − xmin Total spread nSample mean x = n−1 ¯ i=1 xi Center 1Sample median M = inf {x : Fn (x) ≥ 2 } CenterSample α−quantile qα = inf {x : Fn (x) ≥ α} Border of lower 100α%Lower and upper Q1 = q 1 , Q2 = q 3 Border of 4 4quartile lower 25%, upper 75% nSample variance s2 = (n − 1)−1 i=1 (xi − x)2 ¯ Variability √Sample standard s = + s2 VariabilitydeviationInterquartile range IQR = Q2 − Q1 Variability nSample skewness m3 = n−1 i=1 [(xi − x)/s]3 ¯ Asymmetry −1 n 4Sample kurtosis m4 = n i=1 [(xi − x)/s] − 3 ¯ Flat/sharp peak with an (approximately) equal number of observations. In contrast to the median, the mean is sensitive to outliers, since observations that are far from the majority of the data have a strong influence on its value.• Sample standard deviation: The sample standard deviation is a measure of variability. In contrast to the variance, s is directly comparable with the data, since it is measured in the same unit. If observations are drawn independently from the same normal probability distribution (or a dis- tribution that is similar to a normal distribution), then the following rule of thumb applies: (a) approximately 68% of the data are in the interval x ± s; (b) approximately 95% of the data are in the interval x ± 2s; (c) ¯ ¯ almost all data are in the interval x ± 3s. For a sufficiently large sample ¯ size, these conclusions can be carried over to the population from which the data were drawn.©2004 CRC Press LLC
  • 33. • Interquartile range: The interquartile range also measures variability. Its advantage, compared to s, is that it is much less sensitive to outliers. If the observations are drawn from the same normal probability distribu- tion, then IQR/1.35 (or more precisely IQR/[Φ−1 (0.75) − Φ−1 (0.25)] where Φ−1 is the quantile function of the standard normal distribution) estimates the same quantity as s, namely the population standard devi- ation.• Quantiles: For α = n (i = 1, ..., n), qα coincides with at least one ob- i servation. For other values of α, qα can be defined as in Table 1.1 or, alternatively, by interpolating neighboring observed values as follows: let β = n < α < γ = i+1 . Then the interpolated quantile qα is defined by i n ˜ α−β qα = qβ + ˜ (qγ − qα ) (2.1) 1/n Note that a slightly different convention used by some statisticians is to call inf{x : Fn (x) ≥ α} the (α − 0.5 )-quantile (see e.g. Chambers et al. n 1983).• Skewness: Skewness measures symmetry/asymmetry. For exactly sym- metric data, m3 = 0, for data with a long right tail m3 > 0, for data with a long left tail m3 < 0.• Kurtosis: The kurtosis is mainly meaningful for unimodal distributions, i.e. distributions with one peak. For a sample from a normal distribution, m4 ≈ 0. The reason is that then E[(X − µ)4 ] = 3σ 4 where µ = E(X). For samples from unimodal distributions with a sharper or flatter peak than the normal distribution, we then tend to have m4 > 0 and m4 < 0 respectively.Simple, but very useful graphical displays are:• Histogram: 1. Divide an interval (a, b] that includes all observations into disjoint intervals I1 = (a1 , b1 ], ..., Ik = (ak , bk ]. 2. Let n1 , ..., nk be the number of observations in the intervals I1 , ..., Ik respectively. 3. Above each interval Ij , plot a rectangle of width wj = bj − aj and height hj = nj /wj . Instead of the absolute frequencies, one can also use relative frequencies nj /n where n = n1 + ... + nk . The essential point is that the area is proportional to nj . If the data are drawn from a probability distribution with density function f, then the histogram is an estimate of f.• Kernel estimate of a density function: The histogram is a step function, and in that sense does not resemble most density functions. This can be improved as follows. If the data are realizations of a continuous random x variable X with distribution F (x) = P (X ≤ x) = −∞ f (u)du, then a smooth estimate of the probability density function f can be defined by a kernel estimate (Rosenblatt 1956, Parzen 1962, Silverman 1986) of the©2004 CRC Press LLC
  • 34. form n ˆ 1 xi − x f (x) = K( ) (2.2) nb i=1 b ∞ where K(u) = K(−u) ≥ 0 and −∞ K(u)du = 1. Most kernels used in practice also satisfy the condition K(u) = 0 for |u| > 1. The “band- width” b then specifies which data in the neighborhood of x are used to estimate f (x). In situations where one has partial knowledge of the shape of f, one may incorporate this into the estimation procedure. For instance, Hjort and Glad (2002) combine parametric estimation based ˆ on a preliminary density function f (x; θ) with kernel smoothing of the ˆ They show that major efficiency gains “remaining density” f /f (x; θ). can be achieved if the preliminary model is close to the truth.• Barchart: If data can assume only a few different values, or if data are qualitative (i.e. we only record which category an item belongs to), then one can plot the possible values or names of categories on the x-axis and on the vertical axis the corresponding (relative) frequencies.• Boxplot (simple version): 1. Calculate Q1 , M, Q2 and IQR = Q2 − Q1 . 2. Draw parallel lines (in principle of arbitrary length) at the levels Q1 , M, Q2 , A1 = Q1 − 3 IQR, A2 = Q2 + 3 IQR, B1 = Q1 − 3IQR and 2 2 B2 = Q1 + 3IQR. The points A1 , A2 are called inner fence, and B1 , B2 are called outer fence. 3. Identify the observation(s) between Q1 and A1 that is closest to A1 and draw a line connecting Q1 with this point. Do the same for Q2 and A2 . 4. Identify observation(s) between A1 and B1 and draw points (or other symbols) at those places. Do the same for A2 and B2 . 5. Draw points (or other symbols) for observations beyond B1 and B2 respectively. The boxplot can be interpreted as follows: the relative positions of Q1 , M, Q2 and the inner and outer fences indicate symmetry or asymmetry. Moreover, the distance between Q1 and Q2 is the IQR and thus measures variability. The inner and outer fences help to identify outliers, i.e. values lying unusually far from most of the other observations.• Q-q-plot for comparing two data sets x1 , ..., xn and y1 , ..., ym : 1. Define a certain number of points 0 < p1 < ... < pk ≤ 1 (the standard choice is: pi = i−0.5 where N = min(n, m)). 2. Plot the pi -quantiles (i = 1, ..., N ) N of the y−observations versus those of the x − −observations. Alternative plots for comparing distributions are discussed e.g. in Ghosh and Beran (2000) and Ghosh (1996, 1999).©2004 CRC Press LLC
  • 35. 2.3 Sp ecific applications in music – univariate2.3.1 Tempo curvesFigure 2.3 displays 28 tempo curves for performances of Schumann’s Tr¨u- amerei op. 15, No. 7, by 24 pianists. The names of the pianists and datesof the recordings (in brackets) are Martha Argerich (before 1983), ClaudioArrau (1974), Vladimir Ashkenazy (1987), Alfred Brendel (before 1980),Stanislav Bunin (1988), Sylvia Capova (before 1987), Alfred Cortot (1935,1947 and 1953), Clifford Curzon (about 1955), Fanny Davies (1929), J¨rg oDemus (about 1960), Christoph Eschenbach (before 1966), Reine Gianoli(1974), Vladimir Horowitz (1947, before 1963 and 1965), Cyprien Katsaris(1980), Walter Klien (date unknown), Andr´ Krust (about 1960), Antonin eKubalek (1988), Benno Moisewitsch (about 1950), Elly Ney (about 1935),Guiomar Novaes (before 1954), Cristina Ortiz (before 1988), Artur Schn-abel (1947), Howard Shelley (before 1990), Yakov Zak (about 1960). Tempo is more likely to be varied in a relative rather than absolute way.For instance, a musician plays a certain passage twice as fast as the previ-ous one, but may care less about the exact absolute tempo. This suggestsconsideration of the logarithm of tempo. Moreover, the main interest lies incomparing the shapes of the curves. Therefore, the plotted curves consistof standardized logarithmic tempo (each curve has sample mean zero andvariance one). Schumann’s Tr¨umerei is divided into four main parts, each consisting aof about eight bars, the first two and the last one being almost identi-cal (see Figure 2.1). Thus, the structure is: A, A , B, and A . Already avery simple exploratory analysis reveals interesting features. For each pi-anist, we calculate the following statistics for the four parts respectively:x, M, s, Q1 , Q2 , m3 and m4 . Figures 2.4a through e show a distinct pattern¯that corresponds to the division into A, A , B, and A . Tempo is muchlower in A and generally highest in B. Also, A seems to be played at aslightly slower tempo than A – though this distinction is not quite so clear(Figures 2.4a,b). Tempo is varied most towards the end and considerablyless in the first half of the piece (Figures 2.4c). Skewness is generally nega-tive which is due to occasional extreme “ritardandi”. This is most extremein part B and, again, least pronounced in the first half of the piece (A, A ).A mirror image of this pattern, with most extreme positive values in B,is observed for kurtosis. This indicates that in B (and also in A ), mosttempo values vary little around an average value, but occasionally extremetempo changes occur. Also, for A, there are two outliers with an extremlynegative skewness – these turn out to be Fanny Davies and J¨rg Demus. oFigures 2.4f through h show another interesting comparison of boxplots.In Figure 2.4f, the differences between the lower quartiles in A and Afor performances before 1965 are compared with those from performancesrecorded in 1965 or later. The clear difference indicates that, at least for the©2004 CRC Press LLC
  • 36. -20 0 ARGERICH ARRAU ASKENAZE BRENDEL -40 BUNIN CAPOVA CORTOT1 CORTOT2 CORTOT3 CURZON DAVIES -60 DEMUS log(tempo) ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS -80 KLIEN KRUST KUBALEK MOISEIWITSCH NEY NOVAES -100 ORTIZ SCHNABEL SHELLEY ZAK 0 10 20 30 onset timeFigure 2.3 Twenty-eight tempo curves of Schumann’s Tr¨umerei performed by 24 apianists. (For Cortot and Horowitz, three tempo curves were available.)sample considered here, pianists of the “modern era” tend to make a muchstronger distinction between A and A in terms of slow tempi. The onlyexceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz’first performance and Ashkenazy (outlier in the right boxplot). The com-parsion of skewness and curtosis in Figures 2.4g and h also indicates that“modern” pianists seem to prefer occasional extreme ritardandi. The onlyexception in the “early 20th century group” is Artur Schnabel, with anextreme skewness of −2.47 and a kurtosis of 7.04. Direct comparisons of tempo distributions are shown in Figures 2.5a©2004 CRC Press LLC
  • 37. Figure 2.4 Boxplots of descriptive statistics for the 28 tempo curves in Figure2.3.through f. The following observations can be made: a) compared to Demus(quantiles on the horizontal axis), Ortiz has a few relatively extreme slowtempi (Figure 2.5a); b) similarily, but in a less extreme way, Cortot’s inter-pretation includes occasional extremely slow tempo values (Figure 2.5b); c)Ortiz and Argerich have practically the same (marginal) distribution (Fig-ure 2.5c); d) Figure 2.5d is similar to 2.5a and b, though less extreme; e) thetempo distribution of Cortot’s performance (Figure 2.5e) did not changemuch in 1947 compared to 1935; f) similarily, Horowitz’s tempo distribu-©2004 CRC Press LLC
  • 38. tions in 1947 and 1963 are almost the same, except for slight changes forvery low tempi (Figure 2.5f). 2 2 1 1 1 Figure 2.5a: q-q-plot Figure 2.5b: q-q-plot Figure 2.5c: q-q-plot 0 Demus (1960) - Ortiz (1988) Demus (1960) - Cortot (1935) Ortiz (1988) - Argerich (1983) 0 0 -1 -1 -1 Argerich -2 Cortot Ortiz -2 -2 -3 -3 -3 -4 -4 -4 -2 -1 0 1 -2 -1 0 1 -4 -3 -2 -1 0 1 Demus Demus Ortiz 1 1 Figure 2.5d: q-q-plot Figure 2.5e: q-q-plot Figure 2.5f: q-q-plot Demus (1960) - Krust (1960) Cortot (1935) - Cortot (1947) Horowitz (1947) - Horowitz (1963) 0 0 2 0 -1 -1 Horowitz 1963 Cortot 1947 -2 Krust -2 -2 -3 -3 -4 -4 -4 -2 -1 0 1 -4 -3 -2 -1 0 1 2 -4 -3 -2 -1 0 1 Demus Cortot 1935 Horowitz 1947 Figure 2.5 q-q-plots of several tempo curves (from Figure 2.3).2.3.2 Notes modulo 12In most classical music, a central tone around which notes “fluctuate” canbe identified, and a small selected number of additional notes or chords(often triads) play a special role. For instance, from about 400 to 1500A.D., music was mostly written using so-called modes. The main notes©2004 CRC Press LLC
  • 39. were the first one (finalis, the ”final note”) and the fifth note of the scale(dominant). The system of 12 major and 12 minor scales was developedlater, adding more flexibility with respect to modulation and scales. Themain “representatives” of a major/minor scale are three triads, obtainedby “adding” thirds, starting at the basic note corresponding to the first(tonic), fourth (subtonic) and fifth (tonic) note of the scale respectively.Other triads are also – but to a lesser degree – associated with the properties“tonic”, “subtonic” and/or “dominant”. In the 20th century, and partiallyalready in the late 19th century, other systems of scales as well as systemsthat do not rely on any specific scales were proposed (in particular 12-tonemusic). 0.4 Figure 2.6a: J.S.Bach - Fugue 1, Figure 2.6b: W.A.Mozart - KV 545, 0.20 frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 0 b c d 2 7 8 c 1 c d e b 0.3 5 6 0 9 b 5 7 8 9 a e f 3 4 6 9 a b d e 8 b d 2 a f 0.15 0 1 2 4 7 a c 3 4 6 1 5 0 f 1 3 2 7 9 a c e f f 1 3 4 0 9 3 4 4 6 7 8 5 7 8 9 3 5 5 4 3 1 2 5 4 1 2 5 0.2 6 0 a 8 d e f 6 0 e 7 6 a f e 0.10 6 1 2 9 1 3 7 8 b c d e f 1 2 d 3 2 5 4 8 6 8 d c 0 2 f 4 5 6 7 8 a c b d e f 3 c b d 3 c d e f 9 3 2 a b c 1 7 9 b 3 9 c b d e f a 4 5 6 9 0 a e 1 2 0.1 0.05 2 0 7 8 4 9 0 b 5 6 7 8 9 0 a f 4 5 7 8 6 0 9 0 a d b c e 0 a 2 9 5 6 7 8 a b 1 a e f 3 4 5 6 7 8 9 0 c b d e f 1 3 4 5 6 7 8 5 6 7 8 9 0 a b 4 1 2 3 4 5 7 8 6 9 0 a b c d e 1 3 2 4 5 7 8 6 9 0 a b c d e f 1 3 2 4 5 7 8 6 9 0 a b c d e f f 1 3 2 4 5 7 8 6 9 0 a b d f c e 1 2 3 4 5 6 7 8 9 0 a c b d c b d 1 2 3 4 5 6 7 8 9 0 a c b d 1 2 1 2 3 4 5 6 7 8 9 0 a c b d 1 2 3 4 c d 1 2 3 c d 0.0 e f e f e f e f e f 0.0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 (Notes-Tonic) mod 12 (Notes-Tonic) mod 12 0.30 Figure 2.6c: R.Schumann - op.15/2, Figure 2.6d: R.Schumann - op.15/3, 0.4 frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) frequencies of notes number i, i in [1+j,16+j] (j=0, ,65) 4 5 6 7 8 0 a b e 2 3 7 6 5 9 c f 1 3 4 d 1 2 8 9 0 0.3 a e f d 5 0 a b 0.20 b d 2 1 3 4 6 c 8 9 c c e f b 7 f 1 d 7 0.2 1 d 0 9 c 8 a b e 2 2 3 4 6 e f 3 4 9 5 e f 2 3 5 6 8 a 0 9 5 1 5 0.10 1 b c d 8 0 a b c e 2 6 9 a b e f 2 3 0 a e f 7 5 6 d f 7 8 0 b c d 1 3 4 7 8 0 c d 1 4 7 6 8 0 a c 4 7 d 6 7 c 0.1 4 7 6 8 9 2 3 4 1 3 9 a 2 3 5 9 b d e f 7 8 0 9 8 9 a b 2 3e f 9 4 5 6 7 8 0 9 a 2 3 4 f 5 1 2 4 5 6 e f 1 7 5 6 8 9 c d e 1 2 a b c d e f 1 2 3 4 5 6 d e f e f 1 3 4 d 5 6 a c b 0 1 4 5 6d 7 8 0 a c b 1 2 3 4 7 5 6 8 9 0 a b c d e f 0 a b 1 2 3 4 7 5 6 8 9 0 a b c d e f 3 4 7 5 6 8 9 0 7 8 9 0 a b c 1 2 3 4 7 5 6 8 9 0 a b c d e f e f 2 b 2 1 3 4 d 5 6 7 8 0 9 a c e f 2 1 3 4 d 5 6 7 8 0 9 a c b e f 2 1 3 4 d 5 6 7 8 0 9 a c b e f 2 1 3 4 d 5 6 7 8 0 9 a c b e f 2 1 3 d c b 0.0 0.0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 (Notes-Tonic) mod 12 (Notes-Tonic) mod 12Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16. A very simple illustration of this development can be obtained by count-ing the frequencies of notes (pitches) in the following way: consider a scorein equal temperament. Ignoring transposition by octaves, we can representall notes x(t1 ), ..., x(tn ) by the integers 0, 1, ..., 11. Here, t1 ≤ t2 ≤ ... ≤ tn©2004 CRC Press LLC
  • 40. Figure 2.7a: A.Scriabin - op.51/2, Figure 2.7b: A.Scriabin - op.51/4, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 0.3 8 9 frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 5 0.3 3 5 6 7 0 6 2 4 3 4 7 8 a 1 a 3 4 5 6 a f 2 9 0.2 b 1 2 1 2 4 7 8 9 0 b c d e 1 2 3 4 7 8 9 0 a e 1 0 b 9 2 3 0.2 6 8 3 5 5 6 b c d c 8 0 a 1 5 6 f 7 9 d e 7 b c 4 7 0 e f c d e f c d e f f d 6 7 8 9 a b c 0 c 0.1 f e 4 6 8 d 0.1 b 7 9 0 a b a b d e 1 2 1 2 1 3 2 1 3 9 0 a b c d d 5 9 0 a b c d e f 5 1 2 5 6 8 c d f a b c d e f 1 f 2 3 4 8 3 4 5 6 1 4 2 1 3 4 4 5 6 7 8 9 e f c c d e f f e c d e f b b c d e f 0 e f f d e f a 3 4 e f b c d e f 0 2 3 4 5 6 7 0 a b c d e 5 7 7 8 0 a b b 4 5 6 7 8 9 0 a d e f 2 3 4 5 9 0 a 5 6 7 8 9 0 a 2 1 3 a b d 2 1 3 4 b c d e 1 2 3 4 5 6 7 8 9 0 a b c 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 a b c d e f 1 2 3 4 5 6 7 8 9 0 a b c d e f 1 2 3 4 5 6 7 8 9 0 a 1 2 3 4 5 6 7 8 9 8 9 6 2 1 3 4 5 6 2 1 3 6 5 6 0.0 9 7 8 9 0 a b c 7 8 c 7 8 9 0 a 0.0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 (Notes-Tonic) mod 12 (Notes-Tonic) mod 12 Figure 2.7c: F.Martin - Prelude 6, Figure 2.7d: F.Martin - Prelude 7, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 0 a d e 1 2 8 7 0 2 e f 4 5 1 3 0.20 0.12 8 9 b c 1 2 6 7 5 4 6 3 5 7 1 0 a c f 1 5 6 a b 3 5 4 9 a d e f e 3 6 d b f 4 c 2 e f 1 2 9 0 d c e 0 a b c 5 6 7 5 4 7 f 1 3 2 8 c 3 5 4 b c f 1 3 a f e 4 6 8 9 2 4 6 8 7 b d e 9 b c d f 3 2 5 4 6 9 f e 3 2 9 0 c 6 8 7 b c f 3 2 6 8 9 c d f 1 b d 0.08 8 9 0 a d 1 3 4 5 1 2 3 4 a b c 7 8 b 5 6 7 8 1 4 5 6 7 9 0 a d e f 3 4 5 8 9 0 d c 0.10 7 e f 8 9 d 6 7 8 d e f 2 4 8 9 0 a b c 6 7 a e 3 6 4 6 7 9 0 a b d e 8 9 0 a d e 2 8 7 9 0 b c d 0 a b d 3 5 9 3 2 5 4 8 7 0 a e 8 7 0 b c d 4 d f e 1 2 1 5 4 7 0 a b e 2 4 9 0 a c 1 2 6 6 7 a b c 5 9 0 a a 6 7 8 9 0 a b 9 7 8 9 0 a b c 3 1 2 3 8 b 1 2 b f 3 5 0 b d c e f 1 7 8 9 0 1 2 3 5 d c 9 8 6 d c 0.04 1 2 5 f c f e 1 6 1 a 3 5 6 8 7 4 2 3 4 5 6 4 e f 0 a b d c e f 1 2 3 4 5 6 7 1 2 3 4 5 e f d e f 0.0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 (Notes-Tonic) mod 12 (Notes-Tonic) mod 12Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.denote the score-onset times of the notes. To make different compositionscomparable, the notes are centered by subtracting the central note whichis defined to be the most frequent note. Given a prespecified integer k (inour case k = 16), we calculate the relative frequencies j+2k −1 pj (x) = (2k + 1) 1{x(ti ) = x} i=jwhere 1{x(ti ) = x} = 1, if x(ti ) = x and zero otherwise and j = 1, 2, ..., n−2k − 1. This means that we calculate the distribution of notes for a mov-ing window of 2k + 1 notes. Figures 2.6a through d and 2.7a through ddisplay the distributions pj (x) (j = 4, 8, ..., 64) for the following composi-tions: Fugue 1 from “Das Wohltemperierte Klavier I” by J.S. Bach (1685-1750), Sonata KV 545 (first movement) by W.A. Mozart (1756-1791; Figure2.8), Kinderszenen No. 2 and 3 by R. Schumann (1810-1856; Figure 2.9),Pr´ludes op. 51, No. 2 and 4 by A. Scriabin (1872-1915) and Pr´ludes No. e e©2004 CRC Press LLC
  • 41. Figure 2.8 Johannes Chrysostomus Wolfgangus Theophilus Mozart (1756-1791)in the house of Salomon Gessner in Zurich. (Courtesy of ZentralbibliothekZ¨rich.) u©2004 CRC Press LLC
  • 42. Figure 2.9 R. Schumann (1810-1856) – lithography by H. Bodmer. (Courtesy ofZentralbibliothek Z¨rich.) u©2004 CRC Press LLC
  • 43. 6 and 7 by F. Martin (1890-1971). For each j = 4, 8, ..., 64, the frequen-cies pj (0), ..., pj (11) are joined by lines respectively. The obvious commonfeature for Bach, Mozart and Schumann is a distinct preference (local max-imum) for the notes 5 and 7 (apart from 0). Note that if 0 is the root ofthe tonic triad, then 5 corresponds to the root of the subdominant triad.Similarily, 7 is root of the dominant triad. Also relatively frequent are thenotes 3 =minor third (second note of tonic triad in minor) and 10 =minorseventh, which is the fourth note of the dominant seventh chord to thesubtonic. Also note that, for Schumann, the local maxima are somewhatless pronounced. A different pattern can be observed for Scriabin and evenmore for Martin. In Scriabin’s Pr´lude op. 51/2, the perfect fifth almost enever occurs, but instead the major sixth is very frequent. In Scriabin’sPr´lude op. 51/4, the tonal system is dissolved even further, as the clearly edominating note is 6 which builds together with 0 the augmented fourth(or diminished fifth) – an interval that is considered highly dissonant intonal music. Nevertheless, even in Scriabin’s compositions, the distributionof notes does not change very rapidly, since the sixteen overlayed curves arealmost identical. This may indicate that the notion of scales or a slow har-monic development still play a role. In contrast, in Frank Martin’s Pr´ludeeNo. 6, the distribution changes very quickly. This is hardly surprising, sinceMartin’s style incorporates, among other influences, dodecaphonism (12-tone music) – a compositional technique that does not impose traditionalrestrictions on the harmonic structure.2.4 Some descriptive statistics and plots for bivariate data2.4.1 DefinitionsWe give a short overview of important descriptive concepts for bivariatedata. For a comprehensive treatment we refer the reader to standard textbooks given above (also see e.g. Plackett 1960, Ryan 1996, Srivastava andSen 1997, Draper and Smith 1998, and Rao 1973 for basic theoretical re-sults).CorrelationIf each observation consists of a pair of measurements (xi , yi ), then the mainobjective is to investigate the relationship between x and y. Consider, forexample, the case where both variables are quantitative. The data can thenbe displayed in a scatter plot (y versus x). Useful statistics are Pearson’ssample correlation n n 1 xi − x yi − y ¯ ¯ i=1 (xi − x)(yi − y ) ¯ ¯ r= ( )( )= n n (2.3) n i=1 sx sy i=1 (xi − x)2 ¯ i=1 (yi − y )2 ¯©2004 CRC Press LLC
  • 44. where s2 = n−1 x n ¯2 i=1 (xi − x) and s2 = n−1 y n ¯2 i=1 (yi − y ) and Spearman’srank correlation n n 1 ui − u vi − v ¯ ¯ i=1 (ui − u)(vi − v ) ¯ ¯ rSp = ( )( )= n n (2.4) i=1 (ui − u) i=1 (vi − n su sv ¯ 2 v )2 ¯ i=1where ui denotes the rank of xi among the x−values and vi is the rankof yi among the y−values. In (2.3) and (2.4) it is assumed that sx , sy ,su and sv are not zero. Recall that these definitions imply the followingproperties: a) −1 ≤ r, rSp ≤ 1; b) r = 1, if and only if yi = βo + β1 xiand β1 > 0 (exact linear relationship with positive slope); c) r = −1, ifand only if yi = βo + β1 xi and β1 < 0 (exact linear relationship withnegative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictlymonotonically increasing relationship); e) r = −1, if and only if xi >xj implies yi < yj (strictly monotonically decreasing relationship); f) rmeasures the strength (and sign) of the linear relationship; g) rSp measuresthe strength (and sign) of monotonicity; h) if the data are realizations of abivariate random variable (X, Y ), then r is an estimate of the populationcorrelation ρ = cov(X, Y )/ var(X)var(Y ) where cov(X, Y ) = E[XY ] −E[X]E[Y ], var(X) = cov(X, X) and var(Y ) = cov(Y, Y ). When usingthese measures of dependence one should bear in mind that each of themmeasures a specific type of dependence only, namely linear and monotonicdependence respectively. Thus, a Pearson or Spearman correlation nearor equal to zero does not necessarily mean independence. Note also thatcorrelation can be interpreted in a geometric way as follows: defining then−dimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal tothe standardized scalar product between x and y, and is therefore equal tothe cosine of the angle between these two vectors. A special type of correlation is interesting for time series. Time series aredata that are taken in a specific ordered (usually temporal) sequence. IfY1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, thenone would like to know whether there is any linear dependence betweenobservations Yi and Yi−k , i.e. between observations that are k time unitsapart. If this dependence is the same for all time points i, and the expectedvalue of Yi is constant, then the corresponding population correlation canbe written as function of k only (see Chapter 4), cov(Yi , Yi+k ) = ρ(k) (2.5) var(Yi )var(Yi+k )and a simple estimate of ρ(k) is the sample autocorrelation (acf) n−k 1 yi − y yi+k − y ¯ ¯ ρ(k) = ˆ ( )( ) (2.6) n i=1 s swhere s2 = n−1 (yi − y)(yi+k − y). Note that here summation stops at ¯ ¯©2004 CRC Press LLC
  • 45. n − k, because no data are available beyond (n − k) + k = n. For large lags(large compared to n), ρ(k) is not a very precise estimate, since there are ˆonly very few pairs that are k time units apart. The definition of ρ(k) and ρ(k) can be extended to multivariate time ˆseries, taking into account that dependence between different componentsof the series may be delayed. For instance, for a bivariate time series (Xi , Yi )(i = 1, 2, ...), one considers lag-k sample cross-correlations n−k 1 xi − x yi+k − y ¯ ¯ ρXY (k) = ˆ ( )( ) (2.7) n i=1 sX sYas estimates of the population cross-correlations cov(Xi , Yi+k ) ρXY (k) = (2.8) var(Xi )var(Yi+k )where s2 = n−1 (xi − x)(xi+k − x) and s2 = n−1 (yi − y)(yi+k − y). If X ¯ ¯ Y ¯ ¯|ρXY (k)| is high, then there is a strong linear dependence between Xi andYi+k .RegressionIn addition to measuring the strength of dependence between two variables,one is often interested in finding an explicit functional relationship. Forinstance, it may be possible to express the response variable y in terms of anexplanatory variable x by y = g(x, ε) where ε is a variable representing thepart of y that is unexplained. More specifically, we may have, for example,an additive relationship y = g(x) + ε or a multiplicative equation y =g(x)eε . The simplest relationship is given by the simple linear regressionequation y = βo + β1 x + ε (2.9)where ε is assumed to be a random variable with E(ε) = 0 (and usuallyfinite variance σ 2 = var(ε) < ∞). Thus, the data are yi = βo +β1 xi +εi (i =1, ..., n) where the εi s are generated by the same zero mean distribution.Often the εi ’s are also assumed to uncorrelated or even independent – this ishowever not a necessary assumption. An obvious estimate of the unknownparameters βo and β1 is obtained by minimizing the total sum of squarederrors SSE = SSE(bo , b1 ) = (yi − bo − b1 xi )2 = 2 ri (bo , b1 ) (2.10)with respect to bo , b1 . The solution is found by setting the partial derivativeswith respect to bo and b1 equal to zero. A more elegant way to find thesolution is obtained by interpreting the problem geometrically: defining then-dimensional vectors 1 = (1, ..., 1)t , b = (bo , b1 )t and the n × 2 matrix Xwith columns 1 and x, we have SSE = ||y − bo 1 − b1 x||2 = ||y − Xb||2©2004 CRC Press LLC
  • 46. where ||.|| denotes the squared euclidian norm, or length of the vector. Itis then clear that SSE is minimized by the orthogonal projection of y onthe plane spanned by 1 and x. The estimate of β = (βo , β1 )t is therefore β = (βo , β1 )t = (X t X)−1 X t y ˆ ˆ ˆ (2.11)and the projection – which is the vector of estimated values yi – is given ˆby y = (ˆ1 , ..., yn )t = X(X t X)−1 X t y ˆ y ˆ (2.12)Defining the measure of the total variability of y, SST = ||y−¯1||2 (total ysum of squares), and the quantities SSR = ||ˆ−¯1||2 (regression sum of y ysquares=variability due to the fact that the fitted line is not horizontal) 2and SSE = ||y − y|| (error sum of squares, variability unexplained by ˆregression line), we have by Pythagoras SST = SSR + SSE (2.13) ˆ ˆ ˆThe proportion of variability “explained” by the regression line y = βo +β1 xis therefore (ˆi − yi )2 ||ˆ − y 1||2 n y ¯ y ¯ SSR SSE R2 = i=1n = = =1− . (2.14) i=1 (yi − y ) ¯ 2 ||y − y 1||2 ¯ SST SSTBy definition, 0 ≤ R2 ≤ 1, and R2 = 1 if and only if yi = yi (i.e. all points ˆare on the regression line). Moreover, for simple regression we also haveR2 = r2 . The advantage of defining R2 as above (instead of via r2 ) is thatthe definition remains valid for the multiple regression model (see below),i.e. when several explanatory variables are available. Finally, note that anestimate of σ 2 is obtained by σ 2 = (n − 2)−1 ri (βo , β1 ). ˆ 2 ˆ ˆ In analogy to the sample mean and the sample variance, the least squaresestimates of the regression parameters are sensitive to the presence of out-liers. Outliers in regression can occur in the y-variable as well as in thex-variable. The latter are also called influential points. Outliers may oftenbe correct and in fact very interesting observations (e.g. telling us that theassumed model may not be correct). However, since least squares estimatesare highly influenced by outliers, it is often difficult to notice that theremay be a problem, since the fitted curve tends to lie close to the outliers.Alternative, robust estimates can be helpful in such situations (see Huber1981, Hampel et al. 1986). For instance, instead of minimizing the residualsum of squares we may minimize ρ(ri ) where ρ is a bounded function.If ρ is differentiable, then the solution can usually also be found by solvingthe equations n r ∂ ρ( ) r(b) = 0 (j = 0, ..., p) (2.15) i=1 σ ∂bj ˆwhere σ 2 is a robust estimate of σ 2 obtained from an additional equation ˆand p is the number of explanatory variables. This leads to estimates that©2004 CRC Press LLC
  • 47. are (up to a certain degree) robust with respect to outliers in y, not howeverwith respect to influential points (outliers in x). To control the effect ofinfluential points one can, for instance, solve a set of equations n r ψj ( , xi ) = 0 (j = 0, ..., p) (2.16) i=1 σ ˆwhere ψ is such that it downweighs outliers in x as well. For a compre-hensive theory of robustness see e.g. Huber (1981), Hampel et al. (1986).For more recent, efficient and highly robust methods see Yohai (1987),Rousseeuw and Yohai (1984), Gervini and Yohai (2002), and referencestherein. The results for simple linear regression can be extended easily to the casewhere more than one explanatory variable is available. The multiple linearregression model with p explanatory variables is defined by y = βo + β1 x1 +...+βp xp +ε. For data we write yi = βo +β1 xi1 +...+βp xip +εi (i = 1, ..., n).Note that the word “linear” refers to linearity in the parameters βo , ..., βp .The function itself can be nonlinear. For instance, we may have polynomialregression with y = βo +β1 x+...+βp xp +ε. The same geometric argumentsas above apply so that (2.11) and (2.12) hold with β = (βo , ..., βp )t , andthe n × (p + 1)−matrix X = (x(1) , ..., x(p+1) ) with columns x(1) = 1 andx(j+1) = xj = (x1j , ..., xnj )t (j = 1, ..., p).Regression smoothingA more general, but more difficult, approach to modeling a functional re-lationship is to impose less restrictive assumptions on the function g. Forinstance, we may assume y = g(x) + ε (2.17)with g being a twice continuously differentiable function. Under suitableadditional conditions on x and ε it is then possible to estimate g fromobserved data by nonparametric smoothing. As a special example considerobservations yi taken at time points i = 1, 2, ..., n. A standard model is yi = g(ti ) + εi (2.18)where ti = i/n, εi are independent identically distributed (iid) randomvariables with E(εi ) = 0 and σ 2 = var(εi ) < 0. The reason for usingstandardized time ti ∈ [0, 1] is that this way g is observed on an increasinglyfine grid. This makes it possible to ultimately estimate g(t) for all valuesof t by using neighboring values ti , provided that g is not too “wild”. Asimple estimate of g can be obtained, for instance, by a weighted average(kernel smoothing) n g (t) = ˆ wi yi (2.19) i=1©2004 CRC Press LLC
  • 48. with suitable weights wi ≥ 0, wi = 1. For example, one may use theNadaraya-Watson weights K( t−ti ) b wi = wi (t; b, n) = n t−tj (2.20) j=1 K( b )with b > 0, and a kernel function K ≥ 0 such that K(u) = K(−u), K(u) = 10 (|u| > 1) and −1 K(u)du = 1. The role of b is to restrict observationsthat influence the estimate to a small window of neighboring time points.For instance, the rectangular kernel K(u) = 1 1{|u| ≤ 1} yields the sample 2mean of observations yi in the “window” n(t − b) ≤ i ≤ n(t + b). An evenmore elegant formula can be obtained by approximating the Riemann sum 1 n t−tj 1nb j=1 K( b ) by the integral −1 K(u)du = 1: n n 1 t − ti g (t) = ˆ wi yi = K( )yi (2.21) i=1 nb i=1 bIn this case, the sum of the weights is not exactly equal to one, but asymp-totically (as n → ∞ and b → 0 such that nb3 → ∞) this error is negligible.It can be shown that, under fairly general conditions on g and ε, g con- ˆverges to g, in a certain sense that depends on the specific assumptions (seee.g. Gasser and M¨ ller 1979, Gasser and M¨ ller 1984, H¨rdle 1991, Beran u u aand Feng 2002, Wand and Jones 1995, and references therein). An alternative to kernel smoothing is local polynomial fitting (Fan andGijbels 1995, 1996; also see Feng 1999). The idea is to fit a polynomiallocally, i.e. to data in a small neighborhood of the point of interest. Thiscan be formulated as a weighted least squares problem as follows: ˆ ˆ g (t) = βo (2.22) ˆ ˆ ˆ ˆwhere β = (βo , β1 , ..., βp )t solves a local least squares problem defined by ˆ ti − t 2 β = arg min K( )ri (a). (2.23) a bHere ri = yi − [ao + a1 (ti − t) + ... + ap (ti − t)p ], K is a kernel as above andb > 0 is the bandwidth defining the window of neighboring observations.It can be shown that asymptotically, a local polynomial smoother can bewritten as kernel estimator (Ruppert and Wand 1994). A difference onlyoccurs at the borders (t close to 0 or 1) where, in contrast to the localpolynomial estimate, the kernel smoother has to be modified. The reasonis that observations are no longer symmetrically spaced in the windowt ± b). A major advantage of local polynomials is that they automatically ˆ ˆ ˆprovide estimates of derivatives, namely g (t) = β1 , g (t) = 2β2 etc. Kernel ˆsmoothing can also be used for estimation of derivatives; however different(and rather complicated) kernels have to be used for each derivative (Gasserand M¨ller 1984, Gasser et al. 1985). A third alternative, so-called wavelet u©2004 CRC Press LLC
  • 49. thresholding, will not be discussed here (see e.g. Daubechies 1992, Donohoand Johnston 1995, 1998, Donoho et al. 1995, 1996, Vidakovic 1999, andPercival and Walden 2000 and references therein). A related method basedof wavelets is discussed in Chapter 5.Smoothing of two-dimensional distributions, sharpeningEstimating a relationship between x and y (where x and y are realizationsof random variables X and Y respectively) amounts to estimating the jointtwo-dimensional distribution function F (x, y) = P (X ≤ x, Y ≤ y). Forcontinuous variables with F (x, y) = u≤x v≤y f (u, v) dudv, the densityfunction f can be estimated, for instance, by a two-dimensional histogram.For visual and theoretical reasons, a better estimate is obtained by kernelestimation (see e.g. Silverman 1986) defined by ˆ 1 f (x, y) = K(xi − x, yi − y; b1 , b2 ) (2.24) nb1 b2 i=1where the kernel K is such that K(u, v) = K(−u, v) = K(u, −v) ≥ 0, and K(u, v)dudv = 1. Usually, b1 = b2 = b and K(u, v) has compact sup-port. Examples of kernels are K(u, v) = 1 1{|u| ≤ 1}1{|v| ≤ 1} (rectangular 4kernel with rectangular support), K(u, v) = π −1 1{u2 + v 2 ≤ 1} (rectangu-lar kernel with circular support), K(u, v) = 2π −1 [1−u2 −v 2 ] (Epanechnikov 1kernel with circular support) or K(u, v) = (2π)−1 exp[− 2 (u2 + v 2 )] (nor-mal density kernel with infinite support). In analogy to one-dimensionaldensity estimation, it can be shown that under mild regularity conditions,f (x, y) is a consistent estimate of f (x, y), provided that b1 , b2 → 0, and ˆnb1 , nb2 → ∞. Graphical representations of two-dimensional distribution functions are ˆ • 3-dimensional perspective plot: z = f (x, y) (or f (x, y)) is plotted against x and y; • contour plot: like in a geographic map, curves corresponding to equal levels of f are drawn in the x-y-plane; • image plot: coloring of the x-y-plane with the color at point (x, y) cor- responding to the value of f.A simple way of enhancing the visual understanding of scatterplots is so-called sharpening (Tukey and Tukey 1981; also see Chambers et al. 1983): ˆfor given numbers a and b, only points with a ≤ f (x, y) ≤ b are drawn inthe scatterplot. Alternatively, one may plot all points and highlight points ˆwith a ≤ f (x, y) ≤ b.InterpolationOften a process may be generated in continuous time, but is observed atdiscrete time points. One may then wish to guess the values of the points©2004 CRC Press LLC
  • 50. in between. Kernel and local polynomial smoothing provide this possibility,since g (t) can be calculated for any t ∈ (0, 1). Alternatively, if the obser- ˆvations are assumed to be completely without “error”, i.e. yi = g(ti ), thendeterministic interpolation can be used. The most popular method is splineinterpolation. For instance, cubic splines connect neighboring observed val-ues yi−1 , yi by cubic polynomials such that the first and second derivativesat the endpoints ti−1 , ti are equal. For observations y1 , ..., yn at equidistanttime points ti with ti − ti−1 = tj − tj−1 = ∆t (i, j = 1, ..., n), we have n − 1polynomials pi (t) = ai + bi (t − ti ) + ci (t − ti )2 + di (t − ti )3 (i = 1, ..., n − 1) (2.25)To achieve smoothness at the points ti where two polynomials pi−1 , pi meet,one imposes the condition that the polynomials and their first two deriva-tives are equal at ti . This together with the conditions pi (ti ) = yi leads toa system of 3(n − 2) + n = 4(n − 1) − 2 equations for 4(n − 1) parametersai , bi , ci , di (i = 1, ..., n − 1). To specify a unique solution one thereforeneeds two additional conditions at the border. A typical assumption isp (t1 ) = p (tn ) = 0 which defines so-called natural splines. Cubic splineshave a physical meaning, since these are the curves that form when a thinrod is forced to pass through n knots (in our case the knots are t1 , ..., tn ),corresponding to minimum strain energy. The term “spline” refers to thethin flexible rods that were used in the past by draftsmen to draw smoothcurves in ship design. In spite of their “natural” meaning, interpolationsplines (and similarily other methods of interpolation) can be problem-atic since the interpolated values may be highly dependent on the specificmethod of interpolation and are therefore purely hypothetical unless theaim is indeed to build a ship. Splines can also be used for smoothing purposes by removing the restric-tion that the curve has to go through all observed points. More specifically,one looks for a function g (t) such that ˆ n ∞ V (λ) = (yi − g(ti ))2 + λ ˆ [ˆ (t)]2 dt g (2.26) i=1 −∞is minimized. The parameter λ > 0 controls the smoothness of the resultingcurve. For small values of λ, the fitted curve will be rather rough but closeto the data; for large values more smoothness is achieved but the curveis, in general, not as close to the data. The question of which λ to choosereflects a standard dilemma in statistical smoothing: one needs to balancethe aim of achieving a small bias (λ small) against the aim of a smallvariance (λ large). For a given value of λ, the solution to the minimizationproblem above turns out to be a natural cubic spline (see Reinsch 1967;also see Wahba 1990 and references therein). The solution can also bewritten as a kernel smoother with a kernel function K(u) proportional©2004 CRC Press LLC
  • 51. √ √ 1to exp(−|u|/ 2) sin(π/4 + |u|/ 2) and a bandwidth b proportional to λ 4 1(Silverman 1986). If ti = i/n, then the bandwidth is exactly equal to λ 4 .Statistical inferenceIn this section, correlation, linear regression, nonparametric smoothing,and interpolation were introduced in an informal way, without exact dis-cussion of probabilistic assumptions and statistical inference. All thesetechniques can be used in an informal way to explore possible structureswithout specific model assumptions. Sometimes, however, one wishes toobtain more solid conclusions by statistical tests and confidence intervals.There is an enormous literature on statistical inference in regression, in-cluding nonparametric approaches. For selected results see the referencesgiven above. For nonparametric methods also see Wand and Jones (1995),Simonoff (1996), Bowman and Azzalini (1997), Eubank (1999) and refer-ences therein.2.5 Sp ecific applications in music – bivariate2.5.1 Empirical tempo-accelerationConsider the tempo curves in Figure 2.3. An approximate measure oftempo-acceleration may be defined by ∆2 y(t) [y(ti ) − y(ti−1 )] − [y(ti−1 ) − y(ti−2 )] a(ti ) = 2t = (2.27) ∆ [ti − ti−1 ] − [ti−1 − ti−2 ]where y(t) is the tempo (or log-tempo) at time t. Figures 2.10a through fshow a(t) for the three performances by Cortot and Horowitz. From thepictures it is not quite easy to see in how far there are similarilies or dif-ferences. Consider now the pairs (aj (ti ), al (ti )) where aj , al are accelera-tion measurements of performance j and l respectively. We calculate thesample correlations for each pair (j, l) ∈ {1, ..., 28} × {1, ..., 28}, (j = l).Figure 2.11a shows the correlations between Cortot 1 (1947) and the otherperformances. As expected, Cortot correlates best with Cortot: the corre-lation between Cortot 1 and Cortot’s other two performances (1947, 1953)is clearly highest. The analogous observation can be made for Horowitz1 (1947) (Figure 2.11b). Also interesting is to compare how much overallresemblance there is between a selected performance and the other per-formances. For each of the 28 performances, the average and the maximalcorrelation with other performances were calculated. Figures 2.11c and dindicate that, in terms of accelaration, Cortot’s style appears to be quiteunique among the pianists considered here. The overall (average and max-imal) similarily between each of his three acceleration curves and the otherperformances is much smaller than for any other pianist.©2004 CRC Press LLC
  • 52. 10 10 a) Acceleration - Cortot (1935) b) Acceleration - Cortot (1947) c) Acceleration - Cortot (1953) 10 5 5 5 0 0 0 a(t) a(t) a(t) -5 -5 -5 -10 -10 -10 -15 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 onset time t onset time t onset time t 15 10 d) Acceleration - Horowitz (1947) e) Acceleration - Horowitz (1963) f) Acceleration - Horowitz (1965) 10 10 5 5 5 0 0 0 a(t) a(t) a(t) -5 -5 -5 -10 -10 -10 -15 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 onset time t onset time t onset time t Figure 2.10 Acceleration of tempo curves for Cortot and Horowitz.2.5.2 Interpolated and smoothed tempo curves – velocity and accelerationConceptually it is plausible to assume that musicians control tempo in con-tinuous time. The measure of acceleration given above is therefore a rathercrude estimate of the actual acceleration curve. Interpolation splines pro-vide a simple possibility to “guess” the tempo and its derivatives betweenthe observed time points. One should bear in mind, however, that interpo-lation is always based on specific assumptions. For instance, cubic splinesassume that the curve between two consecutive time points where obser-vations are available is, or can be well approximated by, a third degreepolynomial. This assumption can hardly be checked experimentally andcan lead to undesirable effects. Figure 2.12 shows the observed and inter-polated tempo for Martha Argerich. While most of the interpolated valuesseem plausible, there are a few rather doubtful interpolations (marked witharrows) where the cubic polynomial by far exceeds each of the two observedvalues at the neighboring knots.©2004 CRC Press LLC
  • 53. mean correlation Correlation 0.4 0.5 0.6 0.7 0.8 0.4 0.8 1.2 0 0 ARGERICH ARGERICH ARRAU ARRAU ASKENAZE ASKENAZE BRENDEL BRENDEL 5 5 BUNIN BUNIN CAPOVA CAPOVA CORTOT1 CORTOT2 CORTOT2 CORTOT3 CORTOT3 CURZON CURZON 10 DAVIES 10 DAVIES DEMUS DEMUS ESCHENBACH ESCHENBACH©2004 CRC Press LLC GIANOLI GIANOLI HOROWITZ1 HOROWITZ1 15 HOROWITZ2 15 HOROWITZ2 HOROWITZ3 Performance Performance HOROWITZ3 KATSARIS KATSARIS other pianists KLIEN KLIEN KRUST KRUST 20 KUBALEK 20 c) Mean correlations with KUBALEK MOISEIWITSCH MOISEIWITSCH a) Acceleration - Correlations of NEY NEY NOVAES NOVAES Cortot (1935) with other performances ORTIZ ORTIZ 25 SCHNABEL 25 SCHNABEL SHELLEY SHELLEY ZAK ZAK mean correlation Correlation 0.6 0.7 0.8 0.9 1.0 0.2 0.6 1.0 1.4 0 0 ARGERICH ARGERICH ARRAU ARRAU ASKENAZE ASKENAZE BRENDEL BRENDEL 5 5 BUNIN BUNIN CAPOVA CAPOVA CORTOT1 CORTOT1 CORTOT2 CORTOT2 CORTOT3 CORTOT3 CURZON 10 CURZON 10 DAVIES DAVIES DEMUS DEMUS ESCHENBACH ESCHENBACH GIANOLI GIANOLI HOROWITZ1 15 HOROWITZ2 15 HOROWITZ2 HOROWITZ3 Performance Performance HOROWITZ3 KATSARIS other pianists KATSARIS KLIEN KLIEN KRUST KRUST 20 KUBALEK 20 KUBALEK d) Maximal correlations with MOISEIWITSCH MOISEIWITSCH b) Acceleration- Correlations of NEY NEY NOVAES NOVAES ORTIZ ORTIZ 25 Horowitz (1947) with other performances SCHNABEL 25 SCHNABEL SHELLEY SHELLEY Figure 2.11 Tempo acceleration – correlation with other performances. ZAK ZAK
  • 54. Figure 2.12 Martha Argerich – interpolation of tempo curve by cubic splines.2.5.3 Tempo – hierarchical decomposition by smoothingThe tempo curve may be thought of as an aggregation of mostly smoothtempo curves at different onset-time-scales. This corresponds to the generalstructure of music as a mixture of global and local structures at variousscales. It is therefore interesting to look at smoothed tempo curves, andtheir derivatives, at different scales. Reasonable smoothing bandwidths maybe guessed from the general structure of the composition such as timesignature(s), rhythmic, metric, melodic, and harmonic structure, and so on.For tempo curves of Schumann’s Tr¨umerei (Figure 2.3), even multiples of a1/8th are plausible. Figures 2.13 through 2.16 show the following kernel-smoothed tempo curves with b1 = 8, b2 = 1, and b3 = 1/8 respectively: t − ti g1 (t) = (nb1 )−1 ˆ K( )yi (2.28) b1 t − ti g2 (t) = (nb2 )−1 ˆ K( )[yi − g1 (t)] ˆ (2.29) b2 t − ti g3 (t) = (nb3 )−1 ˆ K( )[yi − g1 (t) − g2 (t)] ˆ ˆ (2.30) b3and the residuals e(t) = yi − g1 (t) − g2 (t) − g3 (t). ˆ ˆ ˆ ˆ (2.31)©2004 CRC Press LLC
  • 55. ARGERICH ARRAU ASKENAZE BRENDEL -0.4 -0.4 -0.4 -0.4 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t BUNIN CAPOVA CORTOT1 CORTOT2 -0.6 -0.6 -0.6 -0.6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t CORTOT3 CURZON DAVIES DEMUS -0.6 -0.6 -0.6 -0.6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 -0.6 -0.6 -0.6 -0.6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t HOROWITZ3 KATSARIS KLIEN KRUST -0.6 -0.6 -0.6 -0.6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t KUBALEK MOISEIWITSCH NEY NOVAES -0.6 -0.6 -0.6 -0.6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ORTIZ SCHNABEL SHELLEY ZAK -0.6 -0.6 -0.6 -0.6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t Figure 2.13 Smoothed tempo curves g1 (t) = (nb1 )−1 ˆ K( t−ti )yi (b1 = 8). b1©2004 CRC Press LLC
  • 56. ARGERICH ARRAU ASKENAZE BRENDEL -0.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t BUNIN CAPOVA CORTOT1 CORTOT2 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t CORTOT3 CURZON DAVIES DEMUS -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t HOROWITZ3 KATSARIS KLIEN KRUST -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t KUBALEK MOISEIWITSCH NEY NOVAES 1.0 -1.5 -1.5 -1.5 -2.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ORTIZ SCHNABEL SHELLEY ZAK -2.0 -2.0 -2.0 -2.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t tFigure 2.14 Smoothed tempo curves g2 (t) = (nb2 )−1 ˆ K( t−ti )[yi − g1 (t)] (b2 = b2 ˆ1).©2004 CRC Press LLC
  • 57. ARGERICH ARRAU ASKENAZE BRENDEL 1 1 1 1 -2 -2 -2 -2 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t BUNIN CAPOVA CORTOT1 CORTOT2 1 0 0 0 -2 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t CORTOT3 CURZON DAVIES DEMUS 0 0 0 0 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 0 0 0 0 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t HOROWITZ3 KATSARIS KLIEN KRUST 0 0 0 0 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t KUBALEK MOISEIWITSCH NEY NOVAES 0 0 0 0 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ORTIZ SCHNABEL SHELLEY ZAK 1 0 0 0 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t tFigure 2.15 Smoothed tempo curves g3 (t) = (nb3 )−1 ˆ K( t−ti )[yi − g1 (t) − b3 ˆg2 (t)] (b3 = 1/8).ˆ©2004 CRC Press LLC
  • 58. ARGERICH ARRAU ASKENAZE BRENDEL 1.0 1.0 -1.0 -1.0 -1.0 -1.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t BUNIN CAPOVA CORTOT1 CORTOT2 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t CORTOT3 CURZON DAVIES DEMUS -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t HOROWITZ3 KATSARIS KLIEN KRUST 1.5 1.5 1.5 1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t KUBALEK MOISEIWITSCH NEY NOVAES 1.5 1.5 1.5 1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t ORTIZ SCHNABEL SHELLEY ZAK 1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t tFigure 2.16 Smoothed tempo curves – residuals e(t) = yi − g1 (t) − g2 (t) − g3 (t). ˆ ˆ ˆ ˆ©2004 CRC Press LLC
  • 59. The tempo curves are thus decomposed into curves corresponding to a hi-erarchy of bandwidths. Each component reveals specific features. The firstcomponent reflects the overall tendency of the tempo. Most pianists havean essentially monotonically decreasing curve corresponding to a gradual,and towards the end emphasized, ritardando. For some performances (inparticular Bunin, Capova, Gianoli, Horowitz 1, Kubalek, and Moisewitsch)there is a distinct initial acceleration with a local maximum in the middleof the piece. The second component g2 (t) reveals tempo-fluctuations that ˆcorrespond to a natural division of the piece in 8 times 4 bars. Some pi-anists, like Cortot, greatly emphasize the 8×4 structure. For other pianists,such as Horowitz, the 8×4 structure is less evident: the smoothed tempocurve is mostly quite flat, though the main, but smaller, tempo changes dotake place at the junctions of the eight parts. Striking is also the distinctionbetween part B (bars 17 to 24) and the other parts (A,A ,A ) of the com-position – in particular in Argerich’s performance. The third componentcharacterizes fluctuations at the resolution level of 2/8th. At this very locallevel, tempo changes frequently for pianists like Horowitz, whereas thereis less local movement in Cortot’s performances. Finally, the residuals e(t)consist of the remaining fluctuations at the finest resolution of 1/8th. Thesimilarity between the three residual curves by Horowitz illustrate thateven at this very fine level, the “seismic” variation of tempo is a highlycontrolled process that is far from random.2.5.4 Tempo curves and melodic indicatorIn Chapter 3, the so-called melodic indicator will be introduced. One ofthe aims will be to “explain” some of the variability in tempo curvesby melodic structures in the score. Consider a simple melodic indicatorm(t) = wmelod (t) (see Section 3.3.4) that is essentially obtained by addingall indicators corresponding to individual motifs. Figures 2.17a and d dis-play smoothed curves obtained by local polynomial smoothing of −m(t)using a large and a small bandwidth respectively. Figures 2.17b and e showthe first derivatives of the two curves in 2.17a,d. Similarily, the secondderivatives are given in figures 2.17c and f. For the tempo curves, the firstand second derivatives of local polynomial fits with b = 4 are given in Fig-ures 2.18 and 2.19 respectively. A resemblance can be found in particularbetween the second derivative of −m(t) in Figure 2.17f and the secondderivatives of tempo curves in Figure 2.19. Also, there are interesting simi-larities and differences between the performances, with respect to the localvariability of the first two derivatives. Many pianists start with a very smallsecond derivative, with strongly increased values in part B.©2004 CRC Press LLC
  • 60. a) -m(t) (span=24/32) b) -m’(t) (span=24/32) c): -m’’(t) (span=24/32) -78 0.6 0.5 0.4 -80 0.0 0.2 mel. Ind. 2nd der. 1st der. 0.0 -82 -0.5 -0.4 -84 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t d) -m(t) (span=8/32) e) -m’(t) (span=8/32) f) -m’’(t) (span=8/32) 150 40 -40 100 20 50 -60 mel. Ind. 2nd der. 1st der. 0 0 -80 -50 -20 -100 -100 -40 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t tFigure 2.17 Melodic indicator – local polynomial fits together with first and secondderivatives.2.5.5 Tempo and loudnessBy invitation of Prince Charles, Vladimir Horowitz gave a benefit recitalat London’s Royal Festival Hall on May 22, 1982. It was his first Europeanappearance in 31 years. One of the pieces played at the concert was Schu-mann’s Kinderszene op. 15, No. 4. Figure 2.20 displays the (approximate)soundwave of Horowitz’s performance sampled from the CD recording. Twovariables that can be extracted quite easily by visual inspection are: a) onthe horizontal axis the time when notes are played (and derived from thisquantity, the tempo) and b) on the vertical axis, loudness. More specifically,let t1 , ..., tn be the score onset-times and u(t1 ), ..., u(tn ) the correspondingperformance times. Then an approximate tempo at score-onset time ti canbe defined by y(ti ) = (ti+1 − ti )/(u(ti+1 ) − u(ti )). A complication withloudness is that the amplitude level of piano sounds decreases gradually ina complex manner so that “loudness” as such is not defined exactly. Forsimplicity, we therefore define loudness as the initial amplitude level (orrather its logarithm). Moreover, we consider only events where the score-onset time is a multiple of 1/8. For illustration, the first four events (scoreonset times 1/8, 2/8, 3/8, 4/8) are marked with arrows in Figure 2.20. An interesting question is what kind of relationship there may be be-tween time delay y and loudness level x. The autocorrelations of x(ti ) =©2004 CRC Press LLC
  • 61. 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ARGERICH ARRAU ASKENAZE BRENDEL BUNIN CAPOVA CORTOT1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t t t t CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t t t t HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t t t t MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. 1st der. -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t t t tFigure 2.18 Tempo curves (Figure 2.3) – first derivatives obtained from localpolynomial fits (span 24/32).©2004 CRC Press LLC
  • 62. 3 3 3 3 3 3 3 2 2 2 2 2 2 2 ARGERICH ARRAU ASKENAZE BRENDEL BUNIN CAPOVA CORTOT1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. -2 -2 -2 -2 -2 -2 -2 3 3 3 3 3 3 3 -3 -3 -3 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 2 2 2 2 2 2 2 t t t t t t t CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI 1 1 1 1 1 1 1 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. -2 -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 -3 -3 -3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 t t t t t t t HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 1 1 1 1 1 1 1 0 0 0 0 0 0 0 - - - - - - - 1 1 1 1 1 1 1 -- -- -- -- -- -- -- 3 3 3 3 3 3 3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 2 2 2 2 2 2 2 t t t t t t t MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 2nd der. 1 1 1 1 1 1 1 0 0 0 0 0 0 0 - - - - - - - 1 1 1 1 1 1 1 -- -- -- -- -- -- -- 3 3 3 3 3 3 3 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 2 2 2 2 2 2 2 t t t t t t tFigure 2.19 Tempo curves (Figure 2.3) – second derivatives obtained from localpolynomial fits (span 8/32).©2004 CRC Press LLC
  • 63. Figure 2.20 Kinderszene No. 4 – sound wave of performance by Horowitz at theRoyal Festival Hall in London on May 22, 1982.log(Amplitude) and y(ti ) as well as the cross-autocorrelations between thetwo time series are shown in Figure 2.21a. The main remarkable cross-autocorrelation occurs at lag 8. This can also be seen visually when plot-ting y(ti+8 ) against x(ti ) (Figure 2.21b). There appears to be a strongrelationship between the two variables with the exception of four outliers.The three fitted lines correspond to a) a least square linear regression fitusing all data; b) a robust high breakdown point and high efficiency regres-sion (Yohai et al. 1991); and c) a least squares fit excluding the outliers. Itshould be noted that the “outliers” all occur together in a temporal cluster(see Figure 2.21c) and correspond to a phase where tempo is at its extreme(lowest for the first three outliers and fastest for the last outlier). Thisindicates that these are informative “outliers” (in contrast to wrong mea-surements) that should not be dismissed, since they may tell us somethingabout the intention of the performer. Finally, Figure 2.21d displays a sharpened version of the scatterplot in ˆFigure 2.21b: Points with high estimated joint density f (x, y) are markedwith “O”. In contrast to what one would expect from a regression model,random errors εi that are independent of x, the points with highest densitygather around a horizontal line rather than the regression line(s) fitted inFigure 2.21b. Thus, a linear regression model is hardly applicable. Instead,the data may possibly be divided into three clusters: a) a cluster with lowloudness and low tempo; b) a second cluster with medium loudness andlow to medium tempo; and c) a third cluster with a high level of loudnessand medium to high tempo.©2004 CRC Press LLC
  • 64. Figure 2.21 log(Amplitude) and tempo for Kinderszene No. 4 – auto- and crosscorrelations (Figure 2.24a), scatter plot with fitted least squares and robust lines(Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Fig-ure 2.24d).©2004 CRC Press LLC
  • 65. 2.5.6 Loudness and tempo – two-dimensional distribution functionIn the example above, the correlation between loudness and tempo, whenmeasured at the same time, turned out to be relatively small, whereas thereappeared to be quite a clear lagged relationship. Does this mean that thereis indeed no “immediate” relationship between these two variables? Con-sider x(ti ) = log(Amplitude) and the logarithm of tempo. The scatterplotand the boxplot in Figures 22a and b rather suggest that there may be arelationship, but the dependence is nonlinear. This is further supported bythe two-dimensional histogram (Figure 23a), the smoothed density (Figure24a) and the corresponding image plots (Figures 23b and 24b; the actualobservations are plotted as stars). The density was estimated by a kernelestimate with the Epanechnikov kernel. Since correlation only measures lin-ear dependence, it cannot detect this kind of highly nonlinear relationship.Figure 2.22 Horowitz’ performance of Kinderszene No. 4 – log(tempo) versuslog(Amplitude) and boxplots of log(tempo) for three ranges of amplitude.2.5.7 Melodic tempo-sharpeningSharpening can also be applied by using an “external” variable. This isillustrated in Figures 2.25 through 2.27. Figure 2.25a displays the estimateddensity function of log(m+1) where m(t) is the value of a melodic indicatorat onset time t. The marked region corresponds to very high values ofthe density function f (namely f (x) > 0.793). This defines a set Isharpof corresponding “sharpening onset times”. The series m(t) is shown inFigure 2.25b, with sharpening onset times t ∈ Isharp highlighted by vertical©2004 CRC Press LLC
  • 66. Figure 2.23 Horowitz’ performance of Kinderszene No. 4 – two-dimensional his-togram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective andimage plot respectively.Figure 2.24 Horowitz’ performance of Kinderszene No. 4 – kernel estimate oftwo-dimensional distribution of (x, y) = (log(tempo), log(Amplitude)) displayedin a perspective and image plot respectively.©2004 CRC Press LLC
  • 67. Figure 2.25 R. Schumann, Tr¨umerei op. 15, No. 7 – density of melodic indicator awith sharpening region (a) and melodic curve plotted against onset time, withsharpening points highlighted (b).©2004 CRC Press LLC
  • 68. CORTOT1 CORTOT2 CORTOT3 tempo tempo tempo 0 0 0 HOROWITZ1 HOROWITZ2 HOROWITZ3 tempo tempo tempo 0 0 0Figure 2.26 R. Schumann, Tr¨umerei op. 15, No. 7 – tempo by Cortot and aHorowitz at sharpening onset times. CORTOT1 CORTOT2 CORTOT3 10 10 10 diff(tempo) diff(tempo) diff(tempo) 0 0 0 -10 -10 -10 HOROWITZ1 HOROWITZ2 HOROWITZ3 10 10 10 diff(tempo) diff(tempo) diff(tempo) 0 0 0 -10 -10 -10Figure 2.27 R. Schumann, Tr¨umerei op. 15, No. 7 – tempo “derivatives” for aCortot and Horowitz at sharpening onset times.©2004 CRC Press LLC
  • 69. lines. Figures 2.26 and 2.27 show the tempo y and its discrete “derivative”v(ti ) = [y(ti+1 ) − y(ti )]/(ti+1 − ti ) for ti ∈ Isharp and the performances byCortot and Horowitz. The pictures indicate a systematic difference betweenCortot and Horowitz. A common feature is the negative derivative at thefifth and sixth sharpening onset time.2.6 Some multivariate descriptive displays2.6.1 DefinitionsSuppose that we observe multivariate data x1 , x2 , ..., xn where each xi isa p-dimensional vector (xi1 , ..., xip )t ∈ Rp . Obvious numerical summarystatistics are the sample mean x = (¯1 , x2 , ..., xp )t ¯ x ¯ ¯where xj = n−1 n ¯ i=1 xij and the p × p covariance matrix S with elements n Sjl = (n − 1)−1 (xij − xj )(xil − xl ). ¯ ¯ i=1Most methods for analyzing multivariate data are based on these two statis-tics. One of the main tools consists of dimension reduction by suitable pro-jections, since it is easier to find and visualize structure in low dimensions.These techniques go far beyond descriptive statistics. We therefore post-pone the discussion of these methods to Chapters 8 to 11. Another set ofmethods consists of visualizing individual multivariate observations. Themain purpose is a simple visual identification of similarities and differencesbetween observations, as well as search for clusters and other patterns.Typical examples are:• Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending on the values of corresponding coordinates. For instance, the face func- tion in S-Plus has the following correspondence between coordinates and feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width of eyebrows.• Stars: Each coordinate is represented by a ray in a star, the length of each corresponding to the value of the coordinate. More specifically, a star for a data vector xi = (xi1 , ..., xip )t is constructed as follows: 1. Scale xi to the range [0, r] : 0 ≤ x1j, ..., xnj ≤ r; 2. Draw p rays at angles ϕj = 2π(j − 1)/p (j = 1, ..., p); for a star with©2004 CRC Press LLC
  • 70. origin 0 representing observation xi , the end point of the jth ray has the coordinates r · (xij cos ϕj , xij sin ϕj ); 3. For visual reasons, the end points of the rays may be connected by straight lines.• Profiles: An observation xi =(xi1 , ..., xip )t is represented by a plot of xij versus j where neighboring points xij−1 and xij (j = 1, ..., p) are connected.• Symb ol plot: The horizontal and vertical positions represent xi1 and xi2 respectively (or any other two coordinates of xi ). The other coor- dinates xi3 , ..., xip determine p − 2 characteristic shape parameters of a geometric object that is plotted at point (xi1 , xi2 ). Typical symbols are circle (one additional dimension), rectangle (two additional dimensions), stars (arbitrary number of additional dimensions), and faces (arbitrary number of additional dimensions).2.7 Sp ecific applications in music – multivariate2.7.1 Distribution of notes – Chernoff facesIn music that is based on scales, pitch (modulo 12) is usually not equallydistributed. Notes that belong to the main scale are more likely to occur,and within these, there are certain prefered notes as well (e.g. the rootsof the tonic, subtonic and supertonic triads). To illustrate this, we con-sider the following compositions: 1. Saltarello (Anonymus, 13th century);2. Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S.Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856);4. Piano piece op. 19, No. 2 (A. Sch¨nberg, 1874-1951; figure 2.28); 5. Rain oTree Sketch 1 (T. Takemitsu, 1930-1996). For each composition, the dis-tribution of notes (pitches) modulo 12 is calculated and centered aroundthe “central pitch” (defined as the most frequent pitch modulo 12). Thus,the central pitch is defined as zero. We then obtain five vectors of relativefrequencies pj = (pj0 , ..., pj11 )t (j = 1, ..., 5) characterizing the five compo-sitions. In addition, for each of these vectors the number nj of local peaksin pj is calculated. We say that a local peak at i ∈ {1, ..., 10} occurs, ifpji > max(pji−1 , pji+1 ). For i = 10, we say that a local peak occurs, ifpji > pji−1 . Figure 2.29a displays Chernoff faces of the 12-dimensional vec-tors vj = (nj , pj1 , ..., pj11 )t . In Figure 2.29b, the coordinates of vj (and thusthe assignment of feature variables) were permuted. The two plots illus-trate the usefulness of Chernoff faces, and at the same time the difficultiesin finding an objective interpretation. On one hand, the method discoversa plausible division in two groups: both picures show a clear distinctionbetween classical tonal music (first three faces) and the three representa-tives of “avant-garde” music of the 20th century. On the other hand, the©2004 CRC Press LLC
  • 71. exact nature of the distinction cannot be seen. In Figure 2.29a, the classicalfaces look much more friendly than the rather miserable avant-garde fel-lows. The judgment of conservative music lovers that “avant-garde” musicis unbearable, depressing, or even bad for health, seems to be confirmed!Yet, bad temper is the response of the classical masters to a simple permu-tation of the variables (Figure 2.29b), whereas the grim avant-garde seemsto be much more at ease. The difficulty in interpreting Chernoff faces isthat the result depends on the order of the variables, whereas due to theirpsychological effect most feature variables are not interchangeable.Figure 2.28 Arnold Sch¨nberg (1874-1951), self-portrait. (Courtesy of Verwer- otungsgesellschaft Bild-Kunst, Bonn.)2.7.2 Distribution of notes – star plotsWe consider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitchmodulo 12 where 0 is the tonal center. In contrast to Chernoff faces, per-mutation of coordinates in star plots is much less likely to have a subjectiveinfluence on the interpretation of the picture. Nevertheless, certain patternscan become more visible when using an appropriate ordering of the vari-ables. From the point of view of tonal music, a natural ordering of pitch canbe obtained, for instance, from the ascending circle of fourths. This leadsto the following permutation p∗ = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . j(p0 is omitted, since it is maximal by definition for all compositions.) Sincestars are easy to look at, it is possible to compare a large number of obser-vations simultaneously. We consider the following set of compositions:©2004 CRC Press LLC
  • 72. a ANONYMUS BACH SCHUMANN WEBERN SCHOENBERG TAKEMITSUFigure 2.29 a) Chernoff faces for 1. Saltarello (Anonymus, 13th century); 2.Prelude and Fugue No. 1 from “Das Wohltemperierte Klavier” (J. S. Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856); 4. Piano pieceop. 19, No. 2 (A. Sch¨nberg, 1874-1951); 5. Rain Tree Sketch 1 (T. Takemitsu, o1930-1996). b ANONYMUS BACH SCHUMANN WEBERN SCHOENBERG TAKEMITSUFigure 2.29 b) Chernoff faces for the same compositions as in figure 2.29a, afterpermuting coordinates.©2004 CRC Press LLC
  • 73. • A. de la Halle (1235?-1287): “Or est Bayard en la pature, hure!”;• J. de Ockeghem (1425-1495): Canon epidiatesseron;• J. Arcadelt (1505-1568): a) Ave Maria, b) La ingratitud, c) Io dico fra noi;• W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queen’s Alman;• J.P. Rameau (1683-1764): a) La Poplini`re, b) Le Tambourin, c) La e Triomphante;• J.S. Bach (1685-1750): Das Wohltemperierte Klavier – Preludes und Fuges No. 5, 6 and 7;• D. Scarlatti (1660-1725): Sonatas K 222, K 345 and K 381;• J. Haydn (1732-1809): Sonata op. 34, No. 2;• W.A. Mozart (1756-1791): 2nd movements of Sonatas KV 332, KV 545 and KV 333;• M. Clementi (1752-1832): Gradus ad Parnassum – Studies 2 and 9 (Fig- ure 11.4);• R. Schumann (1810-1856): Kinderszenen op. 15, No. 1, 2, and 3;• F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32, No. 1, c) Etude op. 10, No. 6;• R. Wagner (1813-1883): a) Bridal Choir from “Lohengrin”, b) Ouverture to Act 3 of “Die Meistersinger”;• C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Re- flections dans l’eau;• A. Scriabin (1872-1915): Preludes op. 2/2, op. 11/14 and op. 13/2;• B. Bart´k (1881-1945): a) Bagatelle op. 11, No. 2 and 3, b) Sonata for o Piano;• O. Messiaen (1908-1992): Vingts regards sur l’enfant de J´sus, No. 3; e• S. Prokoffieff (1891-1953): Visions fugitives No. 11, 12 and 13;• A. Sch¨nberg (1874-1951): Piano piece op. 19, No. 2; o• T. Takemitsu (1930-1996): Rain Tree Sketch No. 1;• A. Webern (1883-1945): Orchesterst¨ ck op. 6, No. 6; u• J. Beran (*1959): S¯ ´ anti – piano concert No. 2 (beginning of 2nd Mov.)The star plots of p∗ are given in Figure 2.31. From Halle (cf. Figure 2.30) jup to about the early Scriabin, the long beams form more or less a half-circle. This means that the most frequent notes are neighbors in the circleof quarts and are much more frequent than all other notes. This is indeedwhat one would expect in music composed in the tonal system. The picturestarts changing in the neighborhood of Scriabin where long beams are either©2004 CRC Press LLC
  • 74. isolated (most extremely for Bart´k’s Bagatelle No. 3) or tend to cover more oor less the whole range of notes (e.g. Bart´k, Prokoffieff, Takemitsu, Beran). oDue to the variety of styles in the 20th century, the specific shape of eachof the stars would need to be discussed in detail individually. For instance,Messiaen’s shape may be explained by the specific scales (Messiaen scales)he used. Generally speaking, the difference between star plots of the 20thcentury and earlier music reflects the replacement of the traditional tonalsystem with major/minor scales by other principles.Figure 2.30 The minnesinger Burchard von Wengen (1229-1280), contemporaryof Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the Uni-versity Library Heidelberg.) (Color figures follow page 152.)©2004 CRC Press LLC
  • 75. Distribution of notes ordered according to ascending circle of fourths HALLE OCKEGHEM ARCADELT ARCADELT ARCADELT BYRD BYRD BYRD RAMEAU RAMEAU RAMEAU BACH BACH BACH SCARLATTI SCARLATTI SCARLATTI HAYDN MOZART MOZART MOZART CLEMENTI CLEMENTI SCHUMANN SCHUMANN SCHUMANN CHOPIN CHOPIN CHOPIN WAGNER WAGNER DEBUSSY DEBUSSY DEBUSSY SCRIABIN SCRIABIN SCRIABIN BARTOK BARTOK BARTOK PROKOFFIEFF PROKOFFIEFF PROKOFFIEFF MESSIAEN SCHOENBERG WEBERN TAKEMITSU BERANFigure 2.31 Star plots of p∗ = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for com- jpositions from the 13th to the 20th century.2.7.3 Joint distribution of interval steps of envelopesConsider a composition consisting of onset times ti and pitch values x(ti ). Ina polyphonic score, several notes may be played simultaneously. To simplifyanalysis, we define a simplified score by considering the lower and upperenvelope:Definition 24 Let n C = {(ti , x(ti )) : ti ∈ A, x(ti ) ∈ B, i = 1, 2, ..., N } = Cj j=1where A = {t∗ , ..., t∗ } ⊂ Z+ (t∗ < t∗ < ... < t∗ ), B ⊂ R or Z and 1 n 1 2 nCj = {(t, x(t)) ∈ C : t = t∗ }. Then the lower and upper envelope of C are j©2004 CRC Press LLC
  • 76. defined by Elow = {(t∗ , j min x(t)), j = 1, ..., n} (t,x(t))∈Cjand Eup = {(t∗ , j max x(t)), j = 1, ..., n}. (t,x(t))∈CjIn other words, for each onset time, the lowest and highest note are se-lected to define the lower and upper envelope respectively. In the examplebelow, we consider interval steps ∆y(ti ) = y(ti+1 ) − y(ti ) mod 12 for theupper envelope of a composition with onset times t1 , ..., tn and pitchesy(t1 )..., y(tn ). A simple aspect of melodic and harmonic structure is thequestion in which sequence intervals are likely to occur. Here, we look atthe empirical two-dimensional distribution of (∆y(ti ), ∆y(ti+1 )). For eachpair (i, j), (−11 ≤ i, j ≤ 11, i, j=0), we count the number nij of occurencesand define Nij = log(nij + 1). (The value 0 is excluded here, since repe-titions of a note – or transposition by an octave – are less interesting.) Ifonly the type of interval and not its direction is of interest, then i, j assumethe values 1 to 11 only. A useful representation of Nij can be obtained bya symbol plot. In Figures 2.32 and 2.33, the x- and y-coordinates corre-spond to i and j respectively. The radius of a circle with center (i, j) isproportional to Nij . The compositions considered here are: a) J.S. Bach:Pr¨ludium No. 1 from ”Das Wohltemperierte Klavier”; b) W.A. Mozart : aSonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Pr´lude op. e51, No. 4; and d) F. Martin: Pr´lude No. 6. For Bach’s piece, there is a clear eclustering in three main groups in the first plot (there are almost never twosuccessive interval steps downwards) and a horseshoe-like pattern for abso-lute intervals. Remarkable is the clear negative correlation in Mozart’s firstplot and the concentration on a few selected interval sequences. A nega-tive correlation in the plots of interval steps with sign can also be foundfor Scriabin and Martin. However, considering only the types of intervalswithout their sign, the number and variety of interval sequences that areused relatively frequently is much higher for Scriabin and even more forMartin. For Martin, the plane of absolute intervals (Figure 2.33d) is filledalmost uniformly.2.7.4 Pitch distribution – symbol plots with circlesConsider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitchmodulo 12 as in the star-plot example above. The star plots show a cleardistinction between “modern” compositions and classical tonal composi-tions. Symbol plots can be used to see more clearly which composers (orcompositions) are close with respect to pj . In figure 2.34 the x- and y-axis corresponds to pj5 and pj7 . Recall that if 0 is the root of the tonictriad, then 5 is the root of the subtonic and 7 the root of the dominant©2004 CRC Press LLC
  • 77. Figure 2.32 Symbol plot of the distribution of successive interval pairs(∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for theupper envelopes of Bach’s Pr¨ludium No. 1 (Das Wohltemperierte Klavier I) and aMozart ’s Sonata KV 545 (beginning of 2nd movement).©2004 CRC Press LLC
  • 78. Figure 2.33 Symbol plot of the distribution of successive interval pairs(∆y(ti ), ∆y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for theupper envelopes of Scriabin’s Pr´lude op. 51, No. 4 and F. Martin’s Pr´lude No. e e6.©2004 CRC Press LLC
  • 79. triad. The radius of the circles in Figure 2.34 is proportional to pj1 , thefrequency of the “dissonant” minor second. In color Figure 2.35, the radiusrepresents pj6 , i.e. the augmented fourth. Both plots show a clear positiverelationship between pj5 and pj7 . Moreover the circles tend to be largerfor small values of x and y. The positioning in the plane together with thesize of the circles separates (apart from a few exceptions) classical tonalcompositions from more recent ones. To visualize this, four different col-ors are chosen for “early music” (black), “baroque and classical” (green),“romantic” (blue) and “20/21st century” (red). The clustering of the fourcolors indicates that there is indeed an approximate clustering accordingto the four time periods. Interesting exceptions can be observed for “early”music with two extreme “outliers” (Halle and Arcadelt). Also, one piece byRameau is somewhat far from the rest. 0.20 ARCADELT RAMEAU RAMEAU SCHUMANN ARCADELT BYRD SCRIABIN RAMEAU SCARLATTI 0.15 CLEMENTI SCARLATTIBYRD MOZART BYRD DEBUSSY SCRIABIN PROKOFFIEFF BACH BACH OCKEGHEM CHOPIN MOZART SCARLATTI CHOPIN DEBUSSY SCHUMANN BACH HAYDN WEBERN WAGNER MOZART CLEMENTI SCRIABIN 0.10 DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU 0.05 SCHOENBERG MESSIAEN ARCADELT BERAN BARTOK PROKOFFIEFF BARTOK 0.0 HALLE 0.0 0.05 0.10 0.15 0.20Figure 2.34 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportionalto pj 1 .2.7.5 Pitch distribution – symbol plots with rectanglesBy using rectangles, four dimensions can be represented. Color Figure 2.36shows a symbol with (x, y)-coordinates (pj5 , pj7 ) and rectangles with width©2004 CRC Press LLC
  • 80. 0.20 ARCADELT RAMEAU RAMEAU SCHUMANN ARCADELT BYRD 0.15 SCRIABIN RAMEAU SCARLATTI SCARLATTIBYRD CLEMENTI MOZART BYRD DEBUSSY SCRIABIN PROKOFFIEFF BACH BACH OCKEGHEM CHOPIN MOZART SCARLATTI CHOPIN DEBUSSY SCHUMANN BACH HAYDN WEBERN WAGNER MOZART 0.10 CLEMENTI SCRIABIN DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU 0.05 SCHOENBERG MESSIAEN ARCADELT BERAN BARTOK PROKOFFIEFF BARTOK 0.0 HALLE 0.0 0.05 0.10 0.15 0.20Figure 2.35 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportionalto pj 6 . (Color figures follow page 152.)pj1 (diminished second) and height pj6 (augmented fourth). Using the samecolors for the names as above, a similar clustering as in the circle-plot canbe observed. The picture not only visualizes a clear four-dimensional rela-tionship between pj1 , pj5 , pj6 and pj7 , but also shows that these quantitiesare related to the time period.2.7.6 Pitch distribution – symbol plots with starsFive dimensions are visualized in color Figure 2.37 with (x, y) = (pj5 , pj7 )and the variables pj1 , pj6 and pj10 (diminished seventh) defining a starplotfor each observation, the first variables starting on the right and the sub-sequent variables winding counterclockwise around the star (in this casea triangle). The shape of the triangle is obviously a characteristic of thetime period. For tonal music composed mostly before about 1900, the starsare very narrow with a relatively long beam in the direction of the di-minished seventh. The diminished seventh is indeed an important pitchin tonal music, since it is the fourth note in the dominant seventh chordto the subtonic. In contrast, notes that are a diminished second and an©2004 CRC Press LLC
  • 81. 0.0 RAMEAU ARCADELT RAMEAU SCHUMANN ARCADELT BYRD RAMEAU SCRIABIN SCARLATTI CLEMENTI SCARLATTI MOZART DEBUSSY BYRD BYRD PROKOFFIEFF BACH MOZART OCKEGHEM CHOPIN SCRIABIN BACH SCARLATTI CHOPIN DEBUSSY SCHUMANN BACH HAYDN WAGNER WEBERN CLEMENTI MOZART SCRIABIN DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU SCHOENBERG MESSIAEN ARCADELT BERAN BARTOK PROKOFFIEFF BARTOK HALLE -0.1 0.0 0.05 0.10 0.15 0.20Figure 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1(diminished second) and height pj 6 (augmented fourth). (Color figures follow page152.)©2004 CRC Press LLC
  • 82. augmented fourth above the root of the tonic triad build, together withthe tonic root, highly dissonant intervals and are therefore less frequent intonal music. Color Figure 2.37 shows the triangles; the names without thetriangles are plotted in color Figure 2.38.2.7.7 Pitch distribution – profile plotsFinally, as an alternative to star plots, Figure 2.39 displays profile plotsof p∗ = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . For compositions up to jabout 1900, the profiles are essentially U-shaped. This corresponds to starswith clustered long and short beams respectively, as seen previously. For“modern” compositions, there is a large variety of shapes different from aU-shape.©2004 CRC Press LLC
  • 83. 0.10 0.05 0.0 0.0 0.05 0.10 0.15 0.20Figure 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles defined by pj1 (di-minished second), pj 6 (augmented fourth) and pj 10 (diminished seventh). (Colorfigures follow page 152.)©2004 CRC Press LLC
  • 84. ARCADELT 0.20 RAMEAU RAMEAU SCHUMANN ARCADELT BYRD SCRIABIN RAMEAU SCARLATTI SCARLATTIBYRD 0.15 CLEMENTI MOZART DEBUSSY BYRD PROKOFFIEFF BACH BACH MOZART OCKEGHEM CHOPIN SCRIABIN SCARLATTI CHOPIN DEBUSSY SCHUMANN BACH HAYDN WEBERN WAGNER MOZART CLEMENTI SCRIABIN 0.10 DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU 0.05 SCHOENBERG MESSIAEN ARCADELT BERAN BARTOK PROKOFFIEFF BARTOK 0.0 HALLE 0.0 0.05 0.10 0.15 0.20Figure 2.38 Names plotted at locations (x, y) = (pj 5 , pj 7 ). (Color figures followpage 152.)©2004 CRC Press LLC
  • 85. HALLE OCKEGHEM ARCADELT ARCADELT ARCADELT BYRD BYRD BYRD 0.10 0.20 0.10 0.10 0.10 0.08 0.10 0.10 0.10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 RAMEAU RAMEAU RAMEAU BACH BACH BACH SCARLATTI SCARLATTI 0.10 0.10 0.10 0.10 0.08 0.08 0.10 0.10 0.02 0.02 0.0 0.0 0.0 0.0 0.0 0.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 SCARLATTI HAYDN MOZART MOZART MOZART CLEMENTI CLEMENTI SCHUMANN 0.15 0.15 0.15 0.10 0.10 0.08 0.10 0.10 0.05 0.05 0.05 0.02 0.02 0.0 0.0 0.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 SCHUMANN SCHUMANN CHOPIN CHOPIN CHOPIN WAGNER WAGNER DEBUSSY 0.10 0.10 0.10 0.10 0.10 0.08 0.08 0.10 0.02 0.02 0.02 0.0 0.04 0.0 0.0 0.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 DEBUSSY DEBUSSY SCRIABIN SCRIABIN SCRIABIN BARTOK BARTOK BARTOK 0.15 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.06 0.05 0.02 0.02 0.0 0.0 0.0 0.0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 PROKOFFIEFF PROKOFFIEFF PROKOFFIEFF MESSIAEN SCHOENBERG WEBERN TAKEMITSU BERAN 0.20 0.10 0.10 0.12 0.10 0.20 0.09 0.08 0.06 0.07 0.05 0.05 0.04 0.02 0.04 0.02 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Figure 2.39 Profile plots of p∗ = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . j©2004 CRC Press LLC
  • 86. CHAPTER 3 Global measures of structure and randomness3.1 Musical motivationEssential aspects of music may be summarized under the keywords “struc-ture”, “information” and “communication”. Even aleatoric pieces whereevents are generated randomly (e.g. Cage, Xenakis, Lutoslawsky) havestructure and information induced by the definition of specific random dis-tributions. It is therefore meaningful to measure the amount of structureand information contained in a composition. Clearly, this is a nontrivialtask and many different, and possibly controversial, definitions can be in-vented. In this chapter, two types of measures are discussed: 1) generalglobal measures of information or randomness, and 2) specific local mea-sures indicating metric, melodic, and harmonic structures.3.2 Basic principles3.2.1 Measuring information and randomnessThere is an enormous amount of literature on information measures andtheir applications. In this section, only some basic fundamental definitionsand results are reviewed. These and other classical results can be found, inparticular, in Fisher (1925, 1956), Hartley (1928), Bhattacharyya (1946a),Erd¨s (1946), Wiener (1948), Shannon (1948), Shannon and Weaver (1949), oBarnard (1951), McMillan (1953), Mandelbrot (1953, 1956), Khinchin (1953,1956), Goldman (1953), Bartlett (1955), Brillouin (1956), Komogorov (1956),Ashby (1956), Joshi (1957), Kullback (1959), Wolfowitz (1957, 1958, 1961),Woodward (1953), R´nyi (1959a,b, 1961, 1965, 1970). Also see e.g. Ash e(1965) for an overview. A classical measure of information (or random-ness) is entropy, which is also called Shannon information (Shannon 1948,Shannon and Weaver 1949). To explain its meaning, consider the follow-ing question: how much information is contained in a message, or morespecifically, what is the necessary number of digits to encode the messageunambiguously in the binary system? For instance, if the entire vocabularyonly consisted of the words “I”, “hungry”, “not”, “very”, then the wordscould be identified with the binary numbers 00 = “I”, 01 = “hungry”, 10 =©2004 CRC Press LLC
  • 87. ¨Figure 3.1 Ludwig Boltzmann (1844-1906). (Courtesy of Osterreichische PostAG.)“not” and 11 = “very”. Thus, for a vocabulary V of |V | = N = 22 words,n = 2 digits would be sufficient. More generally, suppose that we have aset V with N = 2n elements. Then we need n = log2 N digits for encod-ing the elements in the binary system. The number n is then called theinformation of a message from vocabulary V . Note that in the special casewhere V consists of one element only, n = 0, i.e. the information content ofa message is zero, because we know which element of V will be containedin the message even before receiving it. An extension of this definition to integers N that are not necessarilypowers of 2 can be justified as follows: consider a sequence of k elementsfrom V . The number of sequences v1 , ..., vk (vi ∈ V ) is N k . (Note that oneelement is allowed to occur more than once.) The number of binary digitsto express a sequence v1 , ..., vk is nk where 2nk −1 < N k ≤ 2nk . The averagenumber of digits needed to express an element in this sequence is nk /kwhere k log2 N ≤ nk < k log2 N + 1. We then have nk lim = log2 N. k→∞ kThe following definition is therefore meaningful:Definition 25 Let VN be a finite set with N elements. Then the informa-tion necessary to characterize the elements of VN is defined by I(VN ) = log2 N (3.1)This definition can also be derived by postulating the following propertiesa measure of information should have:1. Additivity: If |VK | = N M , then I(VK ) = I(VN ) + I(VM )©2004 CRC Press LLC
  • 88. 2. Monotonicity: I(VN ) ≤ I(VN +1 )3. Definition of unit: I(V2 ) = 1.The only function that satisfies these conditions is I(VN ) = log2 N. Consider now a more complex situation where VN = ∪k Vj , Vj ∩ Vl = φ j=1(j = l) and |Vj | = Nj (and hence N = N1 +...+Nk ), and define pj = Nj /N .Suppose that we select an element from V randomly, each element havingthe same probability of being chosen. If an element v ∈ V is known tobelong to a specific Vj , then the additional information needed to identify itwithin Vj is equal to I(Vj ) = log2 Nj . The expected value of this additionalinformation is therefore k k I2 = pj log2 Nj = pj log2 (N pj ) (3.2) j=1 j=1Let I1 be the information needed to identify the set Vj which v belongs to.Then the total information needed for identifying (encoding) elements ofV is log2 N = I1 + I2 (3.3)On the other hand, pj log2 N = log2 N so that we obtain Shannon’sfamous formula k I1 = − pj log2 (pj ) (3.4) j=1I1 is also called Shannon information. Shannon information is thus the ex-pected information about the occurence of the sets V1 , ..., Vk contained ina randomly chosen element from V . Note that the term “information” canbe used synonymously for “uncertainty”: the information obtained froma random experiment diminishes uncertainty by the same amount. Thederivation of Shannon information is credited to Shannon (1948) and, in-dependently, Wiener (1948). In physics, an analogous formula is known asentropy and is a measure of the disorder of a system (see Boltzmann 1896,figure 3.1). Shannon’s formula can also be derived by postulating the following prop-erties for a measure of information of the outcome of a random experiment:let V1 , ..., Vk be the possible outcomes of a random experiment and denoteby pj = P (Aj ) the corresponding probabilities. Then a measure of infor-mation, say I, obtained by the outcome of the random experiment shouldhave the following properties:1. Function of probabilities: I = I(p1 , ..., pk ), i.e. I depends on the proba- bilities pj only;2. Symmetry: I(p1 , ..., pk ) = I(pπ(1) , ..., pπ(k) ) for any permutation π;3. Continuity: I(p, 1 − p) is a continuous function of p (0 ≤ p ≤ 1);4. Definition of unit: I( 1 , 1 ) = 1; 2 2©2004 CRC Press LLC
  • 89. 5. Additivity and weighting by probabilities: p1 p2 I(p1 , ..., pk ) = I(p1 + p2 , p3 , ..., pk ) + (p1 + p2 )I( , ) (3.5) p1 + p2 p1 + p2The meaning of the first four properties is obvious. The last property canbe interpreted as follows: suppose the outcome of an experiment does notdistinguish between V1 and V2 , i.e. if v turns out to be in one of thesetwo sets, we only know that v ∈ V1 ∪ V2 . Then the infomation providedby the experiment is I(p1 + p2 , p3 , ..., pk ). If the experiment did distinguishbetween V1 and V2 , then it is reasonable to assume that the informationwould be larger by the amount p1 p2 (p1 + p2 )I( , ). p1 + p2 p1 + p2Equation (3.5) tells us exactly that: the complete information I(p1 , ..., pk )can be obtained by adding the partial and the additional information. Itturns out that the only function for which the postulates hold is Shannon’sinformation:Theorem 9 Let I be a functional that assigns each finite discrete distri-bution function P (defined by probabilities p1 , ..., pk , k ≥ 1) a real numberI(P ), such that the properties above hold. Then k I(P ) = I(p1 , ..., pk ) = − pj log2 pj (3.6) j=1Shannon information has an obvious upper bound that follows from Jensen’sinequality: recall that Jensen’s inequality states that for a convex functiong and weights wj ≥ 0 with wj = 1 we have g( wj xj ) ≤ wj g(xj ).In particular, for g(x) = x log2 x, k −1 g(pj ) = k −1 pj log2 pj ≥ g( k −1 pj ) = −k −1 log2 k.Hence, I(P ) ≤ log2 k (3.7)This bound is achieved by the uniform distribution pj = 1/k. The otherextreme case is pj = 1 for some j. This means that event Vj occurs withcertainty and I(p1 , ..., pk ) = I(pj ) = I(1) = I(1, 0) = I(1, 0, 0) etc. Thenfrom the fifth property we have I(1, 0) = I(1) + I(1, 0) so that I(1) = 0.The interpretation is that, if it is clear a priori which event will occur, thena random experiment does not provide any information. The notion of information can be extended in an obvious way to thecase where one has an infinite but countable number of possible outcomes.©2004 CRC Press LLC
  • 90. The information contained in the realization of a random variable X withpossible outcomes x1 , x2 , ... is defined by I(X) = − pj log2 pjwhere pj = P (X = xj ). More subtle is the extension to continuous distri-butions and random variables. A nice illumination of the problem is givenin Renyi (1970): for a random variable with uniform distribution on (0,1),the digits in the binary expansion of X are infinitely many independent0-1-random variables where 0 and 1 occur with probability 1/2 each. Theinformation furnished by a realization of X would therefore be infinite. Nev-ertheless, a meaningful measure of information can be defined as a limit ofdiscrete approximations:Theorem 10 Let X be a random variable with density function f. DefineXN = [N X]/N where [x] denotes the integer part of x. If I(X1 ) < ∞, thenthe following holds: I(XN ) lim =1 (3.8) N →∞ log2 N ∞ lim (I(XN ) − log2 N ) = − f (x) log2 f (x)dx (3.9) N →∞ −∞We thus haveDefinition 26 Let X be a random variable with density function f . Then ∞ I(X) = − f (x) log2 f (x)dx (3.10) −∞is called the information (or entropy) of X.Note that, in contrast to discrete distributions, information can be negative.This is due to the fact that I(X) is in fact the limit of a difference ofinformations. The notion of entropy can also be carried over to measuring randomnessin stationary time series in the sense of correlations. (For the definition ofstationarity and time series in general see Chapter 4.)Definition 27 Let Xt (t ∈ Z) be a stationary process with var(Xt ) = 1,and spectral density f . Then the spectral entropy of Xt is defined by π I(Xt , t ∈ Z) = − f (x) log2 f (x)dx (3.11) −πThis definition is plausible, because for a process with unit variance, f hasthe same properties as a probability distribution and can be interpreted asa distribution on frequencies. The process Xt is uncorrelated if and only iff is constant, i.e. if f is the uniform distribution on [−π, π]. Exactly in thiscase entropy is maximal, and knowledge of past observations does not helpto predict future observations. On the other hand, if f has one or more©2004 CRC Press LLC
  • 91. extreme peaks, then entropy is very low (and in the limit minus infinity).This corresponds to the fact that in this case future observations can bepredicted with high accuracy from past values. Thus, future observationsdo not contain as much new information as in the case of independence.3.2.2 Measuring metric, melodic, and harmonic importanceGeneral ideaWestern classical music is usually structured in at least three aspects:melody, metric structure, and harmony. With respect to representing theessential melodic, metric, and harmonic structures, not all notes are equallyimportant. For a given composition K, we may therefore try to find metric,melodic, and harmonic structures and quantify them in a weight functionw : K → R3 (which we will also call an “indicator”). For each note event x ∈K, the three components of w(x) = (wmelodic (x), wmetric (x), wharmonic (x))quantify the ”importance” of x with respect to the melodic, metric, andharmonic structure of the composition respectively.Omnibus metric, melodic, and harmonic indicatorsSpecific definitions of structural indicators (or weight functions) are dis-cussed for instance in Mazzola et al. (1995), Fleischer et al. (2000), andBeran and Mazzola (2001). To illustrate the general approach, we give afull definition of metric weights. Melodic and harmonic weights are definedin a similar fashion, taking into account the specific nature of melodic andharmonic structures respectively. Metric structures characterize local periodic patterns in symbolic onsettimes. This can be formalized as follows: let K ⊂ Z4 be a composition (withcoordinates “Onset Time”, “Pitch”,”Loudness”, and “Duration”), T ⊂ Zits set of onset times (i.e. the projection of K on the first axis) and lettmax = max{t : t ∈ T }. Without loss of generality the smallest onset timein T is equal to one.Definition 28 For each triple (t, l, p) ∈ Z × N × N the set B(t, l, p) = {t + kp : 0 ≤ k ≤ l}is called a meter with starting point t, length l and period p. The meteris called admissible, if B(t, l, p) ⊂ T . The non-negative length l of a localmeter M = B(t, l, p) is uniquely determined by the set M and is denotedby l(M ).Note that by definition, t ∈ B(t, l, p) for any (t, l, p) ∈ Z × N × N. Theimportance of events at onset time s is now measured by the number ofmeters this onset is contained in. For a given triple (t, l, p), three situationscan occur:©2004 CRC Press LLC
  • 92. 1. B(t, l, p) is admissible and there is no other admissible local meter B = B (t , l , p ) such that B B ;2. B(t, l, p) is not admissible;3. B(t, l, p) is admissible, but there is another admissible local meter B = B (t , l , p ) such that B B .We count only case 1. This leads to the following definition:Definition 29 An admissible meter B(t, l, p) for a composition K ⊂ Z4is called a maximal local meter if and only if it is not a proper subsetof another admissible local meter B(t , l , p ) of K. Denote by M(K) theset of maximal local meters of K and by M(K, t) the set of maximal localmeters of K containing onset t.Note that the set M(K) is always a covering of T . Metric weights can nowbe defined, for instance, byDefinition 30 Let x ∈ K be a note event at onset time t(x) ∈ T , M =M(K, t) the set of maximal local meters of K containing t(x), and h anondecreasing real function on Z. Specify a minimal length lmin . Thenthe metric indicator (or metric weight) of x, associated with the minimallength lmin , is given by wmetric (x) = h(l(M )) (3.12) M∈M, l(M)≥lminIn a similar fashion, melodic indicators wmelodic and harmonic indicatorswharmonic can be derived from a melodic and harmonic analysis respec-tively.Specific indicatorsA possible objection to weight functions as defined above is that only in-formation about pitch and onset time is used. A score, however, usuallycontains much more symbolic information that helps musicians to read itcorrectly. For instance, melodic phrases are often connected by a phras-ing slur, notes are grouped by beams, separate voices are made visible bysuitable orientation of note stems, etc. Ideally, structural indicators shouldtake into account such additional information. An improved indicator thattakes into account knowledge about musical “motifs” can be defined forexample as follows:Definition 31 Let M = {(τ1 , y1 ), ..., (τk , yk )}, τ1 < τ2 < ... < τk be a“motif ” where y denotes pitch and τ onset time. Given a composition K ⊂T × Z ⊂ Z2 , define for each score-onset time ti ∈ T (i = 1, ..., n) andu ∈ {1, ..., k}, the shifted motif M (ti , u) = {(ti + τ1 − τu , y1 ), ..., (ti + τk − τu , yk )}©2004 CRC Press LLC
  • 93. and denote by Tu (ti ) = {ti + τ1 − τu , ..., ti + τk − τu } = {s1 , ..., sk }the corresponding onset times. Moreover, let Xu (ti ) = {x = (x(s1 ), ..., x(sk )) : (si , x(si )) ∈ K}be the set of all pitch-vectors with onset set Tu (ti ). Then we define thedistance k du (ti ) = min (x(si ) − yi )2 (3.13) x∈Xu (ti ) i=1If Xu is empty, then du (ti ) is not defined or set equal to an arbitrary upperbound D < ∞.In this definition, it is assumed that the motif is identified beforehandby other means (e.g. “by hand” using traditional musical analysis). Thedistance du (ti ) thus measures in how far there are notes that are similarto those in M, if ti is at the uth place of the rhythmic pattern of motif M.Note that the euclidian distance (x(si ) − yi )2 could be replaced by anyother reasonable distance. Analogously, distance or similarity can be measured by correlation:Definition 32 Using the same definitions as above, let k xo = arg min (x(si ) − yi )2 , x∈Xu (ti ) i=1and define ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ).If M (ti , u) K, then set ru (ti ) = 0.Disregarding the position within a motif, we can now define overall motivicindicators (or weights), for instance by k wd,mean (ti ) = g( du (ti )) (3.14) u=1where g is a monotonically decreasing function, wd,min (ti ) = min du (ti ) (3.15) 1≤u≤kor wcorr (ti ) = max ru (ti ) (3.16) 1≤u≤kFinally, given weights for p different motifs, we may combine these intoone overall indicator. For instance, an overall melodic indicator based oncorrelations can be defined by p wmelod (ti ) = h(wcorr,j (ti ), Li ) (3.17) j=1©2004 CRC Press LLC
  • 94. where wcorr,j is the weight function for motif number j and Li is the numberof elements in the motif. Including Li has the purpose of attributing higherweights to the presence of longer motifs. The advantage of the motif-based definition is that one can first searchfor possible motifs in the score, making full use of the available informationin the score as well as musicological and historical knowledge, and thenincorporate these in the definition of melodic weights. Similar definitionsmay be obtained for metric and harmonic indicators.3.2.3 Measuring dimensionThere are many different definitions of dimension, each measuring a specificaspect of “objects”. Best known is the topological dimension. In the usual keuclidian √ space Rk with scalar product < x, y >= i=1 xi yi and distances|x−y| = < x − y, x − y >, the topological dimension of the space is equalto k. The dimension of an object in this space is equal to the dimensionof the subspace it is contained in. The euclidian space is, however, ratherspecial since it is metric with a scalar product. More generally, one can define a topological dimension in any topological(not necessarily metric) space in terms of coverings. We start with thedefinition of a topological space: a topological space is a nonempty setX together with a family O of so-called open subsets of X satisfying thefollowing conditions:1. X ∈ O and φ ∈ O (φ denotes the empty set)2. If U1 , U2 ∈ O, then U1 ∪ U2 ∈ O3. If U1 , U2 ∈ O, then U1 ∩ U2 ∈ O.A covering of a set S ⊆ X is a collection U ⊆ O of open sets such that S ⊆ ∪U∈U U.A refinement of a covering U is a covering U ∗ such that for each U ∗ ∈ U ∗there exists a U ∈ U with U ∗ ⊆ U . The definition of topological dimensionis now as follows:Definition 33 A topological space X has topological dimension m, if everycovering U of X has a refinement U ∗ in which every point of X occurs inat most m + 1 sets of U ∗ , and m is the smallest such integer.The topological dimension of a subset S ⊆ X is analogous. For instance,a straight line in a euclidian space can be divided into open intervals suchthat at most two intervals intersect – so that dT = 1. Similarily, a simplegeometric figure in the plane, such as a disk or a rectangle (including theinner area), can be covered with arbitrarily small circles or rectangles suchthat at most three such sets intersect – this number can however not bemade smaller. Thus, the topological dimension of such an object is dT =3 − 1 = 2.©2004 CRC Press LLC
  • 95. The topological dimension is a relatively rough measure of dimension,since it can assume integer values only and thus classifies sets (in a topo-logical space) into a finite or countable number of categories. On the otherhand, dT is defined for very general spaces where a metric (i.e. distances)need not exist. A finer definition of dimension, which is however confinedto metric spaces, is the Hausdorff-Besicovitch dimension. Suppose we havea set A in a metric space X. In a metric space, we can define open balls ofradius r around each point x ∈ X by U (r) = {y ∈ X : dX (x, y) < r}where dX is the metric in X. The idea is now to measure the size of A bycovering it with a finite number of balls Ur = {U1 (r), ..., Uk (r)} of radius rand to calculate an approximate measure of A by µUr ,r,h (A) = h(r) (3.18)where the sum is taken over all balls and h is some positive function. Thismeasure depends on r, the specific covering Ur and h. To obtain a measurethat is independent of a specific covering, we define the measure µr,h (A) = inf µUρ ,ρ,h (A) (3.19) Uρ :ρ<rThis measure is still only an approximation of A. The question is nowwhether we can get a measure that corresponds exactly to the set A. Thisis done by taking the limit r → 0 : µh (A) = lim µr,h (A) (3.20) r→0Clearly, as r tends to zero, µr,h becomes at most larger and therefore has alimit. The limit can be either zero (if µr,h = 0 already), infinity, or a finitenumber. This leads to the following definition:Definition 34 A function h for which 0 < µh (A) < ∞is called intrinsic function of A.Consider, for example, a simple shape in the plane such as a circle withradius R. The area of the circle A can be measured by covering it by smallcircles of radius r and evaluating µh (A) using the function h(r) = πr2 .It is well known that limr→0 µr,h (A) exists and is equal to µh (A) = πR2 .On the other hand, if we took h(r) = πrα with α < 2, then µh (A) = ∞,whereas for α > 2, µh (A) = 0. For standard sets, such as circles, rectangles,triangles, cylinders, etc., it is generally true that the intrinsic function for aset A that with topological dimension dT = d is given by (Hausdorff 1919) 1 {Γ( 2 )}d h(r) = hd (r) = rd . (3.21) Γ(1 + d ) 2©2004 CRC Press LLC
  • 96. Many other more complicated sets, including randomly generated sets, haveintrinsic functions of the form h(r) = L(r)rd for some d > 0 which isnot always equal to dT , and L a function that is slowly varying at theorigin (see e.g. Hausdorff 1919, Besicovitch 1935, Besicovitch and Ursell1937, Mandelbrot 1977, 1983, Falcomer 1985, 1986, Kono 1986, Telcs 1990,Devaney 1990). Here, L is called slowly varying at zero, if for any u > 0,limr→0 [L(ur)/L(r)] = 1. This leads to the following definition of dimension:Definition 35 Let A be a subset of a metric space and h(r) = L(r) · rdan intrinsic function of A where L(r) is slowly varying. Then dH = d iscalled the Hausdorff-Besicovitch dimension (or Hausdorff dimension) of A.The definition of Hausdorff dimension leads to the definition of fractals (seee.g. Mandelbrot 1977):Definition 36 Let A be a subset of a metric space. Suppose that A hastopological dimension dT and Hausdorff dimension dH such that dH > dT .Then A is called a fractal.Figure 3.2 Fractal pictures (by C´line Beran, computer generated.) (Color figures efollow page 152.)Intuitively, dH > dT means that the set A is “more complicated” than astandard set with topological dimension dT . An alternative definition ofHausdorff-dimension is the fractal dimension:Definition 37 Let A be a compact subset of a metric space. For each ε > 0,denote by N (ε) the smallest number of balls of radius r ≤ ε necessary tocover A. If log N (ε) dF = − lim (3.22) ε→0 log εexists, then dF is called the fractal dimension of A.It can be shown that dF ≥ dT . Moreover, in Rk one has dF ≤ k = dT . Beau-tiful examples of fractal curves and surfaces (cf. Figure 3.2) can be found in©2004 CRC Press LLC
  • 97. Mandelbrot (1977) and other related books. Many phenomena, not only innature but also in art, appear to be fractal. For instance, fractal shapes canbe found in Jackson Pollock’s (1912-1956) abstract drip paintings (Taylor1999a,b,c, 2000). In music, the idea of fractals was used by some contem-porary composers, though mainly as a conceptual inspiration rather thanan exact algorithm (e.g. Harri Vuori, Gy¨rgy Ligeti; Figure 3.3). o Figure 3.3 Gy¨rgy Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.) o The notion of fractals is closely related to self-similarity (see Mandel-brot 1977 and references therein). Self-similar geometric objects have theproperty that the same shapes are repeated at infinitely many scales. Bydrawing recursively m smaller copies of the same shape – rescaling themby a factor s – one can construct fractals. For self-similar objects, the frac-tal dimension can be calculated directly from the scaling factor s and thenumber m of repetitions of the rescaled objects by log m dF = (3.23) log sFor many purposes more realistic are random fractals where instead ofthe shape itself, the distribution remains the same after rescaling. Morespecifically, we haveDefinition 38 Let Xt (t ∈ R) be a stochastic process. The process is calledself-similar with self-similarity parameter H, if for any c > 0 Xt =d c−H Xctwhere = d means equality of the two processes in distribution.The parameter H is also called Hurst exponent. Self-similar processes are(like their deterministic counterparts) very special models. However, theyplay a central role for stochastic processes just like the normal distributionfor random variables. The reason is that, under very general conditions,the limit of partial sum processes (see Lamperti 1962, 1972) is always aself-similar process:©2004 CRC Press LLC
  • 98. Theorem 11 Suppose that Zt (t ∈ R+ ) is a stochastic process such thatZ1 = 0 with positive probability and Zt is the limit in distribution of thesequence of normalized partial sums [nt] a−1 Snt = a−1 n n Xs (n = 1, 2, ...) (3.24) s=1where X1 , X2 , ... is a stationary discrete time process with zero mean anda1 , a2 , ... a sequence of positive normalizing constants such that log an →∞. Then there exists an H > 0 such that for any u > 0, limn→∞ (anu /an ) =uH , Zt is self-similar with self-similarity parameter H, and Zt has station-ary increments.The self-similarity parameter therefore also makes sense for processes thatare not exactly self-similar themselves, since it is defined by the rate n−Hneeded to standardize partial sums. Moreover, H is related to the frac-tal dimension, the exact relationship between H and the fractal dimen-sion however depends on some other properties of the process as well. Forinstance, sample paths of (univariate) Gaussian self-similar processes so-called fractional Brownian motion (see Chapter 4) have, with probabilityone, a fractal dimension of 2 − H with possible values of H in the inter-val (0, 1). Thus, the closer H is to 1, the more a sample paths is similarto a simple geometric line with dimension one. On the other hand, as Happroaches zero, a typical sample path fills up most of the plane so thatthe dimension approaches two. Practically, H can be determined from anobserved series X1 , ..., Xn , for example by maximum likelihood estimation. For a thorough discussion of self-similar and related processes and sta-tistical methods see e.g. Beran (1994). Further references on fractals apartfrom those given above are, for instance, Edgar (1990), Falconer (1990),Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995).A cautionary remark should be made at this point: in view of theorem 11,the fact that we do find self-similarity in aggregated time series is hardlysurprising and can therefore not be interpreted as something very specialthat would distinguish the particular series from other data. What maybe special at most is which particular value of H is obtained and whichparticular self-similar process the normalized aggregated series convergesto.3.3 Specific applications in music3.3.1 Entropy of melodic shapesLet x(ti ) be the upper and y(ti ) the lower envelope of a composition atscore-onset times ti (i = 1, ..., n). To investigate the shape of the melodic©2004 CRC Press LLC
  • 99. movement we consider the first and second discrete “derivatives” ∆x(ti ) x(ti+1 ) − x(ti ) x(1) (ti ) = = (3.25) ∆ti ti+1 − tiand ∆2 x(ti ) [x(ti+2 ) − x(ti+1 )] − [x(ti+1 ) − x(ti )] x(2) (ti ) = = (3.26) ∆2 ti [ti+2 − ti+1 ] − [ti+1 − ti ]Alternatively, if octaves “do not count”, we define [x(ti+1 ) − x(ti )]12 x(1;12) (ti ) = (3.27) ti+1 − tiand [x(ti+2 ) − x(ti+1 )]12 − [x(ti+1 ) − x(ti )]12 x(2;12) (ti ) = (3.28) [ti+2 − ti+1 ] − [ti+1 − ti+2 ]where [x]k = x mod k. Thus, in this definition intervals between successivenotes x(ti ), x(ti+1 ) and x(tj ), x(tj+1 ) respectively are considered identicalif they differ by octaves only. The number of possible values of x(2) and x(2;12) is finite, however poten-tially very large. In first approximation we may therefore consider both vari-ables to be continuous. In the following, the distribution of x(2) and x(2;12) ˆis approximated by a continuous density kernel estimate f (see Chapter 2).For illustration, we define the following measures of entropy:1. E1 = − ˆ ˆ f (x) log2 f (x)dx (3.29) where f is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by ˆ kernel estimation.2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead.3. E3 = − ˆ ˆ f (x, y) log2 f (x, y)dxdy (3.30) ˆ where f (x, y) is a kernel estimate based on observations (ai , bi ) with ai = x (ti−1 ) and bi = x(2) (ti ). Thus, E3 is the (empirical) entropy of (2) the joint distribution of two successive values of x(2) .4. E4 : Same as Entropy 3, but using (x(2;12) (ti−1 ), x(2;12) (ti )) instead.5. E5 : Same as Entropy 3, but using (x(ti ) − y(ti ))(1) instead.6. E6 : Same as Entropy 3, but using (x(ti ) − y(ti ))(1;12) instead.7. E7 : Same as Entropy 1, but using (x(ti ) − y(ti ))(1) instead.8. E8 : Same as Entropy 1, but using (x(ti ) − y(ti ))(1;12) instead.©2004 CRC Press LLC
  • 100. Figure 3.4 Comparison of entropies 1, 2, 3, and 4 for J.S. Bach’s Cello SuiteNo. I and R. Schumann’s op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16.©2004 CRC Press LLC
  • 101. Each of these entropies characterizes information content (or randomness)of certain aspects of melodic patterns in the upper and lower envelope.Figures 3.4a through d show boxplots of Entropies 1 through 4 for Bachand Schumann (Figure 3.8). The pieces considered here are: J.S. Bach CelloSuite No. I (each of the six movements separately), Pr¨ludium und Fuge aNo. 1 and 8 from “Das Wohltemperierte Klavier” I (each piece separately);R. Schumann – op. 15, No. 2, 3, 4 and 7, and op. 68, No. 2 and 16. Obvi-ously there is a difference between Bach and Schumann in all four entropymeasures. In Bach’s pieces, entropy is higher, indicating a more uniformmixture of local melodic shapes.3.3.2 Spectral entropy of local interval variabilityConsider the local variability of intervals yi = x(ti+1 ) − x(ti ) between suc-cessive notes. Specifically, we consider a moving “nearest neighbor” window[ti , ti+4 ] (i = 1, ..., n − 4) and define local variances 3 1 vi = (yi+j − yi )2 ¯ (3.31) 4−1 j=0 3where yi = 4−1 j=0 yi+j . Based on this, a SEMIFAR-model is fitted to ¯the time series zi = log(vi + 1 ) (see Chapter 4 for the definition of SEMI- 2 ˆFAR models). The fitted spectral density f (λ; θ) is then used to define thespectral entropy π E9 = − ˆ ˆ f (λ; θ) log f (λ; θ)dλ (3.32) −πIf octaves do not count, then intervals are circular so that an estimate ofvariability for circular data should be used. Here, we use R∗ = 2(1 − R) asdefined in Chapter 7. To transform the range [0, 2] of R∗ to the real line,the logistic transformation is applied, defining R∗ + ε zi = log( ) 2 + ε − R∗where ε is a small positive number that is needed in order that −∞ < zi <∞ even if R∗ = 0 or 2 respectively. Fitting a SEMIFAR-model to zi wethen define E10 the same way as E9 above. Figure 3.6 shows a comparison of E9 and E10 for the same compositionsas in 3.3.1. In contrast to the previous measures of entropy, Bach is con-sistently lower than Schumann. With respect to E10 this is also the casein comparison with Scriabin (Figure 3.5) and Martin. Thus, for Bach thereappears to be a high degree of nonrandomness (i.e. organization) in theway variability of interval steps changes sequentially.©2004 CRC Press LLC
  • 102. Figure 3.5 Alexander Scriabin (1871-1915) (at the piano) and the conductorSerge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gem¨ldegalerie aNeuer Meister, Dresden, and Robert-Sterl-House.)Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scri-abin/Martin.©2004 CRC Press LLC
  • 103. 3.3.3 Omnibus metric, melodic, and harmonic indicators for compositions by Bach, Schumann, and WebernFigures 3.7, and 3.9 through 3.11 show the “omnibus” metric, melodic,and harmonic weight functions for Bach’s Canon cancricans, Schumann’sop. 15/2 and 7, and for Webern’s Variations op 27. For Bach’s composi-tion, the almost perfect symmetry around the middle of the compositioncan be seen. Moreover, the metric curve exhibits a very regular up anddown. Schumann’s curves, in particular the melodic one, show clear pe-riodicities. This appears to be quite typical for Schumann and becomeseven clearer when plotting a kernel-smoothed version of the curves (herea bandwidth of 8/8 was used). Interestingly, this type of pattern can alsobe observed for Webern. In view of the historic development of 12-tonemusic as a logical continuation of harmonic freedom and romantic ges-ture achieved in the 19th and early 20th centuries, this similarity is notcompletely unexpected. Finally, note that a relationship between metric,Figure 3.7 Metric, melodic, and harmonic global indicators for Bach’s Canoncancricans.melodic and harmonic structure can not be seen directly from the “raw”curves. However, smoothed weights as shown in the figures above revealclear connections between the three weight functions. This is even the casefor Webern, in spite of the absence of tonality.©2004 CRC Press LLC
  • 104. Figure 3.8 Robert Schumann (1810-1856). (Courtesy of ZentralbibliothekZ¨rich.) u3.3.4 Specific melodic indicators for Schumann’s Tr¨umerei aSchumann’s Tr¨umerei is rich in local motifs. Here, we consider eight of athese as indicated in Figure 3.12. Figure 3.13 displays the individual indi-cator functions obtained from (3.16). The overall indicator function m(t) =wmelod (t) displayed in Figure 3.15 is defined by (3.17) with h(w, L) =[2 · max(w, 0.5)]L and Lj =number of notes in motif j. The contributionsh(wcorr,j (ti ), Lj ) of wcorr,j (j = 1, ..., 8) are given in Figure 3.14.©2004 CRC Press LLC
  • 105. Figure 3.9 Metric, melodic, and harmonic global indicators for Schumann’s op.15, No. 2 (upper figure), together with smoothed versions (lower figure).©2004 CRC Press LLC
  • 106. Figure 3.10 Metric, melodic, and harmonic global indicators for Schumann’s op.15, No. 7 upper figure), together with smoothed versions (lower figure).©2004 CRC Press LLC
  • 107. Figure 3.11 Metric, melodic, and harmonic global indicators for Webern’s Varia-tions op. 27, No. 2 (upper figure), together with smoothed versions (lower figure).©2004 CRC Press LLC
  • 108. Figure 3.12 R. Schumann – Tr¨umerei: motifs used for specific melodic indica- ators.©2004 CRC Press LLC
  • 109. Figure 3.13 R. Schumann – Tr¨umerei: indicators of individual motifs. aFigure 3.14 R. Schumann – Tr¨umerei: contributions of individual motifs to over- aall melodic indicator.©2004 CRC Press LLC
  • 110. 150 100 w 50 0 0 5 10 15 20 25 30 onset time Figure 3.15 R. Schumann – Tr¨umerei: overall melodic indicator. a©2004 CRC Press LLC
  • 111. CHAPTER 4 Time series analysis4.1 Musical motivationMusical events are ordered according to a specific temporal sequence. Timeseries analysis deals with observations that are indexed by an ordered vari-able (usuallly time). It is therefore not surprising that time series anal-ysis is important for analyzing musical data. Traditional applications areconcerned with “raw physical data” in the form of audio signals (e.g. digi-tal CD-recording, sound analysis, frequency recognition, synthetic sounds,modeling musical instruments). In the last few years, time series modelshave been developed for modeling symbolic musical data and analyzing“higher level” structures in musical performance and composition. A fewexamples are discussed in this chapter.4.2 Basic principles4.2.1 Deterministic and random components, basic definitionsTime series analysis in its most sophisticated form is a complex subjectthat cannot be summarized in one short chapter. Here, we briefly mentionsome of the main ingredients only. For a thorough systematic account of thetopic we refer the reader to standard text books such as Priestley (1981a,b),Brillinger (1981), Brockwell and Davis (1991), Diggle (1990), Beran (1994),Shumway and Stoffer (2000). A time series is a family of (usually, but not necessarily) real variablesXt with an ordered index t. For simplicity, we assume that observationsare taken at equidistant discrete time points t ∈ Z (or N). Usually, obser-vations are random with certain deterministic components. For instance,we may have an additive decomposition Xt = µ(t) + Ut where Ut is suchthat E(Ut ) = 0 and µ(t) is a deterministic function of t. One of the mainaims of time series analysis is to identify the probability model that gen-erated an observed time series x1 , ..., xn . In the additive model this wouldmean to estimate the mean function µ(t) and the probability distributionof the random sequence U1 , U2 , .... Note that a random sequence can alsobe understood as a function mapping positive integers t to the real numbersUt . The main difficulties in identifying the correct distribution are:©2004 CRC Press LLC
  • 112. 1. The probability law has to be defined on an infinite dimensional space of vectors (X1 , X2 , ...). This difficulty is even more serious for continuous time series where a sample path is a function on R;2. The finite sample vector X(n) = (X1 , ..., Xn )t has an arbitrary n-di- mensional distribution so that it cannot be estimated from observed val- ues x1 , ..., xn consistently, unless some minimal assumptions are made.Difficulty 1 can be solved by applying appropriate mathematical techniquesand is described in detail in standard books on stochastic processes andtime series analysis (see e.g. Billingsley 1986 and the references above). Dif-ficulty 2 cannot be solved by mathematical arguments only. It is of coursepossible to give necessary or sufficient conditions such that the probabilitydistribution can be estimated with arbitrary accuracy (measured in an ap-propriate sense) as n tend infinity. However, which concrete assumptionsshould be used depends on the specific application. Assumptions should nei-ther be too general (otherwise population quantities cannot be estimated)nor too restrictive (otherwise results are unrealistic). A standard, and almost necessary, assumption is that Xt can be reducedto a stationary process Ut by applying a suitable transformation. For in-stance, we may have a deterministic “trend” µ(i) plus stationary “noise”Ui , Xi = µ(i) + Ui , (4.1)or an integrated process of order m for which the mth difference is station-ary, i.e. (1 − B)m Xi = Ui (4.2)where (1 − B)Xi = Xi − Xi−1 . In the latter case, Xt is called m-differencestationary. Stationarity is defined as follows:Definition 39 A time series Xi is called strictly stationary, if for anyk, i1 , ..., in ∈ N, P (Xi1 ≤ x1 , ..., Xin ≤ xn ) = P (Xi1 +k ≤ x1 , ..., Xin +k ≤ xn ) (4.3)The time series is called weakly (or second order) stationary, if µ(i) = E(Xi ) = µ = const (4.4)and for any i, j ∈ N, the autocovariance depends on the lag k = |i − j| only,i.e. cov(Xi , Xi+k ) = γ(k) = γ(−k) (4.5)A second order stationary process can be decomposed into uncorrelatedrandom components that correspond to periodic signals, via the so-calledspectral representation π Xt = µ + eitλ dZX (λ). (4.6) −πHere ZX (λ) = ZX,1 (λ) + iZX,2 (λ) ∈ C is a so-called orthogonal increment©2004 CRC Press LLC
  • 113. process (in λ) with the following properties: ZX (0) = 0, E[ZX (λ)] = 0 andfor λ1 > λ2 ≥ ν1 > ν2 , E[∆Z X (λ2 , λ1 )∆ZX (ν2 , ν1 )] = 0 (4.7)where ∆ZX (u, v) = ZX (u) − ZX (v). The integral in (4.6) is defined as alimit in mean square. It can be constructed by approximating the functioneitλ by step functions gn (λ) = αi,n 1{ai,n < λ ≤ bi,n }(n ∈ N). For step functions we have the integrals π In = gn (λ)dZX (λ) = αi,n [Z(bi,n ) − Z(ai,n )]. −πAs gn → eitλ , the integrals In converge to a random variable I, in the sensethat lim E[(I − In )2 ] = 0. n→∞The random variable I is then denoted by exp(itλ)dZ(λ). The spectralrepresentation is especially useful when one needs to identify (random)periodicities. For this purpose one defines the spectral distribution function FX (λ) = E[|ZX (λ) − ZX (0)|2 ] = E[|ZX (λ)|2 ] (4.8)The variance is then decomposed into frequency contributions by π π var(Xt ) = E[|dZX (λ)|2 ] = dFX (λ) (4.9) −π −πThis means that the expected contribution (expected squared amplitude)of components with frequencies in the interval (λ, λ + ε] to the variance ofXt is equal to F (λ + ε) − F (λ). Two interesting special cases can be distinguished:Case 1 – F differentiable: In this case, d F (λ + ε) − F (λ) = F (λ)ε + o(ε) = f (λ)ε + o(ε). dλThe function f is called spectral density and can also be defined directlyby ∞ 1 f (λ) = γX (k)eikλ (4.10) 2π k=−∞where γX (k) = cov(Xt , Xt+k ). The inverse relationship is π γX (k) = eikλ f (λ)dλ (4.11) −πA high peak of f at a frequency λo means that the component(s) at (or inthe neighborhood of) λo contribute largely to the variability of Xt . Note©2004 CRC Press LLC
  • 114. that the period of exp(itλ), as a function of t, is T = 2π/λ (sometimes ˜one therefore defines λ = λ/(2π) as frequency in order that the period T isdirectly the inverse of the frequency). Thus, a peak of f at λo implies thata sample path of Xt is likely to exhibit a strong periodic component withfrequency λo . Periodicity is, however, random – the observed series is not aperiodic function. The meaning of random periodicity can be explained bestin the simplest case where T is an integer: if f has a peak at frequency λo =2π/T, then the correlation between Xt and Xt+jT (j ∈ Z) is relatively highcompared to other correlations with similar lags. A further complicationthat blurs periodicity is that, if f is continuous around a peak at λo , thenthe observed signal is a weighted sum of infinitely (in fact uncountably)many, relatively large components with frequencies that are similar to λo .The sharper the peak, the less this “blurring” takes place and a distinctperiodicity (though still random) can be seen. In the other extreme casewhere f is constant, there is no preference for any frequency, and γX (k) = 0(k = 0), i.e. observations are uncorrelated.Case 2 - F is a step function with a finite or countable number of jumps:this corresponds to processes of the form k Xt = Aj eiλj t j=1for some k ≤ ∞, and λj ∈ [0, π], Aj ∈ C. We then have F (λ) = E[|Aj |2 ], (4.12) j:λj ≤λ k var(Xt ) = E[|Aj |2 ] (4.13) j=1This means that the variance is a sum of contributions that are due to thefrequencies λj (1 ≤ j ≤ k). A sample path of Xt cannot be distinguishedfrom a deterministic periodic function, because the randomly selected am-plitudes Aj are then fixed. Finally, it should be noted that not all frequencies are observable whenobservations are taken at discrete time points t = 1, 2, ..., n. The smallestidentifiable period is 2, which corresponds to a highest observable frequencyof 2π/2 = π. The largest identifiable period is n/2, which corresponds tothe smallest frequency 4π/n. As n increases, the lowest frequency tends tozero, however the highest does not. In other words, the highest frequencyresolution does not improve with increasing sample size. To obtain more general models, one may wish to relax the conditionof stationarity. An asymptotic concept of local stationarity is defined inDahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n©2004 CRC Press LLC
  • 115. (n ∈ N ) is called locally stationary, if we have a spectral representation π t Xt,n = µ( ) + eitλ At,n (λ)dZX (λ), (4.14) n −πwith “ = ” meaning almost sure (a.s.) equality, µ(u) continuous, and thereexists a 2π−periodic function A : [0, 1] × R → C such that A(u, −λ) = ¯A(u, λ), A(u, λ) is continuous in u, and t sup |A( , λ) − At,n (λ)| ≤ cn−1 (4.15) t,λ n(a.s.) for some constant c < ∞. Intuitively, this means that for n largeenough, the observed process can be approximated locally in a small timewindow t ± ε by the stationary process exp(itλ)A( n , λ)dZX (λ). The or- t −1der n of the approximation is chosen such that most standard estima-tion procedures, such as maximum likelihood estimation, can be appliedlocally and their usual properties (e.g. consistency, asymptotic normality)still hold. Under smoothness conditions on A one can prove that a mean-ingful “evolving” spectral density fX (u, λ) (u ∈ (0, 1)) exists such that ∞ 1 fX (u, λ) = lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) (4.16) n→∞ 2π k=−∞The function fX (u, λ) is called evolutionary spectral density. Note that, forfixed u, lim cov(X[u·n−k/2],n , X[u·n+k/2],n ) = γX (k) n→∞ = (2π)−1 exp(ikλ)fX (u, λ)dλ.Thumfart (1995) carries this concept over to series with discrete spectra.A simplified definition can be given as follows: a sequence of stochasticprocesses Xt,n (n ∈ N ) is said to have a discrete evolutionary spectrumFX (u, λ), if t t t Xt,n = µ( ) + Aj ( )eiλj ( n )t (4.17) n n j∈Mwhere M ⊆ Z, and µj (u) is twice continuously differentiable. The discreteevolutionary spectrum can be defined in analogy to the continuous case.For other definitions of nonstationary processes see e.g. Priestley (1965,1981), Ghosh et al. (1997) and Ghosh and Draghicescu (2002a,b).4.2.2 Sampling of continuous-time time seriesOften time series observed at discrete time points t = j · ∆τ (j = 1, 2, 3, ...)actually “happen” in continuous time τ ∈ R. Sampling in discrete time©2004 CRC Press LLC
  • 116. leads to information loss in the following way: let Yτ be a second order sta-tionary time series with τ ∈ R. (Stationarity in continuous time is definedin an exact analogy to definition 39.) Then, Yτ has a spectral representation ∞ Yτ = eiτ λ dZY (λ), (4.18) −∞a spectral distribution function λ FY (λ) = E[|dZ(λ)|2 ] (4.19) −∞and, if F exists, a spectral density function ∞ 1 fY (λ) = F (λ) = e−iτ λ γY (τ )dτ (4.20) 2π −∞We also have γY (τ ) = cov(Yt , Yt+τ ) = eiλτ f (λ)dλ.The reason why the frequency range extends to (−∞, ∞), instead of [−π, π],is that in continuous time, by definition, arbitrarily small frequencies areobservable. Suppose now that Yτ is observed at discrete time points t = j · ∆τ , i.e.we observe Xt = Yj·∆τ (4.21)Then we can write ∞ ∞ π/∆τ +(2π/∆τ )u Xt = eij(∆τ λ) dZY (λ) = eij(∆τ λ) dZY (λ) −∞ u=−∞ −π/∆τ +(2π/∆τ )u (4.22) ∞ π/∆τ π/∆τ = eij(∆τ λ) dZY (λ + (2π/∆τ )u) = eitλ dZX (λ) u=−∞ −π/∆τ −π/∆τ (4.23)where ∞ dZX (λ) = dZY (λ + (2π/∆τ )u) (4.24) u=−∞Moreover, if Yτ has spectral density fY , then the spectral density of Xt is ∞ fX (λ) = fY (λ + (2π/∆τ )u) (4.25) u=−∞for λ ∈ [− ∆τ , ∆τ ]. This result can be interpreted as follows: a frequency λ > π ππ/∆τ can be written as λ = λo − (2π/∆τ )j for some j ∈ N where λo is inthe interval [−π/∆τ, π/∆τ ]. The contributions of the two frequencies λ and©2004 CRC Press LLC
  • 117. λo to the observed function Xt (in discrete time) are confounded, i.e. theycannot be distinguished. Thus, if we observe a peak of fX at a frequencyλ ∈ (0, π/∆τ ], then this may be due to any of the periodic components withperiods 2π/(λ + (2π/∆τ )u), u = 0, 1, 2, ..., or a combination of these. Thishas, for instance, direct implications for sampling of sound signals. Supposethat 22050Hz (i.e. λ = 22050 · 2π ≈ 138544.2) is the highest frequency thatwe want to identify (and later reproduce) correctly, instead of attributing itto a lower frequency. This would cover the range perceivable by the humanear. Then ∆τ must be so small that π/∆τ ≥ 22050 · 2π. Thus the time gap∆τ between successive measurements of the sound wave must not exceed1/44100.4.2.3 Linear filtersSuppose we need to extract or eliminate frequency components from asignal Xt with spectral density fX . The aim is thus, for instance, to producean output signal Yt whose spectral density fY is zero for a frequency intervala ≤ λ ≤ b. The simplest, though not necessarily best, way to do this is linearfiltering. A linear filter maps an input series Xt to an output series Yt by ∞ Yt = aj Xt−j (4.26) j=−∞The coefficients must fulfill certain conditions in order that the sum isdefined. If Xt is second order stationary, then we need a2 < ∞. The jresulting spectral density of Yt is fY (λ) = |A(λ)|2 fX (λ) (4.27)where ∞ A(λ) = aj e−ijλ . (4.28) j=−∞To eliminate a certain frequency band [a, b] one thus needs a linear filtersuch that A(λ) ≡ 0 in this interval. Equation (4.27) also helps to construct and simulate time series mod-els with desired spectral densities: a series with spectral density fY (λ) =(2π)−1 |A(λ)|2 can be simulated by passing a series of independent obser-vations Xt through the filter A(λ). Note that, in reality, one can use onlya finite number of terms in the filter so that only an approximation can beachieved.4.2.4 Special modelsWhen modeling time series statistically, one may use one of the followingapproaches: a) parametric modeling; b) nonparametric modeling; and c)©2004 CRC Press LLC
  • 118. semiparametric modeling. In parametric modeling, the probability distri-bution of the time series is completely specified a priori, except for a fi-nite dimensional parameter θ = (θ1 , ..., θp )t . In contrast, for nonparametricmodels, an infinite dimensional parameter is unknown and must be esti-mated from the data. Finally, semiparametric models have parametric andnonparametric components. A link between parametric and nonparametricmodels can also be established by data-based choice of the length p of theunknown parameter vector θ, with p tending to infinity with the samplesize. Some typical parametric models are:1. White noise: Xt second order stationary, var(Xt ) = σ 2 , fX (λ) = σ 2 /(2π),and γX (k) = 0 (k = 0)2. Moving average process of order q, MA(q): q Xt = µ + εt + ψk εt−k (4.29) k=1with µ ∈ R, εt independent identically distributed (iid) r.v., E(εt ) = 0 and 2σε = var(εt ) < ∞. This can also be written as Xt − µ = ψ(B)εt (4.30) qwhere B is the backshift operator with BXt = Xt−1 , ψ(B) = ψk B k . k=0 qIf k=0 ψk z = 0 implies |z| > 1, then Xt is invertible in the sense that it kcan also be written as ∞ Xt − µ = ϕk (Xt−k − µ) + εt . k=13. Autoregressive process of order p, AR(p): p (Xt − µ) − ϕk (Xt−k − µ) = εt (4.31) k=1or ϕ(B)(Xt − µ) = εt where ϕ(B) = 1 − p ϕk B k . If 1 − k=1 p k=1 ϕk z k = 0implies |z| > 1, then Xt is stationary.4. Autoregressive moving average process, ARMA(p, q): ϕ(B)(Xt − µ) = ψ(B)εt . (4.32)The spectral density is 2 |ψ(eiλ )|2 fX (λ) = σε . (4.33) |ϕ(eiλ )|25. Linear process: ∞ Xt = µ + ψj εt−j (4.34) j=−∞©2004 CRC Press LLC
  • 119. where ψj depend on a finite dimensional parameter vector θ. The spectraldensity is fX (λ) = σε |ψ(eiλ )|2 . 26. Integrated ARIMA process, ARIMA(p, d, q) (Box and Jenkins 1970): ϕ(B)((1 − B)d Xt − µ) = ψ(B)εt (4.35)with d = 0, 1, 2, ..., where ϕ(z) and ψ(z) are not zero for |z| ≤ 1. Thismeans that the dth difference (1 − B)d Xt is a stationary ARMA process.7. Fractional ARIMA process, FARIMA(p, d, q) (Granger and Joyeux 1980,Hosking 1981, Beran 1995): (1 − B)δ ϕ(B){(1 − B)m Xt − µ} = ψ(B)εt (4.36) 1 1with d = m + δ, 2 <δ< 2, m = 0, 1. Here, ∞ d (1 − B)d = (−1)k B k k k=0with d Γ(d + 1) = . k Γ(k + 1)Γ(d − k + 1)The spectral density of (1 − B)m Xt is 2 |ψ(eiλ )|2 fX (λ) = σε |1 − eiλ |−2d . (4.37) |ϕ(eiλ )|2The fractional differencing parameter δ plays an important role. If δ =0, then (1 − B)m Xt is an ordinary ARIMA(p, 0, q) process, with spectraldensity such that fX (λ) converges to a finite value fX (0) as λ → 0 andthe covariances decay exponentially, i.e. |γX (k)| ≤ Cak for some 0 < C <∞, 0 < a < 1. The process is therefore said to have short memory. Forδ > 0, fX has a pole at the origin of the form fX (λ) ∝ λ−2δ as λ → 0, andγX (k) ∝ k 2d−1 so that ∞ γX (k) = ∞. k=−∞This case is also known as long memory, since autocorrelations decay veryslowly (see Beran 1994). On the other hand, if δ < 0, then fX (λ) ∝ λ−2δconverges to zero at the origin and ∞ γX (k) = 0. k=−∞This is called antipersistence, since for large lags there is a negative corre-lation. The fractional differencing parameter δ, or d = δ + m, is also calledlong-memory parameter, and is related to the fractal or Hausdorff dimen-sion dH (see Chapter 3). For an extended discussion of long-memory andantipersistent processes see e.g. Beran (1994) and references therein.©2004 CRC Press LLC
  • 120. 8. Fractional Gaussian noise (Mandelbrot and van Ness 1968, Mandelbrotand Wallis 1969): recall that a stochastic process Yt (t ∈ R) is called self-similar with self-similarity parameter H, if for any c > 0, Yt =d c−H Yct .This definition implies that the covariances of Yt are equal to σ 2 2H cov(Yt , Yt+s ) = (|t| + |s|2H − |t − s|2H ) 2where σ 2 > 0. If Yt is Gaussian (i.e. all joint distributions are normal),then the process is fully determined by its expected value and the covari-ance function. Therefore, there is only one self-similar Gaussian process.This process is called fractional Brownian motion BH (t) with self-similarityparameter 0 < H < 1. The discrete time increment process Xt = BH (t) − BH (t − 1) (t ∈ N) (4.38)is called fractional Gaussian noise (FGN). FGN is stationary with autoco-variances σ2 γ(k) = (|k + 1|2H + |k − 1|2H − 2|k|2H ), (4.39) 2the spectral density is equal to (Sinai 1976) ∞ f (λ) = 2cf (1 − cos λ) |2πj + λ|−2H−1 , λ ∈ [−π, π] (4.40) j=−∞with cf = cf (H, σ 2 ) = σ 2 (2π)−1 sin(πH)Γ(2H + 1) and σ 2 = var(Xi ). Forfurther discussion see e.g. Beran (1994).8. Polynomial trend model: p Xt = βj t j + U t (4.41) j=0where Ut is stationary.9. Harmonic or seasonal trend model: p p Xt = αj cos λj t + αj sin λj t + Ut (4.42) j=0 j=0with Ut stationary10. Nonparametic trend model: t Xt,n = g( ) + Ut (4.43) nwith g : [0, 1] → R a “smooth” function (e.g. twice continuously differen-tiable) and Ut stationary.11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q) (Be-ran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b): (1 − B)δ ϕ(B){(1 − B)m Xt − g(st )} = Ut (4.44)©2004 CRC Press LLC
  • 121. where d, ϕ, εt and g are as above and m = 0, 1. In this case, the centereddifferenced process Yt = (1 − B)m Xt − g(st ) is a fractional ARIMA(p, δ, 0)model. The SEMIFAR model incorporates stationarity, difference stationar-ity, antipersistence, short memory and long memory, as well as an unspec-ified trend. Incorporating all these components enables us to distinguishstatistically which of the components are present in an observed time se-ries (see Beran and Feng 2002a,b). A software implementation by Beran isincluded in the S − P lus−package F inM etrics and described in Zivot andWang (2002).4.2.5 Fitting parametric modelsIf Xt is a second order stationary model with a distribution function thatis known except for a finite dimensional parameter θo = (θ1 , ..., θk )t ∈ Θ ⊆ o oR , then the standard estimation technique is the maximum likelihood kmethod: given an observed time series x1 , ..., xn , estimate θ by ˆ θ = arg max h(x1 , ..., xn ; θ) (4.45) θ∈Θwhere h is the joint density function of (X1 , ..., Xn ). If observations are dis-crete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently,we may maximize the log-likelihood L(x1 , ..., xn ; θ) = log h(x1 , ..., xn ; θ). ˆUnder fairly general regularity conditions, θ is asymptotically consistent, inthe sense that it converges in probabilty to θo . In other words, limn→∞ P (|θ− ˆθo | > ε) = 0 for all ε > 0. In the case of a Gaussian time series with spectraldensity fX (λ; θ), we have 1 L(x1 , ..., xn ; θ) = − [log 2π + log |Σn | + (x−¯) Σ−1 (x−¯)] t x n x (4.46) 2where x = (x1 , ..., xn )t , x = x · (1, 1, ..., 1)t , and |Σn | is the determinant of ¯ ¯the covariance matrix of (X1 , ..., Xn )t with elements [Σn ]ij = cov(Xi , Xj ).Since under general conditions n−1 log |Σn | converges to (2π)−1 times theintegral of log fX (Grenander and Szeg¨ 1958), and the (j, l)th element of o −1Σ−1 can be approximated by fX (λ) exp{i(j − l)λ}dλ, an approximation n ˆ ˜to θ can be obtained by the so-called Whittle estimator θ (Whittle 1953;also see e.g. Fox and Taqqu 1986, Dahlhaus 1987) that minimizes π 1 I(λ) Ln (θ) = [log fX (λ; θ) + ]dλ (4.47) 4π −π fX (λ; θ)An alternative approximation for Gaussian processes is obtained by using ∞an autoregressive representation of the type Xt = j=1 bj Xt−j + t , where t are independent identically distributed zero mean normal variables withvariance σ 2 . This leads to minimizing the sum of the squared residuals asexplained below in Equation (4.50) (see e.g. Box and Jenkins 1970, Beran1995).©2004 CRC Press LLC
  • 122. In general, the actual mathematical and practical difficulty lies in defin-ing a computationally feasible estimation procedure and also to obtain ˆthe asymptotic distribution of θ. There is a large variety of models forwhich this has been achieved. Most results are known for linear modelsXt = ψj εt−j with iid εt . (All examples given in the previous section arelinear.) The reason is that, if the distribution of εt is known, then the distri-bution of the process can be recovered by looking at the autocovariances, orequivalently the spectral density, only. Furthermore, if Xt is invertible, i.e.if Xt can be written as Xt = ∞ ϕk Xt−k + εt , then θo can be estimated k=1by maximizing the loglikelihood of the independent variables εt : n ˆ θ = arg max log hε (et (θ)) (4.48) θ∈Θ t=1where hε is the probability density of ε and et (θ) = xt − ∞ ϕk xt−k . k=1 t−1For a finite sample, et (θ) is approximated by et (θ) = xt − k=1 ϕk xt−k . In ˆ 1the simplest case where εt are normally distributed with hε (x)= (2πσε )− 22exp{−x2 /(2σe )} and θ = (σε , θ2 , ..., θp ) = (σε , η), we have et (θ) = et (η) 2 2 2and n n 2 ˆ 2 et (η) θ = arg min[ log σε + ] (4.49) θ∈Θ t=1 t=1 σεDifferentiating with respect to θ leads to n η = arg min ˆ e2 (η) t (4.50) η t=1and σε = n−1 e2 (ˆ). Under mild regularity conditions, as n tends to ˆ2 t η √ ˆinfinity, the distribution of n(θ−θ) tends to a normal distribution N (0, V )with with covariance matrix V = 2B −1 where B is a p × p matrix withelements π ∂ ∂ Bij = (2π)−1 log f (λ; θ) log f (λ; θ)dλ −π ∂θi ∂θj(see e.g. Box and Jenkins 1970, Beran 1995). The estimation method above assumes that the order of the model, i.e.the length p of the parameter vector θ, is known. This is not the case ingeneral so that p has to be estimated from data. Information theoretic con-siderations (based on definitions discussed in Section 3.1) lead to Akaike’sfamous criterion (AIC; Akaike 1973a,b) p = arg min{−2 log likelihood + 2p} ˆ (4.51) pMore generally, we may minimize AICα = −2 log likelihood + αk with re-spect to p. This includes the AIC (α = 2), the BIC (Bayesian informationcriterion, Schwarz 1978, Akaike 1979) with α = log n and the HIC (Han-©2004 CRC Press LLC
  • 123. nan and Quinn 1979) with α = 2c log log n (c > 1). It can be shown that, ifthe observed process is indeed generated by a process from the postulatedclass of models, and if its order is po , then for α ≥ O(2c log log n) the esti-mated order is asymptotically correct with probability one. In contrast, ifα/(2c log log n) → 0 as n → ∞, then the criterion tends to choose too manyparameters in the sense that P (ˆ > po ) converges to a positive probability. pThis is, for instance, the case for Akaike’s criterion. Thus, if identificationof a correct model is the aim, and the observed process is indeed likely to beat least very close to the postulated model class, then α ≥ O(2c log log n)should be used. On the other hand, one may argue that no model is evercorrect, so that increasing the number of parameters with increasing sam-ple size may be the right approach. In this case, the original AIC is a goodcandidate. It should be noted, however, that if p → ∞ as n → ∞, then ˆthe asymptotic distribution and even the rate of convergence of θ changes,since this is a kind of nonparametric modeling with an ultimately infinitedimensional parameter.4.2.6 Fitting non- and semiparametric modelsMost techniques for fitting nonparametric models rely on smoothing, com-bined with additional estimation of parameters needed for fine tuning ofthe smoothing procedure. To illustrate this, consider for instance, (1 − B)m Xt = g(st ) + Ut (4.52)as defined above where Ut is second order stationary and st = t/n. If m isknown, then g may be estimated, for instance, by a kernel smoother n 1 st − sto g(to ) = ˆ K( )yt (4.53) nb t=1 bas defined in Chapter 2, with xt = (1 − B)m xt . However, results maydiffer considerably depending on the choice of the bandwidth b (see e.g.Gasser and M¨ ller 1979, Beran and Feng 2002a,b). The optimal bandwidth udepends on the nature of the residual process Ut . A criterion for optimalityis, for instance, the integrated mean squared error IM SE = E{[ˆ(s) − g(s)]2 }ds. gThe IMSE can be written asIM SE = {E[ˆ(s)]−g(s)}2 ds+ g var(ˆ(s))ds = g {Bias2 +variance}ds.The Bias only depends on the function g, and is thus independent of theerror process. The variance, on the other hand, is a function of the co-variances γU (k) = cov(Ut , Ut+k ), or equivalently the spectral density fU .©2004 CRC Press LLC
  • 124. The bandwidth that minimizes the IM SE thus depends on the unknownquantities g and fU . Both g and fU , therefore, have to be estimated simul-taneously in an iterative fashion. For instance, in a SEMIFAR model, theasymptotically optimal bandwidth can be shown to be equal to bopt = Copt n(2δ−1)/(5−2δ)where Copt is a constant that depends on the unknown parameter vectorθ = (σ 2 , d, ϕ1 , ..., ϕp )t . Note that in this case, m is also part of the un-known vector. An algorithm for estimating g as well as θ can be defined bystarting with an initial estimate of θ, calculating the corresponding optimalbandwidth, subtracting g from xt , reestimating θ, estimating the new op- ˆtimal bandwidth and so on. Note that in addition the order p is unknown,so that a model choice criterion has to be used at some stage. This com-plicates matters considerably, and special care has to be taken to definea reliable algorithm. Algorithms that work theoretically as well as prac-tically for reasonably small sample sizes are discussed in Beran and Feng(2002a,b).4.2.7 Spectral estimationSometimes one is only interested in the spectral density fX of a stationaryprocess or, equivalently, the autocovariances γX (k), without modeling thewhole distribution of the time series. The reason can be, for instance, thatas discussed above, one may be mainly interested in (random) periodicitieswhich are identifiable as peaks in the spectral density. A natural nonparametric estimate of γX (k) is the sample autocovariance n−k 1 γ (k) = ˆ (xt − x)(xt+k − x) ¯ ¯ (4.54) n t=1for k ≥ 0 and γ (−k) = γ (k). The corresponding estimate of fX is the ˆ ˆperiodogram n−1 n 1 1 I(λ) = γ (k)e ˆ ikλ = | (xt − x)eitλ |2 ¯ (4.55) 2π 2πn t=1 k=−(n−1)Sometimes a so-called tapered periodogram is used: n t Iw (λ) = (2πn)−1 | w( )(xt − x)eitλ |2 ¯ t=1 nwhere w is a weight function. It can be shown that E[I(λ)] → fX (λ) asn → ∞. However, for lags close to n−1, γ (k) is very inaccurate, because one ˆaverages over n − k observed pairs only. For instance, for k = n − 1, there isonly one observed pair, namely (x1 , xn ), with this lag! As a result, I(λ) does©2004 CRC Press LLC
  • 125. not converge to fX (λ). Instead, the following holds, under mild regularityconditions: if 0 < λ1 < ... < λk < π, and n → ∞, then, as n → ∞,the distribution of 2 · [I(λ1 )/fX (λ1 ), ..., 2I(λk )/fX (λk )] converges to thedistribution of (Z1 , ..., Zk ) where Zi are independent χ2 -distributed random 2variables. This result is also true for sequences of frequencies 0 < λ1,n <... < λk,n < π as long as the smallest distance between the frequencies,min |λi,n − λj,n | does not converge to zero faster than n−1 . Because of thelatter condition, and also for computational reasons (fast Fourier transform,FFT; see Cooley and Tukey 1965, Bringham 1988), one usually calculatesI(λ) at the so-called Fourier frequencies λj = 2πj/n (j = 1, ..., m) with nm = [(n − 1)/2]) only. Note that for Fourier frequencies, t=1 eitλj = 0, sothat the I(λ) = (2πn)−1 | xt eitλ |2 .Thus, the sample mean actually does not need to be subtracted. The peri-odogram at Fourier frequencies can also be understood as a decompositionof the variance into orthogonal components, analogous to classical analysisof variance (Scheff´ 1959): for n odd, e n m (xt − x)2 = 4π ¯ I(λj ) (4.56) t=1 j=2and for n even, n m 2 (xt − x) = 4π ¯ I(λj ) + 2πI(π). (4.57) t=1 j=2This means that I(λj ) corresponds to the (empirically observed) contribu-tion of periodic components with frequency λj to the overall variability ofx1 , ..., xn . A consistent estimate of fX can be obtained by eliminating or down-weighing sample autocovariances with too large lags: n−1 ˆ 1 f (λ) = wn (k)ˆ (k)eikλ γ (4.58) 2π k=−(n−1)where wn (k) = 0 (or becomes negligible) for k > Mn , with Mn /n → 0 andMn → ∞. Equivalently, one can define a smoothed periodogram ˆ f (λ) = Wn (ν − λ)I(ν)dν (4.59)for a suitable sequence of window functions Wn such that Wn (ν−λ)f (ν)dνconverges to f (λ) as n → ∞. See e.g. Priestley (1981) for a detailed dis-cussion. Finally, it should be noted that, in spite of inconsistency, the raw peri-odogram is very useful for finding periodicities. In particular, in the case©2004 CRC Press LLC
  • 126. of deterministic periodicities with frequencies ωj , I(λ) diverges to infinityfor λ = ωj and remains finite (proportional to a χ2 −variable) elsewhere. 24.2.8 The harmonic regression modelAn important approach to analyzing musical sounds is the harmonic re-gression model p Xt = [αj cos ωj t + βj sin ωj t] + Ut (4.60) j=1with Ut stationary. Note that, theoretically, this model can also be un-derstood as a stationary process with jumps in the spectral distributionFX (see Section 4.2.1). Given ω = (ω1 , ..., ωp )t , the parameter vector θ =(α1 , ..., αp , β1 , ..., βp )t can be estimated by the least squares or, more gen-erally, weighted least squares method, n p t ˆ θ = arg min w( )[xt − (αj cos ωj t + βj sin ωj t)]2 (4.61) θ t=1 n j=1where w is a weight function. The solution is obtained from usual linearregression formulas. In many applications the situation is more complex,since the frequencies ω1 , ..., ωp are also unknown. This leads to a nonlinearregression problem. A simple approximate solution can be given by (Walker1971, Hannan 1973, Hassan 1982, Brown 1990, Quinn and Thomson 1991) p n p t ω = arg ˆ max | w( )xt eiωj t |2 = arg max Iw (ωj ), (4.62) 0<ω1 ,...,ωp ≤π n ω j=1 t=1 j=1 n t t=1 w( n )xt cos ωj t ˆ αj = ˆ n t , (4.63) t=1 w( n )and n t ˆ w( n )xt sin ωj t t=1 ˆ βj = n t (4.64) t=1 w( n )Note that (4.64) means that we look for the k largest peaks in the (w-tapered) periodogram. Under quite general assumptions, the asymptoticdistribution of the estimates can be shown to be as follows: the vectors √ √ ˆ 3 Zn,j = [ n(ˆ j − αj ), n(βj − βj ), n 2 (ˆ j − ωj )]t α ω(j = 1, ..., p) are asymptotically mutually independent, each having a 3-dimensional normal distribution with expected value zero and covariancematrix C(ωj ) that depends on fU (ωj ) and the weight function w. Theformulas for C are as follows (Irizarry 1998, 2000, 2001, 2002): 4πfU (ωj ) C(ωj ) = 2 V (ωj ) (4.65) α2 + βj j©2004 CRC Press LLC
  • 127. where   c1 α2 + c2 βj j 2 −c3 αj βj −c4 βj V (ωj ) =  −c3 αj βj c2 α2 + c1 βj j 2 c4 αj  , (4.66) −c4 βj c4 αj co −2 co = ao bo , c1 = Uo Wo , c2 = ao b1 , (4.67) −2 2 3 2 c3 = ao W1 Wo (Wo W1 U2 − W1 Uo − 2Wo W2 U1 + 2Wo W1 W2 Uo ), (4.68) 2 c4 = ao (Wo W1 U2 − W1 U1 − Wo W2 U1 + W1 W2 Uo ), (4.69) ao = (Wo W2 − W1 )−2 , 2 (4.70) 2 bn = Wn U2 + Wn+1 (Wn+1 Uo − 2Wn U1 ) (n = 0, 1), (4.71) 1 Un = sn w2 (s)ds, (4.72) o 1 Wn = sn w(s)ds. (4.73) oThis result can be used to obtain tests and confidence intervals for αj , βjand ωj (j = 1, 2, ..., p), with the unknown quantities αj , βj and fU (ωj ) thenreplaced by estimates. Note that this involves, in particular, estimation ofthe spectral density of the residual process Ut . A quantity that is of particular interest is the difference between thefundamental frequency ω1 and partials j · ω1 , ∆j = ωj − j · ω1 . (4.74)For many musical instruments, this difference is exactly or approximatelyequal to zero. The asymptotic distribution given above can be used to testthe null hypothesis Ho : ∆j = 0 or to construct confidence intervals for ∆j . 3More specifically, n 2 (∆j − ∆j ) is asymptotically normal with zero mean ˆand variance fU (ωj ) j 2 fU (ω1 ) v∆ = 4πco 2 + β 2 + α2 + β 2 . (4.75) αj j 1 1This can be generalized to any hypothesized relationship ∆j = ωj − g(j)ω1(see the example of a guitar mentioned in the next section).4.2.9 Dominating frequencies in random seriesIn the harmonic regression model, the main signal consists of determin-istic periodic functions. For less harmonic “noisy” signals, a weaker form©2004 CRC Press LLC
  • 128. of periodicity may be observed. This can be modeled by a purely randomprocess whose mth difference Yt = (1 − B)m Xt is stationary (m = 0, 1, ...)with a spectral density f that has distinct local maxima. Estimation oflocal maxima and identification of the corresponding frequencies is consid-ered, for instance, in Newton and Pagano (1983) and Beran and Ghosh(2000). Beran and Ghosh (2000) consider the case where Yt is a fractionalARIM A(p, d, 0) process of unknown order p. Suppose we want to esti-mate the frequency ωmax where f assumes the largest local maximum. In afirst step, the parameter vector θ = (σ 2 , d, φ1 , ..., φp ) (with d = δ + m)is estimated by maximum likelihood and p is chosen by the BIC. Letθ∗ = (σ 2 , δ, θ3 , ..., θp+2 ) = (σ 2 , η ∗ ) and σ2 σ2 f (λ; θ∗ ) = |φ(eiλ )|−2 |1 − eiλ |−2δ = g(λ; η ∗ ) (4.76) 2π 2π ˆbe the spectral density of Yt . Then Yt = (1 − B)m Xt and ωmax is set equal ˆ ˆto the frequency where the estimated spectral density f (λ; θ∗ ) assumes its ˆmaximum. Define Vp (η ∗ ) = 2W −1 (4.77)where π ∂ ∂Wij = (2π)−1 [ log g(x; u) log g(x; u)dx]|u=η∗ , (i, j = 1, ..., p+1). −π ∂ui ∂uj (4.78)Then, as n → ∞, √ n(ˆ max − ωmax ) →d N (0, τp ) ω (4.79)with 1 τp = τp (η ∗ ) = [g (ωmax , η ∗ )]T Vp (η ∗ )[g (ωmax , η ∗ )] (4.80) ˙ ˙ [g (ωmax , η ∗ )]2where →d denotes convergence in distribution, g , g denotes derivativeswith respect to frequency and g with respect to the parameter vector. Note ˙in particular that the order of var(ˆ max ) is n−1 whereas in the harmonic ωregression model the frequency estimates have variances of the order n−3 .The reason is that a deterministic periodic signal is a much stronger formof periodicity and is therefore easier to identify.4.3 Specific applications in music4.3.1 Analysis and modeling of musical instrumentsThere is an abundance of literature on mathematical modeling of soundsignals produced by musical instruments. Since a musical instrument is avery complex physical system, even if conditions are kept fixed, not only de-terministic but also statistical models are important. In addition to that,©2004 CRC Press LLC
  • 129. various factors can play a role. For instance, the sound of a violin de-pends on the wood it is made of, which manufacturing procedure was used,current atmospheric conditions (temperature, humidity, air pressure), whoplays the violin, which particular notes are played in which context, etc.The standard approach that makes modeling feasible is to think of a soundas the result of harmonic components that may change slowly in time, plus“noise” components that may be described by random models. It should benoted, however, that sound is not only produced by an instrument but alsoperceived by the human ear and brain. Thus, when dealing with the “signif-icance” or “effect” of sounds, physiology, psychology and related scientificdisciplines come into play. Here, we are first concerned with the actual “ob-jective” modeling of the physical sound wave. This is a formidable task onits own, and far from being solved in a satisfactory manner. The scientific study of musical sound signals by physical equations goesback to the 19th century. Helmholtz (1863) proved experimentally thatmusical sound signals are mainly composed of frequency components thatare multiples of a fundamental frequency (also see Raleigh 1894). Ohmconjectured that the human ear perceives sounds by analyzing the powerspectrum (i.e. essentially the periodogram), without taking into account rel-ative phases of the sounds. These conjectures have been mostly confirmedby psychological and physiological experiments (see e.g. Grey 1977, Pierce1983/1992). Recent mathematical models of instrumental sound waves (seee.g. Fletcher and Rossing 1991) lead to the assumption that, for short timesegments, a musical sound signal is stationary and can be written as aharmonic regression model with ω1 < ω2 < ... < ωp . To analyze a musicalsound wave, one therefore can divide time into small blocks and fit theharmonic regression model as described above. The lowest frequency ω1 iscalled the fundamental frequency and corresponds to what one calls “pitch”in music. The higher frequencies ωj (j ≥ 2) are called partials, overtones,or harmonics. The amplitudes of partials, and how they change gradually,are main factors in determining the “timbre” of a sound. For illustration,Figure 4.1 shows the sound wave (air pressure amplitudes) of a piano dur-ing 1.9 seconds where first a c and then an f are played. The signalwas sampled in 16-bit format at a sampling rate of 44100 Hz. This corre-sponds to CD-quality and means that every second, 44100 measurementsof the sound wave were taken, each of the measurements taking an integervalue between −32768 to 32767 (32767+32768+1=216). Figure 4.2 showsan enlarged picture of the shaded area in Figure 4.1 (2050 measurements,corresponding to 0.046 seconds). The periodogram (in log-coordinates) ofthis subseries is plotted in Figure 4.3. The largest peak occurs approxi-mately at the fundamental frequency ω1 = 441 · 2−9/12 ≈ 262.22 of c . Notethat, since the periodogram is calculated at Fourier frequencies only, ω1cannot be identified exactly (see also the remarks below). A small numberof partials ωj (j ≥ 2) can also be seen in Figure 4.3 the contribution of©2004 CRC Press LLC
  • 130. Figure 4.1 Sound wave of c and f played on a piano.higher partials is however relatively small. In contrast, the periodogram ofe played on a harpsichord shows a large number of distinctly importantpartials (Figures 4.4, 4.5). There is obviously a clear difference betweenpiano and harpsichord in terms of amplitudes of higher partials. A com-prehensive study of instrumental or vocal sounds also needs to take intoaccount different techniques in which an instrument is played, and otherfactors such as the particular pitch ω1 that is played. This would, however,be beyond the scope of this introductory chapter. A specific component that is important for “timbre” is the way in whichthe coefficients αj , βj change in time (see e.g. Risset and Mathews 1969).Readers familiar with synthesizers may recall “envelopes” that are con-trolled by parameters such as “attack” and “delay”. The development of αj ,βj can be studied by calculating the periodogram for a moving time-windowand plotting its values against time and frequency in a 3-dimensional orimage plot. Thus, we plot the local periodogram (in this context also called©2004 CRC Press LLC
  • 131. 10^3 c’ played by piano (shaded area from figure 4.1) amplitude 0 1 time in seconds Figure 4.2 Zoomed piano sound wave – shaded area in Figure 4.1.spectrogram) n 1 t − j −iλj 2 I(t, λ) = n | W( )e xj | (4.81) 2π 2 ( t−j ) nb j=1 W nb j=1where W : R → R+ is a weight function such that W (u) = 0 for |u| > 1 andb > 0 is a bandwidth that determines how large the window (block) is, i.e.how many consecutive observations are considered to correspond approx-imately to a harmonic regression model with fixed coefficients αj , βj andstationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichordsound, with W (u) = 1{|u| ≤ 1}. Intense pink corresponds to high values ofI(t, λ). Figures 4.6a through d show explicitly the change in I(t, λ) betweenfour different blocks. Since the note was played “staccato”, the sound waveis very short, namely about 0.1 seconds. Nevertheless, there is a change inthe spectrum of the sound, with some of the higher harmonics fading away. Apart from the relative amplitudes of partials, most musical sounds in-©2004 CRC Press LLC
  • 132. Figure 4.3 Periodogram of piano sound wave in Figure 4.2. Sound wave of e’’ flat played by harpsichord (0.25sec at sampling rate=44100 Hz) 3000 2000 1000 amplitude 0 -1000 -2000 -3000 0.0 0.01 0.02 0.03 0.04 time in seconds Figure 4.4 Sound wave of e played on a harpsichord.©2004 CRC Press LLC
  • 133. Figure 4.5 Periodogram of harpsichord sound wave in Figure 4.4. Harpsichord - Periodogram Harpsichord - Periodogram (block 1) (block 22) 10^7 10^6 periodogram periodogram 10^5 10^4 10^2 10^3 10^0 10^1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 frequency frequency a a Harpsichord - Periodogram Harpsichord - Periodogram (block 42) (block 62) 10^6 10^6 periodogram periodogram 10^4 10^4 10^2 10^2 10^0 10^0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 frequency frequency b cFigure 4.6 Harpsichord sound – periodogram plots for different time frames (mov-ing windows of time points).©2004 CRC Press LLC
  • 134. Figure 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds tohigh values of I(t, λ). (Color figures follow page 152.)clude a characteristic nonperiodic noise component. This is a further jus-tification, apart from possible measurement errors, to include a randomdeviation part in the harmonic regression equation. The properties of thestochastic process Ut are believed to be characteristic for specific instru-ments (see e.g. Serra and Smith 1991, Rodet 1997). Typical noise compo-nents are, for instance, transient noise in percussive instruments, breathnoise in wind instruments, or bow noise of string instruments. For a dis-cussion of statistical issues in this context see e.g. Irizarry (2001). For mostinstruments, not only the harmonic amplitudes but also the characteris-tics of the noise component change gradually. This may be modeled bysmoothly changing processes as defined for instance in Ghosh et al. (1997).Other approaches are discussed in Priestley (1965) and Dahlhaus (1996a,b,1997) (see Section 4.2.1 above). Some interesting applications of the asymptotic results in Section 4.2.8 toquestions arising in the analysis of musical sounds are discussed in Irizarry©2004 CRC Press LLC
  • 135. (2001). In particular, the following experiment is described: recordings ofa professional clarinet player trying to play concert pitch A (ω1 = 441Hz)and a professional guitar player playing D (ω1 = 146.8Hz) were made. Forthe analysis of the clarinet sound, a one-second segment was divided intonon-overlapping blocks consisting of 1025 measurements (≈23 milliseconds)and the harmonic regression model was fitted to each block separately. Forthe guitar, the same was done with 60 non-overlapping intervals with 3000observations each. Two types of results were obtained:1. The clarinet player turned out to be always out of tune in the sense that the estimated fundamental frequency ω1 was always outside the 95%- ˆ 3 acceptance region 441Hz ± 1.96 C33 (ω1 )n− 2 where the null hypothesis o o is Ho : ω1 = ω1 = 441Hz. On the other hand, from the point of view of musical perception, the clarinet player was not out of tune, because the deviation from 441Hz was less than 0.76Hz which corresponds to 0.03 semitones. According to experimental studies, the human ear cannot distinguish notes that are 0.03 semitones apart (Pierce 1983/1992).2. Physical models (see e.g. Fletcher and Rossing 1991) postulate the fol- lowing relationships between the fundamental frequency and partials: for a “harmonic instrument” such as the clarinet, one expects ωj = j · ω1 , whereas for a “plucked string instrument”, such as the guitar, one should have ωj ≈ cj 2 · ω1 where c is a constant determined by properties of the strings. The ex- periment described in Irizarry (2001) supports the assumption for the clarinet in the sense that, in general, the 95%-confidence intervals for the difference ωj − jω1 contained 0. For the guitar, his findings suggest a relationship of the form ωj ≈ c(a + j)2 ω1 with a = 0.4.3.2 Licklider’s theory of pitch perceptionThumfart (1995) uses the theory of discrete evolutionary spectra to derivea simple linear model for pitch perception as proposed by Licklider (1951).The general biological background is as follows (see e.g. Kelly 1991): vibra-tions of the ear drum caused by sound waves are transferred to the innerear (cochlea) by three ossicles in the middle ear. The inner ear is a spiralstructure that is partitioned along its length by the basilar membrane. Thesound wave causes a traveling wave on the basilar membrane which in turncauses hair cells positioned at different locations to release a chemical trans-mitter. The chemical transmitter generates nerve impulses to the auditorynerve. At which location on the membrane the highest amplitude occurs,and thus which groups of hair cells are activated, depends on the frequency©2004 CRC Press LLC
  • 136. of the sound wave. This means that certain frequency regions correspondto certain hair groups. Frequency bands with high spectral density f (orhigh increments dF of the spectral distribution) activate the associatedhair groups. To obtain a simple model for the effect of a sound on the basilar mem-brane movement, Slaney and Lyon (1991) partition the cochlea into 86sections, each section corresponding to a particular group of cells. Thum-fart (1995) assumes that each group of cells acts like a separate linear filterΨj (j = 1, ..., 86). (This is a simplification compared to Slaney and Lyonwho use nonlinear models.) The wave entering the inner ear is assumed tobe the original sound wave Xt , filtered by the outer ear by a linear filterA1 , and the middle ear by a linear A2 . Thus, the output of the inner earthat generates the final nerve impulses consists of 86 time series Yt,j = Ψj (B)A2 (B)A1 (B)Xt (j = 1, ..., 86). (4.82)Calculating tapered local periodograms Ij (u, λ) of Yt,j for each of the 86sections (j = 1, ..., 86), one can then define the quantity π c(k, j, u) = Ij (u, λ)eikλ dλ (4.83) −πwhich Slaney and Lyon call “correlogram”. This is in fact an estimated localautocovariance at lag k for section j and the time-segment with midpointu. The “Slaney-Lyon-correlogram” thus essentially characterizes the localautocovariance structure of the resulting nerve impulse series. Thumfart(1995) shows formally how, and under which conditions, this model canbe defined within the framework of processes with a discrete evolutionaryspectrum. He also suggests a simple method for estimating pitch (the fun-damental frequency) at local time u by setting ω1 (u) = 2π/kmax (u) where ˆ 86kmax (u) = arg maxk C(k, u) and C(k, u) = j=1 c(k, j, u).4.3.3 Identification of pitch, tone separation and purity of intonationIn a recent study, Weihs et al. (2001) investigate objective criteria for judg-ing the quality of singing (also see Ligges et al. 2002). The main questionasked in their analysis is how to assess purity of intonation. In an ex-perimental setting, with standardized playback piano accompaniment in arecording studio, 17 singers were asked to sing H¨ndel’s “Tochter Zion” and aBeethoven’s “Ehre Gottes aus der Natur”. The audio signal of the vocalperformance was recorded in CD quality in 16-bit format at a sampling rateof 44100 Hz. For the actual statistical analysis, data is reduced to 11000Hz,for computational reasons, and standardized to the interval [-1,1]. The first question is how to identify the fundamental frequency (pitch)ω1 . In the harmonic regression model above, estimates of ω1 and the par-tials ωj (2 ≤ j ≤ k) are identical with the k frequencies where the pe-©2004 CRC Press LLC
  • 137. riodogram assumes its k largest values. Weihs et al. suggest a simplified(though clearly suboptimal) version of this, in that they consider the peri-odogram at Fourier frequencies λj = 2πj/n (j = 1, 2, ..., m = [(n − 1)/2])only and set ω1 = ˜ min {λj : I(λj ) > max[I(λj−1 ), I(λj+1 )]}. (4.84) λj ∈{λ2 ,...,λm−1 }In other words, ω1 corresponds to the Fourier frequency where the first ˜peak of the periodogram occurs. Because of the restriction to Fourier fre-quencies, the peridogram may have two adjacent peaks and the estimate istoo inaccurate in general. An empirical interpolation formula is suggestedby the authors to obtain an improved estimate ω1 . A comparison with har- ˆmonic regression is not made, however, so that it is not clear how good theinterpolation works in comparison. Given a procedure for pitch identification, an automatic note separationprocedure can be defined. This is a procedure that identifies time points ina sound signal where a new note starts. The interesting result in Weihs etal. is that automatic note separation works better for amateur singers thanfor professionals. The reason may be the absence of vibrato in amateurvoices. In a third step, Weihs et al. address the question of how to as-sess computationally the purity of intonation based on a vocal time series.This is done using discriminant analysis. The discussion of these results istherefore postponed to Chapter 9.4.3.4 Music as 1/f noise?In the 1970s Voss and Clarke (1975, 1978) discovered a seemingly universal“law” according to which music has a 1/f spectrum. With 1/f -spectrumone means that the observed process has a spectral density f such thatf (λ) ∝ λ−1 as λ → 0. In the sense of definition (4.10), such a densityactually does not exist - however, a generalized version of spectral densityexists in the sense that the expected value of the periodogram convergesto this function (see Matheron 1973, Solo 1992, Hurvich and Ray 1995).Specifically, Voss and Clarke analyzed acoustic music signals by first trans-forming the recorded signal Xt in the following way: a) Xt is filtered by alow-pass filter (frequencies outside the interval [10Hz, 10000Hz] are elim- 2inated); and b) the “instantaneous power” Yt = Xt is filtered by anotherlow-pass filter (frequencies above 20Hz are eliminated). This filtering tech-nique essentially removes higher frequencies but retains the overall shape(or envelope) of each sound wave corresponding to a note and the relativeposition on the onset axis. In this sense, Voss and Clarke actually analyzedrhythmic structures. A recent, statistically more sophisticated study alongthis line is described in Brillinger and Irizarry (1998). One objection to this approach can be that in acoustic signals, structural©2004 CRC Press LLC
  • 138. a) Harpsichord sound wave (e flat) b) Harpsichord - log(power) 18 sampled at 44100 Hz 3000 17 1000 16 air pressure log(power) 15 -1000 14 -3000 13 0.0 0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.02 0.04 0.06 0.08 0.10 0.12 time (sec) time (sec) c) Harpsichord - d) Harpsichord - histogram of log(power) log-log-periodogram and SEMIFAR-fit (d=0.51) 1.0000 0 0 1 0 8 0.0100 log(f) 0 6 0 4 0 2 0.0001 0 13 14 15 16 17 18 0.01 0.05 0.10 0.50 1.00 log(y**2) log(frequency)Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b),histogram of the series (c) and its periodogram on log-scale (d) together with fittedSEMIFAR-spectrum.properties of the composition may be confounded with those of the instru-ments. Consider, for instance, the harpsichord sound wave in Figure 4.8a.The square of the wave is displayed in Figure 4.8b on logarithmic scale.The picture illustrates that, apart from obvious oscillation, the (envelopeof the) signal changes slowly. Fitting a SEMIFAR-model (with order p ≤ 8chosen by the BIC) yields a good fit to the periodogram. The estimated ˆfractional differencing parameter is d = 0.51 with a 95%-confidence intervalof [0.29,0.72]. This corresponds to a spectral density (defined in the gener-alized sense above) that is proportional to λ−1.02 , or approximately λ−1 .Thus, even in a composition consisting of one single note one would detect1/f noise in the resulting sound wave. Instead of recorded sound waves, we therefore consider the score itself,independently of which instrument is supposed to play. This is similar butnot identical to considering zero crossings of a sound signal (see Voss and©2004 CRC Press LLC
  • 139. Clarke 1975, 1978, Voss 1988; Brillinger and Irizarry 1998). Figures 4.9a andc show the log-frequencies plotted against onset time for the first movementof Bach’s first Cello-Suite and for Paganini’s Capriccio No. 24. For Bach, the ˆSEMIFAR-fit yields d ≈ 0.7 with a 95%-confidence interval of [0.46, 0.93].This corresponds to a 1/f 1.4 spectrum; however 1/f (d = 1/2) is includedin the confidence interval. Thus, there is not enough evidence against the ˆ1/f hypothesis. In contrast, for Paganini (Figure 4.11) we obtain d ≈ 0.21with a 95%-confidence interval of [0.07, 0.35] which excludes 1/f noise. Thisindicates that there is a larger variety of fractal behavior than the “1/f -law” would suggest. Note also that in both cases there is also a trend inthe data which is in fact an even stronger type of long memory than thestochastic one. Moreover, Bach’s (and also to a lesser degree Paganini’s)spectrum has local maxima in the spectral density, indicating periodicities(see Section 4.2.9). Thus, there is no “pure” 1/f α behavior but insteada mixture of long-range dependence expressed by the power law near theorigin, and short-range periodicities.Figure 4.9 Log-frequencies with fitted SEMIFAR-trend and log-log-periodogramtogether with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) andPaganini’s Capriccio No. 24 (c,d) respectively. Finally, consider an alternative quantity, namely local variability of notesmodulo octave. Since we are in Z12 , a measure of variability for circulardata should be used. Here, we use the measure V = (1 − R) as defined in ¯Chapter 7 or rather the transformed variable log[(V +0.05)/(1.05−V )]. Theresulting standardized time series are displayed in Figures 4.10a and c. Thelog-log-plot of the periodgrams and fitted SEMIFAR-spectra are given inFigures 4.10b and d respectively. The estimated long-memory parameters©2004 CRC Press LLC
  • 140. Figure 4.10 Local variability with fitted SEMIFAR-trend and log-log-periodogramtogether with SEMIFAR-fit for Bach’s first Cello Suite (1st movement; a,b) andPaganini’s Capriccio No. 24 (c,d) respectively. ˆare similar to before, namely d = 0.51 ([0.20, 0.81]) for Bach and 0.33([0.24, 0.42]) for Paganini.©2004 CRC Press LLC
  • 141. Figure 4.11 Niccol` Paganini (1782-1840). (Courtesy of Zentralbibliothek oZ¨rich.) u©2004 CRC Press LLC
  • 142. CHAPTER 5 Hierarchical methods5.1 Musical motivationMusical structures are typically generated in a hierarchical manner. Mostcompositions can be divided approximately into natural segments (e.g.movements of a sonata); these are again divided into smaller units (e.g.exposition, development, and coda of a sonata movement). These can againbe divided into smaller parts (e.g. melodic phrases), and so on. Differentparts even at the same hierarchical level need not be disjoint. For instance,different melodic lines may overlap. Moreover, different parts are usuallyclosely related within and across levels. A general mathematical approachto understanding the vast variety of possibilities can be obtained, for in-stance, by considering a hierarchy of maps defined in terms of a manifold(see e.g. Mazzola 1990a). The concept of hierarchical relationships and sim-ilarities is also related to “self-similarity” and fractals as defined in Mandel-brot (1977) (see Chapter 3). To obtain more concrete results, hierarchicalregression models have been developed in the last few years (Beran andMazzola 1999a,b, 2000, 2001).5.2 Basic principles5.2.1 Hierarchical aggregation and decompositionSuppose that we have two time series Yt , Xt and we wish to model the re-latioship between Yt and Xt . The simplest model is simple linear regression Yt = βo + β1 Xt + εt (5.1)where εt is a stationary zero mean process independent of Xt . If Yt and Xtare expected to be “hierarchical”, then we may hope to find a more realisticmodel by first decomposing Xt (and possibly also Yt ) and searching fordependence structures between Yt (or its components) and the componentsof Xt . Thus, given a decomposition Xt = Xt,1 + ... + Xt,M , we consider themultiple regression model M Yt = βo + βj Xt,j + εt (5.2) j=1©2004 CRC Press LLC
  • 143. with εt second order stationary and E(εt ) = 0. Alternatively, if Yt = Yt,1 +... + Yt,L , we may consider a system of L regressions M Yt,1 = β01 + βj1 Xt,j + εt,1 j=1 M Yt,2 = β02 + βj2 Xt,j + εt,2 j=1 . . . M Yt,L = β0L + βjL Xt,j + εt,L . j=1Three methods of hierarchical regression based on decompositions will bediscussed here: HIREG: hierarchical regression using explanatory variablesobtained by kernel smoothing with predetermined fixed bandwidths; HIS-MOOTH: hierarchical smoothing models with automatic bandwidth selec-tion; HIWAVE: hierarchical wavelet models.5.2.2 Hierarchical regressionGiven an explanatory time series Xt (t = 1, 2, ..., n), a smoothing kernelK, and a hierarchy of bandwidths b1 > b2 > ... > bM > 0, define n 1 t−s Xt,1 = K( )Xt (5.3) nb1 s=1 nb1and for 1 < j ≤ M , n j−1 1 t−s Xt,j = K( )[Xt − Xt,l ] (5.4) nbj s=1 nbj l=1The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hier-archical decomposition of Xt . The HIREG-model is then defined by (5.2).If εt (t = 1, 2, ...) are independent, then usual techniques of multiple linearregression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Sri-vastava and Sen 1997, Draper and Smith 1998). In case of correlated errorsεt , appropriate adjustments of tests, confidence intervals, and parameterselection techniques must be made. The main assumption in the HIREGmodel is that we know which bandwidths to use. In some cases this mayindeed be true. For instance, if there is a three-fourth meter at the begin-ning of a musical score, then bandwidths that are divisible by three areplausible.©2004 CRC Press LLC
  • 144. 5.2.3 Hierarchical smoothingBeran and Mazzola (1999b) consider the case where the bandwidths bjare not known a priori. Essentially, this amounts to a nonlinear regression Mmodel Yt = βo + j=1 βj Xt,j + εt where not only βj (j = 0, ..., p) are un-known, but also b1 , ..., bM , and possibly the order M, have to be estimated.The following definition formalizes the idea (for simplicity it is given forthe case of one explanatory series Xt only):Definition 40 For integers M, n > 0, let β = (β1 , ..., βM ) ∈ RM , b =(b1 , ..., bM ) ∈ RM , b1 > b2 > ... > bM = 0, ti ∈ [0, T ], 0 < T < ∞, t1 <t2 < ... < tn , and θ = (β, b)t . Denote by K : [0, 1] → R+ a non-negativesymmetric kernel function such that K(u)du = 1, K is twice continuouslydifferentiable, and define for b > 0 and t ∈ [0, T ], the Nadaraya-Watsonweights (Nadaraya 1964, Watson 1964) K( t−ti ) b ab (t, ti ) = n t−tj (5.5) j=1 K( b )Also, let εi (i ∈ Z) be a stationary zero mean process satisfying suitablemoment conditions, fε the spectral density of εi , and assume εi to be in-dependent of Xi . Then the sequence of bivariate time series {(X1,n , Y1,n ),..., (Xn,n , Yn,n )} (n = 1, 2, 3, ...) is a Hierarchical Smoothing Model (orHISMOOTH model), if M Yi,n = Y (ti ) = βj g(ti ; bj ) + εi (5.6) j=1where ti = i/n and n g(ti ; bj ) = abj (ti , tl )Xl,n (5.7) l=1Denote by θo = (β o , bo )t the true parameter vector. Then θo can be esti-mated by a nonlinear least squares method as follows: define M ei (θ) = Y (ti ) − βj g(ti ; bj ) (5.8) l=1 n 2 ∂as a function of θ = (β, b)t , let S(θ) = i=1 ei (θ) and g = ˙ ∂b g. Then ˆ θ = argminθ S(θ) (5.9)or equivalently n ˆ ψ(ti , y; θ) = 0 (5.10) i=1©2004 CRC Press LLC
  • 145. where ψ = (ψ1 , ..., ψ2M )t , ψj (t, y; θ) = ei (θ)g(t; bj ) (5.11)for j = 1, ..., M, and ψj (t, y; θ) = ei (θ)βj g(t; bj ) ˙ (5.12) ˆfor j = M +1, ..., 2M. Under suitable assumptions, the estimate θ is asymp-totically normal. More specifically, set hi (t; θo ) = g(t; bi ) (i = 1, ..., M ) (5.13) hi (t; θo ) = βi g(t; bi ) (i = M + 1, ..., 2M ) ˙ (5.14) Σ = [γε (i − j)]i,j=1,...,n = [cov(εi , εj )]i,j=1,...,n (5.15)and define the 2M × n matrix G = G2M×n = [hi (tj ; θo )]i=1,...,2M;j=1,...,n (5.16)and the 2M × 2M matrix Vn = (GGt )−1 (GΣGt )(GGt )−1 (5.17)The following assumptions are sufficient to obtain asymptotic normality: (A1) fε (λ) ∼ cf |λ|−2d (cf > 0) as λ → 0 with − 1 < d < 1 ; 2 2 (A2) Let n ar = n−1 γε (i − j)g(ti ; br )g(tj ; br ), i,j=1 n br = n−1 γε (i − j)g(ti ; br )g(tj ; bs ). ˙ ˙ i,j=1 Then, as n → ∞, lim inf |ar | > 0, and lim inf |br | > 0 for all r, s ∈ {1, ..., M }. (A3) x(ti ) = ξ(ti ) where ξ : [0, T ] → R is a function in C[0, T ], T < ∞. (A4) The set of time points converges to a set A that is dense in [0, T ].Then we have (Beran and Mazzola 1999b):Theorem 12 Let Θ1 and Θ2 be compact subsets of R and R+ respectively,Θ = ΘM × ΘM and let η = 1 min{1, 1 − 2d}. Suppose that (A1), (A2), (A3) 1 2 2and (A4) hold and θo is in the interior of Θ. Then, as n → ∞, ˆ (i) θ →p θo ; (ii) Vn → V where V is a symmetric positive definite 2M × 2M matrix; ˆ (iii) nη (θ − θ) →d N (0, V ).©2004 CRC Press LLC
  • 146. ˆThus, θ is asymptotically normal, but for d > 0 (i.e. long-memory errors), 1 1the rate of convergence n 2 −d is slower than the usual n 2 −rate. A particular aspect of HISMOOTH models is that the bandwidths bj arefixed positive unknown parameters that are estimated from the data. Thismeans that, in contrast to nonparametric regression models (see e.g. Gasserand M¨ller 1979, Simonoff 1996, Bowman and Azzalini 1997, Eubank 1999), uthe notion of optimal bandwidth does not exist here. There is a fixed truebandwidth (or a vector of true bandwidths) that has to be estimated. AHISMOOTH model is in fact a semiparametric nonlinear regression ratherthan a nonparametric smoothing model. Theorem 1 can be interpreted as multiple linear regression where uncer-tainty due to (explanatory) variable selection is taken into account. Theset of possible combinations of explanatory variables is parametrized by acontinuous bandwidth-parameter vector b ∈ ΘM . Confidence intervals for β 2 ˆbased on the asymptotic distribution of θ take into account additional un-certainty due to “variable selection” from the (infinite) parametric family ofM explanatory variables X = {(xb1 , ..., xbM ) : bj ∈ Θ2 , b1 > b2 > ... > bM }. For the practical implementation of the model, the following algorithmsthat include estimation of M are defined in Beran and Mazzola (1999b): ifM is fixed, then the algorithm consists of two basic steps: a) generation ofthe set of all possible explanatory variables xs (s ∈ S), and b) selection ofM variables (bandwidths) that maximize R2 . This means that after step 1,the estimation problem is reduced to variable selection in multiple regres-sion, with a fixed number M of explanatory variables. Standard regressionsoftware, such as the function leaps in S-Plus, can be used for this purpose.The detailed algorithm is as follows:Algorithm 1 Define a sufficiently fine grid S = {s1 , ..., sk } ⊂ Θ2 andcarry out the following steps: Step 1: Define k explanatory time series xs = [xs (t1 ), ..., xs (tn )]t (s ∈ S) by xs (ti ) = g(ti , s). Step 2: For each b = (b1 , ..., bM ) ∈ S M , with bi > bj (i < j) define the n × M matrix X = (xb1 , ..., xbM ) and let β = β(b) = (X t X)−1 X t y. Also, denote by R2 (b) the corresponding value of R2 obtained from least squares regression of y on X. Step 3: Define θ = (β, ˆ t by ˆ = argmax R2 (b) and β = β(ˆ ˆ ˆ b) b b ˆ b).If M is unknown, then the algorithm can be modified, for instance by in-creasing M as long as all β-coefficients are significant. In order to calculate ˆthe standard deviation of β at each stage, the error process εi needs to bemodeled explicitly. Beran and Mazzola (1999) use fractional autoregressivemodels together with the BIC for choosing the order of the process. Thisleads toAlgorithm 2 Define a sufficiently fine grid S = {s1 , ..., sk } ⊂ Θ2 for the©2004 CRC Press LLC
  • 147. bandwidths, and calculate k explanatory time series xs (s ∈ S) by xs (ti ) =g(ti , s). Furthermore, define a significance level α, set Mo = 0, and carryout the following steps: Step 1: Set M = Mo + 1; Step 2: For each b = (b1 , ..., bM ) ∈ S M , with bi > bj (i < j) define the n × M matrix X = (xb1 , ..., xbM ) and let β = β(b) = (X t X)−1 X t y. Also, denote by R2 (b) the corresponding value of R2 obtained from least squares regression of y on X. Step 3: Define θ = (ˆ β)t by ˆ = argmaxb R2 (b) and β = β(ˆ ˆ b, ˆ b ˆ b). Step 4: Let e(θ)ˆ = [e1 , ..., en ]t be the vector of regression residuals. Assume that ei is a fractional autoregressive process of unknown order p charac- 2 terized by a parameter vector ζ = (σε , d, φ1 , ..., φp ). Estimate p and ζ by maximum likelihood and the BIC. Step 5: Calculate for each j = 1, ..., M, the estimated standard deviation ˆ ˆ σj (ζ) of βj , and set ˆ −1 ˆ pj = 2[1 − Φ(|βj |σj (ζ))] where Φ denotes the cumulative standard normal distribution function. If max (pj ) < α, set Mo = Mo + 1 and repeat 1 through 5. Otherwise, ˆ ˆ stop the iteration and set M = Mo and θ equal to the corresponding estimate.5.2.4 Hierarchical wavelet modelsWavelet decomposition has become very popular in statistics and manyfields of application in the last few years. This is due to the flexibility todepict local features at different levels of resolution. There is an extendedliterature on wavelets spanning a vast range between profound mathemat-ical foundations and mathematical statistics to concrete applications suchas data compression, image and sound processing, and data analysis, toname only a few. For references see for example Daubechies (1992), Meyer(1992, 1993), Kaiser (1994), Antoniadis and Oppenheim (1995), Ogden(1996), Mallat (1998), H¨rdle et al. (1998), Vidakovic (1999), Percival and aWalden (2000), Jansen (2001), Jaffard et al. (2001). The essential principleof wavelets is to express square integrable functions in terms of orthogonalbasis functions that are zero except in a small neighborhood, the neigh-borhoods being hierarchical in size. The set of basis functions Ψ = {ϕok ,k ∈ Z} ∪ {ψjk , j, k ∈ Z} is generated by two functions only, the fatherwavelet ϕ and the mother wavelet ψ, respectively, by up/downscaling andshifting of the location respectively. If scaling is done by powers of 2 andshifting by integers, then the basis functions are: ϕok (x) = ϕoo (x − k) = ϕ(x − k) (k ∈ Z) (5.18)©2004 CRC Press LLC
  • 148. j j ψjk (x) = 2 2 ψoo (2j x − k) = 2 2 ψ(2j x − k) (j ∈ N, k ∈ Z) (5.19)With respect to the scalar product < g, h >= g(x)h(x)dx, these basisfunctions are orthonormal: < ϕok , ϕom >= 0 (k = m), < ϕok , ϕok >= ||ϕk ||2 = 1 (5.20) < ψjk , ψlm >= 0 (k = m or j = l), < ψjk , ψjk >= ||ψjk ||2 = 1 (5.21) < ψjk , ϕol >= 0 (5.22) 2Every function g in L (R) (the space of square integrable functions on R)has a unique representation ∞ ∞ ∞ g(x) = ak ϕok (x) + bjk ψjk (x) (5.23) k=−∞ j=0 k=−∞ ∞ ∞ ∞ = ak ϕ(x − k) + bjk ψ(2j x − k) (5.24) k=−∞ j=0 k=−∞where ak =< g, ϕk >= g(x)ϕk (x)dx (5.25)and bjk =< g, ψjk >= g(x)ψjk (x)dx (5.26)Note in particular that g 2 (x)dx = a2 + k b2 . The purpose of this jkrepresentation is a decomposition with respect to frequency and time. Asimple wavelet, where the meaning of the decomposition can be understooddirectly, is the Haar wavelet with ϕ(x) = 1{0 ≤ x < 1} (5.27)where 1{0 ≤ x < 1} = 1 for 0 ≤ x < 1 and zero otherwise, and 1 1 ψ(x) = 1{0 ≤ x < } − 1{ ≤ x < 1}. (5.28) 2 2For the Haar basis functions ϕk , we have coefficients k+1 ak = g(x)dx (5.29) kThus, coefficients of the basis functions ϕk are equal to the average valueof g in the interval [k, k + 1]. For ψjk we have 2−j (k+ 1 ) 2 2−j (k+1) j bjk = 2 [ 2 g(x)dx − g(x)dx] (5.30) 2−j k 2−j (k+ 1 ) 2which is the difference between the average values of g in the intervals2−j k ≤ x < 2−j (k + 1 ) and 2−j (k + 1 ) ≤ x < 2−j (k + 1). This can be 2 2interpreted as a (signed) measure of variability. Since each interval Ijk =©2004 CRC Press LLC
  • 149. [2−j k, 2−j (k + 1)] has length 2−j and midpoint 2−j (k + 1 ), the coefficients 2ajk (or their squares a2 ) characterize the variability of g at different scales jk2−j (j = 0, 1, 2, ...) and a grid of locations 2−j (k + 1 ) that becomes finer 2as the scale decreases with increasing values of j. Suppose now that a time series (function) yt is observed at a finite num-ber of discrete time points t = 1, 2, ..., n with n = 2m . To relate this towavelet decomposition in continuous time, one can construct a piecewiseconstant function in continuous time by n−1 n−1 k k+1 gn (x) = yk 1{ ≤x< }= yk 1{2−mk ≤ x < 2−m (k + 1)} n n k=0 k=0 (5.31)Since gn is a step function (like the Haar basis functions themselves) andzero outside the interval [0, 1), the Haar wavelet decomposition of gn hasonly a finite number of nonzero terms: m−1 2j −1 gn (x) = aoo + bjk ψjk (x) (5.32) j=0 k=0Note that gn assumes only a finite number of values gn (x) = ynx (x = j1/n, 2/n, ..., 1). Moreover, for x = k/n, ψjk (x) = 2 2 ψ(2j x − k) is nonzerofor 0 ≤ k < 1/(2m−j − 1) only. Therefore, Equation (5.32) can be writtenin matrix form and calculation of the coefficients aoo and bjk can be doneby matrix inversion. Since matrix inversion may not be feasible for largedata sets, various efficient algorithms such as the so-called discrete wavelettransform have been developed (see e.g. Percival and Walden 2000). An interesting interpretation of wavelet decomposition can be given interms of total variability. The total variability of an observed series can bedecomposed into contributions of the basis functions by m−1 2j −1 2 (yt − y ) = ¯ b2 . jk (5.33) j=0 k=0A plot of b2 against j (or 2j = “frequency”, or 2−j = “period”) and k jk(“location”) shows for each k and j how much of the signal’s variability isdue to variation at the corresponding location k and frequency 2j . To illustrate how wavelet decomposition works, consider the followingsimulated example: let xi = 2 cos(2πi/90) if i ∈ {1, ..., 300} or {501, ..., 700}or {901, ..., 1024}, For 301 ≤ i ≤ 500, set xi = 1 cos(2πi/10), and for 2 1701 ≤ i ≤ 900, xi = 15 cos(2πi/10) + 10000 (i − 200)2 . The observed signalthus consists of several periodic segments with different frequencies andamplitudes, the largest amplitude occurring between t = 701 and 900, to-gether with a slight trend. Figure 5.1a displays xi . The coefficients for thefour highest levels (i.e. j = 0, 1, 2, 3) are plotted against time in Figure©2004 CRC Press LLC
  • 150. 5.1b. Note that D stands for mother and S for father wavelet. Moreover,the numbering in the plot (as given in S-Plus) is opposite to the one givenabove: s4 and d4 in the plot correspond to the coarsest level j = 0 above.The corresponding functions at the different levels are given in Figure 5.1c.The ten and fifty largest basis contributions are given in Figures 5.1d ande respectively (together with the data on top and residuals at the bot-tom). Figure 5.1f shows the time frequency plot of the squared coefficientsin the wavelet decomposition of xi . Bright shading corresponds to largecoefficients. All plots emphasize the high-frequency portion with large am-plitude between i = 701 and 900. Moreover, the trend at this location isvisible through the coefficient values of the father wavelet ψ (s4 in theplot) and the slightly brighter shading in the lowest frequency band of thetime-frequency plot. An alternative to HISMOOTH models can be defined via wavelets (thefollowing definition is a slight modification of Beran and Mazzola 2001):Definition 41 Let φ, ψ ∈ L2 (R) be a father and the corresponding motherwavelet respectively, φk (.) = φ(. − k), ψj,k = 2j/2 ψ(2j . − k) (k ∈ Z, j ∈ N)the orthogonal wavelet basis generated by φ and ψ, and ui and εi (i ∈Z) independent stationary zero mean processes satisfying suitable momentconditions. Assume X(ti ) = g(ti ) + ui with g ∈ L2 [0, T ], ti ∈ [0, T ] andwavelet decomposition g(t) = ak φk (t) + bj,k ψj,k (t). For 0 = cM+1 <cM < ... < c1 < co = ∞ let g(t; ci−1 , ci ) = ak φk (t) + bj,k ψj,k (t). ci ≤|ak |<ci−1 ci ≤|bj,k |<ci−1Then (X(ti ), Y (ti )) (i = 1, ..., n) is a Hierarchical Wavelet Model (HI-WAVE model) of order M , if there exists M ∈ N, β = (β1 , ..., βM ) ∈ RM ,η = (η1 , ..., ηM ) ∈ RM , 0 < ηM < ...η1 < ηo = ∞ such that + M Y (ti ) = βl g(ti ; ηl−1 , ηl ) + εi . (5.34) l=1The definition means that the time series Y (t) is decomposed into orthog-onal components that are proportional to certain “bands” in the waveletdecomposition of the explanatory series X(t) – the bands being defined bythe size of wavelet coefficients. As for HISMOOTH models, the parametervector θ = (β, η)t can be estimated by nonlinear least squares regression.To illustrate how HIWAVE-models may be used, consider the followingsimulated example: let xi = g(ti ) (i = 1, ..., 1024) as in the previous ex-ample. The function g is decomposed into g(t) = g(t; ∞, η1 ) + g(t; η1 , 0) =g1 (t) + g2 (t) where η1 is such that 50 wavelet coefficients of g are largeror equal η1 . Figure 5.2 shows g, g1 , and g2 . A simulated series of responsevariables, defined by Y (ti ) = 2g1 (ti ) + εi (t = 1, ..., 1024) with independent 2zero-mean normal errors εi with variance σε = 100, is shown in Figure 5.3b.©2004 CRC Press LLC
  • 151. 20 10 x 0 x -10 0 200 400 600 800 1000 a Coefficients upto j=4 (numbered in reversed order) idwt d1 d2 d3 d4 s4 0 200 400 600 800 1000 b Figure 5.1 Simulated signal (a) and wavelet coefficients (b).©2004 CRC Press LLC
  • 152. Components upto j=4 Data D1 D2 D3 D4 S4 0 200 400 600 800 1000 c The largest ten components D3.92 D3.102 D3.112 D3.109 D3.104 Resid 0 200 400 600 800 1000 d Figure 5.1 c and d: wavelet components of simulated signal in a.©2004 CRC Press LLC
  • 153. Figure 5.1 e and f: wavelet components of simulated signal in a and frequencyplot of coefficients.©2004 CRC Press LLC
  • 154. x and its components g1 (left) and g2=x-g1 (right) 20 10 0 -10 g1=first 50 components of x g2=x-g1 20 x 10 10 -20 5 x-g1 g1 0 0 -30 -10 -10 -40 0 400 800 0 400 800 0 200 400 600 800 1000 Figure 5.2 Decomposition of x−series in simulated HIWAVE model.A comparison of the two scatter plots in Figures 5.3c and d shows a muchclearer dependence between y and g1 as compared to y versus x = g. Figure5.3e illustrates that there is no relationship between y and g2 . Finally, thetime-frequency plot in Figure 5.3f indicates that the main periodic behavioroccurs for t ∈ {701, ..., 900}. The difficulty in practice is that the correctdecomposition of x into g1 and the redundant component g2 is not known ˆ ˆa priori. Figure 5.4 shows y and the HIWAVE-curve βo + β1 g(ti ; ∞, η1 ) (for ˆgraphical reason the fitted curve is shifted vertically) fitted by nonlinearleast squares regression. Apparently, the algorithm identified η1 and hence ˆthe relevant time span [701, 900] quite exactly, since g(ti ; ∞, η1 ) corresponds ˆto the sum of the largest 51 wavelet components. The estimated coefficients ˆ ˆare βo = −0.36 and β1 = 1.95. If we assume (incorrectly of course) thatη1 has been known a priori, then we can give confidence intervals for bothˆparameters as in linear least squares regression. These intervals are gen-erally too short, since they do not take into account that η1 is estimated. ˆHowever, if a null hypothesis is not rejected using these intervals, then itwill not be rejected by the correct test either. In our case, the linear re-gression confidence intervals for βo and β1 are [−0.96, 0.24] and [1.81, 2.09]respectively, and thus contain the true values βo = 0 and β1 = 2.©2004 CRC Press LLC
  • 155. Figure 5.3 Simulated HIWAVE model - explanatory series g1 (a), y−series (b),y versus x (c), y versus g1 (d), y versus g2 = x − g1 (e) and time frequency plotof y (f ).©2004 CRC Press LLC
  • 156. COLOR FIGURE 2.30 The minnesinger Burchard von Wengen (1229-1280), contem- porary of Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the Uni- versity Library, Heidelberg.) COLOR FIGURE 2.35 Symbol plot with x = pj5, y = pj7, and radius of circles proportional to pj6.©2004 CRC Press LLC
  • 157. COLOR FIGURE 2.36 Symbol plot with x = pj5, y = pj7. The rectangles have width pj1 (diminished second) and height pj6 (augmented fourth). COLOR FIGURE 2.37 Symbol plot with x = pj5, y = pj7, and triangles defined by pj1 (diminished second), pj6 (augmented fourth), and pj10 (diminished seventh).©2004 CRC Press LLC
  • 158. COLOR FIGURE 2.38 Names plotted at locations (x, y) = (pj5, pj7). COLOR FIGURE 3.2 Fractal pictures (by Céline Beran, computer generated).©2004 CRC Press LLC
  • 159. COLOR FIGURE 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I(t,λ). COLOR FIGURE 9.6 Graduale written for an Augustinian monastery of the diocese Konstanz, 13th cen- tury. (Courtesy of Zentralbibliothek Zürich.)©2004 CRC Press LLC
  • 160. Figure 5.4 HIWAVE time series and fitted function g1 . ˆ5.3 Sp ecific applications in music5.3.1 Hierarchical decomposition of metric, melodic, and harmonic weightsDecomposition of metric, melodic and harmonic weights as in (5.3) and(5.4) can reveal structures and relationships that are not obvious in theoriginal series. To illustrate this, Figures 5.5a through d and 5.5e throughh show a decomposition of these weights for Bach’s Canon cancricans from“Das Musikalische Opfer” BWV 1079 and Webern’s Variation op. 27/2respectively. The bandwidths were chosen based on time signature andbar grouping. Webern’s piano piece is written in 2/4 signature, its formalgrouping is 1 + 11 + 11 + 11 + 11; however, Webern insists on a groupingin 2-bar portions suggesting the bandwidths of 5.5 (11 bars), 1 (2 bars)and 0.5 (1 bar). Bach’s canon is written in 4/4 signature; the grouping is9+9+9+9. The chosen bandwidths are 9 (9 bars), 3 (3 bars) and 1 (1 bar).For both compositions, much stronger similarities between the smoothedmetric, melodic, and harmonic components can be observed than for theoriginal weights. An extended discussion of these and other examples canbe found in Beran and Mazzola (1999a).©2004 CRC Press LLC
  • 161. Figure 5.5 Hierarchical decomposition of metric, melodic, and harmonic indica-tors for Bach’s “Canon cancricans” (Das Musikalische Opfer BWV 1079) andWebern’s Variation op. 27, No. 2.5.3.2 HIREG models of the relationship between tempo and melodic curvesQuantitative analysis of performance data is an attempt to understand“objectively” how musicians interpret a score (Figure 5.6). For the anal-ysis of tempo curves for Schumann’s Tr¨umerei (Figure 2.3), Beran and aMazzola (1999a) construct the following matrix of explanatory variablesby decomposing structural weight functions into components of different©2004 CRC Press LLC
  • 162. smoothness: let x1 = xmetric = metric weight, x2 = xmelod = melodicweight, x3 = xhmean = harmonic (mean) weight (see Chapter 3). Definethe bandwidths b1 = 4 (4 bars), b2 = 2 (2 bars) and b3 = 1 (1 bar) anddenote the corresponding components in the decomposition of x1 , x2 , x3by xj,metric = xj,1 , xj,melod = xj,2 , xj,hmean = xj,3 . More exactly, sinceharmonic weights are originally defined for each note, two alternative vari-ables are considered for the harmonic aspect: xhmean (tl ) = average har-monic weight at onset time tl , and xhmax (tl ) = maximal harmonic weightat onset time tl . Thus, the decomposition of four different weight functionsxmetric , xmelod , xhmean , and xhmax is used in the analysis. Moreover, foreach curve, discrete derivatives are defined by x(tj ) − x(tj−1 ) dx(tj ) = tj − tj−1and dx(tj ) − dx(tj−1 ) dx(2) (tj−1 ) = . tj − tj−1Each of these variables is decomposed hierarchically into four components,as decribed above, with the bandwidths b1 = 4 (weighted averaging over 8bars), b2 = 2 (4 bars), b3 = 1 (2 bars) and b4 = 0 (residual – no averaging).We thus obtain 48 variables (functions): xmetric,1 xmetric,2 xmetric,3 xmetric,4 dxmetric,1 dxmetric,2 dxmetric,3 dxmetric,4 d2 xmetric,1 d2 xmetric,2 d2 xmetric,3 d2 xmetric,4 xmelodic,1 xmelodic,2 xmelodic,3 xmelodic,4 dxmelodic,1 dxmelodic,2 dxmelodic,3 dxmelodic,4 d2 xmelodic,1 d2 xmelodic,2 d2 xmelodic,3 d2 xmelodic,4 xhmax,1 xhmax,2 xhmax,3 xhmax,4 dxhmax,1 dxhmax,2 dxhmax,3 dxhmax,4 d2 xhmax,1 d2 xhmax,2 d2 xhmax,3 d2 xhmax,4 xhmean,1 xhmean,2 xhmean,3 xhmean,4 dxhmean,1 dxhmean,2 dxhmean,3 dxhmean,4 d2 xhmean,1 d2 xhmean,2 d2 xhmean,3 d2 xhmean,4In addition to these variables, the following score-information is modeledin a simple way:1. Ritardandi There are four onset intervals R1 , R2 , R4 , and R4 with an explicitly written ritardando instruction, starting at onset times to (Rj ) (j = 1, 2, 3, 4) respectively. This is modeled by linear functions xritj (t) = 1{t ∈ Rj } · (t − to (Rj )), j = 1, 2, 3, 4 (5.35)©2004 CRC Press LLC
  • 163. Figure 5.6 Quantitative analysis of performance data is an attempt to under-stand “objectively” how musicians interpret a score without attaching any subjec-tive judgement. (Left: “Freddy” by J.B.; right: J.S. Bach, woodcutting by ErnstW¨rtemberger, Z¨rich. Courtesy of Zentralbibliothek Z¨rich). u u u2. Suspensions There are four onset intervals S1 , S2 , S4 , and S4 with suspensions, starting at onset times to (Sj ) (j = 1, 2, 3, 4) respectively. The effect is modeled by the variables xsusj (t) = 1{t ∈ Sj } · (t − to (Sj )), j = 1, 2, 3, 4 (5.36)3. Fermatas There are two onset intervals F1 , F2 with fermatas. Their effect is modeled by indicator functions xf ermj (t) = 1{t ∈ Fj }, j = 1, 2 (5.37)The variables are summarized in an n × 57 matrix X. After orthonormal-ization, the following model is assumed: y(j) = Zβ(j) + ε(j)where y(j) = [y(t1 , j), y(t2 , j), ..., y(tn , j)]t are the tempo measurementsfor performance j, Z is the orthonormalized X-matrix, β(j) is the vectorof coefficients (β1 (j), ..., βp (j))t and ε(j) = [ε(t1 , j), ε(t2 , j), ..., ε(tn , j)]t isa vector of n identically distributed, but possibly correlated, zero meanrandom variables ε(ti , j) (ti ∈ T ) with variance var(ε(ti , j)) = σ 2 (j). Be-ran and Mazzola (1999a) select the most important variables for each ofthe 28 performances separately, by stepwise linear regression. The mainaim of the analysis is to study the relationship between structural weightfunctions and tempo with respect to a) existence, b) type and complexity,and c) comparison of different performances. It should perhaps be empha-sized at this point that quantitative analysis of performance data aims atgaining a better “objective” understanding how pianists interpret a score©2004 CRC Press LLC
  • 164. without attaching any subjective judgement. The aim is thus not to findthe “ideal performance” which may in fact not exist or to state an opin-ion about the quality of a performance. The values of R2 , obtained forthe full model with all explanatory variables, vary between 0.65 and 0.85.Note, however, that the number of potential explanatory variables is verylarge so that high values of R2 do not necessarily imply that the regressionmodel is meaningful. On the other hand, musical performance is a verycomplex process. It is therefore not unreasonable that a large number ofexplanatory variables may be necessary. This is confirmed formally, in thatfor most performances, the selected models turn out to be complex (withmany variables), all variables being statistically significant (at the 5%-level)even when correlations in the errors are taken into account. For instance,for Brendel’s performance (R2 = 0.76), seventeen significant variables areselected (including first and second derivatives). In spite of the complex-ity, there is a large degree of similiarity between the performances in thefollowing sense: a) all except at most 3 of the 57 coefficients βj have thesame sign for all performances (the results are therefore hardly random), b)there are “canonical” variables that are chosen by stepwise regression for(almost) all performances, and c) the same is true if one considers (for eachperformance separately) explanatory variables with the largest coefficient.Figure 5.7 shows three of these curves. The upper curve is the most impor-tant explanatory variable for 24 of the 28 performances. The exceptions are:all three Cortot-performances and Krust with a preference for the middlecurve which reflects the division of the piece into 8 parts and the perfor-mance by Ashkenazy with a curve similar to Cortot’s. Apparently, Cortot,Krust, and Ashkenazy put special emphasis on the division into 8 parts.The results can also be used to visualize the structure of tempo curves in ˆthe following way: using the size of |βk | as criterion for the importance ofvariable k, we may add the terms in the regression equation sequentially toobtain a hierarchy of tempo curves ranging from very simple to complex.This is illustrated in Figures 5.8a and b for Ashkenazy and Horowitz’s thirdperformance.5.3.3 HISMOOTH models for the relationship between tempo and structural curvesAn analysis of the relationship between a melodic curve (Chapter 3) and the28 tempo curves for Schumann’s Tr¨umerei is discussed in Beran and Maz- azola (1999). In a first step, effects of fermatas and ritardandi are subtractedfrom each of the 28 tempo series individually, using linear regression. Thecomponent of the melodic curve mt orthogonal to these variables is thenused. The second algorithm for HISMOOTH models is used, with a grid Gthat takes into account that 0 ≤ t ≤ 32 and only certain multiples of 1/8correspond to musically interesting neighborhoods: G = {32, 30, 28, 26, 24,©2004 CRC Press LLC
  • 165. Figure 5.7 Most important melodic curves obtained from HIREG fit to tempocurves for Schumann’s Tr¨umerei. a Figure 5.8a: Figure 5.8b: 25 Adding effects for ASKENAZE Adding effects for HOROWITZ3 20 20 15 15 estimated and observed log(tempo) estimated and observed log(tempo) 10 10 5 5 0 0 -5 5 10 15 20 25 30 5 10 15 20 25 30 onset time onset timeFigure 5.8 Successive aggregation of HIREG-components for tempo curves byAshkenazy and Horowitz (third performance).©2004 CRC Press LLC
  • 166. 22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125}.Note that since for large bandwidths the resulting curves g do not varymuch, large trial bandwidths do not need to be too close together. Theerror process ε is modeled by a fractional AR(p, d) process, the order beingestimated from the data by the BIC. Note that, from the musicologicalpoint of view, the fractional differencing parameter can be interpreted as ameasure of self-similarity (see Chapter 3). For illustration, consider the performances CORTOT1 and HOROWITZ1(see Figures 5.9b and c). In both cases, the number M of explanatory vari-ables estimated by Algorithm 2 turns out to be 3 (with a level of significanceof α = 0.05). The estimated bandwidths (and 95%-confidence intervals)are ˆ1 = 4.0 ([2.66, 5.34]), ˆ2 = 2.0 ([1.10, 2.90]) and ˆ3 = 0.5 ([0.17, 0.83]) b b bfor CORTOT1 and ˆ1 = 4 ([2.26, 5.74]), ˆ2 = 1 ([0.39, 1.62]) and ˆ3 = b b b.25 ([0.04, 0.46]) for HOROWITZ1. The estimates of β are β ˆ1 = −0.81([−1.53, −0.10]), β2 = 1.08 ([0.21, 1.05]) and β3 = −0.624 ([−1.15, −0.10]), ˆ ˆand β ˆ1 = −0.42 ([−0.66, −0.18]), β2 = 0.54 ([0.13, 0.95]) and β3 = −0.68 ˆ ˆ([−1.08, −0.28]) respectively. Finally, the fitted error process for Cortot is ˆa fractional AR(1) process with d = −0.25 ([−0.60, 0.09]) and φ1 = 0.77 ˆ[0.48, 1]. For Horowitz we obtain a fractional AR(2) process with d = 0.30 ˆ ˆ1 = 0.26 ([0.09, 0.42]) and φ2 = −0.43 ([−0.55, −0.30]).([0.14, 0.45], φ ˆ A possible interpretation of the results is as follows: the largest band-width ˆ1 = 4 (one bar) is the same for both performers. A relatively blarge portion of the shaping of the tempo “happens” at this level. Apartfrom this, however, Horowitz’s bandwidths are smaller. Horowitz appearsto emphasize very local melodic structures more than Cortot. Moreover, ˆfor Horowitz, d > 0 (long-range dependence): while the small scale struc-tures are “explained” by the melodic structure of the score, the remaining“unexplained” part of the performance is still “coherent” in the sense thatthere is a relatively strong (self-)similarity and positive correlations even be- ˆtween remote parts. On the other hand, for Cortot, d < 0 (antipersistence):While larger scale structures are “explained” by the melodic structure ofthe score, more local fluctuations are still “coherent” in the sense that thereis a relatively strong negative autocorrelation even between remote parts,these smaller scale structures are however difficult to relate directly to themelodic structure of the score. Figures 5.9a through d also simplified tempo curves for all 28 perfor-mances, obtained by HISMOOTH fits with M = 3. The comparison oftypical characteristics is now much easier than for the original curves. Inparticular, there is a strong similarity between all three performances byHorowitz on one hand, and the three performances by Cortot on the otherhand. Several performers (Moisewitsch, Novaes, Ortiz, Krust, Schnabel,Katsaris) put even higher emphasis on global melodic features than Cor-tot. Striking similarities can also be seen between Horowitz, Klien, and©2004 CRC Press LLC
  • 167. Figure 5.9 a and b: HISMOOTH-fits to tempo curves (performances 1-14).Brendel. Another group of similar performances consisting of Cortot, Arg-erich, Capova, Demus, Kubalek, and Shelley.5.3.4 Digital encoding of musical sounds (CD, mpeg)Wavelet decomposition plays an important role in modern techniques ofdigital sound and image processing. Digital encoding of sounds (e.g. CD,mpeg) relies on algorithms that make it possible to compress complex datain as few storage units as possible. Wavelet decoposition is one such tech-nique: instead of storing a complete function (evaluated or measured at avery large number of time points on a fine grid), one only needs to keep therelatively small number of wavelet coefficients. There is an extensive liter-ature on how exactly this can be done to suit particular engineering needs.Since here the focus is on “genuine” musical questions rather than signalprocessing, we do not pursue this further. The interested reader is referredto the engineering literature such as Effelsberg and Steinmetz (1998) andreferences therein.5.3.5 Wavelet analysis of tempo curvesConsider the tempo curves for Schumann’s Tr¨umerei. Wavelet analysis acan help one to understand some of the similarities and differences be-©2004 CRC Press LLC
  • 168. Figure 5.9 c and d: HISMOOTH-fits to tempo curves (performances 15-28).tween tempo curves. This is illustrated in Figures 5.10a through f wheretime-frequency plots of the three tempo curves by Cortot are comparedwith those by Horowitz. (More specifically, only the first 128 observationsare used here.) The obvious difference is that Horowitz has more power inthe high frequency range. Figures 5.11a through f compare the wavelet coef-ficients of residuals obtained after subtracting a kernel-smoothed version ofthe tempo curves (bandwidth 1/8, i.e. averaging was done over one quarterof a bar). This provides an overview of local details of the curves. In partic-ular, it can be seen at which level of resolution each pianist kept essentiallythe same “profile” throughout the years. For instance, for Horowitz thecomplete profile at level 2 (d2) remains essentially the same. An even betteradaptation to data is achieved by using so-called wavelet packets, which aregeneralizations of wavelets, in conjunction with a “best-basis algorithm”.The idea of the algorithm is to find the best type of basis functions suit-able to approximate an observed time series with as few basis functions aspossible. This is a way out of the limitation due the very specific shape ofa particular class of wavelet functions (see e.g. Haar wavelets where we areconfined to step functions). For detailed references on wavelet packets seee.g. Coifman et al. (1992) and Coifman and Wickerhauser (1992). Figures5.12 through 5.14 illustrate the usefulness of this approach: the 28 tempocurves of Schumann’s Tr¨umerei are approximated by the most important a©2004 CRC Press LLC
  • 169. Figure 5.10 Time frequency plots for Cortot’s and Horowitz’s three performances.two (Figure 5.12), five (Figure 5.13) and ten (Figure 5.14) best basis func-tions. The plots show interesting and plausible similarities and differences.Particularly striking are Cortot’s 4-bar oscillations, Horowitz’s “seismic”local fluctuations, the relatively unbalanced tempo with a few extremetempo variations for Eschenbach, Klien, Ortiz, and Schnabel, the irregularshapes for Moisewitsch, and also a strong similarity between Horowitz1 andMoisewitsch with respect to the general shape (Figure 5.12).5.3.6 HIWAVE models of the relationship between tempo and melodic curvesHIWAVE models can be used, for instance, to establish a relationship be-tween structural curves obtained from a score and a performance of thescore. Here, we consider the tempo curves by Cortot and Horowitz (Figure5.15a), and the melodic weight function m(t) defined in Section 3.3.4. As-suming a HIWAVE-model of order 1, Figure 5.15b displays the value of R2©2004 CRC Press LLC
  • 170. Coefficients of residuals - Coefficients of residuals - Coefficients of residuals - Cortot1 Cortot2 Cortot3 idwt idwt idwt d1 d1 d1 d2 d2 d2 s2 s2 s2 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 a b c Coefficients of residuals - Coefficients of residuals - Coefficients of residuals - Horowitz1 Horowitz2 Horowitz3 idwt idwt idwt d1 d1 d1 d2 d2 d2 s2 s2 s2 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 d e fFigure 5.11 Wavelet coefficients for Cortot’s and Horowitz’s three performances.©2004 CRC Press LLC
  • 171. ARGERICH ARRAU ASKENAZE BRENDEL BUNIN CAPOVA CORTOT1 1 1.0 1.0 1.0 1 1 0 0 0.5 0.5 0.5 0 0 -1 -1 0.0 0.0 -1 0.0 -2 -2 -1 -1.0 -3 -0.5 -0.5 -2 -3 -2 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI 1.0 2 1.0 1.0 1.0 1 1.0 1 0.5 0.5 0.5 0.5 0 0 0.5 -1 0.0 0.0 -1 0.0 0.0 -0.5 -2 -2 -0.5 -3 -1.0 -1.0 -3 -1.0 -4 -1.5 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK 2 1.0 1.0 1 0 0 1.0 1 0.5 0 0.5 0.5 -1 -1 0 0.0 -1 -2 -1 -0.5 0.0 -2 -2 -3 -2 -3 -0.5 -1.0 -1.5 -3 -3 -4 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK 1.0 1 1.0 0 0 0 1 0.5 -1 0.5 -1 0 -1 0 -2 0.0 -2 0.0 -1 -2 -3 -1 -0.5 -3 -4 -0.5 -3 -2 -1.0 -2 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150Figure 5.12 Tempo curves – approximation by most important 2 best basis func-tions. ARGERICH ARRAU ASKENAZE BRENDEL BUNIN CAPOVA CORTOT1 1 1 1 1 1 1 1.0 0 0 0 0 0 0 0.5 -1 -1 -1 -1 -1 -1 0.0 -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 -1.0 -3 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI 1.0 1 1 1 1 0.5 1 1 0 0 0 0 0 -1 0 -1 -1 -0.5 -1 -2 -1 -2 -1 -2 -3 -2 -3 -2 -1.5 -4 -3 -4 -2 -3 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK 2 2 1 1 1.0 0 1.0 1 0.5 0 0 -1 0 -1 -1 0.0 0.0 -1 -0.5 -2 -2 -2 -2 -3 -1.0 -3 -1.0 -1.5 -3 -3 -4 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK 1.5 1 1 0 1 1 1 1.0 -1 0 0 0 0 0 0.5 -2 -1 -1 -3 -1 0.0 -2 -1 -1 -2 -4 -0.5 -3 -2 -2 -3 -5 -2 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150Figure 5.13 Tempo curves – approximation by most important 5 best basis func-tions.©2004 CRC Press LLC
  • 172. ARGERICH ARRAU ASKENAZE BRENDEL BUNIN CAPOVA CORTOT1 2 1 1 1 1 1 1 1 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 0 -2 -2 -2 -2 -2 -2 -1 -3 -3 -3 -3 -3 -3 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI 1 1 1 1 0.5 1 1 0 0 0 0 0 -1 -1 -1 0 -1 -0.5 -1 -2 -2 -2 -2 -1 -3 -3 -2 -1.5 -3 -3 -4 -4 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK 1.5 2 2 1 1 1 1 1 1 0 0.5 0 0 0 0 -1 0 -1 -1 -1 -1 -2 -0.5 -2 -2 -1 -2 -2 -3 -3 -3 -3 -3 -1.5 -2 -4 0 50 100 150 0 50 100 150 0 50 100 150 -4 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK 2 1 1 1 1 1 1 1 0 0 0 0 -1 0 0 0 -1 -2 -1 -1 -1 -1 -2 -3 -1 -2 -2 -4 -2 -3 -2 -3 -2 -5 -3 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150Figure 5.14 Tempo curves – approximation by most important 10 best basis func-tions.for the simple linear regression model yi = βo +β1 g(ti ; ∞, η) as a function ofthe number of wavelet-coefficients of mi that are larger or equal to η. Twoobservations can be made: a) for almost all choices of η, the fit for Horowitz(gray lines) is better and b) the best value of η is practically the same for allsix performances. Figure 5.15c shows the fitted HIWAVE-curves for Cortotand Horowitz separately. The result shows an amazing agreement betweenthe three Cortot performances on one hand and the three Horowitz curveson the other hand. The HIWAVE-fits seem to have extracted a major as-pect of the performance styles. Horowitz appears to build blocks of almosthorizontal tempo levels and “adds”, within these blocks, very fine tempovariations. In contrast, for Cortot, blocks have a more “parabolic” shape.It should be noted, of course, that, since Haar wavelets were used here,these features (in particular Horowitz’ horizontal blocks) may be some-what overemphasized. Analogous pictures are displayed in Figures 5.16athrough c and 5.17a through c for the first and second difference of thetempo respectively. Particularly interesting are Figures 5.17b and c: thevalues of R2 are practically the same for all Horowitz performances andclearly lower than for Cortot. Moreover, as before, both pianists show anamazing consistency in their performances.©2004 CRC Press LLC
  • 173. Figure 5.15 Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2obtained in HIWAVE-fit plotted against trial cut-off parameter η (b) and fittedHIWAVE-curves (c).©2004 CRC Press LLC
  • 174. Figure 5.16 First derivative of tempo curves (a) by Cortot (three curves on top)and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameterη (b) and fitted HIWAVE-curves (c).©2004 CRC Press LLC
  • 175. Figure 5.17 Second derivative of tempo curves (a) by Cortot (three curves on top)and Horowitz, R2 obtained in HIWAVE-fit plotted against trial cut-off parameterη (b) and fitted HIWAVE-curves (c).©2004 CRC Press LLC
  • 176. CHAPTER 6 Markov chains and hidden Markov models6.1 Musical motivationMusical events can often be classified into a finite or countable number ofcategories that occur in a temporal sequence. A natural question is thenwhether the transition between different categories can be characterizedby probabilities. In particular, a successful model may be able to repro-duce formally a listener’s expectation of “what happens next”, by givingappropriate conditional probabilities. Markov chains are simple models indiscrete time that are defined by conditioning on the immediate past only.The theory of Markov chains is well developed and many beautiful resultsare available. More complicated, but very flexible, are hidden Markov pro-cesses. For these models, the probability distribution itself changes dynam-ically according to a Markov process. Many of the developments on hiddenMarkov models have been stimulated by problems in speech recognition.It is therefore not surprising that these models are also very useful for an-alyzing musical signals. Here, a very brief introduction to Markov chainsand hidden Markov models is given. For an extended discussion see, forinstance, Chung (1967), Isaacson and Madsen (1976), Kemey et al. (1976),Billingsley (1986), Elliott et al. (1995), MacDonald and Zucchini (1997),Norris (1998), Bremaud (1999).6.2 Basic principles6.2.1 Definition of Markov chainsLet Xo , X1 , ... be a sequence of random variables with possible outcomesXt = xt ∈ S. Then the sequence is called a Markov chain, if M1. The state space S is finite or countable; M2. For any t ∈ N, P (Xt+1 = j|Xo = io , X1 = i1 , ..., Xt = it ) = P (Xt+1 = j|Xt = it ) (6.1)Condition M2 means that the future development of the process, giventhe past, depends on the most recent value only. In the following we also©2004 CRC Press LLC
  • 177. assume that the Markov chain is homogeneous in the sense that for anyi, j ∈ N, the conditional probability P (Xt+1 = j|Xt = i) does not dependon time t. The probability distribution of the process Xt (t = 0, 1, 2, ...) isthen fully specified by the initial distribution πi = P (Xo = i) (6.2)and the (finite or infinite dimensional) matrix of transition probabilities pij = P (Xt+1 = j|Xt = i) (i, j = 1, 2, ..., |S|) (6.3)where |S| = m ≤ ∞ is the number of elements in the state space S. Withoutloss of generality, we may assume S = {1, 2, ..., m}. Note that the vectorπ = (π1 , ..., πm )t and the matrix M = (pij )i,j=1,2,...,mhave the following properties: 0 ≤ πi , pij ≤ 1, m πi = 1 i=1and m pij = 1 j=1Probabilities of events can be obtained by matrix multiplication, since m (n) pij = P (Xt+n = j|Xt = i) = pij1 pj1 j2 ···pjn−1 j = [M n ]ij (6.4) j1 ,...,jn−1 =1and (n) pj = P (Xt+n = j) = [π t M n ]j (6.5)6.2.2 Transience, persistence, irreducibility, periodicity, and stationarityThe dynamic behavior of a Markov chain can essentially be character-ized by the notions of transience–persistence, irreducibility–reducibility,aperiodicity–periodicity and stationarity–nonstationarity. These propertieswill be discussed now. Consider the probability that the first visit in state j occurs at time n,given that the process started in state i, (n) fij = P (X1 = j, ..., Xn−1 = j, Xn = j|Xo = i) (6.6) (n)Note that fij can also be written as (n) fij = P (Tj = n|Xo = i)©2004 CRC Press LLC
  • 178. where Tj = min{n : Xn = j} n≥1is the first time when the process reaches state j. The conditional proba-bility that the process ever visits the state j can be written as ∞ (n) fij = P (Tj < ∞|Xo = i) = P (∪∞ {Xn = j}|Xo = i) = n=1 fij (6.7) n=1We then have the followingDefinition 42 A state i is calledi) transient, if fii < 1.ii) persistent, if fii = 1;Persistence means that we return to the same state again with certainty. Fortransient states it can occur, with positive probability, that we never returnto the same place. As it turns out, a positive probability of never returningimplies that there is indeed a “point of no return”, i.e. a time point afterwhich one never returns. This can be seen as follows. Conditionally onXo = i, the probability that state j is reached at least k + 1 times isequal to fij fjj . Hence, for k → ∞, we obtain the probability of returning kinfinitely often k qij = P (Xn = j infinitely often|Xo = i) = fij lim fjj . (6.8) k→∞This implies qij = 0 for fjj < 1and qij = 1 for fjj = 1.A simple way of checking whether a state is persistent or not is given byTheorem 13 The following holds for a Markov chain: ∞ (n)i) A state j is transient ⇔ qjj = 0 ⇔ n=1 pjj < ∞ ∞ (n)ii) A state j is persistent ⇔ qjj = 1 ⇔ n=1 pjj = ∞. ∞ (n)The condition on n=1 pii can be simplified further for irreducible Markovchains:Definition 43 A Markov chain is called irreducible, if for each i, j ∈ S, (n)pij > 0 for some n.Irreducibility means that wherever we start, any state j can be reached indue time with positive probability. This excludes the possibility of beingcaught forever in a certain subset of S. With respect to persistent and tran-sient states, the situation simplifies greatly for irreducible Markov chains:Theorem 14 Suppose that Xt (t = 0, 1, ...) is an irreducible Markov chain.Then one of the following possibilities is true:©2004 CRC Press LLC
  • 179. i) All states are transient.ii) All states are persistent.Instead of speaking of transient and persistent states one therefore alsouses the notion of transient and persistent Markov chain respectively. Another important property is stationarity of Markov chains. The word“stationarity” implies that the distribution remains stable in some sense.The first definition concerns initial distributions:Definition 44 A distribution π is called stationary if k πi pij = πj , (6.9) i=1or in matrix form, π t M = π. (6.10)This means that if we start with distribution π, then the distribution of allsubsequent Xt s is again π. The next question is in how far the initial distribution influences thedynamic behavior (probability distribution) into the infinite future. A pos-sible complication is that the process may be periodic in the sense that onemay return to certain states periodically:Definition 45 A state j is called to have period τ , if (n) pjj > 0implies that n is a multiple of τ .For an irreducible Markov chain, all states have the same period. Hence,the following definition is meaningful:Definition 46 An irreducible Markov chain is called periodic if τ > 1, andit is called aperiodic if τ = 1.It can be shown that for an aperiodic Markov chain, there is at most onestationary distribution and, if there is one, then the initial distribution doesnot play any role ultimately:Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chainfor which a stationary distribution π exists, then the following holds:(i) the Markov chain is persistent; (n)(ii) limn→∞ pij = πj > 0 for all i, j;(iii) the stationary distribution π is unique.In the other case of an aperiodic irreducible Markov chain for which nostationary distribution exists, we have (n) lim p =0 n→∞ ijfor all i, j. Note that this is even the case if the Markov chain is persistent.One then can classify irreducible aperiodic Markov chains into three classes:©2004 CRC Press LLC
  • 180. Theorem 16 If Xt (t = 0, 1, 2, ...) is an irreducible aperiodic Markovchain, then one the following three possibilities is true:(i) Xt is transient, (n) lim p =0 n→∞ ij and ∞ (n) pij < ∞ n=1(ii) Xt is persistent, but no stationary distribution π exists, (n) lim p = 0, n→∞ ij ∞ (n) pij = ∞ n=1 and ∞ (n) µj = nfjj = ∞ n=1(iii) Xt is persistent, and a unique stationary distribution π exists, (n) lim p = πj > 0 n→∞ ij for all i, j and the average number of steps till the process returns to state j is given by −1 µj = πjFor Markov chains with a finite state space, the results simplify further:Theorem 17 If Xt is an irreducible aperiodic Markov chain with a finitestate space, then the following holds:(i) Xt is persistent(ii) a unique stationary distribution π = (π1 , ..., πk )t exists and is the so- lution of π t (I − M ) = 0, (0 ≤ πj ≤ 1, πj = 1) (6.11) where I is the m × m identity matrix.Note that j Mij = j pij = 1 so that j (I − M )ij = 0, i.e. the matrix(I − M ) is singular. (If this were not the case, then the only solution to thesystem of linear equations would be 0 so that no stationary distributionwould exist.) Thus, there are infinitely many solutions of (6.13). However,there is only one solution that satisfies the conditions 0 ≤ πj ≤ 1 and πj = 1.©2004 CRC Press LLC
  • 181. 6.2.3 Hidden Markov modelsA hidden Markov model is, as the name says, a model where an under-lying Markov process is not directly observable. Instead, observations Xt(t = 1, 2, ...) are generated by a series of probability distributions which inturn are controlled by an unobserved Markov chain. More specifically, thefollowing definitions are used: let θt (t = 1, 2, ...) be a Markov chain withinitial distribution π so that P (θ1 = j) = πj , and transition probabilities pij = P (θt+1 = j|θt = i). (6.12)The state of the Markov chain determines the probability distribution ofthe observable random variables Xt by ψij = P (Xt = j|θt = i) (6.13)In particular, if the state spaces of θt and Xt are finite with dimensionsm1 and m2 respectively, then the probability distribution of the process Xtis determined by the m1 -dimensional vector π, the m1 × m1 -dimensionaltransition matrix M = (pij )i,j=1,...,m1 and the m2 ×m1 -dimensional matrixΨ = (ψij )i=1,...,m2 ;j=1,...,m1 that links θt with Xt . Analogous models canbe defined for the case where Xt (t ∈ N) are continuous variables. The flexibility of hidden Markov models is due to the fact that Xt canbe an arbitrary quantity with an arbitrary distribution that can change intime. For instance, Xt itself can be equal to a time series Xt = (Z1 , ..., Zn ) =(Z1 (t), ..., Zn (t)) whose distribution depends on θt . Typically, such modelsare used in automatic speech processing (see e.g. Levinson et al. 1983, Juangand Rabiner 1991). The variable θt may represent the unobservable state ofthe vocal tract at time t, which in turn “produces” an observable acousticsignal Z1 (t), ..., Zn (t) generated by a distribution characterized by θt . Givenobservations Xt (t = 1, 2, ..., N ), the aim is to guess which configurationsθt (t = 1, 2, ..., N ) the vocal tract was in. More specifically, it is sometimesassumed that there is only a finite number of possible acoustic signals. Wemay therefore denote by Xt the label of the observed signal and estimateθ by maximizing the a posteriori probability P (θ = j|Xt = i). Using theBayes rule, this leads to ˆ θt = arg max P (θt = j|Xt = i) j=1,...,m1 P (Xt = i|θt = j)P (θt = j) = arg max m1 (6.14) j=1,...,m1 l=1P (Xt = i|θt = l)P (θt = l)6.2.4 Parameter estimation for Markov and hidden Markov modelsIn principle, parameter estimation for Markov chains and hidden Markovmodels is simple, since the likelihood function can be written down explic-©2004 CRC Press LLC
  • 182. itly in terms of simple conditional probabilities. The main difficulties thatcan occur are:1. Large number of unknown parameters: the unknown parameters for a Markov chain are the initial distribution π and the transition matrix M = (pij )i,j=1,...,m . If m is finite, then the number of unknown parame- ters is (m−1)+m(m−1). If the initial distribution does not matter, then this reduces to m(m − 1). Both numbers can be quite large compared to the available sample size, since they increase quadratically in m. The situation is even worse if the state space is infinite, since then the num- ber of unknown parameters is infinite. A solution to this problem is to impose restrictions on the parameters or to define parsimonious models where M is characterized by a low-dimensional parameter vector.2. Implicit solution: The maximum likelihood estimate of the unknown parameters is the solution of a system of nonlinear equations, and there- fore must be found by a suitable numerical algorithm. For real time applications with massive data input, as they typically occur in speech processing or processing of musical sound signals, fast algorithms are required.3. Asymptotic distribution: The asymptotic distribution of maximum like- lihood estimates is not always easy to derive.6.3 Specific applications in music6.3.1 Stationary distribution of intervals modulo 12We consider intervals between successive notes modulo octave for the upperenvelopes of the following compositions:• Anonymus: a) Saltarello (13th century); b) Saltarello (14th century); c) Alle Psallite (13th century); d) Troto (13th century)• A. de la Halle (1235?-1287): Or est Bayard en la pature, hure!• J. de Ockeghem (1425-1495): Canon epidiatesseron• J. Arcadelt (1505-1568): a) Ave Mari, b) La Ingratitud, c) Io Dico Fra Noi• W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queen’s Alman• J. Dowland (1562-1626): a) Come Again, b) The Frog Galliard, c) The King Denmark’s Galliard• H.L. Hassler (1564-1612): a) Galliard, b) Kyrie from “Missa secunda”, c) Sanctus et Benedictus from “Missa secunda”• G.P. Palestrina (1525-1594): a) Jesu Rex admirabilis, b) O bone Jesu, c) Pueri Hebraeorum©2004 CRC Press LLC
  • 183. • J.P. Rameau (1683-1764): a) La Popliniere, b) Tambourin, c) La Triom- phante (Figure 6.1)• J.F. Couperin (1668-1733): a) Barriquades mysterieuses, b) La Linotte Effarouch´e, c) Les Moissonneurs, d) Les Papillons e• J.S. Bach (1685-1750): Das Wohltemperierte Klavier; Cello-Suites I to VI (1st Movements)• D. Scarlatti (1660-1725): a) Sonata K 222, b) Sonata K 345, c) Sonata K 381• J. Haydn (1732-1809): Sonata op. 34, No. 2• W.A. Mozart (1756-1791): a) Sonata KV 332, 2nd Mov., b) Sonata KV 545, 2nd Mov., c) Sonata KV 333, 2nd Mov.• F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32, No. 1, c) Etude op. 10, No. 6 (Figure 6.2)• R. Schumann (1810-1856): Kinderszenen op. 15• J. Brahms (1833-1897): a) Hungarian dances No. 1, 2, 3, 6, 7, b) Inter- mezzo op. 117, No. 1 (Figures 6.12, 9.7, 11.5)• C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Re- flections dans l’eau• A. Scriabin (1872-1915): Preludes a) op. 2, No. 2, b) op. 11, No. 14, c) op. 13, No. 2• S. Rachmaninoff (1873-1943): a) Prelude op. 3, No. 2, b) Preludes op. 23, No. 3, 5, 9• B. Bart´k (1881-1945): a) Bagatelle op. 11, No. 2, b) Bagatelle op. 11, o No. 3, c) Sonata for piano• O. Messiaen (1908-1992): Vingts regards sur l’enfant de J´sus, No. 3 e• S. Prokoffieff (1891-1953): Visions fugitives a) No. 11, b) No. 12, c) No. 13• A. Sch¨nberg (1874-1951): Piano piece op. 19, No. 2 o• T. Takemitsu (1930-1996): Rain tree sketch No. 1• A. Webern (1883-1945): Orchesterst¨ ck op. 6, No. 6 uSince we are not interested in note repetitions, zero is excluded, i.e. thestate space of Xt consists of the numbers 1,...,11. For the sake of simplicity,Xt is assumed to be a Markov chain. This is, of course, not really truenevertheless an “approximation” by a Markov chain may reveal certaincharacteristics of the composition. The elements of the transition matrixM = (pij )i,j=1,...,11 are estimated by relative frequencies n t=2 1{xt−1 = i, xt = j} pij = ˆ n−1 , (6.15) t=1 1{xt = i}©2004 CRC Press LLC
  • 184. Figure 6.1 Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin afterJ. J. Cafferi, Paris after 1764; courtesy of Zentralbibliothek Z¨rich.) uand the stationary distribution π of the Markov chain with transition ma- ˆtrix M = (ˆij )i,j=1,...,11 is estimated by solving the system of linear equa- ptions π t (I − M ) = 0 ˆas described above. Figures 6.3a through l show the resulting values of πj ˆ(joined by lines). For each composition, the vector πj is plotted against j. ˆFor visual clarity, points at neighboring states j and j−1 are connected. Thefigures illustrate how the characteristic shape of π changed in the course ofthe last 500 years. The most dramatic change occured in the 20th centurywith a “flattening” of the peaks. Starting with Scriabin a pioneer of atonalmusic, though still rooted in the romantic style of the late 19th century, thisis most extreme for the compositions by Sch¨nberg, Webern, Takemitsu, oand Messiaen. On the other hand, Prokoffieff’s “Visions fugitives” exhibitclear peaks but at varying locations. The estimated stationary distributionscan also be used to perform a cluster analysis. Figure 6.4 shows the resultof the single linkage algorithm with the manhattan norm (see Chapter10). To make names legible, only a subsample of the data was used. Analmost perfect separation between Bach and composers from the classicaland romantic period can be seen.©2004 CRC Press LLC
  • 185. Figure 6.2 Fr´d´ric Chopin (1810-1849). (Courtesy of Zentralbibliothek Z¨rich.) e e u6.3.2 Stationary distribution of interval torus valuesAn analogous analysis can be carried out replacing the interval numbers bythe corresponding values of the torus distance (see Chapter 1). Excludingzeroes, the state space consists of the three numbers 1, 2, 3 only. For thesame compositions as above, the stationary probabilities πj (j = 1, 2, 3) are ˆcalculated. A cluster analysis as above, but with the new probabilties, yieldspractically the same result as before (Figure 6.5). Since the state space con-tains three elements only, it is now even easier to find the patterns thatdetermine clustering. In particular, log-odds-ratios log(ˆi /ˆj ) (i = j) ap- π πpear to be characteristic. Boxplots are shown in Figures 6.6a, 6.7a and 6.8afor categories of composers defined by date of birth as follows: a) before1600 (“early music”); b) [1600,1720) (“baroque”); c) [1720,1800) (“clas-sic”); d) [1800,1880) (“romantic and early 20th century”) (Figure 6.12); e)1880 and later (“20th century”). This is a simple, though somewhat arbi-trary, division with some inaccuracies for instance, Sch¨nberg is classified oin category 4 instead of 5. The log-odds-ratio between π1 and π2 is high- ˆ ˆest in the “classical” period and generally tends to decrease afterwards.Moreover, there is a distinct jump from the baroque to the classical period.This jump is also visible for log(ˆ1 /ˆ3 ). Here, however, the attained level π πis kept in the subsequent time periods. For log(ˆ2 /ˆ3 ) a gradual increase π π©2004 CRC Press LLC
  • 186. Figure 6.3 Stationary distributions πj (j = 1, ..., 11) of Markov chains with state ˆspace Z12 {0}, estimated for the transition between successive intervals.©2004 CRC Press LLC
  • 187. Clusters based on stationary distribution 8 6 SCHUMANN 4 CHOPIN BRAHMS BRAHMS 2 RACHMANINOFF HAYDN SCHUMANN MOZART HAYDN BACH BRAHMS HAYDN SCHUMANN CHOPIN RACHMANINOFF SCHUMANN SCHUMANN SCHUMANN CHOPIN SCHUMANN SCHUMANN SCHUMANN BACH HAYDN HAYDN SCHUMANN BRAHMS MOZART CHOPIN BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN HAYDN MOZART MOZART MOZART BRAHMS BRAHMS SCHUMANN SCHUMANN SCHUMANN RACHMANINOFF RACHMANINOFFFigure 6.4 Cluster analysis based on stationary Markov chain distributions forcompositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rach-maninoff.can be observed. The differences are even more visible when comparing in-dividual composers. This is illustrated in Figures 6.9a and b where Bach’sand Schumann’s log(ˆ1 /ˆ3 ) and log(ˆ2 /ˆ3 ) are compared, and in Figures π π π π6.10a through f where the median and lower and upper quartiles of πj are ˆplotted against j. Finally, Figure 6.11 shows the plots of log(ˆ1 /ˆ3 ) and π πlog(ˆ2 /ˆ3 ) against the date of birth. π π6.3.3 Classification by hidden Markov modelsChai and Vercoe (2001) study classification of folk songs using hiddenMarkov models. They consider, essentially, four ways of representating amelody; namely by a) a vector of pitches modulo 12; b) a vector of pitchesmodulo 12 together with duration (duration being represented by repeatingthe same pitch); c) a sequence of intervals (differenced series of pitches); andd) sequence of intervals, with intervals being classified into only five intervalclasses {0}, {−1, −2}, {1, 2}, {x ≤ −3} and {x ≥ 3}. The observed dataconsist of 187 Irish, 200 German, and 104 Austrian homophonic melodiesfrom folk songs. For each melody representation, the authors estimate theparameters of several hidden Markov models which differ mainly with re-spect to the size of the hidden state space. The models are fitted for each©2004 CRC Press LLC
  • 188. 3.0 Clusters based on stationary distribution of torus distances 2.5 2.0 1.5 SCHUMANN 1.0 BRAHMS BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN HAYDN HAYDN HAYDN HAYDN HAYDN HAYDN CHOPIN MOZART CHOPIN CHOPIN MOZART CHOPIN MOZART MOZART MOZART BRAHMS BRAHMS BRAHMS BRAHMS BRAHMS SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN SCHUMANN RACHMANINOFF RACHMANINOFF RACHMANINOFF RACHMANINOFFFigure 6.5 Cluster analysis based on stationary Markov chain distributions oftorus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann,Brahms, and Rachmaninoff.country separately. Only 70% of the data are used for estimation. Theremaining 30% are used for validation of a classification rule defined asfollows: a melody is assigned to country j, if the corresponding likelihood(calculated using the country’s hidden Markov model) is the largest. Notsurprisingly, the authors conclude that the most reliable distinction can bemade between Irish and non-Irish songs.©2004 CRC Press LLC
  • 189. a) log(pi(1)/pi(2)) for b): log(pi(1)/pi(2)) for five different periods ‘classic’ vs. ‘not classic’ 0.5 0.5 0.0 0.0 -0.5 -0.5 -1.0 -1.0 -1.5 -1.5 b. 1600 1600 1720 1800 from birth 1720-1800 birth before 1720 or -1720 -1800 -1880 1880 1800 and laterFigure 6.6 Comparison of log odds ratios log(ˆ1 /ˆ2 ) of stationary Markov chain π πdistributions of torus distances. a) log(pi(1)/pi(3)) for b) log(pi(1)/pi(3)) for five different periods ‘upto baroque’ vs. ‘after baroque’ 2 2 1 1 0 0 -1 -1 b. 1600 1600 1720 1800 from birth before 1720 birth 1720 and later -1720 -1800 -1880 1880Figure 6.7 Comparison of log odds ratios log(ˆ1 /ˆ3 ) of stationary Markov chain π πdistributions of torus distances.©2004 CRC Press LLC
  • 190. a) log(pi(2)/pi(3)) for b) log(pi(2)/pi(3)) for 3 five different periods ‘upto baroque’ vs. ‘after baroque’ 3 2 2 1 1 0 0 b. 1600 1600 1720 1800 from birth before 1720 birth 1720 and later -1720 -1800 -1880 1880Figure 6.8 Comparison of log odds ratios log(ˆ2 /ˆ3 ) of stationary Markov chain π πdistributions of torus distances. a) log(pi(1)/pi(3)) for b) log(pi(2)/pi(3)) for Bach and Schumann Bach and Schumann 1.5 2.0 1.0 1.5 0.5 1.0 0.0 0.5 -0.5 0.0 -1.0 -0.5 Bach Schumann Bach SchumannFigure 6.9 Comparison of log odds ratios log(ˆ1 /ˆ3 ) and log(ˆ2 /ˆ3 ) of stationary π π π πMarkov chain distributions of torus distances.©2004 CRC Press LLC
  • 191. Figure 6.10 Comparison of stationary Markov chain distributions of torus dis-tances. log(pi(1)/pi(3)) log(pi(2)/pi(3)) plotted against date of birth plotted against date of birth 3 2 2 1 log(pi(1)/pi(3)) log(pi(2)/pi(3)) 1 0 0 -1 1200 1400 1600 1800 1200 1400 1600 1800 year year a bFigure 6.11 Log odds ratios log(ˆ1 /ˆ3 ) and log(ˆ2 /ˆ3 ) plotted against date of π π π πbirth of composer.©2004 CRC Press LLC
  • 192. Figure 6.12 Johannes Brahms (1833-1897). (Courtesy of ZentralbibliothekZ¨rich.) u6.3.4 Reconstructing scores from acoustic signalsOne of the ultimate dreams of musical signal recognition is to reconstructa musical score from the acoustic signal of a musical performance. Thisis a highly complex task that has not yet been solved in a satisfactorymanner. Consider, for instance, the problem of polyphonic pitch trackingdefined as follows: given a musical audio signal, identify the pitches ofthe music. This problem is not easy for at least two reasons: a) differentinstruments have different harmonics and a different change of the spec-trum; and b) in polyphonic music, one must be able to distinguish differentvoices (pitches) that are played simultaneously by the same or differentinstruments. An approach based on a rather complex hierarchical modelis proposed for instance in Walmsley, Godsill, and Rayner (1999). Sup-pose that a maximal number N of notes can be played simultaneously anddenote by ν = (ν1 , ..., νN )t the vector of 0-1-variables indicating whethernote j (j = 1, ..., N ) is played or not. Each note j is associated with a har-monic representation (see Chapter 4) with fundamental frequency j andamplitudes b1 (j), ..., bk (j) (k = number of harmonics). Time is divided into©2004 CRC Press LLC
  • 193. disjoint time intervals, so-called frames. In each frame i of length mi , thesound signal is assumed to be equal to yi (t) = µi (t) + ei (t) where µi (t)(t = 1, ..., mi ) is the sum of the harmonic representations of the notesand a random noise ei . Walmsley et al. assume ei to be iid (independent 2identically distributed) normal with zero mean and variance σi . Taking ev-erything together, the probability distribution of the acoustic signal is fullyspecified by a finite dimensional parameter vector θ. In principle, given anobserved signal, θ could be estimated by maximizing the likelihood (seeChapter 4). The difficulty is, however, that the dimension of θ is very highcompared to the number of observations. The solution proposed by Walm-sley et al. is to circumvent this problem by a Bayesian approach, in that θis assumed to be generated by an a priori distribution. Given the data, con-sisting of a sound signal and an a priori distribution p(θ), the a posterioridistribution p(θ|yi ) of θ is given by f (yi |θ)p(θ) p(θ|yi ) = (6.16) ˜ ˜ ˜ f (yi |θ)p(θ)dθwhere mi f (yi |θ) = (2πσi )−mi /2 exp(− e2 (t)/σi ) i 2 t=1and ei (t) = ei (t; θ). How many notes and which pitches are played can thenbe decided, for instance, by searching for the mode of the distribution. Even if this model is assumed to be realistic, a major practical difficultyremains: the dimension of θ can be several hundred. The computation ofthe a posteriori distribution is therefore very difficult since calculation of ˜ ˜ ˜ f (yi |θ)p(θ)dθ involves high-dimensional numerical intergration. A fur-ther complication is that some of the parameters may be highly correlated.Walmsley et al. therefore propose to use Markov Chain Monte Carlo Meth-ods (see e.g. Gilks et al. 1996). The essential idea is to simulate the integralby a sample mean of f (yi |θ) where θ is sampled randomly from the a pri-ori distribution p(θ). Sampling can be done by using a Markov processwhose stationary distribution is p. The simulation can be simplified fur-ther by the so-called Gibbs sampler which uses suitable one-dimensionalconditional distributions (Besag 1989). A more modest task than polyphonic pitch tracking is automatic segmen-tation of monophonic music. The task is as follows: given a monophonicmusical score and a sampled acoustic signal of a performance of the score,identify for each note and rest in the score the corresponding time in-terval in the performance. A possible approach based on hidden Markovprocesses and Bayesian models is proposed in Raphael (1999) (also seeRaphael 2001a,b). Raphael, who is a professional oboist and a mathemati-cal statistician, also implemented his method in a computer system, calledMusic Plus One, that performs the role of a musical accompanist.©2004 CRC Press LLC
  • 194. CHAPTER 7 Circular statistics7.1 Musical motivationMany phenomena in music are circular. The best known examples are re-peated rhythmic patterns, the circles of fourths and fifths, and scales mod-ulo octave in the well-tempered system. In the circle of fourths, for example,one progresses by steps of a fourth and arrives, after 12 steps, at the ini-tial starting point modulo octave. It is not immediately clear whether andhow to “calculate” in such situations, and what type of statistical proce-dures may be used. The theory of circular statistics has been developedto analyze data on circles where angles have a meaning. Originally, thiswas motivated by data in biology (e.g. direction of bird flight), meteorol-ogy (e.g. direction of wind), and geology (e.g. magnetic fields). Here wegive a very brief introduction, mostly to descriptive statistics. For an ex-tended account of methods and applications of circular statistics see, forinstance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993),and Jammalamadaka and SenGupta (2001). In music, circular methods canbe applied to situations where angles measure a meaningful distance be-tween points on the circle and arithmetic operations in the sense of circulardata are well defined.7.2 Basic principles7.2.1 Some descriptive statisticsCircular data are observations on a circle. In other words, observationsconsist of directions expressed in terms of angles. The first question is whichstatistics describe the data in a meaningful way or, at an even more basiclevel, how to calculate at all when “moving” on a circle. The difficulty canbe seen easily by trying to determine the “average direction”. Suppose weobserve two angles ϕ1 = 330o and ϕ2 = 10o . It is plausible to say that theaverage direction is 350o . However, the average is (330o + 10o )/2 = 170owhich is almost the opposite direction. Calculating the sample mean ofangles is obviously not meaningful. The simple solution is to interpret angular observations as vectors inthe plane, with end points on the unit circle, and applying vector addition©2004 CRC Press LLC
  • 195. instead of adding angles. Thus, we replace ϕi (i = 1, ..., n) by xi = (sin ϕi , cos ϕi )where ϕ is measured anti-clockwise relative to the horizontal axis. Thefollowing descriptive statistics can then be defined.Definition 47 Let n n C= cos ϕi , S = sin ϕi , R = C 2 + S2. (7.1) i=1 i=1The (vector of the) mean direction of ϕi (i = 1, ..., n) is equal to cos ϕ ¯ C/R x= ¯ = (7.2) sin ϕ ¯ S/REquivalently one may use the followingDefinition 48 The (angle of the) mean direction of ϕi (i = 1, ..., n) isequal to S ϕ = arctan + π1{C < 0} + 2π1{C > 0, S < 0} ¯ (7.3) CMoreover, we haveDefinition 49 The mean resultant length of ϕi (i = 1, ..., n) is equal to ¯ R R= (7.4) nNote that R is the length of the vector n¯ obtained by adding all observed x ¯vectors. If all angles are identical, then R = n so that R = 1. In all othercases, we have 0 ≤ R ¯ < 1. In the other extreme case with ϕi = 2πi/n(i.e. the angles are scattered uniformly over [0, 2π], there are no clusters ¯ ¯of directions), we have R = 0. In this sense, R measures the amount ofconcentration around the mean direction. This leads toDefinition 50 The sample circular variance of ϕi (i = 1, ..., n) is equal to V =1−R ¯ (7.5) ¯Note, however, that R is not a perfect measure of concentration, since ¯R = 0 does not necessarily imply that the data are scattered uniformly.For instance, suppose n is even, ϕ2i+1 = π and ϕ2i = 0. Thus there are two ¯preferred directions. Nevertheless, R = 0. Alternative measures of center and variability respectively are the me-dian and the difference between the lower and upper quartile. The mediandirection is a direction Mn = ϕo determined as follows: a) find the axis(straight line through zero) such that the data are divided into two groupsof equal size (if n is odd, then the axis passes through at least one point,otherwise through the midpoint between the two observations in the mid-dle); b) take the direction ϕ on the chosen axis for which the more points©2004 CRC Press LLC
  • 196. xi are closer to the point (cosφ, sinφ)t defined by φ. Similarly, the lowerand upper quartiles, Q1 , Q2 can be defined by dividing each of the halvesinto two halves again. An alternative measure of variability is then givenby IQR = Q2 − Q1 . Since we are dealing with vectors in the two-dimensional plane, all quan-tities above can be expressed in terms of complex numbers. In particular,one can define trigonometric moments byDefinition 51 For p = 1, 2, ... let n n Cp = cos pϕi , Sp = sin pϕi , Rp = 2 2 Cp + Sp (7.6) i=1 i=1 ¯ Cp ¯ Sp ¯ Rp Cp = , Sp = , Rp = (7.7) n n nand Sp ϕ(p) = arctan ¯ + π1{Cp < 0} + 2π1{Cp > 0, Sp < 0} (7.8) CpThen ¯ ¯ ¯ ¯ mp = Cp + iSp = Rp eiϕ(p) (7.9)is called the pth trigonometric sample moment.For p = 1, this definition yields ¯ ¯ ¯ ¯ m1 = C1 + iS1 = R1 eiϕ(1) ¯ ¯ ¯ ¯ ¯ ¯with C1 = C, Sp = S, Rp = R and ϕ(p) = ϕ as before. Similarily, we have ¯ ¯Definition 52 Let n n o Cp = cos p(ϕi − ϕ(1)), Sp = ¯ o sin p(ϕi − ϕ(1)) ¯ (7.10) i=1 i=1 o o Cp Sp ¯o Cp = ¯o , Sp = (7.11) n n o Sp ϕo (p) = arctan ¯ o o o o + π1{Cp < 0} + 2π1{Cp > 0, Sp < 0} (7.12) CpThen ¯o ¯o ¯ ¯o mo = Cp + iSp = Rp eiϕ (p) (7.13) pis called the pth centered trigonometric (sample) moment mo , p centered rel-ative to the mean direction ϕ(1). ¯Note, in particular, that sin(ϕi −ϕ(1)) = 0 so that mo = R1 . An overview ¯ 1 ¯of descriptive measures of center and variability is given in Table 7.1.©2004 CRC Press LLC
  • 197. Table 7.1 Some Important Descriptive Statistics for Circular Data Name Definition Feature measured Sample mean x = (C/R, S/R)t ¯ √ Center with R = C 2 + S 2 (direction) Mean resultant length ¯ R = R/n Concentration Mean direction ϕ = arctan S/C + π1{C < 0} ¯ Center (angle) +2π1{C > 0, S < 0} Median direction Mn = g(φ) where Center (angle) g(φ) = n |π − |ϕi − ϕ|| i=1 Quartiles Q1 , Q2 Q1 = median of {ϕi : Mn − π ≤ ϕi ≤ Mn } Center of Q2 = median of {ϕi : Mn ≤ ϕi ≤ Mn + π} “left” and “right” half Modal direction ˜ ˆ Mn = arg max f (ϕ) where Center (angle) ˆ f (ϕ) = estimate of density f Principal direction a = first eigenvector of Center S = n xi xt i=1 i (direction, unit vector) Concentration ˆ λ1 = first eigenvalue of S Variability Circular variance Vn = 1 − R ¯ Variability Circular stand. dev. sn = −2 log(1 − V ) Variability Circular dispersion dn = (1 − C2 + S2 )/(2R2 ) ¯2 ¯2 ¯ Variability 1 n Mean deviation Dn = π − n i=1 |π − |ϕi − Mn || Variability Interquartile range IQR = Q2 − Q1 Variability7.2.2 Correlation and autocorrelationA model for perfect “linear” association between two circular random vari-ables ϕ, ψ is ϕ = ±ψ + (c mod 2π) (7.14)where c ∈ [0, 2π) is a fixed constant. A sample statistic that measures howclose we are to this perfect association is n i,j=1;i=j sin(ϕi − ϕj ) sin(ψi − ψj ) rϕ,ψ = (7.15) sin2 (ϕi − ϕj ) sin2 (ψi − ψj ) n n i,j=1;i=j i,j=1;i=jor det(n−1 n t i=1 xi yi ) rϕ,ψ = n n (7.16) det(n−1 i=1 xi xt ) det(n−1 i i=1 t yi yi )©2004 CRC Press LLC
  • 198. where xi = (cos ϕi , sin ϕi )t and yi = (cos ψi , sin ψi )t . For a time seriesϕt (t = 1, 2, ...) of circular data, this definition can be carried over toautocorrelations n i,j=1;i=j sin(ϕi − ϕj ) sin(ϕi+k − ϕj+k ) r(k) = n 2 (7.17) i,j=1;i=j sin (ϕi − ϕj )or det(n−1 n−k t i=1 xi xi+k ) rϕ (k) = n−k (7.18) det(n−1 t i=1 xi xi )7.2.3 Probability distributionsA probability distribution for circular data is a distribution F on the in-terval [0, 2π). The sample statistics defined in Section 7.1 are estimates ofthe corresponding population counterparts in Table 7.2. Most frequently used distributions are the uniform, cardioid, wrapped,von Mises, and mixture distributions.Uniform distribution U ([0, 2π)): u F (u) = P (0 ≤ ϕ ≤ u) = 1{0 ≤ u < 2π}, 2π 1 f (ϕ) = F (ϕ) = 1{0 ≤ u < 2π}. 2πIn this case, µp = ρp = 0, the mean direction µϕ is not defined, and thecircular standard deviation σ and dispersion δ are infinite. This expressesthe fact that there is no preference for any direction and variability istherefore maximal.Cardioid (or Cosine) distribution C(µ, ρ): ρ u F (u) = [ sin(u − µ) + ]1{0 ≤ u < 2π} π 2πand 1