chapter two

In the last half-century, Music Technology has established itself as a highly respected line of research and study. Much work has already been done across this multi-disciplined area, ranging from electronically-inclined studies into digital signal processing (DSP), through physical studies into acoustics, to more mathematical and melodic studies such as composition through algorithms.

   It is important to ascertain the role computers are taking in today's music and from what springboard, technologically, MIVI's realisation must begin. This will entail a review of previous and current research and technologies in the field of computer music, with special emphasis on human-computer interaction (HCI), musically-driven graphics and computer music performance systems (CMPS).

   The fundamental basis for MIVI will be an interface for the interaction of man, machine and music, and therefore, in addition to an exploration of the technological, it will be worth quickly delving into the psychological, with reference to applications of technology in education.

   Finally, following reviews of all the relevant fields and technologies, we'll take a closer look at a similar project to MIVI, called DIVA [26], and see if the findings of the Finnish team who conducted it, are of use in our venture.

  2.1 MIDI



Since its conception in 1982, the MIDI protocol has played an important role in many music-related projects, sometimes even to the point of inspiring or prompting them. Indeed, the connection between it and this project goes beyond just the spelling. However, whereas MIVI may be founded on a MIDI base, the goal will require us to develop upon the specification and protocol. It will therefore be wise, in addition to reviewing the MIDI specification and General MIDI (GM) protocol, to consider some aspects that lie beyond their scopes, and the previous attempts to extend them.


2.1.1 introducing MIDI




This sub-section aims to furnish the reader with a basic understanding of MIDI. Contained herein is an explanation of all the concepts and terminology of the protocol, pre-requisite in the reading this report. However, for those desiring a deeper or less specialised overview of the subject, the author recommends the books, mentioned as items [30], [53] and [54] of the bibliography.

   MIDI stands for Musical Instrument Digital Interface and is the technical specification [31] for a language of music - an encoding of music at the semantic level. It contains no audio information or guidelines for auralisation, such as sounds and waveforms, but instead, like a score to a piano, specifies properties like pitch and duration for each note and characteristics like tempo for a whole piece.

practical uses


   In the common instance, it allows music to be recorded by a computer, or dedicated hardware, from one instrument, then output for performance by another. The process can be synchronous, by attaching the MIDI-out (output port) of the first instrument to the MIDI-in (input port) of the other, or asynchronous, by having the computer record the signals from the first device to a file, which it can store and play back through the second at any time. The reader is referred to figure 1.1, in the previous chapter.



   A performance can be encoded in this way and stored in an SMF (Standard Midi File), which can, in turn, be transmitted to a MIDI device. A piano, if MIDI-enabled, can then pose as its own pianist and play itself. Indeed, most modern electronic musical instruments now support the MIDI standard – for example, the Yamaha CS1x [55] keyboard or Korg Trinity-Rack1 tone generator.



   MIDI performances can be augmented over time. You could record the right-hand of a piece of piano music, then rewind and record the left. The computer can then carry out the simultaneous playback of the two phrases. The number of total notes (pitches) a device can play simultaneously is denoted by its polyphony. If it is only one, the device is monophonic.

   The number of different sounds (timbres) a device can play at any on time is called its timbrality, where one timbre might be a violin and another a piano, etc. If it is more than one, the device is multi-timbral.

   Note that it is possible, and common, to have a multi-timbral, polyphonic MIDI instrument, composed of both monophonic and polyphonic voices – for example, a 16-part multi-timbral sound module with 128 voices might include monophonic flute and violin voices, as well as polyphonic piano and guitar voices.



   An SMF can contain the encoding of multiple instruments, even a whole orchestra, regardless of what MIDI input device you use; you can play in a violin – or even drum – solo using your MIDI piano. It is the job of the computer, through a program called a sequencer, such as Steinberg Cubase VST2 or Cakewalk Pro Audio2, to keep time and co-ordinate the performance of every instrument – fulfilling the typical role of the conductor. The sequencer takes the performance (sequence), encoded in the SMF, as input and, when commanded, streams, as output, notes and performance instructions, as MIDI messages, to a MIDI device for immediate execution (often auralisation). An apt analogy to a sequencer is a cassette recorder – allowing both the playback and editing of music.



   MIDI messages are very small packets of only a few bytes, which carry information on the note or performance instruction – MIDI event – that is to be played or executed. Each MIDI event is a packet of varying length comprising a single status byte, and zero or more data bytes. An example is the 3-byte Note On message, where the status byte (0x90 in hexadecimal) tells the device to start a note for the active instrument at a certain pitch and velocity (volume), as denoted by the accompanying two data bytes. By contrast, the Note Off message can then be used to terminate it. Alternatively, using a Note On message with null velocity will also stop the note.

General MIDI


   Most MIDI devices are capable of posing and performing as several instruments – ordinarily, up to 16 different instruments can be played at once, each receiving messages through their own channel. In 1991, a specification, called General MIDI (GM) [32], specified 128 different MIDI voices (the full list of these is given in Appendix A), which can be assigned to any of the channels. In addition, since percussion sounds do not vary in pitch, it is wasteful to have whole instruments dedicated to a single bass drum or snare drum, etc. The specification therefore describes a standard percussion kit, where note C in the lowest octave is a bass drum, D a snare and so on. For example, the first channel might be set to an Acoustic Grand Piano (voice #1) and the second, a Nylon-string Guitar (voice #25), but another (often the tenth, for percussion) to a Standard Drum Kit.

   An SMF is broken up into tracks, each having its own channel. The subtle difference between a track and a channel is that you can have two tracks with the same channel. Going back to our piano example, the left-hand could be on Track 1 and the right on Track 2, both being sent to voice #1 (piano) on Channel 1.

  This structure is illustrated in figure 2.1. The hierarchy has, as its root, the sequence, encoded in the SMF, which is broken up into tracks that have their own channel and contain all information about MIDI devices, etc. Each track contains all the information about the events from the opening to closing note of the piece, for its respective part.

system messages


   A sequencer is normally connected to more than one input or output device. Thus, in music, it is imperative that they cooperate and coordinate their activities with each other, like the sections of an orchestra. Three different types of messages exist to help the sequencer in this pursuit – System Common, System Real-Time and System Exclusive (Sysex).

fig 2.1 -
MIDI hierarchy






   The first two concern the timing of the piece. Whereas System Common messages give the absolute position of playback in a piece, System Real-Time issue the start and stop commands to control it, as well as transmit MIDI clock (tick) messages, which devices can use to synchronise with each other.

   The MIDI specification’s endurance, however, can largely be attributed to the third. Sysex messages allow for unrestricted byte-stream communication through a MIDI connection, between MIDI devices. Originally designed for the ‘bulk dumping’ of settings, or even waveform audio between devices, it has been brought to bear in the real-time environment and has been used by manufacturers to implement controls and functions that are necessary to fully exploit the functionality of new MIDI instruments, but not natively supported by the base MIDI command set. It is, therefore, not necessary to ever replace the specification, but extend it instead.

   As the name suggests, sysex formats are defined exclusively at the system level and, thus, the sysex messages for one device are not necessarily compatible with another. However, the emergence and monopolisation of commercial standards has brought some of these extensions into wider, sometimes universal, use. In the next section, we will briefly look at two such standards, with special attention to their extensions to the GM specification.


2.1.2 MIDI and GM extensions




Two of the manufacturing giants of the music industry, Roland and Yamaha, upon the widespread adoption of the MIDI protocol, recognised both the potential and inadequacy of the GM specification, and seized the opportunity to release their own extensions – General Sound (GS) [36] and Extended (or Expressive) General Midi (XG) [56], respectively.

   The extensions address exactly the same deficiencies of their predecessor and do so using almost exactly the same principles and methods. However, due to the competitive nature of the market, the implementations differ and, thus, most3 XG devices will not respond to GS commands and vice-versa. Therefore, for our purposes, reviewing one will yield as much insight into the other and, under the widely-held conviction4 that Yamaha’s offering is superior, we choose to cover the XG format. Before we review each of the improvements borne by the standard, though, it is important to establish the failing of General MIDI that both companies set out to address.

deficiencies in the
GM specification


   Music, as an expressive art, is an imperfect science – it involves and assigns value to nuances, quirks and irregularities – and demands the ability to step outside the norm. As a mature technology, MIDI was introduced at a time of relative simplicity in the computer – it had to be simple and efficient, too. MIDI is thus highly abstract and technical, and, as a medium for expression, crude and inflexible.

   A symphony orchestra conductor will be the first to notice that confining the number of different instruments to 128, as General MIDI does, is extremely crude. When you consider that Voice #41 is not only a violin, but also the MIDI 'ambassador' to all violins in the world, the deficiency is magnified. Properties such as the violin's size, maker and origin - all of which can have a profound effect on the timbre (character) of a note - are instantiated to one generic set of parameters. Imagine the reproduction of a string quartet, where the two violinists on occasion play the same tune. In the concert hall, though we can't distinguish which is which, we are still aware of two violins. On the average MIDI instrument, the waveforms are identical and, once superposed, would sound like just one – albeit either twice the volume or partially phased.

ensembles and
orchestral sections


   The proposed solution to this problem, in part, only tends to exacerbate the inflexibility of General MIDI. We notice that voice #51 and voice #52 are not single instruments, but string sections. Seemingly, this is to compensate for the loss in polyphony that would derive from emulating each violin individually. It is also provided as a quick and dirty solution to the problem of the superposed violins that such an endeavour might present.

   Interestingly, however, strings and brass are the only sections to benefit from any adaptation that would permit their usage in an ensemble context. Though, this is more a comment on the quality of the MIDI sounds available when the protocol was introduced. Although considerations of sound quality might deter classical musicians, who will afford themselves a real orchestra for any number of the reasons listed in this section, contemporary artists, particularly of the 80’s, are more tolerant of, and even praise, the synthetic sound of MIDI instruments.

   Nevertheless, although it would be fairly painless to design and implement MIVI as a multi-instrument application, there exists negligible practical advantage to displaying more than one instrument (ie. ensembles) at a time – as discussed in section 1.2, MIVI is principally for educational applications, and teaching multiple instruments simultaneously is merely a recipe for disaster. Neither will this report cover the implementation of either a string or a brass instrument. The reader, however, can assume that ensemble GM voices (#49-#52 and #62) would be reduced to their respective solo visual incarnations. On the other hand, it is conceivable that, in its maturity, and combined with other research projects [22][29], MIVI could one day be used as a conductor training tool.

   Thus, in general, one of the biggest criticisms of the specification was the lack of freedom of expression in terms of both instrument varieties and individual instrument usage.

improving upon
the GM soundset


   Instead of just increasing the number of available instruments to more than 128, Yamaha’s research department opted to make the voice list for XG multi-dimensional. So, for each instrument, there can be up to 128 sub-categories, drastically increasing the total number of available timbres to 16,384. An example structure of XG's Violin voices, taken from the Yamaha CS1x synthesizer [55], is illustrated in figure 2.2. It should be noted that only a small number of the 128 possible sub-categories actually differ in character from that of the original instrument and, furthermore, that any variation in timbre is the result of the original sound being put though an effects processor, as opposed to coming from a different source.


fig 2.2 - Yamaha’s XG
extension to GM

Voice 41 Bank 0 Basic Violin
  Bank 8 Slow Violin
  Bank 16 Bright Violin
  Bank 35 Octave higher
  Bank 36 Two Octaves higher


   This is clearly an improvement on the original specification, and manages to maintain legacy compatibility with General MIDI, but is still not ideal. Ideally, we'd want categories like 'Stradivarius Violin' and so forth. Sadly, their inclusion would be of limited use, since today's sound synthesis engines are not able to reproduce tones of sufficient realism, especially when applied to solo string instruments. Our exploration of MIDI extensions, thus requires a review of current synthesis techniques.


2.2 sound synthesis




From the principle that graphics and sound are intrinsically linked, often as products of the same device – hence, follows the visibility of audible objects and vice-versa – our exploration of the visual side of music will benefit from an analysis of the audible.

In our endeavour to recreate instruments, we must know how they work to produce sound. More to the point, to meaningfully teach these workings, it is imperative to understand how the instruments are manipulated to produce not only sound, but also music.

   Initially, this will involve a brief discussion of instruments in the real world. However, both instruments and sound generation techniques have already been extensively studied from an aural standpoint – a core component of the Music Technology field is the pursuit of more realistic and expressive technologically generated sounds and music – and thus, we shall cover, in more detail, previous translations of the art into science.

   As we shall see, much of these aural enterprises can give us guidance in our own endeavour.


2.2.1 instruments in the real world




Today, a large proportion of music listening is done with the absence of visual stimuli. This is made possible by the nature of sound – although involving the movement of objects, which are often visible, sound is the result of miniscule vibrations, undetectable to the human eye. The irrelevance of the visual, means that sound is definable from base physical principles. Music is reduced to simply the manipulation of these sound waves and we must thus consider the instrument from this aspect. It should be noted that this is a centuries-old area of study, and much more is known about it than is permissible or relevant in this report. The author recommends Rossing [39] as a truly remarkable book, yielding deep insight into the subject.

the physics of music


   A violin, for example, uses the friction of a bow to induce oscillations in a string. In this form alone, the induced sound waves haven’t the amplitude (in essence, volume) to be heard, but the resultant minor vibrations of the violin’s bridge permeate into the hollow body of the instrument, where the larger internal surface area gives rise to an amplified wave. This wave can then reflect off the interior of the body several times, before leaving through the f-holes of the violin. Their collision with the human eardrum causes physical displacements, which are translated into electrical impulses and sent to the brain. The brain then interprets the frequency and variations in amplitude that it receives, into the commonly recognisable violin sound.

   This process is similar to that in the other members of the string family. Furthermore, pianos and guitars, as ‘stringed’ instruments use much the same process – differing in only the initial induction of the source wave: a piano string is hit with a padded hammer, and a guitar plucked with a finger.

   Farther afield, even more diverse families employ degrees of the same process – both brass and woodwind instruments rely on the reflection of waves in a resonant corpus. This time, the body is an open-ended tube and the blowing of air induces the wave. Whereas for brass, this excites the resultant wave directly, for woodwind, it is used to excite vibrations in a reed that, in turn, produce the resultant wave.

   Each method contributes towards the unique timbre of sound produced by the instrument, but it should be obvious to the reader that the methods themselves are not so different within certain instrument groups.


2.2.2 sample-based synthesis




Most modern sound synthesis techniques are sample based – that is, for each MIDI voice, the notes have been digitally recorded from a performance in the real world. However, recording a quality sample for all notes is costly – a high quality sample for each piano key (88 in total) would fill the average instrument's allotted memory, leaving little or no room for the other 127 voices. Instead, one pitch is recorded and stored in memory, and by electronically varying the playing speed (or frequency) of the recorded pitch, the other notes can be simulated. Generally, however, as you get farther away from the original pitch, the electronic ‘transposition’ results in a noticeable loss of realism.




   Therefore, for a higher quality voice, a multi-sample is used. A piano voice, for example, might have the C note of each octave recorded in memory and use it to produce the rest of the octaves’ notes. The higher density of notes sampled, the better the quality the reproduction. In the extreme, some digital pianos do dedicate their entire memory to one or two piano voices, sacrificing variety for quality. Note, however, that they also sacrifice their adherence to the General MIDI protocol.

   The pinnacle example of sample-based synthesis came recently, in the form of Hans Zimmer5, one of the most popular composers in Hollywood. Speculating, to accumulate, he booked the London Philharmonic Orchestra for an extended private session, during which he proceeded to journey around the orchestra, instrument by instrument, section by section, and sample every note and phrase he (or they) could imagine. At the end of the session, he had stored gigabyte upon gigabyte of audio data, and has since managed to remove the need for all but the most trivial orchestral participation in his movie scores. It is also worth noting that the hardware used by Zimmer does not employ any accepted instrument / sample naming convention – such as GM, GS or XG – other than that set out by himself.

   In section 2.4.1, we compare MIVI to video tuition and note that video is not as flexible as the real instrument or a simulation thereof. Zimmer’s approach similarly suffers from this inflexibility – any musical phrase he does not have, he cannot synthesise without referring back to the orchestra.

   Sample-based synthesis, by itself, exploits no implicit similarities between instruments or their families. In addition to extending the voice list for General MIDI, GS and XG both tried to develop on the freedom of expression available to the instruments of the GM soundset, by using sound effects processing. Although not evident in the violin voice, in figure 2.2, these extensions do acknowledge the existence of such implicit relationships in other voices. In figure 2.3, which shows voice #25 – the Nylon-string Guitar – the readers should notice the Ukulele in bank 96.


fig 2.3 - Yamaha’s XG
extension to GM

Voice 25 Bank 0 Nylon-stringed Guitar
  Bank 16 Nylon-stringed Guitar 2
  Bank 25 Nylon-stringed Guitar 3
  Bank 43 Guitar Harmonics
  Bank 96 Ukulele

   The explanation is simple and inherent in the architecture of the XG system. Effects processes are simply manipulations of the sound at the waveform level. From the previous section, we noted the relative similarity of sound generation techniques. In this instance, the step from a Nylon-stringed Guitar to a Ukulele is trivial and their actual composition is very similar: they both have multiple strings, wooden bodies, etc. The only difference could be in the shape of the resonant chamber, the material of the string or type of wood. The generic XG effect process has modelled the consequence of these altered parameters and resulted in the abstraction of two instrument timbres to a common source waveform – an instrument has been added without adding a sample to memory.

   In the next section, we talk about the ultimate extension of this abstraction – the entirely synthetic production of sounds.


2.2.3 physical-modelling synthesis




Using the precise knowledge of the processes inherent to real instruments, as introduced in section 2.2.1 and detailed in Rossing [39], we can artificially fabricate the waveform by generating basic wave oscillations and simulating the appropriate reflections, refractions, amplifications and dampening, etc.

   Furthermore, once this is achieved for a violin, we can also adapt the algorithms for other string instruments relatively painlessly. It then also follows that the combination of more adaptation and further innovation would yield synthesis techniques for other instrument families and genres.

   This is not a new theory and has been the subject of considerable research and successful implementation already. One research project, in the late 90’s, was successful and mature enough to breach the academic boundary and enter the commercial market – Yamaha and Stanford University’s illustrious SONDIUS XG Virtual Acoustic Modelling system [57].



   Contrary to what the name suggests, SONDIUS XG has no voice naming hierarchy as in its namesake, XG. Instead, there is a more implicit system of inheritance between voices, dictated by the modelling algorithms that they employ.

   Before this, the very acceptance of a convention, such as orchestral families, already recognised the similarity of a group of instruments. In both cases, the distinctions are based on both the method each family uses to produce sound, whether it be plucking, bowing, blowing or hitting, and other structural properties of the instrument – for example, its material, such as wood or brass.

   For physical-modelling techniques, like SONDIUS XG, these families translate conveniently to synthesis models. String instruments, like violins, violas and cellos, as well as most keyboard instruments, draw upon algorithms, which simulate the sound waves produced by the oscillations of a string upon plucking, bowing or hitting. Wind instruments, like oboes, clarinets and flutes, rely on the modelling of sound waves passing down a wood or metal tube. The SONDIUS XG system also suggests the possibility of partitions based on the driver and resonant body components, and further abstracts them from their physical models.

fig 2.4 - the SONDIUS
XG architecture






   Because most strings use a bow, the respective model is invariably generic: an algorithm coupling friction-based scraping, and string oscillation. The difference simply lies in the parameters – the coefficient of friction, the length of the string, etc. This implicit categorisation in the system’s naming of instruments as combinations of Drivers and Resonant Systems, is illustrated in figure 2.4. Note that one remarkable feature of this architecture is that any combination of driver and resonant system can be used, allowing for imaginative instruments such as breath-driven violins.

fig 2.5 - the string
family: (a) violin,
(b) viola (c) cello,
(d) double bass







   Furthermore, although it may be musical blasphemy to say that a cello is simply a big violin, visually, this is essentially the case (as can be gauged from figure 2.5). Thus, physical modelling can apply the same generalisation to the resonant body, simply altering the dimensions to enable the correct internal reflection and refraction of the sound waves.

abstracting instrument


   This extrapolation is especially beneficial when we consider that the driver can change part-way through a piece of music. A violin, for example, can be bowed or plucked. Indeed, a cello can even be bowed, fingered and plucked simultaneously.

However, on the score, whereas all this requires is a comment above the stave saying arco or pizzicato (respectively), MIDI has no formal method of encoding this performance direction. To simulate it, you must assign the relevant music to a separate bowed violin instrument (voice #41) or plucked violin instrument (voice #46), invariably over two different MIDI tracks or channels. In MIVI, if we were to automate which instrument is displayed (or driver is used), based on the incoming MIDI data, this might create a problem. For this reason, amongst others (discussed later), control over MIVI’s instrument selection will rest with the user. Aside from a small burden to convenience, the impact is minimal, in the context of MIVI’s application6.

   As regards resonant bodies, the string family is perhaps too simple an example. By inspecting the brass family, we see that our categorisation of instruments should be more involved than simple segregation into families, and we are again encouraged towards the driver / body split. However, whereas forced air (or pneumatics) is the driving force, universal to almost all brass instruments, the resonant body and method for controlling the pitch can, this time, vary as well. For example, take the trombone with its slide as opposed to the trumpet with its valves.


2.3 performance




Returning to the problem of differentiating violins in our string quartet scenario, posed in section 2.1.2, the sound of an instrument in the real world is also varied by the manner in which it is played – the proficiency and technique of the performer, or lack thereof. Indeed, much of what tells us that something is real (or human) is, cynically, what is imperfect about it – the gasping breath sounds of a trumpeter, the pale scraping of a bow, sliding too lightly across the violin string, etc.

   Computer aided music is often condemned for its lack of expression, which, though one could blame on DJ's of today's music scene, is more likely attributable to the MIDI specification. Indeed, it has often been criticised for being too centred on keyboard instruments and interfaces [26], resulting in an expressive model which is little more advanced than variation in volume – velocity.

   Performance is central to our goal – MIVI is an instrument performance tutor. The lack of expression in MIDI, also identified in section 2.1.2, must be overcome if we are to instruct in performance of any instrument save the piano.


2.3.1 expression




Research by a number of institutions has produced CMPS's, or Computer Music Performance Systems, which are designed to record more expression in music, allowing the computer to more accurately emulate the human element in performance. These systems can also bring expression to instruments where it wasn't previously found, such as synthesizers.

computer music
performance systems (CMPS’s)


   FORMULA (Forth Music Language) [2] is such a system. It was an attempt, in the early 90’s, to build upon the MIDI protocol – to add emotion and expression, etc. to the audio output, through the creation of a music programming language. On p.23 of the paper, Anderson and Kuivila also briefly mention the potential of visual output, but concede that the hardware of the time, combined with their system architecture, would introduce ‘unacceptable delays’ and prevent a meaningful exploration of the medium.




   There are two immediate problems involved with promoting such data to the visual layer, in an application like MIVI. The first is technical; "How does one show emotion in graphics?" The pursuit of an answer to this could be defined by the history of ‘Art'. Realistically and, perhaps, crudely, the only way to do it would be to have a representation of the performer and his face as he (or she) plays the piece. Indeed, the face is often used in art as an interface to the emotion of a painting’s subject. It is, after all, not the instrument that humans perceive as having the emotion, but the player.

   The other problem is of practicality. In an educational program, the useful applications of carrying expressive performance directives through to the visual layer appear minimal, since expression is often an aggregate of one's own soul and experience – tacit knowledge, which is difficult, if not impossible to teach. However, there can often be more than one method, or technique, available for playing a note or phrase, which is important for the performer to be aware of.


2.3.2 technique




Technique, although related, is not synonymous with skill – it is a means to skilfulness. Although, on a piano, teachers will train students to use particular fingers on particular occasions, it need not always improve the quality of the performance. Instead, it makes the performance easier, so that the move to the next echelon of ability becomes more achievable.



   One aspect of performance that instantly presents itself, when stepping outside the piano, is fingering choice. A guitar, for example, has six strings, and the dynamic range of each overlaps with not only the adjacent, but beyond that as well. There thus exists a choice of methods to finger a single note, and some will be easier than others. If we are to display the instrument as it relates to the score, we must choose one of these methods. However, as an educational tool, we must make sure that our choice is suitable for the learner.

fig 2.6 - table of
guitar fingering






   In figure 2.6, we illustrate a diagram identifying the possible fingerings for various pitches on a guitar fret, where each number denotes the distance from the fundamental pitch (in semitones) – in this case E(00) – and each row is a different string. One can see the repeated occurrence of equivalent pitches across several different strings (equivalent C(20)'s have been emphasised). The problem facing the performer, and – by transitivity, our visual interface – is which to choose at any one given time.

   In the above scenario, let us assume that we want to play the C and that the succeeding note is an A(17), so, if possible, the algorithm should, for simplicity's sake, avoid changes of string or hand position. Intuitively, it would be best to pre-emptively place the first finger on an A, and the second (or third) on an adjacent C, so the transition can be made painlessly and involve only a simple removal of the extraneous finger. This rules out the C on the B-string, since no such A exists, leaving two options, which – under our current constraints – are equally attractive.

   So, we introduce another constraint - that of sound quality. Plucking an open (un-fingered) string produces a much 'cleaner' sound than if the player were to finger the note by depressing a lower string nearer the bridge (towards the right of our diagram). Simplistically, this gives us the heuristic: the further left the number on our diagram, the better the quality the tone. Readers interested in implementing a more detailed set of constraints that account for quality are referred to Taylor's [50] paper.

generic fingering


   Incidentally, in the case of other fingered-string instruments, like the violin, violinists are encouraged to choose fingerings in preference to open strings, in order to maintain a uniform sound quality across all notes. Thus, for applications servicing multiple instruments, such as MIVI, it will be important to give careful consideration to the domain of their fingering algorithms. For example, how much of a guitar fingering algorithm might translate to a violin application? Even on a lower level – how much modification is required to adapt a violin-fingering algorithm to a cello? Does the introduction of the thumb, in this case, present a large problem, or can we simply treat it as a fifth finger? We leave these exercises to the reader and referenced literature.



   Returning, and restricting ourselves, to our guitar scenario; we have seen, with just two constraints, a decision can be made for each note (the previous decision has been shaded in the diagram). Notice, however, that we have also made an assumption (the position of the succeeding A), which allowed us to further constrain the decision. A note's reliance on its successor suggests a ‘lookahead’ is called for – the succeeding note’s position must be calculated before the current note’s and, of course, the same is true for the succeeding note. This recursive relationship will propagate the decision to the final note of the music, where a decision, based on no successor, must be arbitrarily made. The results then cascade back to the initial note. Thus, we see that, in this implementation, although implementable using a constraint-satisfaction algorithm [18], fingering calculations cannot be a real-time operation, but must be pre-processed.

   This is how the DIVA system (discussed in the section 2.5) works. The music is analysed and all the fingering positions are worked out and stored before the performance. It is a simple approach and will result in a polished performance, depending on the algorithm’s constraints. It requires, however, a pre-processing step before each new piece of music – the duration of which will vary depending on the complexity of the piece and the number of musicians in the ensemble, and could, potentially be quite costly. In comparison to the typical human approach, though, the duration of this step will be drastically smaller than a standard ensemble’s equivalent during rehearsal sessions.

   However, let us consider the human performer more closely. The player’s approach depends on their level of skill and experience, most notably in the sphere of sight-reading. A lesser-experienced musician might play the piece through once, annotating the score with fingering observations as they occur, restarting, each time, to verify them – the human equivalent of the DIVA system.

finite lookahead


   An experienced musician, on the other hand, might be able to perform a small lookahead in his mind and anticipate optimal fingerings, similar to a Grandmaster of chess deducing the next seven or eight moves before they happen. This latter tactic gives us insight into how to implement a real-time fingering system. By restricting the lookahead to a limited amount and forcing the arbitrary decision before reaching the final note, we could conceivably avoid the heavy load on the processor, enabling the decision to be taken as the note is played.

   Indeed, it is self-evident that the fingering for the final note in a symphony movement has little or no influence on that of the first. We can apply this analogy on much smaller intervals and, by selecting a suitable lookahead size, in relation to these intervals, not only reduce the processing overhead sufficiently to allow for the decisions to be made in real-time, but forego any significant hit in performance quality. There are several points in a score where influence does not propagate, such as rests (silences), and less critical points where even more involved fingering shifts are less costly, such as following long notes, or stretches of repeated notes, where thinking time is more abundant – both are examples of good lookahead limits, that occur frequently throughout most pieces.

   However, in the absence of such appropriate junctures, an algorithm might have to force a lookahead limit in order to prevent hogging of the processor, thus avoiding glitches in song playback. In this case, a decision is forced, for the current note, based on less than all the facts.

expert systems


   Such decisions can often be simplified by the use of expert systems or neural networks, such as Bayesian Networks [23] or Hebbian Learning [40] (respectively), where instead of simple deterministic choices, statistical reasoning is employed. Hence, after a modest lookahead, though we might not be able to say for certain what the best choice is, we have statistical weights denoting which choice is most likely to be the best choice. Bayesian Network systems that allow for learning, such as BUGS [45], can even be combined with cached experience, that can assist us further, by basing the fingering, for a phrase of finite length, on what action was taken last time the question, or similar question, was posed.

a hybrid
fingering algorithm


   Using a similar principle of learned experience, Sayegh [41] successfully employed the Optimum Path Paradigm (OPP), an approach based around constraint satisfaction, to tackle the fingering problem in stringed instruments. Taking the number of strings to be S, the number of playable notes on each string, N, with NT (= S × N) representing the total number of finger positions possible, the algorithm first populates an NT × NT matrix, W, with the cost of every possible transition between fingerings, penalising string and hand position changes – as in our original example. The matrix can then be queried one or more times, given a sequence of notes, to produce a set of solutions. All existing solutions are then combined to form a weighted, directed graph (see figure 2.7), which can be solved in polynomial time using the Viterbi algorithm [52], a refinement of the algorithm for finding paths of least cost by Dijkstra [15].

fig.2.7 – Sayegh’s use
of weighted, directed
graphs in the sequence
G, A, C, E






   For a classical guitar (S=6, N=36), the number of calculations to populate W is a phenomenal 216 × 216 = 46,656. Unfortunately, aside from a trivial identity mapping, no ‘optimising’ observations can be employed to streamline the process – fingering transitions are not normally commutative. However, the algorithm need only be executed once per instrument – the matrix can be stored for use in the future.

   Sayegh refines the approach by introducing a second stage, based on Hebbian learning [40], a form of expert system. He uses a second matrix of equal dimensions, W', the entries of which are initialised to zero. The results of the first stage then go through a learning phase, using W to generate responses to various training material in the form of sequences of musical notes. Each time the algorithm suggests a transition between two fingerings, the corresponding entry in W' is incremented by 1. After a sufficiently large sample input, the new matrix represents experience, which can be used to base fingering decisions on, either solely, or in addition to the original matrix. The accuracy of this sort of Hebbian learning (as with Bayesian Networks) is often difficult to accept from base principles, and those new to the idea often benefit from practical and quantative examples, which the reader should be able to find in most of the books on the subject [23][40].



   In MIVI, we are not designing a playing aid, but a teaching aid. The performers – in this case, guitarists – need to learn how to apply the constraints and work out fingering themselves, and should not become reliant on the presence of a computer. The constraints should be considered equivalent to 'playing tips and tricks', and can be presented as such in the program. Therefore, given a piece of music, the user should be able to specify which techniques the computer employs, and thus which to tackle learning themselves. The true beginner starts with none, possibly using only his fore finger to adjust the pitch of his instrument, and can then introduce the 'tricks' one-by-one, adding them and advancing at his leisure. Furthermore, when the choice of employing a specific technique conflicts with that of another, or represents a different – as opposed to 'better' – practice, the decision that must be made is one of playing 'style'. Sayegh provides a convenient method of implementing such a system, with each set of constraints (or playing style), represented by a pre-compiled matrix.

playing style


   Whereas the 'tips and tricks' indicative of a playing style might be defined by the constraints of the algorithm's first stage, an expert guitarist's playing style might be encoded by running just the learning stage on them, instead of the algorithm – getting them to manually increment the contents of matrix M', as they make their own decisions, given the learning material.

   Hypothetically, a 'complete' implementation would not only allow you to define sets of constraints corresponding to the playing styles of guitarists (John Williams, Eric Clapton, BB King, etc.), but also more general sets, optimised for their genres (respectively: classical, rock, blues, etc.). Simply restricting the learning phase's training set to pieces in the required genre could develop such genre-orientated sets.

   The initial implementation of MIVI, documented later in the report, will not include such performance memories and fingering algorithms. However, when it becomes time to codify our flute, we shall see that a simple fingering decision is required. Furthermore, we shall attempt to ensure that if – or when – the MIVI system is extended for other instruments, such as the guitar, the architecture is able to support the inclusion of fingering algorithms.

  2.4 education




The awkward and visible sign is the syllabus, a table of contents which lays down what the student is required to do and what is examined… The syllabus narrows the students vision of the edge of knowledge and cuts him off from precisely those fuzzy areas at the edges of subjects that are the most interesting and rewarding.

Christopher Small, p.186-7 in Music-Society-Education [44]



The moral of this fable is that, if you’re not sure where you’re going, you’re liable to end up somewhere else.

Robert Mager, Preface to Preparing Instructional Objectives [28]


Today, when it comes to learning an instrument, there is a multitude of teaching methods available to the music student and, as a player of several instruments, the author has much experience in this role. In this section, we give a concise and critical history of the academic field, with focus on the practises of musical instrument tuition and use of computers as an adjunct to education


2.4.1 music education




Tradition demands [24][35] the tried and tested technique of one-to-one – teacher to pupil – lessons on a regular and frequent basis. During daily, weekly or fortnightly lessons, the pupil’s performance is assessed by the teacher, whose competence then determines the quality and availability of positive critical feedback. Furthermore, the instructor also sets the syllabus, perhaps via an examining body, and suggests homework and beneficial extra-curricular activities. It is in this way that the author has learnt to play the violin.

   The strict scheduling of lessons and proliferation of deadlines and exams can, however, contradict the rationale of music itself – to be an enjoyable and entertaining occupation. After all, for most people, playing music will be considered a leisure activity. The two passages, given at the beginning of section 2.4, are quoted from Swanwick [48], who is but one of many scholars [17][35] who have arrived at this conclusion.

   Additionally, and often more importantly, the expense of expert tuition can sometimes be prohibitive [19]. In the case of many instruments, especially during the sometimes painfully slow early stages, it can be difficult to identify what your money is paying for.

the self-taught


   For some, this might persuade the student that a self-taught approach is more financially viable – if only to get them to a stage where they can evaluate the merit of expert, third party tuition. However, even at the individual level, there are a number of different approaches, many complemented by commercially available teaching aids.

Naturally, the first is the bloody-minded, do-or-die approach. In the case of learning to swim, this involves jumping in the deep end and, more often than not, results in either drowning or the somewhat crude and expedited erudition of the fundamentals of swimming – namely, the doggy paddle. In the case of the aspirant maestro and, at one time, the author, this might involve a complex piano sonata, Bach for example, and practising until they can play it.

   This approach will only work for some, notably pre-seasoned musicians, and, although coarse and cursory, will elevate the subject’s proficiency across a much larger repertoire than simply the chosen piece. Indeed, once one has deciphered Bach, little stands between them and most other Baroque piano music, and this is the scenario, respectively, for many other genres.

   However, music is an age-old pursuit, and it is naïve to think that just the score and the instrument will endow one with enough experience to make you a maestro. Indeed, the score has been as heavily criticised [35] as credited for its provision of information and encapsulation of the performance. As in the extreme of our aquatic example, one may be able to swim, but not necessarily to swim well.

the explicit, the
implicit and the tacit


   As an art, most notably a performing art, musical knowledge extends beyond the explicit – what is written on the page. In this way, traditional tuition will always have an advantage over the entirely self-supported approach, for it allows the elicitation of implicit knowledge – that which can be articulated but isn’t – and tacit knowledge – that which can’t – from the expert to the learner. So, whereas the self-taught student might be able to play the music, they will be ignorant of tips, tricks and shortcuts, to allow for easier playing and more freedom to express themselves – another defining characteristic of art. Indeed, for many instruments, such gaps in knowledge will inhibit the student from moving to higher echelons of play. Moreover, for instruments less intuitive than the piano, the lack of knowledge and guidance about how to equate the notes on the page to the instrument can severely hinder, often prevent, even the slightest progress.

   Music, incidentally, is the only art form requiring literacy [35] – whereas a painter need only apply the brush to a canvas, a musician is reduced to working through a layer of abstraction – the score. As we saw in section 2.1, such obstacles can be compounded with further layers, such as MIDI.

teaching materials


   Therefore, when teaching oneself, a natural step is to find literature on the subject. In searching for such secondary materials, one is seeking an expert who has tried to put everything they know on paper – both stating the explicit and articulating the implicit, so that it is, in its new form, explicit. To what extent the expert has achieved this goal, determines the relative quality of a material, and there are several different approaches to the problem available.

When one is looking for such literature, as this author endeavoured to do when learning the guitar, the sheer variety available is daunting. In its infancy, this market was dominated by books [34] where the expert and publisher employed text and nothing else. Soon after, these were superseded by ones [33] containing simple diagrams to help tackle explanations of fingering and posture, etc. Then, with the advent, and relative cheapness, of black & white [51] and, later, colour photography, we have books advertising the "all-visual approach to learning to play the [instrument]" [10].

visual learning


  In music, the importance of exposure to the visual aspect of playing an instrument can sometimes be as important as the audible aspect – the classic 'Monkey See, Monkey Do' philosophy is actually of benefit in this case and has been formalised, in research, more than once.

   Chappell [11] states that the ability to internalise music (to hear, or picture, it in one's mind) is of considerable advantage in the process of developing musical skill. Nevertheless, other studies [1][16] show this skill to be present in only a handful of the world's greatest musical geniuses.

   Ben-Or [4], whose views on the subject are widely held, states "If one can really perceive a passage of music with all clarity and represent it to oneself mentally as it relates to the instrument, then there is no obstacle left for the body to freely express it in sound". Thus, to present music in a visual 'instrumental' form, pre-converted from its abstract score format, should remove a number of such obstacles.

pictures and


   Static visualisations, though a step in the right direction, often do not convey the subtleties of playing the instrument. On a purely technical level, pictures can rarely capture motion. For a simple example, the book may illustrate 3 key stages of strumming on a guitar; the up, the strum and the return to the default hand position. In all, the strum should last 1 second. With the average human eye, working at a frequency of 72Hz (72 pictures a second), it is up to the individual to anticipate 69 of them, or 96% of the motion.

   Strumming is a relative simple procedure and most, if not all, beginners are able to master it after but few attempts, allowing books to economise in the early stages. However, more involved motions, that should be accompanied by even more enhanced illustrations, are, instead, usually accompanied by none. Though the cynic might put this down to the price it costs a publisher to print pictures, the true reason is likely to be the limited level of information expressed in only one image. It simply isn’t plausible to express, in a page, complicated techniques where 10 or more key stages, each shot from multiple angles, are required to deliver a true understanding – a deficiency compounded in some books by a tendency to illustrate the instrument from the listener's view, rather than the players, forcing a translation, this time of geometry, on the learner. 

   Another, more general failing with 2D (diagrams) and static pseudo-3D (photos / illustrations) material, is the lack of the fourth dimension of time – a reader, following examples, note by note, cannot gauge his progress, since although they are playing the right notes in the right manner, they are not aware of temporal aspects of the piece. In the case of tempo itself, the speed (or lack thereof) of their performance is likely to be lower, allowing for deceptively easy playing.

audio examples


   Therefore, some books are accompanied by an Audio-CD [10] that demonstrates what the reader should be reciting. This helps to a great extent, but still lacks cohesion, in that, although the music on the page may be recognisable in the playback, the process that connects one to the other still relies on the quality of text and pictures, and requires some quick-thinking on the reader’s part. Slack [43] also recommends the use of audio aids in music education, but stresses the importance of accompanying explanation7. As far back as 1964, he advocated the augmentation of audio with visual aids [43] in the learning process – extolling the virtues of being able to zoom in and focus on parts of the instruments. Thus, as the technology became available, the trend took us from the written word through the spoken and illustrated, to the motion picture.

the motion picture
and MIVI


   This philosophy is not new to education, and has even been brought to bear in the form of multimedia music tutorials - CD-ROMs of text, pictures and video demonstrations. Videos are digitised film, which is to MIVI what the Audio-CD is to MIDI. Instead of a data stream, like videos and CD’s, MIVI will be built on an object-oriented structure for performances. Where MIDI can tell you the pitches, and volumes of each note plus allow you to alter tempo, etc., MIVI will draw upon this to provide total-immersion graphical playing environments, with editable angles, magnification and speeds plus allow the isolation of a single instrument and its components, for closer scrutiny.

fig.2.8 - domains of education methods






   In figure 2.8, the common stages and routes to musical expertise, as discussed in the previous paragraphs, are summarised with relation to their competences. Some features of the table are worth mentioning. Firstly, it is clear that when one becomes an expert, tuition is no longer required, and development now relies on a self-supported exploration of music and performance, although possibly with external influences from other performers. Secondly, we conclude that a degree of tacit knowledge, not serviceable by oneself or media, is required to achieve expertise.




   It should also be noted that the table holds no information relating to speed, ease or expense of any particular method, or combination of methods. Given the problems with current self-tuition materials, and yet, also those of expert third-party tuition (introduced in section 1.2), most learners demand the latter. It is hoped that MIVI, by plucking some of the advantages of the third-party, will close the gap between the two – in a sense simulating a virtual music teacher, for the earlier stages.

   As Roland – a world-renowned musical instrument manufacturer – states [38], arguing a similar case for its own interactive tutoring products, "[M]usic learning with private lessons begins with 30 minutes of excellent guidance and coaching. But, now a crucial difference takes place. Instead of the regular, routine play-through of the school ensemble music, the student goes home to a week-long period of unguided, occasional practice."

   They continue by identifying the distinctions between teacher and tutor – the first, is to provide direction, the second, to guide them on their journey. In their case and ours, the respective products’ competences fall into the latter.

   The promotion to the former might involve the inclusion of implicit and explicit technique theory, including the migration of expertise from the real to virtual incarnations, which has resulted in the division of MIVI into two domains in figure 2.8.    The first represents the intended capabilities of the implemented MIVI system as documented in this report; introducing the student to the instrument8 and teaching the student how to relate the score to said instrument. The second builds on the first with the ominous requisite of ‘performance knowledge’, and was discussed in the section 2.3.2.


2.4.2 computers in education




The modern buzzword in multimedia education is ‘interactivity’. Having argued the ‘learning by seeing’ case, many of the same principles can be, and are, similarly employed in advocating ‘learning by doing’ [9][47][48]. Using interactive materials, not only can the user get instruction, but feedback too. The interactive device can have algorithms to analyse the user’s performance and give qualitative and quantative feedback on their errors, as would a genuine tutor.

games, simulations
and drills


   E.R. Steinberg [47] mentions several methods of interactive learning through the use of the computer. ‘Drills’ are the purest form of Computer-Assisted Instruction (CAI) and simply aid the memorisation of symbols, such as the periodic table, or connections, such as our score note to key mapping, through the standard stimulus / response format. Even in this simple context, detailed feedback is possible, since the computer can tell the learner what they got wrong, where they tend to fail (statistically) and to what extent they are in error. Furthermore, the execution of the drill can be tailored from learner response – getting harder as they do better or vice-versa. Drills, however, only test knowledge already present in the user.

   Either alternatively or in combination, E.R. Steinberg advocates the use of ‘simulations’, where the learner is presented with a virtual representation of the subject material. She states, "Insights about complex scientific principles often come from experience with such concepts in an interactive environment".

   A flight simulator, for example, allows trainee pilots to familiarise themselves with the cockpit of an aeroplane without the expense (in terms of money, space and time) of a real plane or flight.

   Closer to our context, the use of typing tutor software is also an embodiment of this concept. As a drill and a simulation, it is able to show the user what to do, regulate the act itself and give qualitative and quantative feedback on their performance.

eliciting tacit


   We have seen how interactive media can replace the implicit teachings of a music teacher, but how can we extend this to tacit knowledge? As it becomes more widely acknowledged that we are living in an age where information is paramount, industry looks to retain such knowledge, without necessarily retaining workers, through Knowledge Management and Information Systems.

   One approach is to concede that tacit knowledge cannot be encoded and copied, but ‘cloned’ instead. In our case, instead of telling an expert guitarist, to write a book about how the reader should play the guitar, which we have shown will not implant his expertise in the reader, invariably loosing something in the translation, we simply tell him to write a book about how the expert plays the guitar himself. Although the knowledge cannot be copied, the renditioning of a particular piece of music (the performance) can be. So, we instruct the learner to recreate the performance(s), trusting that, in their efforts and attempts to execute the task, the learner will derive the tacit knowledge for himself and will, in future, be able to sub-consciously apply it to the generic music score.

   Thus, returning to our discussion on the relative merits of MIVI over existing soft and hard literature we see that, whereas existing teaching material is often restricted to a specific genre, such as "Jazz Piano" or "Blues Guitar", MIVI should have the capacity to cross these boundaries, simply by selecting a MIDI file and performance memory from the desired genre.

  2.5 the DIVA project

fig.2.9 - the DIVA
system in action





In this section, we will introduce the reader to DIVA, a project closely linked with MIVI, and review some of the core principles and ideas surrounding it. We will also discuss the relative merits and shortcomings of DIVA, as well as the isolate the differences between it and our own endeavour, MIVI. Other more specific aspects of the DIVA project are covered elsewhere in the report, as they become relevant to our own endeavour.


2.5.1 DIVA history






Beginning in 1992, a team of scholars, from the Helsinki University of Technology, embarked on a number of research projects [21][22][26][49] that, in 1997, culminated in the design and development of a Digital Interactive Virtual Acoustic (DIVA) environment [26]. This involved the creation of a totally immersive 3D environment, embodying the performance of a small Virtual Orchestra of Virtual Musicians, led by a human conductor, with physically-modelled instruments and acoustics, both audibly and visually accurate down to the spatial dispersion of sound and fingering of instruments, as illustrated in figure 2.9.

   Drawing from expertise in multiple fields, such as digital signal processing (DSP), graphics, acoustic modelling and sound synthesis, the team has studied and successfully combined technologies in all these areas, climaxing with an interactive performance at the SIGGRAPH'97 conference.


2.5.2 the DIVA system




The impressively endowed system architecture (figure 2.10 (a)) comprises of no less than 3 multi-processor SGI workstations, each acquiring a select number of competences, networked to external synthesizers and an Ascension Motion Star for conductor input.

   Figure 2.10 (b) is adapted from Huopaniemi et al [21] and illustrates the DIVA system information flow, with the corresponding scope of the MIVI system appended. It should be noted that our project will, in fact, address an even more confined scope than that indicated by the diagram, with our MIDI-to-movement mapper and Animation control avoiding the overhead involved with the processing of virtual biped musicians and multiple instruments. It is intended that the MIVI system should be suitable for execution on a traditional, self-contained, uni-processor system - a home computer or other such workstation.

   However, while sharing some common attributes, MIVI and DIVA differ in objective and target audiences. Thus, we find that, in a couple of our interest areas, such as fingering and visual detail, the DIVA research is less extensive.

fig 2.10 (a) – DIVA
system architecture



fig 2.10 (b) – the DIVA
process flow



   This stems from the fact that the DIVA system is principally orientated towards the use of sound engineers and experts – be that for movie sound or general recording studio work. MIVI, on the other hand, will focus on establishing itself in an educational role, aimed, initially, at the unskilled user, possibly in a studio environment, but more likely in the school or the home. The group's research, as such, illustrates an emphasis on acoustics and audio realism, with principal focus on sound spatialisation, room response modelling and HRTF (head-related transfer functions), rather than graphical and performance accuracy that, as discussed, will be crucial in our own venture.

   Many articles and papers, by different members of the group, provide overviews of the various aspects of the project, and it is, therefore, necessary to direct the reader towards Lokki et al [26] for one with more relevance to our field – the visualisation of musical instruments in real-time.

   Recalling our discussion of fingering algorithms in section 2.2.3, DIVA employs a simple and computationally efficient, yet slightly musically naïve, least-distancing algorithm to solve problems arising when there exists more than one way to finger a note. When presented with a choice of fingerings, this approach comprises of simply choosing the solution requiring the least hand or finger movement – favouring ease over quality. Such decisions are made using Critical Path Analysis (see [40]) – an extension to Sayegh’s [41] use of weighted, directed graphs (discussed in section 2.2.3).

   Although, as claimed, the fingerings are "realistic" [26] – in that they are possible – they are far from optimal and, although adequate in an entertainment environment, should be further scrutinised in an educational or reference role. This compromise is also evident in the conducting styles recognised by the system when it processes data from the input baton – only the most generic are identified. The implementation of both features is, nonetheless, more than adequate at demonstrating the practicality of their inclusion. The team’s fingering efforts do, however, introduce an interesting aspect of visual simulation – dynamic (or timed) fingering. Actions like guitar strums or percussion hits require preparation, such as the ‘up’ movement (mentioned in section 2.4.1) or the initial movement of the drumstick towards the drum skin.

   Lokki et al [26] also concede that more weight has been given to the realism of the ‘scene’, rather than that of the individual instruments and musicians themselves. This is most markedly illustrated by the ‘3D cartoon puppet’ format of the musicians. Sighting psychological reasons [26], the DIVA team conclude that maintaining a high animation frame rate and focusing on realistic character motions allow for a scene that the human brain can more easily accept as, or equate to, reality. The appearance of instruments and musicians has thus been optimised for efficiency rather than quality. Indeed, from the stills, it appears that the flute’s visualisation does not extend to the rendering of the flute’s keys.

   To further minimise the computational load at run-time, instead of computing the fingering and musician animation in real-time during playback, the motions are compiled for the entire performance pre-execution. It is not clear from the literature whether the DIVA system utilises an automatic inline system to compile the motions – such as that suggested by Lytle [27] – or a separate standalone application, to the same effect, or whether the animation files must be compiled by hand. If either of the latter two are the case, then MIVI has the great advantage of an immediately accessible repertoire of MIDI files, unrestrained by the pre-requisite of an accompanying animation file – in theory, every Standard Midi File. Furthermore, the real-time nature of MIVI allows for spontaneity in the music, like the introduction of musical improvisation.

   Like the animation files, the DIVA team store the data for instruments in external files [26]. In both cases, the system parses ASCII descriptions and must be generic enough to be able to process varied instruments (flutes, violins, drums, etc.). Such a ‘plug-in’ architecture of instruments and players to the DIVA Virtual Orchestra, allows for flexible system-extendibility. How the team achieved this, and the extent to which they did, is not clear from the available literature, but its description alone encourages consideration of similar functionality in our MIVI application (see Chapter 4).

   MIVI's focus on just the instrument should also permit us to dedicate considerably more of the available real-time computational power to this problem, as well as allow us to research more elegant and involved solutions in this area. Thus, for MIVI, the lines from the MIDI file and Instrument definition boxes to the MIDI-to-movement mapper, in figure 2.10 (b), should be emphasised in bold, indicating the real-time nature of this flow. 

   Indeed, it would be interesting to see the fruits of the DIVA project updated for today’s hardware. The progress, particularly in 3D graphics hardware, during the last few years, has been pronounced. Excepting the distributed and multi-processor advantages of the DIVA hardware, the performance of high-end consumer systems, such as those used in the development of MIVI, could represent a comparable level of computational power. A combination of DIVA’s scene dynamics and MIVI’s attention to detail could lead to one of the most realistic real-time computer simulations yet.


All content, including code and media, (c) Copyright 2002 Chris Nash.