In the last half-century, Music Technology has established itself
as a highly respected line of research and study. Much work has
already been done across this multi-disciplined area, ranging from
electronically-inclined studies into digital signal processing (DSP),
through physical studies into acoustics, to more mathematical and
melodic studies such as composition through algorithms.
It is important to ascertain the role computers are taking in today's
music and from what springboard, technologically, MIVI's realisation
must begin. This will entail a review of previous and current research
and technologies in the field of computer music, with special emphasis
on human-computer interaction (HCI), musically-driven graphics and
computer music performance systems (CMPS).
The fundamental basis for MIVI will be an interface for the interaction
of man, machine and music, and therefore, in addition to an exploration
of the technological, it will be worth quickly delving into the
psychological, with reference to applications of technology in education.
Finally, following reviews of all the relevant fields and technologies,
we'll take a closer look at a similar project to MIVI, called DIVA
, and see if the findings of the Finnish team who conducted
it, are of use in our venture.
Since its conception
in 1982, the MIDI protocol has played an important role in many
music-related projects, sometimes even to the point of inspiring
or prompting them. Indeed, the connection between it and this project
goes beyond just the spelling. However, whereas MIVI may be founded
on a MIDI base, the goal will require us to develop upon
the specification and protocol. It will therefore be wise, in addition
to reviewing the MIDI specification and General MIDI (GM) protocol,
to consider some aspects that lie beyond their scopes, and the previous
attempts to extend them.
aims to furnish the reader with a basic understanding of MIDI. Contained
herein is an explanation of all the concepts and terminology of
the protocol, pre-requisite in the reading this report. However,
for those desiring a deeper or less specialised overview of the
subject, the author recommends the books, mentioned as items ,
 and  of the bibliography.
MIDI stands for Musical Instrument Digital Interface and is the
technical specification  for a language of music - an encoding
of music at the semantic level. It contains no audio information
or guidelines for auralisation, such as sounds and waveforms, but
instead, like a score to a piano, specifies properties like pitch
and duration for each note and characteristics like tempo for a
In the common instance, it allows music to be recorded by a computer,
or dedicated hardware, from one instrument, then output for performance
by another. The process can be synchronous, by attaching the MIDI-out
(output port) of the first instrument to the MIDI-in (input
port) of the other, or asynchronous, by having the computer
record the signals from the first device to a file, which it can
store and play back through the second at any time. The reader is
referred to figure 1.1, in the previous chapter.
A performance can be encoded in this way and stored in an SMF
(Standard Midi File), which can, in turn, be transmitted to a MIDI
device. A piano, if MIDI-enabled, can then pose as its own pianist
and play itself. Indeed, most modern electronic musical instruments
now support the MIDI standard – for example, the Yamaha CS1x 
keyboard or Korg Trinity-Rack1
MIDI performances can be augmented over time. You could record the
right-hand of a piece of piano music, then rewind and record the
left. The computer can then carry out the simultaneous playback
of the two phrases. The number of total notes (pitches) a device
can play simultaneously is denoted by its polyphony. If it
is only one, the device is monophonic.
The number of different sounds (timbres) a device can play at any
on time is called its timbrality, where one timbre might
be a violin and another a piano, etc. If it is more than one, the
device is multi-timbral.
Note that it is possible, and common, to have a multi-timbral, polyphonic
MIDI instrument, composed of both monophonic and polyphonic voices
– for example, a 16-part multi-timbral sound module with 128 voices
might include monophonic flute and violin voices, as well as polyphonic
piano and guitar voices.
An SMF can contain the encoding of multiple instruments, even a
whole orchestra, regardless of what MIDI input device you use; you
can play in a violin – or even drum – solo using your MIDI piano.
It is the job of the computer, through a program called a sequencer,
such as Steinberg Cubase VST2
or Cakewalk Pro Audio2, to
keep time and co-ordinate the performance of every instrument –
fulfilling the typical role of the conductor. The sequencer takes
the performance (sequence), encoded in the SMF, as input
and, when commanded, streams, as output, notes and performance instructions,
as MIDI messages, to a MIDI device for immediate execution
(often auralisation). An apt analogy to a sequencer is a cassette
recorder – allowing both the playback and editing of music.
MIDI messages are very small packets of only a few bytes, which
carry information on the note or performance instruction – MIDI
event – that is to be played or executed. Each MIDI event is
a packet of varying length comprising a single status byte, and
zero or more data bytes. An example is the 3-byte Note On
message, where the status byte (0x90
in hexadecimal) tells the device to start a note for the active
instrument at a certain pitch and velocity (volume), as denoted
by the accompanying two data bytes. By contrast, the Note Off
message can then be used to terminate it. Alternatively, using a
Note On message with null velocity will also stop the note.
Most MIDI devices are capable of posing and performing as several
instruments – ordinarily, up to 16 different instruments can be
played at once, each receiving messages through their own channel.
In 1991, a specification, called General MIDI (GM) ,
specified 128 different MIDI voices (the full list of
these is given in Appendix A), which can be assigned to any of the
channels. In addition, since percussion sounds do not vary in pitch,
it is wasteful to have whole instruments dedicated to a single bass
drum or snare drum, etc. The specification therefore describes a
standard percussion kit, where note C in the lowest octave
is a bass drum, D a snare and so on. For example, the first
channel might be set to an Acoustic Grand Piano (voice #1) and the
second, a Nylon-string Guitar (voice #25), but another (often the
tenth, for percussion) to a Standard Drum Kit.
An SMF is broken up into tracks, each having its own channel.
The subtle difference between a track and a channel is that you
can have two tracks with the same channel. Going back to our piano
example, the left-hand could be on Track 1 and the right on Track
2, both being sent to voice #1 (piano) on Channel 1.
This structure is illustrated in figure 2.1. The hierarchy has,
as its root, the sequence, encoded in the SMF, which is broken up
into tracks that have their own channel and contain all information
about MIDI devices, etc. Each track contains all the information
about the events from the opening to closing note of the piece,
for its respective part.
A sequencer is normally
connected to more than one input or output device. Thus, in music,
it is imperative that they cooperate and coordinate their activities
with each other, like the sections of an orchestra. Three different
types of messages exist to help the sequencer in this pursuit –
System Common, System Real-Time and System Exclusive
The first two concern
the timing of the piece. Whereas System Common messages give the
absolute position of playback in a piece, System Real-Time issue
the start and stop commands to control it, as well as transmit MIDI
clock (tick) messages, which devices can use to synchronise with
The MIDI specification’s endurance, however, can largely be attributed
to the third. Sysex messages allow for unrestricted byte-stream
communication through a MIDI connection, between MIDI devices. Originally
designed for the ‘bulk dumping’ of settings, or even waveform audio
between devices, it has been brought to bear in the real-time environment
and has been used by manufacturers to implement controls and functions
that are necessary to fully exploit the functionality of new MIDI
instruments, but not natively supported by the base MIDI command
set. It is, therefore, not necessary to ever replace the specification,
but extend it instead.
As the name suggests,
sysex formats are defined exclusively at the system level and, thus,
the sysex messages for one device are not necessarily compatible
with another. However, the emergence and monopolisation of commercial
standards has brought some of these extensions into wider, sometimes
universal, use. In the next section, we will briefly look at two
such standards, with special attention to their extensions to the
MIDI and GM extensions
the manufacturing giants of the music industry, Roland and Yamaha,
upon the widespread adoption of the MIDI protocol, recognised both
the potential and inadequacy of the GM specification, and seized
the opportunity to release their own extensions – General Sound
(GS)  and Extended (or Expressive) General Midi (XG) , respectively.
The extensions address exactly the same deficiencies of their predecessor
and do so using almost exactly the same principles and methods.
However, due to the competitive nature of the market, the implementations
differ and, thus, most3
XG devices will not respond to GS commands and vice-versa. Therefore,
for our purposes, reviewing one will yield as much insight into
the other and, under the widely-held conviction4
that Yamaha’s offering is superior, we choose to cover the XG format.
Before we review each of the improvements borne by the standard,
though, it is important to establish the failing of General MIDI
that both companies set out to address.
deficiencies in the
Music, as an expressive art, is an imperfect science – it involves
and assigns value to nuances, quirks and irregularities – and demands
the ability to step outside the norm. As a mature technology, MIDI
was introduced at a time of relative simplicity in the computer
– it had to be simple and efficient, too. MIDI is thus highly abstract
and technical, and, as a medium for expression, crude and inflexible.
A symphony orchestra conductor will be the first to notice that
confining the number of different instruments to 128, as General
MIDI does, is extremely crude. When you consider that Voice #41
is not only a violin, but also the MIDI 'ambassador' to all violins
in the world, the deficiency is magnified. Properties such as the
violin's size, maker and origin - all of which can have a profound
effect on the timbre (character) of a note - are instantiated to
one generic set of parameters. Imagine the reproduction of a string
quartet, where the two violinists on occasion play the same tune.
In the concert hall, though we can't distinguish which is which,
we are still aware of two violins. On the average MIDI instrument,
the waveforms are identical and, once superposed, would sound like
just one – albeit either twice the volume or partially phased.
The proposed solution to this problem, in part, only tends to exacerbate
the inflexibility of General MIDI. We notice that voice #51 and
voice #52 are not single instruments, but string sections.
Seemingly, this is to compensate for the loss in polyphony that
would derive from emulating each violin individually. It is also
provided as a quick and dirty solution to the problem of the superposed
violins that such an endeavour might present.
Interestingly, however, strings and brass are the only sections
to benefit from any adaptation that would permit their usage in
an ensemble context. Though, this is more a comment on the quality
of the MIDI sounds available when the protocol was introduced. Although
considerations of sound quality might deter classical musicians,
who will afford themselves a real orchestra for any number of the
reasons listed in this section, contemporary artists, particularly
of the 80’s, are more tolerant of, and even praise, the synthetic
sound of MIDI instruments.
Nevertheless, although it would be fairly painless to design and
implement MIVI as a multi-instrument application, there exists negligible
practical advantage to displaying more than one instrument (ie.
ensembles) at a time – as discussed in section 1.2, MIVI is principally
for educational applications, and teaching multiple instruments
simultaneously is merely a recipe for disaster. Neither will this
report cover the implementation of either a string or a brass instrument.
The reader, however, can assume that ensemble GM voices (#49-#52
and #62) would be reduced to their respective solo visual incarnations.
On the other hand, it is conceivable that, in its maturity, and
combined with other research projects , MIVI could one day
be used as a conductor training tool.
Thus, in general, one of the biggest criticisms of the specification
was the lack of freedom of expression in terms of both instrument
varieties and individual instrument usage.
the GM soundset
Instead of just increasing
the number of available instruments to more than 128, Yamaha’s research
department opted to make the voice list for XG multi-dimensional.
So, for each instrument, there can be up to 128 sub-categories,
drastically increasing the total number of available timbres to
16,384. An example structure of XG's Violin voices, taken from the
Yamaha CS1x synthesizer , is illustrated in figure 2.2. It should
be noted that only a small number of the 128 possible sub-categories
actually differ in character from that of the original instrument
and, furthermore, that any variation in timbre is the result of
the original sound being put though an effects processor, as opposed
to coming from a different source.
fig 2.2 - Yamaha’s
extension to GM
||Two Octaves higher
This is clearly an improvement
on the original specification, and manages to maintain legacy compatibility
with General MIDI, but is still not ideal. Ideally, we'd want categories
like 'Stradivarius Violin' and so forth. Sadly, their inclusion
would be of limited use, since today's sound synthesis engines are
not able to reproduce tones of sufficient realism, especially when
applied to solo string instruments. Our exploration of MIDI extensions,
thus requires a review of current synthesis techniques.
From the principle that graphics
and sound are intrinsically linked, often as products of the same
device – hence, follows the visibility of audible objects and vice-versa
– our exploration of the visual side of music will benefit from
an analysis of the audible.
endeavour to recreate instruments, we must know how they work to
produce sound. More to the point, to meaningfully teach these workings,
it is imperative to understand how the instruments are manipulated
to produce not only sound, but also music.
Initially, this will involve a brief discussion of instruments in
the real world. However, both instruments and sound generation techniques
have already been extensively studied from an aural standpoint –
a core component of the Music Technology field is the pursuit of
more realistic and expressive technologically generated sounds and
music – and thus, we shall cover, in more detail, previous translations
of the art into science.
As we shall see, much
of these aural enterprises can give us guidance in our own endeavour.
instruments in the real world
a large proportion of music listening is done with the absence of
visual stimuli. This is made possible by the nature of sound – although
involving the movement of objects, which are often visible, sound
is the result of miniscule vibrations, undetectable to the human
eye. The irrelevance of the visual, means that sound is definable
from base physical principles. Music is reduced to simply the manipulation
of these sound waves and we must thus consider the instrument from
this aspect. It should be noted that this is a centuries-old area
of study, and much more is known about it than is permissible or
relevant in this report. The author recommends Rossing  as a
truly remarkable book, yielding deep insight into the subject.
the physics of music
A violin, for example, uses the friction of a bow to induce oscillations
in a string. In this form alone, the induced sound waves haven’t
the amplitude (in essence, volume) to be heard, but the resultant
minor vibrations of the violin’s bridge permeate into the hollow
body of the instrument, where the larger internal surface area gives
rise to an amplified wave. This wave can then reflect off the interior
of the body several times, before leaving through the f-holes of
the violin. Their collision with the human eardrum causes physical
displacements, which are translated into electrical impulses and
sent to the brain. The brain then interprets the frequency and variations
in amplitude that it receives, into the commonly recognisable violin
This process is similar to that in the other members of the string
family. Furthermore, pianos and guitars, as ‘stringed’ instruments
use much the same process – differing in only the initial induction
of the source wave: a piano string is hit with a padded hammer,
and a guitar plucked with a finger.
Farther afield, even more diverse families employ degrees of the
same process – both brass and woodwind instruments rely on the reflection
of waves in a resonant corpus. This time, the body is an open-ended
tube and the blowing of air induces the wave. Whereas for brass,
this excites the resultant wave directly, for woodwind, it is used
to excite vibrations in a reed that, in turn, produce the resultant
Each method contributes
towards the unique timbre of sound produced by the instrument, but
it should be obvious to the reader that the methods themselves are
not so different within certain instrument groups.
sound synthesis techniques are sample based – that is, for each
MIDI voice, the notes have been digitally recorded from a performance
in the real world. However, recording a quality sample for all notes
is costly – a high quality sample for each piano key (88 in total)
would fill the average instrument's allotted memory, leaving little
or no room for the other 127 voices. Instead, one pitch is recorded
and stored in memory, and by electronically varying the playing
speed (or frequency) of the recorded pitch, the other notes can
be simulated. Generally, however, as you get farther away from the
original pitch, the electronic ‘transposition’ results in a noticeable
loss of realism.
Therefore, for a higher quality voice, a multi-sample is used. A
piano voice, for example, might have the C note of each octave recorded
in memory and use it to produce the rest of the octaves’ notes.
The higher density of notes sampled, the better the quality the
reproduction. In the extreme, some digital pianos do dedicate their
entire memory to one or two piano voices, sacrificing variety for
quality. Note, however, that they also sacrifice their adherence
to the General MIDI protocol.
The pinnacle example of sample-based synthesis came recently, in
the form of Hans Zimmer5,
one of the most popular composers in Hollywood. Speculating, to
accumulate, he booked the London Philharmonic Orchestra for an extended
private session, during which he proceeded to journey around the
orchestra, instrument by instrument, section by section, and sample
every note and phrase he (or they) could imagine. At the end of
the session, he had stored gigabyte upon gigabyte of audio data,
and has since managed to remove the need for all but the most trivial
orchestral participation in his movie scores. It is also worth noting
that the hardware used by Zimmer does not employ any accepted instrument
/ sample naming convention – such as GM, GS or XG – other than that
set out by himself.
In section 2.4.1, we compare MIVI to video tuition and note that
video is not as flexible as the real instrument or a simulation
thereof. Zimmer’s approach similarly suffers from this inflexibility
– any musical phrase he does not have, he cannot synthesise without
referring back to the orchestra.
Sample-based synthesis, by itself, exploits no implicit similarities
between instruments or their families. In addition to extending
the voice list for General MIDI, GS and XG both tried to develop
on the freedom of expression available to the instruments of the
GM soundset, by using sound effects processing. Although not evident
in the violin voice, in figure 2.2, these extensions do acknowledge
the existence of such implicit relationships in other voices. In
figure 2.3, which shows voice #25 – the Nylon-string Guitar – the
readers should notice the Ukulele in bank 96.
fig 2.3 - Yamaha’s
extension to GM
||Nylon-stringed Guitar 2
||Nylon-stringed Guitar 3
The explanation is simple and inherent in the architecture
of the XG system. Effects processes are simply manipulations of
the sound at the waveform level. From the previous section, we noted
the relative similarity of sound generation techniques. In this
instance, the step from a Nylon-stringed Guitar to a Ukulele is
trivial and their actual composition is very similar: they both
have multiple strings, wooden bodies, etc. The only difference could
be in the shape of the resonant chamber, the material of the string
or type of wood. The generic XG effect process has modelled the
consequence of these altered parameters and resulted in the abstraction
of two instrument timbres to a common source waveform – an instrument
has been added without adding a sample to memory.
In the next section, we talk about the ultimate extension of this
abstraction – the entirely synthetic production of sounds.
precise knowledge of the processes inherent to real instruments,
as introduced in section 2.2.1 and detailed in Rossing , we
can artificially fabricate the waveform by generating basic wave
oscillations and simulating the appropriate reflections, refractions,
amplifications and dampening, etc.
Furthermore, once this is achieved for a violin, we can also adapt
the algorithms for other string instruments relatively painlessly.
It then also follows that the combination of more adaptation and
further innovation would yield synthesis techniques for other instrument
families and genres.
This is not a new theory and has been the subject of considerable
research and successful implementation already. One research project,
in the late 90’s, was successful and mature enough to breach the
academic boundary and enter the commercial market – Yamaha and Stanford
University’s illustrious SONDIUS XG Virtual Acoustic Modelling system
Contrary to what the name suggests, SONDIUS XG has no voice naming
hierarchy as in its namesake, XG. Instead, there is a more implicit
system of inheritance between voices, dictated by the modelling
algorithms that they employ.
Before this, the very acceptance of a convention, such as orchestral
families, already recognised the similarity of a group of instruments.
In both cases, the distinctions are based on both the method each
family uses to produce sound, whether it be plucking, bowing, blowing
or hitting, and other structural properties of the instrument –
for example, its material, such as wood or brass.
For physical-modelling techniques, like SONDIUS XG, these families
translate conveniently to synthesis models. String instruments,
like violins, violas and cellos, as well as most keyboard instruments,
draw upon algorithms, which simulate the sound waves produced by
the oscillations of a string upon plucking, bowing or hitting. Wind
instruments, like oboes, clarinets and flutes, rely on the modelling
of sound waves passing down a wood or metal tube. The SONDIUS
XG system also suggests the possibility of partitions based on the
driver and resonant body components, and further abstracts them
from their physical models.
fig 2.4 - the SONDIUS
Because most strings use a bow, the respective model is invariably
generic: an algorithm coupling friction-based scraping, and string
oscillation. The difference simply lies in the parameters – the
coefficient of friction, the length of the string, etc. This implicit
categorisation in the system’s naming of instruments as combinations
of Drivers and Resonant Systems, is illustrated in figure 2.4. Note
that one remarkable feature of this architecture is that any combination
of driver and resonant system can be used, allowing for imaginative
instruments such as breath-driven violins.
fig 2.5 - the string
family: (a) violin,
(b) viola (c) cello,
(d) double bass
Furthermore, although it may be musical blasphemy to say that a
cello is simply a big violin, visually, this is essentially the
case (as can be gauged from figure 2.5). Thus, physical modelling
can apply the same generalisation to the resonant body, simply altering
the dimensions to enable the correct internal reflection and refraction
of the sound waves.
This extrapolation is especially beneficial when we consider that
the driver can change part-way through a piece of music. A violin,
for example, can be bowed or plucked. Indeed, a cello can even be
bowed, fingered and plucked simultaneously.
on the score, whereas all this requires is a comment above the stave
saying arco or pizzicato (respectively), MIDI has
no formal method of encoding this performance direction. To simulate
it, you must assign the relevant music to a separate bowed violin
instrument (voice #41) or plucked violin instrument (voice #46),
invariably over two different MIDI tracks or channels. In MIVI,
if we were to automate which instrument is displayed (or driver
is used), based on the incoming MIDI data, this might create a problem.
For this reason, amongst others (discussed later), control over
MIVI’s instrument selection will rest with the user. Aside from
a small burden to convenience, the impact is minimal, in the context
of MIVI’s application6.
As regards resonant bodies,
the string family is perhaps too simple an example. By inspecting
the brass family, we see that our categorisation of instruments
should be more involved than simple segregation into families, and
we are again encouraged towards the driver / body split. However,
whereas forced air (or pneumatics) is the driving force, universal
to almost all brass instruments, the resonant body and method for
controlling the pitch can, this time, vary as well. For example,
take the trombone with its slide as opposed to the trumpet with
Returning to the problem of differentiating
violins in our string quartet scenario, posed in section 2.1.2,
the sound of an instrument in the real world is also varied by the
manner in which it is played – the proficiency and technique of
the performer, or lack thereof. Indeed, much of what tells us that
something is real (or human) is, cynically, what is imperfect
about it – the gasping breath sounds of a trumpeter, the pale scraping
of a bow, sliding too lightly across the violin string, etc.
Computer aided music is often condemned for its lack of expression,
which, though one could blame on DJ's of today's music scene, is
more likely attributable to the MIDI specification. Indeed, it has
often been criticised for being too centred on keyboard instruments
and interfaces , resulting in an expressive model which is little
more advanced than variation in volume – velocity.
Performance is central
to our goal – MIVI is an instrument performance tutor. The lack
of expression in MIDI, also identified in section 2.1.2, must be
overcome if we are to instruct in performance of any instrument
save the piano.
by a number of institutions has produced CMPS's, or Computer Music
Performance Systems, which are designed to record more expression
in music, allowing the computer to more accurately emulate the human
element in performance. These systems can also bring expression
to instruments where it wasn't previously found, such as synthesizers.
performance systems (CMPS’s)
FORMULA (Forth Music Language)  is such a system. It was an attempt,
in the early 90’s, to build upon the MIDI protocol – to add emotion
and expression, etc. to the audio output, through the creation of
a music programming language. On p.23 of the paper, Anderson and
Kuivila also briefly mention the potential of visual output, but
concede that the hardware of the time, combined with their system
architecture, would introduce ‘unacceptable delays’ and prevent
a meaningful exploration of the medium.
There are two immediate problems involved with promoting such data
to the visual layer, in an application like MIVI. The first is technical;
"How does one show emotion in graphics?" The pursuit of
an answer to this could be defined by the history of ‘Art'. Realistically
and, perhaps, crudely, the only way to do it would be to have a
representation of the performer and his face as he (or she) plays
the piece. Indeed, the face is often used in art as an interface
to the emotion of a painting’s subject. It is, after all, not the
instrument that humans perceive as having the emotion, but the player.
The other problem is of practicality. In an educational program,
the useful applications of carrying expressive performance directives
through to the visual layer appear minimal, since expression is
often an aggregate of one's own soul and experience – tacit
knowledge, which is difficult, if not impossible to teach. However,
there can often be more than one method, or technique, available
for playing a note or phrase, which is important for the performer
to be aware of.
although related, is not synonymous with skill – it is a means to
skilfulness. Although, on a piano, teachers will train students
to use particular fingers on particular occasions, it need not always
improve the quality of the performance. Instead, it makes the performance
easier, so that the move to the next echelon of ability becomes
One aspect of performance
that instantly presents itself, when stepping outside the piano,
is fingering choice. A guitar, for example, has six strings, and
the dynamic range of each overlaps with not only the adjacent, but
beyond that as well. There thus exists a choice of methods to finger
a single note, and some will be easier than others. If we are to
display the instrument as it relates to the score, we must choose
one of these methods. However, as an educational tool, we must make
sure that our choice is suitable for the learner.
fig 2.6 - table of
In figure 2.6, we illustrate a diagram identifying the possible
fingerings for various pitches on a guitar fret, where each number
denotes the distance from the fundamental pitch (in semitones) –
in this case E(00)
– and each row is a different string. One can see the repeated occurrence
of equivalent pitches across several different strings (equivalent
have been emphasised). The problem facing the performer, and – by
transitivity, our visual interface – is which to choose at any one
In the above scenario, let us assume that we want to play the C
and that the succeeding note is an A(17),
so, if possible, the algorithm should, for simplicity's sake, avoid
changes of string or hand position. Intuitively, it would be best
to pre-emptively place the first finger on an A,
and the second (or third) on an adjacent C,
so the transition can be made painlessly and involve only a simple
removal of the extraneous finger. This rules out the C
on the B-string,
since no such A
exists, leaving two options, which – under our current constraints
– are equally attractive.
So, we introduce another constraint - that of sound quality. Plucking
an open (un-fingered) string produces a much 'cleaner' sound than
if the player were to finger the note by depressing a lower string
nearer the bridge (towards the right of our diagram). Simplistically,
this gives us the heuristic: the further left the number on our
diagram, the better the quality the tone. Readers interested in
implementing a more detailed set of constraints that account for
quality are referred to Taylor's  paper.
Incidentally, in the case of other fingered-string instruments,
like the violin, violinists are encouraged to choose fingerings
in preference to open strings, in order to maintain a uniform sound
quality across all notes. Thus, for applications servicing multiple
instruments, such as MIVI, it will be important to give careful
consideration to the domain of their fingering algorithms. For example,
how much of a guitar fingering algorithm might translate to a violin
application? Even on a lower level – how much modification is required
to adapt a violin-fingering algorithm to a cello? Does the introduction
of the thumb, in this case, present a large problem, or can we simply
treat it as a fifth finger? We leave these exercises to the reader
and referenced literature.
Returning, and restricting ourselves, to our guitar scenario; we
have seen, with just two constraints, a decision can be made for
each note (the previous decision has been shaded in the diagram).
Notice, however, that we have also made an assumption (the position
of the succeeding A), which allowed us to further constrain the
decision. A note's reliance on its successor suggests a ‘lookahead’
is called for – the succeeding note’s position must be calculated
before the current note’s and, of course, the same is true for the
succeeding note. This recursive relationship will propagate the
decision to the final note of the music, where a decision, based
on no successor, must be arbitrarily made. The results then cascade
back to the initial note. Thus, we see that, in this implementation,
although implementable using a constraint-satisfaction algorithm
, fingering calculations cannot be a real-time operation, but
must be pre-processed.
This is how the DIVA system (discussed in the section 2.5) works.
The music is analysed and all the fingering positions are worked
out and stored before the performance. It is a simple approach and
will result in a polished performance, depending on the algorithm’s
constraints. It requires, however, a pre-processing step before
each new piece of music – the duration of which will vary depending
on the complexity of the piece and the number of musicians in the
ensemble, and could, potentially be quite costly. In comparison
to the typical human approach, though, the duration of this step
will be drastically smaller than a standard ensemble’s equivalent
during rehearsal sessions.
However, let us consider the human performer more closely. The player’s
approach depends on their level of skill and experience, most notably
in the sphere of sight-reading. A lesser-experienced musician might
play the piece through once, annotating the score with fingering
observations as they occur, restarting, each time, to verify them
– the human equivalent of the DIVA system.
An experienced musician, on the other hand, might be able to perform
a small lookahead in his mind and anticipate optimal fingerings,
similar to a Grandmaster of chess deducing the next seven or eight
moves before they happen. This latter tactic gives us insight into
how to implement a real-time fingering system. By restricting the
lookahead to a limited amount and forcing the arbitrary decision
before reaching the final note, we could conceivably avoid the heavy
load on the processor, enabling the decision to be taken as the
note is played.
Indeed, it is self-evident that the fingering for the final note
in a symphony movement has little or no influence on that of the
first. We can apply this analogy on much smaller intervals and,
by selecting a suitable lookahead size, in relation to these intervals,
not only reduce the processing overhead sufficiently to allow for
the decisions to be made in real-time, but forego any significant
hit in performance quality. There are several points in a score
where influence does not propagate, such as rests (silences), and
less critical points where even more involved fingering shifts are
less costly, such as following long notes, or stretches of repeated
notes, where thinking time is more abundant – both are examples
of good lookahead limits, that occur frequently throughout most
However, in the absence of such appropriate junctures, an algorithm
might have to force a lookahead limit in order to prevent hogging
of the processor, thus avoiding glitches in song playback. In this
case, a decision is forced, for the current note, based on less
than all the facts.
Such decisions can often be simplified by the use of expert systems
or neural networks, such as Bayesian Networks  or Hebbian Learning
 (respectively), where instead of simple deterministic choices,
statistical reasoning is employed. Hence, after a modest lookahead,
though we might not be able to say for certain what the best choice
is, we have statistical weights denoting which choice is most likely
to be the best choice. Bayesian Network systems that allow for learning,
such as BUGS , can even be combined with cached experience,
that can assist us further, by basing the fingering, for a phrase
of finite length, on what action was taken last time the question,
or similar question, was posed.
Using a similar principle
of learned experience, Sayegh  successfully employed the Optimum
Path Paradigm (OPP), an approach based around constraint satisfaction,
to tackle the fingering problem in stringed instruments. Taking
the number of strings to be S, the number of playable notes
on each string, N, with NT (= S × N)
representing the total number of finger positions possible,
the algorithm first populates an NT × NT
matrix, W, with the cost of every possible transition between
fingerings, penalising string and hand position changes – as in
our original example. The matrix can then be queried one or more
times, given a sequence of notes, to produce a set of solutions.
All existing solutions are then combined to form a weighted, directed
graph (see figure 2.7), which can be solved in polynomial time using
the Viterbi algorithm , a refinement of the algorithm for finding
paths of least cost by Dijkstra .
fig.2.7 – Sayegh’s
of weighted, directed
graphs in the sequence
G, A, C, E
For a classical guitar
(S=6, N=36), the number of calculations to populate W is
a phenomenal 216 × 216 = 46,656. Unfortunately, aside from
a trivial identity mapping, no ‘optimising’ observations can be
employed to streamline the process – fingering transitions are not
normally commutative. However, the algorithm need only be executed
once per instrument – the matrix can be stored for use in the future.
Sayegh refines the approach by introducing a second stage, based
on Hebbian learning , a form of expert system. He uses a second
matrix of equal dimensions, W', the entries of which are
initialised to zero. The results of the first stage then go through
a learning phase, using W to generate responses to various
training material in the form of sequences of musical notes. Each
time the algorithm suggests a transition between two fingerings,
the corresponding entry in W' is incremented by 1.
After a sufficiently large sample input, the new matrix represents
experience, which can be used to base fingering decisions on, either
solely, or in addition to the original matrix. The accuracy of this
sort of Hebbian learning (as with Bayesian Networks) is often difficult
to accept from base principles, and those new to the idea often
benefit from practical and quantative examples, which the reader
should be able to find in most of the books on the subject .
In MIVI, we are not designing a playing aid, but a teaching aid.
The performers – in this case, guitarists – need to learn how to
apply the constraints and work out fingering themselves, and should
not become reliant on the presence of a computer. The constraints
should be considered equivalent to 'playing tips and tricks', and
can be presented as such in the program. Therefore, given a piece
of music, the user should be able to specify which techniques the
computer employs, and thus which to tackle learning themselves.
The true beginner starts with none, possibly using only his fore
finger to adjust the pitch of his instrument, and can then introduce
the 'tricks' one-by-one, adding them and advancing at his leisure.
Furthermore, when the choice of employing a specific technique conflicts
with that of another, or represents a different – as opposed to
'better' – practice, the decision that must be made is one of playing
'style'. Sayegh provides a convenient method of implementing such
a system, with each set of constraints (or playing style), represented
by a pre-compiled matrix.
Whereas the 'tips and tricks' indicative of a playing style might
be defined by the constraints of the algorithm's first stage, an
expert guitarist's playing style might be encoded by running just
the learning stage on them, instead of the algorithm – getting them
to manually increment the contents of matrix M', as they
make their own decisions, given the learning material.
Hypothetically, a 'complete' implementation would not only allow
you to define sets of constraints corresponding to the playing styles
of guitarists (John Williams, Eric Clapton, BB King, etc.), but
also more general sets, optimised for their genres (respectively:
classical, rock, blues, etc.). Simply restricting the learning phase's
training set to pieces in the required genre could develop such
The initial implementation
of MIVI, documented later in the report, will not include such performance
memories and fingering algorithms. However, when it becomes time
to codify our flute, we shall see that a simple fingering decision
is required. Furthermore, we shall attempt to ensure that if – or
when – the MIVI system is extended for other instruments, such as
the guitar, the architecture is able to support the inclusion of
The awkward and visible sign is the syllabus, a table of contents
which lays down what the student is required to do and what
is examined… The syllabus narrows the students vision of the
edge of knowledge and cuts him off from precisely those fuzzy
areas at the edges of subjects that are the most interesting
Small, p.186-7 in Music-Society-Education 
moral of this fable is that, if you’re not sure where you’re
going, you’re liable to end up somewhere else.
Mager, Preface to Preparing Instructional Objectives 
Today, when it comes to learning an
instrument, there is a multitude of teaching methods available to
the music student and, as a player of several instruments, the author
has much experience in this role. In this section, we give a concise
and critical history of the academic field, with focus on the practises
of musical instrument tuition and use of computers as an adjunct
demands  the tried and tested technique of one-to-one –
teacher to pupil – lessons on a regular and frequent basis. During
daily, weekly or fortnightly lessons, the pupil’s performance is
assessed by the teacher, whose competence then determines the quality
and availability of positive critical feedback. Furthermore, the
instructor also sets the syllabus, perhaps via an examining body,
and suggests homework and beneficial extra-curricular activities.
It is in this way that the author has learnt to play the violin.
The strict scheduling of lessons and proliferation of deadlines
and exams can, however, contradict the rationale of music itself
– to be an enjoyable and entertaining occupation. After all, for
most people, playing music will be considered a leisure activity.
The two passages, given at the beginning of section 2.4, are quoted
from Swanwick , who is but one of many scholars  who
have arrived at this conclusion.
Additionally, and often more importantly, the expense of expert
tuition can sometimes be prohibitive . In the case of many instruments,
especially during the sometimes painfully slow early stages, it
can be difficult to identify what your money is paying for.
For some, this might persuade the student that a self-taught approach
is more financially viable – if only to get them to a stage where
they can evaluate the merit of expert, third party tuition. However,
even at the individual level, there are a number of different approaches,
many complemented by commercially available teaching aids.
the first is the bloody-minded, do-or-die approach. In the case
of learning to swim, this involves jumping in the deep end and,
more often than not, results in either drowning or the somewhat
crude and expedited erudition of the fundamentals of swimming –
namely, the doggy paddle. In the case of the aspirant maestro and,
at one time, the author, this might involve a complex piano sonata,
Bach for example, and practising until they can play it.
This approach will only work for some, notably pre-seasoned musicians,
and, although coarse and cursory, will elevate the subject’s proficiency
across a much larger repertoire than simply the chosen piece. Indeed,
once one has deciphered Bach, little stands between them and most
other Baroque piano music, and this is the scenario, respectively,
for many other genres.
However, music is an age-old pursuit, and it is naïve to think that
just the score and the instrument will endow one with enough experience
to make you a maestro. Indeed, the score has been as heavily criticised
 as credited for its provision of information and encapsulation
of the performance. As in the extreme of our aquatic example, one
may be able to swim, but not necessarily to swim well.
the explicit, the
implicit and the tacit
As an art, most notably a performing art, musical knowledge extends
beyond the explicit – what is written on the page. In this way,
traditional tuition will always have an advantage over the entirely
self-supported approach, for it allows the elicitation of implicit
knowledge – that which can be articulated but isn’t – and tacit
knowledge – that which can’t – from the expert to the learner. So,
whereas the self-taught student might be able to play the music,
they will be ignorant of tips, tricks and shortcuts, to allow for
easier playing and more freedom to express themselves – another
defining characteristic of art. Indeed, for many instruments, such
gaps in knowledge will inhibit the student from moving to higher
echelons of play. Moreover, for instruments less intuitive than
the piano, the lack of knowledge and guidance about how to equate
the notes on the page to the instrument can severely hinder, often
prevent, even the slightest progress.
Music, incidentally, is the only art form requiring literacy 
– whereas a painter need only apply the brush to a canvas, a musician
is reduced to working through a layer of abstraction – the score.
As we saw in section 2.1, such obstacles can be compounded with
further layers, such as MIDI.
Therefore, when teaching oneself, a natural step is to find literature
on the subject. In searching for such secondary materials, one is
seeking an expert who has tried to put everything they know on paper
– both stating the explicit and articulating the implicit, so that
it is, in its new form, explicit. To what extent the expert has
achieved this goal, determines the relative quality of a material,
and there are several different approaches to the problem available.
is looking for such literature, as this author endeavoured to do
when learning the guitar, the sheer variety available is daunting.
In its infancy, this market was dominated by books  where the
expert and publisher employed text and nothing else. Soon after,
these were superseded by ones  containing simple diagrams to
help tackle explanations of fingering and posture, etc. Then, with
the advent, and relative cheapness, of black & white  and,
later, colour photography, we have books advertising the "all-visual
approach to learning to play the [instrument]" .
In music, the importance of exposure to the visual aspect of playing
an instrument can sometimes be as important as the audible aspect
– the classic 'Monkey See, Monkey Do' philosophy is actually of
benefit in this case and has been formalised, in research, more
Chappell  states that the ability to internalise music
(to hear, or picture, it in one's mind) is of considerable advantage
in the process of developing musical skill. Nevertheless, other
studies  show this skill to be present in only a handful
of the world's greatest musical geniuses.
Ben-Or , whose views on the subject are widely held, states "If
one can really perceive a passage of music with all clarity and
represent it to oneself mentally as it relates to the instrument,
then there is no obstacle left for the body to freely express it
in sound". Thus, to present music in a visual 'instrumental'
form, pre-converted from its abstract score format, should remove
a number of such obstacles.
Static visualisations, though a step in the right direction, often
do not convey the subtleties of playing the instrument. On a purely
technical level, pictures can rarely capture motion. For a simple
example, the book may illustrate 3 key stages of strumming on a
guitar; the up, the strum and the return to the default hand position.
In all, the strum should last 1 second. With the average human eye,
working at a frequency of 72Hz (72 pictures a second), it is up
to the individual to anticipate 69 of them, or 96% of the motion.
Strumming is a relative simple procedure and most, if not all, beginners
are able to master it after but few attempts, allowing books to
economise in the early stages. However, more involved motions, that
should be accompanied by even more enhanced illustrations, are,
instead, usually accompanied by none. Though the cynic might put
this down to the price it costs a publisher to print pictures, the
true reason is likely to be the limited level of information expressed
in only one image. It simply isn’t plausible to express, in a page,
complicated techniques where 10 or more key stages, each shot from
multiple angles, are required to deliver a true understanding –
a deficiency compounded in some books by a tendency to illustrate
the instrument from the listener's view, rather than the players,
forcing a translation, this time of geometry, on the learner.
Another, more general failing with 2D (diagrams) and static pseudo-3D
(photos / illustrations) material, is the lack of the fourth
dimension of time – a reader, following examples, note by note,
cannot gauge his progress, since although they are playing the right
notes in the right manner, they are not aware of temporal aspects
of the piece. In the case of tempo itself, the speed (or lack thereof)
of their performance is likely to be lower, allowing for deceptively
Therefore, some books are accompanied by an Audio-CD  that demonstrates
what the reader should be reciting. This helps to a great extent,
but still lacks cohesion, in that, although the music on the page
may be recognisable in the playback, the process that connects one
to the other still relies on the quality of text and pictures, and
requires some quick-thinking on the reader’s part. Slack  also
recommends the use of audio aids in music education, but stresses
the importance of accompanying explanation7.
As far back as 1964, he advocated the augmentation of audio with
visual aids  in the learning process – extolling the virtues
of being able to zoom in and focus on parts of the instruments.
Thus, as the technology became available, the trend took us from
the written word through the spoken and illustrated, to the motion
the motion picture
This philosophy is not
new to education, and has even been brought to bear in the form
of multimedia music tutorials - CD-ROMs of text, pictures and video
demonstrations. Videos are digitised film, which is to MIVI what
the Audio-CD is to MIDI. Instead of a data stream, like videos and
CD’s, MIVI will be built on an object-oriented structure for performances.
Where MIDI can tell you the pitches, and volumes of each note plus
allow you to alter tempo, etc., MIVI will draw upon this to provide
total-immersion graphical playing environments, with editable angles,
magnification and speeds plus allow the isolation of a single instrument
and its components, for closer scrutiny.
fig.2.8 - domains
of education methods
In figure 2.8, the
common stages and routes to musical expertise, as discussed in the
previous paragraphs, are summarised with relation to their competences.
Some features of the table are worth mentioning. Firstly, it is
clear that when one becomes an expert, tuition is no longer required,
and development now relies on a self-supported exploration of music
and performance, although possibly with external influences from
other performers. Secondly, we conclude that a degree of tacit knowledge,
not serviceable by oneself or media, is required to achieve expertise.
It should also be noted that the table holds no information relating
to speed, ease or expense of any particular method, or combination
of methods. Given the problems with current self-tuition materials,
and yet, also those of expert third-party tuition (introduced in
section 1.2), most learners demand the latter. It is hoped that
MIVI, by plucking some of the advantages of the third-party, will
close the gap between the two – in a sense simulating a virtual
music teacher, for the earlier stages.
As Roland – a world-renowned musical instrument manufacturer – states
, arguing a similar case for its own interactive tutoring products,
"[M]usic learning with private lessons begins with 30 minutes
of excellent guidance and coaching. But, now a crucial difference
takes place. Instead of the regular, routine play-through of the
school ensemble music, the student goes home to a week-long period
of unguided, occasional practice."
They continue by identifying the distinctions between teacher and
tutor – the first, is to provide direction, the second, to guide
them on their journey. In their case and ours, the respective products’
competences fall into the latter.
The promotion to the
former might involve the inclusion of implicit and explicit technique
theory, including the migration of expertise from the real
to virtual incarnations, which has resulted in the division of MIVI
into two domains in figure 2.8. The first represents
the intended capabilities of the implemented MIVI system as documented
in this report; introducing the student to the instrument8
and teaching the student how to relate the score to said instrument.
The second builds on the first with the ominous requisite of ‘performance
knowledge’, and was discussed in the section 2.3.2.
computers in education
buzzword in multimedia education is ‘interactivity’. Having argued
the ‘learning by seeing’ case, many of the same principles can be,
and are, similarly employed in advocating ‘learning by doing’ .
Using interactive materials, not only can the user get instruction,
but feedback too. The interactive device can have algorithms to
analyse the user’s performance and give qualitative and quantative
feedback on their errors, as would a genuine tutor.
E.R. Steinberg  mentions several methods of interactive learning
through the use of the computer. ‘Drills’ are the purest form of
Computer-Assisted Instruction (CAI) and simply aid the memorisation
of symbols, such as the periodic table, or connections, such as
our score note to key mapping, through the standard stimulus / response
format. Even in this simple context, detailed feedback is possible,
since the computer can tell the learner what they got wrong, where
they tend to fail (statistically) and to what extent they are in
error. Furthermore, the execution of the drill can be tailored from
learner response – getting harder as they do better or vice-versa.
Drills, however, only test knowledge already present in the user.
Either alternatively or in combination, E.R. Steinberg advocates
the use of ‘simulations’, where the learner is presented with a
virtual representation of the subject material. She states, "Insights
about complex scientific principles often come from experience with
such concepts in an interactive environment".
A flight simulator, for example, allows trainee pilots to familiarise
themselves with the cockpit of an aeroplane without the expense
(in terms of money, space and time) of a real plane or flight.
Closer to our context, the use of typing tutor software is also
an embodiment of this concept. As a drill and a simulation, it is
able to show the user what to do, regulate the act itself and give
qualitative and quantative feedback on their performance.
We have seen how interactive media can replace the implicit teachings
of a music teacher, but how can we extend this to tacit knowledge?
As it becomes more widely acknowledged that we are living in an
age where information is paramount, industry looks to retain such
knowledge, without necessarily retaining workers, through Knowledge
Management and Information Systems.
One approach is to concede that tacit knowledge cannot be encoded
and copied, but ‘cloned’ instead. In our case, instead of telling
an expert guitarist, to write a book about how the reader should
play the guitar, which we have shown will not implant his expertise
in the reader, invariably loosing something in the translation,
we simply tell him to write a book about how the expert plays
the guitar himself. Although the knowledge cannot be copied, the
renditioning of a particular piece of music (the performance) can
be. So, we instruct the learner to recreate the performance(s),
trusting that, in their efforts and attempts to execute the task,
the learner will derive the tacit knowledge for himself and will,
in future, be able to sub-consciously apply it to the generic music
Thus, returning to our
discussion on the relative merits of MIVI over existing soft and
hard literature we see that, whereas existing teaching material
is often restricted to a specific genre, such as "Jazz Piano"
or "Blues Guitar", MIVI should have the capacity to cross
these boundaries, simply by selecting a MIDI file and performance
memory from the desired genre.
the DIVA project
fig.2.9 - the DIVA
system in action
In this section, we will introduce
the reader to DIVA, a project closely linked with MIVI, and review
some of the core principles and ideas surrounding it. We will also
discuss the relative merits and shortcomings of DIVA, as well as
the isolate the differences between it and our own endeavour, MIVI.
Other more specific aspects of the DIVA project are covered elsewhere
in the report, as they become relevant to our own endeavour.
in 1992, a team of scholars, from the Helsinki University of Technology,
embarked on a number of research projects  that,
in 1997, culminated in the design and development of a Digital Interactive
Virtual Acoustic (DIVA) environment . This involved the creation
of a totally immersive 3D environment, embodying the performance
of a small Virtual Orchestra of Virtual Musicians, led by a human
conductor, with physically-modelled instruments and acoustics, both
audibly and visually accurate down to the spatial dispersion of
sound and fingering of instruments, as illustrated in figure 2.9.
Drawing from expertise
in multiple fields, such as digital signal processing (DSP), graphics,
acoustic modelling and sound synthesis, the team has studied and
successfully combined technologies in all these areas, climaxing
with an interactive performance at the SIGGRAPH'97 conference.
the DIVA system
impressively endowed system architecture (figure 2.10 (a))
comprises of no less than 3 multi-processor SGI workstations, each
acquiring a select number of competences, networked to external
synthesizers and an Ascension Motion Star for conductor input.
Figure 2.10 (b)
is adapted from Huopaniemi et al  and illustrates the DIVA system
information flow, with the corresponding scope of the MIVI system
appended. It should be noted that our project will, in fact, address
an even more confined scope than that indicated by the diagram,
with our MIDI-to-movement mapper and Animation control avoiding
the overhead involved with the processing of virtual biped musicians
and multiple instruments. It is intended that the MIVI system should
be suitable for execution on a traditional, self-contained, uni-processor
system - a home computer or other such workstation.
However, while sharing
some common attributes, MIVI and DIVA differ in objective and target
audiences. Thus, we find that, in a couple of our interest areas,
such as fingering and visual detail, the DIVA research is less extensive.
fig 2.10 (a) – DIVA
fig 2.10 (b) – the
This stems from the fact that the DIVA system is principally orientated
towards the use of sound engineers
and experts – be that for movie sound or general recording studio
work. MIVI, on the other hand, will focus on establishing itself
in an educational role, aimed, initially, at the unskilled user,
possibly in a studio environment, but more likely in the school
or the home. The group's research, as such, illustrates an emphasis
on acoustics and audio realism, with principal focus on sound spatialisation,
room response modelling and HRTF (head-related transfer functions),
rather than graphical and performance accuracy that, as discussed,
will be crucial in our own venture.
Many articles and papers, by different members of the group, provide
overviews of the various aspects of the project, and it is, therefore,
necessary to direct the reader towards Lokki et al  for one
with more relevance to our field – the visualisation of musical
instruments in real-time.
Recalling our discussion of fingering algorithms in section 2.2.3,
DIVA employs a simple and computationally efficient, yet slightly
musically naïve, least-distancing algorithm to solve problems arising
when there exists more than one way to finger a note. When presented
with a choice of fingerings, this approach comprises of simply choosing
the solution requiring the least hand or finger movement – favouring
ease over quality. Such decisions are made using Critical Path Analysis
(see ) – an extension to Sayegh’s  use of weighted, directed
graphs (discussed in section 2.2.3).
Although, as claimed, the fingerings are "realistic" 
– in that they are possible – they are far from optimal and, although
adequate in an entertainment environment, should be further scrutinised
in an educational or reference role. This compromise is also evident
in the conducting styles recognised by the system when it processes
data from the input baton – only the most generic are identified.
The implementation of both features is, nonetheless, more than adequate
at demonstrating the practicality of their inclusion. The team’s
fingering efforts do, however, introduce an interesting aspect of
visual simulation – dynamic (or timed) fingering. Actions like guitar
strums or percussion hits require preparation, such as the ‘up’
movement (mentioned in section 2.4.1) or the initial movement of
the drumstick towards the drum skin.
Lokki et al  also concede that more weight has been given to
the realism of the ‘scene’, rather than that of the individual instruments
and musicians themselves. This is most markedly illustrated by the
‘3D cartoon puppet’ format of the musicians. Sighting psychological
reasons , the DIVA team conclude that maintaining a high animation
frame rate and focusing on realistic character motions allow
for a scene that the human brain can more easily accept as, or equate
to, reality. The appearance of instruments and musicians has thus
been optimised for efficiency rather than quality. Indeed, from
the stills, it appears that the flute’s visualisation does not extend
to the rendering of the flute’s keys.
To further minimise the computational load at run-time, instead
of computing the fingering and musician animation in real-time during
playback, the motions are compiled for the entire performance pre-execution.
It is not clear from the literature whether the DIVA system utilises
an automatic inline system to compile the motions – such as that
suggested by Lytle  – or a separate standalone application,
to the same effect, or whether the animation files must be compiled
by hand. If either of the latter two are the case, then MIVI has
the great advantage of an immediately accessible repertoire of MIDI
files, unrestrained by the pre-requisite of an accompanying animation
file – in theory, every Standard Midi File. Furthermore, the real-time
nature of MIVI allows for spontaneity in the music, like the introduction
of musical improvisation.
Like the animation files, the DIVA team store the data for instruments
in external files . In both cases, the system parses ASCII descriptions
and must be generic enough to be able to process varied instruments
(flutes, violins, drums, etc.). Such a ‘plug-in’ architecture of
instruments and players to the DIVA Virtual Orchestra, allows for
flexible system-extendibility. How the team achieved this, and the
extent to which they did, is not clear from the available literature,
but its description alone encourages consideration of similar functionality
in our MIVI application (see Chapter 4).
MIVI's focus on just the instrument should also permit us to dedicate
considerably more of the available real-time computational power
to this problem, as well as allow us to research more elegant and
involved solutions in this area. Thus, for MIVI, the lines from
the MIDI file and Instrument definition boxes to the MIDI-to-movement
mapper, in figure 2.10 (b), should be emphasised in bold, indicating
the real-time nature of this flow.
Indeed, it would be interesting to see the fruits of the DIVA project
updated for today’s hardware. The progress, particularly in 3D graphics
hardware, during the last few years, has been pronounced. Excepting
the distributed and multi-processor advantages of the DIVA hardware,
the performance of high-end consumer systems, such as those used
in the development of MIVI, could represent a comparable level of
computational power. A combination of DIVA’s scene dynamics and
MIVI’s attention to detail could lead to one of the most realistic
real-time computer simulations yet.