GOTO 2017 • Composing Bach Chorales Using Deep Learning • Feynman Liang

[Music]

[Music] cool thank you um just so I know what

cool thank you um just so I know what

cool thank you um just so I know what level to speak at raise your hands if

level to speak at raise your hands if

level to speak at raise your hands if you know who Bach is great raise your

you know who Bach is great raise your

you know who Bach is great raise your hand if you know what a neural network

hand if you know what a neural network

hand if you know what a neural network is oh this is the perfect crowd awesome

is oh this is the perfect crowd awesome

is oh this is the perfect crowd awesome if you don’t know don’t worry I’m going

if you don’t know don’t worry I’m going

if you don’t know don’t worry I’m going to cover the very basics of both so

to cover the very basics of both so

to cover the very basics of both so let’s talk about Bach I’m going to play

let’s talk about Bach I’m going to play

let’s talk about Bach I’m going to play to you some music

to you some music

to you some music [Music]

[Music]

[Music] now what you just heard is what’s known

now what you just heard is what’s known

now what you just heard is what’s known as a coral there are four parts to it a

as a coral there are four parts to it a

as a coral there are four parts to it a soprano alto tenor bass playing at the

soprano alto tenor bass playing at the

soprano alto tenor bass playing at the exact same time and there’s very regular

exact same time and there’s very regular

exact same time and there’s very regular phrasing structure where you have the

phrasing structure where you have the

phrasing structure where you have the beginning of a phrase the determination

beginning of a phrase the determination

beginning of a phrase the determination of a phrase followed by the next phrase

of a phrase followed by the next phrase

of a phrase followed by the next phrase except that wasn’t Bach rather that was

except that wasn’t Bach rather that was

except that wasn’t Bach rather that was a computer algorithm called Bach bot and

a computer algorithm called Bach bot and

a computer algorithm called Bach bot and that was one sample out of its outputs

that was one sample out of its outputs

that was one sample out of its outputs if you don’t believe me it’s on

if you don’t believe me it’s on

if you don’t believe me it’s on soundcloud it’s called sample one go

soundcloud it’s called sample one go

soundcloud it’s called sample one go listen for yourself so instead of

listen for yourself so instead of

listen for yourself so instead of talking about box today I’m going to

talking about box today I’m going to

talking about box today I’m going to talk to you about Bach bot hi my name is

talk to you about Bach bot hi my name is

talk to you about Bach bot hi my name is phiman and it’s a pleasure to be here at

phiman and it’s a pleasure to be here at

phiman and it’s a pleasure to be here at Amsterdam and today we’ll talk about

Amsterdam and today we’ll talk about

Amsterdam and today we’ll talk about autumn is automatic stylistic

autumn is automatic stylistic

autumn is automatic stylistic composition using long short term memory

composition using long short term memory

composition using long short term memory so then a background about myself I’m

so then a background about myself I’m

so then a background about myself I’m currently a software engineer at gigster

currently a software engineer at gigster

currently a software engineer at gigster where I walk at work on interesting

where I walk at work on interesting

where I walk at work on interesting automation problems regarding I’m taking

automation problems regarding I’m taking

automation problems regarding I’m taking contracts divided them into sub

contracts divided them into sub

contracts divided them into sub contracts and then freelancing them out

contracts and then freelancing them out

contracts and then freelancing them out the work on Bach bot was done as part of

the work on Bach bot was done as part of

the work on Bach bot was done as part of my master’s thesis where which I did at

my master’s thesis where which I did at

my master’s thesis where which I did at the University of Cambridge with

the University of Cambridge with

the University of Cambridge with Microsoft Research Cambridge in line

Microsoft Research Cambridge in line

Microsoft Research Cambridge in line with the track here I do not have a PhD

with the track here I do not have a PhD

with the track here I do not have a PhD and so and I still can do machine

and so and I still can do machine

and so and I still can do machine learning so this is the fact this is a

learning so this is the fact this is a

learning so this is the fact this is a fact you can do machine learning without

fact you can do machine learning without

fact you can do machine learning without a PhD for those of you who just want to

a PhD for those of you who just want to

a PhD for those of you who just want to know what’s going to happen and then get

know what’s going to happen and then get

know what’s going to happen and then get out of here because it’s not interesting

out of here because it’s not interesting

out of here because it’s not interesting here is the executive summary I’m going

here is the executive summary I’m going

here is the executive summary I’m going to talk to you about how to train end to

to talk to you about how to train end to

to talk to you about how to train end to end starting from data sets preparation

end starting from data sets preparation

end starting from data sets preparation all the way to model tuning and

all the way to model tuning and

all the way to model tuning and deployment of a deep recurrent neural

deployment of a deep recurrent neural

deployment of a deep recurrent neural network for music this neural network is

network for music this neural network is

network for music this neural network is capable of polysemy multiple

capable of polysemy multiple

capable of polysemy multiple simultaneous voices at the same time

simultaneous voices at the same time

simultaneous voices at the same time it’s capable automatic composition

it’s capable automatic composition

it’s capable automatic composition generating a composition completely from

generating a composition completely from

generating a composition completely from scratch as well as harmonization given

scratch as well as harmonization given

scratch as well as harmonization given some fixed parts such as the soprano

some fixed parts such as the soprano

some fixed parts such as the soprano line of the melody generate the

line of the melody generate the

line of the melody generate the remaining supporting parts this model

remaining supporting parts this model

remaining supporting parts this model learns music theory without being told

learns music theory without being told

learns music theory without being told to do so providing empirical validation

to do so providing empirical validation

to do so providing empirical validation of what music theorists have been using

of what music theorists have been using

of what music theorists have been using for centuries and

for centuries and

for centuries and finally it’s evaluated on an online

finally it’s evaluated on an online

finally it’s evaluated on an online musical Turing test we’re out of 1700

musical Turing test we’re out of 1700

musical Turing test we’re out of 1700 participants only nine percent are able

participants only nine percent are able

participants only nine percent are able to distinguish actual Bach from Bach

to distinguish actual Bach from Bach

to distinguish actual Bach from Bach Bach

when I set off on this research there

when I set off on this research there were three primary goals the first

were three primary goals the first

were three primary goals the first question I wanted to answer was what is

question I wanted to answer was what is

question I wanted to answer was what is the frontier of computational creativity

the frontier of computational creativity

the frontier of computational creativity now creativity is something we take to

now creativity is something we take to

now creativity is something we take to be an 8 li human innately special in

be an 8 li human innately special in

be an 8 li human innately special in some sense computers shouldn’t ought not

some sense computers shouldn’t ought not

some sense computers shouldn’t ought not to be able to replicate this about us is

to be able to replicate this about us is

to be able to replicate this about us is this actually true can we have computers

this actually true can we have computers

this actually true can we have computers generate art that is convincingly human

generate art that is convincingly human

generate art that is convincingly human the second question I wanted to answer

the second question I wanted to answer

the second question I wanted to answer was how much does deep learning impacted

was how much does deep learning impacted

was how much does deep learning impacted automatic music composition now

automatic music composition now

automatic music composition now automatic music composition is a special

automatic music composition is a special

automatic music composition is a special field it has been dominated by symbolic

field it has been dominated by symbolic

field it has been dominated by symbolic methods which utilize things like formal

methods which utilize things like formal

methods which utilize things like formal grammars or context-free grammars such

grammars or context-free grammars such

grammars or context-free grammars such as this parse tree we’ve seen

as this parse tree we’ve seen

as this parse tree we’ve seen connectionist methods in the early 19th

connectionist methods in the early 19th

connectionist methods in the early 19th century however we have it however they

century however we have it however they

century however we have it however they have they followed in popularity and

have they followed in popularity and

have they followed in popularity and most recent systems have used symbolic

most recent systems have used symbolic

most recent systems have used symbolic methods with the work here I wanted to

methods with the work here I wanted to

methods with the work here I wanted to see did the new advances in deep

see did the new advances in deep

see did the new advances in deep learning in the last 10 years can they

learning in the last 10 years can they

learning in the last 10 years can they be transferred over to this particular

be transferred over to this particular

be transferred over to this particular problem domain and finally the last

problem domain and finally the last

problem domain and finally the last question I wanted to look at is how do

question I wanted to look at is how do

question I wanted to look at is how do we evaluate these generative models I

we evaluate these generative models I

we evaluate these generative models I mean we’ve seen we’ve seen in the

mean we’ve seen we’ve seen in the

mean we’ve seen we’ve seen in the previous talk a lot of a lot of models

previous talk a lot of a lot of models

previous talk a lot of a lot of models they generate art we look at it and as

they generate art we look at it and as

they generate art we look at it and as the author we say oh that’s convincing

the author we say oh that’s convincing

the author we say oh that’s convincing but oh that’s beautiful and great that

but oh that’s beautiful and great that

but oh that’s beautiful and great that might be a perfectly valid use case but

might be a perfectly valid use case but

might be a perfectly valid use case but it’s not sufficient for publication to

it’s not sufficient for publication to

it’s not sufficient for publication to publish something we need to establish a

publish something we need to establish a

publish something we need to establish a standardized benchmark and we need to be

standardized benchmark and we need to be

standardized benchmark and we need to be able to evaluate all of our models about

able to evaluate all of our models about

able to evaluate all of our models about it so we can objectively say which model

it so we can objectively say which model

it so we can objectively say which model is better than the other now if you’re

is better than the other now if you’re

is better than the other now if you’re still here I’m assuming you’re

still here I’m assuming you’re

still here I’m assuming you’re interested this is the outline we’ll

interested this is the outline we’ll

interested this is the outline we’ll start with a quick primer on music

start with a quick primer on music

start with a quick primer on music theory giving you just the basic

theory giving you just the basic

theory giving you just the basic terminology you need to understand the

terminology you need to understand the

terminology you need to understand the remainder of this presentation we’ll

remainder of this presentation we’ll

remainder of this presentation we’ll talk about how to prepare a data set of

talk about how to prepare a data set of

talk about how to prepare a data set of Bach Corral’s

Bach Corral’s

Bach Corral’s well then gate will get the give a

well then gate will get the give a

well then gate will get the give a primer on recurrent neural networks

primer on recurrent neural networks

primer on recurrent neural networks which is the actual deep learning model

which is the actual deep learning model

which is the actual deep learning model architecture used to build Bach Bach

architecture used to build Bach Bach

architecture used to build Bach Bach we’ll talk about the Bach Bach model

we’ll talk about the Bach Bach model

we’ll talk about the Bach Bach model itself the tips and tricks and

itself the tips and tricks and

itself the tips and tricks and techniques that we used in order to

techniques that we used in order to

techniques that we used in order to train it

train it

train it have it run successfully as well as

have it run successfully as well as

have it run successfully as well as deploy it and then we’ll show the

deploy it and then we’ll show the

deploy it and then we’ll show the results

results

results well show how this model is able to

well show how this model is able to

well show how this model is able to capture statistical regularities in box

capture statistical regularities in box

capture statistical regularities in box musical style and we’ll prove a we won’t

musical style and we’ll prove a we won’t

musical style and we’ll prove a we won’t prove we’ll provide very convincing

prove we’ll provide very convincing

prove we’ll provide very convincing evidence that music theory does have

evidence that music theory does have

evidence that music theory does have theoretical gesture empirical

theoretical gesture empirical

theoretical gesture empirical justification and finally I’ll show the

justification and finally I’ll show the

justification and finally I’ll show the results of the musical Turing test which

results of the musical Turing test which

results of the musical Turing test which was our proposed evaluation methodology

was our proposed evaluation methodology

was our proposed evaluation methodology for saying yes

for saying yes

for saying yes this model has solves our research goal

this model has solves our research goal

this model has solves our research goal the the task of automatically composing

the the task of automatically composing

the the task of automatically composing convincing Bach chorale is more closed

convincing Bach chorale is more closed

convincing Bach chorale is more closed than open of a problem as a result of

than open of a problem as a result of

than open of a problem as a result of Bach plot and if you’re a hands-on type

Bach plot and if you’re a hands-on type

Bach plot and if you’re a hands-on type of learner we’ve containerized the

of learner we’ve containerized the

of learner we’ve containerized the entire deployment so if you go to my

entire deployment so if you go to my

entire deployment so if you go to my website here I have a copy of the slides

website here I have a copy of the slides

website here I have a copy of the slides which have all of these instructions you

which have all of these instructions you

which have all of these instructions you run this eight lines of code and it runs

run this eight lines of code and it runs

run this eight lines of code and it runs this entire and pipeline right here

this entire and pipeline right here

this entire and pipeline right here where it takes the corrals it pre

where it takes the corrals it pre

where it takes the corrals it pre processes them puts them into a data

processes them puts them into a data

processes them puts them into a data store trains of trains the deep learning

store trains of trains the deep learning

store trains of trains the deep learning model samples the deep learning model

model samples the deep learning model

model samples the deep learning model produces outputs that you can listen to

let’s start with basic music theory now

let’s start with basic music theory now when people think of music this is

when people think of music this is

when people think of music this is usually what you think about you got

usually what you think about you got

usually what you think about you got these bar lines you got notes and these

these bar lines you got notes and these

these bar lines you got notes and these notes are on different horizontal and

notes are on different horizontal and

notes are on different horizontal and vertical positions some of them have

vertical positions some of them have

vertical positions some of them have interesting ties some of them of dots

interesting ties some of them of dots

interesting ties some of them of dots this is interesting little weird hat

this is interesting little weird hat

this is interesting little weird hat looking thing we don’t need all this we

looking thing we don’t need all this we

looking thing we don’t need all this we need three fundamental concepts the

need three fundamental concepts the

need three fundamental concepts the first is pitch pitch is often referred

first is pitch pitch is often referred

first is pitch pitch is often referred to as how low or how high a note is so

to as how low or how high a note is so

to as how low or how high a note is so if I play this we can distinguish that

if I play this we can distinguish that

if I play this we can distinguish that some notes are lower and some notes are

some notes are lower and some notes are

some notes are lower and some notes are higher in frequency and that corresponds

higher in frequency and that corresponds

higher in frequency and that corresponds to the vertical axis here as the notes

to the vertical axis here as the notes

to the vertical axis here as the notes of the notes sound ascending they appear

of the notes sound ascending they appear

of the notes sound ascending they appear ascending on the bar lines the second

ascending on the bar lines the second

ascending on the bar lines the second attribute we need is duration and this

attribute we need is duration and this

attribute we need is duration and this is really how long a notice so this one

is really how long a notice so this one

is really how long a notice so this one note these two notes these four and

note these two notes these four and

note these two notes these four and these eight all have equal total

these eight all have equal total

these eight all have equal total duration but they are they’re having zuv

duration but they are they’re having zuv

duration but they are they’re having zuv each other so if we take a listen

the general intuition is the more bars

the general intuition is the more bars there are on these tides the faster the

there are on these tides the faster the

there are on these tides the faster the notes appear with just those two

notes appear with just those two

notes appear with just those two concepts this is starting to make a

concepts this is starting to make a

concepts this is starting to make a little bit more sense this right here is

little bit more sense this right here is

little bit more sense this right here is twice as fast as this note we can see

twice as fast as this note we can see

twice as fast as this note we can see this note is higher than this note and

this note is higher than this note and

this note is higher than this note and you can generalize this to the remainder

you can generalize this to the remainder

you can generalize this to the remainder of this but there’s still this funny hat

of this but there’s still this funny hat

of this but there’s still this funny hat looking thing we’ll get to the hat in a

looking thing we’ll get to the hat in a

looking thing we’ll get to the hat in a sec but with pitch and duration we can

sec but with pitch and duration we can

sec but with pitch and duration we can rewrite the music like so rather than

rewrite the music like so rather than

rewrite the music like so rather than representing it using notes which may be

representing it using notes which may be

representing it using notes which may be kind of cryptic we show it here as a

kind of cryptic we show it here as a

kind of cryptic we show it here as a matrix where on the x axis we have time

matrix where on the x axis we have time

matrix where on the x axis we have time so the duration and on the y-axis we

so the duration and on the y-axis we

so the duration and on the y-axis we have pitch how high or low and frequency

have pitch how high or low and frequency

have pitch how high or low and frequency that note is and what we’ve done is

that note is and what we’ve done is

that note is and what we’ve done is we’ve taken the symbolic representation

we’ve taken the symbolic representation

we’ve taken the symbolic representation of music and we’ve turned it into a

of music and we’ve turned it into a

of music and we’ve turned it into a digital computable format that we can

digital computable format that we can

digital computable format that we can train models on back to the hat looking

train models on back to the hat looking

train models on back to the hat looking thing this is called a Fermata and Bach

thing this is called a Fermata and Bach

thing this is called a Fermata and Bach used it to denote the ends of phrases we

used it to denote the ends of phrases we

used it to denote the ends of phrases we had originally said about this research

had originally said about this research

had originally said about this research completely neglecting for modest and we

completely neglecting for modest and we

completely neglecting for modest and we found that the phrases generated by the

found that the phrases generated by the

found that the phrases generated by the model just kind of wandered they never

model just kind of wandered they never

model just kind of wandered they never seem to end there was no sense of

seem to end there was no sense of

seem to end there was no sense of resolution or conclusion and that was

resolution or conclusion and that was

resolution or conclusion and that was unrealistic but by adding these four

unrealistic but by adding these four

unrealistic but by adding these four modest all of a sudden the model turned

modest all of a sudden the model turned

modest all of a sudden the model turned around and we and we suddenly found

around and we and we suddenly found

around and we and we suddenly found realistic phrasing structure cool and

realistic phrasing structure cool and

realistic phrasing structure cool and that’s all the music you need to know

that’s all the music you need to know

that’s all the music you need to know the rest of it is machine learning now

the rest of it is machine learning now

the rest of it is machine learning now the biggest part of a machine learning

the biggest part of a machine learning

the biggest part of a machine learning engineer’s job is preparing their data

engineer’s job is preparing their data

engineer’s job is preparing their data sets this is a very painful task usually

sets this is a very painful task usually

sets this is a very painful task usually have to scour the internet or find some

have to scour the internet or find some

have to scour the internet or find some standardized data set that you train and

standardized data set that you train and

standardized data set that you train and evaluate your models on that usually

evaluate your models on that usually

evaluate your models on that usually these data sets have to be pre processed

these data sets have to be pre processed

these data sets have to be pre processed and massaged into a format that’s

and massaged into a format that’s

and massaged into a format that’s amenable for learning upon and for us it

amenable for learning upon and for us it

amenable for learning upon and for us it was no different box works however

was no different box works however

was no different box works however fortunately over the years have been

fortunately over the years have been

fortunately over the years have been transcribed into excuse my German Bach

transcribed into excuse my German Bach

transcribed into excuse my German Bach worka Vera – Nix BW sorry

worka Vera – Nix BW sorry

worka Vera – Nix BW sorry dwv is how I’ve been referring to this

dwv is how I’ve been referring to this

dwv is how I’ve been referring to this corpus it contains about all 438

corpus it contains about all 438

corpus it contains about all 438 harmonizations of Bach

harmonizations of Bach

harmonizations of Bach Corral’s and conveniently it is

Corral’s and conveniently it is

Corral’s and conveniently it is available through the software package

available through the software package

available through the software package called music21

called music21

called music21 this is a Python package that you can

this is a Python package that you can

this is a Python package that you can just tip install and then import it and

just tip install and then import it and

just tip install and then import it and now you have an iterator over a

now you have an iterator over a

now you have an iterator over a collection of music the first

collection of music the first

collection of music the first pre-processing step we did is we took

pre-processing step we did is we took

pre-processing step we did is we took the music the original music here and we

the music the original music here and we

the music the original music here and we did two things we transposed it and then

did two things we transposed it and then

did two things we transposed it and then we quantize it in time now you can

we quantize it in time now you can

we quantize it in time now you can notice the transposition by looking at

notice the transposition by looking at

notice the transposition by looking at these accidentals right here these two

these accidentals right here these two

these accidentals right here these two little funny backwards or forwards B’s

little funny backwards or forwards B’s

little funny backwards or forwards B’s and then they’re absent over here

and then they’re absent over here

and then they’re absent over here furthermore that note has shifted up by

furthermore that note has shifted up by

furthermore that note has shifted up by half a line that’s a little hard to see

half a line that’s a little hard to see

half a line that’s a little hard to see but it’s happening and the reason why we

but it’s happening and the reason why we

but it’s happening and the reason why we did this is we didn’t want to learn key

did this is we didn’t want to learn key

did this is we didn’t want to learn key signature key signature is usually

signature key signature is usually

signature key signature is usually something decided by the author before

something decided by the author before

something decided by the author before the pieces even begun to compose and so

the pieces even begun to compose and so

the pieces even begun to compose and so we can and so key signature itself can

we can and so key signature itself can

we can and so key signature itself can be injected as a pre-processing step

be injected as a pre-processing step

be injected as a pre-processing step where we sample over all the keys Bach

where we sample over all the keys Bach

where we sample over all the keys Bach did use so we removed key fingers from

did use so we removed key fingers from

did use so we removed key fingers from the equation through transposition and

the equation through transposition and

the equation through transposition and I’ll justify why that’s an okay thing to

I’ll justify why that’s an okay thing to

I’ll justify why that’s an okay thing to do in the next slide this first measure

do in the next slide this first measure

do in the next slide this first measure is written is is a progression of five

is written is is a progression of five

is written is is a progression of five notes written in C major and then what I

notes written in C major and then what I

notes written in C major and then what I did in the next measure is I just moved

did in the next measure is I just moved

did in the next measure is I just moved it up by five whole steps

it up by five whole steps

it up by five whole steps [Music]

so yeah the pitch did change it’s

so yeah the pitch did change it’s relatively higher it’s absolutely higher

relatively higher it’s absolutely higher

relatively higher it’s absolutely higher on all accounts

on all accounts

on all accounts but the relations between the notes

but the relations between the notes

but the relations between the notes didn’t change and the sensation the the

didn’t change and the sensation the the

didn’t change and the sensation the the motifs that the music is bringing out

motifs that the music is bringing out

motifs that the music is bringing out those still remain fairly constant even

those still remain fairly constant even

those still remain fairly constant even after transposition quantization that

after transposition quantization that

after transposition quantization that however is a different story if I go

however is a different story if I go

however is a different story if I go back to slides will notice quantization

back to slides will notice quantization

back to slides will notice quantization to this 30-second note and turn it into

to this 30-second note and turn it into

to this 30-second note and turn it into a sixteenth note by removing that second

a sixteenth note by removing that second

a sixteenth note by removing that second bar we’ve distorted time is that a

bar we’ve distorted time is that a

bar we’ve distorted time is that a problem it is it’s not it’s not perfect

problem it is it’s not it’s not perfect

problem it is it’s not it’s not perfect but it’s a very minor problem so over

but it’s a very minor problem so over

but it’s a very minor problem so over here I’ve plotted a histogram of all of

here I’ve plotted a histogram of all of

here I’ve plotted a histogram of all of the durations inside of the corral

the durations inside of the corral

the durations inside of the corral corpus and this quantization affects

corpus and this quantization affects

corpus and this quantization affects only 0.2% of all the notes that we’re

only 0.2% of all the notes that we’re

only 0.2% of all the notes that we’re training on the reason that we do it is

training on the reason that we do it is

training on the reason that we do it is by quantizing in time we’re able to get

by quantizing in time we’re able to get

by quantizing in time we’re able to get discrete representations in both time as

discrete representations in both time as

discrete representations in both time as well as in pitch whereas working on a

well as in pitch whereas working on a

well as in pitch whereas working on a continuous time axis now you have to

continuous time axis now you have to

continuous time axis now you have to deal computers are discrete and are

deal computers are discrete and are

deal computers are discrete and are unable to operate on the continuous

unable to operate on the continuous

unable to operate on the continuous representation has to be quantized into

representation has to be quantized into

representation has to be quantized into a digital format somehow the last

a digital format somehow the last

a digital format somehow the last challenge polyphony so polysemy is the

challenge polyphony so polysemy is the

challenge polyphony so polysemy is the presence of multiple simultaneous voices

presence of multiple simultaneous voices

presence of multiple simultaneous voices so far the examples that I’ve shown you

so far the examples that I’ve shown you

so far the examples that I’ve shown you you’ve just heard a single voice playing

you’ve just heard a single voice playing

you’ve just heard a single voice playing at any given time but a Corral has four

at any given time but a Corral has four

at any given time but a Corral has four voices the soprano the alto the tenor

voices the soprano the alto the tenor

voices the soprano the alto the tenor the bass and so here’s a question for

the bass and so here’s a question for

the bass and so here’s a question for you if I have four voices and they can

you if I have four voices and they can

you if I have four voices and they can each represent 128 different pitches

each represent 128 different pitches

each represent 128 different pitches that’s the constraint in MIDI

that’s the constraint in MIDI

that’s the constraint in MIDI representation of music how many

representation of music how many

representation of music how many different chords can I construct very

different chords can I construct very

different chords can I construct very good yes 128 ^ 4 that’s correct

good yes 128 ^ 4 that’s correct

good yes 128 ^ 4 that’s correct I put a Big O because some like some

I put a Big O because some like some

I put a Big O because some like some like you can rearrange the ordering but

like you can rearrange the ordering but

like you can rearrange the ordering but more or less yeah that’s correct and why

more or less yeah that’s correct and why

more or less yeah that’s correct and why is this a problem well this is the

is this a problem well this is the

is this a problem well this is the problem because most of these chords are

problem because most of these chords are

problem because most of these chords are actually never seen especially after you

actually never seen especially after you

actually never seen especially after you transposed a C major a minor in fact

transposed a C major a minor in fact

transposed a C major a minor in fact looking at the data set we can see that

looking at the data set we can see that

looking at the data set we can see that just the first 20 chords or 20

just the first 20 chords or 20

just the first 20 chords or 20 notes rather occupy almost 90% of the

notes rather occupy almost 90% of the

notes rather occupy almost 90% of the entire dataset so if we were to

entire dataset so if we were to

entire dataset so if we were to represent all of these we would have a

represent all of these we would have a

represent all of these we would have a ton of symbols in our vocabulary which

ton of symbols in our vocabulary which

ton of symbols in our vocabulary which we had never seen before the way we deal

we had never seen before the way we deal

we had never seen before the way we deal with this problem is by serializing so

with this problem is by serializing so

with this problem is by serializing so that is instead of representing all four

that is instead of representing all four

that is instead of representing all four notes as an individual symbol we

notes as an individual symbol we

notes as an individual symbol we represent each individual note as a

represent each individual note as a

represent each individual note as a symbol itself and we serialized in

symbol itself and we serialized in

symbol itself and we serialized in soprano alto tenor bass order and so

soprano alto tenor bass order and so

soprano alto tenor bass order and so what you end up getting is a reduction

what you end up getting is a reduction

what you end up getting is a reduction from 128 to the 4th all possible chords

from 128 to the 4th all possible chords

from 128 to the 4th all possible chords into just 128 possible pitches now this

into just 128 possible pitches now this

into just 128 possible pitches now this may seem a little unjustified but this

may seem a little unjustified but this

may seem a little unjustified but this is actually done all the time with

is actually done all the time with

is actually done all the time with sequence processing if you took like

sequence processing if you took like

sequence processing if you took like take a look at traditional on language

take a look at traditional on language

take a look at traditional on language models you can represent them either at

models you can represent them either at

models you can represent them either at the character level or at the word level

the character level or at the word level

the character level or at the word level similarly you can represent music either

similarly you can represent music either

similarly you can represent music either at the note level or at the chord level

at the note level or at the chord level

at the note level or at the chord level after serializing the the data looks

after serializing the the data looks

after serializing the the data looks like this we have assembled a noting the

like this we have assembled a noting the

like this we have assembled a noting the start of a piece and this is used to

start of a piece and this is used to

start of a piece and this is used to initialize our model we then have the

initialize our model we then have the

initialize our model we then have the four chords soprano alto tenor bass

four chords soprano alto tenor bass

four chords soprano alto tenor bass followed by a delimiter indicating the

followed by a delimiter indicating the

followed by a delimiter indicating the end of this frame and time has advanced

end of this frame and time has advanced

end of this frame and time has advanced one in the future followed by another

one in the future followed by another

one in the future followed by another soprano alto tenor bass we also have

soprano alto tenor bass we also have

soprano alto tenor bass we also have these funny-looking dot things which I

these funny-looking dot things which I

these funny-looking dot things which I came up with to denote the self firmata

came up with to denote the self firmata

came up with to denote the self firmata so that we can encode when the end of a

so that we can encode when the end of a

so that we can encode when the end of a phrases in our input training data after

phrases in our input training data after

phrases in our input training data after all of our pre-processing our final

all of our pre-processing our final

all of our pre-processing our final corpus looks like this there’s only 108

corpus looks like this there’s only 108

corpus looks like this there’s only 108 symbols left so not a hundred all

symbols left so not a hundred all

symbols left so not a hundred all hundred 28 pitches are used in Bach’s

hundred 28 pitches are used in Bach’s

hundred 28 pitches are used in Bach’s works and there’s about I would say four

works and there’s about I would say four

works and there’s about I would say four hundred thousand total where we split

hundred thousand total where we split

hundred thousand total where we split three hundred and eighty thousand or

three hundred and eighty thousand or

three hundred and eighty thousand or three hundred and eighty thousand into a

three hundred and eighty thousand into a

three hundred and eighty thousand into a training set and forty thousand into a

training set and forty thousand into a

training set and forty thousand into a validation set we split between training

validation set we split between training

validation set we split between training and validation in order to prevent

and validation in order to prevent

and validation in order to prevent overfitting we don’t want to just

overfitting we don’t want to just

overfitting we don’t want to just memorize box Corral’s rather we want to

memorize box Corral’s rather we want to

memorize box Corral’s rather we want to be able to produce very similar samples

be able to produce very similar samples

be able to produce very similar samples which are not exact identical and that’s

which are not exact identical and that’s

which are not exact identical and that’s it with that you have the training set

it with that you have the training set

it with that you have the training set and it’s encapsulated by the first three

and it’s encapsulated by the first three

and it’s encapsulated by the first three commands on that slide I showed earlier

commands on that slide I showed earlier

commands on that slide I showed earlier with Bach

with Bach

with Bach make data set Bach bot extract

make data set Bach bot extract

make data set Bach bot extract vocabulary the next step is to train the

vocabulary the next step is to train the

vocabulary the next step is to train the recurrent neural network to talk about

recurrent neural network to talk about

recurrent neural network to talk about recurrent neural networks let’s break

recurrent neural networks let’s break

recurrent neural networks let’s break the word down recurrent neural network

the word down recurrent neural network

the word down recurrent neural network I’m going to start with neuro neural

I’m going to start with neuro neural

I’m going to start with neuro neural neural just means that we have very

neural just means that we have very

neural just means that we have very basic building blocks called neurons

basic building blocks called neurons

basic building blocks called neurons which look like this they take a

which look like this they take a

which look like this they take a d-dimensional input x1 XD these are

d-dimensional input x1 XD these are

d-dimensional input x1 XD these are numbers like 0.9 0.2 and they’re all

numbers like 0.9 0.2 and they’re all

numbers like 0.9 0.2 and they’re all added together with a linear combination

added together with a linear combination

added together with a linear combination so what you end up getting is this

so what you end up getting is this

so what you end up getting is this activation Z which is just the sum of

activation Z which is just the sum of

activation Z which is just the sum of these inputs weighted by WS so if a

these inputs weighted by WS so if a

these inputs weighted by WS so if a neuron really cares about say X 2 W 2 W

neuron really cares about say X 2 W 2 W

neuron really cares about say X 2 W 2 W 1 and the rest will be zeros and so this

1 and the rest will be zeros and so this

1 and the rest will be zeros and so this lets the neuron preferentially select

lets the neuron preferentially select

lets the neuron preferentially select which of its inputs that cares more

which of its inputs that cares more

which of its inputs that cares more about and allows to specialize for

about and allows to specialize for

about and allows to specialize for certain parts of its input this

certain parts of its input this

certain parts of its input this activation is passed through this X

activation is passed through this X

activation is passed through this X shaped thing called an on called an

shaped thing called an on called an

shaped thing called an on called an activation function commonly a sigmoid

activation function commonly a sigmoid

activation function commonly a sigmoid but all it does is it introduces a

but all it does is it introduces a

but all it does is it introduces a non-linearity into the network and

non-linearity into the network and

non-linearity into the network and allows you to explore expressive on the

allows you to explore expressive on the

allows you to explore expressive on the types of functions you can approximate

types of functions you can approximate

types of functions you can approximate and we have the output called Y you take

and we have the output called Y you take

and we have the output called Y you take these neurons you stack them

these neurons you stack them

these neurons you stack them horizontally and you get what’s called a

horizontally and you get what’s called a

horizontally and you get what’s called a lair so here I’m just showing four

lair so here I’m just showing four

lair so here I’m just showing four neurons in this layer three neurons in

neurons in this layer three neurons in

neurons in this layer three neurons in this layer two neurons on this top layer

this layer two neurons on this top layer

this layer two neurons on this top layer and I represented the network like this

and I represented the network like this

and I represented the network like this here we take the input X so this bottom

here we take the input X so this bottom

here we take the input X so this bottom part we multiply by a matrix now because

part we multiply by a matrix now because

part we multiply by a matrix now because we’ve replicated the neurons

we’ve replicated the neurons

we’ve replicated the neurons horizontally and what w’s represents the

horizontally and what w’s represents the

horizontally and what w’s represents the weights we pass it through this sigmoid

weights we pass it through this sigmoid

weights we pass it through this sigmoid activation function to get these first

activation function to get these first

activation function to get these first layer outputs this is recursively done

layer outputs this is recursively done

layer outputs this is recursively done through all the layers until you get to

through all the layers until you get to

through all the layers until you get to the very top where we have the final

the very top where we have the final

the very top where we have the final outputs of the model the W’s here the

outputs of the model the W’s here the

outputs of the model the W’s here the weights those are the parameters of the

weights those are the parameters of the

weights those are the parameters of the network and these are the things that we

network and these are the things that we

network and these are the things that we need to learn in order to train the

need to learn in order to train the

need to learn in order to train the neural network great

neural network great

neural network great we know that feed-forward neural

we know that feed-forward neural

we know that feed-forward neural networks now let’s introduce the word

networks now let’s introduce the word

networks now let’s introduce the word recurrent recurrent just means that the

recurrent recurrent just means that the

recurrent recurrent just means that the previous input or the previous hidden

previous input or the previous hidden

previous input or the previous hidden states are used in the next time step

states are used in the next time step

states are used in the next time step the prediction so what I’m showing here

the prediction so what I’m showing here

the prediction so what I’m showing here is again if you just pay attention to

is again if you just pay attention to

is again if you just pay attention to this input area

this input area

this input area and this layer right here and this

and this layer right here and this

and this layer right here and this output this part right here is the same

output this part right here is the same

output this part right here is the same thing as this thing right here however

thing as this thing right here however

thing as this thing right here however we’ve added this funny little loop

we’ve added this funny little loop

we’ve added this funny little loop coming back with this is electrical

coming back with this is electrical

coming back with this is electrical engineering notation for a unit time

engineering notation for a unit time

engineering notation for a unit time delay and what this is saying is take

delay and what this is saying is take

delay and what this is saying is take the hidden state from time T minus 1 and

the hidden state from time T minus 1 and

the hidden state from time T minus 1 and also include it as input into the next

also include it as input into the next

also include it as input into the next into the prime T predictions in

into the prime T predictions in

into the prime T predictions in equations it looks like this

equations it looks like this

equations it looks like this the current hidden state is equal to the

the current hidden state is equal to the

the current hidden state is equal to the act or the previous inputs plus the free

act or the previous inputs plus the free

act or the previous inputs plus the free or an activation of the previous inputs

or an activation of the previous inputs

or an activation of the previous inputs waited plus the the weighted activations

waited plus the the weighted activations

waited plus the the weighted activations of the previous hidden states and the

of the previous hidden states and the

of the previous hidden states and the outputs is only a function of just the

outputs is only a function of just the

outputs is only a function of just the current hidden states we can take this

current hidden states we can take this

current hidden states we can take this loop right here

loop right here

loop right here oh sorry before I go there um this is

oh sorry before I go there um this is

oh sorry before I go there um this is called a Elmen type recurrent neural

called a Elmen type recurrent neural

called a Elmen type recurrent neural network this memory cell is very basic

network this memory cell is very basic

network this memory cell is very basic it’s just doing the exact same thing a

it’s just doing the exact same thing a

it’s just doing the exact same thing a normal neural network would do it turns

normal neural network would do it turns

normal neural network would do it turns out there’s some problems with just

out there’s some problems with just

out there’s some problems with just using the basic architecture and so the

using the basic architecture and so the

using the basic architecture and so the architecture that the field has been

architecture that the field has been

architecture that the field has been converging towards is known as long

converging towards is known as long

converging towards is known as long short-term memory

short-term memory

short-term memory it looks really complicated it’s not you

it looks really complicated it’s not you

it looks really complicated it’s not you take the inputs and the hidden states

take the inputs and the hidden states

take the inputs and the hidden states and you put them into three spots right

and you put them into three spots right

and you put them into three spots right here the inputs an input gate a forget

here the inputs an input gate a forget

here the inputs an input gate a forget gate and output gate and the point of

gate and output gate and the point of

gate and output gate and the point of adding all this art complexity is to

adding all this art complexity is to

adding all this art complexity is to solve a problem known as the vanishing

solve a problem known as the vanishing

solve a problem known as the vanishing gradient problem where this constant

gradient problem where this constant

gradient problem where this constant error carousel of the hidden state being

error carousel of the hidden state being

error carousel of the hidden state being fed back to itself over and over and

fed back to itself over and over and

fed back to itself over and over and over results in signals converging

over results in signals converging

over results in signals converging toward zero or diverging to infinity

toward zero or diverging to infinity

toward zero or diverging to infinity this is fortunately this is usually

this is fortunately this is usually

this is fortunately this is usually available as just a black box

available as just a black box

available as just a black box implementation in most software packages

implementation in most software packages

implementation in most software packages you just specify I want to use an LS TM

you just specify I want to use an LS TM

you just specify I want to use an LS TM and all of this is abstracted away from

and all of this is abstracted away from

and all of this is abstracted away from you now here if you squint you can kind

you now here if you squint you can kind

you now here if you squint you can kind of see that the memory cell that I’ve

of see that the memory cell that I’ve

of see that the memory cell that I’ve shown previously where we have the

shown previously where we have the

shown previously where we have the inputs the hidden States hidden facing

inputs the hidden States hidden facing

inputs the hidden States hidden facing back to itself to generate an output I

back to itself to generate an output I

back to itself to generate an output I distract it away like this and I’ve

distract it away like this and I’ve

distract it away like this and I’ve stacked it up on top of each other so

stacked it up on top of each other so

stacked it up on top of each other so rather than just having the outputs come

rather than just having the outputs come

rather than just having the outputs come out of this H right here I’ve actually

out of this H right here I’ve actually

out of this H right here I’ve actually made it the inputs to get another memory

made it the inputs to get another memory

made it the inputs to get another memory cell

cell

cell this is where the word deep comes from

this is where the word deep comes from

this is where the word deep comes from deep networks are just networks that

deep networks are just networks that

deep networks are just networks that have a lot of layers and by stacking I

have a lot of layers and by stacking I

have a lot of layers and by stacking I get to use the word deep inside of my

get to use the word deep inside of my

get to use the word deep inside of my deep LS TM model but I’ll show you later

deep LS TM model but I’ll show you later

deep LS TM model but I’ll show you later that I’m not just doing it for the

that I’m not just doing it for the

that I’m not just doing it for the buzzword depth actually matters as well

buzzword depth actually matters as well

buzzword depth actually matters as well see in results another operation that’s

see in results another operation that’s

see in results another operation that’s important for LS CMS is unrolling and

important for LS CMS is unrolling and

important for LS CMS is unrolling and what unrolling does is it takes this

what unrolling does is it takes this

what unrolling does is it takes this unit time delay and it just replicates

unit time delay and it just replicates

unit time delay and it just replicates the LS TM units over time so rather than

the LS TM units over time so rather than

the LS TM units over time so rather than show in this delay like this I’ve taken

show in this delay like this I’ve taken

show in this delay like this I’ve taken it I’ve shown the the – once hidden unit

it I’ve shown the the – once hidden unit

it I’ve shown the the – once hidden unit passing state into the the t hidden unit

passing state into the the t hidden unit

passing state into the the t hidden unit passing stages the T plus first hidden

passing stages the T plus first hidden

passing stages the T plus first hidden unit your input is a variable length and

unit your input is a variable length and

unit your input is a variable length and to train the network what you do is you

to train the network what you do is you

to train the network what you do is you expand this graph you unroll the lsdm so

expand this graph you unroll the lsdm so

expand this graph you unroll the lsdm so the same length as your variable length

the same length as your variable length

the same length as your variable length input and in order to get these

input and in order to get these

input and in order to get these predictions up at the top great we know

predictions up at the top great we know

predictions up at the top great we know all we need to know about music and rnns

all we need to know about music and rnns

all we need to know about music and rnns let’s move on to a Bach bot have Bach

let’s move on to a Bach bot have Bach

let’s move on to a Bach bot have Bach Bach works to Train Bach bot we apply

Bach works to Train Bach bot we apply

Bach works to Train Bach bot we apply sequential prediction criteria now I’ve

sequential prediction criteria now I’ve

sequential prediction criteria now I’ve stolen this from Andre carpet thieves

stolen this from Andre carpet thieves

stolen this from Andre carpet thieves github but the principles are the same

github but the principles are the same

github but the principles are the same suppose we’re given the input characters

suppose we’re given the input characters

suppose we’re given the input characters hello and we want to model it using a

hello and we want to model it using a

hello and we want to model it using a recurrent neural network the training

recurrent neural network the training

recurrent neural network the training criteria is given the current input

criteria is given the current input

criteria is given the current input character and the previous hidden state

character and the previous hidden state

character and the previous hidden state predicts the next character so notice

predicts the next character so notice

predicts the next character so notice down here I have a CH and I’m trying to

down here I have a CH and I’m trying to

down here I have a CH and I’m trying to predict e I’ve e and I’m trying to

predict e I’ve e and I’m trying to

predict e I’ve e and I’m trying to predict L I’ve L and I’m trying to

predict L I’ve L and I’m trying to

predict L I’ve L and I’m trying to predict L and I have Allen I’m trying to

predict L and I have Allen I’m trying to

predict L and I have Allen I’m trying to predict oh if we take this analogy to

predict oh if we take this analogy to

predict oh if we take this analogy to music I have all of the notes I’ve seen

music I have all of the notes I’ve seen

music I have all of the notes I’ve seen up until this point in time and I’m

up until this point in time and I’m

up until this point in time and I’m trying to predict the next note I can

trying to predict the next note I can

trying to predict the next note I can iterate this process forwards to

iterate this process forwards to

iterate this process forwards to generate compositions the criteria we

generate compositions the criteria we

generate compositions the criteria we want to use is and so the output layer

want to use is and so the output layer

want to use is and so the output layer here is actually a probability

here is actually a probability

here is actually a probability distribution sorry so take in the

distribution sorry so take in the

distribution sorry so take in the previous slide and now I put it on top

previous slide and now I put it on top

previous slide and now I put it on top of my unrolled Network so given the

of my unrolled Network so given the

of my unrolled Network so given the initial hidden state which we just

initial hidden state which we just

initial hidden state which we just initialized all zeroes because we have a

initialized all zeroes because we have a

initialized all zeroes because we have a unique start symbol used to initialize

unique start symbol used to initialize

unique start symbol used to initialize our pieces and the RNN dynamics so this

our pieces and the RNN dynamics so this

our pieces and the RNN dynamics so this is the probability distribution over the

is the probability distribution over the

is the probability distribution over the next state given the current state

next state given the current state

next state given the current state so this YT is

so this YT is

so this YT is for that and it’s a function of the

for that and it’s a function of the

for that and it’s a function of the currents the current input XT as well as

currents the current input XT as well as

currents the current input XT as well as the previous hidden states from t minus

the previous hidden states from t minus

the previous hidden states from t minus 1 we need to choose the r and n

1 we need to choose the r and n

1 we need to choose the r and n parameters so these weight matrices the

parameters so these weight matrices the

parameters so these weight matrices the weights of all the connections between

weights of all the connections between

weights of all the connections between all the neurons in order to maximize

all the neurons in order to maximize

all the neurons in order to maximize this probability right here the

this probability right here the

this probability right here the probability of the real Bach chorale so

probability of the real Bach chorale so

probability of the real Bach chorale so down here we have all the notes of the

down here we have all the notes of the

down here we have all the notes of the real Bach chorale and up here we have

real Bach chorale and up here we have

real Bach chorale and up here we have the next notes of this of those in an

the next notes of this of those in an

the next notes of this of those in an ideal world if we just initialize it

ideal world if we just initialize it

ideal world if we just initialize it with some Bach chorale it’ll just

with some Bach chorale it’ll just

with some Bach chorale it’ll just memorize and return the remainder and

memorize and return the remainder and

memorize and return the remainder and that will that will do great on this

that will that will do great on this

that will that will do great on this prediction criteria but that’s not

prediction criteria but that’s not

prediction criteria but that’s not exactly what we want but nevertheless

exactly what we want but nevertheless

exactly what we want but nevertheless once we have this criteria the way that

once we have this criteria the way that

once we have this criteria the way that the model is actually trained is by

the model is actually trained is by

the model is actually trained is by using the chain rule from calculus where

using the chain rule from calculus where

using the chain rule from calculus where we take partial derivatives up here we

we take partial derivatives up here we

we take partial derivatives up here we have an error signal so I know this is

have an error signal so I know this is

have an error signal so I know this is the real Bach note the real note that

the real Bach note the real note that

the real Bach note the real note that Bach used and this is the thing my model

Bach used and this is the thing my model

Bach used and this is the thing my model is predicting ok they’re a little bit

is predicting ok they’re a little bit

is predicting ok they’re a little bit different how do I change the parameters

different how do I change the parameters

different how do I change the parameters this weight matrix between the hidden

this weight matrix between the hidden

this weight matrix between the hidden state the outputs this weight matrix

state the outputs this weight matrix

state the outputs this weight matrix between the previous in stay in the

between the previous in stay in the

between the previous in stay in the current hidden state and this weight

current hidden state and this weight

current hidden state and this weight matrix between the hidden state the

matrix between the hidden state the

matrix between the hidden state the inputs how can I change those around how

inputs how can I change those around how

inputs how can I change those around how do I wiggle those to make this output up

do I wiggle those to make this output up

do I wiggle those to make this output up here closer to what Bach actually had

here closer to what Bach actually had

here closer to what Bach actually had produced now this training criteria can

produced now this training criteria can

produced now this training criteria can be just formalized

be just formalized

be just formalized used by taking gradients using calculus

used by taking gradients using calculus

used by taking gradients using calculus and iterating and then optimization

and iterating and then optimization

and iterating and then optimization known as stochastic gradient descents

known as stochastic gradient descents

known as stochastic gradient descents and when applied to neural networks it’s

and when applied to neural networks it’s

and when applied to neural networks it’s an algorithm called back propagation

an algorithm called back propagation

an algorithm called back propagation well back propagation through time if

well back propagation through time if

well back propagation through time if you want to get nitty-gritty because

you want to get nitty-gritty because

you want to get nitty-gritty because we’ve unrolled the neural network over

we’ve unrolled the neural network over

we’ve unrolled the neural network over time but again this is also abstraction

time but again this is also abstraction

time but again this is also abstraction that need not concern you because this

that need not concern you because this

that need not concern you because this is also usually provided for you as a

is also usually provided for you as a

is also usually provided for you as a black box inside of common frameworks

black box inside of common frameworks

black box inside of common frameworks such as tensor flow and caris we now

such as tensor flow and caris we now

such as tensor flow and caris we now have all we now have the Bach bot model

have all we now have the Bach bot model

have all we now have the Bach bot model but there’s a couple parameters that we

but there’s a couple parameters that we

but there’s a couple parameters that we need to look at I haven’t told you

need to look at I haven’t told you

need to look at I haven’t told you exactly how deep Bach bot is nor have I

exactly how deep Bach bot is nor have I

exactly how deep Bach bot is nor have I told you how big these layers are before

told you how big these layers are before

told you how big these layers are before we start when optimizing models this is

we start when optimizing models this is

we start when optimizing models this is this is a very important learning and

this is a very important learning and

this is a very important learning and it’s probably obvious by now GPUs are

it’s probably obvious by now GPUs are

it’s probably obvious by now GPUs are very important for rapid experimentation

very important for rapid experimentation

very important for rapid experimentation I did a quick benchmark and I found that

I did a quick benchmark and I found that

I did a quick benchmark and I found that a GPU delivers an 8x perform

a GPU delivers an 8x perform

a GPU delivers an 8x perform speed up making my training time goes

speed up making my training time goes

speed up making my training time goes down from 256 minutes down to just 28

down from 256 minutes down to just 28

down from 256 minutes down to just 28 minutes so if you want to iterate

minutes so if you want to iterate

minutes so if you want to iterate quickly getting a GPU will save you

quickly getting a GPU will save you

quickly getting a GPU will save you April like will make you eight times

April like will make you eight times

April like will make you eight times more productive did I just put the word

more productive did I just put the word

more productive did I just put the word deep onto my neural network because it

deep onto my neural network because it

deep onto my neural network because it was a good buzz word it turns out no

was a good buzz word it turns out no

was a good buzz word it turns out no depth actually matters what I’m showing

depth actually matters what I’m showing

depth actually matters what I’m showing you here are the training losses as well

you here are the training losses as well

you here are the training losses as well as the validation losses as I change the

as the validation losses as I change the

as the validation losses as I change the depth the training loss is how well is

depth the training loss is how well is

depth the training loss is how well is my model doing on the training data set

my model doing on the training data set

my model doing on the training data set which I’m letting it see and letting it

which I’m letting it see and letting it

which I’m letting it see and letting it tune its parameters to do better on and

tune its parameters to do better on and

tune its parameters to do better on and the validation loss is how well is my

the validation loss is how well is my

the validation loss is how well is my model doing on data that I didn’t let it

model doing on data that I didn’t let it

model doing on data that I didn’t let it see so how well is it generalizing

see so how well is it generalizing

see so how well is it generalizing beyond just memorizing its inputs and

beyond just memorizing its inputs and

beyond just memorizing its inputs and what we notice here is that with just

what we notice here is that with just

what we notice here is that with just one layer the validation error is quite

one layer the validation error is quite

one layer the validation error is quite high and as we increase layers – it gets

high and as we increase layers – it gets

high and as we increase layers – it gets you down here three gets you this red

you down here three gets you this red

you down here three gets you this red curve which is as low as it goes and if

curve which is as low as it goes and if

curve which is as low as it goes and if you keep going for with four it goes

you keep going for with four it goes

you keep going for with four it goes back up should this be surprising it

back up should this be surprising it

back up should this be surprising it shouldn’t and the reason why it

shouldn’t and the reason why it

shouldn’t and the reason why it shouldn’t is because as you add more

shouldn’t is because as you add more

shouldn’t is because as you add more layers you’re adding more expressive

layers you’re adding more expressive

layers you’re adding more expressive power notice that we’re here with four

power notice that we’re here with four

power notice that we’re here with four layers you’re actually doing just as

layers you’re actually doing just as

layers you’re actually doing just as good as the red curve so you’re doing

good as the red curve so you’re doing

good as the red curve so you’re doing great on the training set but because

great on the training set but because

great on the training set but because your model is now so expressive you’re

your model is now so expressive you’re

your model is now so expressive you’re memorizing the inputs and so you

memorizing the inputs and so you

memorizing the inputs and so you generalize more poorly so a similar

generalize more poorly so a similar

generalize more poorly so a similar story can be told about the hidden state

story can be told about the hidden state

story can be told about the hidden state sighs so how wide those memory cells are

sighs so how wide those memory cells are

sighs so how wide those memory cells are how many units do we have in them as we

how many units do we have in them as we

how many units do we have in them as we increase the hidden state layer it’s

increase the hidden state layer it’s

increase the hidden state layer it’s hidden state size we get performance

hidden state size we get performance

hidden state size we get performance improvements in generalization from this

improvements in generalization from this

improvements in generalization from this blue curve all the way down until we get

blue curve all the way down until we get

blue curve all the way down until we get to 256 hidden units this green curve

to 256 hidden units this green curve

to 256 hidden units this green curve after that we see the same kind of

after that we see the same kind of

after that we see the same kind of behavior where the training set error

behavior where the training set error

behavior where the training set error goes lower and lower but because you’re

goes lower and lower but because you’re

goes lower and lower but because you’re memorizing the inputs because your model

memorizing the inputs because your model

memorizing the inputs because your model is now too powerful you’re out your

is now too powerful you’re out your

is now too powerful you’re out your generalization error actually gets worse

finally LST em they’re pretty

finally LST em they’re pretty complicated the reason why I introduced

complicated the reason why I introduced

complicated the reason why I introduced it is because it’s actually so critical

it is because it’s actually so critical

it is because it’s actually so critical for your performance the the basic

for your performance the the basic

for your performance the the basic element type recurrent neural network or

element type recurrent neural network or

element type recurrent neural network or just reuses the standard recurrent

just reuses the standard recurrent

just reuses the standard recurrent neural network architecture for the

neural network architecture for the

neural network architecture for the memory cell is shown here in

memory cell is shown here in

memory cell is shown here in side of this green curve right here

side of this green curve right here

side of this green curve right here which actually doesn’t do to both too

which actually doesn’t do to both too

which actually doesn’t do to both too badly but by using a long short term

badly but by using a long short term

badly but by using a long short term memory you get this yellow curve which

memory you get this yellow curve which

memory you get this yellow curve which is at the very bottom it’s doing as best

is at the very bottom it’s doing as best

is at the very bottom it’s doing as best as out of all the architectures we

as out of all the architectures we

as out of all the architectures we looked at in terms of memory cells gated

looked at in terms of memory cells gated

looked at in terms of memory cells gated recurrent units are ass more simpler or

recurrent units are ass more simpler or

recurrent units are ass more simpler or simpler generalization of LF CMS they

simpler generalization of LF CMS they

simpler generalization of LF CMS they haven’t been used as much and so there’s

haven’t been used as much and so there’s

haven’t been used as much and so there’s less literature about them but on this

less literature about them but on this

less literature about them but on this task they also appear to be doing quite

task they also appear to be doing quite

task they also appear to be doing quite well cool after all of this

well cool after all of this

well cool after all of this experimentation and all of this manual

experimentation and all of this manual

experimentation and all of this manual grid search we finally arrived at a

grid search we finally arrived at a

grid search we finally arrived at a final architecture where notes are first

final architecture where notes are first

final architecture where notes are first embedded into real numbers a 32

embedded into real numbers a 32

embedded into real numbers a 32 dimensional real or vector rather and

dimensional real or vector rather and

dimensional real or vector rather and then we have a three layer stacked

then we have a three layer stacked

then we have a three layer stacked long short term memory recurrent neural

long short term memory recurrent neural

long short term memory recurrent neural network which processes these notes

network which processes these notes

network which processes these notes sequences over time and we trained it

sequences over time and we trained it

sequences over time and we trained it using standard gradient descent with a

using standard gradient descent with a

using standard gradient descent with a couple tricks we use this thing called

couple tricks we use this thing called

couple tricks we use this thing called drop out and we drop out with a setting

drop out and we drop out with a setting

drop out and we drop out with a setting of 30% and what this means is in between

of 30% and what this means is in between

of 30% and what this means is in between subsequent connections between layers

subsequent connections between layers

subsequent connections between layers randomly turns 30% of the neurons off

randomly turns 30% of the neurons off

randomly turns 30% of the neurons off that seems a little bit counterintuitive

that seems a little bit counterintuitive

that seems a little bit counterintuitive why might you want to do that it turns

why might you want to do that it turns

why might you want to do that it turns out by turning off neurons during

out by turning off neurons during

out by turning off neurons during training you actually force the neurons

training you actually force the neurons

training you actually force the neurons to learn more robust features that are

to learn more robust features that are

to learn more robust features that are independent of each other

independent of each other

independent of each other if the neurons are not always reliably

if the neurons are not always reliably

if the neurons are not always reliably avail if those connections are not

avail if those connections are not

avail if those connections are not always reliably available then there are

always reliably available then there are

always reliably available then there are always reliably available then neurons

always reliably available then neurons

always reliably available then neurons may learn that to combine these two

may learn that to combine these two

may learn that to combine these two features and to happen so you end up

features and to happen so you end up

features and to happen so you end up getting correlated features where to

getting correlated features where to

getting correlated features where to newer ons are actually learning the

newer ons are actually learning the

newer ons are actually learning the exact same feature with dropout we’re

exact same feature with dropout we’re

exact same feature with dropout we’re able we will actually show in the next

able we will actually show in the next

able we will actually show in the next slide that generalization improves as we

slide that generalization improves as we

slide that generalization improves as we increase this number to a certain point

increase this number to a certain point

increase this number to a certain point we also conduct something called

we also conduct something called

we also conduct something called Bachelor Malaysian it basically just

Bachelor Malaysian it basically just

Bachelor Malaysian it basically just takes your data and centers it back

takes your data and centers it back

takes your data and centers it back around zero and rescales the variance so

around zero and rescales the variance so

around zero and rescales the variance so that you don’t have to worry about

that you don’t have to worry about

that you don’t have to worry about floating-point number overflows or under

floating-point number overflows or under

floating-point number overflows or under flows and we use 128 kind step truncated

flows and we use 128 kind step truncated

flows and we use 128 kind step truncated back propagation through time again

back propagation through time again

back propagation through time again another thing that your optimizer will

another thing that your optimizer will

another thing that your optimizer will handle for you but at a high level what

handle for you but at a high level what

handle for you but at a high level what this is doing is rather than unrolling

this is doing is rather than unrolling

this is doing is rather than unrolling the entire network which over the entire

the entire network which over the entire

the entire network which over the entire input sequence which could be tens of

input sequence which could be tens of

input sequence which could be tens of thousands of notes long

thousands of notes long

thousands of notes long got tens of thousands thousands of notes

got tens of thousands thousands of notes

got tens of thousands thousands of notes long we only unroll it 128 and we

long we only unroll it 128 and we

long we only unroll it 128 and we truncate the air signals we basically

truncate the air signals we basically

truncate the air signals we basically say after 120 time steps whatever you do

say after 120 time steps whatever you do

say after 120 time steps whatever you do over here is not going to affect the

over here is not going to affect the

over here is not going to affect the future

future

future too much here’s my promise slide about

too much here’s my promise slide about

too much here’s my promise slide about drop out counter-intuitively as we turn

drop out counter-intuitively as we turn

drop out counter-intuitively as we turn that as we start dropping out or turning

that as we start dropping out or turning

that as we start dropping out or turning off random neurons or random neuron

off random neurons or random neuron

off random neurons or random neuron connections we actually generalize

connections we actually generalize

connections we actually generalize better we see that without drop out the

better we see that without drop out the

better we see that without drop out the model actually starts to overfit

model actually starts to overfit

model actually starts to overfit dramatically you know it gets better at

dramatically you know it gets better at

dramatically you know it gets better at generalizing that it gets worse and

generalizing that it gets worse and

generalizing that it gets worse and worse and worse at generalizing because

worse and worse at generalizing because

worse and worse at generalizing because it’s got so many connections it can

it’s got so many connections it can

it’s got so many connections it can learn so much you turn to and drop out

learn so much you turn to and drop out

learn so much you turn to and drop out up to 0.3 you get this purple curve at

up to 0.3 you get this purple curve at

up to 0.3 you get this purple curve at the bottom where you’ve turned just to

the bottom where you’ve turned just to

the bottom where you’ve turned just to the right amount so that the features

the right amount so that the features

the right amount so that the features the model of learning are robust they

the model of learning are robust they

the model of learning are robust they can generalize independently of other

can generalize independently of other

can generalize independently of other features and if you turn it up too high

features and if you turn it up too high

features and if you turn it up too high then now you’re dropping up so much

then now you’re dropping up so much

then now you’re dropping up so much you’re injecting more noise than

you’re injecting more noise than

you’re injecting more noise than regularizing your model you actually

regularizing your model you actually

regularizing your model you actually don’t generalize that well and the story

don’t generalize that well and the story

don’t generalize that well and the story on the training side is also consistent

on the training side is also consistent

on the training side is also consistent as we increase dropout you do strictly

as we increase dropout you do strictly

as we increase dropout you do strictly worse on training and that makes sense

worse on training and that makes sense

worse on training and that makes sense too because this isn’t generalization

too because this isn’t generalization

too because this isn’t generalization this is just how well can the model

this is just how well can the model

this is just how well can the model memorize its input data and if you turn

memorize its input data and if you turn

memorize its input data and if you turn inputs off you will memorize this good

great with the Train model we can do

great with the Train model we can do many things we can compose and we can

many things we can compose and we can

many things we can compose and we can harmonize and the way we compose is the

harmonize and the way we compose is the

harmonize and the way we compose is the following we have the hidden states and

following we have the hidden states and

following we have the hidden states and we have the inputs and we have the model

we have the inputs and we have the model

we have the inputs and we have the model weights and so we can use the model

weights and so we can use the model

weights and so we can use the model weights to form this predictive

weights to form this predictive

weights to form this predictive distribution what is the probability of

distribution what is the probability of

distribution what is the probability of my current note given all of the

my current note given all of the

my current note given all of the previous notes I’ve seen before from

previous notes I’ve seen before from

previous notes I’ve seen before from this probability distribution we just

this probability distribution we just

this probability distribution we just written we pick out a note according to

written we pick out a note according to

written we pick out a note according to how that distribution is parameterised

how that distribution is parameterised

how that distribution is parameterised so up here this could be like I think L

so up here this could be like I think L

so up here this could be like I think L has the highest weight here and then so

has the highest weight here and then so

has the highest weight here and then so after we sample it we just set XT equal

after we sample it we just set XT equal

after we sample it we just set XT equal to whatever we sampled out of there and

to whatever we sampled out of there and

to whatever we sampled out of there and we just treat it as truth we just assume

we just treat it as truth we just assume

we just treat it as truth we just assume that whatever the output was right there

that whatever the output was right there

that whatever the output was right there is now the input for the next time step

is now the input for the next time step

is now the input for the next time step and then we iterate this process for

and then we iterate this process for

and then we iterate this process for words so starting with no notes at all

words so starting with no notes at all

words so starting with no notes at all you sample the start symbol and then you

you sample the start symbol and then you

you sample the start symbol and then you just keep going until you sample the end

just keep going until you sample the end

just keep going until you sample the end symbol and then

symbol and then

symbol and then that way we’re able to generate novel

that way we’re able to generate novel

that way we’re able to generate novel automatic compositions harmonization is

automatic compositions harmonization is

automatic compositions harmonization is actually a generalization of composition

actually a generalization of composition

actually a generalization of composition in composition what we basically did was

in composition what we basically did was

in composition what we basically did was I got a start symbol fill in the rest

I got a start symbol fill in the rest

I got a start symbol fill in the rest harmonization is where you say I’ve got

harmonization is where you say I’ve got

harmonization is where you say I’ve got the melody I’ve got the baseline or I’ve

the melody I’ve got the baseline or I’ve

the melody I’ve got the baseline or I’ve got these certain notes fill in the

got these certain notes fill in the

got these certain notes fill in the parts that I didn’t specify and for this

parts that I didn’t specify and for this

parts that I didn’t specify and for this we actually proposed a suboptimal

we actually proposed a suboptimal

we actually proposed a suboptimal strategy so I’m going to let alpha

strategy so I’m going to let alpha

strategy so I’m going to let alpha denote the stuff that we’re given so it

denote the stuff that we’re given so it

denote the stuff that we’re given so it alpha could be like 1 3 7 the points in

alpha could be like 1 3 7 the points in

alpha could be like 1 3 7 the points in time where the notes are fixed and the

time where the notes are fixed and the

time where the notes are fixed and the privatization problem is we need to

privatization problem is we need to

privatization problem is we need to choose the notes that aren’t fixed or we

choose the notes that aren’t fixed or we

choose the notes that aren’t fixed or we subdues the input the sequence X 1 to X

subdues the input the sequence X 1 to X

subdues the input the sequence X 1 to X also we need to choose the entire

also we need to choose the entire

also we need to choose the entire composition such that the notes that

composition such that the notes that

composition such that the notes that we’re given X alpha are already fixed

we’re given X alpha are already fixed

we’re given X alpha are already fixed and so our decision variables are the

and so our decision variables are the

and so our decision variables are the things that are not in alpha and we need

things that are not in alpha and we need

things that are not in alpha and we need to maximize this probability

to maximize this probability

to maximize this probability distribution my kind of greedy solution

distribution my kind of greedy solution

distribution my kind of greedy solution which I’ve received a lot of criticism

which I’ve received a lot of criticism

which I’ve received a lot of criticism for is okay you’re at this point in time

for is okay you’re at this point in time

for is okay you’re at this point in time just sample the the most likely thing at

just sample the the most likely thing at

just sample the the most likely thing at the next point in time the reason why

the next point in time the reason why

the next point in time the reason why this gets criticized is because if you

this gets criticized is because if you

this gets criticized is because if you greedily choose without looking at what

greedily choose without looking at what

greedily choose without looking at what influence this decision now could impact

influence this decision now could impact

influence this decision now could impact on your future you might choose

on your future you might choose

on your future you might choose something that just doesn’t make any

something that just doesn’t make any

something that just doesn’t make any sense in the future harmonic context but

sense in the future harmonic context but

sense in the future harmonic context but may sound really good right now it’s

may sound really good right now it’s

may sound really good right now it’s kind of like thinking it’s kind of like

kind of like thinking it’s kind of like

kind of like thinking it’s kind of like acting without thinking about the

acting without thinking about the

acting without thinking about the consequences of your action but the

consequences of your action but the

consequences of your action but the testament to how well this actually

testament to how well this actually

testament to how well this actually performs is not what could it how bad

performs is not what could it how bad

performs is not what could it how bad could it be theoretically it’s actually

could it be theoretically it’s actually

could it be theoretically it’s actually how well does it do empirically is this

how well does it do empirically is this

how well does it do empirically is this still convincing and we’ll find out soon

but before we go there let’s uncover the

but before we go there let’s uncover the black box I’ve been talking about neural

black box I’ve been talking about neural

black box I’ve been talking about neural networks is just this thing which you

networks is just this thing which you

networks is just this thing which you can just optimize throw data at it it’ll

can just optimize throw data at it it’ll

can just optimize throw data at it it’ll learn things let’s take a look inside

learn things let’s take a look inside

learn things let’s take a look inside and see what’s actually going on and so

and see what’s actually going on and so

and see what’s actually going on and so what I’ve done here is I’ve taken the

what I’ve done here is I’ve taken the

what I’ve done here is I’ve taken the various memory cells of my recurrent

various memory cells of my recurrent

various memory cells of my recurrent neural network and I’ve unrolled it over

neural network and I’ve unrolled it over

neural network and I’ve unrolled it over time so on the x axis you see time and

time so on the x axis you see time and

time so on the x axis you see time and on the y axis I’m showing you the

on the y axis I’m showing you the

on the y axis I’m showing you the activations of all of the hidden units

activations of all of the hidden units

activations of all of the hidden units so this is like neuron number

so this is like neuron number

so this is like neuron number one tuner on number 32 this is neuron

one tuner on number 32 this is neuron

one tuner on number 32 this is neuron number one – neuron number 256 in the

number one – neuron number 256 in the

number one – neuron number 256 in the first hidden layer and similarly this is

first hidden layer and similarly this is

first hidden layer and similarly this is neuron number one – neuron number 256 in

neuron number one – neuron number 256 in

neuron number one – neuron number 256 in the second hidden layer these any

the second hidden layer these any

the second hidden layer these any pattern there I don’t I mean I kind of

pattern there I don’t I mean I kind of

pattern there I don’t I mean I kind of do I see like there’s like this little

do I see like there’s like this little

do I see like there’s like this little smear right here and it seems to show up

smear right here and it seems to show up

smear right here and it seems to show up everywhere as well as right here but

everywhere as well as right here but

everywhere as well as right here but there’s not too much intuitive sense

there’s not too much intuitive sense

there’s not too much intuitive sense that I can make out of this image and

that I can make out of this image and

that I can make out of this image and this is a common criticism of deep

this is a common criticism of deep

this is a common criticism of deep neural networks they’re like black boxes

neural networks they’re like black boxes

neural networks they’re like black boxes where we don’t know how they really work

where we don’t know how they really work

where we don’t know how they really work on the inside but they seem to do

on the inside but they seem to do

on the inside but they seem to do awfully good as we get closer to the

awfully good as we get closer to the

awfully good as we get closer to the output things start to make a little bit

output things start to make a little bit

output things start to make a little bit more sense so over so I previously was

more sense so over so I previously was

more sense so over so I previously was showing the hidden units of the first

showing the hidden units of the first

showing the hidden units of the first and second layer now I’m showing the

and second layer now I’m showing the

and second layer now I’m showing the third layer as well as a linear

third layer as well as a linear

third layer as well as a linear combination of the third layer and

combination of the third layer and

combination of the third layer and finally the outputs of the model and as

finally the outputs of the model and as

finally the outputs of the model and as you get towards the end you start seeing

you get towards the end you start seeing

you get towards the end you start seeing oh there’s this little dotty pattern

oh there’s this little dotty pattern

oh there’s this little dotty pattern this almost looks like a piano roll if

this almost looks like a piano roll if

this almost looks like a piano roll if you remember the representation of music

you remember the representation of music

you remember the representation of music I showed earlier where we had time on

I showed earlier where we had time on

I showed earlier where we had time on the x-axis and pitch on the y-axis this

the x-axis and pitch on the y-axis this

the x-axis and pitch on the y-axis this looks awfully similar to that and this

looks awfully similar to that and this

looks awfully similar to that and this isn’t surprising either recall we

isn’t surprising either recall we

isn’t surprising either recall we trained the neural network to predict

trained the neural network to predict

trained the neural network to predict the next note given the current note or

the next note given the current note or

the next note given the current note or all the previous notes if the network

all the previous notes if the network

all the previous notes if the network was doing perfectly we would expect to

was doing perfectly we would expect to

was doing perfectly we would expect to just see the input here delayed by a

just see the input here delayed by a

just see the input here delayed by a single time step and so it’s

single time step and so it’s

single time step and so it’s unsurprising that we do see something

unsurprising that we do see something

unsurprising that we do see something that resembles the input but it’s not

that resembles the input but it’s not

that resembles the input but it’s not quite exactly the input sometimes we see

quite exactly the input sometimes we see

quite exactly the input sometimes we see like multiple predictions at one point

like multiple predictions at one point

like multiple predictions at one point in time and this is really representing

in time and this is really representing

in time and this is really representing the uncertainty inside of our

the uncertainty inside of our

the uncertainty inside of our predictions so if I represented the

predictions so if I represented the

predictions so if I represented the probability distribution we’re not just

probability distribution we’re not just

probability distribution we’re not just saying the next note is then is this

saying the next note is then is this

saying the next note is then is this rather we’re saying we’re pretty sure

rather we’re saying we’re pretty sure

rather we’re saying we’re pretty sure than that next note is this with this

than that next note is this with this

than that next note is this with this probability but it could also be this

probability but it could also be this

probability but it could also be this with this probability that probability I

with this probability that probability I

with this probability that probability I called this the probabilistic piano roll

called this the probabilistic piano roll

called this the probabilistic piano roll I don’t know if that’s standard

I don’t know if that’s standard

I don’t know if that’s standard terminology here’s one of my most

terminology here’s one of my most

terminology here’s one of my most interesting insights that I found from

interesting insights that I found from

interesting insights that I found from this model it appears to actually be

this model it appears to actually be

this model it appears to actually be learning music theory concepts so what

learning music theory concepts so what

learning music theory concepts so what I’m showing here is some input that I

I’m showing here is some input that I

I’m showing here is some input that I provided to the model and here I picked

provided to the model and here I picked

provided to the model and here I picked out some neurons and oh no these neurons

out some neurons and oh no these neurons

out some neurons and oh no these neurons are randomly selected so I didn’t just

are randomly selected so I didn’t just

are randomly selected so I didn’t just go and I fished for the ones that

go and I fished for the ones that

go and I fished for the ones that like that rather I just ran a random

like that rather I just ran a random

like that rather I just ran a random number generator got eight of them out

number generator got eight of them out

number generator got eight of them out and then I handed them off to my music

and then I handed them off to my music

and then I handed them off to my music dearest collaborator and I was like hey

dearest collaborator and I was like hey

dearest collaborator and I was like hey is there anything there and here’s the

is there anything there and here’s the

is there anything there and here’s the end here’s the notes he made for me

end here’s the notes he made for me

end here’s the notes he made for me he said that neuron 64 this one and

he said that neuron 64 this one and

he said that neuron 64 this one and layer one neuron 138 this one they

layer one neuron 138 this one they

layer one neuron 138 this one they appear to be picking out perfect

appear to be picking out perfect

appear to be picking out perfect Cadence’s with root position chords in

Cadence’s with root position chords in

Cadence’s with root position chords in the tonic key more music theory than I

the tonic key more music theory than I

the tonic key more music theory than I can understand but if I look up here

can understand but if I look up here

can understand but if I look up here it’s like that shape right there on the

it’s like that shape right there on the

it’s like that shape right there on the piano roll looks like that shape on the

piano roll looks like that shape on the

piano roll looks like that shape on the piano roll looks like that shape on the

piano roll looks like that shape on the

piano roll looks like that shape on the piano roll interesting neuron layer one

piano roll interesting neuron layer one

piano roll interesting neuron layer one or neuron 151 I believe that is this one

or neuron 151 I believe that is this one

or neuron 151 I believe that is this one a minor Cadence’s ending phrases two and

a minor Cadence’s ending phrases two and

a minor Cadence’s ending phrases two and four no that’s this one sorry and and

four no that’s this one sorry and and

four no that’s this one sorry and and again I look up here okay yeah that kind

again I look up here okay yeah that kind

again I look up here okay yeah that kind of chord right there looks kind of like

of chord right there looks kind of like

of chord right there looks kind of like that chord right there they seem to be

that chord right there they seem to be

that chord right there they seem to be specializing to picking out specific

specializing to picking out specific

specializing to picking out specific types of chords okay so it’s learning

types of chords okay so it’s learning

types of chords okay so it’s learning Roman numeral analysis and tonics and

Roman numeral analysis and tonics and

Roman numeral analysis and tonics and root position chords and Cadence’s and

root position chords and Cadence’s and

root position chords and Cadence’s and the last one where one neuron eighty

the last one where one neuron eighty

the last one where one neuron eighty seven and layer two neuron 37 I believe

seven and layer two neuron 37 I believe

seven and layer two neuron 37 I believe that’s this one in this one they’re

that’s this one in this one they’re

that’s this one in this one they’re picking out I six chords I have no idea

picking out I six chords I have no idea

picking out I six chords I have no idea what that means

so I showed you automatic composition at

so I showed you automatic composition at the beginning of the presentation when I

the beginning of the presentation when I

the beginning of the presentation when I took some Bach Bach music and I

took some Bach Bach music and I

took some Bach Bach music and I allegedly claimed it was Bach I’ll now

allegedly claimed it was Bach I’ll now

allegedly claimed it was Bach I’ll now show you what harmonization sounds like

show you what harmonization sounds like

show you what harmonization sounds like and this is with the sub optimal

and this is with the sub optimal

and this is with the sub optimal strategy that I proposed so we take a

strategy that I proposed so we take a

strategy that I proposed so we take a melody such as

melody such as

melody such as [Music]

[Music]

[Music] we tell the model this has to be the

we tell the model this has to be the

we tell the model this has to be the soprano line what are the others likely

soprano line what are the others likely

soprano line what are the others likely to be like that’s kind of convincing

to be like that’s kind of convincing

to be like that’s kind of convincing it’s almost like a baroque C major chord

it’s almost like a baroque C major chord

it’s almost like a baroque C major chord progression what’s really interesting

progression what’s really interesting

progression what’s really interesting though is that not only can we just

though is that not only can we just

though is that not only can we just harmonize simple melodies like that we

harmonize simple melodies like that we

harmonize simple melodies like that we can actually take popular tunes such as

can actually take popular tunes such as

can actually take popular tunes such as this

this

this [Music]

we can generate a novel baroque

we can generate a novel baroque harmonization of what Bach might have

harmonization of what Bach might have

harmonization of what Bach might have done had he heard twinkle twinkle little

done had he heard twinkle twinkle little

done had he heard twinkle twinkle little star during his lifetime

now I’m going off the track where it’s

now I’m going off the track where it’s like oh this is my model it looks so

like oh this is my model it looks so

like oh this is my model it looks so good it sounds so realistic yeah but I

good it sounds so realistic yeah but I

good it sounds so realistic yeah but I was just criticizing at the beginning of

was just criticizing at the beginning of

was just criticizing at the beginning of the talk

the talk

the talk my third research goal was actually how

my third research goal was actually how

my third research goal was actually how can we determine a standardized way to

can we determine a standardized way to

can we determine a standardized way to quantitatively assess the performance of

quantitatively assess the performance of

quantitatively assess the performance of generative models for this particular

generative models for this particular

generative models for this particular task and one which I recommend for all

task and one which I recommend for all

task and one which I recommend for all of automatic composition is to do a

of automatic composition is to do a

of automatic composition is to do a subjective listening experiment and so

subjective listening experiment and so

subjective listening experiment and so what we did is we built

what we did is we built

what we did is we built václav comm and it looks like this it’s

václav comm and it looks like this it’s

václav comm and it looks like this it’s got a splash page and it’s kind of

got a splash page and it’s kind of

got a splash page and it’s kind of trying to go viral it’s asking can you

trying to go viral it’s asking can you

trying to go viral it’s asking can you tell the difference between Bach and a

tell the difference between Bach and a

tell the difference between Bach and a computer they used to say man versus

computer they used to say man versus

computer they used to say man versus machine but but the interface is simple

machine but but the interface is simple

machine but but the interface is simple you’re given two choices one of them is

you’re given two choices one of them is

you’re given two choices one of them is Bach one of them is Bach bot and you’re

Bach one of them is Bach bot and you’re

Bach one of them is Bach bot and you’re asked to distinguish which one was the

asked to distinguish which one was the

asked to distinguish which one was the actual Bach we put this up out on the

actual Bach we put this up out on the

actual Bach we put this up out on the Internet

Internet

Internet I’ve got around nineteen hundred

I’ve got around nineteen hundred

I’ve got around nineteen hundred participants from all around the world

participants tended to be within the

participants tended to be within the eighteen to forty five age group the

eighteen to forty five age group the

eighteen to forty five age group the district we got a surprisingly large

district we got a surprisingly large

district we got a surprisingly large number of expert users who decided to

number of expert users who decided to

number of expert users who decided to contribute we defined expert as a

contribute we defined expert as a

contribute we defined expert as a researcher someone who is published or a

researcher someone who is published or a

researcher someone who is published or a teacher someone with professional

teacher someone with professional

teacher someone with professional accreditation as a music teacher

accreditation as a music teacher

accreditation as a music teacher advanced as someone who has who have

advanced as someone who has who have

advanced as someone who has who have studied in a degree program for music

studied in a degree program for music

studied in a degree program for music and intermediate someone who plays an

and intermediate someone who plays an

and intermediate someone who plays an instrument and here’s how they did so

instrument and here’s how they did so

instrument and here’s how they did so I’ve coded these like I’ve coded these

I’ve coded these like I’ve coded these

I’ve coded these like I’ve coded these with SAT B to represent the part that

with SAT B to represent the part that

with SAT B to represent the part that was asked to be harmonized so this is

was asked to be harmonized so this is

was asked to be harmonized so this is given the alto tenor bass harmonized

given the alto tenor bass harmonized

given the alto tenor bass harmonized with soprano this year was given just

with soprano this year was given just

with soprano this year was given just the soprano wood bass harmonized the

the soprano wood bass harmonized the

the soprano wood bass harmonized the middle – and this is composed everything

middle – and this is composed everything

middle – and this is composed everything I’m going to give you nothing this is

I’m going to give you nothing this is

I’m going to give you nothing this is the result that I’ve been coding this

the result that I’ve been coding this

the result that I’ve been coding this entire talk only participants are only

entire talk only participants are only

entire talk only participants are only able to distinguish Bach from Bach

able to distinguish Bach from Bach

able to distinguish Bach from Bach bought 7% better than random chance but

bought 7% better than random chance but

bought 7% better than random chance but there’s some other interesting findings

there’s some other interesting findings

there’s some other interesting findings in here

in here

in here well I guess this isn’t too surprising

well I guess this isn’t too surprising

well I guess this isn’t too surprising if you delete the soprano line then then

if you delete the soprano line then then

if you delete the soprano line then then Bach bot is off to create a convincing

Bach bot is off to create a convincing

Bach bot is off to create a convincing melody and it doesn’t do too well

melody and it doesn’t do too well

melody and it doesn’t do too well whereas if you delete the bass line

whereas if you delete the bass line

whereas if you delete the bass line Bach lots of a lot better now I think

Bach lots of a lot better now I think

Bach lots of a lot better now I think this is actually a consequence of the

this is actually a consequence of the

this is actually a consequence of the way I chose to deal with polyphony in

way I chose to deal with polyphony in

way I chose to deal with polyphony in the sense that I serialized the music

the sense that I serialized the music

the sense that I serialized the music from soprano alto tenor bass and so by

from soprano alto tenor bass and so by

from soprano alto tenor bass and so by the time Bach Bach got to figuring out

the time Bach Bach got to figuring out

the time Bach Bach got to figuring out what the bass note might be it already

what the bass note might be it already

what the bass note might be it already seen the soprano alto and tenor note

seen the soprano alto and tenor note

seen the soprano alto and tenor note within that time instant and so it

within that time instant and so it

within that time instant and so it already had a very strong harmonic

already had a very strong harmonic

already had a very strong harmonic context about what note might sound good

context about what note might sound good

context about what note might sound good whereas if I whereas when I’ve got the

whereas if I whereas when I’ve got the

whereas if I whereas when I’ve got the soprano note Bach watt has no idea what

soprano note Bach watt has no idea what

soprano note Bach watt has no idea what the alto tenor bass note might be and so

the alto tenor bass note might be and so

the alto tenor bass note might be and so just going to make a random guess that

just going to make a random guess that

just going to make a random guess that could be totally out of place to

could be totally out of place to

could be totally out of place to validate this hypothesis which is a work

validate this hypothesis which is a work

validate this hypothesis which is a work left for the future you could serialize

left for the future you could serialize

left for the future you could serialize in a different order such as bass tenor

in a different order such as bass tenor

in a different order such as bass tenor Alto soprano you could run this

Alto soprano you could run this

Alto soprano you could run this experiment again and you can see and you

experiment again and you can see and you

experiment again and you can see and you would expect to see it go down like this

would expect to see it go down like this

would expect to see it go down like this if the hypothesis is true and

if the hypothesis is true and

if the hypothesis is true and differently if not here I’ve taken the

differently if not here I’ve taken the

differently if not here I’ve taken the exact same plot from the previous plot

exact same plot from the previous plot

exact same plot from the previous plot except I’ve now broken it down by music

except I’ve now broken it down by music

except I’ve now broken it down by music experience unsurprisingly

experience unsurprisingly

experience unsurprisingly you kind of see this curve where people

you kind of see this curve where people

you kind of see this curve where people are doing or doing better as they get

are doing or doing better as they get

are doing or doing better as they get more experienced so the novices are like

more experienced so the novices are like

more experienced so the novices are like almost only three percent better where

almost only three percent better where

almost only three percent better where the experts are sixteen percent better

the experts are sixteen percent better

the experts are sixteen percent better they probably know Bach they’ve got it

they probably know Bach they’ve got it

they probably know Bach they’ve got it memorized so they can tell the

memorized so they can tell the

memorized so they can tell the difference but the interesting one is

difference but the interesting one is

difference but the interesting one is here the experts do significantly worse

here the experts do significantly worse

here the experts do significantly worse than random chance when getting when

than random chance when getting when

than random chance when getting when comparing Bach versus Bach bought bass

comparing Bach versus Bach bought bass

comparing Bach versus Bach bought bass harmonizations I actually don’t have a

harmonizations I actually don’t have a

harmonizations I actually don’t have a good reason why but it’s surprising to

good reason why but it’s surprising to

good reason why but it’s surprising to me it seems that the experts think block

me it seems that the experts think block

me it seems that the experts think block bot is more convincing than actual Bach

bot is more convincing than actual Bach

bot is more convincing than actual Bach so in conclusion I’ve presented a deep

so in conclusion I’ve presented a deep

so in conclusion I’ve presented a deep long short term memory generative model

long short term memory generative model

long short term memory generative model for composing completing and generating

for composing completing and generating

for composing completing and generating polyphonic music and this model isn’t

polyphonic music and this model isn’t

polyphonic music and this model isn’t just like research that I’m talking

just like research that I’m talking

just like research that I’m talking about that no one ever gets to use it’s

about that no one ever gets to use it’s

about that no one ever gets to use it’s actually open source it’s on my github

actually open source it’s on my github

actually open source it’s on my github and moreover Google’s Google brains

and moreover Google’s Google brains

and moreover Google’s Google brains magenta project has actually integrated

magenta project has actually integrated

magenta project has actually integrated it already into Google magenta so if you

it already into Google magenta so if you

it already into Google magenta so if you use the

use the

use the polyphonic recurrent neural network

polyphonic recurrent neural network

polyphonic recurrent neural network model at magenta and the tensor flow

model at magenta and the tensor flow

model at magenta and the tensor flow projects you’ll be using the bok-bok

projects you’ll be using the bok-bok

projects you’ll be using the bok-bok model the model appears to learn music

model the model appears to learn music

model the model appears to learn music theory without any prior knowledge we

theory without any prior knowledge we

theory without any prior knowledge we didn’t tell it this is a chord this is

didn’t tell it this is a chord this is

didn’t tell it this is a chord this is the cadence this is a tonic it just

the cadence this is a tonic it just

the cadence this is a tonic it just decided to figure that out on its own in

decided to figure that out on its own in

decided to figure that out on its own in order to optimize performance on an

order to optimize performance on an

order to optimize performance on an automatic composition task to me this

automatic composition task to me this

automatic composition task to me this suggests that music theory with all of

suggests that music theory with all of

suggests that music theory with all of its rules and all of its formalisms

its rules and all of its formalisms

its rules and all of its formalisms actually is useful for for comp

actually is useful for for comp

actually is useful for for comp composing in fact it’s so useful that a

composing in fact it’s so useful that a

composing in fact it’s so useful that a machine trained to optimize compose

machine trained to optimize compose

machine trained to optimize compose composition decided to specialize on

composition decided to specialize on

composition decided to specialize on these concepts finally we conducted the

these concepts finally we conducted the

these concepts finally we conducted the largest musical Turing test to date with

largest musical Turing test to date with

largest musical Turing test to date with 1,700 participants only 7% of which

1,700 participants only 7% of which

1,700 participants only 7% of which performed better than random chance

obligatory note to my employer we do

obligatory note to my employer we do open slitter we do freelance outsourcing

open slitter we do freelance outsourcing

open slitter we do freelance outsourcing if you need a development team let me

if you need a development team let me

if you need a development team let me know other than that thank you so much

know other than that thank you so much

know other than that thank you so much for your attention it was a pleasure

for your attention it was a pleasure

for your attention it was a pleasure speaking to you all

speaking to you all

speaking to you all [Applause]

”

GOTO 2017 • Composing Bach Chorales Using Deep Learning • Feynman Liang

Be First to Comment

Leave a Reply Cancel reply