Press "Enter" to skip to content

GOTO 2017 • Composing Bach Chorales Using Deep Learning • Feynman Liang


00:00:06[Music]

00:00:08cool thank you um just so I know what

00:00:13level to speak at raise your hands if

00:00:15you know who Bach is great raise your

00:00:20hand if you know what a neural network

00:00:22is oh this is the perfect crowd awesome

00:00:26if you don’t know don’t worry I’m going

00:00:29to cover the very basics of both so

00:00:33let’s talk about Bach I’m going to play

00:00:36to you some music

00:00:37[Music]

00:00:49now what you just heard is what’s known

00:00:52as a coral there are four parts to it a

00:00:55soprano alto tenor bass playing at the

00:00:58exact same time and there’s very regular

00:01:00phrasing structure where you have the

00:01:02beginning of a phrase the determination

00:01:03of a phrase followed by the next phrase

00:01:06except that wasn’t Bach rather that was

00:01:12a computer algorithm called Bach bot and

00:01:14that was one sample out of its outputs

00:01:16if you don’t believe me it’s on

00:01:18soundcloud it’s called sample one go

00:01:20listen for yourself so instead of

00:01:23talking about box today I’m going to

00:01:25talk to you about Bach bot hi my name is

00:01:29phiman and it’s a pleasure to be here at

00:01:31Amsterdam and today we’ll talk about

00:01:33autumn is automatic stylistic

00:01:35composition using long short term memory

00:01:37so then a background about myself I’m

00:01:42currently a software engineer at gigster

00:01:43where I walk at work on interesting

00:01:45automation problems regarding I’m taking

00:01:48contracts divided them into sub

00:01:50contracts and then freelancing them out

00:01:51the work on Bach bot was done as part of

00:01:54my master’s thesis where which I did at

00:01:57the University of Cambridge with

00:01:58Microsoft Research Cambridge in line

00:02:01with the track here I do not have a PhD

00:02:03and so and I still can do machine

00:02:05learning so this is the fact this is a

00:02:07fact you can do machine learning without

00:02:09a PhD for those of you who just want to

00:02:14know what’s going to happen and then get

00:02:15out of here because it’s not interesting

00:02:16here is the executive summary I’m going

00:02:20to talk to you about how to train end to

00:02:22end starting from data sets preparation

00:02:24all the way to model tuning and

00:02:25deployment of a deep recurrent neural

00:02:27network for music this neural network is

00:02:31capable of polysemy multiple

00:02:33simultaneous voices at the same time

00:02:34it’s capable automatic composition

00:02:37generating a composition completely from

00:02:39scratch as well as harmonization given

00:02:42some fixed parts such as the soprano

00:02:44line of the melody generate the

00:02:46remaining supporting parts this model

00:02:49learns music theory without being told

00:02:51to do so providing empirical validation

00:02:53of what music theorists have been using

00:02:55for centuries and

00:02:57finally it’s evaluated on an online

00:02:59musical Turing test we’re out of 1700

00:03:02participants only nine percent are able

00:03:04to distinguish actual Bach from Bach

00:03:06Bach

00:03:09when I set off on this research there

00:03:11were three primary goals the first

00:03:14question I wanted to answer was what is

00:03:16the frontier of computational creativity

00:03:19now creativity is something we take to

00:03:22be an 8 li human innately special in

00:03:24some sense computers shouldn’t ought not

00:03:26to be able to replicate this about us is

00:03:28this actually true can we have computers

00:03:30generate art that is convincingly human

00:03:34the second question I wanted to answer

00:03:36was how much does deep learning impacted

00:03:39automatic music composition now

00:03:41automatic music composition is a special

00:03:43field it has been dominated by symbolic

00:03:45methods which utilize things like formal

00:03:47grammars or context-free grammars such

00:03:48as this parse tree we’ve seen

00:03:51connectionist methods in the early 19th

00:03:52century however we have it however they

00:03:56have they followed in popularity and

00:03:57most recent systems have used symbolic

00:03:59methods with the work here I wanted to

00:04:02see did the new advances in deep

00:04:04learning in the last 10 years can they

00:04:06be transferred over to this particular

00:04:08problem domain and finally the last

00:04:11question I wanted to look at is how do

00:04:13we evaluate these generative models I

00:04:15mean we’ve seen we’ve seen in the

00:04:17previous talk a lot of a lot of models

00:04:19they generate art we look at it and as

00:04:21the author we say oh that’s convincing

00:04:23but oh that’s beautiful and great that

00:04:27might be a perfectly valid use case but

00:04:29it’s not sufficient for publication to

00:04:31publish something we need to establish a

00:04:32standardized benchmark and we need to be

00:04:34able to evaluate all of our models about

00:04:35it so we can objectively say which model

00:04:38is better than the other now if you’re

00:04:42still here I’m assuming you’re

00:04:44interested this is the outline we’ll

00:04:46start with a quick primer on music

00:04:48theory giving you just the basic

00:04:50terminology you need to understand the

00:04:51remainder of this presentation we’ll

00:04:53talk about how to prepare a data set of

00:04:55Bach Corral’s

00:04:56well then gate will get the give a

00:04:58primer on recurrent neural networks

00:04:59which is the actual deep learning model

00:05:01architecture used to build Bach Bach

00:05:04we’ll talk about the Bach Bach model

00:05:06itself the tips and tricks and

00:05:08techniques that we used in order to

00:05:09train it

00:05:10have it run successfully as well as

00:05:11deploy it and then we’ll show the

00:05:13results

00:05:14well show how this model is able to

00:05:16capture statistical regularities in box

00:05:18musical style and we’ll prove a we won’t

00:05:20prove we’ll provide very convincing

00:05:22evidence that music theory does have

00:05:24theoretical gesture empirical

00:05:26justification and finally I’ll show the

00:05:29results of the musical Turing test which

00:05:31was our proposed evaluation methodology

00:05:33for saying yes

00:05:34this model has solves our research goal

00:05:36the the task of automatically composing

00:05:39convincing Bach chorale is more closed

00:05:41than open of a problem as a result of

00:05:43Bach plot and if you’re a hands-on type

00:05:48of learner we’ve containerized the

00:05:49entire deployment so if you go to my

00:05:51website here I have a copy of the slides

00:05:53which have all of these instructions you

00:05:55run this eight lines of code and it runs

00:05:58this entire and pipeline right here

00:05:59where it takes the corrals it pre

00:06:01processes them puts them into a data

00:06:03store trains of trains the deep learning

00:06:05model samples the deep learning model

00:06:07produces outputs that you can listen to

00:06:12let’s start with basic music theory now

00:06:17when people think of music this is

00:06:20usually what you think about you got

00:06:21these bar lines you got notes and these

00:06:23notes are on different horizontal and

00:06:25vertical positions some of them have

00:06:27interesting ties some of them of dots

00:06:28this is interesting little weird hat

00:06:30looking thing we don’t need all this we

00:06:34need three fundamental concepts the

00:06:37first is pitch pitch is often referred

00:06:40to as how low or how high a note is so

00:06:43if I play this we can distinguish that

00:06:50some notes are lower and some notes are

00:06:52higher in frequency and that corresponds

00:06:54to the vertical axis here as the notes

00:06:57of the notes sound ascending they appear

00:06:59ascending on the bar lines the second

00:07:04attribute we need is duration and this

00:07:06is really how long a notice so this one

00:07:09note these two notes these four and

00:07:11these eight all have equal total

00:07:13duration but they are they’re having zuv

00:07:17each other so if we take a listen

00:07:23the general intuition is the more bars

00:07:26there are on these tides the faster the

00:07:28notes appear with just those two

00:07:32concepts this is starting to make a

00:07:34little bit more sense this right here is

00:07:36twice as fast as this note we can see

00:07:38this note is higher than this note and

00:07:40you can generalize this to the remainder

00:07:42of this but there’s still this funny hat

00:07:45looking thing we’ll get to the hat in a

00:07:49sec but with pitch and duration we can

00:07:53rewrite the music like so rather than

00:07:56representing it using notes which may be

00:07:57kind of cryptic we show it here as a

00:07:59matrix where on the x axis we have time

00:08:03so the duration and on the y-axis we

00:08:06have pitch how high or low and frequency

00:08:08that note is and what we’ve done is

00:08:10we’ve taken the symbolic representation

00:08:12of music and we’ve turned it into a

00:08:13digital computable format that we can

00:08:15train models on back to the hat looking

00:08:20thing this is called a Fermata and Bach

00:08:24used it to denote the ends of phrases we

00:08:26had originally said about this research

00:08:28completely neglecting for modest and we

00:08:30found that the phrases generated by the

00:08:32model just kind of wandered they never

00:08:34seem to end there was no sense of

00:08:36resolution or conclusion and that was

00:08:38unrealistic but by adding these four

00:08:40modest all of a sudden the model turned

00:08:42around and we and we suddenly found

00:08:44realistic phrasing structure cool and

00:08:49that’s all the music you need to know

00:08:51the rest of it is machine learning now

00:08:55the biggest part of a machine learning

00:08:57engineer’s job is preparing their data

00:08:59sets this is a very painful task usually

00:09:01have to scour the internet or find some

00:09:03standardized data set that you train and

00:09:05evaluate your models on that usually

00:09:07these data sets have to be pre processed

00:09:08and massaged into a format that’s

00:09:10amenable for learning upon and for us it

00:09:14was no different box works however

00:09:17fortunately over the years have been

00:09:19transcribed into excuse my German Bach

00:09:22worka Vera – Nix BW sorry

00:09:27dwv is how I’ve been referring to this

00:09:30corpus it contains about all 438

00:09:33harmonizations of Bach

00:09:34Corral’s and conveniently it is

00:09:37available through the software package

00:09:38called music21

00:09:39this is a Python package that you can

00:09:42just tip install and then import it and

00:09:44now you have an iterator over a

00:09:45collection of music the first

00:09:50pre-processing step we did is we took

00:09:52the music the original music here and we

00:09:54did two things we transposed it and then

00:09:58we quantize it in time now you can

00:10:00notice the transposition by looking at

00:10:02these accidentals right here these two

00:10:04little funny backwards or forwards B’s

00:10:05and then they’re absent over here

00:10:08furthermore that note has shifted up by

00:10:11half a line that’s a little hard to see

00:10:13but it’s happening and the reason why we

00:10:17did this is we didn’t want to learn key

00:10:19signature key signature is usually

00:10:20something decided by the author before

00:10:22the pieces even begun to compose and so

00:10:25we can and so key signature itself can

00:10:26be injected as a pre-processing step

00:10:27where we sample over all the keys Bach

00:10:30did use so we removed key fingers from

00:10:32the equation through transposition and

00:10:34I’ll justify why that’s an okay thing to

00:10:36do in the next slide this first measure

00:10:40is written is is a progression of five

00:10:42notes written in C major and then what I

00:10:45did in the next measure is I just moved

00:10:47it up by five whole steps

00:10:50[Music]

00:10:57so yeah the pitch did change it’s

00:11:00relatively higher it’s absolutely higher

00:11:02on all accounts

00:11:03but the relations between the notes

00:11:05didn’t change and the sensation the the

00:11:09motifs that the music is bringing out

00:11:10those still remain fairly constant even

00:11:13after transposition quantization that

00:11:17however is a different story if I go

00:11:19back to slides will notice quantization

00:11:22to this 30-second note and turn it into

00:11:24a sixteenth note by removing that second

00:11:26bar we’ve distorted time is that a

00:11:29problem it is it’s not it’s not perfect

00:11:36but it’s a very minor problem so over

00:11:39here I’ve plotted a histogram of all of

00:11:42the durations inside of the corral

00:11:43corpus and this quantization affects

00:11:47only 0.2% of all the notes that we’re

00:11:49training on the reason that we do it is

00:11:52by quantizing in time we’re able to get

00:11:54discrete representations in both time as

00:11:56well as in pitch whereas working on a

00:11:58continuous time axis now you have to

00:12:01deal computers are discrete and are

00:12:03unable to operate on the continuous

00:12:05representation has to be quantized into

00:12:07a digital format somehow the last

00:12:12challenge polyphony so polysemy is the

00:12:15presence of multiple simultaneous voices

00:12:18so far the examples that I’ve shown you

00:12:19you’ve just heard a single voice playing

00:12:21at any given time but a Corral has four

00:12:24voices the soprano the alto the tenor

00:12:26the bass and so here’s a question for

00:12:29you if I have four voices and they can

00:12:32each represent 128 different pitches

00:12:34that’s the constraint in MIDI

00:12:36representation of music how many

00:12:39different chords can I construct very

00:12:45good yes 128 ^ 4 that’s correct

00:12:48I put a Big O because some like some

00:12:51like you can rearrange the ordering but

00:12:53more or less yeah that’s correct and why

00:12:56is this a problem well this is the

00:12:59problem because most of these chords are

00:13:01actually never seen especially after you

00:13:03transposed a C major a minor in fact

00:13:06looking at the data set we can see that

00:13:08just the first 20 chords or 20

00:13:10notes rather occupy almost 90% of the

00:13:14entire dataset so if we were to

00:13:16represent all of these we would have a

00:13:18ton of symbols in our vocabulary which

00:13:20we had never seen before the way we deal

00:13:24with this problem is by serializing so

00:13:27that is instead of representing all four

00:13:29notes as an individual symbol we

00:13:32represent each individual note as a

00:13:33symbol itself and we serialized in

00:13:36soprano alto tenor bass order and so

00:13:39what you end up getting is a reduction

00:13:41from 128 to the 4th all possible chords

00:13:44into just 128 possible pitches now this

00:13:48may seem a little unjustified but this

00:13:50is actually done all the time with

00:13:52sequence processing if you took like

00:13:54take a look at traditional on language

00:13:56models you can represent them either at

00:13:58the character level or at the word level

00:14:00similarly you can represent music either

00:14:02at the note level or at the chord level

00:14:05after serializing the the data looks

00:14:10like this we have assembled a noting the

00:14:12start of a piece and this is used to

00:14:13initialize our model we then have the

00:14:17four chords soprano alto tenor bass

00:14:19followed by a delimiter indicating the

00:14:21end of this frame and time has advanced

00:14:23one in the future followed by another

00:14:25soprano alto tenor bass we also have

00:14:28these funny-looking dot things which I

00:14:30came up with to denote the self firmata

00:14:32so that we can encode when the end of a

00:14:34phrases in our input training data after

00:14:40all of our pre-processing our final

00:14:42corpus looks like this there’s only 108

00:14:45symbols left so not a hundred all

00:14:47hundred 28 pitches are used in Bach’s

00:14:49works and there’s about I would say four

00:14:52hundred thousand total where we split

00:14:54three hundred and eighty thousand or

00:14:56three hundred and eighty thousand into a

00:14:58training set and forty thousand into a

00:15:00validation set we split between training

00:15:02and validation in order to prevent

00:15:04overfitting we don’t want to just

00:15:05memorize box Corral’s rather we want to

00:15:07be able to produce very similar samples

00:15:11which are not exact identical and that’s

00:15:16it with that you have the training set

00:15:18and it’s encapsulated by the first three

00:15:20commands on that slide I showed earlier

00:15:23with Bach

00:15:24make data set Bach bot extract

00:15:27vocabulary the next step is to train the

00:15:31recurrent neural network to talk about

00:15:35recurrent neural networks let’s break

00:15:36the word down recurrent neural network

00:15:39I’m going to start with neuro neural

00:15:41neural just means that we have very

00:15:43basic building blocks called neurons

00:15:44which look like this they take a

00:15:47d-dimensional input x1 XD these are

00:15:50numbers like 0.9 0.2 and they’re all

00:15:54added together with a linear combination

00:15:57so what you end up getting is this

00:15:58activation Z which is just the sum of

00:16:01these inputs weighted by WS so if a

00:16:04neuron really cares about say X 2 W 2 W

00:16:071 and the rest will be zeros and so this

00:16:10lets the neuron preferentially select

00:16:12which of its inputs that cares more

00:16:13about and allows to specialize for

00:16:15certain parts of its input this

00:16:17activation is passed through this X

00:16:19shaped thing called an on called an

00:16:20activation function commonly a sigmoid

00:16:23but all it does is it introduces a

00:16:25non-linearity into the network and

00:16:27allows you to explore expressive on the

00:16:29types of functions you can approximate

00:16:31and we have the output called Y you take

00:16:36these neurons you stack them

00:16:37horizontally and you get what’s called a

00:16:39lair so here I’m just showing four

00:16:42neurons in this layer three neurons in

00:16:44this layer two neurons on this top layer

00:16:46and I represented the network like this

00:16:50here we take the input X so this bottom

00:16:52part we multiply by a matrix now because

00:16:56we’ve replicated the neurons

00:16:57horizontally and what w’s represents the

00:16:59weights we pass it through this sigmoid

00:17:02activation function to get these first

00:17:04layer outputs this is recursively done

00:17:06through all the layers until you get to

00:17:08the very top where we have the final

00:17:09outputs of the model the W’s here the

00:17:12weights those are the parameters of the

00:17:14network and these are the things that we

00:17:16need to learn in order to train the

00:17:17neural network great

00:17:21we know that feed-forward neural

00:17:23networks now let’s introduce the word

00:17:25recurrent recurrent just means that the

00:17:28previous input or the previous hidden

00:17:30states are used in the next time step

00:17:32the prediction so what I’m showing here

00:17:34is again if you just pay attention to

00:17:36this input area

00:17:38and this layer right here and this

00:17:39output this part right here is the same

00:17:43thing as this thing right here however

00:17:46we’ve added this funny little loop

00:17:48coming back with this is electrical

00:17:51engineering notation for a unit time

00:17:52delay and what this is saying is take

00:17:55the hidden state from time T minus 1 and

00:17:57also include it as input into the next

00:18:00into the prime T predictions in

00:18:02equations it looks like this

00:18:04the current hidden state is equal to the

00:18:07act or the previous inputs plus the free

00:18:11or an activation of the previous inputs

00:18:13waited plus the the weighted activations

00:18:17of the previous hidden states and the

00:18:20outputs is only a function of just the

00:18:22current hidden states we can take this

00:18:27loop right here

00:18:28oh sorry before I go there um this is

00:18:31called a Elmen type recurrent neural

00:18:33network this memory cell is very basic

00:18:36it’s just doing the exact same thing a

00:18:37normal neural network would do it turns

00:18:40out there’s some problems with just

00:18:42using the basic architecture and so the

00:18:44architecture that the field has been

00:18:45converging towards is known as long

00:18:47short-term memory

00:18:48it looks really complicated it’s not you

00:18:53take the inputs and the hidden states

00:18:55and you put them into three spots right

00:18:57here the inputs an input gate a forget

00:18:59gate and output gate and the point of

00:19:02adding all this art complexity is to

00:19:03solve a problem known as the vanishing

00:19:05gradient problem where this constant

00:19:07error carousel of the hidden state being

00:19:09fed back to itself over and over and

00:19:10over results in signals converging

00:19:12toward zero or diverging to infinity

00:19:14this is fortunately this is usually

00:19:17available as just a black box

00:19:18implementation in most software packages

00:19:20you just specify I want to use an LS TM

00:19:23and all of this is abstracted away from

00:19:25you now here if you squint you can kind

00:19:32of see that the memory cell that I’ve

00:19:34shown previously where we have the

00:19:36inputs the hidden States hidden facing

00:19:38back to itself to generate an output I

00:19:40distract it away like this and I’ve

00:19:42stacked it up on top of each other so

00:19:45rather than just having the outputs come

00:19:46out of this H right here I’ve actually

00:19:48made it the inputs to get another memory

00:19:50cell

00:19:51this is where the word deep comes from

00:19:53deep networks are just networks that

00:19:55have a lot of layers and by stacking I

00:19:58get to use the word deep inside of my

00:20:00deep LS TM model but I’ll show you later

00:20:04that I’m not just doing it for the

00:20:05buzzword depth actually matters as well

00:20:07see in results another operation that’s

00:20:10important for LS CMS is unrolling and

00:20:12what unrolling does is it takes this

00:20:14unit time delay and it just replicates

00:20:17the LS TM units over time so rather than

00:20:18show in this delay like this I’ve taken

00:20:20it I’ve shown the the – once hidden unit

00:20:23passing state into the the t hidden unit

00:20:26passing stages the T plus first hidden

00:20:27unit your input is a variable length and

00:20:30to train the network what you do is you

00:20:33expand this graph you unroll the lsdm so

00:20:35the same length as your variable length

00:20:37input and in order to get these

00:20:38predictions up at the top great we know

00:20:44all we need to know about music and rnns

00:20:46let’s move on to a Bach bot have Bach

00:20:48Bach works to Train Bach bot we apply

00:20:53sequential prediction criteria now I’ve

00:20:56stolen this from Andre carpet thieves

00:20:57github but the principles are the same

00:21:00suppose we’re given the input characters

00:21:02hello and we want to model it using a

00:21:04recurrent neural network the training

00:21:07criteria is given the current input

00:21:09character and the previous hidden state

00:21:11predicts the next character so notice

00:21:14down here I have a CH and I’m trying to

00:21:15predict e I’ve e and I’m trying to

00:21:17predict L I’ve L and I’m trying to

00:21:19predict L and I have Allen I’m trying to

00:21:22predict oh if we take this analogy to

00:21:25music I have all of the notes I’ve seen

00:21:27up until this point in time and I’m

00:21:29trying to predict the next note I can

00:21:31iterate this process forwards to

00:21:32generate compositions the criteria we

00:21:36want to use is and so the output layer

00:21:38here is actually a probability

00:21:41distribution sorry so take in the

00:21:44previous slide and now I put it on top

00:21:45of my unrolled Network so given the

00:21:48initial hidden state which we just

00:21:50initialized all zeroes because we have a

00:21:51unique start symbol used to initialize

00:21:53our pieces and the RNN dynamics so this

00:21:57is the probability distribution over the

00:22:00next state given the current state

00:22:02so this YT is

00:22:05for that and it’s a function of the

00:22:07currents the current input XT as well as

00:22:09the previous hidden states from t minus

00:22:111 we need to choose the r and n

00:22:15parameters so these weight matrices the

00:22:17weights of all the connections between

00:22:18all the neurons in order to maximize

00:22:21this probability right here the

00:22:22probability of the real Bach chorale so

00:22:26down here we have all the notes of the

00:22:28real Bach chorale and up here we have

00:22:30the next notes of this of those in an

00:22:33ideal world if we just initialize it

00:22:35with some Bach chorale it’ll just

00:22:36memorize and return the remainder and

00:22:38that will that will do great on this

00:22:40prediction criteria but that’s not

00:22:41exactly what we want but nevertheless

00:22:46once we have this criteria the way that

00:22:49the model is actually trained is by

00:22:50using the chain rule from calculus where

00:22:52we take partial derivatives up here we

00:22:55have an error signal so I know this is

00:22:57the real Bach note the real note that

00:22:58Bach used and this is the thing my model

00:23:00is predicting ok they’re a little bit

00:23:02different how do I change the parameters

00:23:05this weight matrix between the hidden

00:23:06state the outputs this weight matrix

00:23:08between the previous in stay in the

00:23:09current hidden state and this weight

00:23:11matrix between the hidden state the

00:23:12inputs how can I change those around how

00:23:14do I wiggle those to make this output up

00:23:16here closer to what Bach actually had

00:23:18produced now this training criteria can

00:23:21be just formalized

00:23:22used by taking gradients using calculus

00:23:24and iterating and then optimization

00:23:26known as stochastic gradient descents

00:23:27and when applied to neural networks it’s

00:23:30an algorithm called back propagation

00:23:31well back propagation through time if

00:23:34you want to get nitty-gritty because

00:23:35we’ve unrolled the neural network over

00:23:37time but again this is also abstraction

00:23:40that need not concern you because this

00:23:42is also usually provided for you as a

00:23:44black box inside of common frameworks

00:23:46such as tensor flow and caris we now

00:23:51have all we now have the Bach bot model

00:23:53but there’s a couple parameters that we

00:23:54need to look at I haven’t told you

00:23:56exactly how deep Bach bot is nor have I

00:23:59told you how big these layers are before

00:24:03we start when optimizing models this is

00:24:06this is a very important learning and

00:24:07it’s probably obvious by now GPUs are

00:24:10very important for rapid experimentation

00:24:12I did a quick benchmark and I found that

00:24:16a GPU delivers an 8x perform

00:24:18speed up making my training time goes

00:24:21down from 256 minutes down to just 28

00:24:24minutes so if you want to iterate

00:24:25quickly getting a GPU will save you

00:24:28April like will make you eight times

00:24:30more productive did I just put the word

00:24:36deep onto my neural network because it

00:24:37was a good buzz word it turns out no

00:24:40depth actually matters what I’m showing

00:24:44you here are the training losses as well

00:24:46as the validation losses as I change the

00:24:48depth the training loss is how well is

00:24:51my model doing on the training data set

00:24:53which I’m letting it see and letting it

00:24:54tune its parameters to do better on and

00:24:56the validation loss is how well is my

00:24:59model doing on data that I didn’t let it

00:25:01see so how well is it generalizing

00:25:02beyond just memorizing its inputs and

00:25:06what we notice here is that with just

00:25:07one layer the validation error is quite

00:25:09high and as we increase layers – it gets

00:25:12you down here three gets you this red

00:25:14curve which is as low as it goes and if

00:25:17you keep going for with four it goes

00:25:18back up should this be surprising it

00:25:22shouldn’t and the reason why it

00:25:24shouldn’t is because as you add more

00:25:26layers you’re adding more expressive

00:25:28power notice that we’re here with four

00:25:30layers you’re actually doing just as

00:25:32good as the red curve so you’re doing

00:25:33great on the training set but because

00:25:35your model is now so expressive you’re

00:25:37memorizing the inputs and so you

00:25:38generalize more poorly so a similar

00:25:43story can be told about the hidden state

00:25:45sighs so how wide those memory cells are

00:25:47how many units do we have in them as we

00:25:50increase the hidden state layer it’s

00:25:51hidden state size we get performance

00:25:53improvements in generalization from this

00:25:55blue curve all the way down until we get

00:25:58to 256 hidden units this green curve

00:26:00after that we see the same kind of

00:26:03behavior where the training set error

00:26:04goes lower and lower but because you’re

00:26:07memorizing the inputs because your model

00:26:09is now too powerful you’re out your

00:26:11generalization error actually gets worse

00:26:15finally LST em they’re pretty

00:26:18complicated the reason why I introduced

00:26:20it is because it’s actually so critical

00:26:22for your performance the the basic

00:26:24element type recurrent neural network or

00:26:26just reuses the standard recurrent

00:26:28neural network architecture for the

00:26:29memory cell is shown here in

00:26:32side of this green curve right here

00:26:33which actually doesn’t do to both too

00:26:35badly but by using a long short term

00:26:38memory you get this yellow curve which

00:26:39is at the very bottom it’s doing as best

00:26:41as out of all the architectures we

00:26:43looked at in terms of memory cells gated

00:26:46recurrent units are ass more simpler or

00:26:48simpler generalization of LF CMS they

00:26:50haven’t been used as much and so there’s

00:26:52less literature about them but on this

00:26:53task they also appear to be doing quite

00:26:55well cool after all of this

00:27:01experimentation and all of this manual

00:27:03grid search we finally arrived at a

00:27:05final architecture where notes are first

00:27:07embedded into real numbers a 32

00:27:10dimensional real or vector rather and

00:27:12then we have a three layer stacked

00:27:14long short term memory recurrent neural

00:27:16network which processes these notes

00:27:18sequences over time and we trained it

00:27:22using standard gradient descent with a

00:27:24couple tricks we use this thing called

00:27:27drop out and we drop out with a setting

00:27:30of 30% and what this means is in between

00:27:33subsequent connections between layers

00:27:35randomly turns 30% of the neurons off

00:27:38that seems a little bit counterintuitive

00:27:41why might you want to do that it turns

00:27:44out by turning off neurons during

00:27:46training you actually force the neurons

00:27:49to learn more robust features that are

00:27:50independent of each other

00:27:52if the neurons are not always reliably

00:27:54avail if those connections are not

00:27:55always reliably available then there are

00:27:58always reliably available then neurons

00:28:00may learn that to combine these two

00:28:02features and to happen so you end up

00:28:04getting correlated features where to

00:28:05newer ons are actually learning the

00:28:06exact same feature with dropout we’re

00:28:10able we will actually show in the next

00:28:12slide that generalization improves as we

00:28:14increase this number to a certain point

00:28:16we also conduct something called

00:28:17Bachelor Malaysian it basically just

00:28:20takes your data and centers it back

00:28:21around zero and rescales the variance so

00:28:24that you don’t have to worry about

00:28:25floating-point number overflows or under

00:28:26flows and we use 128 kind step truncated

00:28:31back propagation through time again

00:28:33another thing that your optimizer will

00:28:35handle for you but at a high level what

00:28:38this is doing is rather than unrolling

00:28:40the entire network which over the entire

00:28:42input sequence which could be tens of

00:28:44thousands of notes long

00:28:45got tens of thousands thousands of notes

00:28:46long we only unroll it 128 and we

00:28:49truncate the air signals we basically

00:28:50say after 120 time steps whatever you do

00:28:53over here is not going to affect the

00:28:54future

00:28:55too much here’s my promise slide about

00:29:00drop out counter-intuitively as we turn

00:29:03that as we start dropping out or turning

00:29:06off random neurons or random neuron

00:29:08connections we actually generalize

00:29:10better we see that without drop out the

00:29:12model actually starts to overfit

00:29:14dramatically you know it gets better at

00:29:16generalizing that it gets worse and

00:29:17worse and worse at generalizing because

00:29:18it’s got so many connections it can

00:29:20learn so much you turn to and drop out

00:29:23up to 0.3 you get this purple curve at

00:29:25the bottom where you’ve turned just to

00:29:27the right amount so that the features

00:29:28the model of learning are robust they

00:29:30can generalize independently of other

00:29:32features and if you turn it up too high

00:29:34then now you’re dropping up so much

00:29:36you’re injecting more noise than

00:29:38regularizing your model you actually

00:29:39don’t generalize that well and the story

00:29:42on the training side is also consistent

00:29:44as we increase dropout you do strictly

00:29:47worse on training and that makes sense

00:29:48too because this isn’t generalization

00:29:50this is just how well can the model

00:29:51memorize its input data and if you turn

00:29:54inputs off you will memorize this good

00:29:59great with the Train model we can do

00:30:03many things we can compose and we can

00:30:06harmonize and the way we compose is the

00:30:09following we have the hidden states and

00:30:13we have the inputs and we have the model

00:30:15weights and so we can use the model

00:30:16weights to form this predictive

00:30:18distribution what is the probability of

00:30:20my current note given all of the

00:30:23previous notes I’ve seen before from

00:30:25this probability distribution we just

00:30:27written we pick out a note according to

00:30:29how that distribution is parameterised

00:30:31so up here this could be like I think L

00:30:35has the highest weight here and then so

00:30:39after we sample it we just set XT equal

00:30:42to whatever we sampled out of there and

00:30:43we just treat it as truth we just assume

00:30:45that whatever the output was right there

00:30:47is now the input for the next time step

00:30:48and then we iterate this process for

00:30:50words so starting with no notes at all

00:30:53you sample the start symbol and then you

00:30:55just keep going until you sample the end

00:30:56symbol and then

00:30:58that way we’re able to generate novel

00:31:00automatic compositions harmonization is

00:31:04actually a generalization of composition

00:31:07in composition what we basically did was

00:31:09I got a start symbol fill in the rest

00:31:12harmonization is where you say I’ve got

00:31:15the melody I’ve got the baseline or I’ve

00:31:18got these certain notes fill in the

00:31:20parts that I didn’t specify and for this

00:31:22we actually proposed a suboptimal

00:31:23strategy so I’m going to let alpha

00:31:26denote the stuff that we’re given so it

00:31:28alpha could be like 1 3 7 the points in

00:31:30time where the notes are fixed and the

00:31:34privatization problem is we need to

00:31:36choose the notes that aren’t fixed or we

00:31:38subdues the input the sequence X 1 to X

00:31:41also we need to choose the entire

00:31:42composition such that the notes that

00:31:47we’re given X alpha are already fixed

00:31:49and so our decision variables are the

00:31:52things that are not in alpha and we need

00:31:54to maximize this probability

00:31:55distribution my kind of greedy solution

00:31:59which I’ve received a lot of criticism

00:32:01for is okay you’re at this point in time

00:32:04just sample the the most likely thing at

00:32:06the next point in time the reason why

00:32:09this gets criticized is because if you

00:32:11greedily choose without looking at what

00:32:13influence this decision now could impact

00:32:15on your future you might choose

00:32:17something that just doesn’t make any

00:32:18sense in the future harmonic context but

00:32:20may sound really good right now it’s

00:32:23kind of like thinking it’s kind of like

00:32:24acting without thinking about the

00:32:25consequences of your action but the

00:32:30testament to how well this actually

00:32:32performs is not what could it how bad

00:32:34could it be theoretically it’s actually

00:32:36how well does it do empirically is this

00:32:38still convincing and we’ll find out soon

00:32:43but before we go there let’s uncover the

00:32:46black box I’ve been talking about neural

00:32:48networks is just this thing which you

00:32:49can just optimize throw data at it it’ll

00:32:51learn things let’s take a look inside

00:32:53and see what’s actually going on and so

00:32:56what I’ve done here is I’ve taken the

00:32:57various memory cells of my recurrent

00:33:00neural network and I’ve unrolled it over

00:33:02time so on the x axis you see time and

00:33:05on the y axis I’m showing you the

00:33:07activations of all of the hidden units

00:33:09so this is like neuron number

00:33:12one tuner on number 32 this is neuron

00:33:14number one – neuron number 256 in the

00:33:17first hidden layer and similarly this is

00:33:19neuron number one – neuron number 256 in

00:33:21the second hidden layer these any

00:33:24pattern there I don’t I mean I kind of

00:33:29do I see like there’s like this little

00:33:30smear right here and it seems to show up

00:33:32everywhere as well as right here but

00:33:35there’s not too much intuitive sense

00:33:36that I can make out of this image and

00:33:38this is a common criticism of deep

00:33:40neural networks they’re like black boxes

00:33:42where we don’t know how they really work

00:33:44on the inside but they seem to do

00:33:45awfully good as we get closer to the

00:33:49output things start to make a little bit

00:33:52more sense so over so I previously was

00:33:54showing the hidden units of the first

00:33:55and second layer now I’m showing the

00:33:57third layer as well as a linear

00:33:59combination of the third layer and

00:34:00finally the outputs of the model and as

00:34:03you get towards the end you start seeing

00:34:04oh there’s this little dotty pattern

00:34:06this almost looks like a piano roll if

00:34:10you remember the representation of music

00:34:11I showed earlier where we had time on

00:34:13the x-axis and pitch on the y-axis this

00:34:16looks awfully similar to that and this

00:34:19isn’t surprising either recall we

00:34:22trained the neural network to predict

00:34:24the next note given the current note or

00:34:26all the previous notes if the network

00:34:29was doing perfectly we would expect to

00:34:31just see the input here delayed by a

00:34:33single time step and so it’s

00:34:35unsurprising that we do see something

00:34:37that resembles the input but it’s not

00:34:39quite exactly the input sometimes we see

00:34:41like multiple predictions at one point

00:34:42in time and this is really representing

00:34:44the uncertainty inside of our

00:34:46predictions so if I represented the

00:34:48probability distribution we’re not just

00:34:49saying the next note is then is this

00:34:51rather we’re saying we’re pretty sure

00:34:53than that next note is this with this

00:34:55probability but it could also be this

00:34:56with this probability that probability I

00:34:59called this the probabilistic piano roll

00:35:01I don’t know if that’s standard

00:35:04terminology here’s one of my most

00:35:09interesting insights that I found from

00:35:10this model it appears to actually be

00:35:13learning music theory concepts so what

00:35:16I’m showing here is some input that I

00:35:17provided to the model and here I picked

00:35:20out some neurons and oh no these neurons

00:35:22are randomly selected so I didn’t just

00:35:24go and I fished for the ones that

00:35:25like that rather I just ran a random

00:35:27number generator got eight of them out

00:35:28and then I handed them off to my music

00:35:29dearest collaborator and I was like hey

00:35:31is there anything there and here’s the

00:35:34end here’s the notes he made for me

00:35:35he said that neuron 64 this one and

00:35:38layer one neuron 138 this one they

00:35:41appear to be picking out perfect

00:35:43Cadence’s with root position chords in

00:35:45the tonic key more music theory than I

00:35:48can understand but if I look up here

00:35:50it’s like that shape right there on the

00:35:52piano roll looks like that shape on the

00:35:54piano roll looks like that shape on the

00:35:55piano roll interesting neuron layer one

00:36:00or neuron 151 I believe that is this one

00:36:04a minor Cadence’s ending phrases two and

00:36:08four no that’s this one sorry and and

00:36:12again I look up here okay yeah that kind

00:36:14of chord right there looks kind of like

00:36:15that chord right there they seem to be

00:36:18specializing to picking out specific

00:36:20types of chords okay so it’s learning

00:36:22Roman numeral analysis and tonics and

00:36:24root position chords and Cadence’s and

00:36:27the last one where one neuron eighty

00:36:30seven and layer two neuron 37 I believe

00:36:32that’s this one in this one they’re

00:36:35picking out I six chords I have no idea

00:36:38what that means

00:36:42so I showed you automatic composition at

00:36:46the beginning of the presentation when I

00:36:48took some Bach Bach music and I

00:36:49allegedly claimed it was Bach I’ll now

00:36:51show you what harmonization sounds like

00:36:53and this is with the sub optimal

00:36:54strategy that I proposed so we take a

00:36:57melody such as

00:36:58[Music]

00:37:02we tell the model this has to be the

00:37:05soprano line what are the others likely

00:37:07to be like that’s kind of convincing

00:37:15it’s almost like a baroque C major chord

00:37:17progression what’s really interesting

00:37:20though is that not only can we just

00:37:22harmonize simple melodies like that we

00:37:24can actually take popular tunes such as

00:37:26this

00:37:26[Music]

00:37:50we can generate a novel baroque

00:37:53harmonization of what Bach might have

00:37:55done had he heard twinkle twinkle little

00:37:57star during his lifetime

00:38:01now I’m going off the track where it’s

00:38:03like oh this is my model it looks so

00:38:04good it sounds so realistic yeah but I

00:38:08was just criticizing at the beginning of

00:38:09the talk

00:38:10my third research goal was actually how

00:38:12can we determine a standardized way to

00:38:15quantitatively assess the performance of

00:38:17generative models for this particular

00:38:19task and one which I recommend for all

00:38:21of automatic composition is to do a

00:38:23subjective listening experiment and so

00:38:25what we did is we built

00:38:26václav comm and it looks like this it’s

00:38:30got a splash page and it’s kind of

00:38:32trying to go viral it’s asking can you

00:38:35tell the difference between Bach and a

00:38:36computer they used to say man versus

00:38:38machine but but the interface is simple

00:38:42you’re given two choices one of them is

00:38:45Bach one of them is Bach bot and you’re

00:38:47asked to distinguish which one was the

00:38:49actual Bach we put this up out on the

00:38:53Internet

00:38:53I’ve got around nineteen hundred

00:38:55participants from all around the world

00:38:59participants tended to be within the

00:39:01eighteen to forty five age group the

00:39:04district we got a surprisingly large

00:39:06number of expert users who decided to

00:39:08contribute we defined expert as a

00:39:09researcher someone who is published or a

00:39:11teacher someone with professional

00:39:13accreditation as a music teacher

00:39:15advanced as someone who has who have

00:39:17studied in a degree program for music

00:39:19and intermediate someone who plays an

00:39:21instrument and here’s how they did so

00:39:25I’ve coded these like I’ve coded these

00:39:28with SAT B to represent the part that

00:39:31was asked to be harmonized so this is

00:39:33given the alto tenor bass harmonized

00:39:35with soprano this year was given just

00:39:38the soprano wood bass harmonized the

00:39:40middle – and this is composed everything

00:39:43I’m going to give you nothing this is

00:39:47the result that I’ve been coding this

00:39:48entire talk only participants are only

00:39:51able to distinguish Bach from Bach

00:39:52bought 7% better than random chance but

00:39:58there’s some other interesting findings

00:39:59in here

00:39:59well I guess this isn’t too surprising

00:40:00if you delete the soprano line then then

00:40:04Bach bot is off to create a convincing

00:40:06melody and it doesn’t do too well

00:40:08whereas if you delete the bass line

00:40:10Bach lots of a lot better now I think

00:40:14this is actually a consequence of the

00:40:16way I chose to deal with polyphony in

00:40:18the sense that I serialized the music

00:40:20from soprano alto tenor bass and so by

00:40:22the time Bach Bach got to figuring out

00:40:25what the bass note might be it already

00:40:27seen the soprano alto and tenor note

00:40:28within that time instant and so it

00:40:31already had a very strong harmonic

00:40:32context about what note might sound good

00:40:34whereas if I whereas when I’ve got the

00:40:37soprano note Bach watt has no idea what

00:40:40the alto tenor bass note might be and so

00:40:42just going to make a random guess that

00:40:43could be totally out of place to

00:40:45validate this hypothesis which is a work

00:40:48left for the future you could serialize

00:40:50in a different order such as bass tenor

00:40:52Alto soprano you could run this

00:40:54experiment again and you can see and you

00:40:57would expect to see it go down like this

00:40:58if the hypothesis is true and

00:41:01differently if not here I’ve taken the

00:41:06exact same plot from the previous plot

00:41:07except I’ve now broken it down by music

00:41:10experience unsurprisingly

00:41:12you kind of see this curve where people

00:41:14are doing or doing better as they get

00:41:16more experienced so the novices are like

00:41:19almost only three percent better where

00:41:20the experts are sixteen percent better

00:41:22they probably know Bach they’ve got it

00:41:24memorized so they can tell the

00:41:26difference but the interesting one is

00:41:29here the experts do significantly worse

00:41:32than random chance when getting when

00:41:35comparing Bach versus Bach bought bass

00:41:36harmonizations I actually don’t have a

00:41:39good reason why but it’s surprising to

00:41:43me it seems that the experts think block

00:41:45bot is more convincing than actual Bach

00:41:48so in conclusion I’ve presented a deep

00:41:52long short term memory generative model

00:41:54for composing completing and generating

00:41:57polyphonic music and this model isn’t

00:42:00just like research that I’m talking

00:42:01about that no one ever gets to use it’s

00:42:04actually open source it’s on my github

00:42:06and moreover Google’s Google brains

00:42:08magenta project has actually integrated

00:42:11it already into Google magenta so if you

00:42:13use the

00:42:13polyphonic recurrent neural network

00:42:15model at magenta and the tensor flow

00:42:17projects you’ll be using the bok-bok

00:42:19model the model appears to learn music

00:42:23theory without any prior knowledge we

00:42:24didn’t tell it this is a chord this is

00:42:26the cadence this is a tonic it just

00:42:28decided to figure that out on its own in

00:42:29order to optimize performance on an

00:42:31automatic composition task to me this

00:42:34suggests that music theory with all of

00:42:36its rules and all of its formalisms

00:42:38actually is useful for for comp

00:42:41composing in fact it’s so useful that a

00:42:43machine trained to optimize compose

00:42:45composition decided to specialize on

00:42:47these concepts finally we conducted the

00:42:51largest musical Turing test to date with

00:42:531,700 participants only 7% of which

00:42:57performed better than random chance

00:43:01obligatory note to my employer we do

00:43:05open slitter we do freelance outsourcing

00:43:07if you need a development team let me

00:43:08know other than that thank you so much

00:43:10for your attention it was a pleasure

00:43:12speaking to you all

00:43:13[Applause]