00:00:06[Music]
00:00:08cool thank you um just so I know what
00:00:13level to speak at raise your hands if
00:00:15you know who Bach is great raise your
00:00:20hand if you know what a neural network
00:00:22is oh this is the perfect crowd awesome
00:00:26if you don’t know don’t worry I’m going
00:00:29to cover the very basics of both so
00:00:33let’s talk about Bach I’m going to play
00:00:36to you some music
00:00:37[Music]
00:00:49now what you just heard is what’s known
00:00:52as a coral there are four parts to it a
00:00:55soprano alto tenor bass playing at the
00:00:58exact same time and there’s very regular
00:01:00phrasing structure where you have the
00:01:02beginning of a phrase the determination
00:01:03of a phrase followed by the next phrase
00:01:06except that wasn’t Bach rather that was
00:01:12a computer algorithm called Bach bot and
00:01:14that was one sample out of its outputs
00:01:16if you don’t believe me it’s on
00:01:18soundcloud it’s called sample one go
00:01:20listen for yourself so instead of
00:01:23talking about box today I’m going to
00:01:25talk to you about Bach bot hi my name is
00:01:29phiman and it’s a pleasure to be here at
00:01:31Amsterdam and today we’ll talk about
00:01:33autumn is automatic stylistic
00:01:35composition using long short term memory
00:01:37so then a background about myself I’m
00:01:42currently a software engineer at gigster
00:01:43where I walk at work on interesting
00:01:45automation problems regarding I’m taking
00:01:48contracts divided them into sub
00:01:50contracts and then freelancing them out
00:01:51the work on Bach bot was done as part of
00:01:54my master’s thesis where which I did at
00:01:57the University of Cambridge with
00:01:58Microsoft Research Cambridge in line
00:02:01with the track here I do not have a PhD
00:02:03and so and I still can do machine
00:02:05learning so this is the fact this is a
00:02:07fact you can do machine learning without
00:02:09a PhD for those of you who just want to
00:02:14know what’s going to happen and then get
00:02:15out of here because it’s not interesting
00:02:16here is the executive summary I’m going
00:02:20to talk to you about how to train end to
00:02:22end starting from data sets preparation
00:02:24all the way to model tuning and
00:02:25deployment of a deep recurrent neural
00:02:27network for music this neural network is
00:02:31capable of polysemy multiple
00:02:33simultaneous voices at the same time
00:02:34it’s capable automatic composition
00:02:37generating a composition completely from
00:02:39scratch as well as harmonization given
00:02:42some fixed parts such as the soprano
00:02:44line of the melody generate the
00:02:46remaining supporting parts this model
00:02:49learns music theory without being told
00:02:51to do so providing empirical validation
00:02:53of what music theorists have been using
00:02:55for centuries and
00:02:57finally it’s evaluated on an online
00:02:59musical Turing test we’re out of 1700
00:03:02participants only nine percent are able
00:03:04to distinguish actual Bach from Bach
00:03:06Bach
00:03:09when I set off on this research there
00:03:11were three primary goals the first
00:03:14question I wanted to answer was what is
00:03:16the frontier of computational creativity
00:03:19now creativity is something we take to
00:03:22be an 8 li human innately special in
00:03:24some sense computers shouldn’t ought not
00:03:26to be able to replicate this about us is
00:03:28this actually true can we have computers
00:03:30generate art that is convincingly human
00:03:34the second question I wanted to answer
00:03:36was how much does deep learning impacted
00:03:39automatic music composition now
00:03:41automatic music composition is a special
00:03:43field it has been dominated by symbolic
00:03:45methods which utilize things like formal
00:03:47grammars or context-free grammars such
00:03:48as this parse tree we’ve seen
00:03:51connectionist methods in the early 19th
00:03:52century however we have it however they
00:03:56have they followed in popularity and
00:03:57most recent systems have used symbolic
00:03:59methods with the work here I wanted to
00:04:02see did the new advances in deep
00:04:04learning in the last 10 years can they
00:04:06be transferred over to this particular
00:04:08problem domain and finally the last
00:04:11question I wanted to look at is how do
00:04:13we evaluate these generative models I
00:04:15mean we’ve seen we’ve seen in the
00:04:17previous talk a lot of a lot of models
00:04:19they generate art we look at it and as
00:04:21the author we say oh that’s convincing
00:04:23but oh that’s beautiful and great that
00:04:27might be a perfectly valid use case but
00:04:29it’s not sufficient for publication to
00:04:31publish something we need to establish a
00:04:32standardized benchmark and we need to be
00:04:34able to evaluate all of our models about
00:04:35it so we can objectively say which model
00:04:38is better than the other now if you’re
00:04:42still here I’m assuming you’re
00:04:44interested this is the outline we’ll
00:04:46start with a quick primer on music
00:04:48theory giving you just the basic
00:04:50terminology you need to understand the
00:04:51remainder of this presentation we’ll
00:04:53talk about how to prepare a data set of
00:04:55Bach Corral’s
00:04:56well then gate will get the give a
00:04:58primer on recurrent neural networks
00:04:59which is the actual deep learning model
00:05:01architecture used to build Bach Bach
00:05:04we’ll talk about the Bach Bach model
00:05:06itself the tips and tricks and
00:05:08techniques that we used in order to
00:05:09train it
00:05:10have it run successfully as well as
00:05:11deploy it and then we’ll show the
00:05:13results
00:05:14well show how this model is able to
00:05:16capture statistical regularities in box
00:05:18musical style and we’ll prove a we won’t
00:05:20prove we’ll provide very convincing
00:05:22evidence that music theory does have
00:05:24theoretical gesture empirical
00:05:26justification and finally I’ll show the
00:05:29results of the musical Turing test which
00:05:31was our proposed evaluation methodology
00:05:33for saying yes
00:05:34this model has solves our research goal
00:05:36the the task of automatically composing
00:05:39convincing Bach chorale is more closed
00:05:41than open of a problem as a result of
00:05:43Bach plot and if you’re a hands-on type
00:05:48of learner we’ve containerized the
00:05:49entire deployment so if you go to my
00:05:51website here I have a copy of the slides
00:05:53which have all of these instructions you
00:05:55run this eight lines of code and it runs
00:05:58this entire and pipeline right here
00:05:59where it takes the corrals it pre
00:06:01processes them puts them into a data
00:06:03store trains of trains the deep learning
00:06:05model samples the deep learning model
00:06:07produces outputs that you can listen to
00:06:12let’s start with basic music theory now
00:06:17when people think of music this is
00:06:20usually what you think about you got
00:06:21these bar lines you got notes and these
00:06:23notes are on different horizontal and
00:06:25vertical positions some of them have
00:06:27interesting ties some of them of dots
00:06:28this is interesting little weird hat
00:06:30looking thing we don’t need all this we
00:06:34need three fundamental concepts the
00:06:37first is pitch pitch is often referred
00:06:40to as how low or how high a note is so
00:06:43if I play this we can distinguish that
00:06:50some notes are lower and some notes are
00:06:52higher in frequency and that corresponds
00:06:54to the vertical axis here as the notes
00:06:57of the notes sound ascending they appear
00:06:59ascending on the bar lines the second
00:07:04attribute we need is duration and this
00:07:06is really how long a notice so this one
00:07:09note these two notes these four and
00:07:11these eight all have equal total
00:07:13duration but they are they’re having zuv
00:07:17each other so if we take a listen
00:07:23the general intuition is the more bars
00:07:26there are on these tides the faster the
00:07:28notes appear with just those two
00:07:32concepts this is starting to make a
00:07:34little bit more sense this right here is
00:07:36twice as fast as this note we can see
00:07:38this note is higher than this note and
00:07:40you can generalize this to the remainder
00:07:42of this but there’s still this funny hat
00:07:45looking thing we’ll get to the hat in a
00:07:49sec but with pitch and duration we can
00:07:53rewrite the music like so rather than
00:07:56representing it using notes which may be
00:07:57kind of cryptic we show it here as a
00:07:59matrix where on the x axis we have time
00:08:03so the duration and on the y-axis we
00:08:06have pitch how high or low and frequency
00:08:08that note is and what we’ve done is
00:08:10we’ve taken the symbolic representation
00:08:12of music and we’ve turned it into a
00:08:13digital computable format that we can
00:08:15train models on back to the hat looking
00:08:20thing this is called a Fermata and Bach
00:08:24used it to denote the ends of phrases we
00:08:26had originally said about this research
00:08:28completely neglecting for modest and we
00:08:30found that the phrases generated by the
00:08:32model just kind of wandered they never
00:08:34seem to end there was no sense of
00:08:36resolution or conclusion and that was
00:08:38unrealistic but by adding these four
00:08:40modest all of a sudden the model turned
00:08:42around and we and we suddenly found
00:08:44realistic phrasing structure cool and
00:08:49that’s all the music you need to know
00:08:51the rest of it is machine learning now
00:08:55the biggest part of a machine learning
00:08:57engineer’s job is preparing their data
00:08:59sets this is a very painful task usually
00:09:01have to scour the internet or find some
00:09:03standardized data set that you train and
00:09:05evaluate your models on that usually
00:09:07these data sets have to be pre processed
00:09:08and massaged into a format that’s
00:09:10amenable for learning upon and for us it
00:09:14was no different box works however
00:09:17fortunately over the years have been
00:09:19transcribed into excuse my German Bach
00:09:22worka Vera – Nix BW sorry
00:09:27dwv is how I’ve been referring to this
00:09:30corpus it contains about all 438
00:09:33harmonizations of Bach
00:09:34Corral’s and conveniently it is
00:09:37available through the software package
00:09:38called music21
00:09:39this is a Python package that you can
00:09:42just tip install and then import it and
00:09:44now you have an iterator over a
00:09:45collection of music the first
00:09:50pre-processing step we did is we took
00:09:52the music the original music here and we
00:09:54did two things we transposed it and then
00:09:58we quantize it in time now you can
00:10:00notice the transposition by looking at
00:10:02these accidentals right here these two
00:10:04little funny backwards or forwards B’s
00:10:05and then they’re absent over here
00:10:08furthermore that note has shifted up by
00:10:11half a line that’s a little hard to see
00:10:13but it’s happening and the reason why we
00:10:17did this is we didn’t want to learn key
00:10:19signature key signature is usually
00:10:20something decided by the author before
00:10:22the pieces even begun to compose and so
00:10:25we can and so key signature itself can
00:10:26be injected as a pre-processing step
00:10:27where we sample over all the keys Bach
00:10:30did use so we removed key fingers from
00:10:32the equation through transposition and
00:10:34I’ll justify why that’s an okay thing to
00:10:36do in the next slide this first measure
00:10:40is written is is a progression of five
00:10:42notes written in C major and then what I
00:10:45did in the next measure is I just moved
00:10:47it up by five whole steps
00:10:50[Music]
00:10:57so yeah the pitch did change it’s
00:11:00relatively higher it’s absolutely higher
00:11:02on all accounts
00:11:03but the relations between the notes
00:11:05didn’t change and the sensation the the
00:11:09motifs that the music is bringing out
00:11:10those still remain fairly constant even
00:11:13after transposition quantization that
00:11:17however is a different story if I go
00:11:19back to slides will notice quantization
00:11:22to this 30-second note and turn it into
00:11:24a sixteenth note by removing that second
00:11:26bar we’ve distorted time is that a
00:11:29problem it is it’s not it’s not perfect
00:11:36but it’s a very minor problem so over
00:11:39here I’ve plotted a histogram of all of
00:11:42the durations inside of the corral
00:11:43corpus and this quantization affects
00:11:47only 0.2% of all the notes that we’re
00:11:49training on the reason that we do it is
00:11:52by quantizing in time we’re able to get
00:11:54discrete representations in both time as
00:11:56well as in pitch whereas working on a
00:11:58continuous time axis now you have to
00:12:01deal computers are discrete and are
00:12:03unable to operate on the continuous
00:12:05representation has to be quantized into
00:12:07a digital format somehow the last
00:12:12challenge polyphony so polysemy is the
00:12:15presence of multiple simultaneous voices
00:12:18so far the examples that I’ve shown you
00:12:19you’ve just heard a single voice playing
00:12:21at any given time but a Corral has four
00:12:24voices the soprano the alto the tenor
00:12:26the bass and so here’s a question for
00:12:29you if I have four voices and they can
00:12:32each represent 128 different pitches
00:12:34that’s the constraint in MIDI
00:12:36representation of music how many
00:12:39different chords can I construct very
00:12:45good yes 128 ^ 4 that’s correct
00:12:48I put a Big O because some like some
00:12:51like you can rearrange the ordering but
00:12:53more or less yeah that’s correct and why
00:12:56is this a problem well this is the
00:12:59problem because most of these chords are
00:13:01actually never seen especially after you
00:13:03transposed a C major a minor in fact
00:13:06looking at the data set we can see that
00:13:08just the first 20 chords or 20
00:13:10notes rather occupy almost 90% of the
00:13:14entire dataset so if we were to
00:13:16represent all of these we would have a
00:13:18ton of symbols in our vocabulary which
00:13:20we had never seen before the way we deal
00:13:24with this problem is by serializing so
00:13:27that is instead of representing all four
00:13:29notes as an individual symbol we
00:13:32represent each individual note as a
00:13:33symbol itself and we serialized in
00:13:36soprano alto tenor bass order and so
00:13:39what you end up getting is a reduction
00:13:41from 128 to the 4th all possible chords
00:13:44into just 128 possible pitches now this
00:13:48may seem a little unjustified but this
00:13:50is actually done all the time with
00:13:52sequence processing if you took like
00:13:54take a look at traditional on language
00:13:56models you can represent them either at
00:13:58the character level or at the word level
00:14:00similarly you can represent music either
00:14:02at the note level or at the chord level
00:14:05after serializing the the data looks
00:14:10like this we have assembled a noting the
00:14:12start of a piece and this is used to
00:14:13initialize our model we then have the
00:14:17four chords soprano alto tenor bass
00:14:19followed by a delimiter indicating the
00:14:21end of this frame and time has advanced
00:14:23one in the future followed by another
00:14:25soprano alto tenor bass we also have
00:14:28these funny-looking dot things which I
00:14:30came up with to denote the self firmata
00:14:32so that we can encode when the end of a
00:14:34phrases in our input training data after
00:14:40all of our pre-processing our final
00:14:42corpus looks like this there’s only 108
00:14:45symbols left so not a hundred all
00:14:47hundred 28 pitches are used in Bach’s
00:14:49works and there’s about I would say four
00:14:52hundred thousand total where we split
00:14:54three hundred and eighty thousand or
00:14:56three hundred and eighty thousand into a
00:14:58training set and forty thousand into a
00:15:00validation set we split between training
00:15:02and validation in order to prevent
00:15:04overfitting we don’t want to just
00:15:05memorize box Corral’s rather we want to
00:15:07be able to produce very similar samples
00:15:11which are not exact identical and that’s
00:15:16it with that you have the training set
00:15:18and it’s encapsulated by the first three
00:15:20commands on that slide I showed earlier
00:15:23with Bach
00:15:24make data set Bach bot extract
00:15:27vocabulary the next step is to train the
00:15:31recurrent neural network to talk about
00:15:35recurrent neural networks let’s break
00:15:36the word down recurrent neural network
00:15:39I’m going to start with neuro neural
00:15:41neural just means that we have very
00:15:43basic building blocks called neurons
00:15:44which look like this they take a
00:15:47d-dimensional input x1 XD these are
00:15:50numbers like 0.9 0.2 and they’re all
00:15:54added together with a linear combination
00:15:57so what you end up getting is this
00:15:58activation Z which is just the sum of
00:16:01these inputs weighted by WS so if a
00:16:04neuron really cares about say X 2 W 2 W
00:16:071 and the rest will be zeros and so this
00:16:10lets the neuron preferentially select
00:16:12which of its inputs that cares more
00:16:13about and allows to specialize for
00:16:15certain parts of its input this
00:16:17activation is passed through this X
00:16:19shaped thing called an on called an
00:16:20activation function commonly a sigmoid
00:16:23but all it does is it introduces a
00:16:25non-linearity into the network and
00:16:27allows you to explore expressive on the
00:16:29types of functions you can approximate
00:16:31and we have the output called Y you take
00:16:36these neurons you stack them
00:16:37horizontally and you get what’s called a
00:16:39lair so here I’m just showing four
00:16:42neurons in this layer three neurons in
00:16:44this layer two neurons on this top layer
00:16:46and I represented the network like this
00:16:50here we take the input X so this bottom
00:16:52part we multiply by a matrix now because
00:16:56we’ve replicated the neurons
00:16:57horizontally and what w’s represents the
00:16:59weights we pass it through this sigmoid
00:17:02activation function to get these first
00:17:04layer outputs this is recursively done
00:17:06through all the layers until you get to
00:17:08the very top where we have the final
00:17:09outputs of the model the W’s here the
00:17:12weights those are the parameters of the
00:17:14network and these are the things that we
00:17:16need to learn in order to train the
00:17:17neural network great
00:17:21we know that feed-forward neural
00:17:23networks now let’s introduce the word
00:17:25recurrent recurrent just means that the
00:17:28previous input or the previous hidden
00:17:30states are used in the next time step
00:17:32the prediction so what I’m showing here
00:17:34is again if you just pay attention to
00:17:36this input area
00:17:38and this layer right here and this
00:17:39output this part right here is the same
00:17:43thing as this thing right here however
00:17:46we’ve added this funny little loop
00:17:48coming back with this is electrical
00:17:51engineering notation for a unit time
00:17:52delay and what this is saying is take
00:17:55the hidden state from time T minus 1 and
00:17:57also include it as input into the next
00:18:00into the prime T predictions in
00:18:02equations it looks like this
00:18:04the current hidden state is equal to the
00:18:07act or the previous inputs plus the free
00:18:11or an activation of the previous inputs
00:18:13waited plus the the weighted activations
00:18:17of the previous hidden states and the
00:18:20outputs is only a function of just the
00:18:22current hidden states we can take this
00:18:27loop right here
00:18:28oh sorry before I go there um this is
00:18:31called a Elmen type recurrent neural
00:18:33network this memory cell is very basic
00:18:36it’s just doing the exact same thing a
00:18:37normal neural network would do it turns
00:18:40out there’s some problems with just
00:18:42using the basic architecture and so the
00:18:44architecture that the field has been
00:18:45converging towards is known as long
00:18:47short-term memory
00:18:48it looks really complicated it’s not you
00:18:53take the inputs and the hidden states
00:18:55and you put them into three spots right
00:18:57here the inputs an input gate a forget
00:18:59gate and output gate and the point of
00:19:02adding all this art complexity is to
00:19:03solve a problem known as the vanishing
00:19:05gradient problem where this constant
00:19:07error carousel of the hidden state being
00:19:09fed back to itself over and over and
00:19:10over results in signals converging
00:19:12toward zero or diverging to infinity
00:19:14this is fortunately this is usually
00:19:17available as just a black box
00:19:18implementation in most software packages
00:19:20you just specify I want to use an LS TM
00:19:23and all of this is abstracted away from
00:19:25you now here if you squint you can kind
00:19:32of see that the memory cell that I’ve
00:19:34shown previously where we have the
00:19:36inputs the hidden States hidden facing
00:19:38back to itself to generate an output I
00:19:40distract it away like this and I’ve
00:19:42stacked it up on top of each other so
00:19:45rather than just having the outputs come
00:19:46out of this H right here I’ve actually
00:19:48made it the inputs to get another memory
00:19:50cell
00:19:51this is where the word deep comes from
00:19:53deep networks are just networks that
00:19:55have a lot of layers and by stacking I
00:19:58get to use the word deep inside of my
00:20:00deep LS TM model but I’ll show you later
00:20:04that I’m not just doing it for the
00:20:05buzzword depth actually matters as well
00:20:07see in results another operation that’s
00:20:10important for LS CMS is unrolling and
00:20:12what unrolling does is it takes this
00:20:14unit time delay and it just replicates
00:20:17the LS TM units over time so rather than
00:20:18show in this delay like this I’ve taken
00:20:20it I’ve shown the the – once hidden unit
00:20:23passing state into the the t hidden unit
00:20:26passing stages the T plus first hidden
00:20:27unit your input is a variable length and
00:20:30to train the network what you do is you
00:20:33expand this graph you unroll the lsdm so
00:20:35the same length as your variable length
00:20:37input and in order to get these
00:20:38predictions up at the top great we know
00:20:44all we need to know about music and rnns
00:20:46let’s move on to a Bach bot have Bach
00:20:48Bach works to Train Bach bot we apply
00:20:53sequential prediction criteria now I’ve
00:20:56stolen this from Andre carpet thieves
00:20:57github but the principles are the same
00:21:00suppose we’re given the input characters
00:21:02hello and we want to model it using a
00:21:04recurrent neural network the training
00:21:07criteria is given the current input
00:21:09character and the previous hidden state
00:21:11predicts the next character so notice
00:21:14down here I have a CH and I’m trying to
00:21:15predict e I’ve e and I’m trying to
00:21:17predict L I’ve L and I’m trying to
00:21:19predict L and I have Allen I’m trying to
00:21:22predict oh if we take this analogy to
00:21:25music I have all of the notes I’ve seen
00:21:27up until this point in time and I’m
00:21:29trying to predict the next note I can
00:21:31iterate this process forwards to
00:21:32generate compositions the criteria we
00:21:36want to use is and so the output layer
00:21:38here is actually a probability
00:21:41distribution sorry so take in the
00:21:44previous slide and now I put it on top
00:21:45of my unrolled Network so given the
00:21:48initial hidden state which we just
00:21:50initialized all zeroes because we have a
00:21:51unique start symbol used to initialize
00:21:53our pieces and the RNN dynamics so this
00:21:57is the probability distribution over the
00:22:00next state given the current state
00:22:02so this YT is
00:22:05for that and it’s a function of the
00:22:07currents the current input XT as well as
00:22:09the previous hidden states from t minus
00:22:111 we need to choose the r and n
00:22:15parameters so these weight matrices the
00:22:17weights of all the connections between
00:22:18all the neurons in order to maximize
00:22:21this probability right here the
00:22:22probability of the real Bach chorale so
00:22:26down here we have all the notes of the
00:22:28real Bach chorale and up here we have
00:22:30the next notes of this of those in an
00:22:33ideal world if we just initialize it
00:22:35with some Bach chorale it’ll just
00:22:36memorize and return the remainder and
00:22:38that will that will do great on this
00:22:40prediction criteria but that’s not
00:22:41exactly what we want but nevertheless
00:22:46once we have this criteria the way that
00:22:49the model is actually trained is by
00:22:50using the chain rule from calculus where
00:22:52we take partial derivatives up here we
00:22:55have an error signal so I know this is
00:22:57the real Bach note the real note that
00:22:58Bach used and this is the thing my model
00:23:00is predicting ok they’re a little bit
00:23:02different how do I change the parameters
00:23:05this weight matrix between the hidden
00:23:06state the outputs this weight matrix
00:23:08between the previous in stay in the
00:23:09current hidden state and this weight
00:23:11matrix between the hidden state the
00:23:12inputs how can I change those around how
00:23:14do I wiggle those to make this output up
00:23:16here closer to what Bach actually had
00:23:18produced now this training criteria can
00:23:21be just formalized
00:23:22used by taking gradients using calculus
00:23:24and iterating and then optimization
00:23:26known as stochastic gradient descents
00:23:27and when applied to neural networks it’s
00:23:30an algorithm called back propagation
00:23:31well back propagation through time if
00:23:34you want to get nitty-gritty because
00:23:35we’ve unrolled the neural network over
00:23:37time but again this is also abstraction
00:23:40that need not concern you because this
00:23:42is also usually provided for you as a
00:23:44black box inside of common frameworks
00:23:46such as tensor flow and caris we now
00:23:51have all we now have the Bach bot model
00:23:53but there’s a couple parameters that we
00:23:54need to look at I haven’t told you
00:23:56exactly how deep Bach bot is nor have I
00:23:59told you how big these layers are before
00:24:03we start when optimizing models this is
00:24:06this is a very important learning and
00:24:07it’s probably obvious by now GPUs are
00:24:10very important for rapid experimentation
00:24:12I did a quick benchmark and I found that
00:24:16a GPU delivers an 8x perform
00:24:18speed up making my training time goes
00:24:21down from 256 minutes down to just 28
00:24:24minutes so if you want to iterate
00:24:25quickly getting a GPU will save you
00:24:28April like will make you eight times
00:24:30more productive did I just put the word
00:24:36deep onto my neural network because it
00:24:37was a good buzz word it turns out no
00:24:40depth actually matters what I’m showing
00:24:44you here are the training losses as well
00:24:46as the validation losses as I change the
00:24:48depth the training loss is how well is
00:24:51my model doing on the training data set
00:24:53which I’m letting it see and letting it
00:24:54tune its parameters to do better on and
00:24:56the validation loss is how well is my
00:24:59model doing on data that I didn’t let it
00:25:01see so how well is it generalizing
00:25:02beyond just memorizing its inputs and
00:25:06what we notice here is that with just
00:25:07one layer the validation error is quite
00:25:09high and as we increase layers – it gets
00:25:12you down here three gets you this red
00:25:14curve which is as low as it goes and if
00:25:17you keep going for with four it goes
00:25:18back up should this be surprising it
00:25:22shouldn’t and the reason why it
00:25:24shouldn’t is because as you add more
00:25:26layers you’re adding more expressive
00:25:28power notice that we’re here with four
00:25:30layers you’re actually doing just as
00:25:32good as the red curve so you’re doing
00:25:33great on the training set but because
00:25:35your model is now so expressive you’re
00:25:37memorizing the inputs and so you
00:25:38generalize more poorly so a similar
00:25:43story can be told about the hidden state
00:25:45sighs so how wide those memory cells are
00:25:47how many units do we have in them as we
00:25:50increase the hidden state layer it’s
00:25:51hidden state size we get performance
00:25:53improvements in generalization from this
00:25:55blue curve all the way down until we get
00:25:58to 256 hidden units this green curve
00:26:00after that we see the same kind of
00:26:03behavior where the training set error
00:26:04goes lower and lower but because you’re
00:26:07memorizing the inputs because your model
00:26:09is now too powerful you’re out your
00:26:11generalization error actually gets worse
00:26:15finally LST em they’re pretty
00:26:18complicated the reason why I introduced
00:26:20it is because it’s actually so critical
00:26:22for your performance the the basic
00:26:24element type recurrent neural network or
00:26:26just reuses the standard recurrent
00:26:28neural network architecture for the
00:26:29memory cell is shown here in
00:26:32side of this green curve right here
00:26:33which actually doesn’t do to both too
00:26:35badly but by using a long short term
00:26:38memory you get this yellow curve which
00:26:39is at the very bottom it’s doing as best
00:26:41as out of all the architectures we
00:26:43looked at in terms of memory cells gated
00:26:46recurrent units are ass more simpler or
00:26:48simpler generalization of LF CMS they
00:26:50haven’t been used as much and so there’s
00:26:52less literature about them but on this
00:26:53task they also appear to be doing quite
00:26:55well cool after all of this
00:27:01experimentation and all of this manual
00:27:03grid search we finally arrived at a
00:27:05final architecture where notes are first
00:27:07embedded into real numbers a 32
00:27:10dimensional real or vector rather and
00:27:12then we have a three layer stacked
00:27:14long short term memory recurrent neural
00:27:16network which processes these notes
00:27:18sequences over time and we trained it
00:27:22using standard gradient descent with a
00:27:24couple tricks we use this thing called
00:27:27drop out and we drop out with a setting
00:27:30of 30% and what this means is in between
00:27:33subsequent connections between layers
00:27:35randomly turns 30% of the neurons off
00:27:38that seems a little bit counterintuitive
00:27:41why might you want to do that it turns
00:27:44out by turning off neurons during
00:27:46training you actually force the neurons
00:27:49to learn more robust features that are
00:27:50independent of each other
00:27:52if the neurons are not always reliably
00:27:54avail if those connections are not
00:27:55always reliably available then there are
00:27:58always reliably available then neurons
00:28:00may learn that to combine these two
00:28:02features and to happen so you end up
00:28:04getting correlated features where to
00:28:05newer ons are actually learning the
00:28:06exact same feature with dropout we’re
00:28:10able we will actually show in the next
00:28:12slide that generalization improves as we
00:28:14increase this number to a certain point
00:28:16we also conduct something called
00:28:17Bachelor Malaysian it basically just
00:28:20takes your data and centers it back
00:28:21around zero and rescales the variance so
00:28:24that you don’t have to worry about
00:28:25floating-point number overflows or under
00:28:26flows and we use 128 kind step truncated
00:28:31back propagation through time again
00:28:33another thing that your optimizer will
00:28:35handle for you but at a high level what
00:28:38this is doing is rather than unrolling
00:28:40the entire network which over the entire
00:28:42input sequence which could be tens of
00:28:44thousands of notes long
00:28:45got tens of thousands thousands of notes
00:28:46long we only unroll it 128 and we
00:28:49truncate the air signals we basically
00:28:50say after 120 time steps whatever you do
00:28:53over here is not going to affect the
00:28:54future
00:28:55too much here’s my promise slide about
00:29:00drop out counter-intuitively as we turn
00:29:03that as we start dropping out or turning
00:29:06off random neurons or random neuron
00:29:08connections we actually generalize
00:29:10better we see that without drop out the
00:29:12model actually starts to overfit
00:29:14dramatically you know it gets better at
00:29:16generalizing that it gets worse and
00:29:17worse and worse at generalizing because
00:29:18it’s got so many connections it can
00:29:20learn so much you turn to and drop out
00:29:23up to 0.3 you get this purple curve at
00:29:25the bottom where you’ve turned just to
00:29:27the right amount so that the features
00:29:28the model of learning are robust they
00:29:30can generalize independently of other
00:29:32features and if you turn it up too high
00:29:34then now you’re dropping up so much
00:29:36you’re injecting more noise than
00:29:38regularizing your model you actually
00:29:39don’t generalize that well and the story
00:29:42on the training side is also consistent
00:29:44as we increase dropout you do strictly
00:29:47worse on training and that makes sense
00:29:48too because this isn’t generalization
00:29:50this is just how well can the model
00:29:51memorize its input data and if you turn
00:29:54inputs off you will memorize this good
00:29:59great with the Train model we can do
00:30:03many things we can compose and we can
00:30:06harmonize and the way we compose is the
00:30:09following we have the hidden states and
00:30:13we have the inputs and we have the model
00:30:15weights and so we can use the model
00:30:16weights to form this predictive
00:30:18distribution what is the probability of
00:30:20my current note given all of the
00:30:23previous notes I’ve seen before from
00:30:25this probability distribution we just
00:30:27written we pick out a note according to
00:30:29how that distribution is parameterised
00:30:31so up here this could be like I think L
00:30:35has the highest weight here and then so
00:30:39after we sample it we just set XT equal
00:30:42to whatever we sampled out of there and
00:30:43we just treat it as truth we just assume
00:30:45that whatever the output was right there
00:30:47is now the input for the next time step
00:30:48and then we iterate this process for
00:30:50words so starting with no notes at all
00:30:53you sample the start symbol and then you
00:30:55just keep going until you sample the end
00:30:56symbol and then
00:30:58that way we’re able to generate novel
00:31:00automatic compositions harmonization is
00:31:04actually a generalization of composition
00:31:07in composition what we basically did was
00:31:09I got a start symbol fill in the rest
00:31:12harmonization is where you say I’ve got
00:31:15the melody I’ve got the baseline or I’ve
00:31:18got these certain notes fill in the
00:31:20parts that I didn’t specify and for this
00:31:22we actually proposed a suboptimal
00:31:23strategy so I’m going to let alpha
00:31:26denote the stuff that we’re given so it
00:31:28alpha could be like 1 3 7 the points in
00:31:30time where the notes are fixed and the
00:31:34privatization problem is we need to
00:31:36choose the notes that aren’t fixed or we
00:31:38subdues the input the sequence X 1 to X
00:31:41also we need to choose the entire
00:31:42composition such that the notes that
00:31:47we’re given X alpha are already fixed
00:31:49and so our decision variables are the
00:31:52things that are not in alpha and we need
00:31:54to maximize this probability
00:31:55distribution my kind of greedy solution
00:31:59which I’ve received a lot of criticism
00:32:01for is okay you’re at this point in time
00:32:04just sample the the most likely thing at
00:32:06the next point in time the reason why
00:32:09this gets criticized is because if you
00:32:11greedily choose without looking at what
00:32:13influence this decision now could impact
00:32:15on your future you might choose
00:32:17something that just doesn’t make any
00:32:18sense in the future harmonic context but
00:32:20may sound really good right now it’s
00:32:23kind of like thinking it’s kind of like
00:32:24acting without thinking about the
00:32:25consequences of your action but the
00:32:30testament to how well this actually
00:32:32performs is not what could it how bad
00:32:34could it be theoretically it’s actually
00:32:36how well does it do empirically is this
00:32:38still convincing and we’ll find out soon
00:32:43but before we go there let’s uncover the
00:32:46black box I’ve been talking about neural
00:32:48networks is just this thing which you
00:32:49can just optimize throw data at it it’ll
00:32:51learn things let’s take a look inside
00:32:53and see what’s actually going on and so
00:32:56what I’ve done here is I’ve taken the
00:32:57various memory cells of my recurrent
00:33:00neural network and I’ve unrolled it over
00:33:02time so on the x axis you see time and
00:33:05on the y axis I’m showing you the
00:33:07activations of all of the hidden units
00:33:09so this is like neuron number
00:33:12one tuner on number 32 this is neuron
00:33:14number one – neuron number 256 in the
00:33:17first hidden layer and similarly this is
00:33:19neuron number one – neuron number 256 in
00:33:21the second hidden layer these any
00:33:24pattern there I don’t I mean I kind of
00:33:29do I see like there’s like this little
00:33:30smear right here and it seems to show up
00:33:32everywhere as well as right here but
00:33:35there’s not too much intuitive sense
00:33:36that I can make out of this image and
00:33:38this is a common criticism of deep
00:33:40neural networks they’re like black boxes
00:33:42where we don’t know how they really work
00:33:44on the inside but they seem to do
00:33:45awfully good as we get closer to the
00:33:49output things start to make a little bit
00:33:52more sense so over so I previously was
00:33:54showing the hidden units of the first
00:33:55and second layer now I’m showing the
00:33:57third layer as well as a linear
00:33:59combination of the third layer and
00:34:00finally the outputs of the model and as
00:34:03you get towards the end you start seeing
00:34:04oh there’s this little dotty pattern
00:34:06this almost looks like a piano roll if
00:34:10you remember the representation of music
00:34:11I showed earlier where we had time on
00:34:13the x-axis and pitch on the y-axis this
00:34:16looks awfully similar to that and this
00:34:19isn’t surprising either recall we
00:34:22trained the neural network to predict
00:34:24the next note given the current note or
00:34:26all the previous notes if the network
00:34:29was doing perfectly we would expect to
00:34:31just see the input here delayed by a
00:34:33single time step and so it’s
00:34:35unsurprising that we do see something
00:34:37that resembles the input but it’s not
00:34:39quite exactly the input sometimes we see
00:34:41like multiple predictions at one point
00:34:42in time and this is really representing
00:34:44the uncertainty inside of our
00:34:46predictions so if I represented the
00:34:48probability distribution we’re not just
00:34:49saying the next note is then is this
00:34:51rather we’re saying we’re pretty sure
00:34:53than that next note is this with this
00:34:55probability but it could also be this
00:34:56with this probability that probability I
00:34:59called this the probabilistic piano roll
00:35:01I don’t know if that’s standard
00:35:04terminology here’s one of my most
00:35:09interesting insights that I found from
00:35:10this model it appears to actually be
00:35:13learning music theory concepts so what
00:35:16I’m showing here is some input that I
00:35:17provided to the model and here I picked
00:35:20out some neurons and oh no these neurons
00:35:22are randomly selected so I didn’t just
00:35:24go and I fished for the ones that
00:35:25like that rather I just ran a random
00:35:27number generator got eight of them out
00:35:28and then I handed them off to my music
00:35:29dearest collaborator and I was like hey
00:35:31is there anything there and here’s the
00:35:34end here’s the notes he made for me
00:35:35he said that neuron 64 this one and
00:35:38layer one neuron 138 this one they
00:35:41appear to be picking out perfect
00:35:43Cadence’s with root position chords in
00:35:45the tonic key more music theory than I
00:35:48can understand but if I look up here
00:35:50it’s like that shape right there on the
00:35:52piano roll looks like that shape on the
00:35:54piano roll looks like that shape on the
00:35:55piano roll interesting neuron layer one
00:36:00or neuron 151 I believe that is this one
00:36:04a minor Cadence’s ending phrases two and
00:36:08four no that’s this one sorry and and
00:36:12again I look up here okay yeah that kind
00:36:14of chord right there looks kind of like
00:36:15that chord right there they seem to be
00:36:18specializing to picking out specific
00:36:20types of chords okay so it’s learning
00:36:22Roman numeral analysis and tonics and
00:36:24root position chords and Cadence’s and
00:36:27the last one where one neuron eighty
00:36:30seven and layer two neuron 37 I believe
00:36:32that’s this one in this one they’re
00:36:35picking out I six chords I have no idea
00:36:38what that means
00:36:42so I showed you automatic composition at
00:36:46the beginning of the presentation when I
00:36:48took some Bach Bach music and I
00:36:49allegedly claimed it was Bach I’ll now
00:36:51show you what harmonization sounds like
00:36:53and this is with the sub optimal
00:36:54strategy that I proposed so we take a
00:36:57melody such as
00:36:58[Music]
00:37:02we tell the model this has to be the
00:37:05soprano line what are the others likely
00:37:07to be like that’s kind of convincing
00:37:15it’s almost like a baroque C major chord
00:37:17progression what’s really interesting
00:37:20though is that not only can we just
00:37:22harmonize simple melodies like that we
00:37:24can actually take popular tunes such as
00:37:26this
00:37:26[Music]
00:37:50we can generate a novel baroque
00:37:53harmonization of what Bach might have
00:37:55done had he heard twinkle twinkle little
00:37:57star during his lifetime
00:38:01now I’m going off the track where it’s
00:38:03like oh this is my model it looks so
00:38:04good it sounds so realistic yeah but I
00:38:08was just criticizing at the beginning of
00:38:09the talk
00:38:10my third research goal was actually how
00:38:12can we determine a standardized way to
00:38:15quantitatively assess the performance of
00:38:17generative models for this particular
00:38:19task and one which I recommend for all
00:38:21of automatic composition is to do a
00:38:23subjective listening experiment and so
00:38:25what we did is we built
00:38:26václav comm and it looks like this it’s
00:38:30got a splash page and it’s kind of
00:38:32trying to go viral it’s asking can you
00:38:35tell the difference between Bach and a
00:38:36computer they used to say man versus
00:38:38machine but but the interface is simple
00:38:42you’re given two choices one of them is
00:38:45Bach one of them is Bach bot and you’re
00:38:47asked to distinguish which one was the
00:38:49actual Bach we put this up out on the
00:38:53Internet
00:38:53I’ve got around nineteen hundred
00:38:55participants from all around the world
00:38:59participants tended to be within the
00:39:01eighteen to forty five age group the
00:39:04district we got a surprisingly large
00:39:06number of expert users who decided to
00:39:08contribute we defined expert as a
00:39:09researcher someone who is published or a
00:39:11teacher someone with professional
00:39:13accreditation as a music teacher
00:39:15advanced as someone who has who have
00:39:17studied in a degree program for music
00:39:19and intermediate someone who plays an
00:39:21instrument and here’s how they did so
00:39:25I’ve coded these like I’ve coded these
00:39:28with SAT B to represent the part that
00:39:31was asked to be harmonized so this is
00:39:33given the alto tenor bass harmonized
00:39:35with soprano this year was given just
00:39:38the soprano wood bass harmonized the
00:39:40middle – and this is composed everything
00:39:43I’m going to give you nothing this is
00:39:47the result that I’ve been coding this
00:39:48entire talk only participants are only
00:39:51able to distinguish Bach from Bach
00:39:52bought 7% better than random chance but
00:39:58there’s some other interesting findings
00:39:59in here
00:39:59well I guess this isn’t too surprising
00:40:00if you delete the soprano line then then
00:40:04Bach bot is off to create a convincing
00:40:06melody and it doesn’t do too well
00:40:08whereas if you delete the bass line
00:40:10Bach lots of a lot better now I think
00:40:14this is actually a consequence of the
00:40:16way I chose to deal with polyphony in
00:40:18the sense that I serialized the music
00:40:20from soprano alto tenor bass and so by
00:40:22the time Bach Bach got to figuring out
00:40:25what the bass note might be it already
00:40:27seen the soprano alto and tenor note
00:40:28within that time instant and so it
00:40:31already had a very strong harmonic
00:40:32context about what note might sound good
00:40:34whereas if I whereas when I’ve got the
00:40:37soprano note Bach watt has no idea what
00:40:40the alto tenor bass note might be and so
00:40:42just going to make a random guess that
00:40:43could be totally out of place to
00:40:45validate this hypothesis which is a work
00:40:48left for the future you could serialize
00:40:50in a different order such as bass tenor
00:40:52Alto soprano you could run this
00:40:54experiment again and you can see and you
00:40:57would expect to see it go down like this
00:40:58if the hypothesis is true and
00:41:01differently if not here I’ve taken the
00:41:06exact same plot from the previous plot
00:41:07except I’ve now broken it down by music
00:41:10experience unsurprisingly
00:41:12you kind of see this curve where people
00:41:14are doing or doing better as they get
00:41:16more experienced so the novices are like
00:41:19almost only three percent better where
00:41:20the experts are sixteen percent better
00:41:22they probably know Bach they’ve got it
00:41:24memorized so they can tell the
00:41:26difference but the interesting one is
00:41:29here the experts do significantly worse
00:41:32than random chance when getting when
00:41:35comparing Bach versus Bach bought bass
00:41:36harmonizations I actually don’t have a
00:41:39good reason why but it’s surprising to
00:41:43me it seems that the experts think block
00:41:45bot is more convincing than actual Bach
00:41:48so in conclusion I’ve presented a deep
00:41:52long short term memory generative model
00:41:54for composing completing and generating
00:41:57polyphonic music and this model isn’t
00:42:00just like research that I’m talking
00:42:01about that no one ever gets to use it’s
00:42:04actually open source it’s on my github
00:42:06and moreover Google’s Google brains
00:42:08magenta project has actually integrated
00:42:11it already into Google magenta so if you
00:42:13use the
00:42:13polyphonic recurrent neural network
00:42:15model at magenta and the tensor flow
00:42:17projects you’ll be using the bok-bok
00:42:19model the model appears to learn music
00:42:23theory without any prior knowledge we
00:42:24didn’t tell it this is a chord this is
00:42:26the cadence this is a tonic it just
00:42:28decided to figure that out on its own in
00:42:29order to optimize performance on an
00:42:31automatic composition task to me this
00:42:34suggests that music theory with all of
00:42:36its rules and all of its formalisms
00:42:38actually is useful for for comp
00:42:41composing in fact it’s so useful that a
00:42:43machine trained to optimize compose
00:42:45composition decided to specialize on
00:42:47these concepts finally we conducted the
00:42:51largest musical Turing test to date with
00:42:531,700 participants only 7% of which
00:42:57performed better than random chance
00:43:01obligatory note to my employer we do
00:43:05open slitter we do freelance outsourcing
00:43:07if you need a development team let me
00:43:08know other than that thank you so much
00:43:10for your attention it was a pleasure
00:43:12speaking to you all
00:43:13[Applause]