GOTO 2016 • Deep Learning – What is it and What It Can Do For You • Diogo Moitinho de Almeida

cool thank you I'm do go as a bit of

background I have a very math and

computer background which is very good

for deep learning list of achievements

on why you should listen to me is that I

currently work at a super cool company

that does uses deep learning to make

faster and more accurate medical

diagnosis and in the past lives I've one

whole lot of like international math

competitions and some programming

competitions as well and today I'm going

to chat about what deep learning is and

what it can do for you feel free to ask

questions at any time if I can ask that

right they can ask that right Simon cool

so yeah feel free to ask questions any

time just seal them out especially if

you think I'm lying to you so what is

deep learning gonna start with a

disclaimer deep learning is actually

pretty complicated it's hard to be very

general about everything and be correct

so when in doubt I'm going to favor

generality so if you're familiar with

the appointing already it's going to

sound like I'm lying a lot but in

reality like this is just to like kind

of give the hot really high level of it

and also whenever possible going to

favor some of the shortcuts that might

not be a hundred percent correct but

should give the correct mental model of

how these things work but if you have

any questions feel free to ask so from

super high level there's a very lace a

lot of like different levels of

hierarchy here in the ecosystem there's

artificial intelligence which is a

superset of like everything an example

would be IBM Watson where lots of hand

coded rules and uses extremely large

amounts of expert manpower built to do a

specific task there's machine learning

which is subset of that an example of

this would be like Google ad click

prediction and how you do this is rather

than using tons and tons of hard coded

rules you start using more examples to

figure out how to combine some hand

coded statistics to predict the

probability of for example an ad click

at a slightly deeper level you have

representation learning which is

sometimes seen as one layer deep

learning so sometimes referred like

these levels are called shallow learning

if you're trying to start a fight

and an example of this would be netflix

movie recommendation where the

statistics of what you even know about

each movie is learned from data but

you're still learning a simple

combination of how these features go

together and after a few levels you get

into deep learning so this would be an

example of figuring out diseases from

images where instead of you know having

a layer of manual statistics that are

learned and then combining together you

might learn all of these statistics at

the same time in tens hundreds or even

thousands of steps which is what some

people use nowadays this is probably not

a common view of what deep learning is

but I think the easiest view of how to

see it is deep line exists an interface

and this interface has roughly two

methods as the first method you have a

forward pass and this is definitely the

easy part given arbitrary input make

arbitrary output anyone can do this part

this is really easy the the trick that

makes it work is the backwards pass so

given a desired change in the output you

want to be able to transform this into a

desired change in the input and once you

have these you can make arbitrarily

complex things by chaining thing up

changing them up into a directed acyclic

graph and if this sounds too good to be

true especially with how we design the

forward pass because if you just say

arbitrary input an arbitrary output of

course you can do anything you want but

the hard part is how you define that

backwards pass because as you make your

forward pass more and more complicated

like if you have like some really crazy

function it becomes hard to define how

to map the inputs back into outputs so

by keeping these things simple and

combining them together it gives us this

almost composable language of modules

that allow us to do the things we want

to do so once you have this interface

you just can build up from this once you

once you have that you could have a

bunch of these modules that satisfy this

interface as a side note a bunch of

these modules will be parametric which

means that they have parameters which

roughly means that they're stateful and

they're stateful means that once you

have the state

state changes and it's this change in

state that allows you to change this

function from something that you just

cobbled together to something that's it

gets closer and closer to what you want

to do and once you have a frame rail

language of what you want to do now you

can start doing the tasks that you care

about you and deepen you always define a

loss or a cost depending on how you want

to define this and this is something you

want to minimize for reasons that I'd

happily explain this has always be a

scalar so it can't be several costs the

same time you have to squash it down

into a single thing that you care about

and once you squish this number like

everything care about in the world into

a single number now you can start using

deep learning to optimize this you

create an architecture which is the

function that you want to do and this

would be how you compose together these

modules that I talked about so the way

you connect them together changes the

function that you have and the kind of

representation power it has and that

becomes the hard part after that you

initialize the parameters and you train

this architecture by repeatedly updating

the parameters to minimize the cost so

you go forward through the network to

get the things you care about and you go

backwards through the network to change

the parameters to be slightly better for

your costs and you repeat these many

times until you get a function that

you're really really happy with and

solves whatever problem you want and at

the end you just use that function just

the forward pass how to implement the

backwards pass is in general we almost

always use the chain rule this is really

nice because it makes implementing the

backwards pass easy how this works is if

you have your output as sum function of

some X and you have the partial

derivative of your output dld f you can

get dl DX by simply multiplying the

partial derivative of DF DX and the nice

part about this is that dl DF is gotten

from the rest of your network and DF DX

is gotten just from your module so this

allows you to chain these things

together in a way that only requires

local information in order to get this

backward pass it's very nice there's

theoretical reasons of why this is a

good way to do this and perhaps the best

part about this is some frameworks make

us completely automatic by the

finding a forward pass using automatic

differentiation you can figure out how

to make a backward pass automatically so

it becomes basically as easy as defining

arbitrary functions as so you actually

do get this benefit of just define

arbitrary things return arbitrary things

as long as all the operations you do are

differentiable you can just make it work

like magic and optimize it and this is

literally how people do this in practice

updating the parameters these are just

minor details to like get an

understanding of how this works is that

once you have your existing parameters

you get your gradient and you take a

step in the opposite direction of the

gradient and the partial derivatives

tells us how to change the parameters to

increase or decrease the cost that we

care about an important word to note to

know though is and that people always

use I think kind of makes it more

complicated it's a big word is back

propagation or back prop for short this

is has a longer name called reverse mode

automatic differentiation which sounds

pretty complicated but this is just the

chain rule plus dynamic programming I

assume that I just talked about change

were like some people are familiar with

dynamic programming but this is just

cashing and the idea would be when you

have a computation graph this is a very

simple computation graph y equals C

times the C equals a plus B D equals B

plus 1 the idea would be um you traverse

the graph from the top to the bottom and

by doing it from the top to the bottom

instead of the bottom to the top you can

00:08:18cash the intermediate so that I used

00:08:20many times in the graph and by cashing

00:08:22these intermediates you get something

00:08:23that’s much more efficient than if you

00:08:25were going to do a naive solution and

00:08:26this allows you to get gradients that

00:08:29are computable in linear time in the

00:08:32size of your graph so you basically

00:08:33evaluate each node once and this is a

00:08:35really nice property this makes it all

00:08:38really efficient and that’s basically it

00:08:40for the basics from a high level deep

00:08:43learning is just composing optimized

00:08:45abul subcomponents optimize the Ville

00:08:47almost always means differentiable

00:08:48differentiable means that you can do

00:08:50backdrop backdrop is just the chain rule

00:08:52in dynamic programming when once you get

00:08:56to practical deep learning normally

00:08:57you have to combine this with gradient

00:08:59descent software and a data set but you

00:09:01care about and the space of software is

00:09:05there’s a very rich space of software

00:09:07that will talk a little bit about in the

00:09:09future but it this these things are

00:09:11solved for you so you can do deep

00:09:12learning without even knowing how to

00:09:14calculate the gradient yourself so while

00:09:18we can do arbitrarily complicated things

00:09:20there are a few standard modules that

00:09:22are the main workhorse of deep learning

00:09:23today and the goal of this section is

00:09:26going to be to get a high level

00:09:28understanding of each since all of them

00:09:29can be very incredibly nuanced but these

00:09:32standard modules will cover almost all

00:09:34of what’s happening in papers the

00:09:37simplest of them is perhaps the simplest

00:09:40is just matrix multiplication it has

00:09:42many names the fully connected layer

00:09:44sometimes shortened to FC sometimes

00:09:47called dense because you have lots of

00:09:48connections linear layer because it’s a

00:09:50linear transformation or a fine because

00:09:52sometimes there’s a bias and the matrix

00:09:55multiplication is basically every time

00:09:56you have a neural network diagram all of

00:09:58these arrows correspond to the matrix

00:10:00multiplication so when you have a

00:10:02diagram that looks complicated that’s

00:10:03from that’s from this kind of thing and

00:10:05you can interpret this as a weight from

00:10:07every input to every output so if you

00:10:09have em inputs and you have n outputs

00:10:12you have M by n way to the transform

00:10:14your inputs to outputs and its

00:10:16implementation is literally a matrix

00:10:18multiplication and W in this case it’s

00:10:21generally a parameter which means you

00:10:23learn the connections from inputs to

00:10:25outputs this on its own is not powerful

00:10:29enough so you need at least one more

00:10:31thing which is non linearity the

00:10:35original non-linearity is called a

00:10:36sigmoid it’s just this function it has

00:10:39the nice property that it Maps reals

00:10:42into the the space 0 1 and it can be

00:10:46interpreted as a probability but that’s

00:10:49not as important that’s just being

00:10:50nonlinear and the reason the

00:10:52non-linearity is important is if you

00:10:54have this kind of like neural network

00:10:56when you stack up the layers

00:10:57back-to-back if you had no non-linearity

00:11:00in the middle this would just be two

00:11:02matrix multiplies back-to-back and what

00:11:04would happen is you could just combine

00:11:05this into a single matrix multiply so if

00:11:07you have that 100 layer purely linear

00:11:10network of just matrix multiplications

00:11:11while this thing is pretty complicated

00:11:13and you do all the work of a real neural

00:11:15network you could actually flatten it

00:11:18into a single weight matrix because of a

00:11:20linear composition of linearity so this

00:11:23was the original one people like it

00:11:25because it’s very similar to what people

00:11:27use before they really got how machine

00:11:29learning words which was just binary

00:11:32threshold injustice fired back in the

00:11:35day and the cool part is with just those

00:11:39two units you know how to make a neural

00:11:41network you can just simply do you get

00:11:43your input you apply the matrix multiply

00:11:45you apply a sigmoid you apply another

00:11:47matrix multiply and you have one and

00:11:50these are called multi-layer perceptrons

00:11:52when you only have matrix multiplies and

00:11:54nonlinearities and the cool part is that

00:11:57there’s a serum on this that this simple

00:11:59architecture like literally three

00:12:01functions can solve can approximate

00:12:04arbitrary functions which means it can

00:12:06solve any problem that you care about

00:12:08there’s a cool theorem on this the idea

00:12:12is that if you make the middle big

00:12:14enough you can calculate basically any

00:12:16function the downside is that just

00:12:19because it can it doesn’t mean it will

00:12:20and a single layer multi-layer

00:12:23perceptron often causes more problems

00:12:25than it solves so this is why there was

00:12:27an AI winter in the 90s just because

00:12:30these things are your kind of terrible

00:12:33but people have gotten a lot better it

00:12:35it’s and that now now neural networks

00:12:36are cool and as a disclaimer these are

00:12:40these are neural networks when you have

00:12:43so this is would be a multi-layer

00:12:45perceptron these are neural networks

00:12:47everything I’m talking about today is

00:12:48still a neural network but these are

00:12:51specifically when you talk about the

00:12:53multi-layer perceptron that is what this

00:12:54is so since then people have made better

00:12:59nonlinearities this is probably majority

00:13:03of the improvement between 1990 and 2012

00:13:06unfortunately which is you have like a

00:13:09kind of smarter non-linearity so instead

00:13:11of taking this weird squiggly function

00:13:13you just do a threshold so anything

00:13:16that’s negative you just turn it into 0

00:13:19this is actually the most popular

00:13:21non-linearity nowadays it does

00:13:23incredibly well

00:13:25some really nice optimization properties

00:13:27in particular when you have zero like is

00:13:32this not only is very linear so it works

00:13:34very well with a chain rule and this

00:13:36thing is used almost everywhere nowadays

00:13:38especially in the middle of a neural

00:13:40network there is a softmax which you can

00:13:45think of as converting a bunch of

00:13:47numbers into a discrete probability

00:13:48distribution so the math of it is p

00:13:53equals you exponentiate your input and

00:13:56then you divide it by the sum of the

00:13:57inputs you can think of the explanation

00:13:59is turning it all the numbers into

00:14:01positive and the dividing by the sum is

00:14:03a normalization term there’s some very

00:14:04nice properties about this and it’s used

00:14:07as the final layer for classification

00:14:08problems and it’s used in almost every

00:14:11neural network cool that was the easy

00:14:14part this gets complicated I like feel

00:14:20free to ask questions during this I

00:14:22normally explain this with a whiteboard

00:14:23and it normally is complicated even with

00:14:25a whiteboard but i’ll try to go through

00:14:28this so a convolution is a the main

00:14:31workhorse for deep learning on images

00:14:34and deep learning and images is

00:14:35basically it’s kind of where this

00:14:37revolution started so it’s very very

00:14:39important it’s probably the place where

00:14:41deep learning is the most advanced so

00:14:42it’s a very important primitive and I

00:14:44think the very cool primitive to

00:14:45understand because you really realize

00:14:48like how beautiful the framework is when

00:14:49you see like wow this thing sounds

00:14:51pretty complicated but it you can just

00:14:53plug it in and you’re doing that need to

00:14:54know how it works when someone has coded

00:14:56up for you which is what I do so this is

00:15:01a linear operation for 2d images so once

00:15:03you have a multi-layer perceptron you

00:15:05have a mapping from every input to every

00:15:07output but in the case of images your

00:15:09inputs are structured so you have like

00:15:11this spatial relationship between your

00:15:13inputs and if you have a mapping from

00:15:15every input every output you kind of

00:15:16throw away the spatial relationship so

00:15:19the idea would be what if rather than

00:15:21having a connection from every input to

00:15:23every output what if every output the

00:15:25output look like an image as well and

00:15:27every output was only locally connected

00:15:29to the things that corresponds to so

00:15:31that is in sight number one so local

00:15:33connections and inside number two is

00:15:36every output is

00:15:38a local function of its input what if

00:15:41instead of having every output be its

00:15:43own function which would be the general

00:15:45case what if every output was the same

00:15:46function of its input so what this then

00:15:49becomes is equivalent to like a

00:15:51well-known function in computer vision

00:15:52which is a convolution which is you have

00:15:55a kernel which is you can think of it

00:15:58like a local weight matrix so it’s

00:16:00represented yeah oh cool my mouse is

00:16:04here so it’s often represented as the

00:16:06square in an image like thing which

00:16:09means that they’re capturing that local

00:16:11input you just do a matrix multiply

00:16:13between all of the weights of the kernel

00:16:15which would be something like this you

00:16:18do that multiply for everything the

00:16:20local region you sum up the results so

00:16:22this is just a dot product and then you

00:16:24do that at every single location at the

00:16:26input so it’s kind of like tiling your

00:16:28input with the same function or it can

00:16:30be interpreted extracting the same

00:16:32features at every location which is the

00:16:34more common way to interpret it this is

00:16:37it’s very powerful it’s very parameter

00:16:41efficient because you have a lot of

00:16:42weight sharing between the parameters

00:16:44and you can end up having much larger

00:16:48outputs then you can have with a normal

00:16:50matrix multiplication and you also don’t

00:16:52lose spatial information which is a very

00:16:54important structure of images so these

00:16:57are some really nice properties and as a

00:17:00side effect you might think that this

00:17:01thing is really complicated how do I

00:17:04take a gradient of it because I maybe

00:17:06the whole thing is kind of complicated

00:17:08but this is actually equivalent to a

00:17:11very constrained matrix multiplication

00:17:13so if you take your input image and you

00:17:16unroll it because with a matrix moulton

00:17:18you lose that spatial structure and you

00:17:20unroll your input you basically have a

00:17:22few connection like every input every so

00:17:25every output is connected to maybe like

00:17:27nine of your inputs and that just

00:17:30becomes equivalent to the really like

00:17:32all of the diagram with lots of arrows

00:17:33but most of the arrows being zero or

00:17:35missing so this is still completely

00:17:37differentiable and still fits very

00:17:40nicely into this framework that you can

00:17:41plug in with all the other

00:17:42nonlinearities cool it’s going to get a

00:17:47little bit harder

00:17:49another very fundamental building block

00:17:52is called a recurrent neural network I

00:17:54don’t know why the building block is

00:17:56called a network when everything else is

00:17:57called layer but that’s just kind of

00:18:01convention and this is solving a a

00:18:04problem that is basically has not been

00:18:06solved in machine learning before which

00:18:08is we want functions to take to take in

00:18:11variable size input but they can only

00:18:13take in fixed size input and this is

00:18:15becomes a problem when you’re a function

00:18:17is parametric like a fully connected

00:18:20layer is because if you want a

00:18:21connection from every input to every

00:18:22output but your input size changes that

00:18:25means you’re the number of weights you

00:18:26have changes and that means that if you

00:18:28get a longer example at the inference

00:18:31time you now don’t know what to do with

00:18:32it and this also might be inefficient

00:18:34because you might have like really

00:18:36really big a really large number of

00:18:38inputs and you might not need all of the

00:18:40power of having like every connection

00:18:41there so a recurrent neural network is a

00:18:44way to solve this problem and the

00:18:47solution to this problem is recursion so

00:18:49what you have is an initial state which

00:18:51would just be let’s just call it h in

00:18:53this example and you have a bunch of

00:18:55these inputs X and there’s a variable

00:18:57number of them so you don’t really know

00:18:59what like this capital T is and you can

00:19:02make a function that takes in a fixed

00:19:03size and because each X is fixed size

00:19:06you can make that function have taken

00:19:07both h + X and now you can recurse

00:19:10through this list by saying h of t

00:19:13equals the function of the previous

00:19:15state sorry the new state is a function

00:19:16of the previous state and the current

00:19:18input and then you just return the final

00:19:19one and what this allows you to do is it

00:19:22allows you to with a fixed size input

00:19:25you can have it operate on sorry with a

00:19:28fixed function that takes fixed size

00:19:29input you can now turn it into a

00:19:31function that takes in a variable sized

00:19:32input by applying that function of

00:19:34variable number of times this is not the

00:19:37this is like a pretty obvious insight

00:19:40and you could do that with any kind of

00:19:42machine learning algorithm you could

00:19:43like apply a random forest like an

00:19:45arbitrary number of times but the cool

00:19:47part about this is that because this

00:19:49function is differentiable this

00:19:50recursive function is also

00:19:52differentiable so you can take the

00:19:53derivatives of each of the inputs you

00:19:55can take even take the derivative of the

00:19:57weight matrices you use at each step you

00:19:59can use that f reach up and you get a


00:20:03it looks kind of like this now you can

00:20:06think of it as applying an FC layer for

00:20:08each input that takes the input and the

00:20:09state so far and this diagram might not

00:20:13be very clear but there are many

00:20:15different diagrams for RN ends and

00:20:16they’re all equally confusing if you’re

00:20:18unfamiliar with them so this is kind of

00:20:21the one on the left is my favorite one

00:20:23because you can kind of think of it as a

00:20:25stateful function except you the state

00:20:29only lasts for the duration of your

00:20:31input but the unrolled version is the

00:20:33version you use if you’re taking

00:20:35gradients so this is equivalent just

00:20:37passing the gradients through this very

00:20:39long graph a last complicated slide long

00:20:43short term memory units this puts me in

00:20:46a really hard position because I can’t

00:20:48not talk about them because they are so

00:20:50big it is but they’re also extremely

00:20:51complicated and there they take more

00:20:55building blocks and I’ve UNIX even

00:20:56explain but there is this great blog

00:20:58post I think slides will be published so

00:21:00you don’t have to worry about that this

00:21:02great great blog post tries to explain

00:21:04it but I’m going to try to give like a

00:21:05high-level intuition of them just so

00:21:07like even higher than what I’ve said so

00:21:09far um just so that you can kind of

00:21:11understand where where it’s coming from

00:21:16when I talk about these things being

00:21:17used and the idea would be its kind of

00:21:21like an RNN and in practice no one uses

00:21:24the RNN that I’ve just described it’s a

00:21:26very simple function and there’s much

00:21:28more complicated versions it’s an RN n

00:21:30where the function is just really

00:21:32complicated so this entire thing here is

00:21:35a representation of that function I’m

00:21:38not going to get into the details of it

00:21:40but it involves a lot of different

00:21:44mechanisms in order to make optimization

00:21:46easier and the idea is that if you’d

00:21:50apply if you designed this function well

00:21:53the function is applied at each time

00:21:54step it can make the problem much much

00:21:56easier to optimize and you can have like

00:21:57a much much more powerful function and

00:21:59the key is that by having a path which

00:22:02is relatively simple so this is what

00:22:04represents with the top path where these

00:22:06variable operations being done to it it

00:22:08makes it easier to stack these things

00:22:10want back to back and that makes easier

00:22:13to learn long-term relationships between

00:22:15the functions

00:22:18whoo okay that was that was the

00:22:22complicated part you now know

00:22:24ninety-five percent of the building

00:22:25blocks that everyone uses for

00:22:27state-of-the-art deep learning with just

00:22:29these billing box you could probably do

00:22:30new state-of-the-art things on new

00:22:32domains so congratulations you ready for

00:22:35the next part um so in this part I want

00:22:40to talk about what D planning is really

00:22:41good at and what you should use it on

00:22:43the answer is a whole lot so I’m going

00:22:46to cover just the rough themes of where

00:22:47deep learning really shines but there’s

00:22:49just much much more to it which i think

00:22:51is part of the awesomeness because it’s

00:22:53all falls under this extremely simple

00:22:55framework that I’ve just described I

00:22:57don’t think that you could like describe

00:22:58any framework as simple as what I’ve

00:23:00just done and have it solved this many

00:23:02complicated unsolved tasks before 2012

00:23:06basically so convolutional neural

00:23:09networks this is a general architecture

00:23:11commonly referred to as CNN’s this

00:23:14actually means a network in this case

00:23:16and not just a layer the idea is that

00:23:19you take your image you apply

00:23:20convolution you apply your value your

00:23:24rectified linear unit you’re probably

00:23:25convolution you apply lu and you

00:23:27basically repeat this convo you until

00:23:29you solve all the problems in computer

00:23:31vision that isn’t quite true since at

00:23:34the end you need to tack on some sort of

00:23:36outfit layer and the other player

00:23:37depends and what kind of input you’re

00:23:39trying to solve the cannot like a really

00:23:41old school task is that you its face

00:23:46recognition trying to determine like

00:23:48whose face this is and this is a really

00:23:51cool task because makes the

00:23:52representations very visual and you can

00:23:54see how the network learns over time so

00:23:57at the first layer you when you start

00:23:59with the pixels at the first layer your

00:24:01filters tend to just match for edges and

00:24:05very simple things so convolutions can

00:24:07match edges and other very simple shapes

00:24:08and as you get deeper and deeper into

00:24:10network you learn more complicated

00:24:11functions of the input so after that you

00:24:14can start combining edges into corners

00:24:16or blobs so this is still extremely

00:24:19simple but after you get to another

00:24:21layer somehow like combining two corners

00:24:23the right way becomes kind of like an

00:24:24eye like shape or if you have like two

00:24:26corners in a blob that becomes more I

00:24:28like and you can build up from

00:24:30edges to corners to object parts and

00:24:33eventually into the objects you care

00:24:34about and as you get really really deep

00:24:36networks you actually have intermediates

00:24:38that are extremely semantic objects for

00:24:41example people have made a lot of tools

00:24:43for visualization of neural networks

00:24:44where they visualize what these in with

00:24:48the neural networks learn and you have

00:24:51for example if you have a neural network

00:24:52that doesn’t learn to classify books at

00:24:54all but lanes classified bookshelves

00:24:55some of the intermediate features

00:24:57actually become book classifiers which

00:24:59is really interesting like it can learn

00:25:01or I like a hierarchical representation

00:25:04of your input space such that these are

00:25:07useful things to combine together in

00:25:10order to make a robust classifier and by

00:25:12combining so maybe if you combine like

00:25:15three books together as well as a square

00:25:16this becomes a bookshelf so these are

00:25:18kind of like what the local operations

00:25:20do with each neural network and the

00:25:23beauty of it is that it’s all learned

00:25:24automatically for you don’t need to

00:25:25program like I have a book shelf

00:25:27bookshelf normally have books they have

00:25:29books sorry they have like square stuff

00:25:31maybe they’re often decide flowers this

00:25:33all can like happen in a data set

00:25:35automatically for you and these

00:25:37convolutional neural networks are

00:25:39absolutely amazing they just when I

00:25:41wasn’t joking when they save basically

00:25:43all of computer vision right now it all

00:25:46started with imagenet this was in 2012

00:25:50this is when deep learning actually the

00:25:52entire hype train started where you had

00:25:56traditional machine learning solving

00:25:58this very hard very large computer

00:26:00vision data set and it was kind of

00:26:01plateauing over the years and all of a

00:26:03sudden deep learning comes in and it

00:26:05just blows everything away and ever

00:26:08since then everything has been

00:26:10everything in computer vision has been

00:26:12deep learning like nothing can even

00:26:14compare and recently we’ve been even

00:26:16being able to get superhuman results

00:26:18which is pretty impressive because

00:26:21humans are pretty good at seeing things

00:26:23it’s kind of what we’ve evolved to do

00:26:26and the same architectures can do all

00:26:28sorts of really interesting structured

00:26:31tasks so using almost the same

00:26:33architecture you can use a concept to

00:26:36determine you know like you can break up

00:26:39your input space into a what’s called a

00:26:40semantic segmentation of like all of the

00:26:43relevant parts that you

00:26:44have and using basically the same

00:26:46architecture as well you can do crazy

00:26:49things like super resolution where you

00:26:51takin like a low-resolution image and

00:26:52make it you can fill in the details

00:26:54which is pretty is that’s a pretty not

00:26:59only is it incredible even though it

00:27:00sounds pretty easy it’s incredible that

00:27:04like that you can use the same

00:27:05architecture that takes an image and

00:27:08tells you whether or not there’s a dog

00:27:09in it to take an image and return like a

00:27:12new higher resolution image and this is

00:27:14basically the same library the same

00:27:15components it’s just very very

00:27:19composable and that’s really good

00:27:20awesome you can also use this to solve

00:27:23really hard medical tasks tasks that

00:27:25people could not solve before here we’re

00:27:27detecting classifying lung cancer in CT

00:27:30scans these are the kinds of things that

00:27:32I like to work on and it’s not only

00:27:36limited division there’s been a lot of

00:27:38work in language understanding so is

00:27:41something that deep learning is really

00:27:42good at this language modeling roughly

00:27:45this means how probable is a how much

00:27:50sense this statement-making a certain

00:27:52language so it might have to do with a

00:27:54question response how are you I’m fine

00:27:56it might have other things such as what

00:27:59would be a weird thing my laptop is

00:28:02squishy might be a very improbable

00:28:04sentence to say so a neural network

00:28:06could probably determine squishy is a

00:28:08very bad adjective for a laptop this is

00:28:11a very improbable sentence but if I said

00:28:13my laptop is hot that would probably be

00:28:16a much more likely sentence and this

00:28:18already has some human-like seal to it

00:28:21because language was designed for humans

00:28:23and being able to have like if you can

00:28:26do language understanding as in

00:28:28determining the probability of like any

00:28:30sentence given a context you can and if

00:28:32you do this perfectly you can solve

00:28:33basically any task and this is a it’s

00:28:36really interesting domain where it’s

00:28:38being applied because previous if you

00:28:41look at what how language understanding

00:28:42was done before deep learning was around

00:28:44it was just incredibly simplistic tons

00:28:47and tons of rules no robustness two data

00:28:49sets you’d have to make custom rules for

00:28:51every language and now you could you can

00:28:54use the same tricks for English as you


00:28:58for Chinese characters as you can for

00:28:59byte code so that is just pretty

00:29:02incredible they’ve obviously been much

00:29:05more complicated tasks a pretty popular

00:29:09use for machine for deep learning that

00:29:12this people are really putting a lot of

00:29:14effort in is aunt went language

00:29:16understanding from scratch so the idea

00:29:18is you use an RN to compress a sentence

00:29:23in your source language into a vector

00:29:25like I described in the RNN section and

00:29:28then you use a different RNN to decode

00:29:32it into a target language and while it’s

00:29:36not surprising that you can design a

00:29:37neural network that plausibly can output

00:29:40this it is quite surprising that it

00:29:41works so well and you’ve been able to

00:29:44have neural networks that in the man in

00:29:49a span of a few grad student months

00:29:51match the performance of systems that

00:29:54people have spent decades engineering

00:29:57and / happened nowadays I think that

00:30:00deep learning systems are not into and

00:30:05deploying systems are not what’s used

00:30:06for this right now but they’re a very

00:30:08important component so people still use

00:30:09a bit of hard coded stuff but it’s only

00:30:12a matter of time and the beauty is that

00:30:13but if we have a new task or a new

00:30:15language now it can just automatically

00:30:17work like what if we you know we find

00:30:21out some lost language from a thousand

00:30:25years ago and we have like a good amount

00:30:27of their texts can we actually learn how

00:30:31to translate it or understand it without

00:30:34any knowledge of this and it seems like

00:30:36purely from data we can and that’s

00:30:38really cool we don’t need an

00:30:39understanding of something in order to

00:30:41we don’t need a understanding prior to

00:30:44applying our machine learning models in

00:30:45order to have an understanding

00:30:45afterwards and that is just really

00:30:49really awesome I’ve actually been

00:30:50chatting with the people at SETI the

00:30:52search for extraterrestrial intelligence

00:30:54and one of the tasks that they’re doing

00:30:57is trying to understand dolphins the

00:31:02rationale is that if we can dolphins of

00:31:05language aliens might have language if

00:31:08we if we see alien communication we

00:31:10probably won’t understand it

00:31:12perhaps we can use dolphins to the proxy

00:31:14for aliens to try and understand them so

00:31:17there’s some really cool tasks that are

00:31:18happening there it’s not limited to that

00:31:21there’s some really cool things being

00:31:23done with art in deep learning actually

00:31:25I think that companies have started up

00:31:27that their entire business model is

00:31:30creating awesome deep learning art and

00:31:31they seem to be doing well from what

00:31:34I’ve heard in this case this is a

00:31:37hallucination purely from a conf lap

00:31:39trained to do image classification so an

00:31:43image that continent you know something

00:31:44that takes an image tells you like what

00:31:46breed of dog it is with objects or in it

00:31:47you can use it with a few tricks to

00:31:50create this kind of crazy art and this

00:31:53was a pretty big splash it’s a very

00:31:56unintuitive that a neural network that

00:31:58isn’t even made trained to make art

00:32:01actually can turn out making this kind

00:32:02of thing they’ve been more popular use

00:32:06cases such as style transfer the idea

00:32:09would be you can take a neural network

00:32:10still train for classification the idea

00:32:13would be classification has some priors

00:32:15about what images some priors about the

00:32:19natural world so the what you do then is

00:32:21you say i want my image to kind of match

00:32:24the distribution from a different image

00:32:27and then you get this kind of style

00:32:29transfer where you can mix together

00:32:32these kinds of components and while this

00:32:35is actually pretty ugly example there’s

00:32:38there’s some good ones i promise there’s

00:32:41some much more complicated things you

00:32:42can do it’s not just like taking two

00:32:43images together and merging them

00:32:45together you can do things like

00:32:47transforming a perhaps not super great

00:32:51drawing something that you could

00:32:53probably do in paint fairly quickly into

00:32:56something that looks like an artist did

00:32:59or something that’s really awesome and

00:33:01the idea would be that you can actually

00:33:03take these arbitrary doodles and convert

00:33:06them into these things that look like

00:33:07paintings and this kind of stuff is

00:33:10really awesome and I think it’s just the

00:33:11beginning of the kind of stuff that we

00:33:12can do with neural network art but after

00:33:16basically less than a year of work on

00:33:18this you’re making applications that are

00:33:20already very tangible very awesome very

00:33:25this is already something that if I made

00:33:27this I would probably hang up in my

00:33:29living room and this has only been one

00:33:31year of work imagine what happens that

00:33:32in 10 years I saved the best for last in

00:33:36terms of art we can combine our pictures

00:33:38without of Pokemon so clearly the future

00:33:40is here um this is one of my crowning

00:33:43achievements I think primarily because

00:33:47I’ve done this with like dozens of

00:33:48people and only mine turned out well but

00:33:52yeah the I think this is really awesome

00:33:54there’s like just so many things to do

00:33:57here and so few people are working it on

00:33:59it and that the sky is really the limit

00:34:01so it’s just really exciting what on the

00:34:05kinds of stuff that we can be created

00:34:07here there’s been other huge achievement

00:34:11game playing has been really big if

00:34:13anyone saw deep mines 500 million dollar

00:34:17acquisition in 2013 roughly the only

00:34:21paper that they had at the time was

00:34:23learning to play Atari games from pixels

00:34:26which is might be harder than it sounds

00:34:28because humans have a prior of how to

00:34:31play the game right like they have a

00:34:33prior that this is maybe a ball and

00:34:36that’s a paddle and I want to destroy

00:34:37certain things where they were prior

00:34:38that a key opens doors or that roads are

00:34:42something I want to stay on in a driving

00:34:44game but neural networks not given any

00:34:47of these priors it’s literally only

00:34:49given the pixels given these images it

00:34:51learns to play at what is on median a

00:34:54superhuman level and the techniques have

00:34:57been continuing to get better and this

00:34:59kind of stuff very similar tricks have

00:35:01been applied to the much more recent

00:35:04result of google deepmind alfa GO

00:35:09network which was not that huge of a

00:35:11deal in the West but if you ever talk to

00:35:13people from the more Eastern world you

00:35:16can talk to them about here are the

00:35:18achievements of deep learning you talk

00:35:19about smart inbox and they’re like oh

00:35:22that’s pretty okay you talk about image

00:35:25search yeah that’s pretty okay and then

00:35:27you talk tell them about like oh yeah

00:35:29it’d also be the world champion it go

00:35:31and they’re like whoa we’d beat plays go

00:35:33that’s amazing and people predicted that

00:35:37even beating human

00:35:39go would probably be depending on the

00:35:42expert 10 to 100 years off and it

00:35:44happened it just happened it’s already

00:35:47done it’s already that like humans have

00:35:49lost that go and as a side effect goez

00:35:53also caused more fear over AI safety

00:35:56than any other neural network I believe

00:35:58and this is probably a good

00:36:02representation of that I don’t know how

00:36:03yes let’s medium clear this is an XKCD

00:36:07of like how hard people used to think

00:36:09these games were and you can see go as

00:36:12basically being the last on the level of

00:36:16computer still lose to top humans and

00:36:18then not all of these are solved but

00:36:21that is just pretty incredible that

00:36:24that’s now solved people have been

00:36:26trying to ask like if it can do this

00:36:28what can’t it do because go is a task

00:36:30that requires a lot of reasoning and

00:36:34these kinds of achievements have been

00:36:37being transferred into the physical

00:36:39world as well this is a google has like

00:36:42a farm with like a bunch of robots that

00:36:44have learned on their own to grasp

00:36:46objects and basically robotics control

00:36:50is usually pretty hard especially when

00:36:51you’re trying to make it generalize and

00:36:53they’ve been able to do that just by you

00:36:56know throwing the robots into a dark

00:36:57warehouse having a train for a while

00:36:59designing a cute objective function and

00:37:01it just learned to grasp things better

00:37:04than their hand design controllers did

00:37:06which was pretty awesome and more

00:37:10recently actually I think there was a

00:37:11video like that came out last week of

00:37:13nvidia using just deep learning for

00:37:16self-driving cars so the idea was like

00:37:18with just a single camera in front of

00:37:20your car now your car can learn to drive

00:37:22can can drive itself from learning from

00:37:25how other people drove and this is a

00:37:28very interesting result because even

00:37:30google has been working for i don’t know

00:37:32if it might have been a decade already

00:37:34that they’ve been working on

00:37:35self-driving cars using you know lidar

00:37:37and slam and all of that stuff and

00:37:40Nvidia’s by some measures caught up to

00:37:45them entirely within i think it’s been

00:37:47less than a year since they’ve been

00:37:48investing in this so a lot of thing it

00:37:51seems to be changing a lot of things

00:37:53especially these kinds of perception

00:37:55tasks because research is moving so fast

00:37:59I also have to spend some time and

00:38:01things that are not yet practical but

00:38:03may very well soon be as a disclaimer

00:38:06I’ve been traveling this weekend so i’m

00:38:07not sure if some of these things belong

00:38:09in the already solved category

00:38:12generation is a big one there’s tons and

00:38:14tons of stuff happening generation so i

00:38:15definitely can’t give it justice there’s

00:38:17really cool stuff and like just

00:38:18generating images from scratch and

00:38:20generating arbitrary other domains from

00:38:22scratch images are just the most visual

00:38:24so i have them here but some of the

00:38:27coolest and perhaps most practical

00:38:28examples are conditional generation

00:38:31something i’m really excited about is

00:38:33image to text so the idea is you’ve

00:38:37taken an input image and the output is

00:38:39not like yes or no whether or not the

00:38:41dogs engine but you output a description

00:38:44of the image and that’s like an

00:38:45extremely human task it to be extremely

00:38:48useful if you do this task right it

00:38:51seems like this all the whole ton of

00:38:53possibilities I’m very excited about

00:38:55like taking in a medical image and like

00:38:57outputting like a pleat report of it

00:38:58which would be really awesome and some

00:39:01people that are really excited about

00:39:02this that has applications in the very

00:39:04short term is I don’t know the right way

00:39:07to say it but like the poor eyesight

00:39:08community so web pages nowadays have

00:39:12been pretty bad about stuff for people

00:39:16with disabilities and imagine if you had

00:39:18a neural network that can just describe

00:39:20an image for you describe a page for you

00:39:22tell you what’s on the page in a very

00:39:23semantic summarized way and there’s also

00:39:27a really cool opposite problem which is

00:39:29instead of taking an image and

00:39:31outputting a description you take in a

00:39:32description and not put an image which

00:39:35as a terrible idea artist I’m probably a

00:39:37bit more excited about because instead

00:39:39of like I can describe pictures I can’t

00:39:41really draw them and like these are much

00:39:43better already than I can draw but

00:39:45that’s probably a low bar but in this

00:39:48kind of network you actually take in

00:39:49like a sentences text and all of these

00:39:52images are generated from that network

00:39:53and that’s pretty incredible some of

00:39:56them are not super great but like these

00:39:58birds are actually there’s I believe

00:40:02they’re real the flower is not the

00:40:05purple ones

00:40:06but they actually see him close like if

00:40:10I if it was zoomed out enough I could

00:40:12see this is being pretty real and can

00:40:16you imagine in a future where instead of

00:40:18having to spend millions of dollars in a

00:40:20movie you just like type it up and then

00:40:22a neural network just generates the

00:40:23movie for you we’re quite away from that

00:40:26but perhaps not that far away especially

00:40:29like with some focused work and this

00:40:33could name like all sorts of like new

00:40:34forms of creativity that people don’t

00:40:36even know about while language

00:40:39understanding does quite well there is a

00:40:42deeper language understanding which we

00:40:45can kind of solvent oi tasks but it’s

00:40:47kind of harder for real task so QA so

00:40:51cute question answering that requires

00:40:52more complicated reasoning such as if

00:40:54you have like a story here and you ask

00:40:56something question complicated like

00:40:58where is the football then yes like go

00:41:00back in the store and figure out where

00:41:02that kind of thing happened that thing’s

00:41:03kind of complicated people are very good

00:41:06at this task models can solve these

00:41:09simple ones quite well but they can’t

00:41:11real do real question answering yet

00:41:13which is unfortunate but something

00:41:15people really care about and we’re not

00:41:18quite there yet but I also love how

00:41:20awesome this problem sounds is that like

00:41:22our machines that we have like basically

00:41:25spent no work on only automatically

00:41:27learn a shallow level of reasoning like

00:41:30that’s like such a first real problem

00:41:33while there’s like language

00:41:35understanding there’s also visual

00:41:37understanding that is um kind of

00:41:39unsolved there is a there’s some awesome

00:41:42data set that involves images and

00:41:44questions and the goal is to find an

00:41:46answer and the models there are models

00:41:49that can do pretty okay at this task but

00:41:51still very not good and still like face

00:41:56significantly worse than people do so

00:41:59this kind of thing is something that we

00:42:01can’t do just yet while game playing is

00:42:06solved harder game playing is still an

00:42:08open problem and you might think harder

00:42:12game playing my five-year-old brother

00:42:14can play minecraft and he almost

00:42:16certainly can’t beat the world champion

00:42:17at go

00:42:19the harder in this case means stateful

00:42:21it turns out that humans are really good

00:42:24at remembering something while neural

00:42:26networks have some difficulty with it so

00:42:28the neural networks that people have

00:42:30been using for playing games have been

00:42:32completely stateless so when you have a

00:42:34partially observed world like Minecraft

00:42:36where you like only have one direction

00:42:37that you’re looking at if you like look

00:42:38to the left it forgets what was on the

00:42:40right and this is something that people

00:42:42are still working to solve it’s the same

00:42:44thing with doom and the work has been

00:42:47done a bit but it’s far from being a

00:42:49solved problem and I do believe it

00:42:50they’re still subhuman at this task

00:42:54there’s some really cool stuff with

00:42:56automatically discovering hierarchical

00:42:58structure so in language the

00:43:01hierarchical structures may be clear to

00:43:02us because we use language like

00:43:04character a word limit of character your

00:43:06senses are made of words paragraphs are

00:43:08made of senses this is like this

00:43:10semantic hierarchy which makes it easy

00:43:12to break down a problem into simpler

00:43:13problems but this is not the case in

00:43:17many domains and there have been people

00:43:19who’ve designed neural networks that can

00:43:20actually automatically discover this

00:43:22hierarchy and this could be really

00:43:24useful for tasks where we don’t know how

00:43:26to interpret that so something I’ve

00:43:28worked a bit on is genomics and we

00:43:30really don’t even know how to read

00:43:32genomics right but if a neural network

00:43:33and automatically break it up into like

00:43:35this part goes together with that part

00:43:36you know there’s connections between

00:43:38here and here this could actually help a

00:43:41whole lot with all sorts of different

00:43:42kinds of scientific tasks just purely

00:43:45from data this is when it gets a little

00:43:48bit computery but these are things that

00:43:50I’m excited as a computer scientist

00:43:52there’s this model called neural turing

00:43:54machines which learns to use like a big

00:43:56memory buffer which is very cool so you

00:43:58can actually see like how the network

00:44:00reads and writes and reads in order to

00:44:02copy an input there’s ways to implement

00:44:06differentiable data structures so things

00:44:10that you thought where instead of having

00:44:13like this black box of like arbitrage

00:44:15activations with matrix multiplies you

00:44:17can actually plug in a data structure

00:44:18into a network and now your network can

00:44:20learn to do things like pushing and

00:44:22popping to a stack you know getting from

00:44:24both ends of a queue and all of these

00:44:26kinds of things and this could

00:44:28potentially enable all sorts of very

00:44:30cool use cases

00:44:31as learning to program people have done

00:44:34some work where you can create models

00:44:37that not only can like have simple input

00:44:40output mappings but as an intermediate

00:44:43in this input output mapping they can

00:44:44learn subroutines and play with pointers

00:44:46and this actually makes them a very very

00:44:48general computing like it potentially

00:44:51could do all the problems we care about

00:44:53if you can learn subroutines and play

00:44:55with pointers it’s like that could learn

00:44:57abstraction automatically for you and by

00:45:00putting these things together people

00:45:01have been able to do things like

00:45:03learning to actually execute code so the

00:45:06idea would be given like code is a

00:45:09string and targets for that code like

00:45:11what the output is you can actually

00:45:12learn an interpreter for that language

00:45:14and this is really exciting to me as a

00:45:17programming language guy like maybe I

00:45:19could design a programming language not

00:45:21by implementing it but by just showing a

00:45:23whole bunch of examples and the

00:45:25implementation automatically happening

00:45:27for me or perhaps I could just write the

00:45:28test cases for the language and a neural

00:45:31network can generate an efficient

00:45:32language for me and something else that

00:45:37is related to all of this stuff is this

00:45:40is really early but I think a lot of

00:45:41people are really excited about that

00:45:42which are neural module networks we’re

00:45:46instead of having a single architecture

00:45:47that you play with you can have

00:45:50architectures that are you can have a

00:45:53library of components and that F for

00:45:56every single example you make a custom

00:45:57architecture and you output it so for

00:46:00example if you have the question

00:46:02answering task and you have an image and

00:46:04you have a question where is the dog

00:46:05instead of using an arbitrary network

00:46:07that takes in the question and the

00:46:10answer you actually convert this

00:46:11question into a custom neural network

00:46:13which combines a dog module with a where

00:46:17module and outputs the answer and this

00:46:22kind of thing is very early but really

00:46:24promising so that that’s it for the

00:46:28future of it I hopefully you guys are

00:46:30pumped to deep learning some problems

00:46:33there’s a lot of software to help you

00:46:35I’m not going to talk about that right

00:46:37now because there’s a lot of tutorials

00:46:38out there and I think the high level

00:46:39understanding is much more important my

00:46:43recommendation is that if you want to

00:46:44customize a lot of things

00:46:45theano intensive floor the best because

00:46:47it allows you to get this automatic

00:46:49differentiation that I was talking about

00:46:51then you never have to worry about the

00:46:52backwards fast basically and if you want

00:46:55to just use like the modules that I

00:46:57talked about as well as a few others

00:46:58Karis can solve that and you can do a

00:47:00lot of these things with Karis if you

00:47:03want to do this there’s a lot more

00:47:05learning to do and the devil’s really in

00:47:08the details so I was super high level

00:47:09with lots of the stuff but all like

00:47:12there’s so many little things that you

00:47:13need to know such as how do you take how

00:47:15you perform the updates in a way that

00:47:18doesn’t cause your parameters to grow

00:47:20too large how do you initialize the

00:47:22parameter is to not be a trivial

00:47:23function how do you not over fit your

00:47:26training set so there’s a lot of

00:47:28resources out there my favorite one is

00:47:31this stanford class by Andre Carpathia

00:47:34cs2 31 n it is specifically incontinence

00:47:38but it’s constantly updated with

00:47:40state-of-the-art stuff and it’s

00:47:42generally very high quality so I think

00:47:44it’s very approachable for anyone like

00:47:46beginner to very advanced and if you

00:47:50want to do this you probably need GPU or

00:47:5250 I think that’s it for time so sorry I

00:47:57was rushing at the end with any

00:47:58questions also I have these slides which

00:48:01is slide should I leave it on some

00:48:04questions here go so one is how can we

00:48:08avoid that autonomous cars pick up the

00:48:11human bad habits how can we avoid that

00:48:14autonomous cars pick up human bad habits

00:48:16that is a very interesting question it’s

00:48:18very dependent on how the cars are

00:48:20trained so if you train a car to copy

00:48:23the UN bad habits so if you train a car

00:48:26to copy humans which is by far the

00:48:27easiest thing to do it’s not the most

00:48:30correct thing to do because the most

00:48:33correct thing to do would be to learn

00:48:34how to drive optimally from scratch that

00:48:37unfortunately involves trial and error

00:48:39but you probably don’t want that in

00:48:41self-driving cars so we can skip that or

00:48:43hard coded rules so what content happen

00:48:47is if you’re training it to learn from

00:48:49humans it’ll mimic those humans but the

00:48:51idea is that if humans make mistakes you

00:48:54like let’s hope let’s say you want to

00:48:56make mistakes and let’s say humans don’t

00:48:58make consistent mistakes if they don’t

00:49:00make consistent mistakes and different

00:49:01humans make different kinds of mistakes

00:49:03are the same human-like only makes it

00:49:05makes mistakes sometimes and you have a

00:49:07neural network that neural network can

00:49:08predict the expectation of what the

00:49:11human can do rather than the worst-case

00:49:13scenario so if you’re kind you can think

00:49:16of humans as an in Samba in this case

00:49:17that if you’re predicting what the

00:49:18average of a bunch of humans can do you

00:49:20can drive better than a human can but if

00:49:22humans are consistently make if humans

00:49:24consistently make mistakes then there’s

00:49:26nothing you can do about that other than

00:49:28get more data I think we have time for

00:49:32one more what do you think about chat

00:49:34BOTS is it possible to build only with

00:49:36deep learning yes there’s actually many

00:49:38startups that are doing this right now

00:49:40so this seems to be the next wave in

00:49:44startups are like the hot thing right

00:49:46now where people are trying to use chat

00:49:49BOTS to do all sorts of things for like

00:49:52very specific domains it has some really

00:49:54nice properties from a business point of

00:49:56view because your goal is to replace

00:49:59humans who chap so it’s very easy to

00:50:03like replace them with an algorithm

00:50:04because the humans when you have if you

00:50:06have a bunch of them they generate a

00:50:07bunch of data so it is very plausible

00:50:09it’s still hard for chat BOTS it’s kind

00:50:13of like the game-playing problem where

00:50:14it’s hard for chat BOTS to have a memory

00:50:16of what you said so if you talk about

00:50:18like Oh try you know opening this menu

00:50:21and you know go here and here and here

00:50:23and you could like might have five

00:50:24sentences later the chat BOTS might say

00:50:26the same thing because the neuron lyrics

00:50:28i still have memory issues cool I think

00:50:33that’s it so please remember to vote and

00:50:36let’s yeah give a big applause to do

00:50:43thank you