Press "Enter" to skip to content

GOTO 2017 • Improving Business Decision Making with Bayesian Artificial Intelligence • Michael Green


00:00:07[Music]

00:00:08[Applause]

00:00:14so hi everyone as you heard my name is

00:00:18michael green and i’m indeed here to

00:00:22tell you about a different approach to

00:00:25to building algorithms and building

00:00:27machine learning methods really I’m also

00:00:30going to argue that they are

00:00:32fundamentally the same thing and you’ll

00:00:35see that a little bit later in my talk

00:00:37but that’s let’s get cracking basically

00:00:41I will I’ll talk about the overview of

00:00:44AI and machine learning and I’m not the

00:00:46first one to do this and there are lots

00:00:47of people who who have their take on it

00:00:49but this will be my take I’ll also try

00:00:52to extend to you the idea and concept of

00:00:55why this is not enough we are very good

00:00:58at telling ourselves that we have come

00:00:59really far in AI and I would actually

00:01:03tend to disagree with that I think we’re

00:01:05we’re playing around in the pedaling

00:01:06pool and it’s simply not good enough we

00:01:10need to innovate this area we need to be

00:01:12better I will also talk about how

00:01:17perception versus inference can work in

00:01:20a computer I will make a short note

00:01:22about our patient brains because that’s

00:01:25fundamentally how how we reason as

00:01:27people at least from macroscopic

00:01:29perspective I’ll also talk a little bit

00:01:32about probabilistic programming and why

00:01:34I see that as a very key point to to

00:01:37marrying two very different field or

00:01:40differentiated field today and in the

00:01:42end I’ll tie all of it together so that

00:01:43you can see how you can actually

00:01:44practically deploy a solution like this

00:01:51but basically if we just go back to

00:01:53basic so I know a lot of different

00:01:55definitions of artificial intelligence

00:01:57there there are a lot of them out there

00:01:59and none of them says the ability to

00:02:02drive a car while not crashing that’s

00:02:05simply not artificial intelligence that

00:02:07is that is something that solves a

00:02:09domain-specific problem that is

00:02:10challenging yes but it’s not AI neither

00:02:15is diagnosing a health disease in in a

00:02:18page

00:02:19that comes into the ER that’s also not a

00:02:21I neither is actually well what I do in

00:02:24my company that’s also not AI all of

00:02:27those are examples of narrow AI where we

00:02:29try to use machines to do more clever

00:02:33things than an individual person could

00:02:34do at the same task but my definition of

00:02:38AI is is basically that it’s sort of the

00:02:40behavior as shown by an agent that you

00:02:42stuff into an environment and that

00:02:45behavior in itself seems to optimize the

00:02:47concept of future freedom now that is

00:02:50the closest definition to to artificial

00:02:53intelligence that I that I can come to

00:02:55because that doesn’t say anything you

00:02:57know yeah optimize the least square

00:03:00error do black back propagation to to

00:03:03make sure that the croissant repairer

00:03:04looks good all of those things are

00:03:06man-made and I assure you our brains do

00:03:09not do Brack propagation it’s simply not

00:03:13true

00:03:13no one is telling our children how to

00:03:17stand up they’re not getting smacked on

00:03:18the hands for failing my son he failed

00:03:21several times this morning but he

00:03:23actually succeeded when I left the room

00:03:25so without my encouragement he actually

00:03:27did better that might say something

00:03:29about my pedagogical skills or the fact

00:03:32that it doesn’t need my training to do

00:03:33these things so there’s a fundamental

00:03:35thing that’s missing there’s a missing

00:03:37piece in our understanding of how

00:03:39knowledge is represented accumulated and

00:03:42acted upon and that is what fascinates

00:03:45me more than anything I’m sure you’ve

00:03:50seen this before it’s just a definition

00:03:53of what AI is today so there’s a lot of

00:03:56things but but basically we are in the

00:03:59top level there every single application

00:04:02you have ever seen heard of today is in

00:04:05this field artificial narrow

00:04:07intelligence there is no such thing as

00:04:10artificial general intelligence it

00:04:12doesn’t exist today and if someone says

00:04:14they have it they’re lying because we

00:04:16don’t have the representation of how to

00:04:19capture knowledge no one has that you

00:04:22simply cannot express this in Python or

00:04:24R or whatever language you want it

00:04:26doesn’t exist we need to figure out how

00:04:29to represent this

00:04:30so artificial general intelligence that

00:04:32is really the task of saying how could

00:04:35we actually take an AI that knows how to

00:04:38drive a car stuff that into a different

00:04:41environment and make it utilize the

00:04:43skills that they had learning how to

00:04:45drive the car and apply that to a

00:04:46completely different field that is the

00:04:48main transfer and that is something that

00:04:51no AI can do today

00:04:53now artificial superintelligence and the

00:04:55only reason I’m mentioning this is

00:04:56because it’s really really far away

00:04:59the only thing super about this house

00:05:00super far away it is into the future and

00:05:02and there’s been a lot of people you

00:05:04know battling about this one of the one

00:05:06of the famous guys Elon Musk he is more

00:05:10of a doomsday kind of guy with respect

00:05:12to this and he and he should be because

00:05:13that gets him money into his company so

00:05:15it’s it’s a very it’s a very smart smart

00:05:17move that he says that AI is going to

00:05:19destroy the world so I’m creating a

00:05:20start-up that’s going to sort of

00:05:22regulate that so imagine how hard it was

00:05:24to raise money for that venture there

00:05:28are other things to consider about super

00:05:30intelligence and that’s that it is

00:05:33conceptually possible it is something

00:05:36that sooner or later if we do capture

00:05:38how to represent knowledge how to

00:05:40transfer knowledge how to accumulate

00:05:42knowledge if we know that then there is

00:05:45no stopping us from deploying this into

00:05:48the world and for all practical purposes

00:05:51now sounding a lot like musk what we

00:05:54released at that time would basically be

00:05:56a god to us and the whole thing in the

00:05:59scary part about that is will it be a

00:06:01nice God nobody knows but then again

00:06:05there’s very little proof in history

00:06:08that intelligence feeds violence so if

00:06:13anything the world is a safer place than

00:06:14it’s ever been before and and I would

00:06:16like to see that as an evolution of our

00:06:18intelligence as an evolution of our

00:06:20compassion I don’t see intelligence

00:06:23being a necessity for murderous robots

00:06:25so I’m not very afraid of that scenario

00:06:30I know we won’t be the smartest cookies

00:06:33anymore in the world but maybe that’s

00:06:34not so bad

00:06:35that was always going to happen and

00:06:37evolution will make sure that no matter

00:06:38what

00:06:40but basically the landscape looks like

00:06:42this so you know you have this this

00:06:43disturb artificial intelligence that

00:06:45sort of ubiquitous and describes

00:06:47everything from doing a linear

00:06:48regression in Excel to a self-driving

00:06:50car to identifying melanoma on a cell

00:06:54phone and and and all of these things

00:06:56are are not artificial intelligence but

00:06:59bets just become a buzzword just like

00:07:00big data I very much agree with the

00:07:02previous speakers about this the way I

00:07:05see it is that AI today is two things

00:07:07it’s perception machines and there’s

00:07:09inference machines and by inference that

00:07:11only mean forecasting or sort of

00:07:13prediction I mean really inference where

00:07:15you actually predict without actually

00:07:17having any data now under the perception

00:07:21part we’ve come a long way perception

00:07:23machines are everywhere those are the

00:07:25machines that I should know how to drive

00:07:26a car those are the machines that know

00:07:27how to identify the kites in the in the

00:07:29images that we saw all of those deep

00:07:32learning applications that they’re

00:07:33basically perception machine they can

00:07:36conceptualize something that they

00:07:38actually get as input either through

00:07:40visual stimuli or auditory stimuli they

00:07:43can sort of categorize it but they

00:07:46cannot make sense of it and I’ll show

00:07:47you examples of that and that’s why I

00:07:50reasoned that we need more we need to

00:07:53move into proper inference where we

00:07:54actually have a causal understanding a

00:07:57representation of the world that we’re

00:07:59living in and only then can we actually

00:08:01talk about pure intelligence but we can

00:08:04get you know closer and I’ll show you

00:08:06how to do that the biggest problems in

00:08:09data science today which is also another

00:08:12term for applied artificial intelligence

00:08:14is that data is actually not as

00:08:17ubiquitous and available as you might

00:08:19think

00:08:20for many interesting domains there is

00:08:22simply no data and the data this there

00:08:24is exceedingly noisy it might be a

00:08:27flat-out lie it might be based on

00:08:29surveys and we know that people lie in

00:08:30service that’s also a problem structure

00:08:35the problem with with structure is also

00:08:37that how do you represent the concept in

00:08:39the mathematical structure not

00:08:41necessarily in parameter space but just

00:08:43structurally how do you construct your

00:08:45layers in a neural network for example

00:08:49identifiability what I mean by that is

00:08:51that for any given data sets there are

00:08:54millions of models that fit that data

00:08:58set generalizes from that data set

00:09:00equally well and many of them do not

00:09:04correspond to the physical reality that

00:09:06we’re living in

00:09:07so there are statistical truths

00:09:10parameter truths and there are physical

00:09:12realities and they’re not the same thing

00:09:16that’s why my previous field theoretical

00:09:18physics is sometimes problematic because

00:09:21quantum quantum theory that I sort of

00:09:24specialized in that’s has many different

00:09:27interpretations and then nobody really

00:09:29knows what’s going on but we know we can

00:09:31calculate stuff from it so it makes

00:09:33sense in the math but as soon as we push

00:09:35this button but what’s really happening

00:09:38then you know well we’re basically

00:09:41screwed because no one knows and a lot

00:09:44of people like to pretend that they know

00:09:46and then there are some people like the

00:09:48Copenhagen interpretation that says that

00:09:50well just shut up and do the math which

00:09:53is basically don’t ask the question

00:09:54because they cannot be answered

00:09:56Hawking adheres to this school by the

00:09:58way he’s also one of one of the guys

00:10:00who’s super scared of super intelligence

00:10:03funnily enough because he’s a clever

00:10:06cookie there’s also the thing about

00:10:09priors so every time that you you

00:10:12address a problem as as a human whatever

00:10:14problem I give you as an individual you

00:10:17will have a lot of prior knowledge

00:10:18you’ll have a half or whole life

00:10:22depending on how old you are of

00:10:23knowledge that you’ve accumulated this

00:10:25knowledge might transfer from another

00:10:27person that they just told you about

00:10:28something but you can apply this

00:10:30knowledge to the problem at hand you can

00:10:32represent that knowledge in the domain

00:10:35of the problem that you’re trying to

00:10:36solve and that is something that we also

00:10:39can actually mimic today through the

00:10:40concept of priors and that is that

00:10:42basically the way of encoding an idea or

00:10:46a sort of knowledge as the statistical

00:10:49prior and as a statistical distribution

00:10:51that can be put on par with data I’ll

00:10:54show you later how to do that as well

00:10:55the last part but not the least

00:10:58important one is uncertainty I cannot

00:11:01stress

00:11:01enough how important uncertainty is to

00:11:05do optimal decision-making you basically

00:11:08cannot make optimal decisions without

00:11:10knowing what you don’t know and I will

00:11:13stress that point several times during

00:11:14this talk during the remaining thirty

00:11:17nine minutes of it it’s really great I

00:11:18can actually see how little time I have

00:11:20left so I will not show you more

00:11:25equations and and it’s it’s not because

00:11:27I I’m particularly fond of them but they

00:11:29do help express ideas so in in the top

00:11:32level that’s basically a complete a

00:11:36compact way of describing any problem

00:11:38that you might approach it’s basically a

00:11:40probability distribution over the data

00:11:41that you’re a Fed they are the X’s the

00:11:44Y’s those are the things that you want

00:11:45to be able to explain and the Thetas

00:11:47they represent all of the different

00:11:49parameters of your model stuff you don’t

00:11:51know it can also be latent variables

00:11:53concept that you know exists but that

00:11:54you don’t have observational data for

00:11:56all that is the definition of a problem

00:11:59space now what machine learning has

00:12:01traditionally done ever since Fisher

00:12:04it’s basically that they that they

00:12:08looked at this with a question that

00:12:11everybody knew was wrong they basically

00:12:13said that what is the probability

00:12:15distribution of the data that I got

00:12:18pretending that is random given a fixed

00:12:20hypothesis that I don’t know that I’m

00:12:23actually searching for so then the

00:12:25problem actually became for all

00:12:26machining applications which sort of

00:12:28hypothesis could i generate that’s the

00:12:31most consistent with the data set that

00:12:33looks like my data set but that’s really

00:12:34not my data set and you can you can ask

00:12:38the question is that a reasonable

00:12:40question and then I will tell you it is

00:12:41not it is poppycock that question is not

00:12:45worth asking why because you’re

00:12:49basically just trying to find

00:12:50explanations to fit your truth that is

00:12:53not science ladies and gentlemen there

00:12:55is only one way to do science you

00:12:56postulate an idea and then you observe

00:12:59data to see if you can verify that idea

00:13:02or disregard it you cannot look at a

00:13:05data set then generate a hypothesis that

00:13:07best explains it and think that that’s

00:13:08somehow is any physical representation

00:13:10in this world because it doesn’t

00:13:12and and that’s why a lot of a lot of

00:13:17machine learning approaches a lot of

00:13:18statistical approaches has actually

00:13:19figured out after you know several

00:13:22several years of hardcore science they

00:13:25found out that the biggest risk for

00:13:27dying from coronary artery disease is

00:13:30actually going to the hospital yeah

00:13:34that’s just not true and you know nobody

00:13:37nobody stopped and and instead you know

00:13:40why did this happen is it because the

00:13:43the researchers are brain damaged could

00:13:46have been the reason but but but but it

00:13:47wasn’t it was the methodology it was

00:13:50they were asking the wrong question

00:13:52because if you ask that question I can

00:13:55assure you that before you died at the

00:13:58hospital you had to go there so this

00:14:00makes perfect sense but it has no

00:14:02representation of the problem you’re

00:14:03trying to solve what you should have

00:14:05said is given that you’re sick and you

00:14:09go to the hospital and given that jack

00:14:11to have something that’s worth visiting

00:14:13the hospital for now that is predictive

00:14:15of you being actually disposed to dying

00:14:19for coronary artery disease so how do we

00:14:23fix this we fix this by doing what we

00:14:26should have been doing from the

00:14:27beginning and this is not new this

00:14:28formula down here below asks a different

00:14:31question what does it ask it asks what

00:14:33is the probability distribution of the

00:14:35parameters on my model that I don’t know

00:14:37by the way given that I have observed a

00:14:40data set that is real it is not fake it

00:14:42is not random it is a data set as been

00:14:44observed what is the probability

00:14:46distribution of my parameters now that

00:14:48is an interesting question to ask and

00:14:50that is a scientific question to ask but

00:14:52what does that require it requires you

00:14:54to state your mind the last part on the

00:14:57denominator which is the P theta given X

00:14:59that says what do you believe is true

00:15:02about your parameters given the data set

00:15:04that you have that’s very very important

00:15:07ladies and gentlemen because this is the

00:15:09difference between something great and

00:15:12something completely insane

00:15:14now then you might ask but okay why

00:15:17didn’t we do this because it couldn’t be

00:15:19done we simply didn’t have the

00:15:22computational power to do this and it’s

00:15:24not because of the guy to the right hand

00:15:26side there

00:15:26it’s also not to the guy on the left

00:15:28hand side and denominator and you can

00:15:30see that the guy on the left hand side

00:15:32and nominated it’s exactly what machine

00:15:33learning is doing today now why is that

00:15:36it’s because of the fact that they knew

00:15:38that the the guy in the denominator that

00:15:41is an integral from hell and it cannot

00:15:44be solved it it looks at every single

00:15:46value of every single parameter that you

00:15:48have and sums that out now this will end

00:15:52up in a scenario we have to calculate a

00:15:55lot of more things than the number of

00:15:57atoms in the universe and there are a

00:15:58lot of atoms in the universe even the

00:16:01the part that one that we can see but

00:16:04that basically meant that all of this is

00:16:06out of the question so someone realized

00:16:07hey that I don’t need to calculate that

00:16:09I don’t know I don’t care about

00:16:11probabilities you know I can just say

00:16:13that the point that is the maximum will

00:16:16be the same because the other thing is

00:16:17just a normalizing factor it’s a

00:16:18constant okay good enough we remove that

00:16:21so done deal and then they said but the

00:16:24prior ever what if I don’t know anything

00:16:26what if I I don’t want to say anything I

00:16:29don’t want to you know state my mind and

00:16:31you know put my knowledge into the

00:16:34problem so that’s just the uniform

00:16:36distribution over minus infinity and

00:16:38infinity and whoopty this this equation

00:16:42here has been transferred to only the

00:16:43likelihood but you made a lot of

00:16:46assumptions there but people just forgot

00:16:47that these assumptions are not true and

00:16:52it also in a maximum likelihood which is

00:16:55you know horrible way of doing things

00:16:57it’s basically because you assume that

00:16:59everything is independent you assume

00:17:01that even when you’re doing time series

00:17:03regression that observation one is

00:17:04independent of observation – that’s

00:17:07that’s like saying you know I wasn’t

00:17:09last year I was not one year younger

00:17:11than I am today of course I was and

00:17:13that’s important

00:17:15all of those things that are temporally

00:17:18related are extremely important and the

00:17:20reason why I’m saying this today is that

00:17:22there’s no need to cheat anymore there’s

00:17:25no need for these crazy statistical

00:17:27results only you can state your mind you

00:17:30can do the inference and all of it can

00:17:32be done with probabilistic programming

00:17:33and there are many frameworks for this

00:17:35today including in Python and also

00:17:38building on top of tensorflow by the

00:17:40there’s really no excuse not to do this

00:17:42and the best thing about it is that it’s

00:17:45actually easier than than adhering to

00:17:48normal statistics because the normal

00:17:50statistics you were taught tools they

00:17:52said that if you have two populations

00:17:55and they are sort of varying together

00:17:57then you use this magical tool if they

00:18:00are independent then you use another

00:18:01magical tool nobody really understood

00:18:03why they just but in here is the t-test

00:18:05in this one it’s a paired t-test and

00:18:07this one is the Wilcox in this point you

00:18:09should do a general logistic regression

00:18:13in this one you should just do a normal

00:18:15linear regression in this one is uses

00:18:16port vector machine they are all the

00:18:19same thing they are not different there

00:18:22are different assumptions in the

00:18:24likelihood functions there are different

00:18:25assumptions in your priors there are

00:18:27different assumptions in the physical

00:18:30structure of your model that is all

00:18:32there is no other difference all of it

00:18:35comes back to probabilistic modeling and

00:18:38if you can learn how to make these

00:18:40assumptions explicitly then you have a

00:18:43modeling language without limitations

00:18:45then you don’t have to know the

00:18:46difference between logistic regressions

00:18:48and linear regressions because there is

00:18:49none it is exactly the same thing and

00:18:53that’s perhaps the most important thing

00:18:55now wait the most important thing that

00:18:57I’m gonna say today given that you think

00:19:00it’s important is that you cannot do

00:19:02science without assumptions that is

00:19:03impossible just you know this dis is not

00:19:05my belief this is just hardcore facts

00:19:08you cannot do science without assumption

00:19:10and and don’t rest your minds until you

00:19:14understand this so without actually

00:19:16risking something you can get no answers

00:19:20so let’s have a look at neural networks

00:19:24I’m sure how many of you have taken a

00:19:25neural networks class in their days ok

00:19:28then most of you have have solved this

00:19:30problem I’m sure how many people have

00:19:33solved this problem before ok a few guys

00:19:35and girls so basically this problem is

00:19:39is highly nonlinear it’s it’s a

00:19:42classification task your job is to

00:19:44separate the the blue dots from the red

00:19:46dots by some line you can see this is

00:19:48sort of a spiral that that’s that’s non

00:19:50stationary it’s

00:19:52quite nasty isn’t it Anna neural network

00:19:54will how many hidden notes do you think

00:19:56I have to have in a one-layer no natural

00:19:58to solve this 10 20 50 100 let’s see

00:20:10well with ten hit notes I can learn how

00:20:12to separate this not great but there is

00:20:17some signal there if you use up here

00:20:20thirty hidden notes you can do a lot

00:20:22better not surprising but still it’s

00:20:25still not good because we know that this

00:20:26problem can be solved exactly right so

00:20:28with a hundred hidden notes you almost

00:20:32have perfect classification right and if

00:20:35you look at the accuracy table you will

00:20:37see that the area under the curve is

00:20:39100% with the 100 nodes now what is the

00:20:44problem with this and this is on a this

00:20:46is on a test data set mind you now the

00:20:49problem with this is that this looks

00:20:51great

00:20:52this looks amazing I mean your job is

00:20:54done right okay so let’s look at the

00:20:57decision surfaces that were generated

00:20:59from these guys now to the left-hand

00:21:02side you have the decision surface based

00:21:03on 10 hidden neurons and on the right

00:21:06hand side you have the decision surfaces

00:21:08based on 100 hidden nodes now you can

00:21:11see here does those decision surfaces

00:21:14look good to you does it look like they

00:21:16actually have captured what you wanted

00:21:18them to capture no it did not and this

00:21:22is exactly how neural networks work they

00:21:25are over parameterised very flexible

00:21:28mathematical models that will do

00:21:30everything they can to minimize that sum

00:21:34square or the croissant repair so

00:21:37there’s no penalisation for finding

00:21:39statistical only results and what is the

00:21:42worst thing with this the worst thing

00:21:43here is that you see the regions in the

00:21:45in the outskirts that are colored red

00:21:47that is a signal that the neural network

00:21:50is sure exists there’s there was no data

00:21:53out there at all

00:21:54but it knows that that has a

00:21:56differentiated class now this might not

00:22:00be a problem if you’re if you’re trying

00:22:02to classify you know

00:22:04maybe if there will rain extra much

00:22:08tomorrow the what if you have a droid

00:22:12with one target kill insurgents let

00:22:17civilians live what if they identify one

00:22:20of those asks you know one of those

00:22:23outer regions that that just makes sense

00:22:26that was never part of the training set

00:22:28this is a truth that has been learned by

00:22:32a network where data never actually

00:22:35showed at this and there’s no

00:22:37penalisation for this and the reason why

00:22:39I’m saying this is not to be you know

00:22:41don’t use AI or don’t use machine

00:22:43learning in fact I’m saying the opposite

00:22:45but what I want to say here is that be

00:22:47responsible

00:22:48every time you deploy a machine learning

00:22:51algorithm you have to understand exactly

00:22:54what it does because lack of

00:22:56understanding is the most dangerous

00:22:58thing that can exist today and it

00:22:59doesn’t have to be artificial

00:23:01superintelligence all that requires is a

00:23:04screw-up in the engineer or the

00:23:07scientists built this network and it can

00:23:08have dramatic consequences especially

00:23:12today in the in the time of self-driving

00:23:17cars and and all these things and this

00:23:19here I will show you another example of

00:23:21why I think that this is interesting so

00:23:23this is just a representation and mind

00:23:25you this is only a single layer neural

00:23:27network by the way no no you know super

00:23:29deep structures where would have even

00:23:31more parameters so I just want want to

00:23:37show you that this problem here

00:23:40represented in Cartesian coordinates is

00:23:42what was being fed to the neural network

00:23:45and what the neural network should have

00:23:46realized is that in polar coordinates it

00:23:50looks a lot simpler doesn’t it now I

00:23:53know that problem I can separate that

00:23:55with with just one hidden node and this

00:23:58is my point you can over parameterize

00:24:01and throw a lot of data things but if

00:24:03you start to think about the problem at

00:24:05hand and if we teach machines to learn

00:24:07how to think how to reason how to look

00:24:10at data instead of just number crunching

00:24:12and this is why today I’m not scared of

00:24:17artificial intelligence artificial

00:24:18superintelligence because i could have

00:24:20solved this in half a second you know

00:24:22even if you don’t have a degree in

00:24:24physics you should realize that that

00:24:25these are just two sine functions with

00:24:27with increasing radius it’s not hard but

00:24:32a neural network would never get this

00:24:35nor would any other machine learning

00:24:37algorithm by the way impossible because

00:24:40they don’t work that way that’s not

00:24:41their goal

00:24:42the way we can’t we can’t be angry at

00:24:44them for not solving that I just want to

00:24:47show you a take on probabilistic program

00:24:50with this and and also explain to you

00:24:52what public programming is it’s

00:24:54basically an attempt to unify

00:24:56general-purpose programming and by

00:24:58general purpose I mean like Turing

00:25:00complete programs that we all like

00:25:01because they can basically compute

00:25:03anything and marrying that was

00:25:05probabilistic modeling which is what

00:25:07everyone should be doing everyone

00:25:09whatever model you are crazy you are

00:25:11doing probabilistic modeling you just

00:25:13accepted a lot of assumptions that you

00:25:14didn’t make and and that is a

00:25:17realization that that even though you

00:25:19can choose not to care about it you have

00:25:22to know about it you have to know the

00:25:24assumptions behind the algorithms that

00:25:26you’re using and that’s why even though

00:25:28it’s very attempting to to fire up your

00:25:31favorite programming language load

00:25:33scikit-learn or tensorflow or you know

00:25:36whatever framework you’re using MX net

00:25:38doesn’t matter it’s still important to

00:25:40understand the cost you don’t have to be

00:25:42an expert in the math behind it that’s

00:25:44not what I’m saying but you have to

00:25:45understand conceptually what they do and

00:25:47more importantly what they don’t do

00:25:50because that makes all the difference so

00:25:55this is just to say that you could have

00:25:56written this model a lot easier now this

00:25:59is this is also a breaking point of the

00:26:02html5 presentations by the way this is

00:26:05actually really supposed to be on the

00:26:06right hand side so thank you windows

00:26:09even so that few code up there is

00:26:14basically a probabilistic way of

00:26:16specifying the model that solves it

00:26:18exactly and this can be expressed in a

00:26:21probabilistic programming language the

00:26:22neural network I wrote to fix that took

00:26:24a lot more coding I can assure you

00:26:31so the take-home messages here is that

00:26:34if you view things if you go back to

00:26:38basic and view them as what they are

00:26:39probabilistic statements about data

00:26:41about concepts about what you’re trying

00:26:43to model you gain basically a generative

00:26:47model you gain an understanding of what

00:26:49is actually happening and and that also

00:26:51means that you don’t get any crazy

00:26:53statistical only solutions due to

00:26:56identifiability problems and and this is

00:26:59something we really have to get away

00:27:00from identifiability is something that

00:27:03will be problematic so I’m not going to

00:27:06talk about deep learning I just want to

00:27:08show you what it is but I think you’ve

00:27:11had enough talks about that so max

00:27:13pooling and all of that we can I’m

00:27:15pretty sure we can skip what I do want

00:27:18to say though that neural networks per

00:27:20per default are degenerate and what I

00:27:23mean by that is that the the energy

00:27:24landscape that they’re running around in

00:27:26where they are trying to optimize things

00:27:27there are multiple locations in this

00:27:30energy landscape corresponding to the

00:27:31parameters that that minimizes the error

00:27:33and they’re equivalent that they

00:27:36correspond to very different physical

00:27:38realities so how the how’s the neural

00:27:41networks supposed to know and this is

00:27:43not something that you know that that we

00:27:45can design our way out of because the

00:27:48whole idea with the neural network is

00:27:50this degeneracy because the optimization

00:27:52is such a problem problematic space and

00:27:54I just want to visualize with the simple

00:27:57neural network here why this happens you

00:27:59can see these two networks describe

00:28:01exactly the same thing they solve

00:28:03exactly the same problem but the

00:28:05parameters are different and that’s why

00:28:08if you take you from X 1 and go to the

00:28:10hidden 2 and hidden 1 you can either

00:28:13have weight 1 1 be equal to 5 and go to

00:28:15a hidden node 1 or you can have weight 1

00:28:181 before and go to hidden 8 so if you

00:28:21try if you basically turn this on its

00:28:23head and shift around these weights you

00:28:25get exactly the same solution now this

00:28:27is one source of degeneracy and there

00:28:30are many of those so just imagine now

00:28:31that you’re stacking a lot of layers on

00:28:33top of each other you’re having hundreds

00:28:34of neurons how many permutations do you

00:28:36think you will be able to reach a lot is

00:28:39the answer I didn’t do it I didn’t do

00:28:41the math but just

00:28:42trust me it’s a lot so in in energy

00:28:46space in one dimension it looks like the

00:28:48one on the left-hand side you see two

00:28:49distinct points they are equivalent in

00:28:52the solution space and you cannot

00:28:53differentiate between them this is also

00:28:55why regularization is such a good idea

00:28:57in neural networks because it basically

00:28:59forces you to enter one of those

00:29:01tractors and in in two-dimensional space

00:29:04you can see that it corresponds to these

00:29:05two attractors in this colorized plot

00:29:09and then if you visualize this in in all

00:29:12the dimensions that the neural network

00:29:13is actually operating in which is

00:29:14typically the essence of dimensions then

00:29:18you can just imagine how many of those

00:29:20attractors you have and different depths

00:29:22of those attractors so I want to end my

00:29:27point if you missed my points I try to

00:29:29state it several times but sometimes I’m

00:29:31very clumsy in the way I state things so

00:29:33I’m gonna be very blunt this is one of

00:29:36the best neural networks at given 2016

00:29:41or 2015 was a version of the Linette

00:29:44that was trained to recognize digits and

00:29:48it does that perfectly like we said

00:29:51before we’re so far and in this area

00:29:53about perception that we don’t have to

00:29:55worry about not being able to do it it’s

00:29:57actually it’s actually done and and and

00:30:00it’s much better than humans at

00:30:01recognizing these things okay so let’s

00:30:04put it to the test shall we let’s

00:30:06generate some random noise images and

00:30:08ask it what is this and in every single

00:30:11image here you see the network is 99%

00:30:15sure that it’s a 1 versus 2 all the way

00:30:19up to 9 so all the 4 images under the 0

00:30:21it is convinced with the likelihood of

00:30:2499% that this is a 0 can you in any way

00:30:29understand why this is a zero I can’t

00:30:33and nor nor can the network because it

00:30:37was never penalized based on the fact

00:30:40that you’re not allowed to find

00:30:41structures that does not sort of dispute

00:30:46your data it has no briefing that it has

00:30:49to stay true to some sort of physical

00:30:51reality and this happens

00:30:54now back to my point what if it’s not

00:30:57the number zero

00:30:58what if it’s recognizing a unknown the

00:31:01face of a known terrorist with a you

00:31:03know kill on sight command and this is

00:31:06just numbers ladies in them imagine the

00:31:08complexity of faces so this is the entry

00:31:13point exactly how dangerous this

00:31:16technology is if you don’t respect it

00:31:19and it’s not about you know the machines

00:31:21being too intelligent it’s about us not

00:31:23being stupid that is that is really

00:31:26important to remember we have a

00:31:27responsibility to build applications

00:31:30that do not have this confirmation bias

00:31:33in them and that is something I hope

00:31:37that all of you will think of when you

00:31:38go out and build the next awesome

00:31:40machine learning application because I

00:31:42can’t see any numbers in these images

00:31:44anywhere and if you want to you can read

00:31:47the paper by these guys that I said you

00:31:50get the slides afterwards and it’s a

00:31:51very interesting paper they’ve basically

00:31:54tried all they could to see how the

00:31:57network could generalize with things

00:31:59that hadn’t seen before and in different

00:32:02areas of what it was supposed to see

00:32:05another thing I want to say is that

00:32:07events are not temporally independent

00:32:09everything that you do today everything

00:32:11that you see today here perceive think

00:32:13about is affected by what you saw

00:32:15yesterday and it’s the same in data data

00:32:19is not independent you cannot assume

00:32:21that two data points are independent

00:32:23that is a wild and crazy assumption that

00:32:25we have been allowed to do for far too

00:32:27long

00:32:27and this is just a small visualization

00:32:29from the domain that I that I was

00:32:31working in where we’re trying to solve

00:32:33how a TV exposure affects the purchasing

00:32:36behavior of people moving into the

00:32:38future and of course if you see TV

00:32:39commercial today it might affect you to

00:32:41buy something far into the future and it

00:32:44might affect no one to do something

00:32:46today and that’s course or temporal

00:32:49dependencies that that also needs to be

00:32:51taken into account if you think about

00:32:56causal dependencies and if you think

00:32:59about concepts if you really think about

00:33:01structure of things then you end up with

00:33:04something that looks like a deep

00:33:05learning neural network but where you

00:33:07actually have

00:33:08structure that is inherent to the

00:33:10problem at hand and that’s basically you

00:33:12forging connections between concepts

00:33:14between variables between parameters

00:33:16death sort of solves the problem at hand

00:33:19but that doesn’t have this over

00:33:20characterization this is a visualization

00:33:22of one of the one of the models that

00:33:24were running and Blackwood for for one

00:33:26of our from one of our clients and and

00:33:29this is sort of the complexity that you

00:33:30need to have to solve the everyday

00:33:32problems every node that you see here is

00:33:34basically a representation of a variable

00:33:36or a latent variable and the

00:33:37relationships between them are basically

00:33:39edges and basically there’s no point in

00:33:45this thing spinning I just thought it

00:33:46looked cool and it helped me raise money

00:33:49back in the days

00:33:50actually the spinning I think was the

00:33:52differentiate because in one of the

00:33:54pitches I did it didn’t it didn’t spin

00:33:56and we didn’t get those money and then

00:33:58all of a sudden it was spinning and we

00:34:00got those money I don’t know if that’s

00:34:01you know all the reason but the spinning

00:34:05in my mind helped so but there’s there’s

00:34:08there’s no visual improvement based on

00:34:11that how many people have seen this

00:34:12before

00:34:14okay well that’s that’s just no fun okay

00:34:18but before before I saw it the first

00:34:21time interesting enough I had not seen

00:34:23it so the problem here is that you’re

00:34:26supposed to judge whether a and B the

00:34:29squares there are of the same hue or not

00:34:31and from my point of view there are

00:34:33extremely differentiated they look very

00:34:35differently but the problem is that

00:34:38they’re not they’re actually the same

00:34:39and the reason why why a lot of people

00:34:42think that they are think that they are

00:34:44different is because we are predicting

00:34:47based on the shadow that is being cast

00:34:49from a light source that we know where

00:34:51it is because we have recognized this

00:34:53pattern earlier in their lives that is

00:34:54also a kind of confirmation bias but

00:34:57it’s a good one

00:34:57because that’s that’s what allows us to

00:34:59actually live our lives and sometimes we

00:35:01were wrong like in these contorted

00:35:03images but but it does prove a point

00:35:05that does because our brains are very

00:35:08biased based on what we know already and

00:35:11and we would do predictions based on

00:35:13what we know

00:35:16so basically probabilistic programming

00:35:18what that is

00:35:20it basically allows us to specify any

00:35:23kind of models that we want no you don’t

00:35:25have to think about layers you don’t

00:35:26have to think about the pooling you

00:35:28don’t have to think about all the

00:35:29wording all you have to think about is

00:35:31that you specify how variables might

00:35:33relate to each other and you specify

00:35:35which parameters that might be there and

00:35:37how they are relating to the variables

00:35:39at hand and if you have that freedom

00:35:41then there’s nothing you cannot model

00:35:43the problem with this is that you cannot

00:35:45fit that with Maxim likelihood you

00:35:47cannot adapt that because you can’t

00:35:49assume independent observations you

00:35:52can’t assume that everything is its

00:35:54uniform you can’t assume what you can

00:35:57but it’s not very smart you can’t assume

00:35:59that any given parameter has a possible

00:36:01value of minus infinity or plus infinity

00:36:03now this this in general just makes no

00:36:06sense just just think about the fact

00:36:08that you’re supposed to predict the

00:36:12house prices for example if you allow

00:36:15your model to predict something which is

00:36:17negative then you have something that

00:36:19might make sense again it statistical

00:36:21space because there’s no reason why you

00:36:22shouldn’t be able to mirror things right

00:36:23you just look at the positive part but

00:36:26what about the part in the of your model

00:36:28that says that negative sales prices are

00:36:29also positive that that’s just nonsense

00:36:32and and these things you shouldn’t allow

00:36:35so that’s why you should specify your

00:36:37priors and the concept of your models

00:36:40very rigorously

00:36:41and the best thing about probabilistic

00:36:43programming is that we no longer have to

00:36:45be experts in Markov chain Monte Carlo

00:36:47before you have to do that but today you

00:36:49don’t you know you don’t have to

00:36:50understand what what a Hamiltonian is in

00:36:52this space you don’t have to understand

00:36:54quantum mechanics you just have to learn

00:36:56how to program a probabilistic

00:36:58programming language which is very easy

00:37:01by the way super easy if you know Python

00:37:05or R or Julia or C++ or C or Java

00:37:09learning how to program a probabilistic

00:37:11programming language is a walk in the

00:37:12park and it’s still true and complete

00:37:14mind you there are a lot of different

00:37:17things we get out of this we can get the

00:37:20full Bayesian inference with the market

00:37:23in Monte Carlo through algorithms such

00:37:25as Hamiltonian Markov chain Monte Carlo

00:37:27didn’t know you turn sampler that’s what

00:37:29you really want to do the problem with

00:37:30this is that still today it takes

00:37:32it takes some time there’s a there’s

00:37:35another emerging tool that’s called

00:37:37automated differentiation variational

00:37:39inference which is just a lot of

00:37:41different words that says that turn the

00:37:43inference problem into a maximization

00:37:45problem and and they would have gotten

00:37:48somewhere with that which makes these

00:37:50inference machine a lot easier to fit

00:37:53the best thing is that also the math

00:37:56library already has this to automate the

00:37:59differentiation so you don’t have to be

00:38:00expressing that either again all you

00:38:02have to do is learn a probabilistic

00:38:04programming language or learn a

00:38:05framework in in Python that supports it

00:38:08like Edward for example there are many

00:38:11other frameworks that do the same thing

00:38:12a note about uncertainty now what if I

00:38:19gave you a task your task right now is

00:38:22to take 1 million American dollars

00:38:24and you’re going to invest them in

00:38:26either a radio campaign or a TV campaign

00:38:29now I’m going to tell you that the

00:38:31average performance of each campaign has

00:38:34been 0.5 so the return of investment for

00:38:37an average radio campaign has been 0.5

00:38:39the return on investment on an average

00:38:40TV campaign has also been 0.5 now my

00:38:43question to you is how would you invest

00:38:45does it matter well based on this

00:38:49information I would save I will just

00:38:51split it 5050 I mean why not they have

00:38:54the same performance right but what if I

00:38:57also told you that actually if you look

00:39:02at our is the distribution if you look

00:39:04over all the different radio campaigns

00:39:06that have been run and all the different

00:39:08TV campaigns that have been run if you

00:39:10look beyond the average and look at the

00:39:12individual results what do you have then

00:39:14well then you have that radio for

00:39:16example and TV they both have had

00:39:18historically a return investment of 0

00:39:22which basically means it didn’t work

00:39:23that could be like your some of their

00:39:26some of the commercials you see on TV

00:39:28sometimes that are less than good you

00:39:30know sometimes you see these these naked

00:39:32gnomes running on a grass field and

00:39:34they’re trying to sell cell phone

00:39:35subscriptions and every law understood

00:39:37the connection but that didn’t work I’m

00:39:40sure I didn’t quantify that but but it

00:39:43didn’t work on me

00:39:45then I’m going to tell you that the

00:39:47maximum radio and TV performance that

00:39:50has been observed is that radio has had

00:39:51in his history and return investment of

00:39:54nine point three meanwhile TV has only

00:39:55had one point four how would you invest

00:39:59now would you still split it fifty-fifty

00:40:03I wouldn’t

00:40:07now what if I tell you that this is

00:40:10probably not the the real solution

00:40:12either in order to answer this question

00:40:16you have to ask another question in

00:40:18return you have to ask the question what

00:40:21is the probability of me realizing a

00:40:24return on investment greater than for

00:40:25example 0.3 let’s just take that that is

00:40:28what I want to to achieve now now we

00:40:30have a specified what our question is

00:40:32and then we can give it a probabilistic

00:40:34answer and then the answer to this

00:40:36question is that it’s about 40 percent

00:40:41probable for radio to get a return on

00:40:44investment for any given instance above

00:40:460.3 but it’s it’s about 90% for TV how

00:40:52does that go hand-in-hand with the fact

00:40:53that radio is outperform TV historically

00:40:55as a maximum and they have the same

00:40:57average well it’s because of the fact

00:41:00that things are distributions things are

00:41:03distributions and they are not caution

00:41:06now this here is the source of failure

00:41:10of every statistical method that you

00:41:13probably have tried before because it

00:41:14assumes that everything is symmetric in

00:41:16caution nature makes no such promise it

00:41:20has never said thou shalt not use Kashi

00:41:22never has that been part of any sort of

00:41:26commandment or information given to us

00:41:28by nature there is nothing special about

00:41:32the Gaussian distribution there is a few

00:41:34things special about it but you know

00:41:36let’s just ignore the central limit

00:41:39theorem for now because of the fact that

00:41:40we don’t have enough data to actually

00:41:42approach that anyway so let’s just

00:41:43ignore that for now now the point here

00:41:46is that the distribution of radio looks

00:41:47like this and the distribution for TV

00:41:50looks like the one below and here you

00:41:52can see they have the same average very

00:41:54different minima and Maxima and very

00:41:56different skewness and

00:41:58this is why you cannot make optimal

00:42:02decisions without knowing what you don’t

00:42:03know you cannot make optimal decisions

00:42:06without knowing uncertainty even though

00:42:08if you knew the average performance

00:42:09average performance is such a huge

00:42:13culprit in bad science and bad inference

00:42:16I cannot state this enough and that’s

00:42:18also why you should never ever ever ever

00:42:21ever treat the parameters of your model

00:42:23as if they were constants because they

00:42:26are not it’s also not interesting to ask

00:42:29the question how uncertain is my data

00:42:32about this parameter about this fixed

00:42:35parameter also a nonsense question not

00:42:37interesting and that is why we have to

00:42:41go back to basics and do it right

00:42:43because until we do we will never get

00:42:46further so if I can tie this all

00:42:49together I I created sort of a a way for

00:42:56for you to start playing around with

00:42:57this I am I made a docker image

00:43:00basically which is called our Bayesian

00:43:02or is the host language figure you can

00:43:04basically use whatever language you want

00:43:05it doesn’t really matter what I want to

00:43:07show here is basically how easy it is to

00:43:10deploy a docker container with a

00:43:13Bayesian inference engine that can model

00:43:15any problem known to man there is

00:43:18nothing you cannot do with this

00:43:21framework nothing it is more general

00:43:24than anything that you have ever tried

00:43:26because it can simulate everything that

00:43:28you have ever tried and most of the

00:43:30things you have ever tried comes from

00:43:32probability theory and this is just a

00:43:34pure application of probability theory

00:43:36so this is a very easy way to just snap

00:43:40that docker container and the best thing

00:43:42is that the functions do you write

00:43:44theory in there are automatically

00:43:46converted to rest API so that you can

00:43:48expose through this docker service so

00:43:50you have a REST API ready inference

00:43:52machine that is very much true to the

00:43:55scientific principle with no limitations

00:43:57and the only thing you have to pay for

00:43:59it is that you have to think twice now

00:44:03for those of you doesn’t like or I can

00:44:06make one version with Python or Julie or

00:44:08whatever it’s it’s not about the

00:44:09language are

00:44:11whatever I really want to convey is that

00:44:13modeling needs to be rebooted we need to

00:44:16think again on how we define our models

00:44:19how we specify our malls how we think

00:44:20about our models how we relate to our

00:44:22models we can never ever relate to our

00:44:25models without uncertainty we will

00:44:28always fail that’s why I think that

00:44:31playing around with this is it’s a good

00:44:34way to to learn more about these things

00:44:38this is just an example of how you would

00:44:40actually use this so I wrote a very very

00:44:42stupid container that it’s called the

00:44:45stupid weather and it’s stupid because

00:44:47it always gives you the same answer so

00:44:49no matter what you send in as parameter

00:44:51it always gives you something stupid so

00:44:55that that’s just to show you how you

00:44:57write a function it’s not supposed to

00:44:58convey any intelligence it’s just a

00:45:00placeholder it’s just boilerplate code

00:45:02for you to ingest your algorithm but it

00:45:05shows neatly how how you’re transforming

00:45:07this to rest api and it’s as simple as

00:45:10this just talk around and then you have

00:45:12it so even if you’re not you know a

00:45:17back-end developer or a full-stack

00:45:19developer it’s still easy to deploy and

00:45:22run your own solutions and you know

00:45:24docker container can run anywhere in the

00:45:25cloud can run on Google they can run on

00:45:27Amazon I think even it can run on on

00:45:30Microsoft’s cloud sure probably I didn’t

00:45:35try that but but but I would assume that

00:45:37they that they can run docker containers

00:45:41so if I can leave you with one

00:45:44conclusion it is basically think again

00:45:48about everything that you were ever

00:45:49taught every statistics class you had

00:45:52every applied machine learning class all

00:45:55of it

00:45:56rethink its reevaluate it be critical to

00:46:01whatever you were told because I got I

00:46:04can assure you that in most cases it was

00:46:05a flatulent lie and that lie didn’t

00:46:10happen because of the fact that people

00:46:11wanted to lie to you it’s based on

00:46:13ignorance and it’s based on you know

00:46:15decades of malpractice in this field

00:46:19because computation has caught up with

00:46:21us before it was ok to do

00:46:24was done because we had no other choice

00:46:26today no longer okay we have all the

00:46:29choices in the world it’s not hard

00:46:31getting a computational cluster with 200

00:46:34gigabytes of RAM and the 64 CPUs or even

00:46:385000 GPUs those things are at our

00:46:41disposal we don’t need to take the same

00:46:44shortcuts as we did dangerous shortcuts

00:46:47no less so I hope you will think about

00:46:53that another thing is that whenever

00:46:58you’re solving a problem I would like

00:47:00you to think about that whatever problem

00:47:02you’re solving whatever machine learning

00:47:04application you’re writing it is an

00:47:06application of the scientific principle

00:47:09please stay true to that there’s a

00:47:12reason why we have it science is a way

00:47:15for us to not be biased science is a way

00:47:18for us to discover truths about the

00:47:20world that we live in this should not be

00:47:23ignored or taken lightly and that’s why

00:47:27you know crazy people like Trump can get

00:47:29away with saying that there is no such

00:47:30thing as global warming because he does

00:47:33not adhere to the scientific principle

00:47:35so you know you can either be Trump or

00:47:38you can stay true to the scientific

00:47:39principle and those two are the only

00:47:42extremes my friends so another thing

00:47:46that I want to say is always state your

00:47:49mind whatever you know about the problem

00:47:52I assure you that that knowledge is

00:47:54critical and important do not pretend

00:47:57and fall into this trap more I want to

00:48:00do unbiased research there’s no such

00:48:02thing no such thing understand this

00:48:05there is no bias free research there is

00:48:09no scientific result that can be

00:48:12achieved without assumption you are free

00:48:14to evaluate your assumptions again

00:48:17restate them that’s good

00:48:19that’s progress that is science but

00:48:23before you’re observing data state your

00:48:25mind and you have to because otherwise

00:48:28you got nothing you got a result out but

00:48:31that was just picked out of thin air

00:48:33it’s nothing special about those

00:48:35coefficients that came

00:48:36nothing at all and until people realize

00:48:40this we will still have applications

00:48:44that believe that Central Park is the

00:48:45red light and it is not even though that

00:48:49might look like it from from a different

00:48:51scale we need to do better and we can’t

00:48:54do better and maybe the most important

00:48:57thing of all is that with this framework

00:49:00and with this principle of thinking you

00:49:03are able to be free you are able to be

00:49:05creative and most of all you are able to

00:49:08have so much more fun building your

00:49:10models because you are not forced into a

00:49:13paradigm that someone else defined for

00:49:15you because it made the math nice thanks

00:49:27I think we have time for one question

00:49:30somebody asked where can I read more

00:49:33about this any good resources yes there

00:49:38are a few great books that I can slowly

00:49:41recommend and I will do them in

00:49:42mathematical requirement order so if

00:49:45you’re a hardcore mathematician or a

00:49:47theoretical physicist or anyone with a

00:49:49computational background with a deep

00:49:50understanding of mathematics then you

00:49:52can go directly to read a book called

00:49:55the handbook of Markov chain Monte Carlo

00:49:57that is a very technical book and it

00:50:00describes the processes behind the

00:50:02probabilistic modeling if you are a

00:50:05little bit less mathematical but still

00:50:07has quite a bit of mathematics you

00:50:09should read the section about graphical

00:50:11models made by bishop and then a book

00:50:15called machine learning and pattern

00:50:16recognition but the most important book

00:50:19of all perhaps to read is is one of the

00:50:22books called statistical rethinking and

00:50:24that book explains a lot of the concepts

00:50:27that I’ve been badgering now that you

00:50:29know somewhere along the line we just

00:50:30got lost that has both text that you

00:50:35know is consumable by by people and it

00:50:37has a little bit of math so you can sort

00:50:39of put it in context those are really

00:50:42the books I would recommend in this okay

00:50:47thank you and I’ll tweet their resources

00:50:50to the go to hashtag go to CPA okay

00:50:55thank you my thank you

00:50:56[Applause]