00:00:08[Applause]
00:00:14so hi everyone as you heard my name is
00:00:18michael green and i’m indeed here to
00:00:22tell you about a different approach to
00:00:25to building algorithms and building
00:00:27machine learning methods really I’m also
00:00:30going to argue that they are
00:00:32fundamentally the same thing and you’ll
00:00:35see that a little bit later in my talk
00:00:37but that’s let’s get cracking basically
00:00:41I will I’ll talk about the overview of
00:00:44AI and machine learning and I’m not the
00:00:46first one to do this and there are lots
00:00:47of people who who have their take on it
00:00:49but this will be my take I’ll also try
00:00:52to extend to you the idea and concept of
00:00:55why this is not enough we are very good
00:00:58at telling ourselves that we have come
00:00:59really far in AI and I would actually
00:01:03tend to disagree with that I think we’re
00:01:05we’re playing around in the pedaling
00:01:06pool and it’s simply not good enough we
00:01:10need to innovate this area we need to be
00:01:12better I will also talk about how
00:01:17perception versus inference can work in
00:01:20a computer I will make a short note
00:01:22about our patient brains because that’s
00:01:25fundamentally how how we reason as
00:01:27people at least from macroscopic
00:01:29perspective I’ll also talk a little bit
00:01:32about probabilistic programming and why
00:01:34I see that as a very key point to to
00:01:37marrying two very different field or
00:01:40differentiated field today and in the
00:01:42end I’ll tie all of it together so that
00:01:43you can see how you can actually
00:01:44practically deploy a solution like this
00:01:51but basically if we just go back to
00:01:53basic so I know a lot of different
00:01:55definitions of artificial intelligence
00:01:57there there are a lot of them out there
00:01:59and none of them says the ability to
00:02:02drive a car while not crashing that’s
00:02:05simply not artificial intelligence that
00:02:07is that is something that solves a
00:02:09domain-specific problem that is
00:02:10challenging yes but it’s not AI neither
00:02:15is diagnosing a health disease in in a
00:02:18page
00:02:19that comes into the ER that’s also not a
00:02:21I neither is actually well what I do in
00:02:24my company that’s also not AI all of
00:02:27those are examples of narrow AI where we
00:02:29try to use machines to do more clever
00:02:33things than an individual person could
00:02:34do at the same task but my definition of
00:02:38AI is is basically that it’s sort of the
00:02:40behavior as shown by an agent that you
00:02:42stuff into an environment and that
00:02:45behavior in itself seems to optimize the
00:02:47concept of future freedom now that is
00:02:50the closest definition to to artificial
00:02:53intelligence that I that I can come to
00:02:55because that doesn’t say anything you
00:02:57know yeah optimize the least square
00:03:00error do black back propagation to to
00:03:03make sure that the croissant repairer
00:03:04looks good all of those things are
00:03:06man-made and I assure you our brains do
00:03:09not do Brack propagation it’s simply not
00:03:13true
00:03:13no one is telling our children how to
00:03:17stand up they’re not getting smacked on
00:03:18the hands for failing my son he failed
00:03:21several times this morning but he
00:03:23actually succeeded when I left the room
00:03:25so without my encouragement he actually
00:03:27did better that might say something
00:03:29about my pedagogical skills or the fact
00:03:32that it doesn’t need my training to do
00:03:33these things so there’s a fundamental
00:03:35thing that’s missing there’s a missing
00:03:37piece in our understanding of how
00:03:39knowledge is represented accumulated and
00:03:42acted upon and that is what fascinates
00:03:45me more than anything I’m sure you’ve
00:03:50seen this before it’s just a definition
00:03:53of what AI is today so there’s a lot of
00:03:56things but but basically we are in the
00:03:59top level there every single application
00:04:02you have ever seen heard of today is in
00:04:05this field artificial narrow
00:04:07intelligence there is no such thing as
00:04:10artificial general intelligence it
00:04:12doesn’t exist today and if someone says
00:04:14they have it they’re lying because we
00:04:16don’t have the representation of how to
00:04:19capture knowledge no one has that you
00:04:22simply cannot express this in Python or
00:04:24R or whatever language you want it
00:04:26doesn’t exist we need to figure out how
00:04:29to represent this
00:04:30so artificial general intelligence that
00:04:32is really the task of saying how could
00:04:35we actually take an AI that knows how to
00:04:38drive a car stuff that into a different
00:04:41environment and make it utilize the
00:04:43skills that they had learning how to
00:04:45drive the car and apply that to a
00:04:46completely different field that is the
00:04:48main transfer and that is something that
00:04:51no AI can do today
00:04:53now artificial superintelligence and the
00:04:55only reason I’m mentioning this is
00:04:56because it’s really really far away
00:04:59the only thing super about this house
00:05:00super far away it is into the future and
00:05:02and there’s been a lot of people you
00:05:04know battling about this one of the one
00:05:06of the famous guys Elon Musk he is more
00:05:10of a doomsday kind of guy with respect
00:05:12to this and he and he should be because
00:05:13that gets him money into his company so
00:05:15it’s it’s a very it’s a very smart smart
00:05:17move that he says that AI is going to
00:05:19destroy the world so I’m creating a
00:05:20start-up that’s going to sort of
00:05:22regulate that so imagine how hard it was
00:05:24to raise money for that venture there
00:05:28are other things to consider about super
00:05:30intelligence and that’s that it is
00:05:33conceptually possible it is something
00:05:36that sooner or later if we do capture
00:05:38how to represent knowledge how to
00:05:40transfer knowledge how to accumulate
00:05:42knowledge if we know that then there is
00:05:45no stopping us from deploying this into
00:05:48the world and for all practical purposes
00:05:51now sounding a lot like musk what we
00:05:54released at that time would basically be
00:05:56a god to us and the whole thing in the
00:05:59scary part about that is will it be a
00:06:01nice God nobody knows but then again
00:06:05there’s very little proof in history
00:06:08that intelligence feeds violence so if
00:06:13anything the world is a safer place than
00:06:14it’s ever been before and and I would
00:06:16like to see that as an evolution of our
00:06:18intelligence as an evolution of our
00:06:20compassion I don’t see intelligence
00:06:23being a necessity for murderous robots
00:06:25so I’m not very afraid of that scenario
00:06:30I know we won’t be the smartest cookies
00:06:33anymore in the world but maybe that’s
00:06:34not so bad
00:06:35that was always going to happen and
00:06:37evolution will make sure that no matter
00:06:38what
00:06:40but basically the landscape looks like
00:06:42this so you know you have this this
00:06:43disturb artificial intelligence that
00:06:45sort of ubiquitous and describes
00:06:47everything from doing a linear
00:06:48regression in Excel to a self-driving
00:06:50car to identifying melanoma on a cell
00:06:54phone and and and all of these things
00:06:56are are not artificial intelligence but
00:06:59bets just become a buzzword just like
00:07:00big data I very much agree with the
00:07:02previous speakers about this the way I
00:07:05see it is that AI today is two things
00:07:07it’s perception machines and there’s
00:07:09inference machines and by inference that
00:07:11only mean forecasting or sort of
00:07:13prediction I mean really inference where
00:07:15you actually predict without actually
00:07:17having any data now under the perception
00:07:21part we’ve come a long way perception
00:07:23machines are everywhere those are the
00:07:25machines that I should know how to drive
00:07:26a car those are the machines that know
00:07:27how to identify the kites in the in the
00:07:29images that we saw all of those deep
00:07:32learning applications that they’re
00:07:33basically perception machine they can
00:07:36conceptualize something that they
00:07:38actually get as input either through
00:07:40visual stimuli or auditory stimuli they
00:07:43can sort of categorize it but they
00:07:46cannot make sense of it and I’ll show
00:07:47you examples of that and that’s why I
00:07:50reasoned that we need more we need to
00:07:53move into proper inference where we
00:07:54actually have a causal understanding a
00:07:57representation of the world that we’re
00:07:59living in and only then can we actually
00:08:01talk about pure intelligence but we can
00:08:04get you know closer and I’ll show you
00:08:06how to do that the biggest problems in
00:08:09data science today which is also another
00:08:12term for applied artificial intelligence
00:08:14is that data is actually not as
00:08:17ubiquitous and available as you might
00:08:19think
00:08:20for many interesting domains there is
00:08:22simply no data and the data this there
00:08:24is exceedingly noisy it might be a
00:08:27flat-out lie it might be based on
00:08:29surveys and we know that people lie in
00:08:30service that’s also a problem structure
00:08:35the problem with with structure is also
00:08:37that how do you represent the concept in
00:08:39the mathematical structure not
00:08:41necessarily in parameter space but just
00:08:43structurally how do you construct your
00:08:45layers in a neural network for example
00:08:49identifiability what I mean by that is
00:08:51that for any given data sets there are
00:08:54millions of models that fit that data
00:08:58set generalizes from that data set
00:09:00equally well and many of them do not
00:09:04correspond to the physical reality that
00:09:06we’re living in
00:09:07so there are statistical truths
00:09:10parameter truths and there are physical
00:09:12realities and they’re not the same thing
00:09:16that’s why my previous field theoretical
00:09:18physics is sometimes problematic because
00:09:21quantum quantum theory that I sort of
00:09:24specialized in that’s has many different
00:09:27interpretations and then nobody really
00:09:29knows what’s going on but we know we can
00:09:31calculate stuff from it so it makes
00:09:33sense in the math but as soon as we push
00:09:35this button but what’s really happening
00:09:38then you know well we’re basically
00:09:41screwed because no one knows and a lot
00:09:44of people like to pretend that they know
00:09:46and then there are some people like the
00:09:48Copenhagen interpretation that says that
00:09:50well just shut up and do the math which
00:09:53is basically don’t ask the question
00:09:54because they cannot be answered
00:09:56Hawking adheres to this school by the
00:09:58way he’s also one of one of the guys
00:10:00who’s super scared of super intelligence
00:10:03funnily enough because he’s a clever
00:10:06cookie there’s also the thing about
00:10:09priors so every time that you you
00:10:12address a problem as as a human whatever
00:10:14problem I give you as an individual you
00:10:17will have a lot of prior knowledge
00:10:18you’ll have a half or whole life
00:10:22depending on how old you are of
00:10:23knowledge that you’ve accumulated this
00:10:25knowledge might transfer from another
00:10:27person that they just told you about
00:10:28something but you can apply this
00:10:30knowledge to the problem at hand you can
00:10:32represent that knowledge in the domain
00:10:35of the problem that you’re trying to
00:10:36solve and that is something that we also
00:10:39can actually mimic today through the
00:10:40concept of priors and that is that
00:10:42basically the way of encoding an idea or
00:10:46a sort of knowledge as the statistical
00:10:49prior and as a statistical distribution
00:10:51that can be put on par with data I’ll
00:10:54show you later how to do that as well
00:10:55the last part but not the least
00:10:58important one is uncertainty I cannot
00:11:01stress
00:11:01enough how important uncertainty is to
00:11:05do optimal decision-making you basically
00:11:08cannot make optimal decisions without
00:11:10knowing what you don’t know and I will
00:11:13stress that point several times during
00:11:14this talk during the remaining thirty
00:11:17nine minutes of it it’s really great I
00:11:18can actually see how little time I have
00:11:20left so I will not show you more
00:11:25equations and and it’s it’s not because
00:11:27I I’m particularly fond of them but they
00:11:29do help express ideas so in in the top
00:11:32level that’s basically a complete a
00:11:36compact way of describing any problem
00:11:38that you might approach it’s basically a
00:11:40probability distribution over the data
00:11:41that you’re a Fed they are the X’s the
00:11:44Y’s those are the things that you want
00:11:45to be able to explain and the Thetas
00:11:47they represent all of the different
00:11:49parameters of your model stuff you don’t
00:11:51know it can also be latent variables
00:11:53concept that you know exists but that
00:11:54you don’t have observational data for
00:11:56all that is the definition of a problem
00:11:59space now what machine learning has
00:12:01traditionally done ever since Fisher
00:12:04it’s basically that they that they
00:12:08looked at this with a question that
00:12:11everybody knew was wrong they basically
00:12:13said that what is the probability
00:12:15distribution of the data that I got
00:12:18pretending that is random given a fixed
00:12:20hypothesis that I don’t know that I’m
00:12:23actually searching for so then the
00:12:25problem actually became for all
00:12:26machining applications which sort of
00:12:28hypothesis could i generate that’s the
00:12:31most consistent with the data set that
00:12:33looks like my data set but that’s really
00:12:34not my data set and you can you can ask
00:12:38the question is that a reasonable
00:12:40question and then I will tell you it is
00:12:41not it is poppycock that question is not
00:12:45worth asking why because you’re
00:12:49basically just trying to find
00:12:50explanations to fit your truth that is
00:12:53not science ladies and gentlemen there
00:12:55is only one way to do science you
00:12:56postulate an idea and then you observe
00:12:59data to see if you can verify that idea
00:13:02or disregard it you cannot look at a
00:13:05data set then generate a hypothesis that
00:13:07best explains it and think that that’s
00:13:08somehow is any physical representation
00:13:10in this world because it doesn’t
00:13:12and and that’s why a lot of a lot of
00:13:17machine learning approaches a lot of
00:13:18statistical approaches has actually
00:13:19figured out after you know several
00:13:22several years of hardcore science they
00:13:25found out that the biggest risk for
00:13:27dying from coronary artery disease is
00:13:30actually going to the hospital yeah
00:13:34that’s just not true and you know nobody
00:13:37nobody stopped and and instead you know
00:13:40why did this happen is it because the
00:13:43the researchers are brain damaged could
00:13:46have been the reason but but but but it
00:13:47wasn’t it was the methodology it was
00:13:50they were asking the wrong question
00:13:52because if you ask that question I can
00:13:55assure you that before you died at the
00:13:58hospital you had to go there so this
00:14:00makes perfect sense but it has no
00:14:02representation of the problem you’re
00:14:03trying to solve what you should have
00:14:05said is given that you’re sick and you
00:14:09go to the hospital and given that jack
00:14:11to have something that’s worth visiting
00:14:13the hospital for now that is predictive
00:14:15of you being actually disposed to dying
00:14:19for coronary artery disease so how do we
00:14:23fix this we fix this by doing what we
00:14:26should have been doing from the
00:14:27beginning and this is not new this
00:14:28formula down here below asks a different
00:14:31question what does it ask it asks what
00:14:33is the probability distribution of the
00:14:35parameters on my model that I don’t know
00:14:37by the way given that I have observed a
00:14:40data set that is real it is not fake it
00:14:42is not random it is a data set as been
00:14:44observed what is the probability
00:14:46distribution of my parameters now that
00:14:48is an interesting question to ask and
00:14:50that is a scientific question to ask but
00:14:52what does that require it requires you
00:14:54to state your mind the last part on the
00:14:57denominator which is the P theta given X
00:14:59that says what do you believe is true
00:15:02about your parameters given the data set
00:15:04that you have that’s very very important
00:15:07ladies and gentlemen because this is the
00:15:09difference between something great and
00:15:12something completely insane
00:15:14now then you might ask but okay why
00:15:17didn’t we do this because it couldn’t be
00:15:19done we simply didn’t have the
00:15:22computational power to do this and it’s
00:15:24not because of the guy to the right hand
00:15:26side there
00:15:26it’s also not to the guy on the left
00:15:28hand side and denominator and you can
00:15:30see that the guy on the left hand side
00:15:32and nominated it’s exactly what machine
00:15:33learning is doing today now why is that
00:15:36it’s because of the fact that they knew
00:15:38that the the guy in the denominator that
00:15:41is an integral from hell and it cannot
00:15:44be solved it it looks at every single
00:15:46value of every single parameter that you
00:15:48have and sums that out now this will end
00:15:52up in a scenario we have to calculate a
00:15:55lot of more things than the number of
00:15:57atoms in the universe and there are a
00:15:58lot of atoms in the universe even the
00:16:01the part that one that we can see but
00:16:04that basically meant that all of this is
00:16:06out of the question so someone realized
00:16:07hey that I don’t need to calculate that
00:16:09I don’t know I don’t care about
00:16:11probabilities you know I can just say
00:16:13that the point that is the maximum will
00:16:16be the same because the other thing is
00:16:17just a normalizing factor it’s a
00:16:18constant okay good enough we remove that
00:16:21so done deal and then they said but the
00:16:24prior ever what if I don’t know anything
00:16:26what if I I don’t want to say anything I
00:16:29don’t want to you know state my mind and
00:16:31you know put my knowledge into the
00:16:34problem so that’s just the uniform
00:16:36distribution over minus infinity and
00:16:38infinity and whoopty this this equation
00:16:42here has been transferred to only the
00:16:43likelihood but you made a lot of
00:16:46assumptions there but people just forgot
00:16:47that these assumptions are not true and
00:16:52it also in a maximum likelihood which is
00:16:55you know horrible way of doing things
00:16:57it’s basically because you assume that
00:16:59everything is independent you assume
00:17:01that even when you’re doing time series
00:17:03regression that observation one is
00:17:04independent of observation – that’s
00:17:07that’s like saying you know I wasn’t
00:17:09last year I was not one year younger
00:17:11than I am today of course I was and
00:17:13that’s important
00:17:15all of those things that are temporally
00:17:18related are extremely important and the
00:17:20reason why I’m saying this today is that
00:17:22there’s no need to cheat anymore there’s
00:17:25no need for these crazy statistical
00:17:27results only you can state your mind you
00:17:30can do the inference and all of it can
00:17:32be done with probabilistic programming
00:17:33and there are many frameworks for this
00:17:35today including in Python and also
00:17:38building on top of tensorflow by the
00:17:40there’s really no excuse not to do this
00:17:42and the best thing about it is that it’s
00:17:45actually easier than than adhering to
00:17:48normal statistics because the normal
00:17:50statistics you were taught tools they
00:17:52said that if you have two populations
00:17:55and they are sort of varying together
00:17:57then you use this magical tool if they
00:18:00are independent then you use another
00:18:01magical tool nobody really understood
00:18:03why they just but in here is the t-test
00:18:05in this one it’s a paired t-test and
00:18:07this one is the Wilcox in this point you
00:18:09should do a general logistic regression
00:18:13in this one you should just do a normal
00:18:15linear regression in this one is uses
00:18:16port vector machine they are all the
00:18:19same thing they are not different there
00:18:22are different assumptions in the
00:18:24likelihood functions there are different
00:18:25assumptions in your priors there are
00:18:27different assumptions in the physical
00:18:30structure of your model that is all
00:18:32there is no other difference all of it
00:18:35comes back to probabilistic modeling and
00:18:38if you can learn how to make these
00:18:40assumptions explicitly then you have a
00:18:43modeling language without limitations
00:18:45then you don’t have to know the
00:18:46difference between logistic regressions
00:18:48and linear regressions because there is
00:18:49none it is exactly the same thing and
00:18:53that’s perhaps the most important thing
00:18:55now wait the most important thing that
00:18:57I’m gonna say today given that you think
00:19:00it’s important is that you cannot do
00:19:02science without assumptions that is
00:19:03impossible just you know this dis is not
00:19:05my belief this is just hardcore facts
00:19:08you cannot do science without assumption
00:19:10and and don’t rest your minds until you
00:19:14understand this so without actually
00:19:16risking something you can get no answers
00:19:20so let’s have a look at neural networks
00:19:24I’m sure how many of you have taken a
00:19:25neural networks class in their days ok
00:19:28then most of you have have solved this
00:19:30problem I’m sure how many people have
00:19:33solved this problem before ok a few guys
00:19:35and girls so basically this problem is
00:19:39is highly nonlinear it’s it’s a
00:19:42classification task your job is to
00:19:44separate the the blue dots from the red
00:19:46dots by some line you can see this is
00:19:48sort of a spiral that that’s that’s non
00:19:50stationary it’s
00:19:52quite nasty isn’t it Anna neural network
00:19:54will how many hidden notes do you think
00:19:56I have to have in a one-layer no natural
00:19:58to solve this 10 20 50 100 let’s see
00:20:10well with ten hit notes I can learn how
00:20:12to separate this not great but there is
00:20:17some signal there if you use up here
00:20:20thirty hidden notes you can do a lot
00:20:22better not surprising but still it’s
00:20:25still not good because we know that this
00:20:26problem can be solved exactly right so
00:20:28with a hundred hidden notes you almost
00:20:32have perfect classification right and if
00:20:35you look at the accuracy table you will
00:20:37see that the area under the curve is
00:20:39100% with the 100 nodes now what is the
00:20:44problem with this and this is on a this
00:20:46is on a test data set mind you now the
00:20:49problem with this is that this looks
00:20:51great
00:20:52this looks amazing I mean your job is
00:20:54done right okay so let’s look at the
00:20:57decision surfaces that were generated
00:20:59from these guys now to the left-hand
00:21:02side you have the decision surface based
00:21:03on 10 hidden neurons and on the right
00:21:06hand side you have the decision surfaces
00:21:08based on 100 hidden nodes now you can
00:21:11see here does those decision surfaces
00:21:14look good to you does it look like they
00:21:16actually have captured what you wanted
00:21:18them to capture no it did not and this
00:21:22is exactly how neural networks work they
00:21:25are over parameterised very flexible
00:21:28mathematical models that will do
00:21:30everything they can to minimize that sum
00:21:34square or the croissant repair so
00:21:37there’s no penalisation for finding
00:21:39statistical only results and what is the
00:21:42worst thing with this the worst thing
00:21:43here is that you see the regions in the
00:21:45in the outskirts that are colored red
00:21:47that is a signal that the neural network
00:21:50is sure exists there’s there was no data
00:21:53out there at all
00:21:54but it knows that that has a
00:21:56differentiated class now this might not
00:22:00be a problem if you’re if you’re trying
00:22:02to classify you know
00:22:04maybe if there will rain extra much
00:22:08tomorrow the what if you have a droid
00:22:12with one target kill insurgents let
00:22:17civilians live what if they identify one
00:22:20of those asks you know one of those
00:22:23outer regions that that just makes sense
00:22:26that was never part of the training set
00:22:28this is a truth that has been learned by
00:22:32a network where data never actually
00:22:35showed at this and there’s no
00:22:37penalisation for this and the reason why
00:22:39I’m saying this is not to be you know
00:22:41don’t use AI or don’t use machine
00:22:43learning in fact I’m saying the opposite
00:22:45but what I want to say here is that be
00:22:47responsible
00:22:48every time you deploy a machine learning
00:22:51algorithm you have to understand exactly
00:22:54what it does because lack of
00:22:56understanding is the most dangerous
00:22:58thing that can exist today and it
00:22:59doesn’t have to be artificial
00:23:01superintelligence all that requires is a
00:23:04screw-up in the engineer or the
00:23:07scientists built this network and it can
00:23:08have dramatic consequences especially
00:23:12today in the in the time of self-driving
00:23:17cars and and all these things and this
00:23:19here I will show you another example of
00:23:21why I think that this is interesting so
00:23:23this is just a representation and mind
00:23:25you this is only a single layer neural
00:23:27network by the way no no you know super
00:23:29deep structures where would have even
00:23:31more parameters so I just want want to
00:23:37show you that this problem here
00:23:40represented in Cartesian coordinates is
00:23:42what was being fed to the neural network
00:23:45and what the neural network should have
00:23:46realized is that in polar coordinates it
00:23:50looks a lot simpler doesn’t it now I
00:23:53know that problem I can separate that
00:23:55with with just one hidden node and this
00:23:58is my point you can over parameterize
00:24:01and throw a lot of data things but if
00:24:03you start to think about the problem at
00:24:05hand and if we teach machines to learn
00:24:07how to think how to reason how to look
00:24:10at data instead of just number crunching
00:24:12and this is why today I’m not scared of
00:24:17artificial intelligence artificial
00:24:18superintelligence because i could have
00:24:20solved this in half a second you know
00:24:22even if you don’t have a degree in
00:24:24physics you should realize that that
00:24:25these are just two sine functions with
00:24:27with increasing radius it’s not hard but
00:24:32a neural network would never get this
00:24:35nor would any other machine learning
00:24:37algorithm by the way impossible because
00:24:40they don’t work that way that’s not
00:24:41their goal
00:24:42the way we can’t we can’t be angry at
00:24:44them for not solving that I just want to
00:24:47show you a take on probabilistic program
00:24:50with this and and also explain to you
00:24:52what public programming is it’s
00:24:54basically an attempt to unify
00:24:56general-purpose programming and by
00:24:58general purpose I mean like Turing
00:25:00complete programs that we all like
00:25:01because they can basically compute
00:25:03anything and marrying that was
00:25:05probabilistic modeling which is what
00:25:07everyone should be doing everyone
00:25:09whatever model you are crazy you are
00:25:11doing probabilistic modeling you just
00:25:13accepted a lot of assumptions that you
00:25:14didn’t make and and that is a
00:25:17realization that that even though you
00:25:19can choose not to care about it you have
00:25:22to know about it you have to know the
00:25:24assumptions behind the algorithms that
00:25:26you’re using and that’s why even though
00:25:28it’s very attempting to to fire up your
00:25:31favorite programming language load
00:25:33scikit-learn or tensorflow or you know
00:25:36whatever framework you’re using MX net
00:25:38doesn’t matter it’s still important to
00:25:40understand the cost you don’t have to be
00:25:42an expert in the math behind it that’s
00:25:44not what I’m saying but you have to
00:25:45understand conceptually what they do and
00:25:47more importantly what they don’t do
00:25:50because that makes all the difference so
00:25:55this is just to say that you could have
00:25:56written this model a lot easier now this
00:25:59is this is also a breaking point of the
00:26:02html5 presentations by the way this is
00:26:05actually really supposed to be on the
00:26:06right hand side so thank you windows
00:26:09even so that few code up there is
00:26:14basically a probabilistic way of
00:26:16specifying the model that solves it
00:26:18exactly and this can be expressed in a
00:26:21probabilistic programming language the
00:26:22neural network I wrote to fix that took
00:26:24a lot more coding I can assure you
00:26:31so the take-home messages here is that
00:26:34if you view things if you go back to
00:26:38basic and view them as what they are
00:26:39probabilistic statements about data
00:26:41about concepts about what you’re trying
00:26:43to model you gain basically a generative
00:26:47model you gain an understanding of what
00:26:49is actually happening and and that also
00:26:51means that you don’t get any crazy
00:26:53statistical only solutions due to
00:26:56identifiability problems and and this is
00:26:59something we really have to get away
00:27:00from identifiability is something that
00:27:03will be problematic so I’m not going to
00:27:06talk about deep learning I just want to
00:27:08show you what it is but I think you’ve
00:27:11had enough talks about that so max
00:27:13pooling and all of that we can I’m
00:27:15pretty sure we can skip what I do want
00:27:18to say though that neural networks per
00:27:20per default are degenerate and what I
00:27:23mean by that is that the the energy
00:27:24landscape that they’re running around in
00:27:26where they are trying to optimize things
00:27:27there are multiple locations in this
00:27:30energy landscape corresponding to the
00:27:31parameters that that minimizes the error
00:27:33and they’re equivalent that they
00:27:36correspond to very different physical
00:27:38realities so how the how’s the neural
00:27:41networks supposed to know and this is
00:27:43not something that you know that that we
00:27:45can design our way out of because the
00:27:48whole idea with the neural network is
00:27:50this degeneracy because the optimization
00:27:52is such a problem problematic space and
00:27:54I just want to visualize with the simple
00:27:57neural network here why this happens you
00:27:59can see these two networks describe
00:28:01exactly the same thing they solve
00:28:03exactly the same problem but the
00:28:05parameters are different and that’s why
00:28:08if you take you from X 1 and go to the
00:28:10hidden 2 and hidden 1 you can either
00:28:13have weight 1 1 be equal to 5 and go to
00:28:15a hidden node 1 or you can have weight 1
00:28:181 before and go to hidden 8 so if you
00:28:21try if you basically turn this on its
00:28:23head and shift around these weights you
00:28:25get exactly the same solution now this
00:28:27is one source of degeneracy and there
00:28:30are many of those so just imagine now
00:28:31that you’re stacking a lot of layers on
00:28:33top of each other you’re having hundreds
00:28:34of neurons how many permutations do you
00:28:36think you will be able to reach a lot is
00:28:39the answer I didn’t do it I didn’t do
00:28:41the math but just
00:28:42trust me it’s a lot so in in energy
00:28:46space in one dimension it looks like the
00:28:48one on the left-hand side you see two
00:28:49distinct points they are equivalent in
00:28:52the solution space and you cannot
00:28:53differentiate between them this is also
00:28:55why regularization is such a good idea
00:28:57in neural networks because it basically
00:28:59forces you to enter one of those
00:29:01tractors and in in two-dimensional space
00:29:04you can see that it corresponds to these
00:29:05two attractors in this colorized plot
00:29:09and then if you visualize this in in all
00:29:12the dimensions that the neural network
00:29:13is actually operating in which is
00:29:14typically the essence of dimensions then
00:29:18you can just imagine how many of those
00:29:20attractors you have and different depths
00:29:22of those attractors so I want to end my
00:29:27point if you missed my points I try to
00:29:29state it several times but sometimes I’m
00:29:31very clumsy in the way I state things so
00:29:33I’m gonna be very blunt this is one of
00:29:36the best neural networks at given 2016
00:29:41or 2015 was a version of the Linette
00:29:44that was trained to recognize digits and
00:29:48it does that perfectly like we said
00:29:51before we’re so far and in this area
00:29:53about perception that we don’t have to
00:29:55worry about not being able to do it it’s
00:29:57actually it’s actually done and and and
00:30:00it’s much better than humans at
00:30:01recognizing these things okay so let’s
00:30:04put it to the test shall we let’s
00:30:06generate some random noise images and
00:30:08ask it what is this and in every single
00:30:11image here you see the network is 99%
00:30:15sure that it’s a 1 versus 2 all the way
00:30:19up to 9 so all the 4 images under the 0
00:30:21it is convinced with the likelihood of
00:30:2499% that this is a 0 can you in any way
00:30:29understand why this is a zero I can’t
00:30:33and nor nor can the network because it
00:30:37was never penalized based on the fact
00:30:40that you’re not allowed to find
00:30:41structures that does not sort of dispute
00:30:46your data it has no briefing that it has
00:30:49to stay true to some sort of physical
00:30:51reality and this happens
00:30:54now back to my point what if it’s not
00:30:57the number zero
00:30:58what if it’s recognizing a unknown the
00:31:01face of a known terrorist with a you
00:31:03know kill on sight command and this is
00:31:06just numbers ladies in them imagine the
00:31:08complexity of faces so this is the entry
00:31:13point exactly how dangerous this
00:31:16technology is if you don’t respect it
00:31:19and it’s not about you know the machines
00:31:21being too intelligent it’s about us not
00:31:23being stupid that is that is really
00:31:26important to remember we have a
00:31:27responsibility to build applications
00:31:30that do not have this confirmation bias
00:31:33in them and that is something I hope
00:31:37that all of you will think of when you
00:31:38go out and build the next awesome
00:31:40machine learning application because I
00:31:42can’t see any numbers in these images
00:31:44anywhere and if you want to you can read
00:31:47the paper by these guys that I said you
00:31:50get the slides afterwards and it’s a
00:31:51very interesting paper they’ve basically
00:31:54tried all they could to see how the
00:31:57network could generalize with things
00:31:59that hadn’t seen before and in different
00:32:02areas of what it was supposed to see
00:32:05another thing I want to say is that
00:32:07events are not temporally independent
00:32:09everything that you do today everything
00:32:11that you see today here perceive think
00:32:13about is affected by what you saw
00:32:15yesterday and it’s the same in data data
00:32:19is not independent you cannot assume
00:32:21that two data points are independent
00:32:23that is a wild and crazy assumption that
00:32:25we have been allowed to do for far too
00:32:27long
00:32:27and this is just a small visualization
00:32:29from the domain that I that I was
00:32:31working in where we’re trying to solve
00:32:33how a TV exposure affects the purchasing
00:32:36behavior of people moving into the
00:32:38future and of course if you see TV
00:32:39commercial today it might affect you to
00:32:41buy something far into the future and it
00:32:44might affect no one to do something
00:32:46today and that’s course or temporal
00:32:49dependencies that that also needs to be
00:32:51taken into account if you think about
00:32:56causal dependencies and if you think
00:32:59about concepts if you really think about
00:33:01structure of things then you end up with
00:33:04something that looks like a deep
00:33:05learning neural network but where you
00:33:07actually have
00:33:08structure that is inherent to the
00:33:10problem at hand and that’s basically you
00:33:12forging connections between concepts
00:33:14between variables between parameters
00:33:16death sort of solves the problem at hand
00:33:19but that doesn’t have this over
00:33:20characterization this is a visualization
00:33:22of one of the one of the models that
00:33:24were running and Blackwood for for one
00:33:26of our from one of our clients and and
00:33:29this is sort of the complexity that you
00:33:30need to have to solve the everyday
00:33:32problems every node that you see here is
00:33:34basically a representation of a variable
00:33:36or a latent variable and the
00:33:37relationships between them are basically
00:33:39edges and basically there’s no point in
00:33:45this thing spinning I just thought it
00:33:46looked cool and it helped me raise money
00:33:49back in the days
00:33:50actually the spinning I think was the
00:33:52differentiate because in one of the
00:33:54pitches I did it didn’t it didn’t spin
00:33:56and we didn’t get those money and then
00:33:58all of a sudden it was spinning and we
00:34:00got those money I don’t know if that’s
00:34:01you know all the reason but the spinning
00:34:05in my mind helped so but there’s there’s
00:34:08there’s no visual improvement based on
00:34:11that how many people have seen this
00:34:12before
00:34:14okay well that’s that’s just no fun okay
00:34:18but before before I saw it the first
00:34:21time interesting enough I had not seen
00:34:23it so the problem here is that you’re
00:34:26supposed to judge whether a and B the
00:34:29squares there are of the same hue or not
00:34:31and from my point of view there are
00:34:33extremely differentiated they look very
00:34:35differently but the problem is that
00:34:38they’re not they’re actually the same
00:34:39and the reason why why a lot of people
00:34:42think that they are think that they are
00:34:44different is because we are predicting
00:34:47based on the shadow that is being cast
00:34:49from a light source that we know where
00:34:51it is because we have recognized this
00:34:53pattern earlier in their lives that is
00:34:54also a kind of confirmation bias but
00:34:57it’s a good one
00:34:57because that’s that’s what allows us to
00:34:59actually live our lives and sometimes we
00:35:01were wrong like in these contorted
00:35:03images but but it does prove a point
00:35:05that does because our brains are very
00:35:08biased based on what we know already and
00:35:11and we would do predictions based on
00:35:13what we know
00:35:16so basically probabilistic programming
00:35:18what that is
00:35:20it basically allows us to specify any
00:35:23kind of models that we want no you don’t
00:35:25have to think about layers you don’t
00:35:26have to think about the pooling you
00:35:28don’t have to think about all the
00:35:29wording all you have to think about is
00:35:31that you specify how variables might
00:35:33relate to each other and you specify
00:35:35which parameters that might be there and
00:35:37how they are relating to the variables
00:35:39at hand and if you have that freedom
00:35:41then there’s nothing you cannot model
00:35:43the problem with this is that you cannot
00:35:45fit that with Maxim likelihood you
00:35:47cannot adapt that because you can’t
00:35:49assume independent observations you
00:35:52can’t assume that everything is its
00:35:54uniform you can’t assume what you can
00:35:57but it’s not very smart you can’t assume
00:35:59that any given parameter has a possible
00:36:01value of minus infinity or plus infinity
00:36:03now this this in general just makes no
00:36:06sense just just think about the fact
00:36:08that you’re supposed to predict the
00:36:12house prices for example if you allow
00:36:15your model to predict something which is
00:36:17negative then you have something that
00:36:19might make sense again it statistical
00:36:21space because there’s no reason why you
00:36:22shouldn’t be able to mirror things right
00:36:23you just look at the positive part but
00:36:26what about the part in the of your model
00:36:28that says that negative sales prices are
00:36:29also positive that that’s just nonsense
00:36:32and and these things you shouldn’t allow
00:36:35so that’s why you should specify your
00:36:37priors and the concept of your models
00:36:40very rigorously
00:36:41and the best thing about probabilistic
00:36:43programming is that we no longer have to
00:36:45be experts in Markov chain Monte Carlo
00:36:47before you have to do that but today you
00:36:49don’t you know you don’t have to
00:36:50understand what what a Hamiltonian is in
00:36:52this space you don’t have to understand
00:36:54quantum mechanics you just have to learn
00:36:56how to program a probabilistic
00:36:58programming language which is very easy
00:37:01by the way super easy if you know Python
00:37:05or R or Julia or C++ or C or Java
00:37:09learning how to program a probabilistic
00:37:11programming language is a walk in the
00:37:12park and it’s still true and complete
00:37:14mind you there are a lot of different
00:37:17things we get out of this we can get the
00:37:20full Bayesian inference with the market
00:37:23in Monte Carlo through algorithms such
00:37:25as Hamiltonian Markov chain Monte Carlo
00:37:27didn’t know you turn sampler that’s what
00:37:29you really want to do the problem with
00:37:30this is that still today it takes
00:37:32it takes some time there’s a there’s
00:37:35another emerging tool that’s called
00:37:37automated differentiation variational
00:37:39inference which is just a lot of
00:37:41different words that says that turn the
00:37:43inference problem into a maximization
00:37:45problem and and they would have gotten
00:37:48somewhere with that which makes these
00:37:50inference machine a lot easier to fit
00:37:53the best thing is that also the math
00:37:56library already has this to automate the
00:37:59differentiation so you don’t have to be
00:38:00expressing that either again all you
00:38:02have to do is learn a probabilistic
00:38:04programming language or learn a
00:38:05framework in in Python that supports it
00:38:08like Edward for example there are many
00:38:11other frameworks that do the same thing
00:38:12a note about uncertainty now what if I
00:38:19gave you a task your task right now is
00:38:22to take 1 million American dollars
00:38:24and you’re going to invest them in
00:38:26either a radio campaign or a TV campaign
00:38:29now I’m going to tell you that the
00:38:31average performance of each campaign has
00:38:34been 0.5 so the return of investment for
00:38:37an average radio campaign has been 0.5
00:38:39the return on investment on an average
00:38:40TV campaign has also been 0.5 now my
00:38:43question to you is how would you invest
00:38:45does it matter well based on this
00:38:49information I would save I will just
00:38:51split it 5050 I mean why not they have
00:38:54the same performance right but what if I
00:38:57also told you that actually if you look
00:39:02at our is the distribution if you look
00:39:04over all the different radio campaigns
00:39:06that have been run and all the different
00:39:08TV campaigns that have been run if you
00:39:10look beyond the average and look at the
00:39:12individual results what do you have then
00:39:14well then you have that radio for
00:39:16example and TV they both have had
00:39:18historically a return investment of 0
00:39:22which basically means it didn’t work
00:39:23that could be like your some of their
00:39:26some of the commercials you see on TV
00:39:28sometimes that are less than good you
00:39:30know sometimes you see these these naked
00:39:32gnomes running on a grass field and
00:39:34they’re trying to sell cell phone
00:39:35subscriptions and every law understood
00:39:37the connection but that didn’t work I’m
00:39:40sure I didn’t quantify that but but it
00:39:43didn’t work on me
00:39:45then I’m going to tell you that the
00:39:47maximum radio and TV performance that
00:39:50has been observed is that radio has had
00:39:51in his history and return investment of
00:39:54nine point three meanwhile TV has only
00:39:55had one point four how would you invest
00:39:59now would you still split it fifty-fifty
00:40:03I wouldn’t
00:40:07now what if I tell you that this is
00:40:10probably not the the real solution
00:40:12either in order to answer this question
00:40:16you have to ask another question in
00:40:18return you have to ask the question what
00:40:21is the probability of me realizing a
00:40:24return on investment greater than for
00:40:25example 0.3 let’s just take that that is
00:40:28what I want to to achieve now now we
00:40:30have a specified what our question is
00:40:32and then we can give it a probabilistic
00:40:34answer and then the answer to this
00:40:36question is that it’s about 40 percent
00:40:41probable for radio to get a return on
00:40:44investment for any given instance above
00:40:460.3 but it’s it’s about 90% for TV how
00:40:52does that go hand-in-hand with the fact
00:40:53that radio is outperform TV historically
00:40:55as a maximum and they have the same
00:40:57average well it’s because of the fact
00:41:00that things are distributions things are
00:41:03distributions and they are not caution
00:41:06now this here is the source of failure
00:41:10of every statistical method that you
00:41:13probably have tried before because it
00:41:14assumes that everything is symmetric in
00:41:16caution nature makes no such promise it
00:41:20has never said thou shalt not use Kashi
00:41:22never has that been part of any sort of
00:41:26commandment or information given to us
00:41:28by nature there is nothing special about
00:41:32the Gaussian distribution there is a few
00:41:34things special about it but you know
00:41:36let’s just ignore the central limit
00:41:39theorem for now because of the fact that
00:41:40we don’t have enough data to actually
00:41:42approach that anyway so let’s just
00:41:43ignore that for now now the point here
00:41:46is that the distribution of radio looks
00:41:47like this and the distribution for TV
00:41:50looks like the one below and here you
00:41:52can see they have the same average very
00:41:54different minima and Maxima and very
00:41:56different skewness and
00:41:58this is why you cannot make optimal
00:42:02decisions without knowing what you don’t
00:42:03know you cannot make optimal decisions
00:42:06without knowing uncertainty even though
00:42:08if you knew the average performance
00:42:09average performance is such a huge
00:42:13culprit in bad science and bad inference
00:42:16I cannot state this enough and that’s
00:42:18also why you should never ever ever ever
00:42:21ever treat the parameters of your model
00:42:23as if they were constants because they
00:42:26are not it’s also not interesting to ask
00:42:29the question how uncertain is my data
00:42:32about this parameter about this fixed
00:42:35parameter also a nonsense question not
00:42:37interesting and that is why we have to
00:42:41go back to basics and do it right
00:42:43because until we do we will never get
00:42:46further so if I can tie this all
00:42:49together I I created sort of a a way for
00:42:56for you to start playing around with
00:42:57this I am I made a docker image
00:43:00basically which is called our Bayesian
00:43:02or is the host language figure you can
00:43:04basically use whatever language you want
00:43:05it doesn’t really matter what I want to
00:43:07show here is basically how easy it is to
00:43:10deploy a docker container with a
00:43:13Bayesian inference engine that can model
00:43:15any problem known to man there is
00:43:18nothing you cannot do with this
00:43:21framework nothing it is more general
00:43:24than anything that you have ever tried
00:43:26because it can simulate everything that
00:43:28you have ever tried and most of the
00:43:30things you have ever tried comes from
00:43:32probability theory and this is just a
00:43:34pure application of probability theory
00:43:36so this is a very easy way to just snap
00:43:40that docker container and the best thing
00:43:42is that the functions do you write
00:43:44theory in there are automatically
00:43:46converted to rest API so that you can
00:43:48expose through this docker service so
00:43:50you have a REST API ready inference
00:43:52machine that is very much true to the
00:43:55scientific principle with no limitations
00:43:57and the only thing you have to pay for
00:43:59it is that you have to think twice now
00:44:03for those of you doesn’t like or I can
00:44:06make one version with Python or Julie or
00:44:08whatever it’s it’s not about the
00:44:09language are
00:44:11whatever I really want to convey is that
00:44:13modeling needs to be rebooted we need to
00:44:16think again on how we define our models
00:44:19how we specify our malls how we think
00:44:20about our models how we relate to our
00:44:22models we can never ever relate to our
00:44:25models without uncertainty we will
00:44:28always fail that’s why I think that
00:44:31playing around with this is it’s a good
00:44:34way to to learn more about these things
00:44:38this is just an example of how you would
00:44:40actually use this so I wrote a very very
00:44:42stupid container that it’s called the
00:44:45stupid weather and it’s stupid because
00:44:47it always gives you the same answer so
00:44:49no matter what you send in as parameter
00:44:51it always gives you something stupid so
00:44:55that that’s just to show you how you
00:44:57write a function it’s not supposed to
00:44:58convey any intelligence it’s just a
00:45:00placeholder it’s just boilerplate code
00:45:02for you to ingest your algorithm but it
00:45:05shows neatly how how you’re transforming
00:45:07this to rest api and it’s as simple as
00:45:10this just talk around and then you have
00:45:12it so even if you’re not you know a
00:45:17back-end developer or a full-stack
00:45:19developer it’s still easy to deploy and
00:45:22run your own solutions and you know
00:45:24docker container can run anywhere in the
00:45:25cloud can run on Google they can run on
00:45:27Amazon I think even it can run on on
00:45:30Microsoft’s cloud sure probably I didn’t
00:45:35try that but but but I would assume that
00:45:37they that they can run docker containers
00:45:41so if I can leave you with one
00:45:44conclusion it is basically think again
00:45:48about everything that you were ever
00:45:49taught every statistics class you had
00:45:52every applied machine learning class all
00:45:55of it
00:45:56rethink its reevaluate it be critical to
00:46:01whatever you were told because I got I
00:46:04can assure you that in most cases it was
00:46:05a flatulent lie and that lie didn’t
00:46:10happen because of the fact that people
00:46:11wanted to lie to you it’s based on
00:46:13ignorance and it’s based on you know
00:46:15decades of malpractice in this field
00:46:19because computation has caught up with
00:46:21us before it was ok to do
00:46:24was done because we had no other choice
00:46:26today no longer okay we have all the
00:46:29choices in the world it’s not hard
00:46:31getting a computational cluster with 200
00:46:34gigabytes of RAM and the 64 CPUs or even
00:46:385000 GPUs those things are at our
00:46:41disposal we don’t need to take the same
00:46:44shortcuts as we did dangerous shortcuts
00:46:47no less so I hope you will think about
00:46:53that another thing is that whenever
00:46:58you’re solving a problem I would like
00:47:00you to think about that whatever problem
00:47:02you’re solving whatever machine learning
00:47:04application you’re writing it is an
00:47:06application of the scientific principle
00:47:09please stay true to that there’s a
00:47:12reason why we have it science is a way
00:47:15for us to not be biased science is a way
00:47:18for us to discover truths about the
00:47:20world that we live in this should not be
00:47:23ignored or taken lightly and that’s why
00:47:27you know crazy people like Trump can get
00:47:29away with saying that there is no such
00:47:30thing as global warming because he does
00:47:33not adhere to the scientific principle
00:47:35so you know you can either be Trump or
00:47:38you can stay true to the scientific
00:47:39principle and those two are the only
00:47:42extremes my friends so another thing
00:47:46that I want to say is always state your
00:47:49mind whatever you know about the problem
00:47:52I assure you that that knowledge is
00:47:54critical and important do not pretend
00:47:57and fall into this trap more I want to
00:48:00do unbiased research there’s no such
00:48:02thing no such thing understand this
00:48:05there is no bias free research there is
00:48:09no scientific result that can be
00:48:12achieved without assumption you are free
00:48:14to evaluate your assumptions again
00:48:17restate them that’s good
00:48:19that’s progress that is science but
00:48:23before you’re observing data state your
00:48:25mind and you have to because otherwise
00:48:28you got nothing you got a result out but
00:48:31that was just picked out of thin air
00:48:33it’s nothing special about those
00:48:35coefficients that came
00:48:36nothing at all and until people realize
00:48:40this we will still have applications
00:48:44that believe that Central Park is the
00:48:45red light and it is not even though that
00:48:49might look like it from from a different
00:48:51scale we need to do better and we can’t
00:48:54do better and maybe the most important
00:48:57thing of all is that with this framework
00:49:00and with this principle of thinking you
00:49:03are able to be free you are able to be
00:49:05creative and most of all you are able to
00:49:08have so much more fun building your
00:49:10models because you are not forced into a
00:49:13paradigm that someone else defined for
00:49:15you because it made the math nice thanks
00:49:27I think we have time for one question
00:49:30somebody asked where can I read more
00:49:33about this any good resources yes there
00:49:38are a few great books that I can slowly
00:49:41recommend and I will do them in
00:49:42mathematical requirement order so if
00:49:45you’re a hardcore mathematician or a
00:49:47theoretical physicist or anyone with a
00:49:49computational background with a deep
00:49:50understanding of mathematics then you
00:49:52can go directly to read a book called
00:49:55the handbook of Markov chain Monte Carlo
00:49:57that is a very technical book and it
00:50:00describes the processes behind the
00:50:02probabilistic modeling if you are a
00:50:05little bit less mathematical but still
00:50:07has quite a bit of mathematics you
00:50:09should read the section about graphical
00:50:11models made by bishop and then a book
00:50:15called machine learning and pattern
00:50:16recognition but the most important book
00:50:19of all perhaps to read is is one of the
00:50:22books called statistical rethinking and
00:50:24that book explains a lot of the concepts
00:50:27that I’ve been badgering now that you
00:50:29know somewhere along the line we just
00:50:30got lost that has both text that you
00:50:35know is consumable by by people and it
00:50:37has a little bit of math so you can sort
00:50:39of put it in context those are really
00:50:42the books I would recommend in this okay
00:50:47thank you and I’ll tweet their resources
00:50:50to the go to hashtag go to CPA okay
00:50:55thank you my thank you
00:50:56[Applause]
”