GOTO 2017 • Cloud-Native Data Science • Phil Winder

00:00:06[Music]

00:00:10thank you very much so this is a talk

00:00:14about applying client native principles

00:00:17to data science in 2016

00:00:21Microsoft made a very gutsy move and

00:00:24they released a new breed of chat bot

00:00:26into the public domain

00:00:28the company’s website claimed that it

00:00:31had been built using relevant public

00:00:33data that it had been modelled cleaned

00:00:35and filtered you may have heard of it it

00:00:38was called tayi the purpose of the bot

00:00:43was to respond to tweets in a humanistic

00:00:46manner you could send it questions on

00:00:48Twitter using its handle and it did a

00:00:51really good a really good job of

00:00:53answering like her like a youth actually

00:00:55as I say a youth because I didn’t

00:00:57understand a lot of the acronyms are

00:00:58used but in a well when it was released

00:01:02everything was actually going swimmingly

00:01:03and it worked remarkably well it really

00:01:06did sound like a human as long neither

00:01:07end but when a big tech company like

00:01:10this like this releases a product like

00:01:14this usually they’re the first users of

00:01:16this service are engineers and given

00:01:19that you’re all engineers in the room

00:01:20would you a test out this service

00:01:24appreciate it for what it is and you

00:01:26know ask it sensible questions or B

00:01:29would you try and break it would you

00:01:31send it the most horrific things that

00:01:33you could think of in order to try and

00:01:34force it to give us the answer well you

00:01:38know engineers are a sadistic bunch and

00:01:40you can guess which option they chose

00:01:44the bot went from a mild-mannered well

00:01:48answering chat bot to a sexist racist

00:01:52genocide all Nazi in about 24 hours

00:01:57you’ve got a collection of tweets you

00:01:59can see there where it started off you

00:02:01know looking quite good and we ended up

00:02:03with Hitler you know if you end up with

00:02:05Hitler you know it’s gone wrong one of

00:02:08my favorite tweets actually was about a

00:02:10British comedian called Ricky Gervais

00:02:11and it had a very you know

00:02:13decent-enough question is ricky gervais

00:02:15an atheist the response ricky gervais

00:02:18slant totalitarianism from adolf hitler

00:02:21the inventor of atheism now for all I

00:02:25know about Hitler I don’t think that’s

00:02:26his most famous trait but I’ll give it

00:02:3110 out of 10 for you no imagination for

00:02:33that one and ultimately the result of

00:02:37this wonderful experiment 24 hours later

00:02:39it was dead gone and although that’s

00:02:45quite a hilarious story I’m actually

00:02:46quite impressed with Microsoft it was a

00:02:48very gutsy move to allow this to happen

00:02:50they managed to deliver something that

00:02:52was really quite impressive but I think

00:02:55I think and this is just speculation

00:02:57that I think that might some of

00:02:59Microsoft’s traditional organizational

00:03:02stuff got in the way I think that the

00:03:05people if people were able to spot these

00:03:09problems and they were in a position to

00:03:12be able to spot these problems then they

00:03:14could have stopped it before something

00:03:16like this happened and that’s really

00:03:18about what this talk is about today so

00:03:21in normal life tradition is a fantastic

00:03:25in important part of of culture cultural

00:03:29meme but in engineering it’s actually

00:03:31the harbor of bad habits you know if we

00:03:34stick to traditions then we tend to

00:03:36repeat the same mistakes I used to work

00:03:38as a moor in the data science field and

00:03:41in software engineering and what we used

00:03:43to do was I would go away and I would

00:03:45write my models and do my research and

00:03:47then the only thing that everybody else

00:03:49would see was just this massive code

00:03:52which I would throw to software

00:03:54engineers and I would say there you go

00:03:55software engineers I finished my job now

00:03:58it’s your turn you implement it and

00:04:00obviously you know most of the time that

00:04:03just didn’t work some of the time it

00:04:06partially worked but it never worked as

00:04:08well as it should have done I actually

00:04:10spoke to a client the other day and he

00:04:15was worried that he had paid for a

00:04:17project for his company using he sent

00:04:21some data off to a research arm and they

00:04:25he was worried that

00:04:27the the work that these researchers were

00:04:29doing were kind of not really applicable

00:04:31in real life they were the words he used

00:04:32he thought it was a bit too academic

00:04:34with the words he used and what he meant

00:04:36was the the types of things that they

00:04:38were coming up with we’re not really

00:04:39realistic and relevant to you know

00:04:41modern-day industrial software so yeah

00:04:46tradition is a bit of a problem but

00:04:48traditionally data scientists have

00:04:51worked towards a certain type of model

00:04:54this is a model because the called the

00:04:56cross industry standard practice with

00:04:58data mining and this is the nearest

00:05:00thing we’ve got to you know a process in

00:05:03data science you’ll see that there’s

00:05:05lots of loops in this process and that’s

00:05:08just indicative of the fact that most of

00:05:10data science is kind of open-ended and

00:05:12continuous it never really stops the

00:05:16problem with this is that pretty much

00:05:20all of these steps are you know the very

00:05:23individual individual and they don’t

00:05:26scale very well so the first problem is

00:05:28the deployment face as I just said when

00:05:31I was a data scientist I would throw my

00:05:34models over to the software engineers

00:05:35and then I would never see it ever again

00:05:37I kind of think this is probably

00:05:39something that happened at Microsoft we

00:05:42get the software engineers that have not

00:05:43been trained in data science we give

00:05:46them I give them you know poorly

00:05:48documented uninterpretable code and

00:05:53expect them to understand it and

00:05:55implement it efficiently it’s it’s never

00:05:57going to happen

00:05:57and then we start going through the

00:06:01other parts of the model the first is

00:06:02data understanding this is a major issue

00:06:06in in data science because the data is

00:06:10the most impart the most important part

00:06:13of the problem and the data

00:06:15understanding part we rely on domain

00:06:17experts in order to interpret the the

00:06:21data so we had a great talk earlier on

00:06:23from Feynman about how he was working on

00:06:26the music domain he was working with if

00:06:29you weren’t there he was working with

00:06:30classical music and to to a normal data

00:06:33scientist you would know nothing about

00:06:35the terminology used in music but

00:06:37throughout his years of doing this

00:06:38research he finally became a domain

00:06:40but it took years and years of work and

00:06:42years of time so that’s a good example

00:06:44of that data preparation is another

00:06:47problem area because it’s often done by

00:06:49one person and it’s only done once so

00:06:52what happens is you get one guy that

00:06:54really delves into the data and they’re

00:06:57the ones that really understand it but

00:06:58it’s kind of it’s hard to reproduce

00:07:01because only one person understands it

00:07:03and the output is again just this

00:07:04amorphous blob of software which takes

00:07:07messy data and spits out good data and

00:07:11finally modeling this is a little bit

00:07:14easier to reason about if you have more

00:07:16and more people understanding the types

00:07:18of models used and how to do modeling

00:07:21but the issue with modeling is that

00:07:23again this is a very one person only

00:07:25process because the the process usually

00:07:27involves trying lots of different things

00:07:29basically picking the best one but that

00:07:31process that trial and error is never

00:07:34recorded anywhere the only output is the

00:07:36model you know that is the answer so if

00:07:38we want to scale this past one person

00:07:40then we need to there that usually what

00:07:43happens is the second person repeats all

00:07:45the same mistakes and that actually ends

00:07:47up at a different result because you

00:07:49know his biases and his preferences to

00:07:51us algorithms usually ends up in a

00:07:54different model so you know the the

00:07:58whole part of that side of the model is

00:08:00kind of like a murky canal you know it’s

00:08:04like a mucky Amsterdam canal you know

00:08:07the ships can go off and down but you

00:08:09wouldn’t want to jump in and and follow

00:08:11it and then if at the end of all that

00:08:16you know we’ve got the operation side

00:08:17the deployment side we’ve got the vast

00:08:19majority of the data science research

00:08:21phase that’s not going well actually the

00:08:23vast majority of projects fail because

00:08:25there’s a lack of business understanding

00:08:27and that’s either because the business

00:08:30doesn’t understand the technical

00:08:32implications that they’re proposing or

00:08:33the tech guys don’t understand the

00:08:35business problem enough so a whole host

00:08:38of problems so I think what I’m going to

00:08:43do now is I’m gonna ignore the business

00:08:45side a little bit because that is

00:08:47actually a separate problem in itself

00:08:48and you’re all tech guys so and gals so

00:08:52I’m just going to stick to

00:08:54three distinct phases we’ve got the

00:08:56research phase which was the bit that

00:08:59talks about you know understanding the

00:09:00data massaging the data and producing

00:09:02the data in the model I’ve got the build

00:09:04face trying to prove we’re doing what

00:09:06we’re doing is correct and then the

00:09:08actual deployment phase the bit that we

00:09:10want to rush into production

00:09:14so yeah the research phase consists of

00:09:18the initial data science that can be

00:09:20anything from performing experiments

00:09:22gathering more data preparation data

00:09:26cleaning modeling all that good stuff

00:09:29this is kind of this is called the

00:09:31research phase because it is a very

00:09:33scientific process and the biggest

00:09:35problem with that is that it’s it’s

00:09:38inherently open-ended and therefore it’s

00:09:41very high-risk so there is a high

00:09:44probability of failure at this point

00:09:45because you might find that either you

00:09:47don’t have the data to do the job

00:09:48properly or you just can’t do the job

00:09:51because it’s you know intractable for

00:09:53some reason so stepping back a bit

00:09:59believe it or not Britain actually had a

00:10:02very rich motoring heritage you might

00:10:04not think it these days you might think

00:10:06of Germany or something like that but

00:10:08there’s a manufacturing plant near

00:10:10Oxford which started in 1913 so this is

00:10:14a picture from that same manufacturing

00:10:16plant in 1943 and from about then until

00:10:20the 1970s it was owned by a company

00:10:22called British Leyland this is a picture

00:10:25of their manufacturing line building

00:10:27cromwell tanks for world war ii by the

00:10:33time it got to the 70s it was building

00:10:35this little couch you probably all

00:10:37recognize but at the start and during

00:10:40the 70’s things started going wrong and

00:10:43the the ultimate reason why they went

00:10:46wrong was because there were better

00:10:49cheaper alternatives available other

00:10:53companies were investing in the

00:10:56automation of these lines in order to

00:10:59produce better quality and cheaper

00:11:01products and you know when we talk about

00:11:05software engineering software

00:11:06engineering or engineering

00:11:08it’s just converting a process into code

00:11:11so that we can automate it that’s all is

00:11:18that’s where I should have said that

00:11:19word and today I think that data science

00:11:25is actually the automation of the data

00:11:27so we’re starting to see like a

00:11:29three-tier hierarchy here between you

00:11:33know we’ve got data science at the

00:11:35bottom which is taking all the data and

00:11:37automating things based upon that data

00:11:39to feed in to the process so I think the

00:11:43data science and software engineering

00:11:44actually make a very good fit they go

00:11:47together very well because the data and

00:11:49the science feeds into the software

00:11:51which then feeds into the the value that

00:11:53you’re trying to provide and this is a

00:11:55picture of the same manufacturing plant

00:11:57in 2013 so this is a hundred years after

00:11:59the the manufacturing plant opened and

00:12:02you can now see that it’s far more

00:12:05automated and it’s basically no humans

00:12:07there and that allows this company to

00:12:09build better more reliable cars and as

00:12:13you probably know this company’s now

00:12:15owned by BMW so you know the the the

00:12:19great Golden English company was eaten

00:12:21up by German manufacturers damn it so so

00:12:25yeah anyway my point is that I think

00:12:27software engineers or the software

00:12:28engineers are actually in a really good

00:12:30position to actually push ourselves into

00:12:33data science not the other way around

00:12:35because we’ve come away with all of the

00:12:37things that we’ve learned during this

00:12:38you know more traditional automation

00:12:40phase and we can start applying it to

00:12:42data science because the fact is like at

00:12:45the moment none of this happens in data

00:12:47science at the moment and this leads me

00:12:51to data Sciences dirty dirty little

00:12:53secret and the little secret is that the

00:12:57vast majority of your effort and time

00:13:01and engineering skill as a data

00:13:03scientist goes into the data just

00:13:06messing around with the data incessantly

00:13:09you know we are fixing problems with the

00:13:11data we are imputing missing values we

00:13:14are removing invalid data and so on and

00:13:17so on and so on and the vast majority of

00:13:20the PUF

00:13:21the final performance of the model is

00:13:23based upon how much you can improve that

00:13:25data not on the model so you know we’ve

00:13:29had some too great fantastic talks this

00:13:31morning you know all about deep learning

00:13:33all about very sexy technologies but

00:13:36that’s a very Silicon Valley problem no

00:13:38offense to the Silicon Valley guys but

00:13:39that’s a very Silicon Valley bro for

00:13:41everybody else outside of Silicon Valley

00:13:43we’re still living in the world where

00:13:45you know it is the simple techniques

00:13:48that really make a difference and it

00:13:50doesn’t have to be a complex model

00:13:51simple things can go a long way and one

00:13:57of the biggest issues with this process

00:14:00is that this discovery this fixing the

00:14:03data this understanding the data as I

00:14:05said it’s only done by one person so I

00:14:07think actually this is just a problem of

00:14:09visibility there is very little

00:14:12visibility within data science there’s

00:14:15only generally one person that’s working

00:14:17on a problem at a time

00:14:18and it’s you know it’s very difficult to

00:14:20scale or at least it’s very inefficient

00:14:22to scale over the years software

00:14:25engineering has done a really great job

00:14:26in improving this because we had exactly

00:14:29the same problems you know we could

00:14:30distribute binaries quite effectively

00:14:32but when it came to source code we’ve

00:14:34gone through you know decades of trying

00:14:37to improve the visibility and the

00:14:39resiliency of our source code and

00:14:41thankfully data science is finally

00:14:43starting to get there and these two

00:14:45tools in particular have been very

00:14:48prolific so you probably all know one of

00:14:51those so I’m not going to talk about

00:14:54that but the second is a notebook some

00:14:59of you probably come across it but you

00:15:00may be not so I’ll just I’ll just

00:15:02introduce it so this is Jupiter

00:15:04notebooks it’s a an evolution of ipython

00:15:08notebooks the idea is that inside the

00:15:11notebook there is a series of cells and

00:15:13each cell can either be marked down or

00:15:16it can be code what this is done is it

00:15:21is single-handedly improved the

00:15:23visibility from pretty much zero all the

00:15:25way to almost as good as it’s get I

00:15:28think this is actually probably better

00:15:29than software engineer in terms of

00:15:31visibility what it means that it’s when

00:15:33I’m when I first gets

00:15:34data and I’m doing my analysis I can

00:15:36document everything I do even the

00:15:38mistakes I can write the code I can

00:15:40write you know words if I need to and

00:15:43whenever anybody else was to repeat that

00:15:45process they can just come along and

00:15:47read this like a document if they want

00:15:49to they can come in and they can

00:15:51actually start playing with the code

00:15:52they can start doing tests if they if

00:15:54you think oh I think your models rubbish

00:15:55I’m gonna try another model or I’m gonna

00:15:57try some different parameters it’s very

00:15:59easy for someone to just come in change

00:16:01something and run it so this is a very

00:16:04iterative very visual way of doing data

00:16:09science and yeah it’s it’s made a huge

00:16:11impact and then when you team it up

00:16:15we’ve get and I think we’ve got you know

00:16:18the holy grail of repeatability from get

00:16:22visibility from jupiter notebooks and

00:16:26even like like like this is something we

00:16:29don’t take for granted for granted for

00:16:32example like when we’re looking at code

00:16:34normal normal software code we’re using

00:16:38you know github and get lab and whatever

00:16:40just that the online viewers to view

00:16:42code far more often than we actually

00:16:44think that we are and that alone is is

00:16:47super super helpful for for the

00:16:49visibility there so and that is fine a

00:16:55very good and a huge advancement for

00:16:57individual developers but how do we

00:17:00scale it to multiple developers we do

00:17:05that with another project from Jupiter

00:17:07called Jupiter hub and it’s quite a

00:17:10simple architecture as you can see main

00:17:13parts comprise of HTTP proxy we’ve got

00:17:16the individual notebook so this notebook

00:17:18part is the bit I’ve just explained to

00:17:19you the Jupiter notebook and then we’ve

00:17:21got a couple of user base stuff in there

00:17:23on the left hand side there to handle

00:17:26the multi-tenant instances but the most

00:17:30interesting thing is this thing the

00:17:32spawner because what we could do is we

00:17:34can override that spawner and plug in a

00:17:37whole range of tools we can plug in

00:17:40dhaka and we could start spinning up

00:17:43docker containers where you can start

00:17:45spinning up you know Mises containers

00:17:48an Orchestrator we could spin it up in

00:17:50some sort of cloud-based environment

00:17:51it’s incredibly incredibly useful

00:17:54possibly my favorite is we can start

00:17:57kubernetes jobs you know start pods with

00:18:00our own containers in and you know

00:18:03fraught ask software engineers we know

00:18:04that this provides us with a huge amount

00:18:07of flexibility we can simply scale out

00:18:09when we need to if you’ve got more

00:18:11developers working on a different

00:18:12problem just add more pods if you need

00:18:15bigger machines just scale out the

00:18:16number of machines we can select our

00:18:18machines whether we want GPUs or CPUs

00:18:21it’s all years ahead of data science so

00:18:26we’ve got the visibility we’ve now

00:18:29started to containerize the process so

00:18:31this is you know two core tenants of

00:18:33cloud native containers visibility now

00:18:38let’s build on last hour let’s move on

00:18:41to the build face not build on to the

00:18:43move face so we’re teaching on this at

00:18:46the start but in the past and continuing

00:18:49today still happens today data scientist

00:18:53is a very general term I don’t

00:18:54necessarily mean you know people with

00:18:56PhDs that are working with high level

00:18:59tools deep learning this that neither

00:19:01just normal people just working with

00:19:03normal data they come up with an idea

00:19:05maybe a simple model they throw it over

00:19:07to the software engineers this is

00:19:08completely analogous to where software

00:19:10engineering it was about 10 years ago

00:19:13software engineers would take their

00:19:14binary their software throw it over to

00:19:16the Ops guys and we’re combating that

00:19:18with the idea of DevOps

00:19:20so I think that there’s an equivalent

00:19:23shift that needs to happen with data

00:19:25scientists the data scientists need to

00:19:27be become more integrated with the

00:19:29software engineers and ultimately more

00:19:31integrated with the ops people as well

00:19:33so you know data ops if you were and at

00:19:40best if we don’t have that at best we

00:19:42have inefficient models but at worst and

00:19:45you know what’s what’s more likely to

00:19:47happen is that things don’t happen at

00:19:48all and if you don’t get that transition

00:19:51right then you just end up with our

00:19:54products and talking of AI and robots I

00:19:59love that video

00:20:02the lipstick robot Simone get hilarious

00:20:06ah you’re all boring I find that funny

00:20:08I’m gonna I’m gonna laugh and so how do

00:20:13we improve this well like like like we

00:20:15saw from the devops transition from from

00:20:17therefore knops much of the problem is

00:20:20actually a people problem it’s it’s

00:20:23about getting people to accept their

00:20:26role is changing and it’s changing for

00:20:28the better for the benefit of everyone

00:20:30and that’s and that’s okay but it’s a

00:20:33bit boring what we can do technically is

00:20:35start to enforce quality we can enforce

00:20:38quality with surprise-surprise

00:20:41continuous deployment continuous

00:20:43integration this is a classic continuous

00:20:46delivery pipeline the I’m sure you all

00:20:49know this the engineer you know would

00:20:50commit is code it will go into a build

00:20:52server it would run through pipeline be

00:20:55deployed into production now the

00:20:56pipeline is possibly the most important

00:20:59part of this entire process and it needs

00:21:01to be customized to your domain and your

00:21:04problem

00:21:05I like I always like talking about the

00:21:07testing triangle this is quite common in

00:21:10the CI literature if you haven’t seen it

00:21:13before it’s an image where on the x-axis

00:21:15we’ve got the number of tests on the

00:21:18y-axis

00:21:19we’ve got like the scope or the depth of

00:21:21the test so at the bottom we’ve got unit

00:21:23tests who we have very large numbers of

00:21:25unit tests that are telling testing very

00:21:27small bits of code all the way up to the

00:21:29top where we have very few tests

00:21:31acceptance tests but they’re testing a

00:21:33huge amount of code and you know that

00:21:39testing process is possibly the most

00:21:41important part of the build phase if you

00:21:44don’t test your models then you end up

00:21:46with something like this this is my

00:21:48colleague he was trying to book a flight

00:21:50from Amsterdam to Prague I think and

00:21:53kayak kindly recommended the flight me

00:21:57you know that was a direct output of one

00:22:00of their recommendations models it was

00:22:01me so that was for this guy he obviously

00:22:05couldn’t book a flight

00:22:06they lost his revenue they lost his

00:22:08money I dread to imagine how many other

00:22:10people were using the site at the same

00:22:11time and they all received big me and

00:22:16they must lost a lot of money I think if

00:22:18that if anything is a clear indication

00:22:19that that the data science people need

00:22:22to be more integrated into the

00:22:24operations of their actual software

00:22:27because they’re the only ones that know

00:22:28you know how to implement monitoring the

00:22:31best way they know how to fix it if it

00:22:33goes wrong and then we get on to the

00:22:37deploy phase and this is a bit more

00:22:41difficult to talk about because it’s a

00:22:43bit more domain-specific it’s very tech

00:22:46stack specific so it depends what

00:22:49technology stack you’re using but I can

00:22:51generalize it a little bit by talking

00:22:55about containers but I mean ultimately

00:22:57the the goals are exactly the same

00:22:59we want our software to be reactive

00:23:02resilient and reproducible we want it to

00:23:05be reactive so that when we have changes

00:23:07to the outside world we can scale up and

00:23:11scale down as accordingly we want it to

00:23:13be resilient so if it ever fails in ease

00:23:15automatically repair itself and

00:23:18reproducible if we can quickly reproduce

00:23:22our cluster in another location of a

00:23:24testing or something that improves

00:23:26testability and that kind of represents

00:23:32this tiny little arrow in the in the

00:23:36build pipeline and even in continuous

00:23:38delivery this is often overlooked and

00:23:40it’s always represented by a little

00:23:42arrow as if it’s like this simple thing

00:23:44where you just push it to production

00:23:45flowers smiley faces done and it’s never

00:23:49like that it’s kind of it’s a bit more

00:23:51difficult and far more specific and

00:23:52there’s a lot of engineering effort that

00:23:54you spent you know trying to push this

00:23:56out to production for data science land

00:24:00one of the easiest things we can do is

00:24:02bring in containers again you know so

00:24:05how do you do that well you know you you

00:24:07have some sort of model you can quite

00:24:10easily stuff that into a container and

00:24:12if you’ve just got interfaces and

00:24:15rooters then they’re all pretty

00:24:16standardized once you’ve got to that

00:24:19point then it becomes much easier to to

00:24:23not only make sure it runs on your

00:24:24machine and it works the same way in

00:24:26production but also it’s easier for

00:24:28other people to reason about as well

00:24:29because here you know you’re reducing

00:24:31the domain that people have to

00:24:32understand in order to use your service

00:24:34and that model can be anything it could

00:24:37be you know just a simple Python model

00:24:39it can be Fianna derivative tensorflow

00:24:42whatever and if you’re into sort of more

00:24:44streaming technologies and you know we

00:24:46can easily apply streaming technologies

00:24:48here as well if we just package up the

00:24:50whatever it is in your particular

00:24:53streaming X streaming package that

00:24:55you’re using and like a source or spark

00:24:57executor it’s still perfectly reasonable

00:24:59to do that and that fits really nicely

00:25:02into the testing triangle because we can

00:25:05build that container as part of our

00:25:06delivery pipeline and start testing that

00:25:09container as opposed to just testing the

00:25:11code itself so you know it’s all fairly

00:25:14standard stuff everybody aims for but

00:25:17it’s amazing at how much this doesn’t

00:25:19happen in real life in data science and

00:25:23then finally we can simply stuff that

00:25:26container into production however you

00:25:27want you know using some sort of

00:25:30Orchestrator or you know if you’re using

00:25:33some sort of streaming based system

00:25:36selecting GPUs and CPUs it’s the

00:25:38ultimate in flexibility if it works

00:25:40there if it works on your laptop it

00:25:42doesn’t matter and just to finally push

00:25:48home one of the this is a slightly

00:25:50different domain but and I know there’s

00:25:52a few thought works people here today so

00:25:54I’ve got to be a little bit careful

00:25:55there are a great company an amazing

00:25:57company but their marketing department I

00:25:59think also needs to be integrated in

00:26:01into production as well because they

00:26:03sent out this email last week and I

00:26:05would be really interested in finding

00:26:07out what thought works seismic shits I

00:26:10find that really fascinating actually I

00:26:13think this is a genius move by the

00:26:14marketing department because so many

00:26:16people were talking about this in the

00:26:17office and I think that’s done far more

00:26:19for thought works than than anything

00:26:21they could have sent out so well done

00:26:22that marketing person that made that

00:26:26okay so now I have a quick demo

00:26:28demonstrating all of these concepts

00:26:30together

00:26:31I’ve tried think of a simple example my

00:26:33example is a a whisky shop so my

00:26:35business requirement is I have a client

00:26:37which is a whisky shop because I think

00:26:40whisky and their

00:26:42have come to me because they want to

00:26:44provide a USP in the fact that they can

00:26:47recommend better whiskies than anybody

00:26:49else but the problem is they want this

00:26:51to be able to scale they can’t really

00:26:53afford to employ whiskey experts every

00:26:55single one of their shops so it’s much

00:26:56more efficient to write an algorithm to

00:26:58do that for them so their requirements

00:27:01are they want somebody to pass a

00:27:02favorite whiskey in and they want

00:27:04recommendations out they want to start

00:27:06off with a limited set of whiskey’s but

00:27:08want to be able to update their data in

00:27:09the model in the future this is all

00:27:13available on my get repository you can

00:27:17get that for it’s all open source and

00:27:20it’s it’s pretty simple you know the

00:27:22algorithm of amusing for this it’s

00:27:24pretty knotty it’s the kind of famous

00:27:26standard whiskey dataset and just to

00:27:30cover that a little bit it’s a simple

00:27:32nearest neighbor algorithm so if you

00:27:34have two whiskeys if sorry so all

00:27:35whiskies are characterized by a set of

00:27:38numbers where the numbers correspond to

00:27:39a particular feature of that whisky so

00:27:41the features might be smokiness or

00:27:44sweetness toffee things like that so

00:27:47what would happen is that it would

00:27:49calculate the distance between someone’s

00:27:51chosen whisky and all of the whiskies

00:27:53and then we would pick the top five or

00:27:56ten or whatever recommendations based

00:27:58upon that so pretty simple but you know

00:28:01works remarkably effectively but the key

00:28:04thing here is got a full continuous

00:28:05delivery pipeline so all of those stages

00:28:07have all been implemented with you know

00:28:10unit tests and mock data and real data

00:28:12and acceptance tests and I’ve used

00:28:17Jupiter notebook for the initial

00:28:18analysis and we’re able to insert new

00:28:21data simply by stuffing it into git and

00:28:23then watching it flow through the

00:28:25pipeline so hopefully this is going to

00:28:27play it is excellent so I’ve made a

00:28:29video here because as you probably know

00:28:33you know a lot of this takes a lot of

00:28:35time so now I’m just messing around with

00:28:37terraform creating my new infrastructure

00:28:41for this project and we’re going about

00:28:4310 times speed at the moment labid bla

00:28:46bla bla bla bla bla bla bla probably all

00:28:49used to to seeing this and then the end

00:28:53result is working server

00:28:56the cloud with some initial software

00:28:59deployed Oh Deary me can you see the

00:29:04bottom of that screen oh you can it’s

00:29:07just this monitor it’s okay so what I’ve

00:29:09just done there is I’m just fixing books

00:29:12because it didn’t work and finally we’ve

00:29:14got our algorithm actually working so

00:29:16this is running out of the container and

00:29:18when I curl the container then I get my

00:29:20recommendations back so a simple REST

00:29:23API testing you know a passed in mcallen

00:29:27I want that’s my favorite whiskey and so

00:29:29I’m gonna get these recommendations here

00:29:31awesome

00:29:32so first job as a software engineer I’ve

00:29:37figured out that there’s maybe a little

00:29:39bug in my code so I’ve got a UCF ass

00:29:42mcallen there and he’s actually returned

00:29:44Macallan as one of the recommendations

00:29:46so that’s a bit pointless so that’s my

00:29:48first book I’m gonna go and try and fix

00:29:50that so now I’m just inside the code and

00:29:52I’m just going to edit that code I’m

00:29:56gonna basically ignore that first first

00:30:01value there when I output my

00:30:02recommendations and we’re going to write

00:30:04that back then we’re going to push that

00:30:07to the repository and there we go and

00:30:12then we’re going to watch our pipeline

00:30:15so this is quite cool we’ve got a

00:30:17pipeline here where we’ve got all of the

00:30:19tests running in parallel if those tests

00:30:22pass then we go into a registry step

00:30:24which pushes that that file to a

00:30:27registry and then we’ll talk about the

00:30:29deploy in a little bit but all of those

00:30:32stages is just implemented with a simple

00:30:35yeah more script you know and but the

00:30:38beauty is is that we’re actually using

00:30:41realistic data to test this software

00:30:46which kind of isn’t something that

00:30:48happens in real life I haven’t noticed

00:30:50you know so what tends to happen is that

00:30:52you implement it manually and then you

00:30:54test it manually and then the software

00:30:56engineers have some sort of dummy data

00:30:59that they should use in their tests and

00:31:01they have some expected output but it’s

00:31:03a very small you know it’s usually mock

00:31:06data it’s usually not realistic and it’s

00:31:09certainly not real

00:31:11and then at the end of the process the

00:31:13the data scientists would come to the

00:31:15software engineer and manually test his

00:31:17software to see if it’s okay you know

00:31:19it’s it’s a it’s a hugely manual and a

00:31:21very poorly managed process if we can

00:31:26stuff all of that into a pipeline like

00:31:28we’ve done just here thumbs up okay so

00:31:34all our tests have passed it’s now being

00:31:37pushed the registry and once it gets

00:31:42pushed to the registry then it will be

00:31:45deployed to the server for this all I’ve

00:31:48done is just in a real hockey

00:31:51let’s SSH into the server and just do a

00:31:53you know doc Apple docker run which

00:31:56isn’t great if I had more time I’d

00:31:59probably deploy it to Cuba Nettie’s or

00:32:01something but it works

00:32:04and it demonstrates it quite well so I’m

00:32:06just going to watch that container on

00:32:07the on the server now and in a minute

00:32:09we’ll see that container there you go so

00:32:13now it’s just been deleted and it’s

00:32:17going to be recreated there so that’s

00:32:20the deployment phase in action so if we

00:32:24now go back and actually test this new

00:32:26service and hopefully we should see a

00:32:28better output I go search for the same

00:32:32thing again account and again and you

00:32:34can see we’ve removed Macallan from the

00:32:36first entry there fan tastic

00:32:39and that’s okay and that’s kind of a

00:32:41traditional software task but it’s

00:32:42something that a software engineer would

00:32:44normally do not a data scientist so

00:32:46again they’re the focus here is to try

00:32:48and get the data scientist involved in

00:32:50the software engineering or vice-versa

00:32:54if now we have a second data scientist

00:33:00or another engineer that you know what I

00:33:01don’t like some of your data I’m going

00:33:04to change the model so I’m going into

00:33:07the ipython notebook and I’m looking at

00:33:09what the previous person has done and

00:33:11now you’re just going to see me hacking

00:33:14around trying to get something working

00:33:16but the idea is here that this is the

00:33:18process that an engineer would normally

00:33:20go through when he’s trying to implement

00:33:22a new model or I think in this case I’m

00:33:24trying

00:33:24to insert some new data lots of errors

00:33:27lots of errors finally figure out how to

00:33:29do it yep still not right data’s wrong

00:33:33sighs how do I do it how do i there we

00:33:37go there’s a few minutes there where I

00:33:39was on Stack Overflow that’s why the

00:33:40pause was there there we go it’s worked

00:33:42so I’ve generated some new data I’ve

00:33:45pushed that to the repository and now

00:33:47we’re going through the build pipeline

00:33:48again so this is the same build pipeline

00:33:51but with the change of data so we

00:33:54haven’t changed the model now so it’s

00:33:56important to have the data almost as

00:33:58part of the model if you can maybe the

00:34:00data is too big but if you can it’s

00:34:03really useful to be in there because you

00:34:05can catch bugs like this so I think what

00:34:07we just saw there was some of the tests

00:34:10failed because the data who become in

00:34:12such a state that it wasn’t giving out

00:34:15the the output that it should have done

00:34:17so now this time instead of adding data

00:34:19I’m going to just remove some data so

00:34:21I’ve removed the whiskey and we’ve

00:34:24reached it I think what I find out now

00:34:26is that I’ve actually caused some of my

00:34:28unit tests to fail by removing removing

00:34:30one of the whiskies that was in my unit

00:34:32test so I’m just having to fix that

00:34:34there you go it was a commit they’re

00:34:37saying it’s really working now smiley

00:34:40face and there we go our tests are

00:34:46finally passing and once again we’re

00:34:48going through the registry and we get to

00:34:54deploy it there it is come on do it do

00:35:02it do it

00:35:08it’ll get there eventually and and you

00:35:12know the result is the finally deployed

00:35:14model there we go finally with the new

00:35:16data all without touching the model but

00:35:20still going through the pipeline to

00:35:21guarantee that not only our model is

00:35:24valid but the data makes sense and when

00:35:27we throw different data and new data at

00:35:28it it still makes sense so you can

00:35:30imagine trying to apply this to your own

00:35:32stuff that if you if you have got

00:35:34requirements for like accuracy

00:35:36requirements you could make that a hard

00:35:39and fast rule in your pipeline to fail

00:35:41when your model accuracy decreases to a

00:35:43certain point and and that’s it so I

00:35:49think that entire process was probably

00:35:51about an hour I sped it up into about 10

00:35:53minutes well that’s probably just due to

00:35:56my poor software engineering more than

00:35:58anything if you’d like to take a look

00:36:01then just go to the link you can just

00:36:03have a look at the slides or come and

00:36:04see me and we’ll basically search for

00:36:06for window research and we’ll get it

00:36:09there so with that I’d like to say thank

00:36:13you very much just

00:36:15[Applause]

”