GOTO 2015 • Scalable Data Science and Deep Learning with H2O • Arno Candel

00:00:05so yes I spent a lot of years in physics

00:00:08in high performance computing for

00:00:11particle physics on the largest

00:00:13supercomputers of the world at slac

00:00:15working together wis earn that was from

00:00:18a background and then i switched into

00:00:19machine learning startups I’ve been

00:00:21doing this for the last three and a half

00:00:22years or so last year I got nominated

00:00:26and called a big data all-star at the

00:00:28Fortune magazine so that was a nice

00:00:31surprise and you can follow me at are no

00:00:33condell here and if anybody would be

00:00:35willing to take a picture and tweet it

00:00:37to me that will be great thanks so much

00:00:40so yesterday we’re going to introduce

00:00:42h2o and then talk about deep learning a

00:00:45little bit in more detail and then there

00:00:48will be a lot of live demos as much as

00:00:51time allows I will go through all these

00:00:53different things so we’ll look at

00:00:54different data sets different api’s and

00:00:57i’ll make sure that you have a good

00:01:00impression about what h2o can do for you

00:01:02and how it would look like and that you

00:01:06definitely get an idea of what we can do

00:01:08here so h2o is a in memory machine

00:01:12learning platform it’s written in Java

00:01:14it’s open source it distributes across

00:01:18your cluster it sends the code around

00:01:20not the data so your data can stay on

00:01:22the cluster and you have a large data

00:01:26set right and then you want to build

00:01:28models on the entire data set you don’t

00:01:30want to down sample and lose accuracy

00:01:31that way but usual the problem is that

00:01:35the tools don’t allow you to scale to

00:01:37all the big data sets especially for

00:01:39building machine learning models we’re

00:01:42not just talking about summing up stuff

00:01:44for computing aggregates you’re talking

00:01:46about sophisticated models like gradient

00:01:48boosting machines or neural networks and

00:01:52h2 allows you to do this and you get the

00:01:54the scalability and the accuracy from

00:01:57this big data set in scale and as I

00:02:00mentioned earlier we have a lot of AP is

00:02:02that you’ll get to see today we also

00:02:05have a scoring engine which is kind of a

00:02:07key point of the product we are about 35

00:02:11people right now we had our first h-2a

00:02:14world conference last year in the fall

00:02:17and the most

00:02:18huge success and sri satish embody here

00:02:21i was CEO he had he has a great great

00:02:24mindset and culture culture is

00:02:26everything to him so he likes to do meet

00:02:29ups every week even twice a week to get

00:02:32feedback from customers and so on so we

00:02:35are very much community driven even

00:02:37though we write most of the code at this

00:02:39point so you can see here the growth

00:02:43machine learning is really trending and

00:02:45we think it’s the next SQL and

00:02:48prediction is the next search there’s

00:02:50not just predictive analytics there’s

00:02:51also prescriptive analytics where you’re

00:02:53trying to not just say what’s going to

00:02:54happen tomorrow but you’re going to tell

00:02:56the customers what to do such that they

00:02:58can affect tomorrow so you can see the

00:03:00growth here lots and lots of companies

00:03:02are now using h2o and why is that well

00:03:06because it’s a distributed system built

00:03:10by the experts in house we have click

00:03:12click click he’s our CTO he wrote

00:03:15basically Java compiler jit right large

00:03:17parts of it in every cell phone of yours

00:03:19there’s parts of his code that are

00:03:21executed all the time so he architected

00:03:24the whole framework it’s a distributed

00:03:27memory key value store based on a

00:03:29non-blocking hash map it has a MapReduce

00:03:33paradigm built in our own map produced

00:03:35which is fine grain and make sure that

00:03:38all the threats are working at all times

00:03:40if you’re processing your data and of

00:03:42course all the nodes are working in

00:03:44parallel as you’ll see you later and we

00:03:45also compress the data similar to the

00:03:47park a data format and so you can really

00:03:50store only the data you need and it’s

00:03:52much cheaper to decompress on the fly in

00:03:54the registers of the CPU then to send

00:03:56the numbers across the wire and once you

00:04:00have this framework in place you can

00:04:01write algorithms that are using this

00:04:02MapReduce paradigm and you can also do

00:04:05less than an algorithm you can just say

00:04:07compute aggregates for example it’s like

00:04:10a mini algorithm if you want so you can

00:04:12do all these things and in the end you

00:04:14end up with a model that makes a

00:04:15prediction of the future right you stand

00:04:17with machine learning and that code can

00:04:21then be exported then I’ll show you that

00:04:22in a minute and of course we can suck in

00:04:24data from pretty much anywhere and you

00:04:26can talk to our Python via JSON from a

00:04:31web browser

00:04:32I routinely check the status of my jobs

00:04:34from my cell phone for example so

00:04:40there’s a bunch of customers using us

00:04:41right now these that are referenceable

00:04:42at this point there’s a lot more that we

00:04:44can talk about at this moment but you’ll

00:04:48hear about them soon they’re basically

00:04:51doing big data right hundreds of

00:04:54gigabytes dozens of nodes and they’re

00:04:57processing data all the time and they

00:04:59have faster turnaround times they’re

00:05:01saving model saving millions by

00:05:03deploying these models such as this

00:05:06fraud detection model it has a safe

00:05:09paypal millions in fraud so it’s very

00:05:16easy to download you just go to h dot AI

00:05:18and you can find the download button you

00:05:20downloaded it once it’s downloaded you

00:05:22unzip that that file and you go in there

00:05:25and type java dejar right that’s it h2o

00:05:29will be running on your system there’s

00:05:31no dependencies it’s just one single

00:05:32file that you need and you’re basically

00:05:35running and you can do the same thing on

00:05:36a cluster you expect to file everywhere

00:05:38and you launch it that would be a bear

00:05:40ball installation if you don’t want to

00:05:42do bare bones you can do Hadoop you can

00:05:44do yarn spark you can launch it from our

00:05:48and from Python as well so let’s do a

00:05:53quick demo here this is glm so i’m going

00:05:58to a cluster here this cluster has my

00:06:01name on it you got a dedicated cluster

00:06:04for this demo so let’s see what this

00:06:06past erase this cluster is an eighth

00:06:10note cluster on ec2 it has I think 30

00:06:15gigabytes of heap per machine yep here

00:06:19and basically it’s just there waiting

00:06:22for me to tell it what to do so one

00:06:24thing I did earlier as I parse this

00:06:26Airlines data set I’m going to do this

00:06:28again the airlines data set has all the

00:06:30flights from 2007 all the way back to

00:06:32nineteen eighty seven and it’s parsing

00:06:36this right now and let’s go look at the

00:06:38cpu usage here you can see that all the

00:06:43notes are active right now sucking in

00:06:44the data

00:06:46parsing it tokenizing it compressing it

00:06:48into these these reduced representations

00:06:51that are lost less of course so when we

00:06:53have numbers like 719 and 120 then you

00:06:57know that that fits into one bite so you

00:06:58make a one-bite column right once you

00:07:00see that their numbers there are more

00:07:02dynamic ranged and just one bite then

00:07:04you take two bites and so on you

00:07:06basically just store what you need it’s

00:07:09okay so now it part of this file in 35

00:07:11seconds let’s go look at the file

00:07:13there’s a frame summary that I’m

00:07:15expecting it from the server and the

00:07:17server now returns this and says here

00:07:19160 million rows can you see this

00:07:24there’s 160 million rows 30 columns

00:07:27about 4 gigabytes compressed space you

00:07:30see all these different columns here

00:07:32they have like a summary a cardinality

00:07:34some of them are categorical here so in

00:07:36effect is about 700 or dictators in this

00:07:39data set and we’re trying to predict

00:07:40whether their plane is delayed or not

00:07:43based on its like departure origin and

00:07:47destination airport and so on so if i

00:07:50wanted to do this i will just click here

00:07:52build model i will say generalized

00:07:55linear model that’s one that is fast and

00:07:58the training frame is chosen here and i

00:08:01will now choose some columns to use i’ll

00:08:03first ignore all of them because there’s

00:08:05a lot of columns i don’t want to use and

00:08:07then i’ll add year month the day the

00:08:11week at the day of the week let’s see we

00:08:13want to know the departure time maybe

00:08:16the carrier not the flight number that

00:08:19doesn’t mean much maybe the origin and

00:08:22destination and then all we really care

00:08:26about is whether it’s the late or not so

00:08:29that will be my response everything else

00:08:30you don’t need because it would give

00:08:32away the answer right so its departure

00:08:34the late is what I’m going to try to

00:08:37predict and it’s a binomial problem so

00:08:39yes or no is the answer and now I just

00:08:42have to press go and it’s building this

00:08:46model as we as we speak and I can go to

00:08:49the water meter to see the cpu usage and

00:08:52you can see that all the nodes are busy

00:08:54computing this model right now

00:08:58and in a few seconds it will be done you

00:09:01see the objective value doesn’t change

00:09:03anymore yep so it’s done in 19 seconds

00:09:06and I can look at the model and I can

00:09:09see that we have an auc of 9.5 it’s a

00:09:14little more than point five right it’s

00:09:16not just random we have variable

00:09:19importances here we can see that certain

00:09:22airlines like Eastern Airlines is as a

00:09:26negative correlation with the response

00:09:28which means it’s it’s rarely if you take

00:09:30this carrier you’re not going to be

00:09:32delayed that’s because it didn’t have a

00:09:34schedule it was always on time by

00:09:37definition for example so this is like

00:09:38one bit that comes out of this model

00:09:40another thing is that Chicago and

00:09:42Atlanta are often delayed when you start

00:09:44there right when your journey starts

00:09:46there as you know or for example San

00:09:49Francisco if you want to fly to San

00:09:52Francisco there’s a lot of people who

00:09:54want to do that so that’s why it’s also

00:09:56often delayed and as I mentioned earlier

00:09:59the accuracy here flatlined after the

00:10:02first few iterations so the model could

00:10:04have been done even faster if you’re

00:10:06looking at the metrics here for example

00:10:08you can see that there’s a mean square

00:10:10error reported an r square value report

00:10:12at all this data science stuff aoc value

00:10:14of point 65 and so on and there’s even a

00:10:19POJO that we can look at you know what a

00:10:21POJO is a plain old java object it’s

00:10:24basically Java code that’s the scoring

00:10:26code that you can take into production

00:10:27that actually scores your flights in

00:10:30real time and you could say okay if

00:10:32you’re this airline and if you’re at

00:10:33this time of day then you’re going to

00:10:35have this probability to be delayed or

00:10:37not and this is the optimal threshold

00:10:40computed from the ROC curve that curve

00:10:43that you saw earlier that tells you

00:10:45where where best to pick your to

00:10:47operating regime to say the later not

00:10:49based on the falls and positives and

00:10:52true positives and so on that you’re

00:10:53balancing right so let it stand the data

00:10:55science it’s all baked in for you you

00:10:57get the answer right away so this was on

00:10:59160 million rows and we just did this

00:11:02life

00:11:06so as you saw the pojo scoring code

00:11:09there’s there’s more models that you can

00:11:11build in in the flow user API the degree

00:11:16that you saw earlier there’s a a Help

00:11:18button on the right side here to bring

00:11:21this back up there’s help I go down and

00:11:24I can see here packs so there’s a bunch

00:11:28of example packs that come with it so if

00:11:31I click on this here I’ll do this

00:11:33actually on my laptop now I’ll show you

00:11:39how to run this on a laptop so I just

00:11:40downloaded the the package from the

00:11:43website and it only contains two files

00:11:45one is an r package and one is the

00:11:48actual java jar file I’m going to start

00:11:51this on my laptop and I’m going to check

00:11:56the browser localhost at port five four

00:12:01three two one that’s our default port

00:12:03and now I’m connected to this java JVM

00:12:07that I just launched right and I can ask

00:12:10it this is a little too big now let’s

00:12:15make it smaller here we go I can look at

00:12:17the cluster status yet it’s a one-note

00:12:18clustered I gave it 8 gigs of heap you

00:12:23can see that and it’s all ready to go so

00:12:25now I’m going to launch this this flow

00:12:28from this example pack this million

00:12:31songs flow I’m going to load that

00:12:33notebook and you can see this is the

00:12:36million song binary classification demo

00:12:39we basically have data set with 500,000

00:12:41observations 90 numerical columns and

00:12:44we’re going to split that and store the

00:12:48next three well that’s done you already

00:12:50have those files ready for you so now we

00:12:52just have to parse them in here and I

00:12:54put them already on my laptop so I can

00:12:56just say download on import into the h2o

00:13:00cluster I’ll take the non zip diversion

00:13:03because that’s faster so this this file

00:13:05is a few hundred megabytes it’s done in

00:13:07three seconds and this one here is the

00:13:11test set I’m also going to parse this

00:13:14and you can see that you can even

00:13:16specify the column types if you wanted

00:13:18to turn a number into an enum for

00:13:20classification you can do this here

00:13:22explicitly if you’re not happy with the

00:13:24default behavior of the parser but the

00:13:26parts that is very robust and can

00:13:28usually handle that so if you have

00:13:29missing values if you have all kinds of

00:13:31categoricals ugly strings stuff that’s

00:13:33wrong we’ll handle it it’s very robust

00:13:36it’s really made for enterprise-grade

00:13:37datasets it’ll it’ll go through your

00:13:39dirty data and just spit something out

00:13:41that’s usually pretty good okay so now

00:13:45we have these data sets and I’ll see but

00:13:47what else we have here so let me go back

00:13:49out here give your view you can click on

00:13:51outline on the right and you can see all

00:13:53these cells that I pre-populated here

00:13:55and one of them says build a random

00:13:57forest once has build a gradient

00:13:59boosting machine one says build a linear

00:14:01model logistic regression and one says

00:14:03build a deep learning model right and I

00:14:05can just say okay fine let’s build one

00:14:07let’s say let’s go down to the GBM cell

00:14:09and say execute this cell now it’s

00:14:11building a gradient boosting machine on

00:14:13this data set you can see the progress

00:14:15bar here and violets building it I can

00:14:18say hey how do you look right now let me

00:14:20see how you’re doing so right now it’s

00:14:22already giving me to scoring history

00:14:25points where the error went down it’s

00:14:27already in a OC curve an ROC curve with

00:14:29an AOC of something like see point seven

00:14:33I would hope yes point seven a you see

00:14:37already right in just seconds that’s

00:14:39pretty good for this data set if I do it

00:14:41again it’s already down here the error

00:14:45ghost keeps going down and you can keep

00:14:47looking at that model feature

00:14:48importances for which which variables

00:14:51matter the most all in real time and I

00:14:53can also look at the Poggio again this

00:14:55time it’s a tree model not a logistic

00:14:57regression model so you would expect

00:14:59some decisions in this tree structure if

00:15:02I go down there’s all these classes that

00:15:08this all like Java code I think the tree

00:15:11should be somewhere let me see I might

00:15:17have to refresh this model

00:15:28oh here we go so these are all the

00:15:30forests here you see that there’s a lot

00:15:32of forests that are being scored and now

00:15:34we just have to find this function

00:15:35somewhere down there and up here it is

00:15:39so here you can see that this is

00:15:41decision tree logic right if your data

00:15:44is less than 4,000 in this column and

00:15:47less than this endless and then in the

00:15:50end your prediction will be so and so

00:15:52much otherwise it will be this number so

00:15:54basically this is the scoring code of

00:15:57this model that you can put right into

00:15:59production in storm or any other API

00:16:01that you want to use your own basically

00:16:04that’s just Java code without any

00:16:06dependencies and you can build the same

00:16:09thing with deep learning right you can

00:16:11build a deep learning model on the same

00:16:12data set at the same time that the other

00:16:14one is building you can be able to

00:16:16random forest model here also at the

00:16:18same time or a glm and this is all on my

00:16:24laptop right now so I’m building

00:16:26different models at the same time and I

00:16:28can ask hey what’s the status of them I

00:16:31can just go to the right here in the

00:16:32outline and click on giving my deep

00:16:35learning model oh it’s already done

00:16:37let’s see how well we’re doing here also

00:16:40a good auc right and feature importances

00:16:45and the scoring history and the metrics

00:16:48and you can even get a list of optimal

00:16:52metrics like what’s the best position i

00:16:53can get what’s the best accuracy i can

00:16:55get and then at what threshold so this

00:16:57is all geared towards the data scientist

00:17:00understanding what’s happening all right

00:17:04so mild my laptop is churning out some

00:17:07more models you can continue here and

00:17:09talk about deep learning in more detail

00:17:11so deep learning as you all know is

00:17:14basically just connected neurons right

00:17:17and it’s similar to logistic regression

00:17:19except that there’s more multiplications

00:17:22going on you take your feature times the

00:17:26weight you get a number and then you add

00:17:28it up and you do this for all these

00:17:31connections your each connection is a

00:17:33product of the wait times the input

00:17:36gives you some output and then you apply

00:17:38a nonlinear function like at NH

00:17:40something is like a step function

00:17:42move step function and you do this again

00:17:45and again and again and at the end you

00:17:46have like a hierarchy of nonlinear

00:17:49transformations which will lead to very

00:17:51complex nonlinearities in your model so

00:17:53you can describe really weird stuff that

00:17:56you would otherwise not be able to with

00:17:57say a linear model or a simple random

00:17:59forest that doesn’t go as deep to to

00:18:02make up all these nonlinearities between

00:18:04all these features so this is basically

00:18:06the machinery you need for

00:18:08nonlinearities in your data set and we

00:18:11do this in a distributed way again

00:18:13because we’re using the MapReduce we’re

00:18:15doing this again on all the threads

00:18:16right as you saw earlier for glm and

00:18:18everything was Green deep learning is

00:18:20also green it’s known to be green I

00:18:22usually burn up the whole custom and I’m

00:18:25running my models and everybody else has

00:18:27to step back well of course there’s the

00:18:31Linux scheduler that takes care of that

00:18:33but still some claim it’s not

00:18:35necessarily fair if I’m running some big

00:18:37model so I haven’t done that lately and

00:18:39that’s why I’m using these easy two

00:18:41clusters now or maybe my laptop from

00:18:43time to time but anyway you can see here

00:18:45we have a lot of little details building

00:18:48right it works automatically on

00:18:49categorical data it were automatically

00:18:51standardizes standardizes your data you

00:18:54don’t need to worry about that it

00:18:55automatically impedes missing values it

00:18:58automatically does regularization for

00:19:01you if you specify the option it does a

00:19:04check pointing load balancing everything

00:19:06you just need to say go and that’s it so

00:19:08it should be like super easy for anyone

00:19:10to just run it and if you want to know

00:19:13how it works in the detail architecture

00:19:16here it’s basically just distributing

00:19:18the data set for it first right onto the

00:19:21whole cluster let’s say you have a

00:19:22terabyte of data and 10 notes every node

00:19:25will get 100 gigabytes different data

00:19:27and then you’re saying okay I’ll make an

00:19:29initial deep learning model that’s a

00:19:31bunch of weights and bias values all

00:19:33just numbers and i’ll put that into some

00:19:36place in the store and then i spread

00:19:39that to all the notes all my 10 notes

00:19:40get a copy of the same model and then i

00:19:44say train on your local data so then all

00:19:46the the models will get trained on their

00:19:49local data with multi-threading so there

00:19:53are some race conditions here that makes

00:19:54this not reproducible

00:19:55but in the end you will have n models in

00:19:58this case for or on your cluster that

00:20:01I’ve just mentioned 10 you will have 10

00:20:03such models of that I’ve been built on a

00:20:06part of these hundred gigabytes that you

00:20:08have you don’t have to process all the

00:20:09hundred gigabytes you can just sample

00:20:11some of it right and then when you’re

00:20:14done with that you reduce it basically

00:20:16automatically will get average back into

00:20:18one model and that one model is the one

00:20:20that you look at from your browser from

00:20:22our from Python and then you do this

00:20:25again and every pass is a fraction of

00:20:27the data that you’re passing through or

00:20:30all of the data or more than all of your

00:20:32data you can just keep iterating without

00:20:34communicating you can tell each no to

00:20:36just run for six weeks and then

00:20:38communicate but by default it’s done in

00:20:41a way that you spend about two percent

00:20:43of your time communicating on the

00:20:44cluster and ninety-eight percent

00:20:45computing and this is all automatically

00:20:48done so you don’t need to worry about

00:20:49anything you just say go and it’ll

00:20:51basically process the data in parallel

00:20:53and make a good model and this averaging

00:20:55of models this scheme works there’s a

00:20:58paper about it but I’m also working on a

00:21:00new scheme that’s called consensus a dmm

00:21:04where you basically have penalty how far

00:21:07you drift from the average but you keep

00:21:09your local model and that keeps

00:21:11everybody kind of going on their own

00:21:13path in optimization land without

00:21:15averaging all the time you just you know

00:21:18that you’re drifting too far so you get

00:21:19pulled back a little but you still have

00:21:21your own model so this is going to be

00:21:23promising upgrade soon that you can look

00:21:25forward to already as it is it works

00:21:29fairly well so this is the amidst right

00:21:31two digits 0 to 9 handwritten digits 784

00:21:35grayscale pixels you need to know which

00:21:37one is it right from the grayscale pixel

00:21:39values and in with a couple of lines

00:21:41here in our you can get the world class

00:21:44is actually actual world record no one

00:21:46has published a better number in this

00:21:48without using convolutional layers or

00:21:51any other distortions this is purely on

00:21:54the 60,000 training samples no

00:21:56distortions no convolutions and you can

00:22:00see here all the other implementations

00:22:01Jeff Hinton’s and Microsoft’s point 83

00:22:05is the world record of course you could

00:22:07say the last digit is not quite

00:22:08statistically

00:22:09significant because you only have ten

00:22:10thousand to test set points but still

00:22:13it’s good to get down there so now let’s

00:22:16do a little demo here this is a normally

00:22:19detection I’ll show you how we can

00:22:21detect the ugly digits in this Emily’s

00:22:22data set on my laptop in a few seconds

00:22:25so I just have this instance up and

00:22:28running here from before so I’m going to

00:22:30go into our in our I have this our units

00:22:37this runs every day right every time we

00:22:39commit something these tests are being

00:22:40run so you can definitely check those

00:22:42out from your github web page right now

00:22:45if you want but still this is saying

00:22:50build a an auto encoder model which is

00:22:54learning what’s normal so it connects to

00:23:00my cluster right now it learns what’s

00:23:02normal what is a normal digit without

00:23:04knowing but they do today’s it just says

00:23:06look at all the data and learn what’s

00:23:08normal and how does it do that well it

00:23:11takes the 784 pixels it compresses them

00:23:14into in this case 50 neurons 50 numbers

00:23:17and then tries to make it back into 784

00:23:21so it’s learning the identity function

00:23:23of this data set in a compressed way

00:23:26right so if you can somehow represent

00:23:28the data with these 50 numbers and you

00:23:32know the weights connecting in and out

00:23:33then these 50 numbers they mean

00:23:35something that’s what it takes to

00:23:37represent those 10 digits let’s say

00:23:39that’s roughly five numbers four digit

00:23:41and those five numbers are enough to say

00:23:43there’s an edge here as a round thing

00:23:44here as a hole here something like that

00:23:46like the features and with these 50

00:23:48numbers in the middle and of course the

00:23:50connectivity that make up the

00:23:53reconstruction and the basically the

00:23:55encoding and the decoding you can now

00:23:59say what’s normal or not so because now

00:24:02I’ll take the test set I let it go

00:24:03through this network and I see what

00:24:05comes out of the other side if it

00:24:07doesn’t look like the original input

00:24:08then it didn’t match my vision of what

00:24:10this should look like right so I’m going

00:24:12to let the test set go through this

00:24:14model first I need to train the model so

00:24:19right now it’s building this model on my

00:24:20laptop 50 hidden neurons

00:24:2310h activation function and auto encoder

00:24:27is set to true and I had a couple of

00:24:29extra options but that’s just to say

00:24:31don’t drop any of the constant columns

00:24:33at all as zero because I want to plot it

00:24:36at the end okay so now let’s look at the

00:24:39outlier nests of every point we just

00:24:42scored the test set and computed the

00:24:44reconstruction error so how how

00:24:46different is the outcome from the income

00:24:48how bad is my identity mapping that I

00:24:51learned for the test set points and for

00:24:53those points that are kind of ugly they

00:24:56won’t match to what’s normal in the

00:24:58training data right that’s an intuitive

00:25:00thing all right so now let’s plot the

00:25:05ones that match the best top 25 that’s

00:25:08the reconstruction and now let’s look at

00:25:12the actual ones well the same thing

00:25:14right there match the best so I have to

00:25:16look like the same this is the ones that

00:25:18are the easiest to learn to represent in

00:25:20your identity function just take the

00:25:23middle ones and say keep them basically

00:25:26now let’s look at the ones in the middle

00:25:28out of 10,000 that’s the the ones the

00:25:31median reconstruction error so these are

00:25:34still reasonably good you can tell that

00:25:36they’re digits but they’re already not

00:25:38quite as pretty anymore and now let’s

00:25:41look at the ugliest outliers so to speak

00:25:43in the test set so these are all digits

00:25:46that are coming out of my network but

00:25:50they’re not really like digit anymore

00:25:52right so something went wrong basically

00:25:55the reconstruction failed the model said

00:25:57these are ugly if you look at them they

00:25:58are kind of ugly some of them are almost

00:26:01not digits anymore right cut off or the

00:26:04top right one for example is ugly and

00:26:06you can tell that if you remember the

00:26:07bottom line like in the optics test the

00:26:09vision exam 6 40 35 right let’s go look

00:26:14at my slides totally different so every

00:26:18time I run it it’s different because its

00:26:21neural nets with multithreading I can

00:26:24turn it on to be reproducible but then i

00:26:27have to say use one threat don’t do any

00:26:29of this hog wild race condition updates

00:26:32of the weight matrix by multiple threats

00:26:35at the same time just run one

00:26:37right through and give a seed and then

00:26:39just wait until that one thread is done

00:26:41and then it will be reproducible but in

00:26:43this case I chose not to do this because

00:26:45it’s faster this may and the results are

00:26:47fine anyway every time you run it you’ll

00:26:49get something like this you will not get

00:26:51the ugly digits to be the good ones

00:26:53right so this shows you basically that

00:26:56this is a robust thing and again here

00:26:58this is the network topography so I can

00:27:03also go back to the browser now go to

00:27:06localhost and say here clean up

00:27:09everything by the way here this just ran

00:27:11all the model so if I say get models I

00:27:14should see all the models that were

00:27:16built so that the last four rd models

00:27:18they were built on the million song data

00:27:20said earlier and the top one is the 1i

00:27:23built from our the auto encoder and you

00:27:26can see the auto encoder reconstruction

00:27:28error started at point zero eight mean

00:27:31square error and now it’s at point zero

00:27:33two so it got it down it improved from

00:27:36random noise for Otto encoders you

00:27:38always want to check this convergence it

00:27:41has to learn something right the

00:27:44identity mapping and you can also see

00:27:47here the status of the neuron layers the

00:27:49thing I showed you earlier and of course

00:27:51you can also get a POJO again here in

00:27:54this case it’s a neural net so you would

00:27:56expect some weights here and some here

00:27:59what is this oh that’s the neurons here

00:28:04we go I would expect the model to show

00:28:07up somewhere see there’s a lot of

00:28:10declarations you have to make to know

00:28:12all these 784 features so if this is too

00:28:15little for the preview then we have to

00:28:18look at the other model we have yeah

00:28:20let’s go back to get models and click on

00:28:21the other deep learning model be made

00:28:23earlier on the million song data set and

00:28:25look at its pojo that should be smaller

00:28:29because there were only 90 predictors

00:28:31okay here we go so now you should see

00:28:34that the deep learning math actually

00:28:35printed out in plain text so you can

00:28:38always check here activation something

00:28:43with numerical something with

00:28:45categoricals if you had any in this case

00:28:47there are none and then it will save

00:28:49aids activation biases and

00:28:51they will do this matrix vector

00:28:52multiplication so ax + y v 1 this is the

00:28:58matrix vector multiplication that’s

00:29:00inside of the deep learning model and

00:29:02you can see here we do some partial some

00:29:05tricks to be faster to basically allow

00:29:07the CPU to do more additions and

00:29:09multiplications at the same time so all

00:29:11of this is optimized for speed and this

00:29:14is as fast as any c++ implementation or

00:29:17anything because we don’t really have GC

00:29:20issues here all the arrays are allocated

00:29:22one time and then just process all right

00:29:27so now let’s get back to the bigger

00:29:29problems deep learning and higgs boson

00:29:32who has seen this data set before okay

00:29:35great so this is a physics right 13

00:29:38billion dollar biggest project ever

00:29:39scientific experiment this data set has

00:29:4210 million rows their detector events

00:29:45each detector event has 21 numbers

00:29:48coming out saying this is what I

00:29:49measured for certain things and then the

00:29:51physicists come up with seven more

00:29:53numbers that they compute from those 21

00:29:56something like square root of this

00:29:58squared minus that square or something

00:30:00and those formulas or formulae actually

00:30:05help and you can see this down there if

00:30:10you take just the low-level numbers this

00:30:11is the AUC you get so point 5 is random

00:30:14and one would be perfect and now it goes

00:30:16up by something like 10 basis points

00:30:18almost if you add those extra features

00:30:21so it’s very valuable to have physicists

00:30:23around to tell you like what to do right

00:30:25but CERN basically had this baseline

00:30:28here of 81 that was how good it was

00:30:31working for them they used it gradient

00:30:33boosted trees and neural networks with

00:30:36layer with one layer one hidden layer so

00:30:39their baseline was 81 AUC and this paper

00:30:42came a long last summer saying we can do

00:30:43better than that with deep learning and

00:30:46they publish some numbers and now we are

00:30:48going to run the same thing and see what

00:30:50we can do so I’m going back to my

00:30:53cluster my ec2 8 no cluster and I’ll say

00:31:00get frames

00:31:03and I will have to Hicks data set there

00:31:05already because I parse it earlier you

00:31:07can see here 11 million rows and 29

00:31:13columns 2 gigabytes compressed is not

00:31:15much to compress because it’s all

00:31:16doubles and now I’m going to run a deep

00:31:20learning model so I already saved the

00:31:22flow for that so this flow says take the

00:31:30split the split data set I split it in

00:31:32two ninety percent and five five percent

00:31:34so ten million and half a million each

00:31:37take the training data and the

00:31:40validation data and tell me how you’re

00:31:41doing along the way so go and it builds

00:31:45a three layer Network and uses a

00:31:48rectifier activation everything else is

00:31:50default and now it’s running so let’s go

00:31:55look at the the water meter ok here we

00:31:57go deep learning is taking over the

00:31:59cluster and now it’s communicating and

00:32:02now it’s sending that back out and then

00:32:05computing again this might be initial

00:32:07phases where its eyes to first it

00:32:10rebalance the data set or something

00:32:11usually you’ll see it up down up down so

00:32:14let’s wait until the next communication

00:32:17but you’ll see that all the CPUs are

00:32:19busy updating weights with stochastic

00:32:20gradient descent which means it takes a

00:32:23point it trains goes through the network

00:32:27makes a prediction says how wrong it is

00:32:30and corrects the weights all the weights

00:32:32that are affected get fixed basically by

00:32:35every single point there’s no mini batch

00:32:37or anything every single point updates

00:32:39the whole model and that’s done by all

00:32:41the threats in parallel so you’ll have

00:32:43eight threats in parallel changing those

00:32:45weights and I read you right I read you

00:32:48right whatever we just compete but

00:32:50usually we write different weights right

00:32:52there’s millions of weight so you don’t

00:32:53need to override too often but someone

00:32:56else is reading at the time or something

00:32:57so you can see here it’s mostly busy

00:32:59computing if you wanted to know what

00:33:04it’s exactly doing it can also click on

00:33:05the profiler here and it will show you a

00:33:07stack trace and sorted stack trace by

00:33:11count what’s happening so this was

00:33:13basically just communicating let’s do

00:33:15this again

00:33:17now it’s going to be slightly different

00:33:20oh I see so now it was saying these are

00:33:27basically idle because we have eight

00:33:29notes and there are seven orders and

00:33:30there’s one for read and one for right

00:33:33so we got 14 threats actively listening

00:33:36for communication here f 289 are in the

00:33:40back propagation some of them are in the

00:33:43forward propagation so you can see all

00:33:45these exact things that are going on

00:33:48with any moment in time for every note

00:33:50right you can go to a different note and

00:33:52you can see the same behavior so they’re

00:33:54all just busy computing so by this model

00:33:56is building we can ask how well is it

00:33:59doing remember dat one baseline with the

00:34:03human features let’s let’s see what we

00:34:08have here on the validation data set

00:34:09it’s already at 79 this already beat all

00:34:14the random forests and grading boosted

00:34:16methods and neural nets methods that

00:34:19they had at CERN for many years so these

00:34:23models there on the left that had 75 76

00:34:27already beaten by this deep learning

00:34:29model we just ran and this wasn’t even a

00:34:32good model it was just a small like a

00:34:34hundred neurons each layer right so this

00:34:36is very powerful and by the time we

00:34:39finish will actually get down to over 87

00:34:42a you see that’s what the paper reported

00:34:45they have an 88 they trained this for

00:34:47weeks on a GPU and of course they had

00:34:50only this data set and nothing else to

00:34:52worry about and this is a small data set

00:34:54still but you can see the power of deep

00:34:56learning right especially if you feed it

00:34:58more data and you give it more neurons

00:34:59it’ll train and learn everything it’s

00:35:01like a brain that’s trying to learn like

00:35:03a baby’s brain it’s just sucking up all

00:35:06the information and after 40 minutes

00:35:09you’ll get an 84 AFC which is pretty

00:35:12impressive right it beats all the other

00:35:13baseline methods even with the human

00:35:15features and this is without using the

00:35:18human features you don’t need to know

00:35:19anything you just take the sensor data

00:35:20out of your machine and say go all right

00:35:26another use case was deep learning used

00:35:28for crime detection

00:35:30and this is actually Chicago who can

00:35:33recognize this pattern so my colleagues

00:35:35Alex and Macau they wrote an article

00:35:39actually that you can read here data

00:35:41nami just a few days ago and they’re

00:35:44using spark and h2o together to take

00:35:48three different data sets and turn them

00:35:50into something that you can use to

00:35:52predict better crime is going to be

00:35:56leading to an arrest or not so you take

00:35:59the crime data set you take the census

00:36:03data set to know something about the

00:36:04socioeconomic factors and you take the

00:36:06better because the better might have

00:36:08impact on what’s happening and you put

00:36:10them all together in spark first you

00:36:13parse them in h2o because we know that

00:36:15the parser works and it’s it’s fine so

00:36:17in our demo we just suck it all in an

00:36:20h2o we send it over to spark in the same

00:36:24jvm and then we say you an SQL join and

00:36:30once you’re done we split it again in

00:36:32h2o and then we build a deep learning

00:36:34model and for example GBM model i think

00:36:37these two are being built by the demo

00:36:38script that’s available so again both

00:36:43h2o and sparks memory is shared it’s the

00:36:50same jvm there’s no tachyon layer or

00:36:52anything they are basically able to

00:36:55transparently go from one side to the

00:36:57order

00:37:02and the product of course is called

00:37:04sparkling water which was a brilliant

00:37:07idea I think all right so this is the

00:37:12place and github where you would find

00:37:14this this example so you would download

00:37:17sparkling water from our download page

00:37:19and then you would go into that

00:37:21directory set to environment variables

00:37:24pointing to spark and saying how many

00:37:26nodes you want and then you would start

00:37:28the sparkling shell and then copy paste

00:37:31this code into it for example if you

00:37:33want to do it interactively so you can

00:37:37see here there’s a couple of imports you

00:37:39import deep learning in GBM and some

00:37:41spark stuff and then you basically

00:37:45connect to the h2o cluster we parse

00:37:48datasets this way this is just a

00:37:50function definition that gets used by

00:37:52these other functions that actually do

00:37:55the work to load the data and then you

00:37:56can drop some columns and do some simple

00:37:59munging in this case here we do some

00:38:01date manipulations to standardize the

00:38:03three datasets to have the same date

00:38:05format so that we can join on it later

00:38:07and you basically just take these three

00:38:10datasets these are just small for a demo

00:38:11but in reality they of course use the

00:38:13whole data set on a cluster and then

00:38:16once you have these three datasets in

00:38:19memory as h2o objects we just converted

00:38:21to a schema led with this call here and

00:38:24now to become spark or disease for which

00:38:28you can just call like a select

00:38:31statement in SQL and then some join and

00:38:33another join and all that it’s very nice

00:38:35right this is a nice well understood API

00:38:38the people can use and h2o does not have

00:38:41this at this point but we’re working on

00:38:43that so at some point we’ll have more

00:38:45managing capabilities but for now you

00:38:47can definitely benefit from the whole

00:38:49spark ecosystem to do what it’s good for

00:38:53so here in this case but is this we say

00:38:56here’s a crime better data set that we

00:38:58after be splitted I think we spent we

00:39:01bring it back into h2o yes this is an

00:39:03HTML helper function to split and now we

00:39:09have basically a joint data set that

00:39:12knows all about the socioeconomic

00:39:13factors about the way

00:39:14for a given time at a given place and

00:39:19then we can build a deep learning model

00:39:22just like you would do this in Java

00:39:24Scala is very similar right you don’t

00:39:26need to do much porting it’s just the

00:39:28same members that you’re setting and

00:39:30then you say run train model that gets

00:39:33basically and that that at the end you

00:39:35have a model available that you can use

00:39:37to make predictions and it’s very simple

00:39:39and you can definitely follow the

00:39:41tutorials in the interest of time I’ll

00:39:46just show you the sparkling she’ll start

00:39:49here I’m basically able to do this on my

00:39:52laptop as well while the other one is

00:39:54still running here you see spark is

00:39:57being launched and now it’s scheduling

00:39:59those three worker nodes to come up once

00:40:01it’s ready I can copy paste some code in

00:40:03there and the code I would get from the

00:40:08website here Chicago Crime demo it’s all

00:40:14on github

00:40:28so in sparkling water I will get up

00:40:31project under examples there are some

00:40:34scripts and so I can just take this

00:40:35stuff here and just copy paste it all

00:40:39oops I’m sure you believe me that this

00:40:43is all doable right so here spark is not

00:40:46ready and I just copy paste is in and

00:40:47here it goes so that’s how easy it is to

00:40:50do spark and h2o together and then also

00:40:57once you have something in your memory

00:40:58in the 8th show cluster right the model

00:41:01for example or some data sets you can

00:41:03just ask flow to visualize it you can

00:41:05just type this this JavaScript or

00:41:07CoffeeScript rather expression and plot

00:41:09anything you want against anything and

00:41:11you’ll see these interactive plots but

00:41:14you can mouse-over and it will show you

00:41:15what it is and so on so it’s very cool

00:41:17you can plot for example the arrest rate

00:41:19versus the relative occurrence of an

00:41:21arrest for example gambling is always

00:41:24arrested why is that well because

00:41:27otherwise you wouldn’t know that the

00:41:28gambling person was cheating or

00:41:30something so so you basically have to

00:41:32rest them right otherwise you don’t know

00:41:34what’s happening some of the things are

00:41:36undetected but the theft for example

00:41:39it’s not always arrested because someone

00:41:41knows that it was stolen without the

00:41:42person actually being caught so you have

00:41:44to be careful about all this data

00:41:46science stuff but basically can plot

00:41:48whatever you want against whatever you

00:41:49want and that’s pretty powerful and we

00:41:53have our state up table now in house so

00:41:56Matt Dowell joined us recently he he

00:41:58wrote the fastest data table a

00:42:00processing engine in our and this is

00:42:04used for financial institutions that

00:42:06like to do aggregates a lot so just what

00:42:08you saw on the previous slide will soon

00:42:10have all this in H to go in a scalable

00:42:12way that we can do fast joins aggregates

00:42:15and so on and the same thing of course

00:42:18goes for Python you have ipython

00:42:20notebooks and there’s an example to do

00:42:23something for the city bike company in

00:42:25New York City where you want to know how

00:42:27many bikes do you need for stations such

00:42:29that you don’t run out of bikes so let’s

00:42:31say you have 10 million rows of

00:42:34historical data and you have some better

00:42:36data you would imagine it you can join

00:42:38those two and then basically based on

00:42:40location

00:42:41in time and better you can predict how

00:42:44many bikes you’ll need right so if I

00:42:45know today it’s going to be or tomorrow

00:42:47is going to be that better I know I need

00:42:49250 bikes at that station or something

00:42:51and cliff our CTO who-who wrote a jit

00:42:55basically also wrote this data science

00:42:57example here so you can see there’s a

00:42:59group by the top from ipython notebooks

00:43:02and to show you that this is also life

00:43:04impossible here I do this here I’ll type

00:43:07ipython notebook citibike small and up

00:43:10pops up my my my browser with ipython

00:43:13notebook I will delete all the output

00:43:16cells so we don’t cheat and I say go and

00:43:18now it’s connecting to the cluster that

00:43:21I started 30 minutes ago this means i

00:43:23still have a little bit of time left i

00:43:26will load some data here it up we go and

00:43:30then let’s look at the data describe it

00:43:33you can see here some some mean max and

00:43:40so on whatever this is like a

00:43:42distribution of the chunk of the frame

00:43:44how many rows out of each machine in

00:43:46this case is only one machine oops

00:43:47there’s only one machine basically some

00:43:50statistics that tells you how is the

00:43:52data distributed across the cluster what

00:43:54kinds of columns do I have what is their

00:43:59mean max and so on all available from

00:44:01from Python then you can do a group by

00:44:05you don’t need to know all that but

00:44:07basically just you want to know like at

00:44:08what time of the day or what they how

00:44:11many bikes are bitch station and so on

00:44:12you can see that there’s a big

00:44:13distribution here that’s some some

00:44:16places only need 9 bikes on basically

00:44:18the under bikes or even more and so on

00:44:21right and you can do quantiles you see

00:44:24the quantiles here from one percent all

00:44:26the way to ninety-nine percent and you

00:44:28see that there’s some pretty big numbers

00:44:31here you can make new features stay if

00:44:36the weekends on you can build models so

00:44:41this is the fun part we have a bill to

00:44:42GBM we build a random forest we build a

00:44:44glm and we build a deep learning model

00:44:46all on the same data that was joined

00:44:48earlier and so now let’s say do this go

00:44:51so now it’s building a GBM

00:44:54all of my laptop so if I went to my

00:44:57laptop right now I could say get models

00:44:59and these models would just magically

00:45:00pop up and this is deep learning and now

00:45:06we can see how well they’re doing and

00:45:11you get the idea right so now we get a

00:45:1292 AAC by deep learning but the 93 a or

00:45:15c by GBM but deep learning even took a

00:45:17little less time than GBM so you could

00:45:19say that both are very powerful methods

00:45:21they beat the random forests and the

00:45:23linear models here but of course nothing

00:45:26beats the linear model in terms of time

00:45:28Oh point one second to get an 81 and you

00:45:30see it’s pretty remarkable it’s 50 times

00:45:33faster and a random forest all right so

00:45:37you believe me that I Python works as

00:45:38well once you join the better data with

00:45:41a simple merge command here in the

00:45:43middle somewhere then you get a little

00:45:46lift here because then you can even

00:45:48predict better you need bikes are not

00:45:49based on better right make sense if it

00:45:51rains you might need fewer bikes so any

00:45:55anything you might wonder what to do

00:45:57with GBM linear models with deep

00:46:00learning there’s booklets for that and

00:46:02we’re currently rewriting them to the

00:46:04new version of h2o which will have

00:46:05slightly updated api’s and stuff for

00:46:08consistency across our Python Scala JSON

00:46:12and so on so it’s going to be very nice

00:46:13and rewritten everything from scratch a

00:46:16major effort but now we’re basically

00:46:18going to be ready for release I think

00:46:21this week actually so and another ! is

00:46:27that we’re currently number one at this

00:46:29caracal challenge Marc Landry who just

00:46:33joined us who has been on teammates to

00:46:36go for a while he was at the h2o world

00:46:38last fall he is actually going to work

00:46:40full-time almost half his time on Kaggle

00:46:42challenges using h2o so we’ll be excited

00:46:45to see this go across the finish line

00:46:48and they will share how we did this or

00:46:52rather he will share how he did it

00:46:54because so far mark did most of the work

00:46:57next week at h2o in Mountain View and

00:47:00they’ll be live-streamed as well so if

00:47:02you can make it be sure to listen in and

00:47:05these are some examples of other caracal

00:47:07applications

00:47:08we have demo scripts that are posted

00:47:09that are available and for example this

00:47:13one I had hosted a few other maybe a

00:47:15month ago or so I posted this example

00:47:17GBM random parameter tooling logic where

00:47:23you basically just make ten models with

00:47:25random parameters and see which one is

00:47:26the best that sometimes useful

00:47:28especially if you have many dimensions

00:47:30to optimize over and we don’t have

00:47:33Beijing optimization yet but this might

00:47:35be more efficient than just a brute

00:47:36force grid search because the machine

00:47:39gets luckier more than you tell it to be

00:47:41lucky if you want that’s why montecarlo

00:47:43integration works in higher and four

00:47:45dimensions the same thing is true with

00:47:47hyper parameter finding so don’t shy

00:47:50away from these random approaches

00:47:51they’re very powerful so this is the

00:47:55outlook lots of stuff to do for data

00:47:57science now that they have this

00:47:58machinery in place that can scale to big

00:48:01data sets customers are saying well if i

00:48:03do i need to find parameters right yeah

00:48:05sure automatic hybrid parameter tuning

00:48:07is great they’ll do that for you soon

00:48:09you’ll have ensembles like a framework

00:48:12that you can in the GUI and all properly

00:48:15define what you want to blend together

00:48:17what way non- least squares to to stack

00:48:21models of different kinds like a random

00:48:23forest and the GBM and so on all on the

00:48:25holdout sets and so on then we want to

00:48:28have convolutional layers for deep

00:48:29learning for example for people who want

00:48:31to do more image related stuff but all

00:48:35these things are on a to-do list right

00:48:37we have to prioritize those based on

00:48:39customer demand so that’s what our

00:48:40customers get to do the paying customers

00:48:42get to tell us basically what they want

00:48:44and they’ll take that into account

00:48:46natural language processing is high up

00:48:48there especially now that you have this

00:48:50framework we can characterize each

00:48:52string as an integer and then process

00:48:54all that fast and we have a new method

00:48:57called generalized low-rank model which

00:48:59comes right out of Stanford brand new it

00:49:02can do all these methods pcie SVD

00:49:05k-means matrix factorization of course

00:49:08all this stuff fixing missing values for

00:49:11you based on like a Taylor expansion of

00:49:13your data set very powerful stuff can

00:49:16also be used for a commander systems and

00:49:18we have lots and lots of other zero

00:49:20tickets and

00:49:21stuff to work on so if you’re interested

00:49:23in joining the effort please do and I

00:49:26hope I left you with an impression of

00:49:28what you can do with h2o and what the

00:49:30state of the art is right now in machine

00:49:32learning on big data sets and thank you

00:49:34for your attention

”