GOTO 2015 • Scalable Data Science and Deep Learning with H2O • Arno Candel

so yes I spent a lot of years in physics

so yes I spent a lot of years in physics in high performance computing for

in high performance computing for

in high performance computing for particle physics on the largest

particle physics on the largest

particle physics on the largest supercomputers of the world at slac

supercomputers of the world at slac

supercomputers of the world at slac working together wis earn that was from

working together wis earn that was from

working together wis earn that was from a background and then i switched into

a background and then i switched into

a background and then i switched into machine learning startups I’ve been

machine learning startups I’ve been

machine learning startups I’ve been doing this for the last three and a half

doing this for the last three and a half

doing this for the last three and a half years or so last year I got nominated

years or so last year I got nominated

years or so last year I got nominated and called a big data all-star at the

and called a big data all-star at the

and called a big data all-star at the Fortune magazine so that was a nice

Fortune magazine so that was a nice

Fortune magazine so that was a nice surprise and you can follow me at are no

surprise and you can follow me at are no

surprise and you can follow me at are no condell here and if anybody would be

condell here and if anybody would be

condell here and if anybody would be willing to take a picture and tweet it

willing to take a picture and tweet it

willing to take a picture and tweet it to me that will be great thanks so much

to me that will be great thanks so much

to me that will be great thanks so much so yesterday we’re going to introduce

so yesterday we’re going to introduce

so yesterday we’re going to introduce h2o and then talk about deep learning a

h2o and then talk about deep learning a

h2o and then talk about deep learning a little bit in more detail and then there

little bit in more detail and then there

little bit in more detail and then there will be a lot of live demos as much as

will be a lot of live demos as much as

will be a lot of live demos as much as time allows I will go through all these

time allows I will go through all these

time allows I will go through all these different things so we’ll look at

different things so we’ll look at

different things so we’ll look at different data sets different api’s and

different data sets different api’s and

different data sets different api’s and i’ll make sure that you have a good

i’ll make sure that you have a good

i’ll make sure that you have a good impression about what h2o can do for you

impression about what h2o can do for you

impression about what h2o can do for you and how it would look like and that you

and how it would look like and that you

and how it would look like and that you definitely get an idea of what we can do

definitely get an idea of what we can do

definitely get an idea of what we can do here so h2o is a in memory machine

here so h2o is a in memory machine

here so h2o is a in memory machine learning platform it’s written in Java

learning platform it’s written in Java

learning platform it’s written in Java it’s open source it distributes across

it’s open source it distributes across

it’s open source it distributes across your cluster it sends the code around

your cluster it sends the code around

your cluster it sends the code around not the data so your data can stay on

not the data so your data can stay on

not the data so your data can stay on the cluster and you have a large data

the cluster and you have a large data

the cluster and you have a large data set right and then you want to build

set right and then you want to build

set right and then you want to build models on the entire data set you don’t

models on the entire data set you don’t

models on the entire data set you don’t want to down sample and lose accuracy

want to down sample and lose accuracy

want to down sample and lose accuracy that way but usual the problem is that

that way but usual the problem is that

that way but usual the problem is that the tools don’t allow you to scale to

the tools don’t allow you to scale to

the tools don’t allow you to scale to all the big data sets especially for

all the big data sets especially for

all the big data sets especially for building machine learning models we’re

building machine learning models we’re

building machine learning models we’re not just talking about summing up stuff

not just talking about summing up stuff

not just talking about summing up stuff for computing aggregates you’re talking

for computing aggregates you’re talking

for computing aggregates you’re talking about sophisticated models like gradient

about sophisticated models like gradient

about sophisticated models like gradient boosting machines or neural networks and

boosting machines or neural networks and

boosting machines or neural networks and h2 allows you to do this and you get the

h2 allows you to do this and you get the

h2 allows you to do this and you get the the scalability and the accuracy from

the scalability and the accuracy from

the scalability and the accuracy from this big data set in scale and as I

this big data set in scale and as I

this big data set in scale and as I mentioned earlier we have a lot of AP is

mentioned earlier we have a lot of AP is

mentioned earlier we have a lot of AP is that you’ll get to see today we also

that you’ll get to see today we also

that you’ll get to see today we also have a scoring engine which is kind of a

have a scoring engine which is kind of a

have a scoring engine which is kind of a key point of the product we are about 35

key point of the product we are about 35

key point of the product we are about 35 people right now we had our first h-2a

people right now we had our first h-2a

people right now we had our first h-2a world conference last year in the fall

world conference last year in the fall

world conference last year in the fall and the most

and the most

and the most huge success and sri satish embody here

huge success and sri satish embody here

huge success and sri satish embody here i was CEO he had he has a great great

i was CEO he had he has a great great

i was CEO he had he has a great great mindset and culture culture is

mindset and culture culture is

mindset and culture culture is everything to him so he likes to do meet

everything to him so he likes to do meet

everything to him so he likes to do meet ups every week even twice a week to get

ups every week even twice a week to get

ups every week even twice a week to get feedback from customers and so on so we

feedback from customers and so on so we

feedback from customers and so on so we are very much community driven even

are very much community driven even

are very much community driven even though we write most of the code at this

though we write most of the code at this

though we write most of the code at this point so you can see here the growth

point so you can see here the growth

point so you can see here the growth machine learning is really trending and

machine learning is really trending and

machine learning is really trending and we think it’s the next SQL and

we think it’s the next SQL and

we think it’s the next SQL and prediction is the next search there’s

prediction is the next search there’s

prediction is the next search there’s not just predictive analytics there’s

not just predictive analytics there’s

not just predictive analytics there’s also prescriptive analytics where you’re

also prescriptive analytics where you’re

also prescriptive analytics where you’re trying to not just say what’s going to

trying to not just say what’s going to

trying to not just say what’s going to happen tomorrow but you’re going to tell

happen tomorrow but you’re going to tell

happen tomorrow but you’re going to tell the customers what to do such that they

the customers what to do such that they

the customers what to do such that they can affect tomorrow so you can see the

can affect tomorrow so you can see the

can affect tomorrow so you can see the growth here lots and lots of companies

growth here lots and lots of companies

growth here lots and lots of companies are now using h2o and why is that well

are now using h2o and why is that well

are now using h2o and why is that well because it’s a distributed system built

because it’s a distributed system built

because it’s a distributed system built by the experts in house we have click

by the experts in house we have click

by the experts in house we have click click click he’s our CTO he wrote

click click he’s our CTO he wrote

click click he’s our CTO he wrote basically Java compiler jit right large

basically Java compiler jit right large

basically Java compiler jit right large parts of it in every cell phone of yours

parts of it in every cell phone of yours

parts of it in every cell phone of yours there’s parts of his code that are

there’s parts of his code that are

there’s parts of his code that are executed all the time so he architected

executed all the time so he architected

executed all the time so he architected the whole framework it’s a distributed

the whole framework it’s a distributed

the whole framework it’s a distributed memory key value store based on a

memory key value store based on a

memory key value store based on a non-blocking hash map it has a MapReduce

non-blocking hash map it has a MapReduce

non-blocking hash map it has a MapReduce paradigm built in our own map produced

paradigm built in our own map produced

paradigm built in our own map produced which is fine grain and make sure that

which is fine grain and make sure that

which is fine grain and make sure that all the threats are working at all times

all the threats are working at all times

all the threats are working at all times if you’re processing your data and of

if you’re processing your data and of

if you’re processing your data and of course all the nodes are working in

course all the nodes are working in

course all the nodes are working in parallel as you’ll see you later and we

parallel as you’ll see you later and we

parallel as you’ll see you later and we also compress the data similar to the

also compress the data similar to the

also compress the data similar to the park a data format and so you can really

park a data format and so you can really

park a data format and so you can really store only the data you need and it’s

store only the data you need and it’s

store only the data you need and it’s much cheaper to decompress on the fly in

much cheaper to decompress on the fly in

much cheaper to decompress on the fly in the registers of the CPU then to send

the registers of the CPU then to send

the registers of the CPU then to send the numbers across the wire and once you

the numbers across the wire and once you

the numbers across the wire and once you have this framework in place you can

have this framework in place you can

have this framework in place you can write algorithms that are using this

write algorithms that are using this

write algorithms that are using this MapReduce paradigm and you can also do

MapReduce paradigm and you can also do

MapReduce paradigm and you can also do less than an algorithm you can just say

less than an algorithm you can just say

less than an algorithm you can just say compute aggregates for example it’s like

compute aggregates for example it’s like

compute aggregates for example it’s like a mini algorithm if you want so you can

a mini algorithm if you want so you can

a mini algorithm if you want so you can do all these things and in the end you

do all these things and in the end you

do all these things and in the end you end up with a model that makes a

end up with a model that makes a

end up with a model that makes a prediction of the future right you stand

prediction of the future right you stand

prediction of the future right you stand with machine learning and that code can

with machine learning and that code can

with machine learning and that code can then be exported then I’ll show you that

then be exported then I’ll show you that

then be exported then I’ll show you that in a minute and of course we can suck in

in a minute and of course we can suck in

in a minute and of course we can suck in data from pretty much anywhere and you

data from pretty much anywhere and you

data from pretty much anywhere and you can talk to our Python via JSON from a

can talk to our Python via JSON from a

can talk to our Python via JSON from a web browser

web browser

web browser I routinely check the status of my jobs

I routinely check the status of my jobs

I routinely check the status of my jobs from my cell phone for example so

from my cell phone for example so

from my cell phone for example so there’s a bunch of customers using us

there’s a bunch of customers using us

there’s a bunch of customers using us right now these that are referenceable

right now these that are referenceable

right now these that are referenceable at this point there’s a lot more that we

at this point there’s a lot more that we

at this point there’s a lot more that we can talk about at this moment but you’ll

can talk about at this moment but you’ll

can talk about at this moment but you’ll hear about them soon they’re basically

hear about them soon they’re basically

hear about them soon they’re basically doing big data right hundreds of

doing big data right hundreds of

doing big data right hundreds of gigabytes dozens of nodes and they’re

gigabytes dozens of nodes and they’re

gigabytes dozens of nodes and they’re processing data all the time and they

processing data all the time and they

processing data all the time and they have faster turnaround times they’re

have faster turnaround times they’re

have faster turnaround times they’re saving model saving millions by

saving model saving millions by

saving model saving millions by deploying these models such as this

deploying these models such as this

deploying these models such as this fraud detection model it has a safe

fraud detection model it has a safe

fraud detection model it has a safe paypal millions in fraud so it’s very

paypal millions in fraud so it’s very

paypal millions in fraud so it’s very easy to download you just go to h dot AI

easy to download you just go to h dot AI

easy to download you just go to h dot AI and you can find the download button you

and you can find the download button you

and you can find the download button you downloaded it once it’s downloaded you

downloaded it once it’s downloaded you

downloaded it once it’s downloaded you unzip that that file and you go in there

unzip that that file and you go in there

unzip that that file and you go in there and type java dejar right that’s it h2o

and type java dejar right that’s it h2o

and type java dejar right that’s it h2o will be running on your system there’s

will be running on your system there’s

will be running on your system there’s no dependencies it’s just one single

no dependencies it’s just one single

no dependencies it’s just one single file that you need and you’re basically

file that you need and you’re basically

file that you need and you’re basically running and you can do the same thing on

running and you can do the same thing on

running and you can do the same thing on a cluster you expect to file everywhere

a cluster you expect to file everywhere

a cluster you expect to file everywhere and you launch it that would be a bear

and you launch it that would be a bear

and you launch it that would be a bear ball installation if you don’t want to

ball installation if you don’t want to

ball installation if you don’t want to do bare bones you can do Hadoop you can

do bare bones you can do Hadoop you can

do bare bones you can do Hadoop you can do yarn spark you can launch it from our

do yarn spark you can launch it from our

do yarn spark you can launch it from our and from Python as well so let’s do a

and from Python as well so let’s do a

and from Python as well so let’s do a quick demo here this is glm so i’m going

quick demo here this is glm so i’m going

quick demo here this is glm so i’m going to a cluster here this cluster has my

to a cluster here this cluster has my

to a cluster here this cluster has my name on it you got a dedicated cluster

name on it you got a dedicated cluster

name on it you got a dedicated cluster for this demo so let’s see what this

for this demo so let’s see what this

for this demo so let’s see what this past erase this cluster is an eighth

past erase this cluster is an eighth

past erase this cluster is an eighth note cluster on ec2 it has I think 30

note cluster on ec2 it has I think 30

note cluster on ec2 it has I think 30 gigabytes of heap per machine yep here

gigabytes of heap per machine yep here

gigabytes of heap per machine yep here and basically it’s just there waiting

and basically it’s just there waiting

and basically it’s just there waiting for me to tell it what to do so one

for me to tell it what to do so one

for me to tell it what to do so one thing I did earlier as I parse this

thing I did earlier as I parse this

thing I did earlier as I parse this Airlines data set I’m going to do this

Airlines data set I’m going to do this

Airlines data set I’m going to do this again the airlines data set has all the

again the airlines data set has all the

again the airlines data set has all the flights from 2007 all the way back to

flights from 2007 all the way back to

flights from 2007 all the way back to nineteen eighty seven and it’s parsing

nineteen eighty seven and it’s parsing

nineteen eighty seven and it’s parsing this right now and let’s go look at the

this right now and let’s go look at the

this right now and let’s go look at the cpu usage here you can see that all the

cpu usage here you can see that all the

cpu usage here you can see that all the notes are active right now sucking in

notes are active right now sucking in

notes are active right now sucking in the data

the data

the data parsing it tokenizing it compressing it

parsing it tokenizing it compressing it

parsing it tokenizing it compressing it into these these reduced representations

into these these reduced representations

into these these reduced representations that are lost less of course so when we

that are lost less of course so when we

that are lost less of course so when we have numbers like 719 and 120 then you

have numbers like 719 and 120 then you

have numbers like 719 and 120 then you know that that fits into one bite so you

know that that fits into one bite so you

know that that fits into one bite so you make a one-bite column right once you

make a one-bite column right once you

make a one-bite column right once you see that their numbers there are more

see that their numbers there are more

see that their numbers there are more dynamic ranged and just one bite then

dynamic ranged and just one bite then

dynamic ranged and just one bite then you take two bites and so on you

you take two bites and so on you

you take two bites and so on you basically just store what you need it’s

basically just store what you need it’s

basically just store what you need it’s okay so now it part of this file in 35

okay so now it part of this file in 35

okay so now it part of this file in 35 seconds let’s go look at the file

seconds let’s go look at the file

seconds let’s go look at the file there’s a frame summary that I’m

there’s a frame summary that I’m

there’s a frame summary that I’m expecting it from the server and the

expecting it from the server and the

expecting it from the server and the server now returns this and says here

server now returns this and says here

server now returns this and says here 160 million rows can you see this

160 million rows can you see this

160 million rows can you see this there’s 160 million rows 30 columns

there’s 160 million rows 30 columns

there’s 160 million rows 30 columns about 4 gigabytes compressed space you

about 4 gigabytes compressed space you

about 4 gigabytes compressed space you see all these different columns here

see all these different columns here

see all these different columns here they have like a summary a cardinality

they have like a summary a cardinality

they have like a summary a cardinality some of them are categorical here so in

some of them are categorical here so in

some of them are categorical here so in effect is about 700 or dictators in this

effect is about 700 or dictators in this

effect is about 700 or dictators in this data set and we’re trying to predict

data set and we’re trying to predict

data set and we’re trying to predict whether their plane is delayed or not

whether their plane is delayed or not

whether their plane is delayed or not based on its like departure origin and

based on its like departure origin and

based on its like departure origin and destination airport and so on so if i

destination airport and so on so if i

destination airport and so on so if i wanted to do this i will just click here

wanted to do this i will just click here

wanted to do this i will just click here build model i will say generalized

build model i will say generalized

build model i will say generalized linear model that’s one that is fast and

linear model that’s one that is fast and

linear model that’s one that is fast and the training frame is chosen here and i

the training frame is chosen here and i

the training frame is chosen here and i will now choose some columns to use i’ll

will now choose some columns to use i’ll

will now choose some columns to use i’ll first ignore all of them because there’s

first ignore all of them because there’s

first ignore all of them because there’s a lot of columns i don’t want to use and

a lot of columns i don’t want to use and

a lot of columns i don’t want to use and then i’ll add year month the day the

then i’ll add year month the day the

then i’ll add year month the day the week at the day of the week let’s see we

week at the day of the week let’s see we

week at the day of the week let’s see we want to know the departure time maybe

want to know the departure time maybe

want to know the departure time maybe the carrier not the flight number that

the carrier not the flight number that

the carrier not the flight number that doesn’t mean much maybe the origin and

doesn’t mean much maybe the origin and

doesn’t mean much maybe the origin and destination and then all we really care

destination and then all we really care

destination and then all we really care about is whether it’s the late or not so

about is whether it’s the late or not so

about is whether it’s the late or not so that will be my response everything else

that will be my response everything else

that will be my response everything else you don’t need because it would give

you don’t need because it would give

you don’t need because it would give away the answer right so its departure

away the answer right so its departure

away the answer right so its departure the late is what I’m going to try to

the late is what I’m going to try to

the late is what I’m going to try to predict and it’s a binomial problem so

predict and it’s a binomial problem so

predict and it’s a binomial problem so yes or no is the answer and now I just

yes or no is the answer and now I just

yes or no is the answer and now I just have to press go and it’s building this

have to press go and it’s building this

have to press go and it’s building this model as we as we speak and I can go to

model as we as we speak and I can go to

model as we as we speak and I can go to the water meter to see the cpu usage and

the water meter to see the cpu usage and

the water meter to see the cpu usage and you can see that all the nodes are busy

you can see that all the nodes are busy

you can see that all the nodes are busy computing this model right now

computing this model right now

computing this model right now and in a few seconds it will be done you

and in a few seconds it will be done you

and in a few seconds it will be done you see the objective value doesn’t change

see the objective value doesn’t change

see the objective value doesn’t change anymore yep so it’s done in 19 seconds

anymore yep so it’s done in 19 seconds

anymore yep so it’s done in 19 seconds and I can look at the model and I can

and I can look at the model and I can

and I can look at the model and I can see that we have an auc of 9.5 it’s a

see that we have an auc of 9.5 it’s a

see that we have an auc of 9.5 it’s a little more than point five right it’s

little more than point five right it’s

little more than point five right it’s not just random we have variable

not just random we have variable

not just random we have variable importances here we can see that certain

importances here we can see that certain

importances here we can see that certain airlines like Eastern Airlines is as a

airlines like Eastern Airlines is as a

airlines like Eastern Airlines is as a negative correlation with the response

negative correlation with the response

negative correlation with the response which means it’s it’s rarely if you take

which means it’s it’s rarely if you take

which means it’s it’s rarely if you take this carrier you’re not going to be

this carrier you’re not going to be

this carrier you’re not going to be delayed that’s because it didn’t have a

delayed that’s because it didn’t have a

delayed that’s because it didn’t have a schedule it was always on time by

schedule it was always on time by

schedule it was always on time by definition for example so this is like

definition for example so this is like

definition for example so this is like one bit that comes out of this model

one bit that comes out of this model

one bit that comes out of this model another thing is that Chicago and

another thing is that Chicago and

another thing is that Chicago and Atlanta are often delayed when you start

Atlanta are often delayed when you start

Atlanta are often delayed when you start there right when your journey starts

there right when your journey starts

there right when your journey starts there as you know or for example San

there as you know or for example San

there as you know or for example San Francisco if you want to fly to San

Francisco if you want to fly to San

Francisco if you want to fly to San Francisco there’s a lot of people who

Francisco there’s a lot of people who

Francisco there’s a lot of people who want to do that so that’s why it’s also

want to do that so that’s why it’s also

want to do that so that’s why it’s also often delayed and as I mentioned earlier

often delayed and as I mentioned earlier

often delayed and as I mentioned earlier the accuracy here flatlined after the

the accuracy here flatlined after the

the accuracy here flatlined after the first few iterations so the model could

first few iterations so the model could

first few iterations so the model could have been done even faster if you’re

have been done even faster if you’re

have been done even faster if you’re looking at the metrics here for example

looking at the metrics here for example

looking at the metrics here for example you can see that there’s a mean square

you can see that there’s a mean square

you can see that there’s a mean square error reported an r square value report

error reported an r square value report

error reported an r square value report at all this data science stuff aoc value

at all this data science stuff aoc value

at all this data science stuff aoc value of point 65 and so on and there’s even a

of point 65 and so on and there’s even a

of point 65 and so on and there’s even a POJO that we can look at you know what a

POJO that we can look at you know what a

POJO that we can look at you know what a POJO is a plain old java object it’s

POJO is a plain old java object it’s

POJO is a plain old java object it’s basically Java code that’s the scoring

basically Java code that’s the scoring

basically Java code that’s the scoring code that you can take into production

code that you can take into production

code that you can take into production that actually scores your flights in

that actually scores your flights in

that actually scores your flights in real time and you could say okay if

real time and you could say okay if

real time and you could say okay if you’re this airline and if you’re at

you’re this airline and if you’re at

you’re this airline and if you’re at this time of day then you’re going to

this time of day then you’re going to

this time of day then you’re going to have this probability to be delayed or

have this probability to be delayed or

have this probability to be delayed or not and this is the optimal threshold

not and this is the optimal threshold

not and this is the optimal threshold computed from the ROC curve that curve

computed from the ROC curve that curve

computed from the ROC curve that curve that you saw earlier that tells you

that you saw earlier that tells you

that you saw earlier that tells you where where best to pick your to

where where best to pick your to

where where best to pick your to operating regime to say the later not

operating regime to say the later not

operating regime to say the later not based on the falls and positives and

based on the falls and positives and

based on the falls and positives and true positives and so on that you’re

true positives and so on that you’re

true positives and so on that you’re balancing right so let it stand the data

balancing right so let it stand the data

balancing right so let it stand the data science it’s all baked in for you you

science it’s all baked in for you you

science it’s all baked in for you you get the answer right away so this was on

get the answer right away so this was on

get the answer right away so this was on 160 million rows and we just did this

160 million rows and we just did this

160 million rows and we just did this life

so as you saw the pojo scoring code

so as you saw the pojo scoring code there’s there’s more models that you can

there’s there’s more models that you can

there’s there’s more models that you can build in in the flow user API the degree

build in in the flow user API the degree

build in in the flow user API the degree that you saw earlier there’s a a Help

that you saw earlier there’s a a Help

that you saw earlier there’s a a Help button on the right side here to bring

button on the right side here to bring

button on the right side here to bring this back up there’s help I go down and

this back up there’s help I go down and

this back up there’s help I go down and I can see here packs so there’s a bunch

I can see here packs so there’s a bunch

I can see here packs so there’s a bunch of example packs that come with it so if

of example packs that come with it so if

of example packs that come with it so if I click on this here I’ll do this

I click on this here I’ll do this

I click on this here I’ll do this actually on my laptop now I’ll show you

actually on my laptop now I’ll show you

actually on my laptop now I’ll show you how to run this on a laptop so I just

how to run this on a laptop so I just

how to run this on a laptop so I just downloaded the the package from the

downloaded the the package from the

downloaded the the package from the website and it only contains two files

website and it only contains two files

website and it only contains two files one is an r package and one is the

one is an r package and one is the

one is an r package and one is the actual java jar file I’m going to start

actual java jar file I’m going to start

actual java jar file I’m going to start this on my laptop and I’m going to check

this on my laptop and I’m going to check

this on my laptop and I’m going to check the browser localhost at port five four

the browser localhost at port five four

the browser localhost at port five four three two one that’s our default port

three two one that’s our default port

three two one that’s our default port and now I’m connected to this java JVM

and now I’m connected to this java JVM

and now I’m connected to this java JVM that I just launched right and I can ask

that I just launched right and I can ask

that I just launched right and I can ask it this is a little too big now let’s

it this is a little too big now let’s

it this is a little too big now let’s make it smaller here we go I can look at

make it smaller here we go I can look at

make it smaller here we go I can look at the cluster status yet it’s a one-note

the cluster status yet it’s a one-note

the cluster status yet it’s a one-note clustered I gave it 8 gigs of heap you

clustered I gave it 8 gigs of heap you

clustered I gave it 8 gigs of heap you can see that and it’s all ready to go so

can see that and it’s all ready to go so

can see that and it’s all ready to go so now I’m going to launch this this flow

now I’m going to launch this this flow

now I’m going to launch this this flow from this example pack this million

from this example pack this million

from this example pack this million songs flow I’m going to load that

songs flow I’m going to load that

songs flow I’m going to load that notebook and you can see this is the

notebook and you can see this is the

notebook and you can see this is the million song binary classification demo

million song binary classification demo

million song binary classification demo we basically have data set with 500,000

we basically have data set with 500,000

we basically have data set with 500,000 observations 90 numerical columns and

observations 90 numerical columns and

observations 90 numerical columns and we’re going to split that and store the

we’re going to split that and store the

we’re going to split that and store the next three well that’s done you already

next three well that’s done you already

next three well that’s done you already have those files ready for you so now we

have those files ready for you so now we

have those files ready for you so now we just have to parse them in here and I

just have to parse them in here and I

just have to parse them in here and I put them already on my laptop so I can

put them already on my laptop so I can

put them already on my laptop so I can just say download on import into the h2o

just say download on import into the h2o

just say download on import into the h2o cluster I’ll take the non zip diversion

cluster I’ll take the non zip diversion

cluster I’ll take the non zip diversion because that’s faster so this this file

because that’s faster so this this file

because that’s faster so this this file is a few hundred megabytes it’s done in

is a few hundred megabytes it’s done in

is a few hundred megabytes it’s done in three seconds and this one here is the

three seconds and this one here is the

three seconds and this one here is the test set I’m also going to parse this

test set I’m also going to parse this

test set I’m also going to parse this and you can see that you can even

and you can see that you can even

and you can see that you can even specify the column types if you wanted

specify the column types if you wanted

specify the column types if you wanted to turn a number into an enum for

to turn a number into an enum for

to turn a number into an enum for classification you can do this here

classification you can do this here

classification you can do this here explicitly if you’re not happy with the

explicitly if you’re not happy with the

explicitly if you’re not happy with the default behavior of the parser but the

default behavior of the parser but the

default behavior of the parser but the parts that is very robust and can

parts that is very robust and can

parts that is very robust and can usually handle that so if you have

usually handle that so if you have

usually handle that so if you have missing values if you have all kinds of

missing values if you have all kinds of

missing values if you have all kinds of categoricals ugly strings stuff that’s

categoricals ugly strings stuff that’s

categoricals ugly strings stuff that’s wrong we’ll handle it it’s very robust

wrong we’ll handle it it’s very robust

wrong we’ll handle it it’s very robust it’s really made for enterprise-grade

it’s really made for enterprise-grade

it’s really made for enterprise-grade datasets it’ll it’ll go through your

datasets it’ll it’ll go through your

datasets it’ll it’ll go through your dirty data and just spit something out

dirty data and just spit something out

dirty data and just spit something out that’s usually pretty good okay so now

that’s usually pretty good okay so now

that’s usually pretty good okay so now we have these data sets and I’ll see but

we have these data sets and I’ll see but

we have these data sets and I’ll see but what else we have here so let me go back

what else we have here so let me go back

what else we have here so let me go back out here give your view you can click on

out here give your view you can click on

out here give your view you can click on outline on the right and you can see all

outline on the right and you can see all

outline on the right and you can see all these cells that I pre-populated here

these cells that I pre-populated here

these cells that I pre-populated here and one of them says build a random

and one of them says build a random

and one of them says build a random forest once has build a gradient

forest once has build a gradient

forest once has build a gradient boosting machine one says build a linear

boosting machine one says build a linear

boosting machine one says build a linear model logistic regression and one says

model logistic regression and one says

model logistic regression and one says build a deep learning model right and I

build a deep learning model right and I

build a deep learning model right and I can just say okay fine let’s build one

can just say okay fine let’s build one

can just say okay fine let’s build one let’s say let’s go down to the GBM cell

let’s say let’s go down to the GBM cell

let’s say let’s go down to the GBM cell and say execute this cell now it’s

and say execute this cell now it’s

and say execute this cell now it’s building a gradient boosting machine on

building a gradient boosting machine on

building a gradient boosting machine on this data set you can see the progress

this data set you can see the progress

this data set you can see the progress bar here and violets building it I can

bar here and violets building it I can

bar here and violets building it I can say hey how do you look right now let me

say hey how do you look right now let me

say hey how do you look right now let me see how you’re doing so right now it’s

see how you’re doing so right now it’s

see how you’re doing so right now it’s already giving me to scoring history

already giving me to scoring history

already giving me to scoring history points where the error went down it’s

points where the error went down it’s

points where the error went down it’s already in a OC curve an ROC curve with

already in a OC curve an ROC curve with

already in a OC curve an ROC curve with an AOC of something like see point seven

an AOC of something like see point seven

an AOC of something like see point seven I would hope yes point seven a you see

I would hope yes point seven a you see

I would hope yes point seven a you see already right in just seconds that’s

already right in just seconds that’s

already right in just seconds that’s pretty good for this data set if I do it

pretty good for this data set if I do it

pretty good for this data set if I do it again it’s already down here the error

again it’s already down here the error

again it’s already down here the error ghost keeps going down and you can keep

ghost keeps going down and you can keep

ghost keeps going down and you can keep looking at that model feature

looking at that model feature

looking at that model feature importances for which which variables

importances for which which variables

importances for which which variables matter the most all in real time and I

matter the most all in real time and I

matter the most all in real time and I can also look at the Poggio again this

can also look at the Poggio again this

can also look at the Poggio again this time it’s a tree model not a logistic

time it’s a tree model not a logistic

time it’s a tree model not a logistic regression model so you would expect

regression model so you would expect

regression model so you would expect some decisions in this tree structure if

some decisions in this tree structure if

some decisions in this tree structure if I go down there’s all these classes that

I go down there’s all these classes that

I go down there’s all these classes that this all like Java code I think the tree

this all like Java code I think the tree

this all like Java code I think the tree should be somewhere let me see I might

should be somewhere let me see I might

should be somewhere let me see I might have to refresh this model

oh here we go so these are all the

oh here we go so these are all the forests here you see that there’s a lot

forests here you see that there’s a lot

forests here you see that there’s a lot of forests that are being scored and now

of forests that are being scored and now

of forests that are being scored and now we just have to find this function

we just have to find this function

we just have to find this function somewhere down there and up here it is

somewhere down there and up here it is

somewhere down there and up here it is so here you can see that this is

so here you can see that this is

so here you can see that this is decision tree logic right if your data

decision tree logic right if your data

decision tree logic right if your data is less than 4,000 in this column and

is less than 4,000 in this column and

is less than 4,000 in this column and less than this endless and then in the

less than this endless and then in the

less than this endless and then in the end your prediction will be so and so

end your prediction will be so and so

end your prediction will be so and so much otherwise it will be this number so

much otherwise it will be this number so

much otherwise it will be this number so basically this is the scoring code of

basically this is the scoring code of

basically this is the scoring code of this model that you can put right into

this model that you can put right into

this model that you can put right into production in storm or any other API

production in storm or any other API

production in storm or any other API that you want to use your own basically

that you want to use your own basically

that you want to use your own basically that’s just Java code without any

that’s just Java code without any

that’s just Java code without any dependencies and you can build the same

dependencies and you can build the same

dependencies and you can build the same thing with deep learning right you can

thing with deep learning right you can

thing with deep learning right you can build a deep learning model on the same

build a deep learning model on the same

build a deep learning model on the same data set at the same time that the other

data set at the same time that the other

data set at the same time that the other one is building you can be able to

one is building you can be able to

one is building you can be able to random forest model here also at the

random forest model here also at the

random forest model here also at the same time or a glm and this is all on my

same time or a glm and this is all on my

same time or a glm and this is all on my laptop right now so I’m building

laptop right now so I’m building

laptop right now so I’m building different models at the same time and I

different models at the same time and I

different models at the same time and I can ask hey what’s the status of them I

can ask hey what’s the status of them I

can ask hey what’s the status of them I can just go to the right here in the

can just go to the right here in the

can just go to the right here in the outline and click on giving my deep

outline and click on giving my deep

outline and click on giving my deep learning model oh it’s already done

learning model oh it’s already done

learning model oh it’s already done let’s see how well we’re doing here also

let’s see how well we’re doing here also

let’s see how well we’re doing here also a good auc right and feature importances

a good auc right and feature importances

a good auc right and feature importances and the scoring history and the metrics

and the scoring history and the metrics

and the scoring history and the metrics and you can even get a list of optimal

and you can even get a list of optimal

and you can even get a list of optimal metrics like what’s the best position i

metrics like what’s the best position i

metrics like what’s the best position i can get what’s the best accuracy i can

can get what’s the best accuracy i can

can get what’s the best accuracy i can get and then at what threshold so this

get and then at what threshold so this

get and then at what threshold so this is all geared towards the data scientist

is all geared towards the data scientist

is all geared towards the data scientist understanding what’s happening all right

understanding what’s happening all right

understanding what’s happening all right so mild my laptop is churning out some

so mild my laptop is churning out some

so mild my laptop is churning out some more models you can continue here and

more models you can continue here and

more models you can continue here and talk about deep learning in more detail

talk about deep learning in more detail

talk about deep learning in more detail so deep learning as you all know is

so deep learning as you all know is

so deep learning as you all know is basically just connected neurons right

basically just connected neurons right

basically just connected neurons right and it’s similar to logistic regression

and it’s similar to logistic regression

and it’s similar to logistic regression except that there’s more multiplications

except that there’s more multiplications

except that there’s more multiplications going on you take your feature times the

going on you take your feature times the

going on you take your feature times the weight you get a number and then you add

weight you get a number and then you add

weight you get a number and then you add it up and you do this for all these

it up and you do this for all these

it up and you do this for all these connections your each connection is a

connections your each connection is a

connections your each connection is a product of the wait times the input

product of the wait times the input

product of the wait times the input gives you some output and then you apply

gives you some output and then you apply

gives you some output and then you apply a nonlinear function like at NH

a nonlinear function like at NH

a nonlinear function like at NH something is like a step function

something is like a step function

something is like a step function move step function and you do this again

move step function and you do this again

move step function and you do this again and again and again and at the end you

and again and again and at the end you

and again and again and at the end you have like a hierarchy of nonlinear

have like a hierarchy of nonlinear

have like a hierarchy of nonlinear transformations which will lead to very

transformations which will lead to very

transformations which will lead to very complex nonlinearities in your model so

complex nonlinearities in your model so

complex nonlinearities in your model so you can describe really weird stuff that

you can describe really weird stuff that

you can describe really weird stuff that you would otherwise not be able to with

you would otherwise not be able to with

you would otherwise not be able to with say a linear model or a simple random

say a linear model or a simple random

say a linear model or a simple random forest that doesn’t go as deep to to

forest that doesn’t go as deep to to

forest that doesn’t go as deep to to make up all these nonlinearities between

make up all these nonlinearities between

make up all these nonlinearities between all these features so this is basically

all these features so this is basically

all these features so this is basically the machinery you need for

the machinery you need for

the machinery you need for nonlinearities in your data set and we

nonlinearities in your data set and we

nonlinearities in your data set and we do this in a distributed way again

do this in a distributed way again

do this in a distributed way again because we’re using the MapReduce we’re

because we’re using the MapReduce we’re

because we’re using the MapReduce we’re doing this again on all the threads

doing this again on all the threads

doing this again on all the threads right as you saw earlier for glm and

right as you saw earlier for glm and

right as you saw earlier for glm and everything was Green deep learning is

everything was Green deep learning is

everything was Green deep learning is also green it’s known to be green I

also green it’s known to be green I

also green it’s known to be green I usually burn up the whole custom and I’m

usually burn up the whole custom and I’m

usually burn up the whole custom and I’m running my models and everybody else has

running my models and everybody else has

running my models and everybody else has to step back well of course there’s the

to step back well of course there’s the

to step back well of course there’s the Linux scheduler that takes care of that

Linux scheduler that takes care of that

Linux scheduler that takes care of that but still some claim it’s not

but still some claim it’s not

but still some claim it’s not necessarily fair if I’m running some big

necessarily fair if I’m running some big

necessarily fair if I’m running some big model so I haven’t done that lately and

model so I haven’t done that lately and

model so I haven’t done that lately and that’s why I’m using these easy two

that’s why I’m using these easy two

that’s why I’m using these easy two clusters now or maybe my laptop from

clusters now or maybe my laptop from

clusters now or maybe my laptop from time to time but anyway you can see here

time to time but anyway you can see here

time to time but anyway you can see here we have a lot of little details building

we have a lot of little details building

we have a lot of little details building right it works automatically on

right it works automatically on

right it works automatically on categorical data it were automatically

categorical data it were automatically

categorical data it were automatically standardizes standardizes your data you

standardizes standardizes your data you

standardizes standardizes your data you don’t need to worry about that it

don’t need to worry about that it

don’t need to worry about that it automatically impedes missing values it

automatically impedes missing values it

automatically impedes missing values it automatically does regularization for

automatically does regularization for

automatically does regularization for you if you specify the option it does a

you if you specify the option it does a

you if you specify the option it does a check pointing load balancing everything

check pointing load balancing everything

check pointing load balancing everything you just need to say go and that’s it so

you just need to say go and that’s it so

you just need to say go and that’s it so it should be like super easy for anyone

it should be like super easy for anyone

it should be like super easy for anyone to just run it and if you want to know

to just run it and if you want to know

to just run it and if you want to know how it works in the detail architecture

how it works in the detail architecture

how it works in the detail architecture here it’s basically just distributing

here it’s basically just distributing

here it’s basically just distributing the data set for it first right onto the

the data set for it first right onto the

the data set for it first right onto the whole cluster let’s say you have a

whole cluster let’s say you have a

whole cluster let’s say you have a terabyte of data and 10 notes every node

terabyte of data and 10 notes every node

terabyte of data and 10 notes every node will get 100 gigabytes different data

will get 100 gigabytes different data

will get 100 gigabytes different data and then you’re saying okay I’ll make an

and then you’re saying okay I’ll make an

and then you’re saying okay I’ll make an initial deep learning model that’s a

initial deep learning model that’s a

initial deep learning model that’s a bunch of weights and bias values all

bunch of weights and bias values all

bunch of weights and bias values all just numbers and i’ll put that into some

just numbers and i’ll put that into some

just numbers and i’ll put that into some place in the store and then i spread

place in the store and then i spread

place in the store and then i spread that to all the notes all my 10 notes

that to all the notes all my 10 notes

that to all the notes all my 10 notes get a copy of the same model and then i

get a copy of the same model and then i

get a copy of the same model and then i say train on your local data so then all

say train on your local data so then all

say train on your local data so then all the the models will get trained on their

the the models will get trained on their

the the models will get trained on their local data with multi-threading so there

local data with multi-threading so there

local data with multi-threading so there are some race conditions here that makes

are some race conditions here that makes

are some race conditions here that makes this not reproducible

this not reproducible

this not reproducible but in the end you will have n models in

but in the end you will have n models in

but in the end you will have n models in this case for or on your cluster that

this case for or on your cluster that

this case for or on your cluster that I’ve just mentioned 10 you will have 10

I’ve just mentioned 10 you will have 10

I’ve just mentioned 10 you will have 10 such models of that I’ve been built on a

such models of that I’ve been built on a

such models of that I’ve been built on a part of these hundred gigabytes that you

part of these hundred gigabytes that you

part of these hundred gigabytes that you have you don’t have to process all the

have you don’t have to process all the

have you don’t have to process all the hundred gigabytes you can just sample

hundred gigabytes you can just sample

hundred gigabytes you can just sample some of it right and then when you’re

some of it right and then when you’re

some of it right and then when you’re done with that you reduce it basically

done with that you reduce it basically

done with that you reduce it basically automatically will get average back into

automatically will get average back into

automatically will get average back into one model and that one model is the one

one model and that one model is the one

one model and that one model is the one that you look at from your browser from

that you look at from your browser from

that you look at from your browser from our from Python and then you do this

our from Python and then you do this

our from Python and then you do this again and every pass is a fraction of

again and every pass is a fraction of

again and every pass is a fraction of the data that you’re passing through or

the data that you’re passing through or

the data that you’re passing through or all of the data or more than all of your

all of the data or more than all of your

all of the data or more than all of your data you can just keep iterating without

data you can just keep iterating without

data you can just keep iterating without communicating you can tell each no to

communicating you can tell each no to

communicating you can tell each no to just run for six weeks and then

just run for six weeks and then

just run for six weeks and then communicate but by default it’s done in

communicate but by default it’s done in

communicate but by default it’s done in a way that you spend about two percent

a way that you spend about two percent

a way that you spend about two percent of your time communicating on the

of your time communicating on the

of your time communicating on the cluster and ninety-eight percent

cluster and ninety-eight percent

cluster and ninety-eight percent computing and this is all automatically

computing and this is all automatically

computing and this is all automatically done so you don’t need to worry about

done so you don’t need to worry about

done so you don’t need to worry about anything you just say go and it’ll

anything you just say go and it’ll

anything you just say go and it’ll basically process the data in parallel

basically process the data in parallel

basically process the data in parallel and make a good model and this averaging

and make a good model and this averaging

and make a good model and this averaging of models this scheme works there’s a

of models this scheme works there’s a

of models this scheme works there’s a paper about it but I’m also working on a

paper about it but I’m also working on a

paper about it but I’m also working on a new scheme that’s called consensus a dmm

new scheme that’s called consensus a dmm

new scheme that’s called consensus a dmm where you basically have penalty how far

where you basically have penalty how far

where you basically have penalty how far you drift from the average but you keep

you drift from the average but you keep

you drift from the average but you keep your local model and that keeps

your local model and that keeps

your local model and that keeps everybody kind of going on their own

everybody kind of going on their own

everybody kind of going on their own path in optimization land without

path in optimization land without

path in optimization land without averaging all the time you just you know

averaging all the time you just you know

averaging all the time you just you know that you’re drifting too far so you get

that you’re drifting too far so you get

that you’re drifting too far so you get pulled back a little but you still have

pulled back a little but you still have

pulled back a little but you still have your own model so this is going to be

your own model so this is going to be

your own model so this is going to be promising upgrade soon that you can look

promising upgrade soon that you can look

promising upgrade soon that you can look forward to already as it is it works

forward to already as it is it works

forward to already as it is it works fairly well so this is the amidst right

fairly well so this is the amidst right

fairly well so this is the amidst right two digits 0 to 9 handwritten digits 784

two digits 0 to 9 handwritten digits 784

two digits 0 to 9 handwritten digits 784 grayscale pixels you need to know which

grayscale pixels you need to know which

grayscale pixels you need to know which one is it right from the grayscale pixel

one is it right from the grayscale pixel

one is it right from the grayscale pixel values and in with a couple of lines

values and in with a couple of lines

values and in with a couple of lines here in our you can get the world class

here in our you can get the world class

here in our you can get the world class is actually actual world record no one

is actually actual world record no one

is actually actual world record no one has published a better number in this

has published a better number in this

has published a better number in this without using convolutional layers or

without using convolutional layers or

without using convolutional layers or any other distortions this is purely on

any other distortions this is purely on

any other distortions this is purely on the 60,000 training samples no

the 60,000 training samples no

the 60,000 training samples no distortions no convolutions and you can

distortions no convolutions and you can

distortions no convolutions and you can see here all the other implementations

see here all the other implementations

see here all the other implementations Jeff Hinton’s and Microsoft’s point 83

Jeff Hinton’s and Microsoft’s point 83

Jeff Hinton’s and Microsoft’s point 83 is the world record of course you could

is the world record of course you could

is the world record of course you could say the last digit is not quite

say the last digit is not quite

say the last digit is not quite statistically

statistically

statistically significant because you only have ten

significant because you only have ten

significant because you only have ten thousand to test set points but still

thousand to test set points but still

thousand to test set points but still it’s good to get down there so now let’s

it’s good to get down there so now let’s

it’s good to get down there so now let’s do a little demo here this is a normally

do a little demo here this is a normally

do a little demo here this is a normally detection I’ll show you how we can

detection I’ll show you how we can

detection I’ll show you how we can detect the ugly digits in this Emily’s

detect the ugly digits in this Emily’s

detect the ugly digits in this Emily’s data set on my laptop in a few seconds

data set on my laptop in a few seconds

data set on my laptop in a few seconds so I just have this instance up and

so I just have this instance up and

so I just have this instance up and running here from before so I’m going to

running here from before so I’m going to

running here from before so I’m going to go into our in our I have this our units

go into our in our I have this our units

go into our in our I have this our units this runs every day right every time we

this runs every day right every time we

this runs every day right every time we commit something these tests are being

commit something these tests are being

commit something these tests are being run so you can definitely check those

run so you can definitely check those

run so you can definitely check those out from your github web page right now

out from your github web page right now

out from your github web page right now if you want but still this is saying

if you want but still this is saying

if you want but still this is saying build a an auto encoder model which is

build a an auto encoder model which is

build a an auto encoder model which is learning what’s normal so it connects to

learning what’s normal so it connects to

learning what’s normal so it connects to my cluster right now it learns what’s

my cluster right now it learns what’s

my cluster right now it learns what’s normal what is a normal digit without

normal what is a normal digit without

normal what is a normal digit without knowing but they do today’s it just says

knowing but they do today’s it just says

knowing but they do today’s it just says look at all the data and learn what’s

look at all the data and learn what’s

look at all the data and learn what’s normal and how does it do that well it

normal and how does it do that well it

normal and how does it do that well it takes the 784 pixels it compresses them

takes the 784 pixels it compresses them

takes the 784 pixels it compresses them into in this case 50 neurons 50 numbers

into in this case 50 neurons 50 numbers

into in this case 50 neurons 50 numbers and then tries to make it back into 784

and then tries to make it back into 784

and then tries to make it back into 784 so it’s learning the identity function

so it’s learning the identity function

so it’s learning the identity function of this data set in a compressed way

of this data set in a compressed way

of this data set in a compressed way right so if you can somehow represent

right so if you can somehow represent

right so if you can somehow represent the data with these 50 numbers and you

the data with these 50 numbers and you

the data with these 50 numbers and you know the weights connecting in and out

know the weights connecting in and out

know the weights connecting in and out then these 50 numbers they mean

then these 50 numbers they mean

then these 50 numbers they mean something that’s what it takes to

something that’s what it takes to

something that’s what it takes to represent those 10 digits let’s say

represent those 10 digits let’s say

represent those 10 digits let’s say that’s roughly five numbers four digit

that’s roughly five numbers four digit

that’s roughly five numbers four digit and those five numbers are enough to say

and those five numbers are enough to say

and those five numbers are enough to say there’s an edge here as a round thing

there’s an edge here as a round thing

there’s an edge here as a round thing here as a hole here something like that

here as a hole here something like that

here as a hole here something like that like the features and with these 50

like the features and with these 50

like the features and with these 50 numbers in the middle and of course the

numbers in the middle and of course the

numbers in the middle and of course the connectivity that make up the

connectivity that make up the

connectivity that make up the reconstruction and the basically the

reconstruction and the basically the

reconstruction and the basically the encoding and the decoding you can now

encoding and the decoding you can now

encoding and the decoding you can now say what’s normal or not so because now

say what’s normal or not so because now

say what’s normal or not so because now I’ll take the test set I let it go

I’ll take the test set I let it go

I’ll take the test set I let it go through this network and I see what

through this network and I see what

through this network and I see what comes out of the other side if it

comes out of the other side if it

comes out of the other side if it doesn’t look like the original input

doesn’t look like the original input

doesn’t look like the original input then it didn’t match my vision of what

then it didn’t match my vision of what

then it didn’t match my vision of what this should look like right so I’m going

this should look like right so I’m going

this should look like right so I’m going to let the test set go through this

to let the test set go through this

to let the test set go through this model first I need to train the model so

model first I need to train the model so

model first I need to train the model so right now it’s building this model on my

right now it’s building this model on my

right now it’s building this model on my laptop 50 hidden neurons

laptop 50 hidden neurons

laptop 50 hidden neurons 10h activation function and auto encoder

10h activation function and auto encoder

10h activation function and auto encoder is set to true and I had a couple of

is set to true and I had a couple of

is set to true and I had a couple of extra options but that’s just to say

extra options but that’s just to say

extra options but that’s just to say don’t drop any of the constant columns

don’t drop any of the constant columns

don’t drop any of the constant columns at all as zero because I want to plot it

at all as zero because I want to plot it

at all as zero because I want to plot it at the end okay so now let’s look at the

at the end okay so now let’s look at the

at the end okay so now let’s look at the outlier nests of every point we just

outlier nests of every point we just

outlier nests of every point we just scored the test set and computed the

scored the test set and computed the

scored the test set and computed the reconstruction error so how how

reconstruction error so how how

reconstruction error so how how different is the outcome from the income

different is the outcome from the income

different is the outcome from the income how bad is my identity mapping that I

how bad is my identity mapping that I

how bad is my identity mapping that I learned for the test set points and for

learned for the test set points and for

learned for the test set points and for those points that are kind of ugly they

those points that are kind of ugly they

those points that are kind of ugly they won’t match to what’s normal in the

won’t match to what’s normal in the

won’t match to what’s normal in the training data right that’s an intuitive

training data right that’s an intuitive

training data right that’s an intuitive thing all right so now let’s plot the

thing all right so now let’s plot the

thing all right so now let’s plot the ones that match the best top 25 that’s

ones that match the best top 25 that’s

ones that match the best top 25 that’s the reconstruction and now let’s look at

the reconstruction and now let’s look at

the reconstruction and now let’s look at the actual ones well the same thing

the actual ones well the same thing

the actual ones well the same thing right there match the best so I have to

right there match the best so I have to

right there match the best so I have to look like the same this is the ones that

look like the same this is the ones that

look like the same this is the ones that are the easiest to learn to represent in

are the easiest to learn to represent in

are the easiest to learn to represent in your identity function just take the

your identity function just take the

your identity function just take the middle ones and say keep them basically

middle ones and say keep them basically

middle ones and say keep them basically now let’s look at the ones in the middle

now let’s look at the ones in the middle

now let’s look at the ones in the middle out of 10,000 that’s the the ones the

out of 10,000 that’s the the ones the

out of 10,000 that’s the the ones the median reconstruction error so these are

median reconstruction error so these are

median reconstruction error so these are still reasonably good you can tell that

still reasonably good you can tell that

still reasonably good you can tell that they’re digits but they’re already not

they’re digits but they’re already not

they’re digits but they’re already not quite as pretty anymore and now let’s

quite as pretty anymore and now let’s

quite as pretty anymore and now let’s look at the ugliest outliers so to speak

look at the ugliest outliers so to speak

look at the ugliest outliers so to speak in the test set so these are all digits

in the test set so these are all digits

in the test set so these are all digits that are coming out of my network but

that are coming out of my network but

that are coming out of my network but they’re not really like digit anymore

they’re not really like digit anymore

they’re not really like digit anymore right so something went wrong basically

right so something went wrong basically

right so something went wrong basically the reconstruction failed the model said

the reconstruction failed the model said

the reconstruction failed the model said these are ugly if you look at them they

these are ugly if you look at them they

these are ugly if you look at them they are kind of ugly some of them are almost

are kind of ugly some of them are almost

are kind of ugly some of them are almost not digits anymore right cut off or the

not digits anymore right cut off or the

not digits anymore right cut off or the top right one for example is ugly and

top right one for example is ugly and

top right one for example is ugly and you can tell that if you remember the

you can tell that if you remember the

you can tell that if you remember the bottom line like in the optics test the

bottom line like in the optics test the

bottom line like in the optics test the vision exam 6 40 35 right let’s go look

vision exam 6 40 35 right let’s go look

vision exam 6 40 35 right let’s go look at my slides totally different so every

at my slides totally different so every

at my slides totally different so every time I run it it’s different because its

time I run it it’s different because its

time I run it it’s different because its neural nets with multithreading I can

neural nets with multithreading I can

neural nets with multithreading I can turn it on to be reproducible but then i

turn it on to be reproducible but then i

turn it on to be reproducible but then i have to say use one threat don’t do any

have to say use one threat don’t do any

have to say use one threat don’t do any of this hog wild race condition updates

of this hog wild race condition updates

of this hog wild race condition updates of the weight matrix by multiple threats

of the weight matrix by multiple threats

of the weight matrix by multiple threats at the same time just run one

at the same time just run one

at the same time just run one right through and give a seed and then

right through and give a seed and then

right through and give a seed and then just wait until that one thread is done

just wait until that one thread is done

just wait until that one thread is done and then it will be reproducible but in

and then it will be reproducible but in

and then it will be reproducible but in this case I chose not to do this because

this case I chose not to do this because

this case I chose not to do this because it’s faster this may and the results are

it’s faster this may and the results are

it’s faster this may and the results are fine anyway every time you run it you’ll

fine anyway every time you run it you’ll

fine anyway every time you run it you’ll get something like this you will not get

get something like this you will not get

get something like this you will not get the ugly digits to be the good ones

the ugly digits to be the good ones

the ugly digits to be the good ones right so this shows you basically that

right so this shows you basically that

right so this shows you basically that this is a robust thing and again here

this is a robust thing and again here

this is a robust thing and again here this is the network topography so I can

this is the network topography so I can

this is the network topography so I can also go back to the browser now go to

also go back to the browser now go to

also go back to the browser now go to localhost and say here clean up

localhost and say here clean up

localhost and say here clean up everything by the way here this just ran

everything by the way here this just ran

everything by the way here this just ran all the model so if I say get models I

all the model so if I say get models I

all the model so if I say get models I should see all the models that were

should see all the models that were

should see all the models that were built so that the last four rd models

built so that the last four rd models

built so that the last four rd models they were built on the million song data

they were built on the million song data

they were built on the million song data said earlier and the top one is the 1i

said earlier and the top one is the 1i

said earlier and the top one is the 1i built from our the auto encoder and you

built from our the auto encoder and you

built from our the auto encoder and you can see the auto encoder reconstruction

can see the auto encoder reconstruction

can see the auto encoder reconstruction error started at point zero eight mean

error started at point zero eight mean

error started at point zero eight mean square error and now it’s at point zero

square error and now it’s at point zero

square error and now it’s at point zero two so it got it down it improved from

two so it got it down it improved from

two so it got it down it improved from random noise for Otto encoders you

random noise for Otto encoders you

random noise for Otto encoders you always want to check this convergence it

always want to check this convergence it

always want to check this convergence it has to learn something right the

has to learn something right the

has to learn something right the identity mapping and you can also see

identity mapping and you can also see

identity mapping and you can also see here the status of the neuron layers the

here the status of the neuron layers the

here the status of the neuron layers the thing I showed you earlier and of course

thing I showed you earlier and of course

thing I showed you earlier and of course you can also get a POJO again here in

you can also get a POJO again here in

you can also get a POJO again here in this case it’s a neural net so you would

this case it’s a neural net so you would

this case it’s a neural net so you would expect some weights here and some here

expect some weights here and some here

expect some weights here and some here what is this oh that’s the neurons here

what is this oh that’s the neurons here

what is this oh that’s the neurons here we go I would expect the model to show

we go I would expect the model to show

we go I would expect the model to show up somewhere see there’s a lot of

up somewhere see there’s a lot of

up somewhere see there’s a lot of declarations you have to make to know

declarations you have to make to know

declarations you have to make to know all these 784 features so if this is too

all these 784 features so if this is too

all these 784 features so if this is too little for the preview then we have to

little for the preview then we have to

little for the preview then we have to look at the other model we have yeah

look at the other model we have yeah

look at the other model we have yeah let’s go back to get models and click on

let’s go back to get models and click on

let’s go back to get models and click on the other deep learning model be made

the other deep learning model be made

the other deep learning model be made earlier on the million song data set and

earlier on the million song data set and

earlier on the million song data set and look at its pojo that should be smaller

look at its pojo that should be smaller

look at its pojo that should be smaller because there were only 90 predictors

because there were only 90 predictors

because there were only 90 predictors okay here we go so now you should see

okay here we go so now you should see

okay here we go so now you should see that the deep learning math actually

that the deep learning math actually

that the deep learning math actually printed out in plain text so you can

printed out in plain text so you can

printed out in plain text so you can always check here activation something

always check here activation something

always check here activation something with numerical something with

with numerical something with

with numerical something with categoricals if you had any in this case

categoricals if you had any in this case

categoricals if you had any in this case there are none and then it will save

there are none and then it will save

there are none and then it will save aids activation biases and

aids activation biases and

aids activation biases and they will do this matrix vector

they will do this matrix vector

they will do this matrix vector multiplication so ax + y v 1 this is the

multiplication so ax + y v 1 this is the

multiplication so ax + y v 1 this is the matrix vector multiplication that’s

matrix vector multiplication that’s

matrix vector multiplication that’s inside of the deep learning model and

inside of the deep learning model and

inside of the deep learning model and you can see here we do some partial some

you can see here we do some partial some

you can see here we do some partial some tricks to be faster to basically allow

tricks to be faster to basically allow

tricks to be faster to basically allow the CPU to do more additions and

the CPU to do more additions and

the CPU to do more additions and multiplications at the same time so all

multiplications at the same time so all

multiplications at the same time so all of this is optimized for speed and this

of this is optimized for speed and this

of this is optimized for speed and this is as fast as any c++ implementation or

is as fast as any c++ implementation or

is as fast as any c++ implementation or anything because we don’t really have GC

anything because we don’t really have GC

anything because we don’t really have GC issues here all the arrays are allocated

issues here all the arrays are allocated

issues here all the arrays are allocated one time and then just process all right

one time and then just process all right

one time and then just process all right so now let’s get back to the bigger

so now let’s get back to the bigger

so now let’s get back to the bigger problems deep learning and higgs boson

problems deep learning and higgs boson

problems deep learning and higgs boson who has seen this data set before okay

who has seen this data set before okay

who has seen this data set before okay great so this is a physics right 13

great so this is a physics right 13

great so this is a physics right 13 billion dollar biggest project ever

billion dollar biggest project ever

billion dollar biggest project ever scientific experiment this data set has

scientific experiment this data set has

scientific experiment this data set has 10 million rows their detector events

10 million rows their detector events

10 million rows their detector events each detector event has 21 numbers

each detector event has 21 numbers

each detector event has 21 numbers coming out saying this is what I

coming out saying this is what I

coming out saying this is what I measured for certain things and then the

measured for certain things and then the

measured for certain things and then the physicists come up with seven more

physicists come up with seven more

physicists come up with seven more numbers that they compute from those 21

numbers that they compute from those 21

numbers that they compute from those 21 something like square root of this

something like square root of this

something like square root of this squared minus that square or something

squared minus that square or something

squared minus that square or something and those formulas or formulae actually

and those formulas or formulae actually

and those formulas or formulae actually help and you can see this down there if

help and you can see this down there if

help and you can see this down there if you take just the low-level numbers this

you take just the low-level numbers this

you take just the low-level numbers this is the AUC you get so point 5 is random

is the AUC you get so point 5 is random

is the AUC you get so point 5 is random and one would be perfect and now it goes

and one would be perfect and now it goes

and one would be perfect and now it goes up by something like 10 basis points

up by something like 10 basis points

up by something like 10 basis points almost if you add those extra features

almost if you add those extra features

almost if you add those extra features so it’s very valuable to have physicists

so it’s very valuable to have physicists

so it’s very valuable to have physicists around to tell you like what to do right

around to tell you like what to do right

around to tell you like what to do right but CERN basically had this baseline

but CERN basically had this baseline

but CERN basically had this baseline here of 81 that was how good it was

here of 81 that was how good it was

here of 81 that was how good it was working for them they used it gradient

working for them they used it gradient

working for them they used it gradient boosted trees and neural networks with

boosted trees and neural networks with

boosted trees and neural networks with layer with one layer one hidden layer so

layer with one layer one hidden layer so

layer with one layer one hidden layer so their baseline was 81 AUC and this paper

their baseline was 81 AUC and this paper

their baseline was 81 AUC and this paper came a long last summer saying we can do

came a long last summer saying we can do

came a long last summer saying we can do better than that with deep learning and

better than that with deep learning and

better than that with deep learning and they publish some numbers and now we are

they publish some numbers and now we are

they publish some numbers and now we are going to run the same thing and see what

going to run the same thing and see what

going to run the same thing and see what we can do so I’m going back to my

we can do so I’m going back to my

we can do so I’m going back to my cluster my ec2 8 no cluster and I’ll say

cluster my ec2 8 no cluster and I’ll say

cluster my ec2 8 no cluster and I’ll say get frames

get frames

get frames and I will have to Hicks data set there

and I will have to Hicks data set there

and I will have to Hicks data set there already because I parse it earlier you

already because I parse it earlier you

already because I parse it earlier you can see here 11 million rows and 29

can see here 11 million rows and 29

can see here 11 million rows and 29 columns 2 gigabytes compressed is not

columns 2 gigabytes compressed is not

columns 2 gigabytes compressed is not much to compress because it’s all

much to compress because it’s all

much to compress because it’s all doubles and now I’m going to run a deep

doubles and now I’m going to run a deep

doubles and now I’m going to run a deep learning model so I already saved the

learning model so I already saved the

learning model so I already saved the flow for that so this flow says take the

flow for that so this flow says take the

flow for that so this flow says take the split the split data set I split it in

split the split data set I split it in

split the split data set I split it in two ninety percent and five five percent

two ninety percent and five five percent

two ninety percent and five five percent so ten million and half a million each

so ten million and half a million each

so ten million and half a million each take the training data and the

take the training data and the

take the training data and the validation data and tell me how you’re

validation data and tell me how you’re

validation data and tell me how you’re doing along the way so go and it builds

doing along the way so go and it builds

doing along the way so go and it builds a three layer Network and uses a

a three layer Network and uses a

a three layer Network and uses a rectifier activation everything else is

rectifier activation everything else is

rectifier activation everything else is default and now it’s running so let’s go

default and now it’s running so let’s go

default and now it’s running so let’s go look at the the water meter ok here we

look at the the water meter ok here we

look at the the water meter ok here we go deep learning is taking over the

go deep learning is taking over the

go deep learning is taking over the cluster and now it’s communicating and

cluster and now it’s communicating and

cluster and now it’s communicating and now it’s sending that back out and then

now it’s sending that back out and then

now it’s sending that back out and then computing again this might be initial

computing again this might be initial

computing again this might be initial phases where its eyes to first it

phases where its eyes to first it

phases where its eyes to first it rebalance the data set or something

rebalance the data set or something

rebalance the data set or something usually you’ll see it up down up down so

usually you’ll see it up down up down so

usually you’ll see it up down up down so let’s wait until the next communication

let’s wait until the next communication

let’s wait until the next communication but you’ll see that all the CPUs are

but you’ll see that all the CPUs are

but you’ll see that all the CPUs are busy updating weights with stochastic

busy updating weights with stochastic

busy updating weights with stochastic gradient descent which means it takes a

gradient descent which means it takes a

gradient descent which means it takes a point it trains goes through the network

point it trains goes through the network

point it trains goes through the network makes a prediction says how wrong it is

makes a prediction says how wrong it is

makes a prediction says how wrong it is and corrects the weights all the weights

and corrects the weights all the weights

and corrects the weights all the weights that are affected get fixed basically by

that are affected get fixed basically by

that are affected get fixed basically by every single point there’s no mini batch

every single point there’s no mini batch

every single point there’s no mini batch or anything every single point updates

or anything every single point updates

or anything every single point updates the whole model and that’s done by all

the whole model and that’s done by all

the whole model and that’s done by all the threats in parallel so you’ll have

the threats in parallel so you’ll have

the threats in parallel so you’ll have eight threats in parallel changing those

eight threats in parallel changing those

eight threats in parallel changing those weights and I read you right I read you

weights and I read you right I read you

weights and I read you right I read you right whatever we just compete but

right whatever we just compete but

right whatever we just compete but usually we write different weights right

usually we write different weights right

usually we write different weights right there’s millions of weight so you don’t

there’s millions of weight so you don’t

there’s millions of weight so you don’t need to override too often but someone

need to override too often but someone

need to override too often but someone else is reading at the time or something

else is reading at the time or something

else is reading at the time or something so you can see here it’s mostly busy

so you can see here it’s mostly busy

so you can see here it’s mostly busy computing if you wanted to know what

computing if you wanted to know what

computing if you wanted to know what it’s exactly doing it can also click on

it’s exactly doing it can also click on

it’s exactly doing it can also click on the profiler here and it will show you a

the profiler here and it will show you a

the profiler here and it will show you a stack trace and sorted stack trace by

stack trace and sorted stack trace by

stack trace and sorted stack trace by count what’s happening so this was

count what’s happening so this was

count what’s happening so this was basically just communicating let’s do

basically just communicating let’s do

basically just communicating let’s do this again

this again

this again now it’s going to be slightly different

now it’s going to be slightly different

now it’s going to be slightly different oh I see so now it was saying these are

oh I see so now it was saying these are

oh I see so now it was saying these are basically idle because we have eight

basically idle because we have eight

basically idle because we have eight notes and there are seven orders and

notes and there are seven orders and

notes and there are seven orders and there’s one for read and one for right

there’s one for read and one for right

there’s one for read and one for right so we got 14 threats actively listening

so we got 14 threats actively listening

so we got 14 threats actively listening for communication here f 289 are in the

for communication here f 289 are in the

for communication here f 289 are in the back propagation some of them are in the

back propagation some of them are in the

back propagation some of them are in the forward propagation so you can see all

forward propagation so you can see all

forward propagation so you can see all these exact things that are going on

these exact things that are going on

these exact things that are going on with any moment in time for every note

with any moment in time for every note

with any moment in time for every note right you can go to a different note and

right you can go to a different note and

right you can go to a different note and you can see the same behavior so they’re

you can see the same behavior so they’re

you can see the same behavior so they’re all just busy computing so by this model

all just busy computing so by this model

all just busy computing so by this model is building we can ask how well is it

is building we can ask how well is it

is building we can ask how well is it doing remember dat one baseline with the

doing remember dat one baseline with the

doing remember dat one baseline with the human features let’s let’s see what we

human features let’s let’s see what we

human features let’s let’s see what we have here on the validation data set

have here on the validation data set

have here on the validation data set it’s already at 79 this already beat all

it’s already at 79 this already beat all

it’s already at 79 this already beat all the random forests and grading boosted

the random forests and grading boosted

the random forests and grading boosted methods and neural nets methods that

methods and neural nets methods that

methods and neural nets methods that they had at CERN for many years so these

they had at CERN for many years so these

they had at CERN for many years so these models there on the left that had 75 76

models there on the left that had 75 76

models there on the left that had 75 76 already beaten by this deep learning

already beaten by this deep learning

already beaten by this deep learning model we just ran and this wasn’t even a

model we just ran and this wasn’t even a

model we just ran and this wasn’t even a good model it was just a small like a

good model it was just a small like a

good model it was just a small like a hundred neurons each layer right so this

hundred neurons each layer right so this

hundred neurons each layer right so this is very powerful and by the time we

is very powerful and by the time we

is very powerful and by the time we finish will actually get down to over 87

finish will actually get down to over 87

finish will actually get down to over 87 a you see that’s what the paper reported

a you see that’s what the paper reported

a you see that’s what the paper reported they have an 88 they trained this for

they have an 88 they trained this for

they have an 88 they trained this for weeks on a GPU and of course they had

weeks on a GPU and of course they had

weeks on a GPU and of course they had only this data set and nothing else to

only this data set and nothing else to

only this data set and nothing else to worry about and this is a small data set

worry about and this is a small data set

worry about and this is a small data set still but you can see the power of deep

still but you can see the power of deep

still but you can see the power of deep learning right especially if you feed it

learning right especially if you feed it

learning right especially if you feed it more data and you give it more neurons

more data and you give it more neurons

more data and you give it more neurons it’ll train and learn everything it’s

it’ll train and learn everything it’s

it’ll train and learn everything it’s like a brain that’s trying to learn like

like a brain that’s trying to learn like

like a brain that’s trying to learn like a baby’s brain it’s just sucking up all

a baby’s brain it’s just sucking up all

a baby’s brain it’s just sucking up all the information and after 40 minutes

the information and after 40 minutes

the information and after 40 minutes you’ll get an 84 AFC which is pretty

you’ll get an 84 AFC which is pretty

you’ll get an 84 AFC which is pretty impressive right it beats all the other

impressive right it beats all the other

impressive right it beats all the other baseline methods even with the human

baseline methods even with the human

baseline methods even with the human features and this is without using the

features and this is without using the

features and this is without using the human features you don’t need to know

human features you don’t need to know

human features you don’t need to know anything you just take the sensor data

anything you just take the sensor data

anything you just take the sensor data out of your machine and say go all right

out of your machine and say go all right

out of your machine and say go all right another use case was deep learning used

another use case was deep learning used

another use case was deep learning used for crime detection

for crime detection

for crime detection and this is actually Chicago who can

and this is actually Chicago who can

and this is actually Chicago who can recognize this pattern so my colleagues

recognize this pattern so my colleagues

recognize this pattern so my colleagues Alex and Macau they wrote an article

Alex and Macau they wrote an article

Alex and Macau they wrote an article actually that you can read here data

actually that you can read here data

actually that you can read here data nami just a few days ago and they’re

nami just a few days ago and they’re

nami just a few days ago and they’re using spark and h2o together to take

using spark and h2o together to take

using spark and h2o together to take three different data sets and turn them

three different data sets and turn them

three different data sets and turn them into something that you can use to

into something that you can use to

into something that you can use to predict better crime is going to be

predict better crime is going to be

predict better crime is going to be leading to an arrest or not so you take

leading to an arrest or not so you take

leading to an arrest or not so you take the crime data set you take the census

the crime data set you take the census

the crime data set you take the census data set to know something about the

data set to know something about the

data set to know something about the socioeconomic factors and you take the

socioeconomic factors and you take the

socioeconomic factors and you take the better because the better might have

better because the better might have

better because the better might have impact on what’s happening and you put

impact on what’s happening and you put

impact on what’s happening and you put them all together in spark first you

them all together in spark first you

them all together in spark first you parse them in h2o because we know that

parse them in h2o because we know that

parse them in h2o because we know that the parser works and it’s it’s fine so

the parser works and it’s it’s fine so

the parser works and it’s it’s fine so in our demo we just suck it all in an

in our demo we just suck it all in an

in our demo we just suck it all in an h2o we send it over to spark in the same

h2o we send it over to spark in the same

h2o we send it over to spark in the same jvm and then we say you an SQL join and

jvm and then we say you an SQL join and

jvm and then we say you an SQL join and once you’re done we split it again in

once you’re done we split it again in

once you’re done we split it again in h2o and then we build a deep learning

h2o and then we build a deep learning

h2o and then we build a deep learning model and for example GBM model i think

model and for example GBM model i think

model and for example GBM model i think these two are being built by the demo

these two are being built by the demo

these two are being built by the demo script that’s available so again both

script that’s available so again both

script that’s available so again both h2o and sparks memory is shared it’s the

h2o and sparks memory is shared it’s the

h2o and sparks memory is shared it’s the same jvm there’s no tachyon layer or

same jvm there’s no tachyon layer or

same jvm there’s no tachyon layer or anything they are basically able to

anything they are basically able to

anything they are basically able to transparently go from one side to the

transparently go from one side to the

transparently go from one side to the order

and the product of course is called

and the product of course is called sparkling water which was a brilliant

sparkling water which was a brilliant

sparkling water which was a brilliant idea I think all right so this is the

idea I think all right so this is the

idea I think all right so this is the place and github where you would find

place and github where you would find

place and github where you would find this this example so you would download

this this example so you would download

this this example so you would download sparkling water from our download page

sparkling water from our download page

sparkling water from our download page and then you would go into that

and then you would go into that

and then you would go into that directory set to environment variables

directory set to environment variables

directory set to environment variables pointing to spark and saying how many

pointing to spark and saying how many

pointing to spark and saying how many nodes you want and then you would start

nodes you want and then you would start

nodes you want and then you would start the sparkling shell and then copy paste

the sparkling shell and then copy paste

the sparkling shell and then copy paste this code into it for example if you

this code into it for example if you

this code into it for example if you want to do it interactively so you can

want to do it interactively so you can

want to do it interactively so you can see here there’s a couple of imports you

see here there’s a couple of imports you

see here there’s a couple of imports you import deep learning in GBM and some

import deep learning in GBM and some

import deep learning in GBM and some spark stuff and then you basically

spark stuff and then you basically

spark stuff and then you basically connect to the h2o cluster we parse

connect to the h2o cluster we parse

connect to the h2o cluster we parse datasets this way this is just a

datasets this way this is just a

datasets this way this is just a function definition that gets used by

function definition that gets used by

function definition that gets used by these other functions that actually do

these other functions that actually do

these other functions that actually do the work to load the data and then you

the work to load the data and then you

the work to load the data and then you can drop some columns and do some simple

can drop some columns and do some simple

can drop some columns and do some simple munging in this case here we do some

munging in this case here we do some

munging in this case here we do some date manipulations to standardize the

date manipulations to standardize the

date manipulations to standardize the three datasets to have the same date

three datasets to have the same date

three datasets to have the same date format so that we can join on it later

format so that we can join on it later

format so that we can join on it later and you basically just take these three

and you basically just take these three

and you basically just take these three datasets these are just small for a demo

datasets these are just small for a demo

datasets these are just small for a demo but in reality they of course use the

but in reality they of course use the

but in reality they of course use the whole data set on a cluster and then

whole data set on a cluster and then

whole data set on a cluster and then once you have these three datasets in

once you have these three datasets in

once you have these three datasets in memory as h2o objects we just converted

memory as h2o objects we just converted

memory as h2o objects we just converted to a schema led with this call here and

to a schema led with this call here and

to a schema led with this call here and now to become spark or disease for which

now to become spark or disease for which

now to become spark or disease for which you can just call like a select

you can just call like a select

you can just call like a select statement in SQL and then some join and

statement in SQL and then some join and

statement in SQL and then some join and another join and all that it’s very nice

another join and all that it’s very nice

another join and all that it’s very nice right this is a nice well understood API

right this is a nice well understood API

right this is a nice well understood API the people can use and h2o does not have

the people can use and h2o does not have

the people can use and h2o does not have this at this point but we’re working on

this at this point but we’re working on

this at this point but we’re working on that so at some point we’ll have more

that so at some point we’ll have more

that so at some point we’ll have more managing capabilities but for now you

managing capabilities but for now you

managing capabilities but for now you can definitely benefit from the whole

can definitely benefit from the whole

can definitely benefit from the whole spark ecosystem to do what it’s good for

spark ecosystem to do what it’s good for

spark ecosystem to do what it’s good for so here in this case but is this we say

so here in this case but is this we say

so here in this case but is this we say here’s a crime better data set that we

here’s a crime better data set that we

here’s a crime better data set that we after be splitted I think we spent we

after be splitted I think we spent we

after be splitted I think we spent we bring it back into h2o yes this is an

bring it back into h2o yes this is an

bring it back into h2o yes this is an HTML helper function to split and now we

HTML helper function to split and now we

HTML helper function to split and now we have basically a joint data set that

have basically a joint data set that

have basically a joint data set that knows all about the socioeconomic

knows all about the socioeconomic

knows all about the socioeconomic factors about the way

factors about the way

factors about the way for a given time at a given place and

for a given time at a given place and

for a given time at a given place and then we can build a deep learning model

then we can build a deep learning model

then we can build a deep learning model just like you would do this in Java

just like you would do this in Java

just like you would do this in Java Scala is very similar right you don’t

Scala is very similar right you don’t

Scala is very similar right you don’t need to do much porting it’s just the

need to do much porting it’s just the

need to do much porting it’s just the same members that you’re setting and

same members that you’re setting and

same members that you’re setting and then you say run train model that gets

then you say run train model that gets

then you say run train model that gets basically and that that at the end you

basically and that that at the end you

basically and that that at the end you have a model available that you can use

have a model available that you can use

have a model available that you can use to make predictions and it’s very simple

to make predictions and it’s very simple

to make predictions and it’s very simple and you can definitely follow the

and you can definitely follow the

and you can definitely follow the tutorials in the interest of time I’ll

tutorials in the interest of time I’ll

tutorials in the interest of time I’ll just show you the sparkling she’ll start

just show you the sparkling she’ll start

just show you the sparkling she’ll start here I’m basically able to do this on my

here I’m basically able to do this on my

here I’m basically able to do this on my laptop as well while the other one is

laptop as well while the other one is

laptop as well while the other one is still running here you see spark is

still running here you see spark is

still running here you see spark is being launched and now it’s scheduling

being launched and now it’s scheduling

being launched and now it’s scheduling those three worker nodes to come up once

those three worker nodes to come up once

those three worker nodes to come up once it’s ready I can copy paste some code in

it’s ready I can copy paste some code in

it’s ready I can copy paste some code in there and the code I would get from the

there and the code I would get from the

there and the code I would get from the website here Chicago Crime demo it’s all

website here Chicago Crime demo it’s all

website here Chicago Crime demo it’s all on github

so in sparkling water I will get up

so in sparkling water I will get up project under examples there are some

project under examples there are some

project under examples there are some scripts and so I can just take this

scripts and so I can just take this

scripts and so I can just take this stuff here and just copy paste it all

stuff here and just copy paste it all

stuff here and just copy paste it all oops I’m sure you believe me that this

oops I’m sure you believe me that this

oops I’m sure you believe me that this is all doable right so here spark is not

is all doable right so here spark is not

is all doable right so here spark is not ready and I just copy paste is in and

ready and I just copy paste is in and

ready and I just copy paste is in and here it goes so that’s how easy it is to

here it goes so that’s how easy it is to

here it goes so that’s how easy it is to do spark and h2o together and then also

do spark and h2o together and then also

do spark and h2o together and then also once you have something in your memory

once you have something in your memory

once you have something in your memory in the 8th show cluster right the model

in the 8th show cluster right the model

in the 8th show cluster right the model for example or some data sets you can

for example or some data sets you can

for example or some data sets you can just ask flow to visualize it you can

just ask flow to visualize it you can

just ask flow to visualize it you can just type this this JavaScript or

just type this this JavaScript or

just type this this JavaScript or CoffeeScript rather expression and plot

CoffeeScript rather expression and plot

CoffeeScript rather expression and plot anything you want against anything and

anything you want against anything and

anything you want against anything and you’ll see these interactive plots but

you’ll see these interactive plots but

you’ll see these interactive plots but you can mouse-over and it will show you

you can mouse-over and it will show you

you can mouse-over and it will show you what it is and so on so it’s very cool

what it is and so on so it’s very cool

what it is and so on so it’s very cool you can plot for example the arrest rate

you can plot for example the arrest rate

you can plot for example the arrest rate versus the relative occurrence of an

versus the relative occurrence of an

versus the relative occurrence of an arrest for example gambling is always

arrest for example gambling is always

arrest for example gambling is always arrested why is that well because

arrested why is that well because

arrested why is that well because otherwise you wouldn’t know that the

otherwise you wouldn’t know that the

otherwise you wouldn’t know that the gambling person was cheating or

gambling person was cheating or

gambling person was cheating or something so so you basically have to

something so so you basically have to

something so so you basically have to rest them right otherwise you don’t know

rest them right otherwise you don’t know

rest them right otherwise you don’t know what’s happening some of the things are

what’s happening some of the things are

what’s happening some of the things are undetected but the theft for example

undetected but the theft for example

undetected but the theft for example it’s not always arrested because someone

it’s not always arrested because someone

it’s not always arrested because someone knows that it was stolen without the

knows that it was stolen without the

knows that it was stolen without the person actually being caught so you have

person actually being caught so you have

person actually being caught so you have to be careful about all this data

to be careful about all this data

to be careful about all this data science stuff but basically can plot

science stuff but basically can plot

science stuff but basically can plot whatever you want against whatever you

whatever you want against whatever you

whatever you want against whatever you want and that’s pretty powerful and we

want and that’s pretty powerful and we

want and that’s pretty powerful and we have our state up table now in house so

have our state up table now in house so

have our state up table now in house so Matt Dowell joined us recently he he

Matt Dowell joined us recently he he

Matt Dowell joined us recently he he wrote the fastest data table a

wrote the fastest data table a

wrote the fastest data table a processing engine in our and this is

processing engine in our and this is

processing engine in our and this is used for financial institutions that

used for financial institutions that

used for financial institutions that like to do aggregates a lot so just what

like to do aggregates a lot so just what

like to do aggregates a lot so just what you saw on the previous slide will soon

you saw on the previous slide will soon

you saw on the previous slide will soon have all this in H to go in a scalable

have all this in H to go in a scalable

have all this in H to go in a scalable way that we can do fast joins aggregates

way that we can do fast joins aggregates

way that we can do fast joins aggregates and so on and the same thing of course

and so on and the same thing of course

and so on and the same thing of course goes for Python you have ipython

goes for Python you have ipython

goes for Python you have ipython notebooks and there’s an example to do

notebooks and there’s an example to do

notebooks and there’s an example to do something for the city bike company in

something for the city bike company in

something for the city bike company in New York City where you want to know how

New York City where you want to know how

New York City where you want to know how many bikes do you need for stations such

many bikes do you need for stations such

many bikes do you need for stations such that you don’t run out of bikes so let’s

that you don’t run out of bikes so let’s

that you don’t run out of bikes so let’s say you have 10 million rows of

say you have 10 million rows of

say you have 10 million rows of historical data and you have some better

historical data and you have some better

historical data and you have some better data you would imagine it you can join

data you would imagine it you can join

data you would imagine it you can join those two and then basically based on

those two and then basically based on

those two and then basically based on location

location

location in time and better you can predict how

in time and better you can predict how

in time and better you can predict how many bikes you’ll need right so if I

many bikes you’ll need right so if I

many bikes you’ll need right so if I know today it’s going to be or tomorrow

know today it’s going to be or tomorrow

know today it’s going to be or tomorrow is going to be that better I know I need

is going to be that better I know I need

is going to be that better I know I need 250 bikes at that station or something

250 bikes at that station or something

250 bikes at that station or something and cliff our CTO who-who wrote a jit

and cliff our CTO who-who wrote a jit

and cliff our CTO who-who wrote a jit basically also wrote this data science

basically also wrote this data science

basically also wrote this data science example here so you can see there’s a

example here so you can see there’s a

example here so you can see there’s a group by the top from ipython notebooks

group by the top from ipython notebooks

group by the top from ipython notebooks and to show you that this is also life

and to show you that this is also life

and to show you that this is also life impossible here I do this here I’ll type

impossible here I do this here I’ll type

impossible here I do this here I’ll type ipython notebook citibike small and up

ipython notebook citibike small and up

ipython notebook citibike small and up pops up my my my browser with ipython

pops up my my my browser with ipython

pops up my my my browser with ipython notebook I will delete all the output

notebook I will delete all the output

notebook I will delete all the output cells so we don’t cheat and I say go and

cells so we don’t cheat and I say go and

cells so we don’t cheat and I say go and now it’s connecting to the cluster that

now it’s connecting to the cluster that

now it’s connecting to the cluster that I started 30 minutes ago this means i

I started 30 minutes ago this means i

I started 30 minutes ago this means i still have a little bit of time left i

still have a little bit of time left i

still have a little bit of time left i will load some data here it up we go and

will load some data here it up we go and

will load some data here it up we go and then let’s look at the data describe it

then let’s look at the data describe it

then let’s look at the data describe it you can see here some some mean max and

you can see here some some mean max and

you can see here some some mean max and so on whatever this is like a

so on whatever this is like a

so on whatever this is like a distribution of the chunk of the frame

distribution of the chunk of the frame

distribution of the chunk of the frame how many rows out of each machine in

how many rows out of each machine in

how many rows out of each machine in this case is only one machine oops

this case is only one machine oops

this case is only one machine oops there’s only one machine basically some

there’s only one machine basically some

there’s only one machine basically some statistics that tells you how is the

statistics that tells you how is the

statistics that tells you how is the data distributed across the cluster what

data distributed across the cluster what

data distributed across the cluster what kinds of columns do I have what is their

kinds of columns do I have what is their

kinds of columns do I have what is their mean max and so on all available from

mean max and so on all available from

mean max and so on all available from from Python then you can do a group by

from Python then you can do a group by

from Python then you can do a group by you don’t need to know all that but

you don’t need to know all that but

you don’t need to know all that but basically just you want to know like at

basically just you want to know like at

basically just you want to know like at what time of the day or what they how

what time of the day or what they how

what time of the day or what they how many bikes are bitch station and so on

many bikes are bitch station and so on

many bikes are bitch station and so on you can see that there’s a big

you can see that there’s a big

you can see that there’s a big distribution here that’s some some

distribution here that’s some some

distribution here that’s some some places only need 9 bikes on basically

places only need 9 bikes on basically

places only need 9 bikes on basically the under bikes or even more and so on

the under bikes or even more and so on

the under bikes or even more and so on right and you can do quantiles you see

right and you can do quantiles you see

right and you can do quantiles you see the quantiles here from one percent all

the quantiles here from one percent all

the quantiles here from one percent all the way to ninety-nine percent and you

the way to ninety-nine percent and you

the way to ninety-nine percent and you see that there’s some pretty big numbers

see that there’s some pretty big numbers

see that there’s some pretty big numbers here you can make new features stay if

here you can make new features stay if

here you can make new features stay if the weekends on you can build models so

the weekends on you can build models so

the weekends on you can build models so this is the fun part we have a bill to

this is the fun part we have a bill to

this is the fun part we have a bill to GBM we build a random forest we build a

GBM we build a random forest we build a

GBM we build a random forest we build a glm and we build a deep learning model

glm and we build a deep learning model

glm and we build a deep learning model all on the same data that was joined

all on the same data that was joined

all on the same data that was joined earlier and so now let’s say do this go

earlier and so now let’s say do this go

earlier and so now let’s say do this go so now it’s building a GBM

so now it’s building a GBM

so now it’s building a GBM all of my laptop so if I went to my

all of my laptop so if I went to my

all of my laptop so if I went to my laptop right now I could say get models

laptop right now I could say get models

laptop right now I could say get models and these models would just magically

and these models would just magically

and these models would just magically pop up and this is deep learning and now

pop up and this is deep learning and now

pop up and this is deep learning and now we can see how well they’re doing and

we can see how well they’re doing and

we can see how well they’re doing and you get the idea right so now we get a

you get the idea right so now we get a

you get the idea right so now we get a 92 AAC by deep learning but the 93 a or

92 AAC by deep learning but the 93 a or

92 AAC by deep learning but the 93 a or c by GBM but deep learning even took a

c by GBM but deep learning even took a

c by GBM but deep learning even took a little less time than GBM so you could

little less time than GBM so you could

little less time than GBM so you could say that both are very powerful methods

say that both are very powerful methods

say that both are very powerful methods they beat the random forests and the

they beat the random forests and the

they beat the random forests and the linear models here but of course nothing

linear models here but of course nothing

linear models here but of course nothing beats the linear model in terms of time

beats the linear model in terms of time

beats the linear model in terms of time Oh point one second to get an 81 and you

Oh point one second to get an 81 and you

Oh point one second to get an 81 and you see it’s pretty remarkable it’s 50 times

see it’s pretty remarkable it’s 50 times

see it’s pretty remarkable it’s 50 times faster and a random forest all right so

faster and a random forest all right so

faster and a random forest all right so you believe me that I Python works as

you believe me that I Python works as

you believe me that I Python works as well once you join the better data with

well once you join the better data with

well once you join the better data with a simple merge command here in the

a simple merge command here in the

a simple merge command here in the middle somewhere then you get a little

middle somewhere then you get a little

middle somewhere then you get a little lift here because then you can even

lift here because then you can even

lift here because then you can even predict better you need bikes are not

predict better you need bikes are not

predict better you need bikes are not based on better right make sense if it

based on better right make sense if it

based on better right make sense if it rains you might need fewer bikes so any

rains you might need fewer bikes so any

rains you might need fewer bikes so any anything you might wonder what to do

anything you might wonder what to do

anything you might wonder what to do with GBM linear models with deep

with GBM linear models with deep

with GBM linear models with deep learning there’s booklets for that and

learning there’s booklets for that and

learning there’s booklets for that and we’re currently rewriting them to the

we’re currently rewriting them to the

we’re currently rewriting them to the new version of h2o which will have

new version of h2o which will have

new version of h2o which will have slightly updated api’s and stuff for

slightly updated api’s and stuff for

slightly updated api’s and stuff for consistency across our Python Scala JSON

consistency across our Python Scala JSON

consistency across our Python Scala JSON and so on so it’s going to be very nice

and so on so it’s going to be very nice

and so on so it’s going to be very nice and rewritten everything from scratch a

and rewritten everything from scratch a

and rewritten everything from scratch a major effort but now we’re basically

major effort but now we’re basically

major effort but now we’re basically going to be ready for release I think

going to be ready for release I think

going to be ready for release I think this week actually so and another ! is

this week actually so and another ! is

this week actually so and another ! is that we’re currently number one at this

that we’re currently number one at this

that we’re currently number one at this caracal challenge Marc Landry who just

caracal challenge Marc Landry who just

caracal challenge Marc Landry who just joined us who has been on teammates to

joined us who has been on teammates to

joined us who has been on teammates to go for a while he was at the h2o world

go for a while he was at the h2o world

go for a while he was at the h2o world last fall he is actually going to work

last fall he is actually going to work

last fall he is actually going to work full-time almost half his time on Kaggle

full-time almost half his time on Kaggle

full-time almost half his time on Kaggle challenges using h2o so we’ll be excited

challenges using h2o so we’ll be excited

challenges using h2o so we’ll be excited to see this go across the finish line

to see this go across the finish line

to see this go across the finish line and they will share how we did this or

and they will share how we did this or

and they will share how we did this or rather he will share how he did it

rather he will share how he did it

rather he will share how he did it because so far mark did most of the work

because so far mark did most of the work

because so far mark did most of the work next week at h2o in Mountain View and

next week at h2o in Mountain View and

next week at h2o in Mountain View and they’ll be live-streamed as well so if

they’ll be live-streamed as well so if

they’ll be live-streamed as well so if you can make it be sure to listen in and

you can make it be sure to listen in and

you can make it be sure to listen in and these are some examples of other caracal

these are some examples of other caracal

these are some examples of other caracal applications

applications

applications we have demo scripts that are posted

we have demo scripts that are posted

we have demo scripts that are posted that are available and for example this

that are available and for example this

that are available and for example this one I had hosted a few other maybe a

one I had hosted a few other maybe a

one I had hosted a few other maybe a month ago or so I posted this example

month ago or so I posted this example

month ago or so I posted this example GBM random parameter tooling logic where

GBM random parameter tooling logic where

GBM random parameter tooling logic where you basically just make ten models with

you basically just make ten models with

you basically just make ten models with random parameters and see which one is

random parameters and see which one is

random parameters and see which one is the best that sometimes useful

the best that sometimes useful

the best that sometimes useful especially if you have many dimensions

especially if you have many dimensions

especially if you have many dimensions to optimize over and we don’t have

to optimize over and we don’t have

to optimize over and we don’t have Beijing optimization yet but this might

Beijing optimization yet but this might

Beijing optimization yet but this might be more efficient than just a brute

be more efficient than just a brute

be more efficient than just a brute force grid search because the machine

force grid search because the machine

force grid search because the machine gets luckier more than you tell it to be

gets luckier more than you tell it to be

gets luckier more than you tell it to be lucky if you want that’s why montecarlo

lucky if you want that’s why montecarlo

lucky if you want that’s why montecarlo integration works in higher and four

integration works in higher and four

integration works in higher and four dimensions the same thing is true with

dimensions the same thing is true with

dimensions the same thing is true with hyper parameter finding so don’t shy

hyper parameter finding so don’t shy

hyper parameter finding so don’t shy away from these random approaches

away from these random approaches

away from these random approaches they’re very powerful so this is the

they’re very powerful so this is the

they’re very powerful so this is the outlook lots of stuff to do for data

outlook lots of stuff to do for data

outlook lots of stuff to do for data science now that they have this

science now that they have this

science now that they have this machinery in place that can scale to big

machinery in place that can scale to big

machinery in place that can scale to big data sets customers are saying well if i

data sets customers are saying well if i

data sets customers are saying well if i do i need to find parameters right yeah

do i need to find parameters right yeah

do i need to find parameters right yeah sure automatic hybrid parameter tuning

sure automatic hybrid parameter tuning

sure automatic hybrid parameter tuning is great they’ll do that for you soon

is great they’ll do that for you soon

is great they’ll do that for you soon you’ll have ensembles like a framework

you’ll have ensembles like a framework

you’ll have ensembles like a framework that you can in the GUI and all properly

that you can in the GUI and all properly

that you can in the GUI and all properly define what you want to blend together

define what you want to blend together

define what you want to blend together what way non- least squares to to stack

what way non- least squares to to stack

what way non- least squares to to stack models of different kinds like a random

models of different kinds like a random

models of different kinds like a random forest and the GBM and so on all on the

forest and the GBM and so on all on the

forest and the GBM and so on all on the holdout sets and so on then we want to

holdout sets and so on then we want to

holdout sets and so on then we want to have convolutional layers for deep

have convolutional layers for deep

have convolutional layers for deep learning for example for people who want

learning for example for people who want

learning for example for people who want to do more image related stuff but all

to do more image related stuff but all

to do more image related stuff but all these things are on a to-do list right

these things are on a to-do list right

these things are on a to-do list right we have to prioritize those based on

we have to prioritize those based on

we have to prioritize those based on customer demand so that’s what our

customer demand so that’s what our

customer demand so that’s what our customers get to do the paying customers

customers get to do the paying customers

customers get to do the paying customers get to tell us basically what they want

get to tell us basically what they want

get to tell us basically what they want and they’ll take that into account

and they’ll take that into account

and they’ll take that into account natural language processing is high up

natural language processing is high up

natural language processing is high up there especially now that you have this

there especially now that you have this

there especially now that you have this framework we can characterize each

framework we can characterize each

framework we can characterize each string as an integer and then process

string as an integer and then process

string as an integer and then process all that fast and we have a new method

all that fast and we have a new method

all that fast and we have a new method called generalized low-rank model which

called generalized low-rank model which

called generalized low-rank model which comes right out of Stanford brand new it

comes right out of Stanford brand new it

comes right out of Stanford brand new it can do all these methods pcie SVD

can do all these methods pcie SVD

can do all these methods pcie SVD k-means matrix factorization of course

k-means matrix factorization of course

k-means matrix factorization of course all this stuff fixing missing values for

all this stuff fixing missing values for

all this stuff fixing missing values for you based on like a Taylor expansion of

you based on like a Taylor expansion of

you based on like a Taylor expansion of your data set very powerful stuff can

your data set very powerful stuff can

your data set very powerful stuff can also be used for a commander systems and

also be used for a commander systems and

also be used for a commander systems and we have lots and lots of other zero

we have lots and lots of other zero

we have lots and lots of other zero tickets and

tickets and

tickets and stuff to work on so if you’re interested

stuff to work on so if you’re interested

stuff to work on so if you’re interested in joining the effort please do and I

in joining the effort please do and I

in joining the effort please do and I hope I left you with an impression of

hope I left you with an impression of

hope I left you with an impression of what you can do with h2o and what the

what you can do with h2o and what the

what you can do with h2o and what the state of the art is right now in machine

state of the art is right now in machine

state of the art is right now in machine learning on big data sets and thank you

learning on big data sets and thank you

learning on big data sets and thank you for your attention

”

GOTO 2015 • Scalable Data Science and Deep Learning with H2O • Arno Candel

Be First to Comment

Leave a Reply Cancel reply