so yes I spent a lot of years in physics
so yes I spent a lot of years in physics in high performance computing for
in high performance computing for
in high performance computing for particle physics on the largest
particle physics on the largest
particle physics on the largest supercomputers of the world at slac
supercomputers of the world at slac
supercomputers of the world at slac working together wis earn that was from
working together wis earn that was from
working together wis earn that was from a background and then i switched into
a background and then i switched into
a background and then i switched into machine learning startups I’ve been
machine learning startups I’ve been
machine learning startups I’ve been doing this for the last three and a half
doing this for the last three and a half
doing this for the last three and a half years or so last year I got nominated
years or so last year I got nominated
years or so last year I got nominated and called a big data all-star at the
and called a big data all-star at the
and called a big data all-star at the Fortune magazine so that was a nice
Fortune magazine so that was a nice
Fortune magazine so that was a nice surprise and you can follow me at are no
surprise and you can follow me at are no
surprise and you can follow me at are no condell here and if anybody would be
condell here and if anybody would be
condell here and if anybody would be willing to take a picture and tweet it
willing to take a picture and tweet it
willing to take a picture and tweet it to me that will be great thanks so much
to me that will be great thanks so much
to me that will be great thanks so much so yesterday we’re going to introduce
so yesterday we’re going to introduce
so yesterday we’re going to introduce h2o and then talk about deep learning a
h2o and then talk about deep learning a
h2o and then talk about deep learning a little bit in more detail and then there
little bit in more detail and then there
little bit in more detail and then there will be a lot of live demos as much as
will be a lot of live demos as much as
will be a lot of live demos as much as time allows I will go through all these
time allows I will go through all these
time allows I will go through all these different things so we’ll look at
different things so we’ll look at
different things so we’ll look at different data sets different api’s and
different data sets different api’s and
different data sets different api’s and i’ll make sure that you have a good
i’ll make sure that you have a good
i’ll make sure that you have a good impression about what h2o can do for you
impression about what h2o can do for you
impression about what h2o can do for you and how it would look like and that you
and how it would look like and that you
and how it would look like and that you definitely get an idea of what we can do
definitely get an idea of what we can do
definitely get an idea of what we can do here so h2o is a in memory machine
here so h2o is a in memory machine
here so h2o is a in memory machine learning platform it’s written in Java
learning platform it’s written in Java
learning platform it’s written in Java it’s open source it distributes across
it’s open source it distributes across
it’s open source it distributes across your cluster it sends the code around
your cluster it sends the code around
your cluster it sends the code around not the data so your data can stay on
not the data so your data can stay on
not the data so your data can stay on the cluster and you have a large data
the cluster and you have a large data
the cluster and you have a large data set right and then you want to build
set right and then you want to build
set right and then you want to build models on the entire data set you don’t
models on the entire data set you don’t
models on the entire data set you don’t want to down sample and lose accuracy
want to down sample and lose accuracy
want to down sample and lose accuracy that way but usual the problem is that
that way but usual the problem is that
that way but usual the problem is that the tools don’t allow you to scale to
the tools don’t allow you to scale to
the tools don’t allow you to scale to all the big data sets especially for
all the big data sets especially for
all the big data sets especially for building machine learning models we’re
building machine learning models we’re
building machine learning models we’re not just talking about summing up stuff
not just talking about summing up stuff
not just talking about summing up stuff for computing aggregates you’re talking
for computing aggregates you’re talking
for computing aggregates you’re talking about sophisticated models like gradient
about sophisticated models like gradient
about sophisticated models like gradient boosting machines or neural networks and
boosting machines or neural networks and
boosting machines or neural networks and h2 allows you to do this and you get the
h2 allows you to do this and you get the
h2 allows you to do this and you get the the scalability and the accuracy from
the scalability and the accuracy from
the scalability and the accuracy from this big data set in scale and as I
this big data set in scale and as I
this big data set in scale and as I mentioned earlier we have a lot of AP is
mentioned earlier we have a lot of AP is
mentioned earlier we have a lot of AP is that you’ll get to see today we also
that you’ll get to see today we also
that you’ll get to see today we also have a scoring engine which is kind of a
have a scoring engine which is kind of a
have a scoring engine which is kind of a key point of the product we are about 35
key point of the product we are about 35
key point of the product we are about 35 people right now we had our first h-2a
people right now we had our first h-2a
people right now we had our first h-2a world conference last year in the fall
world conference last year in the fall
world conference last year in the fall and the most
and the most
and the most huge success and sri satish embody here
huge success and sri satish embody here
huge success and sri satish embody here i was CEO he had he has a great great
i was CEO he had he has a great great
i was CEO he had he has a great great mindset and culture culture is
mindset and culture culture is
mindset and culture culture is everything to him so he likes to do meet
everything to him so he likes to do meet
everything to him so he likes to do meet ups every week even twice a week to get
ups every week even twice a week to get
ups every week even twice a week to get feedback from customers and so on so we
feedback from customers and so on so we
feedback from customers and so on so we are very much community driven even
are very much community driven even
are very much community driven even though we write most of the code at this
though we write most of the code at this
though we write most of the code at this point so you can see here the growth
point so you can see here the growth
point so you can see here the growth machine learning is really trending and
machine learning is really trending and
machine learning is really trending and we think it’s the next SQL and
we think it’s the next SQL and
we think it’s the next SQL and prediction is the next search there’s
prediction is the next search there’s
prediction is the next search there’s not just predictive analytics there’s
not just predictive analytics there’s
not just predictive analytics there’s also prescriptive analytics where you’re
also prescriptive analytics where you’re
also prescriptive analytics where you’re trying to not just say what’s going to
trying to not just say what’s going to
trying to not just say what’s going to happen tomorrow but you’re going to tell
happen tomorrow but you’re going to tell
happen tomorrow but you’re going to tell the customers what to do such that they
the customers what to do such that they
the customers what to do such that they can affect tomorrow so you can see the
can affect tomorrow so you can see the
can affect tomorrow so you can see the growth here lots and lots of companies
growth here lots and lots of companies
growth here lots and lots of companies are now using h2o and why is that well
are now using h2o and why is that well
are now using h2o and why is that well because it’s a distributed system built
because it’s a distributed system built
because it’s a distributed system built by the experts in house we have click
by the experts in house we have click
by the experts in house we have click click click he’s our CTO he wrote
click click he’s our CTO he wrote
click click he’s our CTO he wrote basically Java compiler jit right large
basically Java compiler jit right large
basically Java compiler jit right large parts of it in every cell phone of yours
parts of it in every cell phone of yours
parts of it in every cell phone of yours there’s parts of his code that are
there’s parts of his code that are
there’s parts of his code that are executed all the time so he architected
executed all the time so he architected
executed all the time so he architected the whole framework it’s a distributed
the whole framework it’s a distributed
the whole framework it’s a distributed memory key value store based on a
memory key value store based on a
memory key value store based on a non-blocking hash map it has a MapReduce
non-blocking hash map it has a MapReduce
non-blocking hash map it has a MapReduce paradigm built in our own map produced
paradigm built in our own map produced
paradigm built in our own map produced which is fine grain and make sure that
which is fine grain and make sure that
which is fine grain and make sure that all the threats are working at all times
all the threats are working at all times
all the threats are working at all times if you’re processing your data and of
if you’re processing your data and of
if you’re processing your data and of course all the nodes are working in
course all the nodes are working in
course all the nodes are working in parallel as you’ll see you later and we
parallel as you’ll see you later and we
parallel as you’ll see you later and we also compress the data similar to the
also compress the data similar to the
also compress the data similar to the park a data format and so you can really
park a data format and so you can really
park a data format and so you can really store only the data you need and it’s
store only the data you need and it’s
store only the data you need and it’s much cheaper to decompress on the fly in
much cheaper to decompress on the fly in
much cheaper to decompress on the fly in the registers of the CPU then to send
the registers of the CPU then to send
the registers of the CPU then to send the numbers across the wire and once you
the numbers across the wire and once you
the numbers across the wire and once you have this framework in place you can
have this framework in place you can
have this framework in place you can write algorithms that are using this
write algorithms that are using this
write algorithms that are using this MapReduce paradigm and you can also do
MapReduce paradigm and you can also do
MapReduce paradigm and you can also do less than an algorithm you can just say
less than an algorithm you can just say
less than an algorithm you can just say compute aggregates for example it’s like
compute aggregates for example it’s like
compute aggregates for example it’s like a mini algorithm if you want so you can
a mini algorithm if you want so you can
a mini algorithm if you want so you can do all these things and in the end you
do all these things and in the end you
do all these things and in the end you end up with a model that makes a
end up with a model that makes a
end up with a model that makes a prediction of the future right you stand
prediction of the future right you stand
prediction of the future right you stand with machine learning and that code can
with machine learning and that code can
with machine learning and that code can then be exported then I’ll show you that
then be exported then I’ll show you that
then be exported then I’ll show you that in a minute and of course we can suck in
in a minute and of course we can suck in
in a minute and of course we can suck in data from pretty much anywhere and you
data from pretty much anywhere and you
data from pretty much anywhere and you can talk to our Python via JSON from a
can talk to our Python via JSON from a
can talk to our Python via JSON from a web browser
web browser
web browser I routinely check the status of my jobs
I routinely check the status of my jobs
I routinely check the status of my jobs from my cell phone for example so
from my cell phone for example so
from my cell phone for example so there’s a bunch of customers using us
there’s a bunch of customers using us
there’s a bunch of customers using us right now these that are referenceable
right now these that are referenceable
right now these that are referenceable at this point there’s a lot more that we
at this point there’s a lot more that we
at this point there’s a lot more that we can talk about at this moment but you’ll
can talk about at this moment but you’ll
can talk about at this moment but you’ll hear about them soon they’re basically
hear about them soon they’re basically
hear about them soon they’re basically doing big data right hundreds of
doing big data right hundreds of
doing big data right hundreds of gigabytes dozens of nodes and they’re
gigabytes dozens of nodes and they’re
gigabytes dozens of nodes and they’re processing data all the time and they
processing data all the time and they
processing data all the time and they have faster turnaround times they’re
have faster turnaround times they’re
have faster turnaround times they’re saving model saving millions by
saving model saving millions by
saving model saving millions by deploying these models such as this
deploying these models such as this
deploying these models such as this fraud detection model it has a safe
fraud detection model it has a safe
fraud detection model it has a safe paypal millions in fraud so it’s very
paypal millions in fraud so it’s very
paypal millions in fraud so it’s very easy to download you just go to h dot AI
easy to download you just go to h dot AI
easy to download you just go to h dot AI and you can find the download button you
and you can find the download button you
and you can find the download button you downloaded it once it’s downloaded you
downloaded it once it’s downloaded you
downloaded it once it’s downloaded you unzip that that file and you go in there
unzip that that file and you go in there
unzip that that file and you go in there and type java dejar right that’s it h2o
and type java dejar right that’s it h2o
and type java dejar right that’s it h2o will be running on your system there’s
will be running on your system there’s
will be running on your system there’s no dependencies it’s just one single
no dependencies it’s just one single
no dependencies it’s just one single file that you need and you’re basically
file that you need and you’re basically
file that you need and you’re basically running and you can do the same thing on
running and you can do the same thing on
running and you can do the same thing on a cluster you expect to file everywhere
a cluster you expect to file everywhere
a cluster you expect to file everywhere and you launch it that would be a bear
and you launch it that would be a bear
and you launch it that would be a bear ball installation if you don’t want to
ball installation if you don’t want to
ball installation if you don’t want to do bare bones you can do Hadoop you can
do bare bones you can do Hadoop you can
do bare bones you can do Hadoop you can do yarn spark you can launch it from our
do yarn spark you can launch it from our
do yarn spark you can launch it from our and from Python as well so let’s do a
and from Python as well so let’s do a
and from Python as well so let’s do a quick demo here this is glm so i’m going
quick demo here this is glm so i’m going
quick demo here this is glm so i’m going to a cluster here this cluster has my
to a cluster here this cluster has my
to a cluster here this cluster has my name on it you got a dedicated cluster
name on it you got a dedicated cluster
name on it you got a dedicated cluster for this demo so let’s see what this
for this demo so let’s see what this
for this demo so let’s see what this past erase this cluster is an eighth
past erase this cluster is an eighth
past erase this cluster is an eighth note cluster on ec2 it has I think 30
note cluster on ec2 it has I think 30
note cluster on ec2 it has I think 30 gigabytes of heap per machine yep here
gigabytes of heap per machine yep here
gigabytes of heap per machine yep here and basically it’s just there waiting
and basically it’s just there waiting
and basically it’s just there waiting for me to tell it what to do so one
for me to tell it what to do so one
for me to tell it what to do so one thing I did earlier as I parse this
thing I did earlier as I parse this
thing I did earlier as I parse this Airlines data set I’m going to do this
Airlines data set I’m going to do this
Airlines data set I’m going to do this again the airlines data set has all the
again the airlines data set has all the
again the airlines data set has all the flights from 2007 all the way back to
flights from 2007 all the way back to
flights from 2007 all the way back to nineteen eighty seven and it’s parsing
nineteen eighty seven and it’s parsing
nineteen eighty seven and it’s parsing this right now and let’s go look at the
this right now and let’s go look at the
this right now and let’s go look at the cpu usage here you can see that all the
cpu usage here you can see that all the
cpu usage here you can see that all the notes are active right now sucking in
notes are active right now sucking in
notes are active right now sucking in the data
the data
the data parsing it tokenizing it compressing it
parsing it tokenizing it compressing it
parsing it tokenizing it compressing it into these these reduced representations
into these these reduced representations
into these these reduced representations that are lost less of course so when we
that are lost less of course so when we
that are lost less of course so when we have numbers like 719 and 120 then you
have numbers like 719 and 120 then you
have numbers like 719 and 120 then you know that that fits into one bite so you
know that that fits into one bite so you
know that that fits into one bite so you make a one-bite column right once you
make a one-bite column right once you
make a one-bite column right once you see that their numbers there are more
see that their numbers there are more
see that their numbers there are more dynamic ranged and just one bite then
dynamic ranged and just one bite then
dynamic ranged and just one bite then you take two bites and so on you
you take two bites and so on you
you take two bites and so on you basically just store what you need it’s
basically just store what you need it’s
basically just store what you need it’s okay so now it part of this file in 35
okay so now it part of this file in 35
okay so now it part of this file in 35 seconds let’s go look at the file
seconds let’s go look at the file
seconds let’s go look at the file there’s a frame summary that I’m
there’s a frame summary that I’m
there’s a frame summary that I’m expecting it from the server and the
expecting it from the server and the
expecting it from the server and the server now returns this and says here
server now returns this and says here
server now returns this and says here 160 million rows can you see this
160 million rows can you see this
160 million rows can you see this there’s 160 million rows 30 columns
there’s 160 million rows 30 columns
there’s 160 million rows 30 columns about 4 gigabytes compressed space you
about 4 gigabytes compressed space you
about 4 gigabytes compressed space you see all these different columns here
see all these different columns here
see all these different columns here they have like a summary a cardinality
they have like a summary a cardinality
they have like a summary a cardinality some of them are categorical here so in
some of them are categorical here so in
some of them are categorical here so in effect is about 700 or dictators in this
effect is about 700 or dictators in this
effect is about 700 or dictators in this data set and we’re trying to predict
data set and we’re trying to predict
data set and we’re trying to predict whether their plane is delayed or not
whether their plane is delayed or not
whether their plane is delayed or not based on its like departure origin and
based on its like departure origin and
based on its like departure origin and destination airport and so on so if i
destination airport and so on so if i
destination airport and so on so if i wanted to do this i will just click here
wanted to do this i will just click here
wanted to do this i will just click here build model i will say generalized
build model i will say generalized
build model i will say generalized linear model that’s one that is fast and
linear model that’s one that is fast and
linear model that’s one that is fast and the training frame is chosen here and i
the training frame is chosen here and i
the training frame is chosen here and i will now choose some columns to use i’ll
will now choose some columns to use i’ll
will now choose some columns to use i’ll first ignore all of them because there’s
first ignore all of them because there’s
first ignore all of them because there’s a lot of columns i don’t want to use and
a lot of columns i don’t want to use and
a lot of columns i don’t want to use and then i’ll add year month the day the
then i’ll add year month the day the
then i’ll add year month the day the week at the day of the week let’s see we
week at the day of the week let’s see we
week at the day of the week let’s see we want to know the departure time maybe
want to know the departure time maybe
want to know the departure time maybe the carrier not the flight number that
the carrier not the flight number that
the carrier not the flight number that doesn’t mean much maybe the origin and
doesn’t mean much maybe the origin and
doesn’t mean much maybe the origin and destination and then all we really care
destination and then all we really care
destination and then all we really care about is whether it’s the late or not so
about is whether it’s the late or not so
about is whether it’s the late or not so that will be my response everything else
that will be my response everything else
that will be my response everything else you don’t need because it would give
you don’t need because it would give
you don’t need because it would give away the answer right so its departure
away the answer right so its departure
away the answer right so its departure the late is what I’m going to try to
the late is what I’m going to try to
the late is what I’m going to try to predict and it’s a binomial problem so
predict and it’s a binomial problem so
predict and it’s a binomial problem so yes or no is the answer and now I just
yes or no is the answer and now I just
yes or no is the answer and now I just have to press go and it’s building this
have to press go and it’s building this
have to press go and it’s building this model as we as we speak and I can go to
model as we as we speak and I can go to
model as we as we speak and I can go to the water meter to see the cpu usage and
the water meter to see the cpu usage and
the water meter to see the cpu usage and you can see that all the nodes are busy
you can see that all the nodes are busy
you can see that all the nodes are busy computing this model right now
computing this model right now
computing this model right now and in a few seconds it will be done you
and in a few seconds it will be done you
and in a few seconds it will be done you see the objective value doesn’t change
see the objective value doesn’t change
see the objective value doesn’t change anymore yep so it’s done in 19 seconds
anymore yep so it’s done in 19 seconds
anymore yep so it’s done in 19 seconds and I can look at the model and I can
and I can look at the model and I can
and I can look at the model and I can see that we have an auc of 9.5 it’s a
see that we have an auc of 9.5 it’s a
see that we have an auc of 9.5 it’s a little more than point five right it’s
little more than point five right it’s
little more than point five right it’s not just random we have variable
not just random we have variable
not just random we have variable importances here we can see that certain
importances here we can see that certain
importances here we can see that certain airlines like Eastern Airlines is as a
airlines like Eastern Airlines is as a
airlines like Eastern Airlines is as a negative correlation with the response
negative correlation with the response
negative correlation with the response which means it’s it’s rarely if you take
which means it’s it’s rarely if you take
which means it’s it’s rarely if you take this carrier you’re not going to be
this carrier you’re not going to be
this carrier you’re not going to be delayed that’s because it didn’t have a
delayed that’s because it didn’t have a
delayed that’s because it didn’t have a schedule it was always on time by
schedule it was always on time by
schedule it was always on time by definition for example so this is like
definition for example so this is like
definition for example so this is like one bit that comes out of this model
one bit that comes out of this model
one bit that comes out of this model another thing is that Chicago and
another thing is that Chicago and
another thing is that Chicago and Atlanta are often delayed when you start
Atlanta are often delayed when you start
Atlanta are often delayed when you start there right when your journey starts
there right when your journey starts
there right when your journey starts there as you know or for example San
there as you know or for example San
there as you know or for example San Francisco if you want to fly to San
Francisco if you want to fly to San
Francisco if you want to fly to San Francisco there’s a lot of people who
Francisco there’s a lot of people who
Francisco there’s a lot of people who want to do that so that’s why it’s also
want to do that so that’s why it’s also
want to do that so that’s why it’s also often delayed and as I mentioned earlier
often delayed and as I mentioned earlier
often delayed and as I mentioned earlier the accuracy here flatlined after the
the accuracy here flatlined after the
the accuracy here flatlined after the first few iterations so the model could
first few iterations so the model could
first few iterations so the model could have been done even faster if you’re
have been done even faster if you’re
have been done even faster if you’re looking at the metrics here for example
looking at the metrics here for example
looking at the metrics here for example you can see that there’s a mean square
you can see that there’s a mean square
you can see that there’s a mean square error reported an r square value report
error reported an r square value report
error reported an r square value report at all this data science stuff aoc value
at all this data science stuff aoc value
at all this data science stuff aoc value of point 65 and so on and there’s even a
of point 65 and so on and there’s even a
of point 65 and so on and there’s even a POJO that we can look at you know what a
POJO that we can look at you know what a
POJO that we can look at you know what a POJO is a plain old java object it’s
POJO is a plain old java object it’s
POJO is a plain old java object it’s basically Java code that’s the scoring
basically Java code that’s the scoring
basically Java code that’s the scoring code that you can take into production
code that you can take into production
code that you can take into production that actually scores your flights in
that actually scores your flights in
that actually scores your flights in real time and you could say okay if
real time and you could say okay if
real time and you could say okay if you’re this airline and if you’re at
you’re this airline and if you’re at
you’re this airline and if you’re at this time of day then you’re going to
this time of day then you’re going to
this time of day then you’re going to have this probability to be delayed or
have this probability to be delayed or
have this probability to be delayed or not and this is the optimal threshold
not and this is the optimal threshold
not and this is the optimal threshold computed from the ROC curve that curve
computed from the ROC curve that curve
computed from the ROC curve that curve that you saw earlier that tells you
that you saw earlier that tells you
that you saw earlier that tells you where where best to pick your to
where where best to pick your to
where where best to pick your to operating regime to say the later not
operating regime to say the later not
operating regime to say the later not based on the falls and positives and
based on the falls and positives and
based on the falls and positives and true positives and so on that you’re
true positives and so on that you’re
true positives and so on that you’re balancing right so let it stand the data
balancing right so let it stand the data
balancing right so let it stand the data science it’s all baked in for you you
science it’s all baked in for you you
science it’s all baked in for you you get the answer right away so this was on
get the answer right away so this was on
get the answer right away so this was on 160 million rows and we just did this
160 million rows and we just did this
160 million rows and we just did this life
so as you saw the pojo scoring code
so as you saw the pojo scoring code there’s there’s more models that you can
there’s there’s more models that you can
there’s there’s more models that you can build in in the flow user API the degree
build in in the flow user API the degree
build in in the flow user API the degree that you saw earlier there’s a a Help
that you saw earlier there’s a a Help
that you saw earlier there’s a a Help button on the right side here to bring
button on the right side here to bring
button on the right side here to bring this back up there’s help I go down and
this back up there’s help I go down and
this back up there’s help I go down and I can see here packs so there’s a bunch
I can see here packs so there’s a bunch
I can see here packs so there’s a bunch of example packs that come with it so if
of example packs that come with it so if
of example packs that come with it so if I click on this here I’ll do this
I click on this here I’ll do this
I click on this here I’ll do this actually on my laptop now I’ll show you
actually on my laptop now I’ll show you
actually on my laptop now I’ll show you how to run this on a laptop so I just
how to run this on a laptop so I just
how to run this on a laptop so I just downloaded the the package from the
downloaded the the package from the
downloaded the the package from the website and it only contains two files
website and it only contains two files
website and it only contains two files one is an r package and one is the
one is an r package and one is the
one is an r package and one is the actual java jar file I’m going to start
actual java jar file I’m going to start
actual java jar file I’m going to start this on my laptop and I’m going to check
this on my laptop and I’m going to check
this on my laptop and I’m going to check the browser localhost at port five four
the browser localhost at port five four
the browser localhost at port five four three two one that’s our default port
three two one that’s our default port
three two one that’s our default port and now I’m connected to this java JVM
and now I’m connected to this java JVM
and now I’m connected to this java JVM that I just launched right and I can ask
that I just launched right and I can ask
that I just launched right and I can ask it this is a little too big now let’s
it this is a little too big now let’s
it this is a little too big now let’s make it smaller here we go I can look at
make it smaller here we go I can look at
make it smaller here we go I can look at the cluster status yet it’s a one-note
the cluster status yet it’s a one-note
the cluster status yet it’s a one-note clustered I gave it 8 gigs of heap you
clustered I gave it 8 gigs of heap you
clustered I gave it 8 gigs of heap you can see that and it’s all ready to go so
can see that and it’s all ready to go so
can see that and it’s all ready to go so now I’m going to launch this this flow
now I’m going to launch this this flow
now I’m going to launch this this flow from this example pack this million
from this example pack this million
from this example pack this million songs flow I’m going to load that
songs flow I’m going to load that
songs flow I’m going to load that notebook and you can see this is the
notebook and you can see this is the
notebook and you can see this is the million song binary classification demo
million song binary classification demo
million song binary classification demo we basically have data set with 500,000
we basically have data set with 500,000
we basically have data set with 500,000 observations 90 numerical columns and
observations 90 numerical columns and
observations 90 numerical columns and we’re going to split that and store the
we’re going to split that and store the
we’re going to split that and store the next three well that’s done you already
next three well that’s done you already
next three well that’s done you already have those files ready for you so now we
have those files ready for you so now we
have those files ready for you so now we just have to parse them in here and I
just have to parse them in here and I
just have to parse them in here and I put them already on my laptop so I can
put them already on my laptop so I can
put them already on my laptop so I can just say download on import into the h2o
just say download on import into the h2o
just say download on import into the h2o cluster I’ll take the non zip diversion
cluster I’ll take the non zip diversion
cluster I’ll take the non zip diversion because that’s faster so this this file
because that’s faster so this this file
because that’s faster so this this file is a few hundred megabytes it’s done in
is a few hundred megabytes it’s done in
is a few hundred megabytes it’s done in three seconds and this one here is the
three seconds and this one here is the
three seconds and this one here is the test set I’m also going to parse this
test set I’m also going to parse this
test set I’m also going to parse this and you can see that you can even
and you can see that you can even
and you can see that you can even specify the column types if you wanted
specify the column types if you wanted
specify the column types if you wanted to turn a number into an enum for
to turn a number into an enum for
to turn a number into an enum for classification you can do this here
classification you can do this here
classification you can do this here explicitly if you’re not happy with the
explicitly if you’re not happy with the
explicitly if you’re not happy with the default behavior of the parser but the
default behavior of the parser but the
default behavior of the parser but the parts that is very robust and can
parts that is very robust and can
parts that is very robust and can usually handle that so if you have
usually handle that so if you have
usually handle that so if you have missing values if you have all kinds of
missing values if you have all kinds of
missing values if you have all kinds of categoricals ugly strings stuff that’s
categoricals ugly strings stuff that’s
categoricals ugly strings stuff that’s wrong we’ll handle it it’s very robust
wrong we’ll handle it it’s very robust
wrong we’ll handle it it’s very robust it’s really made for enterprise-grade
it’s really made for enterprise-grade
it’s really made for enterprise-grade datasets it’ll it’ll go through your
datasets it’ll it’ll go through your
datasets it’ll it’ll go through your dirty data and just spit something out
dirty data and just spit something out
dirty data and just spit something out that’s usually pretty good okay so now
that’s usually pretty good okay so now
that’s usually pretty good okay so now we have these data sets and I’ll see but
we have these data sets and I’ll see but
we have these data sets and I’ll see but what else we have here so let me go back
what else we have here so let me go back
what else we have here so let me go back out here give your view you can click on
out here give your view you can click on
out here give your view you can click on outline on the right and you can see all
outline on the right and you can see all
outline on the right and you can see all these cells that I pre-populated here
these cells that I pre-populated here
these cells that I pre-populated here and one of them says build a random
and one of them says build a random
and one of them says build a random forest once has build a gradient
forest once has build a gradient
forest once has build a gradient boosting machine one says build a linear
boosting machine one says build a linear
boosting machine one says build a linear model logistic regression and one says
model logistic regression and one says
model logistic regression and one says build a deep learning model right and I
build a deep learning model right and I
build a deep learning model right and I can just say okay fine let’s build one
can just say okay fine let’s build one
can just say okay fine let’s build one let’s say let’s go down to the GBM cell
let’s say let’s go down to the GBM cell
let’s say let’s go down to the GBM cell and say execute this cell now it’s
and say execute this cell now it’s
and say execute this cell now it’s building a gradient boosting machine on
building a gradient boosting machine on
building a gradient boosting machine on this data set you can see the progress
this data set you can see the progress
this data set you can see the progress bar here and violets building it I can
bar here and violets building it I can
bar here and violets building it I can say hey how do you look right now let me
say hey how do you look right now let me
say hey how do you look right now let me see how you’re doing so right now it’s
see how you’re doing so right now it’s
see how you’re doing so right now it’s already giving me to scoring history
already giving me to scoring history
already giving me to scoring history points where the error went down it’s
points where the error went down it’s
points where the error went down it’s already in a OC curve an ROC curve with
already in a OC curve an ROC curve with
already in a OC curve an ROC curve with an AOC of something like see point seven
an AOC of something like see point seven
an AOC of something like see point seven I would hope yes point seven a you see
I would hope yes point seven a you see
I would hope yes point seven a you see already right in just seconds that’s
already right in just seconds that’s
already right in just seconds that’s pretty good for this data set if I do it
pretty good for this data set if I do it
pretty good for this data set if I do it again it’s already down here the error
again it’s already down here the error
again it’s already down here the error ghost keeps going down and you can keep
ghost keeps going down and you can keep
ghost keeps going down and you can keep looking at that model feature
looking at that model feature
looking at that model feature importances for which which variables
importances for which which variables
importances for which which variables matter the most all in real time and I
matter the most all in real time and I
matter the most all in real time and I can also look at the Poggio again this
can also look at the Poggio again this
can also look at the Poggio again this time it’s a tree model not a logistic
time it’s a tree model not a logistic
time it’s a tree model not a logistic regression model so you would expect
regression model so you would expect
regression model so you would expect some decisions in this tree structure if
some decisions in this tree structure if
some decisions in this tree structure if I go down there’s all these classes that
I go down there’s all these classes that
I go down there’s all these classes that this all like Java code I think the tree
this all like Java code I think the tree
this all like Java code I think the tree should be somewhere let me see I might
should be somewhere let me see I might
should be somewhere let me see I might have to refresh this model
oh here we go so these are all the
oh here we go so these are all the forests here you see that there’s a lot
forests here you see that there’s a lot
forests here you see that there’s a lot of forests that are being scored and now
of forests that are being scored and now
of forests that are being scored and now we just have to find this function
we just have to find this function
we just have to find this function somewhere down there and up here it is
somewhere down there and up here it is
somewhere down there and up here it is so here you can see that this is
so here you can see that this is
so here you can see that this is decision tree logic right if your data
decision tree logic right if your data
decision tree logic right if your data is less than 4,000 in this column and
is less than 4,000 in this column and
is less than 4,000 in this column and less than this endless and then in the
less than this endless and then in the
less than this endless and then in the end your prediction will be so and so
end your prediction will be so and so
end your prediction will be so and so much otherwise it will be this number so
much otherwise it will be this number so
much otherwise it will be this number so basically this is the scoring code of
basically this is the scoring code of
basically this is the scoring code of this model that you can put right into
this model that you can put right into
this model that you can put right into production in storm or any other API
production in storm or any other API
production in storm or any other API that you want to use your own basically
that you want to use your own basically
that you want to use your own basically that’s just Java code without any
that’s just Java code without any
that’s just Java code without any dependencies and you can build the same
dependencies and you can build the same
dependencies and you can build the same thing with deep learning right you can
thing with deep learning right you can
thing with deep learning right you can build a deep learning model on the same
build a deep learning model on the same
build a deep learning model on the same data set at the same time that the other
data set at the same time that the other
data set at the same time that the other one is building you can be able to
one is building you can be able to
one is building you can be able to random forest model here also at the
random forest model here also at the
random forest model here also at the same time or a glm and this is all on my
same time or a glm and this is all on my
same time or a glm and this is all on my laptop right now so I’m building
laptop right now so I’m building
laptop right now so I’m building different models at the same time and I
different models at the same time and I
different models at the same time and I can ask hey what’s the status of them I
can ask hey what’s the status of them I
can ask hey what’s the status of them I can just go to the right here in the
can just go to the right here in the
can just go to the right here in the outline and click on giving my deep
outline and click on giving my deep
outline and click on giving my deep learning model oh it’s already done
learning model oh it’s already done
learning model oh it’s already done let’s see how well we’re doing here also
let’s see how well we’re doing here also
let’s see how well we’re doing here also a good auc right and feature importances
a good auc right and feature importances
a good auc right and feature importances and the scoring history and the metrics
and the scoring history and the metrics
and the scoring history and the metrics and you can even get a list of optimal
and you can even get a list of optimal
and you can even get a list of optimal metrics like what’s the best position i
metrics like what’s the best position i
metrics like what’s the best position i can get what’s the best accuracy i can
can get what’s the best accuracy i can
can get what’s the best accuracy i can get and then at what threshold so this
get and then at what threshold so this
get and then at what threshold so this is all geared towards the data scientist
is all geared towards the data scientist
is all geared towards the data scientist understanding what’s happening all right
understanding what’s happening all right
understanding what’s happening all right so mild my laptop is churning out some
so mild my laptop is churning out some
so mild my laptop is churning out some more models you can continue here and
more models you can continue here and
more models you can continue here and talk about deep learning in more detail
talk about deep learning in more detail
talk about deep learning in more detail so deep learning as you all know is
so deep learning as you all know is
so deep learning as you all know is basically just connected neurons right
basically just connected neurons right
basically just connected neurons right and it’s similar to logistic regression
and it’s similar to logistic regression
and it’s similar to logistic regression except that there’s more multiplications
except that there’s more multiplications
except that there’s more multiplications going on you take your feature times the
going on you take your feature times the
going on you take your feature times the weight you get a number and then you add
weight you get a number and then you add
weight you get a number and then you add it up and you do this for all these
it up and you do this for all these
it up and you do this for all these connections your each connection is a
connections your each connection is a
connections your each connection is a product of the wait times the input
product of the wait times the input
product of the wait times the input gives you some output and then you apply
gives you some output and then you apply
gives you some output and then you apply a nonlinear function like at NH
a nonlinear function like at NH
a nonlinear function like at NH something is like a step function
something is like a step function
something is like a step function move step function and you do this again
move step function and you do this again
move step function and you do this again and again and again and at the end you
and again and again and at the end you
and again and again and at the end you have like a hierarchy of nonlinear
have like a hierarchy of nonlinear
have like a hierarchy of nonlinear transformations which will lead to very
transformations which will lead to very
transformations which will lead to very complex nonlinearities in your model so
complex nonlinearities in your model so
complex nonlinearities in your model so you can describe really weird stuff that
you can describe really weird stuff that
you can describe really weird stuff that you would otherwise not be able to with
you would otherwise not be able to with
you would otherwise not be able to with say a linear model or a simple random
say a linear model or a simple random
say a linear model or a simple random forest that doesn’t go as deep to to
forest that doesn’t go as deep to to
forest that doesn’t go as deep to to make up all these nonlinearities between
make up all these nonlinearities between
make up all these nonlinearities between all these features so this is basically
all these features so this is basically
all these features so this is basically the machinery you need for
the machinery you need for
the machinery you need for nonlinearities in your data set and we
nonlinearities in your data set and we
nonlinearities in your data set and we do this in a distributed way again
do this in a distributed way again
do this in a distributed way again because we’re using the MapReduce we’re
because we’re using the MapReduce we’re
because we’re using the MapReduce we’re doing this again on all the threads
doing this again on all the threads
doing this again on all the threads right as you saw earlier for glm and
right as you saw earlier for glm and
right as you saw earlier for glm and everything was Green deep learning is
everything was Green deep learning is
everything was Green deep learning is also green it’s known to be green I
also green it’s known to be green I
also green it’s known to be green I usually burn up the whole custom and I’m
usually burn up the whole custom and I’m
usually burn up the whole custom and I’m running my models and everybody else has
running my models and everybody else has
running my models and everybody else has to step back well of course there’s the
to step back well of course there’s the
to step back well of course there’s the Linux scheduler that takes care of that
Linux scheduler that takes care of that
Linux scheduler that takes care of that but still some claim it’s not
but still some claim it’s not
but still some claim it’s not necessarily fair if I’m running some big
necessarily fair if I’m running some big
necessarily fair if I’m running some big model so I haven’t done that lately and
model so I haven’t done that lately and
model so I haven’t done that lately and that’s why I’m using these easy two
that’s why I’m using these easy two
that’s why I’m using these easy two clusters now or maybe my laptop from
clusters now or maybe my laptop from
clusters now or maybe my laptop from time to time but anyway you can see here
time to time but anyway you can see here
time to time but anyway you can see here we have a lot of little details building
we have a lot of little details building
we have a lot of little details building right it works automatically on
right it works automatically on
right it works automatically on categorical data it were automatically
categorical data it were automatically
categorical data it were automatically standardizes standardizes your data you
standardizes standardizes your data you
standardizes standardizes your data you don’t need to worry about that it
don’t need to worry about that it
don’t need to worry about that it automatically impedes missing values it
automatically impedes missing values it
automatically impedes missing values it automatically does regularization for
automatically does regularization for
automatically does regularization for you if you specify the option it does a
you if you specify the option it does a
you if you specify the option it does a check pointing load balancing everything
check pointing load balancing everything
check pointing load balancing everything you just need to say go and that’s it so
you just need to say go and that’s it so
you just need to say go and that’s it so it should be like super easy for anyone
it should be like super easy for anyone
it should be like super easy for anyone to just run it and if you want to know
to just run it and if you want to know
to just run it and if you want to know how it works in the detail architecture
how it works in the detail architecture
how it works in the detail architecture here it’s basically just distributing
here it’s basically just distributing
here it’s basically just distributing the data set for it first right onto the
the data set for it first right onto the
the data set for it first right onto the whole cluster let’s say you have a
whole cluster let’s say you have a
whole cluster let’s say you have a terabyte of data and 10 notes every node
terabyte of data and 10 notes every node
terabyte of data and 10 notes every node will get 100 gigabytes different data
will get 100 gigabytes different data
will get 100 gigabytes different data and then you’re saying okay I’ll make an
and then you’re saying okay I’ll make an
and then you’re saying okay I’ll make an initial deep learning model that’s a
initial deep learning model that’s a
initial deep learning model that’s a bunch of weights and bias values all
bunch of weights and bias values all
bunch of weights and bias values all just numbers and i’ll put that into some
just numbers and i’ll put that into some
just numbers and i’ll put that into some place in the store and then i spread
place in the store and then i spread
place in the store and then i spread that to all the notes all my 10 notes
that to all the notes all my 10 notes
that to all the notes all my 10 notes get a copy of the same model and then i
get a copy of the same model and then i
get a copy of the same model and then i say train on your local data so then all
say train on your local data so then all
say train on your local data so then all the the models will get trained on their
the the models will get trained on their
the the models will get trained on their local data with multi-threading so there
local data with multi-threading so there
local data with multi-threading so there are some race conditions here that makes
are some race conditions here that makes
are some race conditions here that makes this not reproducible
this not reproducible
this not reproducible but in the end you will have n models in
but in the end you will have n models in
but in the end you will have n models in this case for or on your cluster that
this case for or on your cluster that
this case for or on your cluster that I’ve just mentioned 10 you will have 10
I’ve just mentioned 10 you will have 10
I’ve just mentioned 10 you will have 10 such models of that I’ve been built on a
such models of that I’ve been built on a
such models of that I’ve been built on a part of these hundred gigabytes that you
part of these hundred gigabytes that you
part of these hundred gigabytes that you have you don’t have to process all the
have you don’t have to process all the
have you don’t have to process all the hundred gigabytes you can just sample
hundred gigabytes you can just sample
hundred gigabytes you can just sample some of it right and then when you’re
some of it right and then when you’re
some of it right and then when you’re done with that you reduce it basically
done with that you reduce it basically
done with that you reduce it basically automatically will get average back into
automatically will get average back into
automatically will get average back into one model and that one model is the one
one model and that one model is the one
one model and that one model is the one that you look at from your browser from
that you look at from your browser from
that you look at from your browser from our from Python and then you do this
our from Python and then you do this
our from Python and then you do this again and every pass is a fraction of
again and every pass is a fraction of
again and every pass is a fraction of the data that you’re passing through or
the data that you’re passing through or
the data that you’re passing through or all of the data or more than all of your
all of the data or more than all of your
all of the data or more than all of your data you can just keep iterating without
data you can just keep iterating without
data you can just keep iterating without communicating you can tell each no to
communicating you can tell each no to
communicating you can tell each no to just run for six weeks and then
just run for six weeks and then
just run for six weeks and then communicate but by default it’s done in
communicate but by default it’s done in
communicate but by default it’s done in a way that you spend about two percent
a way that you spend about two percent
a way that you spend about two percent of your time communicating on the
of your time communicating on the
of your time communicating on the cluster and ninety-eight percent
cluster and ninety-eight percent
cluster and ninety-eight percent computing and this is all automatically
computing and this is all automatically
computing and this is all automatically done so you don’t need to worry about
done so you don’t need to worry about
done so you don’t need to worry about anything you just say go and it’ll
anything you just say go and it’ll
anything you just say go and it’ll basically process the data in parallel
basically process the data in parallel
basically process the data in parallel and make a good model and this averaging
and make a good model and this averaging
and make a good model and this averaging of models this scheme works there’s a
of models this scheme works there’s a
of models this scheme works there’s a paper about it but I’m also working on a
paper about it but I’m also working on a
paper about it but I’m also working on a new scheme that’s called consensus a dmm
new scheme that’s called consensus a dmm
new scheme that’s called consensus a dmm where you basically have penalty how far
where you basically have penalty how far
where you basically have penalty how far you drift from the average but you keep
you drift from the average but you keep
you drift from the average but you keep your local model and that keeps
your local model and that keeps
your local model and that keeps everybody kind of going on their own
everybody kind of going on their own
everybody kind of going on their own path in optimization land without
path in optimization land without
path in optimization land without averaging all the time you just you know
averaging all the time you just you know
averaging all the time you just you know that you’re drifting too far so you get
that you’re drifting too far so you get
that you’re drifting too far so you get pulled back a little but you still have
pulled back a little but you still have
pulled back a little but you still have your own model so this is going to be
your own model so this is going to be
your own model so this is going to be promising upgrade soon that you can look
promising upgrade soon that you can look
promising upgrade soon that you can look forward to already as it is it works
forward to already as it is it works
forward to already as it is it works fairly well so this is the amidst right
fairly well so this is the amidst right
fairly well so this is the amidst right two digits 0 to 9 handwritten digits 784
two digits 0 to 9 handwritten digits 784
two digits 0 to 9 handwritten digits 784 grayscale pixels you need to know which
grayscale pixels you need to know which
grayscale pixels you need to know which one is it right from the grayscale pixel
one is it right from the grayscale pixel
one is it right from the grayscale pixel values and in with a couple of lines
values and in with a couple of lines
values and in with a couple of lines here in our you can get the world class
here in our you can get the world class
here in our you can get the world class is actually actual world record no one
is actually actual world record no one
is actually actual world record no one has published a better number in this
has published a better number in this
has published a better number in this without using convolutional layers or
without using convolutional layers or
without using convolutional layers or any other distortions this is purely on
any other distortions this is purely on
any other distortions this is purely on the 60,000 training samples no
the 60,000 training samples no
the 60,000 training samples no distortions no convolutions and you can
distortions no convolutions and you can
distortions no convolutions and you can see here all the other implementations
see here all the other implementations
see here all the other implementations Jeff Hinton’s and Microsoft’s point 83
Jeff Hinton’s and Microsoft’s point 83
Jeff Hinton’s and Microsoft’s point 83 is the world record of course you could
is the world record of course you could
is the world record of course you could say the last digit is not quite
say the last digit is not quite
say the last digit is not quite statistically
statistically
statistically significant because you only have ten
significant because you only have ten
significant because you only have ten thousand to test set points but still
thousand to test set points but still
thousand to test set points but still it’s good to get down there so now let’s
it’s good to get down there so now let’s
it’s good to get down there so now let’s do a little demo here this is a normally
do a little demo here this is a normally
do a little demo here this is a normally detection I’ll show you how we can
detection I’ll show you how we can
detection I’ll show you how we can detect the ugly digits in this Emily’s
detect the ugly digits in this Emily’s
detect the ugly digits in this Emily’s data set on my laptop in a few seconds
data set on my laptop in a few seconds
data set on my laptop in a few seconds so I just have this instance up and
so I just have this instance up and
so I just have this instance up and running here from before so I’m going to
running here from before so I’m going to
running here from before so I’m going to go into our in our I have this our units
go into our in our I have this our units
go into our in our I have this our units this runs every day right every time we
this runs every day right every time we
this runs every day right every time we commit something these tests are being
commit something these tests are being
commit something these tests are being run so you can definitely check those
run so you can definitely check those
run so you can definitely check those out from your github web page right now
out from your github web page right now
out from your github web page right now if you want but still this is saying
if you want but still this is saying
if you want but still this is saying build a an auto encoder model which is
build a an auto encoder model which is
build a an auto encoder model which is learning what’s normal so it connects to
learning what’s normal so it connects to
learning what’s normal so it connects to my cluster right now it learns what’s
my cluster right now it learns what’s
my cluster right now it learns what’s normal what is a normal digit without
normal what is a normal digit without
normal what is a normal digit without knowing but they do today’s it just says
knowing but they do today’s it just says
knowing but they do today’s it just says look at all the data and learn what’s
look at all the data and learn what’s
look at all the data and learn what’s normal and how does it do that well it
normal and how does it do that well it
normal and how does it do that well it takes the 784 pixels it compresses them
takes the 784 pixels it compresses them
takes the 784 pixels it compresses them into in this case 50 neurons 50 numbers
into in this case 50 neurons 50 numbers
into in this case 50 neurons 50 numbers and then tries to make it back into 784
and then tries to make it back into 784
and then tries to make it back into 784 so it’s learning the identity function
so it’s learning the identity function
so it’s learning the identity function of this data set in a compressed way
of this data set in a compressed way
of this data set in a compressed way right so if you can somehow represent
right so if you can somehow represent
right so if you can somehow represent the data with these 50 numbers and you
the data with these 50 numbers and you
the data with these 50 numbers and you know the weights connecting in and out
know the weights connecting in and out
know the weights connecting in and out then these 50 numbers they mean
then these 50 numbers they mean
then these 50 numbers they mean something that’s what it takes to
something that’s what it takes to
something that’s what it takes to represent those 10 digits let’s say
represent those 10 digits let’s say
represent those 10 digits let’s say that’s roughly five numbers four digit
that’s roughly five numbers four digit
that’s roughly five numbers four digit and those five numbers are enough to say
and those five numbers are enough to say
and those five numbers are enough to say there’s an edge here as a round thing
there’s an edge here as a round thing
there’s an edge here as a round thing here as a hole here something like that
here as a hole here something like that
here as a hole here something like that like the features and with these 50
like the features and with these 50
like the features and with these 50 numbers in the middle and of course the
numbers in the middle and of course the
numbers in the middle and of course the connectivity that make up the
connectivity that make up the
connectivity that make up the reconstruction and the basically the
reconstruction and the basically the
reconstruction and the basically the encoding and the decoding you can now
encoding and the decoding you can now
encoding and the decoding you can now say what’s normal or not so because now
say what’s normal or not so because now
say what’s normal or not so because now I’ll take the test set I let it go
I’ll take the test set I let it go
I’ll take the test set I let it go through this network and I see what
through this network and I see what
through this network and I see what comes out of the other side if it
comes out of the other side if it
comes out of the other side if it doesn’t look like the original input
doesn’t look like the original input
doesn’t look like the original input then it didn’t match my vision of what
then it didn’t match my vision of what
then it didn’t match my vision of what this should look like right so I’m going
this should look like right so I’m going
this should look like right so I’m going to let the test set go through this
to let the test set go through this
to let the test set go through this model first I need to train the model so
model first I need to train the model so
model first I need to train the model so right now it’s building this model on my
right now it’s building this model on my
right now it’s building this model on my laptop 50 hidden neurons
laptop 50 hidden neurons
laptop 50 hidden neurons 10h activation function and auto encoder
10h activation function and auto encoder
10h activation function and auto encoder is set to true and I had a couple of
is set to true and I had a couple of
is set to true and I had a couple of extra options but that’s just to say
extra options but that’s just to say
extra options but that’s just to say don’t drop any of the constant columns
don’t drop any of the constant columns
don’t drop any of the constant columns at all as zero because I want to plot it
at all as zero because I want to plot it
at all as zero because I want to plot it at the end okay so now let’s look at the
at the end okay so now let’s look at the
at the end okay so now let’s look at the outlier nests of every point we just
outlier nests of every point we just
outlier nests of every point we just scored the test set and computed the
scored the test set and computed the
scored the test set and computed the reconstruction error so how how
reconstruction error so how how
reconstruction error so how how different is the outcome from the income
different is the outcome from the income
different is the outcome from the income how bad is my identity mapping that I
how bad is my identity mapping that I
how bad is my identity mapping that I learned for the test set points and for
learned for the test set points and for
learned for the test set points and for those points that are kind of ugly they
those points that are kind of ugly they
those points that are kind of ugly they won’t match to what’s normal in the
won’t match to what’s normal in the
won’t match to what’s normal in the training data right that’s an intuitive
training data right that’s an intuitive
training data right that’s an intuitive thing all right so now let’s plot the
thing all right so now let’s plot the
thing all right so now let’s plot the ones that match the best top 25 that’s
ones that match the best top 25 that’s
ones that match the best top 25 that’s the reconstruction and now let’s look at
the reconstruction and now let’s look at
the reconstruction and now let’s look at the actual ones well the same thing
the actual ones well the same thing
the actual ones well the same thing right there match the best so I have to
right there match the best so I have to
right there match the best so I have to look like the same this is the ones that
look like the same this is the ones that
look like the same this is the ones that are the easiest to learn to represent in
are the easiest to learn to represent in
are the easiest to learn to represent in your identity function just take the
your identity function just take the
your identity function just take the middle ones and say keep them basically
middle ones and say keep them basically
middle ones and say keep them basically now let’s look at the ones in the middle
now let’s look at the ones in the middle
now let’s look at the ones in the middle out of 10,000 that’s the the ones the
out of 10,000 that’s the the ones the
out of 10,000 that’s the the ones the median reconstruction error so these are
median reconstruction error so these are
median reconstruction error so these are still reasonably good you can tell that
still reasonably good you can tell that
still reasonably good you can tell that they’re digits but they’re already not
they’re digits but they’re already not
they’re digits but they’re already not quite as pretty anymore and now let’s
quite as pretty anymore and now let’s
quite as pretty anymore and now let’s look at the ugliest outliers so to speak
look at the ugliest outliers so to speak
look at the ugliest outliers so to speak in the test set so these are all digits
in the test set so these are all digits
in the test set so these are all digits that are coming out of my network but
that are coming out of my network but
that are coming out of my network but they’re not really like digit anymore
they’re not really like digit anymore
they’re not really like digit anymore right so something went wrong basically
right so something went wrong basically
right so something went wrong basically the reconstruction failed the model said
the reconstruction failed the model said
the reconstruction failed the model said these are ugly if you look at them they
these are ugly if you look at them they
these are ugly if you look at them they are kind of ugly some of them are almost
are kind of ugly some of them are almost
are kind of ugly some of them are almost not digits anymore right cut off or the
not digits anymore right cut off or the
not digits anymore right cut off or the top right one for example is ugly and
top right one for example is ugly and
top right one for example is ugly and you can tell that if you remember the
you can tell that if you remember the
you can tell that if you remember the bottom line like in the optics test the
bottom line like in the optics test the
bottom line like in the optics test the vision exam 6 40 35 right let’s go look
vision exam 6 40 35 right let’s go look
vision exam 6 40 35 right let’s go look at my slides totally different so every
at my slides totally different so every
at my slides totally different so every time I run it it’s different because its
time I run it it’s different because its
time I run it it’s different because its neural nets with multithreading I can
neural nets with multithreading I can
neural nets with multithreading I can turn it on to be reproducible but then i
turn it on to be reproducible but then i
turn it on to be reproducible but then i have to say use one threat don’t do any
have to say use one threat don’t do any
have to say use one threat don’t do any of this hog wild race condition updates
of this hog wild race condition updates
of this hog wild race condition updates of the weight matrix by multiple threats
of the weight matrix by multiple threats
of the weight matrix by multiple threats at the same time just run one
at the same time just run one
at the same time just run one right through and give a seed and then
right through and give a seed and then
right through and give a seed and then just wait until that one thread is done
just wait until that one thread is done
just wait until that one thread is done and then it will be reproducible but in
and then it will be reproducible but in
and then it will be reproducible but in this case I chose not to do this because
this case I chose not to do this because
this case I chose not to do this because it’s faster this may and the results are
it’s faster this may and the results are
it’s faster this may and the results are fine anyway every time you run it you’ll
fine anyway every time you run it you’ll
fine anyway every time you run it you’ll get something like this you will not get
get something like this you will not get
get something like this you will not get the ugly digits to be the good ones
the ugly digits to be the good ones
the ugly digits to be the good ones right so this shows you basically that
right so this shows you basically that
right so this shows you basically that this is a robust thing and again here
this is a robust thing and again here
this is a robust thing and again here this is the network topography so I can
this is the network topography so I can
this is the network topography so I can also go back to the browser now go to
also go back to the browser now go to
also go back to the browser now go to localhost and say here clean up
localhost and say here clean up
localhost and say here clean up everything by the way here this just ran
everything by the way here this just ran
everything by the way here this just ran all the model so if I say get models I
all the model so if I say get models I
all the model so if I say get models I should see all the models that were
should see all the models that were
should see all the models that were built so that the last four rd models
built so that the last four rd models
built so that the last four rd models they were built on the million song data
they were built on the million song data
they were built on the million song data said earlier and the top one is the 1i
said earlier and the top one is the 1i
said earlier and the top one is the 1i built from our the auto encoder and you
built from our the auto encoder and you
built from our the auto encoder and you can see the auto encoder reconstruction
can see the auto encoder reconstruction
can see the auto encoder reconstruction error started at point zero eight mean
error started at point zero eight mean
error started at point zero eight mean square error and now it’s at point zero
square error and now it’s at point zero
square error and now it’s at point zero two so it got it down it improved from
two so it got it down it improved from
two so it got it down it improved from random noise for Otto encoders you
random noise for Otto encoders you
random noise for Otto encoders you always want to check this convergence it
always want to check this convergence it
always want to check this convergence it has to learn something right the
has to learn something right the
has to learn something right the identity mapping and you can also see
identity mapping and you can also see
identity mapping and you can also see here the status of the neuron layers the
here the status of the neuron layers the
here the status of the neuron layers the thing I showed you earlier and of course
thing I showed you earlier and of course
thing I showed you earlier and of course you can also get a POJO again here in
you can also get a POJO again here in
you can also get a POJO again here in this case it’s a neural net so you would
this case it’s a neural net so you would
this case it’s a neural net so you would expect some weights here and some here
expect some weights here and some here
expect some weights here and some here what is this oh that’s the neurons here
what is this oh that’s the neurons here
what is this oh that’s the neurons here we go I would expect the model to show
we go I would expect the model to show
we go I would expect the model to show up somewhere see there’s a lot of
up somewhere see there’s a lot of
up somewhere see there’s a lot of declarations you have to make to know
declarations you have to make to know
declarations you have to make to know all these 784 features so if this is too
all these 784 features so if this is too
all these 784 features so if this is too little for the preview then we have to
little for the preview then we have to
little for the preview then we have to look at the other model we have yeah
look at the other model we have yeah
look at the other model we have yeah let’s go back to get models and click on
let’s go back to get models and click on
let’s go back to get models and click on the other deep learning model be made
the other deep learning model be made
the other deep learning model be made earlier on the million song data set and
earlier on the million song data set and
earlier on the million song data set and look at its pojo that should be smaller
look at its pojo that should be smaller
look at its pojo that should be smaller because there were only 90 predictors
because there were only 90 predictors
because there were only 90 predictors okay here we go so now you should see
okay here we go so now you should see
okay here we go so now you should see that the deep learning math actually
that the deep learning math actually
that the deep learning math actually printed out in plain text so you can
printed out in plain text so you can
printed out in plain text so you can always check here activation something
always check here activation something
always check here activation something with numerical something with
with numerical something with
with numerical something with categoricals if you had any in this case
categoricals if you had any in this case
categoricals if you had any in this case there are none and then it will save
there are none and then it will save
there are none and then it will save aids activation biases and
aids activation biases and
aids activation biases and they will do this matrix vector
they will do this matrix vector
they will do this matrix vector multiplication so ax + y v 1 this is the
multiplication so ax + y v 1 this is the
multiplication so ax + y v 1 this is the matrix vector multiplication that’s
matrix vector multiplication that’s
matrix vector multiplication that’s inside of the deep learning model and
inside of the deep learning model and
inside of the deep learning model and you can see here we do some partial some
you can see here we do some partial some
you can see here we do some partial some tricks to be faster to basically allow
tricks to be faster to basically allow
tricks to be faster to basically allow the CPU to do more additions and
the CPU to do more additions and
the CPU to do more additions and multiplications at the same time so all
multiplications at the same time so all
multiplications at the same time so all of this is optimized for speed and this
of this is optimized for speed and this
of this is optimized for speed and this is as fast as any c++ implementation or
is as fast as any c++ implementation or
is as fast as any c++ implementation or anything because we don’t really have GC
anything because we don’t really have GC
anything because we don’t really have GC issues here all the arrays are allocated
issues here all the arrays are allocated
issues here all the arrays are allocated one time and then just process all right
one time and then just process all right
one time and then just process all right so now let’s get back to the bigger
so now let’s get back to the bigger
so now let’s get back to the bigger problems deep learning and higgs boson
problems deep learning and higgs boson
problems deep learning and higgs boson who has seen this data set before okay
who has seen this data set before okay
who has seen this data set before okay great so this is a physics right 13
great so this is a physics right 13
great so this is a physics right 13 billion dollar biggest project ever
billion dollar biggest project ever
billion dollar biggest project ever scientific experiment this data set has
scientific experiment this data set has
scientific experiment this data set has 10 million rows their detector events
10 million rows their detector events
10 million rows their detector events each detector event has 21 numbers
each detector event has 21 numbers
each detector event has 21 numbers coming out saying this is what I
coming out saying this is what I
coming out saying this is what I measured for certain things and then the
measured for certain things and then the
measured for certain things and then the physicists come up with seven more
physicists come up with seven more
physicists come up with seven more numbers that they compute from those 21
numbers that they compute from those 21
numbers that they compute from those 21 something like square root of this
something like square root of this
something like square root of this squared minus that square or something
squared minus that square or something
squared minus that square or something and those formulas or formulae actually
and those formulas or formulae actually
and those formulas or formulae actually help and you can see this down there if
help and you can see this down there if
help and you can see this down there if you take just the low-level numbers this
you take just the low-level numbers this
you take just the low-level numbers this is the AUC you get so point 5 is random
is the AUC you get so point 5 is random
is the AUC you get so point 5 is random and one would be perfect and now it goes
and one would be perfect and now it goes
and one would be perfect and now it goes up by something like 10 basis points
up by something like 10 basis points
up by something like 10 basis points almost if you add those extra features
almost if you add those extra features
almost if you add those extra features so it’s very valuable to have physicists
so it’s very valuable to have physicists
so it’s very valuable to have physicists around to tell you like what to do right
around to tell you like what to do right
around to tell you like what to do right but CERN basically had this baseline
but CERN basically had this baseline
but CERN basically had this baseline here of 81 that was how good it was
here of 81 that was how good it was
here of 81 that was how good it was working for them they used it gradient
working for them they used it gradient
working for them they used it gradient boosted trees and neural networks with
boosted trees and neural networks with
boosted trees and neural networks with layer with one layer one hidden layer so
layer with one layer one hidden layer so
layer with one layer one hidden layer so their baseline was 81 AUC and this paper
their baseline was 81 AUC and this paper
their baseline was 81 AUC and this paper came a long last summer saying we can do
came a long last summer saying we can do
came a long last summer saying we can do better than that with deep learning and
better than that with deep learning and
better than that with deep learning and they publish some numbers and now we are
they publish some numbers and now we are
they publish some numbers and now we are going to run the same thing and see what
going to run the same thing and see what
going to run the same thing and see what we can do so I’m going back to my
we can do so I’m going back to my
we can do so I’m going back to my cluster my ec2 8 no cluster and I’ll say
cluster my ec2 8 no cluster and I’ll say
cluster my ec2 8 no cluster and I’ll say get frames
get frames
get frames and I will have to Hicks data set there
and I will have to Hicks data set there
and I will have to Hicks data set there already because I parse it earlier you
already because I parse it earlier you
already because I parse it earlier you can see here 11 million rows and 29
can see here 11 million rows and 29
can see here 11 million rows and 29 columns 2 gigabytes compressed is not
columns 2 gigabytes compressed is not
columns 2 gigabytes compressed is not much to compress because it’s all
much to compress because it’s all
much to compress because it’s all doubles and now I’m going to run a deep
doubles and now I’m going to run a deep
doubles and now I’m going to run a deep learning model so I already saved the
learning model so I already saved the
learning model so I already saved the flow for that so this flow says take the
flow for that so this flow says take the
flow for that so this flow says take the split the split data set I split it in
split the split data set I split it in
split the split data set I split it in two ninety percent and five five percent
two ninety percent and five five percent
two ninety percent and five five percent so ten million and half a million each
so ten million and half a million each
so ten million and half a million each take the training data and the
take the training data and the
take the training data and the validation data and tell me how you’re
validation data and tell me how you’re
validation data and tell me how you’re doing along the way so go and it builds
doing along the way so go and it builds
doing along the way so go and it builds a three layer Network and uses a
a three layer Network and uses a
a three layer Network and uses a rectifier activation everything else is
rectifier activation everything else is
rectifier activation everything else is default and now it’s running so let’s go
default and now it’s running so let’s go
default and now it’s running so let’s go look at the the water meter ok here we
look at the the water meter ok here we
look at the the water meter ok here we go deep learning is taking over the
go deep learning is taking over the
go deep learning is taking over the cluster and now it’s communicating and
cluster and now it’s communicating and
cluster and now it’s communicating and now it’s sending that back out and then
now it’s sending that back out and then
now it’s sending that back out and then computing again this might be initial
computing again this might be initial
computing again this might be initial phases where its eyes to first it
phases where its eyes to first it
phases where its eyes to first it rebalance the data set or something
rebalance the data set or something
rebalance the data set or something usually you’ll see it up down up down so
usually you’ll see it up down up down so
usually you’ll see it up down up down so let’s wait until the next communication
let’s wait until the next communication
let’s wait until the next communication but you’ll see that all the CPUs are
but you’ll see that all the CPUs are
but you’ll see that all the CPUs are busy updating weights with stochastic
busy updating weights with stochastic
busy updating weights with stochastic gradient descent which means it takes a
gradient descent which means it takes a
gradient descent which means it takes a point it trains goes through the network
point it trains goes through the network
point it trains goes through the network makes a prediction says how wrong it is
makes a prediction says how wrong it is
makes a prediction says how wrong it is and corrects the weights all the weights
and corrects the weights all the weights
and corrects the weights all the weights that are affected get fixed basically by
that are affected get fixed basically by
that are affected get fixed basically by every single point there’s no mini batch
every single point there’s no mini batch
every single point there’s no mini batch or anything every single point updates
or anything every single point updates
or anything every single point updates the whole model and that’s done by all
the whole model and that’s done by all
the whole model and that’s done by all the threats in parallel so you’ll have
the threats in parallel so you’ll have
the threats in parallel so you’ll have eight threats in parallel changing those
eight threats in parallel changing those
eight threats in parallel changing those weights and I read you right I read you
weights and I read you right I read you
weights and I read you right I read you right whatever we just compete but
right whatever we just compete but
right whatever we just compete but usually we write different weights right
usually we write different weights right
usually we write different weights right there’s millions of weight so you don’t
there’s millions of weight so you don’t
there’s millions of weight so you don’t need to override too often but someone
need to override too often but someone
need to override too often but someone else is reading at the time or something
else is reading at the time or something
else is reading at the time or something so you can see here it’s mostly busy
so you can see here it’s mostly busy
so you can see here it’s mostly busy computing if you wanted to know what
computing if you wanted to know what
computing if you wanted to know what it’s exactly doing it can also click on
it’s exactly doing it can also click on
it’s exactly doing it can also click on the profiler here and it will show you a
the profiler here and it will show you a
the profiler here and it will show you a stack trace and sorted stack trace by
stack trace and sorted stack trace by
stack trace and sorted stack trace by count what’s happening so this was
count what’s happening so this was
count what’s happening so this was basically just communicating let’s do
basically just communicating let’s do
basically just communicating let’s do this again
this again
this again now it’s going to be slightly different
now it’s going to be slightly different
now it’s going to be slightly different oh I see so now it was saying these are
oh I see so now it was saying these are
oh I see so now it was saying these are basically idle because we have eight
basically idle because we have eight
basically idle because we have eight notes and there are seven orders and
notes and there are seven orders and
notes and there are seven orders and there’s one for read and one for right
there’s one for read and one for right
there’s one for read and one for right so we got 14 threats actively listening
so we got 14 threats actively listening
so we got 14 threats actively listening for communication here f 289 are in the
for communication here f 289 are in the
for communication here f 289 are in the back propagation some of them are in the
back propagation some of them are in the
back propagation some of them are in the forward propagation so you can see all
forward propagation so you can see all
forward propagation so you can see all these exact things that are going on
these exact things that are going on
these exact things that are going on with any moment in time for every note
with any moment in time for every note
with any moment in time for every note right you can go to a different note and
right you can go to a different note and
right you can go to a different note and you can see the same behavior so they’re
you can see the same behavior so they’re
you can see the same behavior so they’re all just busy computing so by this model
all just busy computing so by this model
all just busy computing so by this model is building we can ask how well is it
is building we can ask how well is it
is building we can ask how well is it doing remember dat one baseline with the
doing remember dat one baseline with the
doing remember dat one baseline with the human features let’s let’s see what we
human features let’s let’s see what we
human features let’s let’s see what we have here on the validation data set
have here on the validation data set
have here on the validation data set it’s already at 79 this already beat all
it’s already at 79 this already beat all
it’s already at 79 this already beat all the random forests and grading boosted
the random forests and grading boosted
the random forests and grading boosted methods and neural nets methods that
methods and neural nets methods that
methods and neural nets methods that they had at CERN for many years so these
they had at CERN for many years so these
they had at CERN for many years so these models there on the left that had 75 76
models there on the left that had 75 76
models there on the left that had 75 76 already beaten by this deep learning
already beaten by this deep learning
already beaten by this deep learning model we just ran and this wasn’t even a
model we just ran and this wasn’t even a
model we just ran and this wasn’t even a good model it was just a small like a
good model it was just a small like a
good model it was just a small like a hundred neurons each layer right so this
hundred neurons each layer right so this
hundred neurons each layer right so this is very powerful and by the time we
is very powerful and by the time we
is very powerful and by the time we finish will actually get down to over 87
finish will actually get down to over 87
finish will actually get down to over 87 a you see that’s what the paper reported
a you see that’s what the paper reported
a you see that’s what the paper reported they have an 88 they trained this for
they have an 88 they trained this for
they have an 88 they trained this for weeks on a GPU and of course they had
weeks on a GPU and of course they had
weeks on a GPU and of course they had only this data set and nothing else to
only this data set and nothing else to
only this data set and nothing else to worry about and this is a small data set
worry about and this is a small data set
worry about and this is a small data set still but you can see the power of deep
still but you can see the power of deep
still but you can see the power of deep learning right especially if you feed it
learning right especially if you feed it
learning right especially if you feed it more data and you give it more neurons
more data and you give it more neurons
more data and you give it more neurons it’ll train and learn everything it’s
it’ll train and learn everything it’s
it’ll train and learn everything it’s like a brain that’s trying to learn like
like a brain that’s trying to learn like
like a brain that’s trying to learn like a baby’s brain it’s just sucking up all
a baby’s brain it’s just sucking up all
a baby’s brain it’s just sucking up all the information and after 40 minutes
the information and after 40 minutes
the information and after 40 minutes you’ll get an 84 AFC which is pretty
you’ll get an 84 AFC which is pretty
you’ll get an 84 AFC which is pretty impressive right it beats all the other
impressive right it beats all the other
impressive right it beats all the other baseline methods even with the human
baseline methods even with the human
baseline methods even with the human features and this is without using the
features and this is without using the
features and this is without using the human features you don’t need to know
human features you don’t need to know
human features you don’t need to know anything you just take the sensor data
anything you just take the sensor data
anything you just take the sensor data out of your machine and say go all right
out of your machine and say go all right
out of your machine and say go all right another use case was deep learning used
another use case was deep learning used
another use case was deep learning used for crime detection
for crime detection
for crime detection and this is actually Chicago who can
and this is actually Chicago who can
and this is actually Chicago who can recognize this pattern so my colleagues
recognize this pattern so my colleagues
recognize this pattern so my colleagues Alex and Macau they wrote an article
Alex and Macau they wrote an article
Alex and Macau they wrote an article actually that you can read here data
actually that you can read here data
actually that you can read here data nami just a few days ago and they’re
nami just a few days ago and they’re
nami just a few days ago and they’re using spark and h2o together to take
using spark and h2o together to take
using spark and h2o together to take three different data sets and turn them
three different data sets and turn them
three different data sets and turn them into something that you can use to
into something that you can use to
into something that you can use to predict better crime is going to be
predict better crime is going to be
predict better crime is going to be leading to an arrest or not so you take
leading to an arrest or not so you take
leading to an arrest or not so you take the crime data set you take the census
the crime data set you take the census
the crime data set you take the census data set to know something about the
data set to know something about the
data set to know something about the socioeconomic factors and you take the
socioeconomic factors and you take the
socioeconomic factors and you take the better because the better might have
better because the better might have
better because the better might have impact on what’s happening and you put
impact on what’s happening and you put
impact on what’s happening and you put them all together in spark first you
them all together in spark first you
them all together in spark first you parse them in h2o because we know that
parse them in h2o because we know that
parse them in h2o because we know that the parser works and it’s it’s fine so
the parser works and it’s it’s fine so
the parser works and it’s it’s fine so in our demo we just suck it all in an
in our demo we just suck it all in an
in our demo we just suck it all in an h2o we send it over to spark in the same
h2o we send it over to spark in the same
h2o we send it over to spark in the same jvm and then we say you an SQL join and
jvm and then we say you an SQL join and
jvm and then we say you an SQL join and once you’re done we split it again in
once you’re done we split it again in
once you’re done we split it again in h2o and then we build a deep learning
h2o and then we build a deep learning
h2o and then we build a deep learning model and for example GBM model i think
model and for example GBM model i think
model and for example GBM model i think these two are being built by the demo
these two are being built by the demo
these two are being built by the demo script that’s available so again both
script that’s available so again both
script that’s available so again both h2o and sparks memory is shared it’s the
h2o and sparks memory is shared it’s the
h2o and sparks memory is shared it’s the same jvm there’s no tachyon layer or
same jvm there’s no tachyon layer or
same jvm there’s no tachyon layer or anything they are basically able to
anything they are basically able to
anything they are basically able to transparently go from one side to the
transparently go from one side to the
transparently go from one side to the order
and the product of course is called
and the product of course is called sparkling water which was a brilliant
sparkling water which was a brilliant
sparkling water which was a brilliant idea I think all right so this is the
idea I think all right so this is the
idea I think all right so this is the place and github where you would find
place and github where you would find
place and github where you would find this this example so you would download
this this example so you would download
this this example so you would download sparkling water from our download page
sparkling water from our download page
sparkling water from our download page and then you would go into that
and then you would go into that
and then you would go into that directory set to environment variables
directory set to environment variables
directory set to environment variables pointing to spark and saying how many
pointing to spark and saying how many
pointing to spark and saying how many nodes you want and then you would start
nodes you want and then you would start
nodes you want and then you would start the sparkling shell and then copy paste
the sparkling shell and then copy paste
the sparkling shell and then copy paste this code into it for example if you
this code into it for example if you
this code into it for example if you want to do it interactively so you can
want to do it interactively so you can
want to do it interactively so you can see here there’s a couple of imports you
see here there’s a couple of imports you
see here there’s a couple of imports you import deep learning in GBM and some
import deep learning in GBM and some
import deep learning in GBM and some spark stuff and then you basically
spark stuff and then you basically
spark stuff and then you basically connect to the h2o cluster we parse
connect to the h2o cluster we parse
connect to the h2o cluster we parse datasets this way this is just a
datasets this way this is just a
datasets this way this is just a function definition that gets used by
function definition that gets used by
function definition that gets used by these other functions that actually do
these other functions that actually do
these other functions that actually do the work to load the data and then you
the work to load the data and then you
the work to load the data and then you can drop some columns and do some simple
can drop some columns and do some simple
can drop some columns and do some simple munging in this case here we do some
munging in this case here we do some
munging in this case here we do some date manipulations to standardize the
date manipulations to standardize the
date manipulations to standardize the three datasets to have the same date
three datasets to have the same date
three datasets to have the same date format so that we can join on it later
format so that we can join on it later
format so that we can join on it later and you basically just take these three
and you basically just take these three
and you basically just take these three datasets these are just small for a demo
datasets these are just small for a demo
datasets these are just small for a demo but in reality they of course use the
but in reality they of course use the
but in reality they of course use the whole data set on a cluster and then
whole data set on a cluster and then
whole data set on a cluster and then once you have these three datasets in
once you have these three datasets in
once you have these three datasets in memory as h2o objects we just converted
memory as h2o objects we just converted
memory as h2o objects we just converted to a schema led with this call here and
to a schema led with this call here and
to a schema led with this call here and now to become spark or disease for which
now to become spark or disease for which
now to become spark or disease for which you can just call like a select
you can just call like a select
you can just call like a select statement in SQL and then some join and
statement in SQL and then some join and
statement in SQL and then some join and another join and all that it’s very nice
another join and all that it’s very nice
another join and all that it’s very nice right this is a nice well understood API
right this is a nice well understood API
right this is a nice well understood API the people can use and h2o does not have
the people can use and h2o does not have
the people can use and h2o does not have this at this point but we’re working on
this at this point but we’re working on
this at this point but we’re working on that so at some point we’ll have more
that so at some point we’ll have more
that so at some point we’ll have more managing capabilities but for now you
managing capabilities but for now you
managing capabilities but for now you can definitely benefit from the whole
can definitely benefit from the whole
can definitely benefit from the whole spark ecosystem to do what it’s good for
spark ecosystem to do what it’s good for
spark ecosystem to do what it’s good for so here in this case but is this we say
so here in this case but is this we say
so here in this case but is this we say here’s a crime better data set that we
here’s a crime better data set that we
here’s a crime better data set that we after be splitted I think we spent we
after be splitted I think we spent we
after be splitted I think we spent we bring it back into h2o yes this is an
bring it back into h2o yes this is an
bring it back into h2o yes this is an HTML helper function to split and now we
HTML helper function to split and now we
HTML helper function to split and now we have basically a joint data set that
have basically a joint data set that
have basically a joint data set that knows all about the socioeconomic
knows all about the socioeconomic
knows all about the socioeconomic factors about the way
factors about the way
factors about the way for a given time at a given place and
for a given time at a given place and
for a given time at a given place and then we can build a deep learning model
then we can build a deep learning model
then we can build a deep learning model just like you would do this in Java
just like you would do this in Java
just like you would do this in Java Scala is very similar right you don’t
Scala is very similar right you don’t
Scala is very similar right you don’t need to do much porting it’s just the
need to do much porting it’s just the
need to do much porting it’s just the same members that you’re setting and
same members that you’re setting and
same members that you’re setting and then you say run train model that gets
then you say run train model that gets
then you say run train model that gets basically and that that at the end you
basically and that that at the end you
basically and that that at the end you have a model available that you can use
have a model available that you can use
have a model available that you can use to make predictions and it’s very simple
to make predictions and it’s very simple
to make predictions and it’s very simple and you can definitely follow the
and you can definitely follow the
and you can definitely follow the tutorials in the interest of time I’ll
tutorials in the interest of time I’ll
tutorials in the interest of time I’ll just show you the sparkling she’ll start
just show you the sparkling she’ll start
just show you the sparkling she’ll start here I’m basically able to do this on my
here I’m basically able to do this on my
here I’m basically able to do this on my laptop as well while the other one is
laptop as well while the other one is
laptop as well while the other one is still running here you see spark is
still running here you see spark is
still running here you see spark is being launched and now it’s scheduling
being launched and now it’s scheduling
being launched and now it’s scheduling those three worker nodes to come up once
those three worker nodes to come up once
those three worker nodes to come up once it’s ready I can copy paste some code in
it’s ready I can copy paste some code in
it’s ready I can copy paste some code in there and the code I would get from the
there and the code I would get from the
there and the code I would get from the website here Chicago Crime demo it’s all
website here Chicago Crime demo it’s all
website here Chicago Crime demo it’s all on github
so in sparkling water I will get up
so in sparkling water I will get up project under examples there are some
project under examples there are some
project under examples there are some scripts and so I can just take this
scripts and so I can just take this
scripts and so I can just take this stuff here and just copy paste it all
stuff here and just copy paste it all
stuff here and just copy paste it all oops I’m sure you believe me that this
oops I’m sure you believe me that this
oops I’m sure you believe me that this is all doable right so here spark is not
is all doable right so here spark is not
is all doable right so here spark is not ready and I just copy paste is in and
ready and I just copy paste is in and
ready and I just copy paste is in and here it goes so that’s how easy it is to
here it goes so that’s how easy it is to
here it goes so that’s how easy it is to do spark and h2o together and then also
do spark and h2o together and then also
do spark and h2o together and then also once you have something in your memory
once you have something in your memory
once you have something in your memory in the 8th show cluster right the model
in the 8th show cluster right the model
in the 8th show cluster right the model for example or some data sets you can
for example or some data sets you can
for example or some data sets you can just ask flow to visualize it you can
just ask flow to visualize it you can
just ask flow to visualize it you can just type this this JavaScript or
just type this this JavaScript or
just type this this JavaScript or CoffeeScript rather expression and plot
CoffeeScript rather expression and plot
CoffeeScript rather expression and plot anything you want against anything and
anything you want against anything and
anything you want against anything and you’ll see these interactive plots but
you’ll see these interactive plots but
you’ll see these interactive plots but you can mouse-over and it will show you
you can mouse-over and it will show you
you can mouse-over and it will show you what it is and so on so it’s very cool
what it is and so on so it’s very cool
what it is and so on so it’s very cool you can plot for example the arrest rate
you can plot for example the arrest rate
you can plot for example the arrest rate versus the relative occurrence of an
versus the relative occurrence of an
versus the relative occurrence of an arrest for example gambling is always
arrest for example gambling is always
arrest for example gambling is always arrested why is that well because
arrested why is that well because
arrested why is that well because otherwise you wouldn’t know that the
otherwise you wouldn’t know that the
otherwise you wouldn’t know that the gambling person was cheating or
gambling person was cheating or
gambling person was cheating or something so so you basically have to
something so so you basically have to
something so so you basically have to rest them right otherwise you don’t know
rest them right otherwise you don’t know
rest them right otherwise you don’t know what’s happening some of the things are
what’s happening some of the things are
what’s happening some of the things are undetected but the theft for example
undetected but the theft for example
undetected but the theft for example it’s not always arrested because someone
it’s not always arrested because someone
it’s not always arrested because someone knows that it was stolen without the
knows that it was stolen without the
knows that it was stolen without the person actually being caught so you have
person actually being caught so you have
person actually being caught so you have to be careful about all this data
to be careful about all this data
to be careful about all this data science stuff but basically can plot
science stuff but basically can plot
science stuff but basically can plot whatever you want against whatever you
whatever you want against whatever you
whatever you want against whatever you want and that’s pretty powerful and we
want and that’s pretty powerful and we
want and that’s pretty powerful and we have our state up table now in house so
have our state up table now in house so
have our state up table now in house so Matt Dowell joined us recently he he
Matt Dowell joined us recently he he
Matt Dowell joined us recently he he wrote the fastest data table a
wrote the fastest data table a
wrote the fastest data table a processing engine in our and this is
processing engine in our and this is
processing engine in our and this is used for financial institutions that
used for financial institutions that
used for financial institutions that like to do aggregates a lot so just what
like to do aggregates a lot so just what
like to do aggregates a lot so just what you saw on the previous slide will soon
you saw on the previous slide will soon
you saw on the previous slide will soon have all this in H to go in a scalable
have all this in H to go in a scalable
have all this in H to go in a scalable way that we can do fast joins aggregates
way that we can do fast joins aggregates
way that we can do fast joins aggregates and so on and the same thing of course
and so on and the same thing of course
and so on and the same thing of course goes for Python you have ipython
goes for Python you have ipython
goes for Python you have ipython notebooks and there’s an example to do
notebooks and there’s an example to do
notebooks and there’s an example to do something for the city bike company in
something for the city bike company in
something for the city bike company in New York City where you want to know how
New York City where you want to know how
New York City where you want to know how many bikes do you need for stations such
many bikes do you need for stations such
many bikes do you need for stations such that you don’t run out of bikes so let’s
that you don’t run out of bikes so let’s
that you don’t run out of bikes so let’s say you have 10 million rows of
say you have 10 million rows of
say you have 10 million rows of historical data and you have some better
historical data and you have some better
historical data and you have some better data you would imagine it you can join
data you would imagine it you can join
data you would imagine it you can join those two and then basically based on
those two and then basically based on
those two and then basically based on location
location
location in time and better you can predict how
in time and better you can predict how
in time and better you can predict how many bikes you’ll need right so if I
many bikes you’ll need right so if I
many bikes you’ll need right so if I know today it’s going to be or tomorrow
know today it’s going to be or tomorrow
know today it’s going to be or tomorrow is going to be that better I know I need
is going to be that better I know I need
is going to be that better I know I need 250 bikes at that station or something
250 bikes at that station or something
250 bikes at that station or something and cliff our CTO who-who wrote a jit
and cliff our CTO who-who wrote a jit
and cliff our CTO who-who wrote a jit basically also wrote this data science
basically also wrote this data science
basically also wrote this data science example here so you can see there’s a
example here so you can see there’s a
example here so you can see there’s a group by the top from ipython notebooks
group by the top from ipython notebooks
group by the top from ipython notebooks and to show you that this is also life
and to show you that this is also life
and to show you that this is also life impossible here I do this here I’ll type
impossible here I do this here I’ll type
impossible here I do this here I’ll type ipython notebook citibike small and up
ipython notebook citibike small and up
ipython notebook citibike small and up pops up my my my browser with ipython
pops up my my my browser with ipython
pops up my my my browser with ipython notebook I will delete all the output
notebook I will delete all the output
notebook I will delete all the output cells so we don’t cheat and I say go and
cells so we don’t cheat and I say go and
cells so we don’t cheat and I say go and now it’s connecting to the cluster that
now it’s connecting to the cluster that
now it’s connecting to the cluster that I started 30 minutes ago this means i
I started 30 minutes ago this means i
I started 30 minutes ago this means i still have a little bit of time left i
still have a little bit of time left i
still have a little bit of time left i will load some data here it up we go and
will load some data here it up we go and
will load some data here it up we go and then let’s look at the data describe it
then let’s look at the data describe it
then let’s look at the data describe it you can see here some some mean max and
you can see here some some mean max and
you can see here some some mean max and so on whatever this is like a
so on whatever this is like a
so on whatever this is like a distribution of the chunk of the frame
distribution of the chunk of the frame
distribution of the chunk of the frame how many rows out of each machine in
how many rows out of each machine in
how many rows out of each machine in this case is only one machine oops
this case is only one machine oops
this case is only one machine oops there’s only one machine basically some
there’s only one machine basically some
there’s only one machine basically some statistics that tells you how is the
statistics that tells you how is the
statistics that tells you how is the data distributed across the cluster what
data distributed across the cluster what
data distributed across the cluster what kinds of columns do I have what is their
kinds of columns do I have what is their
kinds of columns do I have what is their mean max and so on all available from
mean max and so on all available from
mean max and so on all available from from Python then you can do a group by
from Python then you can do a group by
from Python then you can do a group by you don’t need to know all that but
you don’t need to know all that but
you don’t need to know all that but basically just you want to know like at
basically just you want to know like at
basically just you want to know like at what time of the day or what they how
what time of the day or what they how
what time of the day or what they how many bikes are bitch station and so on
many bikes are bitch station and so on
many bikes are bitch station and so on you can see that there’s a big
you can see that there’s a big
you can see that there’s a big distribution here that’s some some
distribution here that’s some some
distribution here that’s some some places only need 9 bikes on basically
places only need 9 bikes on basically
places only need 9 bikes on basically the under bikes or even more and so on
the under bikes or even more and so on
the under bikes or even more and so on right and you can do quantiles you see
right and you can do quantiles you see
right and you can do quantiles you see the quantiles here from one percent all
the quantiles here from one percent all
the quantiles here from one percent all the way to ninety-nine percent and you
the way to ninety-nine percent and you
the way to ninety-nine percent and you see that there’s some pretty big numbers
see that there’s some pretty big numbers
see that there’s some pretty big numbers here you can make new features stay if
here you can make new features stay if
here you can make new features stay if the weekends on you can build models so
the weekends on you can build models so
the weekends on you can build models so this is the fun part we have a bill to
this is the fun part we have a bill to
this is the fun part we have a bill to GBM we build a random forest we build a
GBM we build a random forest we build a
GBM we build a random forest we build a glm and we build a deep learning model
glm and we build a deep learning model
glm and we build a deep learning model all on the same data that was joined
all on the same data that was joined
all on the same data that was joined earlier and so now let’s say do this go
earlier and so now let’s say do this go
earlier and so now let’s say do this go so now it’s building a GBM
so now it’s building a GBM
so now it’s building a GBM all of my laptop so if I went to my
all of my laptop so if I went to my
all of my laptop so if I went to my laptop right now I could say get models
laptop right now I could say get models
laptop right now I could say get models and these models would just magically
and these models would just magically
and these models would just magically pop up and this is deep learning and now
pop up and this is deep learning and now
pop up and this is deep learning and now we can see how well they’re doing and
we can see how well they’re doing and
we can see how well they’re doing and you get the idea right so now we get a
you get the idea right so now we get a
you get the idea right so now we get a 92 AAC by deep learning but the 93 a or
92 AAC by deep learning but the 93 a or
92 AAC by deep learning but the 93 a or c by GBM but deep learning even took a
c by GBM but deep learning even took a
c by GBM but deep learning even took a little less time than GBM so you could
little less time than GBM so you could
little less time than GBM so you could say that both are very powerful methods
say that both are very powerful methods
say that both are very powerful methods they beat the random forests and the
they beat the random forests and the
they beat the random forests and the linear models here but of course nothing
linear models here but of course nothing
linear models here but of course nothing beats the linear model in terms of time
beats the linear model in terms of time
beats the linear model in terms of time Oh point one second to get an 81 and you
Oh point one second to get an 81 and you
Oh point one second to get an 81 and you see it’s pretty remarkable it’s 50 times
see it’s pretty remarkable it’s 50 times
see it’s pretty remarkable it’s 50 times faster and a random forest all right so
faster and a random forest all right so
faster and a random forest all right so you believe me that I Python works as
you believe me that I Python works as
you believe me that I Python works as well once you join the better data with
well once you join the better data with
well once you join the better data with a simple merge command here in the
a simple merge command here in the
a simple merge command here in the middle somewhere then you get a little
middle somewhere then you get a little
middle somewhere then you get a little lift here because then you can even
lift here because then you can even
lift here because then you can even predict better you need bikes are not
predict better you need bikes are not
predict better you need bikes are not based on better right make sense if it
based on better right make sense if it
based on better right make sense if it rains you might need fewer bikes so any
rains you might need fewer bikes so any
rains you might need fewer bikes so any anything you might wonder what to do
anything you might wonder what to do
anything you might wonder what to do with GBM linear models with deep
with GBM linear models with deep
with GBM linear models with deep learning there’s booklets for that and
learning there’s booklets for that and
learning there’s booklets for that and we’re currently rewriting them to the
we’re currently rewriting them to the
we’re currently rewriting them to the new version of h2o which will have
new version of h2o which will have
new version of h2o which will have slightly updated api’s and stuff for
slightly updated api’s and stuff for
slightly updated api’s and stuff for consistency across our Python Scala JSON
consistency across our Python Scala JSON
consistency across our Python Scala JSON and so on so it’s going to be very nice
and so on so it’s going to be very nice
and so on so it’s going to be very nice and rewritten everything from scratch a
and rewritten everything from scratch a
and rewritten everything from scratch a major effort but now we’re basically
major effort but now we’re basically
major effort but now we’re basically going to be ready for release I think
going to be ready for release I think
going to be ready for release I think this week actually so and another ! is
this week actually so and another ! is
this week actually so and another ! is that we’re currently number one at this
that we’re currently number one at this
that we’re currently number one at this caracal challenge Marc Landry who just
caracal challenge Marc Landry who just
caracal challenge Marc Landry who just joined us who has been on teammates to
joined us who has been on teammates to
joined us who has been on teammates to go for a while he was at the h2o world
go for a while he was at the h2o world
go for a while he was at the h2o world last fall he is actually going to work
last fall he is actually going to work
last fall he is actually going to work full-time almost half his time on Kaggle
full-time almost half his time on Kaggle
full-time almost half his time on Kaggle challenges using h2o so we’ll be excited
challenges using h2o so we’ll be excited
challenges using h2o so we’ll be excited to see this go across the finish line
to see this go across the finish line
to see this go across the finish line and they will share how we did this or
and they will share how we did this or
and they will share how we did this or rather he will share how he did it
rather he will share how he did it
rather he will share how he did it because so far mark did most of the work
because so far mark did most of the work
because so far mark did most of the work next week at h2o in Mountain View and
next week at h2o in Mountain View and
next week at h2o in Mountain View and they’ll be live-streamed as well so if
they’ll be live-streamed as well so if
they’ll be live-streamed as well so if you can make it be sure to listen in and
you can make it be sure to listen in and
you can make it be sure to listen in and these are some examples of other caracal
these are some examples of other caracal
these are some examples of other caracal applications
applications
applications we have demo scripts that are posted
we have demo scripts that are posted
we have demo scripts that are posted that are available and for example this
that are available and for example this
that are available and for example this one I had hosted a few other maybe a
one I had hosted a few other maybe a
one I had hosted a few other maybe a month ago or so I posted this example
month ago or so I posted this example
month ago or so I posted this example GBM random parameter tooling logic where
GBM random parameter tooling logic where
GBM random parameter tooling logic where you basically just make ten models with
you basically just make ten models with
you basically just make ten models with random parameters and see which one is
random parameters and see which one is
random parameters and see which one is the best that sometimes useful
the best that sometimes useful
the best that sometimes useful especially if you have many dimensions
especially if you have many dimensions
especially if you have many dimensions to optimize over and we don’t have
to optimize over and we don’t have
to optimize over and we don’t have Beijing optimization yet but this might
Beijing optimization yet but this might
Beijing optimization yet but this might be more efficient than just a brute
be more efficient than just a brute
be more efficient than just a brute force grid search because the machine
force grid search because the machine
force grid search because the machine gets luckier more than you tell it to be
gets luckier more than you tell it to be
gets luckier more than you tell it to be lucky if you want that’s why montecarlo
lucky if you want that’s why montecarlo
lucky if you want that’s why montecarlo integration works in higher and four
integration works in higher and four
integration works in higher and four dimensions the same thing is true with
dimensions the same thing is true with
dimensions the same thing is true with hyper parameter finding so don’t shy
hyper parameter finding so don’t shy
hyper parameter finding so don’t shy away from these random approaches
away from these random approaches
away from these random approaches they’re very powerful so this is the
they’re very powerful so this is the
they’re very powerful so this is the outlook lots of stuff to do for data
outlook lots of stuff to do for data
outlook lots of stuff to do for data science now that they have this
science now that they have this
science now that they have this machinery in place that can scale to big
machinery in place that can scale to big
machinery in place that can scale to big data sets customers are saying well if i
data sets customers are saying well if i
data sets customers are saying well if i do i need to find parameters right yeah
do i need to find parameters right yeah
do i need to find parameters right yeah sure automatic hybrid parameter tuning
sure automatic hybrid parameter tuning
sure automatic hybrid parameter tuning is great they’ll do that for you soon
is great they’ll do that for you soon
is great they’ll do that for you soon you’ll have ensembles like a framework
you’ll have ensembles like a framework
you’ll have ensembles like a framework that you can in the GUI and all properly
that you can in the GUI and all properly
that you can in the GUI and all properly define what you want to blend together
define what you want to blend together
define what you want to blend together what way non- least squares to to stack
what way non- least squares to to stack
what way non- least squares to to stack models of different kinds like a random
models of different kinds like a random
models of different kinds like a random forest and the GBM and so on all on the
forest and the GBM and so on all on the
forest and the GBM and so on all on the holdout sets and so on then we want to
holdout sets and so on then we want to
holdout sets and so on then we want to have convolutional layers for deep
have convolutional layers for deep
have convolutional layers for deep learning for example for people who want
learning for example for people who want
learning for example for people who want to do more image related stuff but all
to do more image related stuff but all
to do more image related stuff but all these things are on a to-do list right
these things are on a to-do list right
these things are on a to-do list right we have to prioritize those based on
we have to prioritize those based on
we have to prioritize those based on customer demand so that’s what our
customer demand so that’s what our
customer demand so that’s what our customers get to do the paying customers
customers get to do the paying customers
customers get to do the paying customers get to tell us basically what they want
get to tell us basically what they want
get to tell us basically what they want and they’ll take that into account
and they’ll take that into account
and they’ll take that into account natural language processing is high up
natural language processing is high up
natural language processing is high up there especially now that you have this
there especially now that you have this
there especially now that you have this framework we can characterize each
framework we can characterize each
framework we can characterize each string as an integer and then process
string as an integer and then process
string as an integer and then process all that fast and we have a new method
all that fast and we have a new method
all that fast and we have a new method called generalized low-rank model which
called generalized low-rank model which
called generalized low-rank model which comes right out of Stanford brand new it
comes right out of Stanford brand new it
comes right out of Stanford brand new it can do all these methods pcie SVD
can do all these methods pcie SVD
can do all these methods pcie SVD k-means matrix factorization of course
k-means matrix factorization of course
k-means matrix factorization of course all this stuff fixing missing values for
all this stuff fixing missing values for
all this stuff fixing missing values for you based on like a Taylor expansion of
you based on like a Taylor expansion of
you based on like a Taylor expansion of your data set very powerful stuff can
your data set very powerful stuff can
your data set very powerful stuff can also be used for a commander systems and
also be used for a commander systems and
also be used for a commander systems and we have lots and lots of other zero
we have lots and lots of other zero
we have lots and lots of other zero tickets and
tickets and
tickets and stuff to work on so if you’re interested
stuff to work on so if you’re interested
stuff to work on so if you’re interested in joining the effort please do and I
in joining the effort please do and I
in joining the effort please do and I hope I left you with an impression of
hope I left you with an impression of
hope I left you with an impression of what you can do with h2o and what the
what you can do with h2o and what the
what you can do with h2o and what the state of the art is right now in machine
state of the art is right now in machine
state of the art is right now in machine learning on big data sets and thank you
learning on big data sets and thank you
learning on big data sets and thank you for your attention
Be First to Comment