00:00:05so yes I spent a lot of years in physics
00:00:08in high performance computing for
00:00:11particle physics on the largest
00:00:13supercomputers of the world at slac
00:00:15working together wis earn that was from
00:00:18a background and then i switched into
00:00:19machine learning startups I’ve been
00:00:21doing this for the last three and a half
00:00:22years or so last year I got nominated
00:00:26and called a big data all-star at the
00:00:28Fortune magazine so that was a nice
00:00:31surprise and you can follow me at are no
00:00:33condell here and if anybody would be
00:00:35willing to take a picture and tweet it
00:00:37to me that will be great thanks so much
00:00:40so yesterday we’re going to introduce
00:00:42h2o and then talk about deep learning a
00:00:45little bit in more detail and then there
00:00:48will be a lot of live demos as much as
00:00:51time allows I will go through all these
00:00:53different things so we’ll look at
00:00:54different data sets different api’s and
00:00:57i’ll make sure that you have a good
00:01:00impression about what h2o can do for you
00:01:02and how it would look like and that you
00:01:06definitely get an idea of what we can do
00:01:08here so h2o is a in memory machine
00:01:12learning platform it’s written in Java
00:01:14it’s open source it distributes across
00:01:18your cluster it sends the code around
00:01:20not the data so your data can stay on
00:01:22the cluster and you have a large data
00:01:26set right and then you want to build
00:01:28models on the entire data set you don’t
00:01:30want to down sample and lose accuracy
00:01:31that way but usual the problem is that
00:01:35the tools don’t allow you to scale to
00:01:37all the big data sets especially for
00:01:39building machine learning models we’re
00:01:42not just talking about summing up stuff
00:01:44for computing aggregates you’re talking
00:01:46about sophisticated models like gradient
00:01:48boosting machines or neural networks and
00:01:52h2 allows you to do this and you get the
00:01:54the scalability and the accuracy from
00:01:57this big data set in scale and as I
00:02:00mentioned earlier we have a lot of AP is
00:02:02that you’ll get to see today we also
00:02:05have a scoring engine which is kind of a
00:02:07key point of the product we are about 35
00:02:11people right now we had our first h-2a
00:02:14world conference last year in the fall
00:02:17and the most
00:02:18huge success and sri satish embody here
00:02:21i was CEO he had he has a great great
00:02:24mindset and culture culture is
00:02:26everything to him so he likes to do meet
00:02:29ups every week even twice a week to get
00:02:32feedback from customers and so on so we
00:02:35are very much community driven even
00:02:37though we write most of the code at this
00:02:39point so you can see here the growth
00:02:43machine learning is really trending and
00:02:45we think it’s the next SQL and
00:02:48prediction is the next search there’s
00:02:50not just predictive analytics there’s
00:02:51also prescriptive analytics where you’re
00:02:53trying to not just say what’s going to
00:02:54happen tomorrow but you’re going to tell
00:02:56the customers what to do such that they
00:02:58can affect tomorrow so you can see the
00:03:00growth here lots and lots of companies
00:03:02are now using h2o and why is that well
00:03:06because it’s a distributed system built
00:03:10by the experts in house we have click
00:03:12click click he’s our CTO he wrote
00:03:15basically Java compiler jit right large
00:03:17parts of it in every cell phone of yours
00:03:19there’s parts of his code that are
00:03:21executed all the time so he architected
00:03:24the whole framework it’s a distributed
00:03:27memory key value store based on a
00:03:29non-blocking hash map it has a MapReduce
00:03:33paradigm built in our own map produced
00:03:35which is fine grain and make sure that
00:03:38all the threats are working at all times
00:03:40if you’re processing your data and of
00:03:42course all the nodes are working in
00:03:44parallel as you’ll see you later and we
00:03:45also compress the data similar to the
00:03:47park a data format and so you can really
00:03:50store only the data you need and it’s
00:03:52much cheaper to decompress on the fly in
00:03:54the registers of the CPU then to send
00:03:56the numbers across the wire and once you
00:04:00have this framework in place you can
00:04:01write algorithms that are using this
00:04:02MapReduce paradigm and you can also do
00:04:05less than an algorithm you can just say
00:04:07compute aggregates for example it’s like
00:04:10a mini algorithm if you want so you can
00:04:12do all these things and in the end you
00:04:14end up with a model that makes a
00:04:15prediction of the future right you stand
00:04:17with machine learning and that code can
00:04:21then be exported then I’ll show you that
00:04:22in a minute and of course we can suck in
00:04:24data from pretty much anywhere and you
00:04:26can talk to our Python via JSON from a
00:04:31web browser
00:04:32I routinely check the status of my jobs
00:04:34from my cell phone for example so
00:04:40there’s a bunch of customers using us
00:04:41right now these that are referenceable
00:04:42at this point there’s a lot more that we
00:04:44can talk about at this moment but you’ll
00:04:48hear about them soon they’re basically
00:04:51doing big data right hundreds of
00:04:54gigabytes dozens of nodes and they’re
00:04:57processing data all the time and they
00:04:59have faster turnaround times they’re
00:05:01saving model saving millions by
00:05:03deploying these models such as this
00:05:06fraud detection model it has a safe
00:05:09paypal millions in fraud so it’s very
00:05:16easy to download you just go to h dot AI
00:05:18and you can find the download button you
00:05:20downloaded it once it’s downloaded you
00:05:22unzip that that file and you go in there
00:05:25and type java dejar right that’s it h2o
00:05:29will be running on your system there’s
00:05:31no dependencies it’s just one single
00:05:32file that you need and you’re basically
00:05:35running and you can do the same thing on
00:05:36a cluster you expect to file everywhere
00:05:38and you launch it that would be a bear
00:05:40ball installation if you don’t want to
00:05:42do bare bones you can do Hadoop you can
00:05:44do yarn spark you can launch it from our
00:05:48and from Python as well so let’s do a
00:05:53quick demo here this is glm so i’m going
00:05:58to a cluster here this cluster has my
00:06:01name on it you got a dedicated cluster
00:06:04for this demo so let’s see what this
00:06:06past erase this cluster is an eighth
00:06:10note cluster on ec2 it has I think 30
00:06:15gigabytes of heap per machine yep here
00:06:19and basically it’s just there waiting
00:06:22for me to tell it what to do so one
00:06:24thing I did earlier as I parse this
00:06:26Airlines data set I’m going to do this
00:06:28again the airlines data set has all the
00:06:30flights from 2007 all the way back to
00:06:32nineteen eighty seven and it’s parsing
00:06:36this right now and let’s go look at the
00:06:38cpu usage here you can see that all the
00:06:43notes are active right now sucking in
00:06:44the data
00:06:46parsing it tokenizing it compressing it
00:06:48into these these reduced representations
00:06:51that are lost less of course so when we
00:06:53have numbers like 719 and 120 then you
00:06:57know that that fits into one bite so you
00:06:58make a one-bite column right once you
00:07:00see that their numbers there are more
00:07:02dynamic ranged and just one bite then
00:07:04you take two bites and so on you
00:07:06basically just store what you need it’s
00:07:09okay so now it part of this file in 35
00:07:11seconds let’s go look at the file
00:07:13there’s a frame summary that I’m
00:07:15expecting it from the server and the
00:07:17server now returns this and says here
00:07:19160 million rows can you see this
00:07:24there’s 160 million rows 30 columns
00:07:27about 4 gigabytes compressed space you
00:07:30see all these different columns here
00:07:32they have like a summary a cardinality
00:07:34some of them are categorical here so in
00:07:36effect is about 700 or dictators in this
00:07:39data set and we’re trying to predict
00:07:40whether their plane is delayed or not
00:07:43based on its like departure origin and
00:07:47destination airport and so on so if i
00:07:50wanted to do this i will just click here
00:07:52build model i will say generalized
00:07:55linear model that’s one that is fast and
00:07:58the training frame is chosen here and i
00:08:01will now choose some columns to use i’ll
00:08:03first ignore all of them because there’s
00:08:05a lot of columns i don’t want to use and
00:08:07then i’ll add year month the day the
00:08:11week at the day of the week let’s see we
00:08:13want to know the departure time maybe
00:08:16the carrier not the flight number that
00:08:19doesn’t mean much maybe the origin and
00:08:22destination and then all we really care
00:08:26about is whether it’s the late or not so
00:08:29that will be my response everything else
00:08:30you don’t need because it would give
00:08:32away the answer right so its departure
00:08:34the late is what I’m going to try to
00:08:37predict and it’s a binomial problem so
00:08:39yes or no is the answer and now I just
00:08:42have to press go and it’s building this
00:08:46model as we as we speak and I can go to
00:08:49the water meter to see the cpu usage and
00:08:52you can see that all the nodes are busy
00:08:54computing this model right now
00:08:58and in a few seconds it will be done you
00:09:01see the objective value doesn’t change
00:09:03anymore yep so it’s done in 19 seconds
00:09:06and I can look at the model and I can
00:09:09see that we have an auc of 9.5 it’s a
00:09:14little more than point five right it’s
00:09:16not just random we have variable
00:09:19importances here we can see that certain
00:09:22airlines like Eastern Airlines is as a
00:09:26negative correlation with the response
00:09:28which means it’s it’s rarely if you take
00:09:30this carrier you’re not going to be
00:09:32delayed that’s because it didn’t have a
00:09:34schedule it was always on time by
00:09:37definition for example so this is like
00:09:38one bit that comes out of this model
00:09:40another thing is that Chicago and
00:09:42Atlanta are often delayed when you start
00:09:44there right when your journey starts
00:09:46there as you know or for example San
00:09:49Francisco if you want to fly to San
00:09:52Francisco there’s a lot of people who
00:09:54want to do that so that’s why it’s also
00:09:56often delayed and as I mentioned earlier
00:09:59the accuracy here flatlined after the
00:10:02first few iterations so the model could
00:10:04have been done even faster if you’re
00:10:06looking at the metrics here for example
00:10:08you can see that there’s a mean square
00:10:10error reported an r square value report
00:10:12at all this data science stuff aoc value
00:10:14of point 65 and so on and there’s even a
00:10:19POJO that we can look at you know what a
00:10:21POJO is a plain old java object it’s
00:10:24basically Java code that’s the scoring
00:10:26code that you can take into production
00:10:27that actually scores your flights in
00:10:30real time and you could say okay if
00:10:32you’re this airline and if you’re at
00:10:33this time of day then you’re going to
00:10:35have this probability to be delayed or
00:10:37not and this is the optimal threshold
00:10:40computed from the ROC curve that curve
00:10:43that you saw earlier that tells you
00:10:45where where best to pick your to
00:10:47operating regime to say the later not
00:10:49based on the falls and positives and
00:10:52true positives and so on that you’re
00:10:53balancing right so let it stand the data
00:10:55science it’s all baked in for you you
00:10:57get the answer right away so this was on
00:10:59160 million rows and we just did this
00:11:02life
00:11:06so as you saw the pojo scoring code
00:11:09there’s there’s more models that you can
00:11:11build in in the flow user API the degree
00:11:16that you saw earlier there’s a a Help
00:11:18button on the right side here to bring
00:11:21this back up there’s help I go down and
00:11:24I can see here packs so there’s a bunch
00:11:28of example packs that come with it so if
00:11:31I click on this here I’ll do this
00:11:33actually on my laptop now I’ll show you
00:11:39how to run this on a laptop so I just
00:11:40downloaded the the package from the
00:11:43website and it only contains two files
00:11:45one is an r package and one is the
00:11:48actual java jar file I’m going to start
00:11:51this on my laptop and I’m going to check
00:11:56the browser localhost at port five four
00:12:01three two one that’s our default port
00:12:03and now I’m connected to this java JVM
00:12:07that I just launched right and I can ask
00:12:10it this is a little too big now let’s
00:12:15make it smaller here we go I can look at
00:12:17the cluster status yet it’s a one-note
00:12:18clustered I gave it 8 gigs of heap you
00:12:23can see that and it’s all ready to go so
00:12:25now I’m going to launch this this flow
00:12:28from this example pack this million
00:12:31songs flow I’m going to load that
00:12:33notebook and you can see this is the
00:12:36million song binary classification demo
00:12:39we basically have data set with 500,000
00:12:41observations 90 numerical columns and
00:12:44we’re going to split that and store the
00:12:48next three well that’s done you already
00:12:50have those files ready for you so now we
00:12:52just have to parse them in here and I
00:12:54put them already on my laptop so I can
00:12:56just say download on import into the h2o
00:13:00cluster I’ll take the non zip diversion
00:13:03because that’s faster so this this file
00:13:05is a few hundred megabytes it’s done in
00:13:07three seconds and this one here is the
00:13:11test set I’m also going to parse this
00:13:14and you can see that you can even
00:13:16specify the column types if you wanted
00:13:18to turn a number into an enum for
00:13:20classification you can do this here
00:13:22explicitly if you’re not happy with the
00:13:24default behavior of the parser but the
00:13:26parts that is very robust and can
00:13:28usually handle that so if you have
00:13:29missing values if you have all kinds of
00:13:31categoricals ugly strings stuff that’s
00:13:33wrong we’ll handle it it’s very robust
00:13:36it’s really made for enterprise-grade
00:13:37datasets it’ll it’ll go through your
00:13:39dirty data and just spit something out
00:13:41that’s usually pretty good okay so now
00:13:45we have these data sets and I’ll see but
00:13:47what else we have here so let me go back
00:13:49out here give your view you can click on
00:13:51outline on the right and you can see all
00:13:53these cells that I pre-populated here
00:13:55and one of them says build a random
00:13:57forest once has build a gradient
00:13:59boosting machine one says build a linear
00:14:01model logistic regression and one says
00:14:03build a deep learning model right and I
00:14:05can just say okay fine let’s build one
00:14:07let’s say let’s go down to the GBM cell
00:14:09and say execute this cell now it’s
00:14:11building a gradient boosting machine on
00:14:13this data set you can see the progress
00:14:15bar here and violets building it I can
00:14:18say hey how do you look right now let me
00:14:20see how you’re doing so right now it’s
00:14:22already giving me to scoring history
00:14:25points where the error went down it’s
00:14:27already in a OC curve an ROC curve with
00:14:29an AOC of something like see point seven
00:14:33I would hope yes point seven a you see
00:14:37already right in just seconds that’s
00:14:39pretty good for this data set if I do it
00:14:41again it’s already down here the error
00:14:45ghost keeps going down and you can keep
00:14:47looking at that model feature
00:14:48importances for which which variables
00:14:51matter the most all in real time and I
00:14:53can also look at the Poggio again this
00:14:55time it’s a tree model not a logistic
00:14:57regression model so you would expect
00:14:59some decisions in this tree structure if
00:15:02I go down there’s all these classes that
00:15:08this all like Java code I think the tree
00:15:11should be somewhere let me see I might
00:15:17have to refresh this model
00:15:28oh here we go so these are all the
00:15:30forests here you see that there’s a lot
00:15:32of forests that are being scored and now
00:15:34we just have to find this function
00:15:35somewhere down there and up here it is
00:15:39so here you can see that this is
00:15:41decision tree logic right if your data
00:15:44is less than 4,000 in this column and
00:15:47less than this endless and then in the
00:15:50end your prediction will be so and so
00:15:52much otherwise it will be this number so
00:15:54basically this is the scoring code of
00:15:57this model that you can put right into
00:15:59production in storm or any other API
00:16:01that you want to use your own basically
00:16:04that’s just Java code without any
00:16:06dependencies and you can build the same
00:16:09thing with deep learning right you can
00:16:11build a deep learning model on the same
00:16:12data set at the same time that the other
00:16:14one is building you can be able to
00:16:16random forest model here also at the
00:16:18same time or a glm and this is all on my
00:16:24laptop right now so I’m building
00:16:26different models at the same time and I
00:16:28can ask hey what’s the status of them I
00:16:31can just go to the right here in the
00:16:32outline and click on giving my deep
00:16:35learning model oh it’s already done
00:16:37let’s see how well we’re doing here also
00:16:40a good auc right and feature importances
00:16:45and the scoring history and the metrics
00:16:48and you can even get a list of optimal
00:16:52metrics like what’s the best position i
00:16:53can get what’s the best accuracy i can
00:16:55get and then at what threshold so this
00:16:57is all geared towards the data scientist
00:17:00understanding what’s happening all right
00:17:04so mild my laptop is churning out some
00:17:07more models you can continue here and
00:17:09talk about deep learning in more detail
00:17:11so deep learning as you all know is
00:17:14basically just connected neurons right
00:17:17and it’s similar to logistic regression
00:17:19except that there’s more multiplications
00:17:22going on you take your feature times the
00:17:26weight you get a number and then you add
00:17:28it up and you do this for all these
00:17:31connections your each connection is a
00:17:33product of the wait times the input
00:17:36gives you some output and then you apply
00:17:38a nonlinear function like at NH
00:17:40something is like a step function
00:17:42move step function and you do this again
00:17:45and again and again and at the end you
00:17:46have like a hierarchy of nonlinear
00:17:49transformations which will lead to very
00:17:51complex nonlinearities in your model so
00:17:53you can describe really weird stuff that
00:17:56you would otherwise not be able to with
00:17:57say a linear model or a simple random
00:17:59forest that doesn’t go as deep to to
00:18:02make up all these nonlinearities between
00:18:04all these features so this is basically
00:18:06the machinery you need for
00:18:08nonlinearities in your data set and we
00:18:11do this in a distributed way again
00:18:13because we’re using the MapReduce we’re
00:18:15doing this again on all the threads
00:18:16right as you saw earlier for glm and
00:18:18everything was Green deep learning is
00:18:20also green it’s known to be green I
00:18:22usually burn up the whole custom and I’m
00:18:25running my models and everybody else has
00:18:27to step back well of course there’s the
00:18:31Linux scheduler that takes care of that
00:18:33but still some claim it’s not
00:18:35necessarily fair if I’m running some big
00:18:37model so I haven’t done that lately and
00:18:39that’s why I’m using these easy two
00:18:41clusters now or maybe my laptop from
00:18:43time to time but anyway you can see here
00:18:45we have a lot of little details building
00:18:48right it works automatically on
00:18:49categorical data it were automatically
00:18:51standardizes standardizes your data you
00:18:54don’t need to worry about that it
00:18:55automatically impedes missing values it
00:18:58automatically does regularization for
00:19:01you if you specify the option it does a
00:19:04check pointing load balancing everything
00:19:06you just need to say go and that’s it so
00:19:08it should be like super easy for anyone
00:19:10to just run it and if you want to know
00:19:13how it works in the detail architecture
00:19:16here it’s basically just distributing
00:19:18the data set for it first right onto the
00:19:21whole cluster let’s say you have a
00:19:22terabyte of data and 10 notes every node
00:19:25will get 100 gigabytes different data
00:19:27and then you’re saying okay I’ll make an
00:19:29initial deep learning model that’s a
00:19:31bunch of weights and bias values all
00:19:33just numbers and i’ll put that into some
00:19:36place in the store and then i spread
00:19:39that to all the notes all my 10 notes
00:19:40get a copy of the same model and then i
00:19:44say train on your local data so then all
00:19:46the the models will get trained on their
00:19:49local data with multi-threading so there
00:19:53are some race conditions here that makes
00:19:54this not reproducible
00:19:55but in the end you will have n models in
00:19:58this case for or on your cluster that
00:20:01I’ve just mentioned 10 you will have 10
00:20:03such models of that I’ve been built on a
00:20:06part of these hundred gigabytes that you
00:20:08have you don’t have to process all the
00:20:09hundred gigabytes you can just sample
00:20:11some of it right and then when you’re
00:20:14done with that you reduce it basically
00:20:16automatically will get average back into
00:20:18one model and that one model is the one
00:20:20that you look at from your browser from
00:20:22our from Python and then you do this
00:20:25again and every pass is a fraction of
00:20:27the data that you’re passing through or
00:20:30all of the data or more than all of your
00:20:32data you can just keep iterating without
00:20:34communicating you can tell each no to
00:20:36just run for six weeks and then
00:20:38communicate but by default it’s done in
00:20:41a way that you spend about two percent
00:20:43of your time communicating on the
00:20:44cluster and ninety-eight percent
00:20:45computing and this is all automatically
00:20:48done so you don’t need to worry about
00:20:49anything you just say go and it’ll
00:20:51basically process the data in parallel
00:20:53and make a good model and this averaging
00:20:55of models this scheme works there’s a
00:20:58paper about it but I’m also working on a
00:21:00new scheme that’s called consensus a dmm
00:21:04where you basically have penalty how far
00:21:07you drift from the average but you keep
00:21:09your local model and that keeps
00:21:11everybody kind of going on their own
00:21:13path in optimization land without
00:21:15averaging all the time you just you know
00:21:18that you’re drifting too far so you get
00:21:19pulled back a little but you still have
00:21:21your own model so this is going to be
00:21:23promising upgrade soon that you can look
00:21:25forward to already as it is it works
00:21:29fairly well so this is the amidst right
00:21:31two digits 0 to 9 handwritten digits 784
00:21:35grayscale pixels you need to know which
00:21:37one is it right from the grayscale pixel
00:21:39values and in with a couple of lines
00:21:41here in our you can get the world class
00:21:44is actually actual world record no one
00:21:46has published a better number in this
00:21:48without using convolutional layers or
00:21:51any other distortions this is purely on
00:21:54the 60,000 training samples no
00:21:56distortions no convolutions and you can
00:22:00see here all the other implementations
00:22:01Jeff Hinton’s and Microsoft’s point 83
00:22:05is the world record of course you could
00:22:07say the last digit is not quite
00:22:08statistically
00:22:09significant because you only have ten
00:22:10thousand to test set points but still
00:22:13it’s good to get down there so now let’s
00:22:16do a little demo here this is a normally
00:22:19detection I’ll show you how we can
00:22:21detect the ugly digits in this Emily’s
00:22:22data set on my laptop in a few seconds
00:22:25so I just have this instance up and
00:22:28running here from before so I’m going to
00:22:30go into our in our I have this our units
00:22:37this runs every day right every time we
00:22:39commit something these tests are being
00:22:40run so you can definitely check those
00:22:42out from your github web page right now
00:22:45if you want but still this is saying
00:22:50build a an auto encoder model which is
00:22:54learning what’s normal so it connects to
00:23:00my cluster right now it learns what’s
00:23:02normal what is a normal digit without
00:23:04knowing but they do today’s it just says
00:23:06look at all the data and learn what’s
00:23:08normal and how does it do that well it
00:23:11takes the 784 pixels it compresses them
00:23:14into in this case 50 neurons 50 numbers
00:23:17and then tries to make it back into 784
00:23:21so it’s learning the identity function
00:23:23of this data set in a compressed way
00:23:26right so if you can somehow represent
00:23:28the data with these 50 numbers and you
00:23:32know the weights connecting in and out
00:23:33then these 50 numbers they mean
00:23:35something that’s what it takes to
00:23:37represent those 10 digits let’s say
00:23:39that’s roughly five numbers four digit
00:23:41and those five numbers are enough to say
00:23:43there’s an edge here as a round thing
00:23:44here as a hole here something like that
00:23:46like the features and with these 50
00:23:48numbers in the middle and of course the
00:23:50connectivity that make up the
00:23:53reconstruction and the basically the
00:23:55encoding and the decoding you can now
00:23:59say what’s normal or not so because now
00:24:02I’ll take the test set I let it go
00:24:03through this network and I see what
00:24:05comes out of the other side if it
00:24:07doesn’t look like the original input
00:24:08then it didn’t match my vision of what
00:24:10this should look like right so I’m going
00:24:12to let the test set go through this
00:24:14model first I need to train the model so
00:24:19right now it’s building this model on my
00:24:20laptop 50 hidden neurons
00:24:2310h activation function and auto encoder
00:24:27is set to true and I had a couple of
00:24:29extra options but that’s just to say
00:24:31don’t drop any of the constant columns
00:24:33at all as zero because I want to plot it
00:24:36at the end okay so now let’s look at the
00:24:39outlier nests of every point we just
00:24:42scored the test set and computed the
00:24:44reconstruction error so how how
00:24:46different is the outcome from the income
00:24:48how bad is my identity mapping that I
00:24:51learned for the test set points and for
00:24:53those points that are kind of ugly they
00:24:56won’t match to what’s normal in the
00:24:58training data right that’s an intuitive
00:25:00thing all right so now let’s plot the
00:25:05ones that match the best top 25 that’s
00:25:08the reconstruction and now let’s look at
00:25:12the actual ones well the same thing
00:25:14right there match the best so I have to
00:25:16look like the same this is the ones that
00:25:18are the easiest to learn to represent in
00:25:20your identity function just take the
00:25:23middle ones and say keep them basically
00:25:26now let’s look at the ones in the middle
00:25:28out of 10,000 that’s the the ones the
00:25:31median reconstruction error so these are
00:25:34still reasonably good you can tell that
00:25:36they’re digits but they’re already not
00:25:38quite as pretty anymore and now let’s
00:25:41look at the ugliest outliers so to speak
00:25:43in the test set so these are all digits
00:25:46that are coming out of my network but
00:25:50they’re not really like digit anymore
00:25:52right so something went wrong basically
00:25:55the reconstruction failed the model said
00:25:57these are ugly if you look at them they
00:25:58are kind of ugly some of them are almost
00:26:01not digits anymore right cut off or the
00:26:04top right one for example is ugly and
00:26:06you can tell that if you remember the
00:26:07bottom line like in the optics test the
00:26:09vision exam 6 40 35 right let’s go look
00:26:14at my slides totally different so every
00:26:18time I run it it’s different because its
00:26:21neural nets with multithreading I can
00:26:24turn it on to be reproducible but then i
00:26:27have to say use one threat don’t do any
00:26:29of this hog wild race condition updates
00:26:32of the weight matrix by multiple threats
00:26:35at the same time just run one
00:26:37right through and give a seed and then
00:26:39just wait until that one thread is done
00:26:41and then it will be reproducible but in
00:26:43this case I chose not to do this because
00:26:45it’s faster this may and the results are
00:26:47fine anyway every time you run it you’ll
00:26:49get something like this you will not get
00:26:51the ugly digits to be the good ones
00:26:53right so this shows you basically that
00:26:56this is a robust thing and again here
00:26:58this is the network topography so I can
00:27:03also go back to the browser now go to
00:27:06localhost and say here clean up
00:27:09everything by the way here this just ran
00:27:11all the model so if I say get models I
00:27:14should see all the models that were
00:27:16built so that the last four rd models
00:27:18they were built on the million song data
00:27:20said earlier and the top one is the 1i
00:27:23built from our the auto encoder and you
00:27:26can see the auto encoder reconstruction
00:27:28error started at point zero eight mean
00:27:31square error and now it’s at point zero
00:27:33two so it got it down it improved from
00:27:36random noise for Otto encoders you
00:27:38always want to check this convergence it
00:27:41has to learn something right the
00:27:44identity mapping and you can also see
00:27:47here the status of the neuron layers the
00:27:49thing I showed you earlier and of course
00:27:51you can also get a POJO again here in
00:27:54this case it’s a neural net so you would
00:27:56expect some weights here and some here
00:27:59what is this oh that’s the neurons here
00:28:04we go I would expect the model to show
00:28:07up somewhere see there’s a lot of
00:28:10declarations you have to make to know
00:28:12all these 784 features so if this is too
00:28:15little for the preview then we have to
00:28:18look at the other model we have yeah
00:28:20let’s go back to get models and click on
00:28:21the other deep learning model be made
00:28:23earlier on the million song data set and
00:28:25look at its pojo that should be smaller
00:28:29because there were only 90 predictors
00:28:31okay here we go so now you should see
00:28:34that the deep learning math actually
00:28:35printed out in plain text so you can
00:28:38always check here activation something
00:28:43with numerical something with
00:28:45categoricals if you had any in this case
00:28:47there are none and then it will save
00:28:49aids activation biases and
00:28:51they will do this matrix vector
00:28:52multiplication so ax + y v 1 this is the
00:28:58matrix vector multiplication that’s
00:29:00inside of the deep learning model and
00:29:02you can see here we do some partial some
00:29:05tricks to be faster to basically allow
00:29:07the CPU to do more additions and
00:29:09multiplications at the same time so all
00:29:11of this is optimized for speed and this
00:29:14is as fast as any c++ implementation or
00:29:17anything because we don’t really have GC
00:29:20issues here all the arrays are allocated
00:29:22one time and then just process all right
00:29:27so now let’s get back to the bigger
00:29:29problems deep learning and higgs boson
00:29:32who has seen this data set before okay
00:29:35great so this is a physics right 13
00:29:38billion dollar biggest project ever
00:29:39scientific experiment this data set has
00:29:4210 million rows their detector events
00:29:45each detector event has 21 numbers
00:29:48coming out saying this is what I
00:29:49measured for certain things and then the
00:29:51physicists come up with seven more
00:29:53numbers that they compute from those 21
00:29:56something like square root of this
00:29:58squared minus that square or something
00:30:00and those formulas or formulae actually
00:30:05help and you can see this down there if
00:30:10you take just the low-level numbers this
00:30:11is the AUC you get so point 5 is random
00:30:14and one would be perfect and now it goes
00:30:16up by something like 10 basis points
00:30:18almost if you add those extra features
00:30:21so it’s very valuable to have physicists
00:30:23around to tell you like what to do right
00:30:25but CERN basically had this baseline
00:30:28here of 81 that was how good it was
00:30:31working for them they used it gradient
00:30:33boosted trees and neural networks with
00:30:36layer with one layer one hidden layer so
00:30:39their baseline was 81 AUC and this paper
00:30:42came a long last summer saying we can do
00:30:43better than that with deep learning and
00:30:46they publish some numbers and now we are
00:30:48going to run the same thing and see what
00:30:50we can do so I’m going back to my
00:30:53cluster my ec2 8 no cluster and I’ll say
00:31:00get frames
00:31:03and I will have to Hicks data set there
00:31:05already because I parse it earlier you
00:31:07can see here 11 million rows and 29
00:31:13columns 2 gigabytes compressed is not
00:31:15much to compress because it’s all
00:31:16doubles and now I’m going to run a deep
00:31:20learning model so I already saved the
00:31:22flow for that so this flow says take the
00:31:30split the split data set I split it in
00:31:32two ninety percent and five five percent
00:31:34so ten million and half a million each
00:31:37take the training data and the
00:31:40validation data and tell me how you’re
00:31:41doing along the way so go and it builds
00:31:45a three layer Network and uses a
00:31:48rectifier activation everything else is
00:31:50default and now it’s running so let’s go
00:31:55look at the the water meter ok here we
00:31:57go deep learning is taking over the
00:31:59cluster and now it’s communicating and
00:32:02now it’s sending that back out and then
00:32:05computing again this might be initial
00:32:07phases where its eyes to first it
00:32:10rebalance the data set or something
00:32:11usually you’ll see it up down up down so
00:32:14let’s wait until the next communication
00:32:17but you’ll see that all the CPUs are
00:32:19busy updating weights with stochastic
00:32:20gradient descent which means it takes a
00:32:23point it trains goes through the network
00:32:27makes a prediction says how wrong it is
00:32:30and corrects the weights all the weights
00:32:32that are affected get fixed basically by
00:32:35every single point there’s no mini batch
00:32:37or anything every single point updates
00:32:39the whole model and that’s done by all
00:32:41the threats in parallel so you’ll have
00:32:43eight threats in parallel changing those
00:32:45weights and I read you right I read you
00:32:48right whatever we just compete but
00:32:50usually we write different weights right
00:32:52there’s millions of weight so you don’t
00:32:53need to override too often but someone
00:32:56else is reading at the time or something
00:32:57so you can see here it’s mostly busy
00:32:59computing if you wanted to know what
00:33:04it’s exactly doing it can also click on
00:33:05the profiler here and it will show you a
00:33:07stack trace and sorted stack trace by
00:33:11count what’s happening so this was
00:33:13basically just communicating let’s do
00:33:15this again
00:33:17now it’s going to be slightly different
00:33:20oh I see so now it was saying these are
00:33:27basically idle because we have eight
00:33:29notes and there are seven orders and
00:33:30there’s one for read and one for right
00:33:33so we got 14 threats actively listening
00:33:36for communication here f 289 are in the
00:33:40back propagation some of them are in the
00:33:43forward propagation so you can see all
00:33:45these exact things that are going on
00:33:48with any moment in time for every note
00:33:50right you can go to a different note and
00:33:52you can see the same behavior so they’re
00:33:54all just busy computing so by this model
00:33:56is building we can ask how well is it
00:33:59doing remember dat one baseline with the
00:34:03human features let’s let’s see what we
00:34:08have here on the validation data set
00:34:09it’s already at 79 this already beat all
00:34:14the random forests and grading boosted
00:34:16methods and neural nets methods that
00:34:19they had at CERN for many years so these
00:34:23models there on the left that had 75 76
00:34:27already beaten by this deep learning
00:34:29model we just ran and this wasn’t even a
00:34:32good model it was just a small like a
00:34:34hundred neurons each layer right so this
00:34:36is very powerful and by the time we
00:34:39finish will actually get down to over 87
00:34:42a you see that’s what the paper reported
00:34:45they have an 88 they trained this for
00:34:47weeks on a GPU and of course they had
00:34:50only this data set and nothing else to
00:34:52worry about and this is a small data set
00:34:54still but you can see the power of deep
00:34:56learning right especially if you feed it
00:34:58more data and you give it more neurons
00:34:59it’ll train and learn everything it’s
00:35:01like a brain that’s trying to learn like
00:35:03a baby’s brain it’s just sucking up all
00:35:06the information and after 40 minutes
00:35:09you’ll get an 84 AFC which is pretty
00:35:12impressive right it beats all the other
00:35:13baseline methods even with the human
00:35:15features and this is without using the
00:35:18human features you don’t need to know
00:35:19anything you just take the sensor data
00:35:20out of your machine and say go all right
00:35:26another use case was deep learning used
00:35:28for crime detection
00:35:30and this is actually Chicago who can
00:35:33recognize this pattern so my colleagues
00:35:35Alex and Macau they wrote an article
00:35:39actually that you can read here data
00:35:41nami just a few days ago and they’re
00:35:44using spark and h2o together to take
00:35:48three different data sets and turn them
00:35:50into something that you can use to
00:35:52predict better crime is going to be
00:35:56leading to an arrest or not so you take
00:35:59the crime data set you take the census
00:36:03data set to know something about the
00:36:04socioeconomic factors and you take the
00:36:06better because the better might have
00:36:08impact on what’s happening and you put
00:36:10them all together in spark first you
00:36:13parse them in h2o because we know that
00:36:15the parser works and it’s it’s fine so
00:36:17in our demo we just suck it all in an
00:36:20h2o we send it over to spark in the same
00:36:24jvm and then we say you an SQL join and
00:36:30once you’re done we split it again in
00:36:32h2o and then we build a deep learning
00:36:34model and for example GBM model i think
00:36:37these two are being built by the demo
00:36:38script that’s available so again both
00:36:43h2o and sparks memory is shared it’s the
00:36:50same jvm there’s no tachyon layer or
00:36:52anything they are basically able to
00:36:55transparently go from one side to the
00:36:57order
00:37:02and the product of course is called
00:37:04sparkling water which was a brilliant
00:37:07idea I think all right so this is the
00:37:12place and github where you would find
00:37:14this this example so you would download
00:37:17sparkling water from our download page
00:37:19and then you would go into that
00:37:21directory set to environment variables
00:37:24pointing to spark and saying how many
00:37:26nodes you want and then you would start
00:37:28the sparkling shell and then copy paste
00:37:31this code into it for example if you
00:37:33want to do it interactively so you can
00:37:37see here there’s a couple of imports you
00:37:39import deep learning in GBM and some
00:37:41spark stuff and then you basically
00:37:45connect to the h2o cluster we parse
00:37:48datasets this way this is just a
00:37:50function definition that gets used by
00:37:52these other functions that actually do
00:37:55the work to load the data and then you
00:37:56can drop some columns and do some simple
00:37:59munging in this case here we do some
00:38:01date manipulations to standardize the
00:38:03three datasets to have the same date
00:38:05format so that we can join on it later
00:38:07and you basically just take these three
00:38:10datasets these are just small for a demo
00:38:11but in reality they of course use the
00:38:13whole data set on a cluster and then
00:38:16once you have these three datasets in
00:38:19memory as h2o objects we just converted
00:38:21to a schema led with this call here and
00:38:24now to become spark or disease for which
00:38:28you can just call like a select
00:38:31statement in SQL and then some join and
00:38:33another join and all that it’s very nice
00:38:35right this is a nice well understood API
00:38:38the people can use and h2o does not have
00:38:41this at this point but we’re working on
00:38:43that so at some point we’ll have more
00:38:45managing capabilities but for now you
00:38:47can definitely benefit from the whole
00:38:49spark ecosystem to do what it’s good for
00:38:53so here in this case but is this we say
00:38:56here’s a crime better data set that we
00:38:58after be splitted I think we spent we
00:39:01bring it back into h2o yes this is an
00:39:03HTML helper function to split and now we
00:39:09have basically a joint data set that
00:39:12knows all about the socioeconomic
00:39:13factors about the way
00:39:14for a given time at a given place and
00:39:19then we can build a deep learning model
00:39:22just like you would do this in Java
00:39:24Scala is very similar right you don’t
00:39:26need to do much porting it’s just the
00:39:28same members that you’re setting and
00:39:30then you say run train model that gets
00:39:33basically and that that at the end you
00:39:35have a model available that you can use
00:39:37to make predictions and it’s very simple
00:39:39and you can definitely follow the
00:39:41tutorials in the interest of time I’ll
00:39:46just show you the sparkling she’ll start
00:39:49here I’m basically able to do this on my
00:39:52laptop as well while the other one is
00:39:54still running here you see spark is
00:39:57being launched and now it’s scheduling
00:39:59those three worker nodes to come up once
00:40:01it’s ready I can copy paste some code in
00:40:03there and the code I would get from the
00:40:08website here Chicago Crime demo it’s all
00:40:14on github
00:40:28so in sparkling water I will get up
00:40:31project under examples there are some
00:40:34scripts and so I can just take this
00:40:35stuff here and just copy paste it all
00:40:39oops I’m sure you believe me that this
00:40:43is all doable right so here spark is not
00:40:46ready and I just copy paste is in and
00:40:47here it goes so that’s how easy it is to
00:40:50do spark and h2o together and then also
00:40:57once you have something in your memory
00:40:58in the 8th show cluster right the model
00:41:01for example or some data sets you can
00:41:03just ask flow to visualize it you can
00:41:05just type this this JavaScript or
00:41:07CoffeeScript rather expression and plot
00:41:09anything you want against anything and
00:41:11you’ll see these interactive plots but
00:41:14you can mouse-over and it will show you
00:41:15what it is and so on so it’s very cool
00:41:17you can plot for example the arrest rate
00:41:19versus the relative occurrence of an
00:41:21arrest for example gambling is always
00:41:24arrested why is that well because
00:41:27otherwise you wouldn’t know that the
00:41:28gambling person was cheating or
00:41:30something so so you basically have to
00:41:32rest them right otherwise you don’t know
00:41:34what’s happening some of the things are
00:41:36undetected but the theft for example
00:41:39it’s not always arrested because someone
00:41:41knows that it was stolen without the
00:41:42person actually being caught so you have
00:41:44to be careful about all this data
00:41:46science stuff but basically can plot
00:41:48whatever you want against whatever you
00:41:49want and that’s pretty powerful and we
00:41:53have our state up table now in house so
00:41:56Matt Dowell joined us recently he he
00:41:58wrote the fastest data table a
00:42:00processing engine in our and this is
00:42:04used for financial institutions that
00:42:06like to do aggregates a lot so just what
00:42:08you saw on the previous slide will soon
00:42:10have all this in H to go in a scalable
00:42:12way that we can do fast joins aggregates
00:42:15and so on and the same thing of course
00:42:18goes for Python you have ipython
00:42:20notebooks and there’s an example to do
00:42:23something for the city bike company in
00:42:25New York City where you want to know how
00:42:27many bikes do you need for stations such
00:42:29that you don’t run out of bikes so let’s
00:42:31say you have 10 million rows of
00:42:34historical data and you have some better
00:42:36data you would imagine it you can join
00:42:38those two and then basically based on
00:42:40location
00:42:41in time and better you can predict how
00:42:44many bikes you’ll need right so if I
00:42:45know today it’s going to be or tomorrow
00:42:47is going to be that better I know I need
00:42:49250 bikes at that station or something
00:42:51and cliff our CTO who-who wrote a jit
00:42:55basically also wrote this data science
00:42:57example here so you can see there’s a
00:42:59group by the top from ipython notebooks
00:43:02and to show you that this is also life
00:43:04impossible here I do this here I’ll type
00:43:07ipython notebook citibike small and up
00:43:10pops up my my my browser with ipython
00:43:13notebook I will delete all the output
00:43:16cells so we don’t cheat and I say go and
00:43:18now it’s connecting to the cluster that
00:43:21I started 30 minutes ago this means i
00:43:23still have a little bit of time left i
00:43:26will load some data here it up we go and
00:43:30then let’s look at the data describe it
00:43:33you can see here some some mean max and
00:43:40so on whatever this is like a
00:43:42distribution of the chunk of the frame
00:43:44how many rows out of each machine in
00:43:46this case is only one machine oops
00:43:47there’s only one machine basically some
00:43:50statistics that tells you how is the
00:43:52data distributed across the cluster what
00:43:54kinds of columns do I have what is their
00:43:59mean max and so on all available from
00:44:01from Python then you can do a group by
00:44:05you don’t need to know all that but
00:44:07basically just you want to know like at
00:44:08what time of the day or what they how
00:44:11many bikes are bitch station and so on
00:44:12you can see that there’s a big
00:44:13distribution here that’s some some
00:44:16places only need 9 bikes on basically
00:44:18the under bikes or even more and so on
00:44:21right and you can do quantiles you see
00:44:24the quantiles here from one percent all
00:44:26the way to ninety-nine percent and you
00:44:28see that there’s some pretty big numbers
00:44:31here you can make new features stay if
00:44:36the weekends on you can build models so
00:44:41this is the fun part we have a bill to
00:44:42GBM we build a random forest we build a
00:44:44glm and we build a deep learning model
00:44:46all on the same data that was joined
00:44:48earlier and so now let’s say do this go
00:44:51so now it’s building a GBM
00:44:54all of my laptop so if I went to my
00:44:57laptop right now I could say get models
00:44:59and these models would just magically
00:45:00pop up and this is deep learning and now
00:45:06we can see how well they’re doing and
00:45:11you get the idea right so now we get a
00:45:1292 AAC by deep learning but the 93 a or
00:45:15c by GBM but deep learning even took a
00:45:17little less time than GBM so you could
00:45:19say that both are very powerful methods
00:45:21they beat the random forests and the
00:45:23linear models here but of course nothing
00:45:26beats the linear model in terms of time
00:45:28Oh point one second to get an 81 and you
00:45:30see it’s pretty remarkable it’s 50 times
00:45:33faster and a random forest all right so
00:45:37you believe me that I Python works as
00:45:38well once you join the better data with
00:45:41a simple merge command here in the
00:45:43middle somewhere then you get a little
00:45:46lift here because then you can even
00:45:48predict better you need bikes are not
00:45:49based on better right make sense if it
00:45:51rains you might need fewer bikes so any
00:45:55anything you might wonder what to do
00:45:57with GBM linear models with deep
00:46:00learning there’s booklets for that and
00:46:02we’re currently rewriting them to the
00:46:04new version of h2o which will have
00:46:05slightly updated api’s and stuff for
00:46:08consistency across our Python Scala JSON
00:46:12and so on so it’s going to be very nice
00:46:13and rewritten everything from scratch a
00:46:16major effort but now we’re basically
00:46:18going to be ready for release I think
00:46:21this week actually so and another ! is
00:46:27that we’re currently number one at this
00:46:29caracal challenge Marc Landry who just
00:46:33joined us who has been on teammates to
00:46:36go for a while he was at the h2o world
00:46:38last fall he is actually going to work
00:46:40full-time almost half his time on Kaggle
00:46:42challenges using h2o so we’ll be excited
00:46:45to see this go across the finish line
00:46:48and they will share how we did this or
00:46:52rather he will share how he did it
00:46:54because so far mark did most of the work
00:46:57next week at h2o in Mountain View and
00:47:00they’ll be live-streamed as well so if
00:47:02you can make it be sure to listen in and
00:47:05these are some examples of other caracal
00:47:07applications
00:47:08we have demo scripts that are posted
00:47:09that are available and for example this
00:47:13one I had hosted a few other maybe a
00:47:15month ago or so I posted this example
00:47:17GBM random parameter tooling logic where
00:47:23you basically just make ten models with
00:47:25random parameters and see which one is
00:47:26the best that sometimes useful
00:47:28especially if you have many dimensions
00:47:30to optimize over and we don’t have
00:47:33Beijing optimization yet but this might
00:47:35be more efficient than just a brute
00:47:36force grid search because the machine
00:47:39gets luckier more than you tell it to be
00:47:41lucky if you want that’s why montecarlo
00:47:43integration works in higher and four
00:47:45dimensions the same thing is true with
00:47:47hyper parameter finding so don’t shy
00:47:50away from these random approaches
00:47:51they’re very powerful so this is the
00:47:55outlook lots of stuff to do for data
00:47:57science now that they have this
00:47:58machinery in place that can scale to big
00:48:01data sets customers are saying well if i
00:48:03do i need to find parameters right yeah
00:48:05sure automatic hybrid parameter tuning
00:48:07is great they’ll do that for you soon
00:48:09you’ll have ensembles like a framework
00:48:12that you can in the GUI and all properly
00:48:15define what you want to blend together
00:48:17what way non- least squares to to stack
00:48:21models of different kinds like a random
00:48:23forest and the GBM and so on all on the
00:48:25holdout sets and so on then we want to
00:48:28have convolutional layers for deep
00:48:29learning for example for people who want
00:48:31to do more image related stuff but all
00:48:35these things are on a to-do list right
00:48:37we have to prioritize those based on
00:48:39customer demand so that’s what our
00:48:40customers get to do the paying customers
00:48:42get to tell us basically what they want
00:48:44and they’ll take that into account
00:48:46natural language processing is high up
00:48:48there especially now that you have this
00:48:50framework we can characterize each
00:48:52string as an integer and then process
00:48:54all that fast and we have a new method
00:48:57called generalized low-rank model which
00:48:59comes right out of Stanford brand new it
00:49:02can do all these methods pcie SVD
00:49:05k-means matrix factorization of course
00:49:08all this stuff fixing missing values for
00:49:11you based on like a Taylor expansion of
00:49:13your data set very powerful stuff can
00:49:16also be used for a commander systems and
00:49:18we have lots and lots of other zero
00:49:20tickets and
00:49:21stuff to work on so if you’re interested
00:49:23in joining the effort please do and I
00:49:26hope I left you with an impression of
00:49:28what you can do with h2o and what the
00:49:30state of the art is right now in machine
00:49:32learning on big data sets and thank you
00:49:34for your attention