00:00:13hello and thank you for joining us so
00:00:17I’m part of a small start-up located
00:00:21here in Denmark he knows called on silo
00:00:24we work with big scientific publishers
00:00:28to process article information and to
00:00:30make tools for researchers and so maybe
00:00:35I should start with explaining the
00:00:39mission that we set up four years ago we
00:00:41started the company so our idea was to
00:00:47build a system of discovery services
00:00:50that could make it easy to find patterns
00:00:52across a lot of unstructured text today
00:00:56or a couple of years ago the way things
00:01:00were linked when you looked at an
00:01:02article and try to find something
00:01:04similar was using human annotated
00:01:07editors keywords that’s how you find
00:01:10related articles in science and the big
00:01:14challenges that we saw with the system
00:01:17as it was was that because scientific
00:01:21language is constantly evolving and
00:01:24growing and new things are being
00:01:26discovered it’s impossible to keep up
00:01:29with sort of hand curation of content it
00:01:34also a system also has to be omniscient
00:01:37because presently it’s as an author and
00:01:40an editor that looks at a paper and
00:01:42tries to decide what’s the important
00:01:44aspects of this article and sometimes
00:01:47really interesting discoveries are
00:01:49really only apparent in hindsight so you
00:01:52need an automated system that can
00:01:54correlate a new article two tons of
00:01:57other things that are currently going on
00:01:58figure out if stuff people in China
00:02:00doing something exactly similar to what
00:02:02you’re trying to do then finally it has
00:02:05to be unbiased because right now we have
00:02:08this problem most of the sort of
00:02:10recommenders and the concept curation
00:02:13that’s automated today is based on so
00:02:15the collaborative filtering or like the
00:02:17stuff you see on Amazon people who
00:02:19bought this also bought that it tends to
00:02:21lead us down the same path and it tends
00:02:23to make researchers trying to do
00:02:25something you walk straight by the most
00:02:27interesting stuff because that’s what
00:02:29everyone else also does so we need an
00:02:32unbiased approach that doesn’t rely on
00:02:35some kind of popularity ranking like
00:02:38page rank or collaborative filtering the
00:02:41sound is a little odd is it okay I don’t
00:02:47have that fancy clicker so the core
00:02:49technology that we’ve built is based on
00:02:52a lot of open source components or at
00:02:54least three components we have a
00:02:58document processing pipeline built
00:03:01around batchi you eema and bruta and
00:03:04we’ve run sort of standard natural
00:03:06language processing pipeline and tools
00:03:09on top of that and then we use common
00:03:13languages like Python for prototyping
00:03:15Java and a lot of libraries and stuff in
00:03:20the I guess the data scientists toolbox
00:03:25the key challenge is what we’re trying
00:03:28to do is that unstructured knowledge
00:03:30text basically does not compute as I
00:03:36said before there’s too much stuff going
00:03:37on for humans to be involved in this
00:03:40process and even when humans are
00:03:43involved on a higher level in building
00:03:45ontology is to represent the knowledge
00:03:48that we have of a certain discipline
00:03:49it’s not going fast enough all the
00:03:52interesting stuff that was found out
00:03:54yesterday or last month or even six
00:03:57months ago has not made it into a a
00:04:01curated ontology yet so if you really
00:04:03want to be at the forefront where the
00:04:05money is and and and where things matter
00:04:08in research you really need a more
00:04:10dynamic approach so even when there are
00:04:13dictionaries or reference works it’s
00:04:16it’s not simply not comprehensible
00:04:18enough and then the second big problem
00:04:20that we have is that people are way too
00:04:21creative they don’t use
00:04:24just one name for a certain phenomenon
00:04:26they have many different variations and
00:04:29they often add descriptive detail in
00:04:32their own language that makes absolutely
00:04:34no sense to a computer and makes it
00:04:36really difficult to figure out what
00:04:38they’re actually talking about there is
00:04:40no right way to describe anything in the
00:04:43world and and we somehow have to figure
00:04:46out what people are talking about so as
00:04:49I said finally there’s all this all the
00:04:53data that people consider in obvious
00:04:57that’s probably the biggest problem for
00:04:59for analytics today or for computer AI
00:05:03in general all the stuff that people
00:05:05consider obvious and then fail to
00:05:07include in a description of anything so
00:05:10those are the key problems that were
00:05:12that we’re trying to solve here’s a
00:05:16piece of text it’s an article from 2006
00:05:20and if you use a regular sort of
00:05:22full-text search or some kind of
00:05:27standard search engine and you throw
00:05:29this it’s an abstract of an article the
00:05:31real article is probably ten times as
00:05:33long then it’s really difficult to see
00:05:37what this text is really about and if I
00:05:39read this how do I figure out what other
00:05:42articles talk about the same things so
00:05:46today we use computers to annotate the
00:05:50words that we know what means so these
00:05:53are the words that are found in common
00:05:55sort of dictionaries and ontology of
00:05:57this of this area and we have at our
00:06:05company developed a much more
00:06:07comprehensive way of looking at this and
00:06:09dynamically statistically deriving
00:06:11longer phrases that mean stuff and we
00:06:14figure out which mean approximately
00:06:16which of them mean approximately the
00:06:18same thing and right now as I said in
00:06:22the I think of the remarks for the talk
00:06:24I’m also going to try to talk a little
00:06:25bit about where we want to take things
00:06:27and what we’re currently working on and
00:06:29it’s not just as you can see we’re
00:06:31trying to cover all of the information
00:06:33is actually in an article try to map
00:06:35that out and make it searchable make it
00:06:37findable
00:06:37and we’re presently working on all of
00:06:41the actions and relationships between
00:06:42these things so that when you find stuff
00:06:44that talks about a and B the most
00:06:48relevant article is probably the one
00:06:50that talks about a and B and
00:06:51approximately the same context or the
00:06:53same sentence or even talks about how a
00:06:55is related to B today you can also do
00:06:59this with sort of distance number of
00:07:01words in-between when you use a
00:07:03traditional search engine but the thing
00:07:05is when you’re working with checks then
00:07:07sometimes the number of words in-between
00:07:09cross a paragraph boundary or sometimes
00:07:12it’s the image text that’s right next to
00:07:14that really interesting other thing that
00:07:16you were looking for so and and other
00:07:19times actually the thing that you’re
00:07:20interested in is mentioned up here with
00:07:22that third thing and down here the other
00:07:25thing is mentioned with that third thing
00:07:27so they’re actually really closely
00:07:28connected but they’re just that odd ends
00:07:30of the article so you need a better
00:07:33understanding of this and we actually
00:07:34use graph analytics to understand the
00:07:37proximity of things and the centrality
00:07:40of things in an article so the first
00:07:44step we we perform is a regular natural
00:07:47language processing some of you may be
00:07:50familiar with this but the simplest part
00:07:53of a natural language processing the
00:07:54thing that you do without too much
00:07:56computation is the power of speech
00:07:58tagging basically assigning word classes
00:08:02to each word is this a verb or is it a
00:08:04noun in this context is that an
00:08:06adjective and and once we have the part
00:08:10of speech tagging we actually can find a
00:08:13lot of candidates for potential things
00:08:16in the sentence so as you can see here
00:08:18we have a sentence from the abstract you
00:08:20just saw methods for measuring sodium
00:08:23concentration in serum by indirect
00:08:25sodium select row selective electrode
00:08:28potentiometry so I’ve highlighted
00:08:31underneath for those who don’t read
00:08:33articles on the daily basis there are
00:08:36four things here in an action if you
00:08:38will in come and speak and if we extract
00:08:42all of the things here they seem pretty
00:08:44straightforward right so so what’s the
00:08:48beef
00:08:49so turns out you can say these things in
00:08:52many different ways and if you want to
00:08:53see other content that is closely
00:08:55related to this article you need to dock
00:08:57not just look at the ones that include
00:08:59those exact words you need to also look
00:09:01at the ones that mention these same
00:09:04things in different ways so we have to
00:09:07deduplicate basically so we work with
00:09:10Springer nature which is one of the
00:09:13larger scientific publishers in the
00:09:16world they’ve given us all of their
00:09:20content and we’ve sifted through it we
00:09:22found on the other side of a hundred
00:09:25million things in their content and we
00:09:29then after processing that in various
00:09:32ways deduplicate that down to maybe two
00:09:34or three million different things and
00:09:36even when you’re down at two or three
00:09:38two or three million different things
00:09:40you still have separation between things
00:09:43that may be a human reader would find to
00:09:46be mostly the same thing so there’s a
00:09:49lot of deduplication you need to do if
00:09:51you can look at the examples here so
00:09:53concentration of sodium can be mapped
00:09:57back to sodium concentration you can
00:10:00also have like sentences like the
00:10:03electro potentiometry was indirect well
00:10:06obviously that’s the same as indirect
00:10:08electrode present geometry you can talk
00:10:11some people like to call things a
00:10:13methodology rather than a method and
00:10:16sometimes people talk about zero and
00:10:18plural rather than serum so these are
00:10:20what we call morphological or
00:10:22syntactical variations basically the
00:10:24things that depend on the grammar we
00:10:27also try to reduce the lexical and
00:10:29semantic variations that’s when authors
00:10:31use synonyms or hypo names which are
00:10:34like more generic general terms for the
00:10:36same thing so for four parts of our
00:10:40pipeline we actually also do that sort
00:10:42of abstraction so whenever someone says
00:10:45method we might map that back to a more
00:10:48generic term called mechanism serum
00:10:51sample it’s actually a type of blood
00:10:53sample like the serum is the blood with
00:10:56something filtered out that’s not my
00:10:59primary
00:11:00business and serum sodium concentration
00:11:03well sodium actually is the i guess the
00:11:06American term for nature or it’s also
00:11:09use that sometimes and indirect
00:11:13electrode function geometry that we’ve
00:11:15now see you in a couple of times it’s
00:11:17actually a type of electroanalysis so
00:11:20when we look at longer sentences or
00:11:23longer phrases we actually go in and
00:11:25replace each of the tokens with a more
00:11:28generic term to figure out if this is
00:11:30actually a variation of something that
00:11:31we’ve seen before all of this is really
00:11:34nothing to do with machine learning this
00:11:35is just hard coded understanding of
00:11:38linguistic variations so we have
00:11:44compound paraphrases and ejectable
00:11:47modifiers and coordinations where you
00:11:49mention things like the concentration of
00:11:53sodium and magnesium can be expanded
00:11:55into concentration of magnesium and
00:11:58concentration of sodium and all of these
00:12:00tedious rules that we actually need to
00:12:01perform before you can do any type of
00:12:03sort of aggregated understanding and
00:12:07then final to a couple of things there
00:12:11often we’re looking at fragments of
00:12:16something else or we’re looking at
00:12:18something that contains a fragment which
00:12:20is more interesting so sometimes it’s
00:12:25the indirect potentiometry and no one
00:12:27else in the world has ever put sodium
00:12:29selective in between there so we have to
00:12:32identify that and take sort of author
00:12:34specific variations out of the question
00:12:37because they mean absolutely nothing to
00:12:39anyone else in the world and here we
00:12:44come to also this matter of adding
00:12:45additional descriptive detail that can
00:12:48really be in the way of understanding
00:12:49what’s going on so clinically
00:12:52implemented indirect something or
00:12:54error-prone indirect ion selective
00:12:57whatever whatever these are all things
00:12:59that get in the way of understanding
00:13:01what’s really being spoken about then
00:13:05once we have deduplicated all this these
00:13:09tons of things really we look at
00:13:11different types of features
00:13:14so the local features in the document
00:13:17include how many times it’s mentioned
00:13:19what’s it connected to we actually
00:13:21calculate a position in a document graph
00:13:25we connect all the things in the mention
00:13:29in the document with the relationships
00:13:31that connect them and then do regular
00:13:33sort of graph analysis to figure out
00:13:35what’s central and what sort of a
00:13:37peripheral to what’s being talked about
00:13:40so you can have something that’s only
00:13:42mentioned once but really central
00:13:43because it’s connected to that very one
00:13:45central thing and you can have stuff out
00:13:47here that may be mentioned a couple of
00:13:49times but always in relation to stuff
00:13:51that’s non central and and then of
00:13:55course we run these other types of
00:13:59analytics that use the textual context
00:14:01so the words right before and right
00:14:04after a piece of text the global
00:14:08features that we use are also sort of
00:14:10occurrence count the number of documents
00:14:12that contain the given phrase and we run
00:14:15various sort of fancy algorithms to
00:14:18figure out what the most common
00:14:19variation if you have a set of an Engram
00:14:23if you will a phrase the words what’s
00:14:26the most common used a variation if you
00:14:29add an additional adjective in front
00:14:31what’s the most commonly used adjective
00:14:33or what are the two most common things
00:14:36and are they sufficiently different
00:14:38different to be two different things
00:14:41then of course we also calculate I guess
00:14:44many of you probably also familiar with
00:14:46the tf-idf which is basically deviation
00:14:50in frequency from from a norm so if
00:14:53things occur more often than they do on
00:14:56average that’s that’s a significant
00:15:00phrase probably and then we look at
00:15:04distribution across the corpus so things
00:15:07a thing can be mentioned very few times
00:15:10but whenever someone uses that thing
00:15:12they mention it over and over again in
00:15:14the same document so that means it’s
00:15:16probably got some significance but if
00:15:19you look at it globally and just count
00:15:20the number of documents it occurs in it
00:15:22may seem insignificant so we have this
00:15:25concentration score which pay
00:15:27sickly tells us when it when it occurs
00:15:30in the document how likely is it occur
00:15:32to occur more than once and then we also
00:15:36do an analysis comparing the
00:15:39distribution to domain regions to figure
00:15:43out that this is something that’s very
00:15:44common but only in a certain domain and
00:15:48all of these things are affected into
00:15:51the our learning algorithms or ranking
00:15:53models we also use the aggregated
00:15:58textual context and this is the I’m
00:16:01going to get back to that in a little
00:16:02while this is the word to Vic or word
00:16:05embeddings models that the previous
00:16:09speaker also mentioned so if we look at
00:16:11all the occurrences of the given phrase
00:16:13across the entire corpus that tells us
00:16:16something about what it means or what
00:16:20other things might mean the same thing
00:16:22and then of course the biggest thing
00:16:25when you’re trying to train a model is
00:16:27the thing that you’re training it on so
00:16:32we have two types of things that we can
00:16:35train on we have human training data
00:16:38this could be the articles themselves we
00:16:41figure out if we have and I parthis
00:16:43purposes that a given concept is very
00:16:47central to an article we can compare it
00:16:49and see if we actually found it in the
00:16:51abstract so if it’s in the abstract or
00:16:53in the title as a high likelihood that
00:16:55the author also considers it important
00:16:57so that’s one data point and then
00:17:00aggregated over thousands or millions of
00:17:02articles that actually can tell us how
00:17:05good we are at selecting the things that
00:17:07authors find important of course if we
00:17:10think we can do better than the authors
00:17:12that’s a lousy way to measure it so we
00:17:16also use other types of human training
00:17:18data behavioral data from the companies
00:17:21we work with they kindly allow us access
00:17:24to usage patterns when we present
00:17:27something to users which of these things
00:17:29that we extract did they actually click
00:17:32on find interesting and and which
00:17:36articles when presented with a list of
00:17:38articles related articles
00:17:40the sidebar for instance which of these
00:17:43were found to be most interesting or
00:17:46clicked upon by by users turns out of
00:17:49course is the ones with the promising
00:17:50titles that get clicked on not
00:17:52necessarily the ones that are most
00:17:53similar so sometimes you need to make
00:17:56adjustments just to to create some link
00:17:59bait so the other type of synthetic data
00:18:02that we use is the data that we use is
00:18:06synthetic data so we can actually
00:18:08construct an artificial corpus and and
00:18:12train our models on that and try to
00:18:14improve our models using the principles
00:18:17that we that we use to create the
00:18:19synthetic data it’s slightly more
00:18:22complex but you can actually that’s
00:18:25that’s how the demo if any of you have
00:18:27tried war tyvek the demo that they
00:18:30create there is actually completely
00:18:31synthetic and you can also build
00:18:37partially synthetic data sets one that
00:18:40we’ve tried and that actually it was
00:18:42also used on what to work was to use a
00:18:45different search engine to create your
00:18:47artificial corpus so you search for
00:18:49something maybe two different concepts
00:18:52two different words and then you mix
00:18:54them together and you remove all traces
00:18:56of the worst that you searched for so
00:18:58the only thing that’s left or everything
00:19:00else in the document and then you try to
00:19:02figure out if you can still classify
00:19:04what was what and and and dump things in
00:19:08the right pile so a little bit about
00:19:12word embeddings so the previous author
00:19:15mentioned it here’s an example basically
00:19:19what you do is you build a lecture or
00:19:21it’s actually a tensor it’s a
00:19:22combination of vectors so each each word
00:19:27or token or phrase we work on phrases in
00:19:30our corpus is actually defined in this
00:19:33lecture space by an aggregation of
00:19:36vectors that it commonly co-occurs with
00:19:40so the traditional word tyvek algorithm
00:19:45will just work on create treat all text
00:19:49as a token every token as its own
00:19:53vector and then only a few things get
00:19:56concatenated because they belong
00:19:58together so we we pre process the text
00:20:01quite a lot and figure out after we’ve
00:20:03deduplicated all these hundred million
00:20:06things we’re down to so few million
00:20:07things that they actually have decent
00:20:10recurrent occurrence counts there’s the
00:20:12big problem when you’re looking at
00:20:13larger selections of text is that
00:20:15they’re kind of statistics be more
00:20:17unlikely than the word each word on its
00:20:20own so you have a problem with for
00:20:23instance hyper amick flow doesn’t
00:20:26necessarily occur that many times even
00:20:28when you have a million documents or 10
00:20:30million documents it’s still something
00:20:32so specific that you only have a few
00:20:34hundred occurrences so it’s important to
00:20:36capture all of them even when the author
00:20:39calls it something different but after
00:20:40we’ve done all that deduplication we
00:20:43actually end up with a corpus what we
00:20:45can run a vector model or generate a
00:20:47vector model and then we use other
00:20:49things on top so we know that coronary
00:20:54vasodilation actually is defined in an
00:20:57ontology it’s related to all these
00:20:59different things and then we combine
00:21:01things using our so the structured
00:21:03knowledge of that domain to further
00:21:06refine the vector model and and that’s
00:21:09work really well for us here’s this is
00:21:13just a little data dump from a test a
00:21:15while ago but what you see here are
00:21:19phrases and a current accounts in the
00:21:23test corpus of I think are a million
00:21:25articles and here you can see like the
00:21:28first line deionized water it’s actually
00:21:31part of a set it extends further to the
00:21:33right but the first line you can see do
00:21:37ionized water is actually the same or
00:21:39has a similar vector as by distilled
00:21:41water ultrapure water di water tea /
00:21:45ionized water or double distilled water
00:21:47and these are important to notice that
00:21:51this is the output from a vector model
00:21:55where we basically for each concept in
00:21:58the first column we find the nearest
00:22:01concepts the most the concepts that
00:22:04appear in the most
00:22:06similar context so the algorithm
00:22:08actually does not even look at the
00:22:10letters it just has an ID and then it
00:22:14knows the ID of the things around it and
00:22:17so it’s pretty obvious you that it is
00:22:21actually possible just from the
00:22:23hypothesis is that words that mean
00:22:26approximately the same are used in
00:22:28approximately similar context so the 10
00:22:31words or five words before and after
00:22:33over a million documents will be very
00:22:35similar for things that although they
00:22:37are different phrases mean more or less
00:22:40the same thing so you can see when
00:22:42things are used interchangeably that is
00:22:45very much the case so for instance row I
00:22:47guess 60 so crucial role actually is the
00:22:51more or less the interchangeably used
00:22:53with prominent role vital role
00:22:55fundamental role pivotal role or
00:22:57essential role sounds about right and
00:23:00again it’s it’s a great validation
00:23:02sometimes people work with data sets and
00:23:04they rarely ever see like anything else
00:23:07than floating point values here you can
00:23:09actually look at it and see that does
00:23:10actually make sense and if you’re in
00:23:13doubt you when we do sort of limited QA
00:23:17to see if things have become garbled by
00:23:19some bug introduced somewhere you can
00:23:22always just like look it up on Wikipedia
00:23:24or something see does it make sense and
00:23:27I think him so pivotal role key player
00:23:29essential role yeah so it actually works
00:23:34it’s possible to run this even on
00:23:36phrases which I think we have been the
00:23:40first to do so the upshot of this what
00:23:42have we done we’ve created human
00:23:45readable fingerprints so we’ve for any
00:23:47given text regardless of the type of
00:23:50language used we can extract some some
00:23:54phrases that we know what they mean and
00:23:56we can map them to the most commonly
00:24:00used definition or phrase that means the
00:24:03same thing and for a person skilled in
00:24:06the arts as they say it’s kind of easy
00:24:09to suddenly see what an article is about
00:24:11we can rank them and we can tell you the
00:24:125 10 things that are most important and
00:24:15an Arctic
00:24:15and when people say if you look at the
00:24:18graph up there when when when some
00:24:20author mentions insulin insensitivity
00:24:22and obese children we will know that
00:24:25that article that was written a couple
00:24:27of years ago about oh wait girls and
00:24:29reduced hormone response is actually
00:24:31talking about the exact same thing and
00:24:33that’s a that’s a that’s a very big leap
00:24:36in the way we recommend text in science
00:24:40or indeed anywhere so traditional
00:24:42document similarity relies on as I said
00:24:46to recap the words that we know what
00:24:48means sometimes word can be words can be
00:24:52ambiguous and that’s a big problem so
00:24:55there’s what we call the phrase
00:24:56hypothesis which is what we’re working
00:24:58on when you have a longer selection of
00:25:01words that stack together in the same
00:25:03fashion they rarely have a different
00:25:05meaning they often have a very precise
00:25:07meaning and and that’s the ability to
00:25:11capture those races dynamically is
00:25:14basically one what we do so once you
00:25:20have these fingerprints you can actually
00:25:22produce all kinds of different features
00:25:24that make it easier for researchers make
00:25:27life easier so what we’ve delivered to
00:25:30to the partners that we work with our
00:25:32inability to first as I said highlight
00:25:34the things that are most the principal
00:25:38components of an article so this is an
00:25:40article page some of you may have seen
00:25:41one if you search on Google for an
00:25:43article title you get bounced to a
00:25:45publisher’s webpage where that article
00:25:47is presented and so we helped make that
00:25:50page better we helped make it easier for
00:25:54readers to understand what’s going on
00:25:56and we can pull out key sentences and we
00:25:59can recommend stuff we can tell the user
00:26:02this is where they mentioned that thing
00:26:03you’re interested in they use some
00:26:05different words but it’s about the same
00:26:07thing and we can provide related content
00:26:10basically articles that are talking
00:26:13about the same things and when we do
00:26:15that we not only just provide a related
00:26:18article we’d actually tell you what it
00:26:20is how this overlaps with what you’re
00:26:22currently looking at so we can actually
00:26:25show you oh these are the concepts the
00:26:27current here that also occurs in the
00:26:29article you’re presently looking at and
00:26:31we can actually also we’ve done an
00:26:32interactive version that allows the user
00:26:34to drill down and further explore it has
00:26:38to contain this than this and then get a
00:26:41recommendation here so we work very
00:26:43closely with Springer nature Scientific
00:26:47American McMillan many of the largest
00:26:52publishers and we produce things like
00:26:54this so I guess the little difficult to
00:26:58see the highlights here but in essence
00:27:02this is the non schematic version of
00:27:04what I just saw told you on the right
00:27:07side we have related content you can
00:27:09click any of the things you’re
00:27:10interested in then get a filtered list
00:27:12of the most similar articles that also
00:27:14contain this thing you’re interested in
00:27:17we also do also other types of
00:27:20visualizations with related content we
00:27:24can use our technology to find
00:27:25definitions of things so many of these
00:27:27scientific publishers have a large back
00:27:30catalogue of reference works or teaching
00:27:34books if you will that define different
00:27:36concepts so users can can click on
00:27:41something like RNA editing and we can
00:27:44pick up the best definition we can find
00:27:47in in the publishers literature and not
00:27:51just rely on the stuff that’s on
00:27:54Wikipedia and more interesting we’re
00:27:58also working on building tools that
00:28:02allow researchers to to see more of the
00:28:06history that the stuff they’re
00:28:09interested in is sort of a part of so
00:28:12here is a tool that we call timeline
00:28:15that for a given article here in
00:28:18sometime in the past I guess around 2003
00:28:21the selected article there we use the
00:28:24reference the citation data forwards and
00:28:28backwards citations to figure out which
00:28:31things were cited by this paper and
00:28:33which papers psyched this paper so
00:28:36forwards and backwards in time but
00:28:38that’s a very very large set because
00:28:41when you have a
00:28:42single article they often cite 10 20 50
00:28:46other papers each of which site another
00:28:4910 50 100 papers so it’s a very huge
00:28:53tree and then what we do is that we
00:28:55basically prune that tree to just look
00:28:57at the branches that have articles that
00:29:00talk about the same thing and that
00:29:02allows you to fairly easily identify an
00:29:04article from last year which talks about
00:29:06the same thing and actually through a
00:29:09couple of links cites the article that
00:29:10you’re presently looking at or if you’re
00:29:12looking at a recent article you can say
00:29:13who is the first author in this citation
00:29:16tree to actually combine this and that
00:29:19and in a paper so the value that we’re
00:29:25providing to researchers and this is
00:29:27we’re kind of proud of that is that we
00:29:30accelerate the path to successful
00:29:35discovery by pointing directly to what
00:29:38is relevant in an article and we can
00:29:41also provide more relevant suggestions
00:29:44because they’re much more precise than
00:29:47competing technologies and then we
00:29:49provide so our little company actually
00:29:51also provides end user features because
00:29:54we believe that it’s that understanding
00:29:56of the algorithms used and how they
00:29:59actually how different algorithms will
00:30:04favor different things and and that
00:30:06actually is important for the feature
00:30:07you’re trying to construct what how how
00:30:10you’re going to rank these and it’s
00:30:11actually very dependent on the type of
00:30:13use cases that we were trying to solve
00:30:15and for our our clients the publishers
00:30:19they’re really happy that they can roll
00:30:21out a feature across many different
00:30:22types of context content even so in
00:30:26biomedical for instance gene research or
00:30:30drugs diseases there’s a lot of
00:30:33structured documentation a lot of
00:30:36ontology zahl gene names at least
00:30:38discovered until fairly recently or
00:30:40logged in an open access ontology and
00:30:44and documentation is really really good
00:30:47in that small field of science but
00:30:50everywhere outside of that it’s much
00:30:52much worse if you look at humanities and
00:30:56Jen
00:30:56well there are rarely any any official
00:30:59ontology is available that tell you
00:31:01which words are important or which
00:31:03things is a synonym of what and and so
00:31:07what we do is actually is very important
00:31:10to do to developing this type of
00:31:12services or recommendations for for all
00:31:16the other disciplines so future
00:31:19directions well as I said we’re
00:31:21currently working on understanding the
00:31:24relationships between all these features
00:31:26of things that we extract there are so
00:31:28many different ways that you can say a
00:31:30given thing and when you talk about the
00:31:33relationship between two things there’s
00:31:35an equal amount of different ways you
00:31:37can say things so just the fact that
00:31:41serum consists mostly of water can be
00:31:44said in so many different ways and and
00:31:46the thing thin film coated gold
00:31:50nanoparticles we’re currently working on
00:31:52a nano product for the nano industry
00:31:55with a partner that can also be said in
00:31:58a number of different ways but what’s
00:32:00interesting is of course that these
00:32:02relationships when they stack up we can
00:32:04replace the two things the subject and
00:32:08the object and then have a general
00:32:10understanding of how this relationship
00:32:12can be described and so we’re trying to
00:32:15that’s a big challenge for us is trying
00:32:17to normalize and reduce the types of
00:32:20relationships between things and the
00:32:24corpus another big forward-looking
00:32:27feature is to provide our services to
00:32:31other companies that are trying to solve
00:32:33problems and have access to unstructured
00:32:35text but no ability to process it so
00:32:39we’re working with a couple of large
00:32:40companies to to make basically make
00:32:44large text collections computable so so
00:32:48much of what we do can be applied on any
00:32:51given sort of large collection of text
00:32:53and and you can do all sorts of really
00:32:56interesting analytics on it once you
00:32:58know what’s what and what’s similar and
00:33:00what’s the important aspects of text and
00:33:04then ultimately why we want to go is is
00:33:08to do reasoning at scale
00:33:10that’s really what you need in order to
00:33:12to augment scientific research most
00:33:16efficiently you need to be able to
00:33:18reason what is this how what’s the
00:33:20causal chain of events here and is is
00:33:24this a disputed fact does everyone say
00:33:26that this is how things are or the
00:33:29things that that may be long chains of
00:33:32of course ality that go unnoticed that
00:33:35can only really be uncovered by massive
00:33:38analytics so I guess the the ultimate
00:33:41price there is the cure for cancer so so
00:33:46I guess we have a small team we’re
00:33:47actually located in in almost in second
00:33:51town of Denmark were 18 people I think
00:33:55now and and all of them have worked at
00:33:59large big big international companies
00:34:02and basically chosen to come to work
00:34:04with us four measly salaries and living
00:34:07in the suburbs because we’re so excited
00:34:10about the promise of assisting science
00:34:13we have no Danish clients we all work
00:34:15with international publishers so and yes
00:34:20we are hiring and so feel free to apply
00:34:24where we’re growing right now and would
00:34:28love to receive applications for you
00:34:31guys so I think that concludes my speech
00:34:35and I’d love to answer questions there’s
00:34:36a ton of detail that I left out that if
00:34:40you have any sort of there are really
00:34:44many questions who you’ve been exid I’ve
00:34:46they’re asking questions with that so
00:34:49the first one is is kick stream analysis
00:34:51used to analyze behavioral data such as
00:34:54hyperlinks between articles and do you
00:34:57use spark for this yes I think we do you
00:35:02spark so I’m confession even though I
00:35:05grew up with a computer and a frog coded
00:35:08demos on my c64 and in my parents
00:35:12bedroom in the 80s I actually do not
00:35:15work as a developer in our company i’m
00:35:16one of the founders and i sell the
00:35:18vision so i can actually answer
00:35:21accurately we
00:35:22do look at clickstream data but mostly
00:35:25it’s not it’s limited to profile
00:35:28building not sort of session analysis
00:35:31because we we do there’s a lot of noise
00:35:35and people get distracted so if you have
00:35:38subsequent clicks through a corpus it
00:35:40really just attributes that tells you
00:35:43something about what the users
00:35:44interested in not necessarily that the
00:35:47things that they click on related
00:35:49because people get distracted so so yes
00:35:52we use clicks but not really streams and
00:35:56if you use if you do keep bait isn’t
00:36:00that minute manipulation all right we
00:36:03were actually asked to do this so yeah
00:36:06so I think it’s a there you’re always
00:36:09when you’re working with big
00:36:10corporations you have different layers
00:36:13of management and they have this
00:36:14different sort of key performance
00:36:17indicators and and the people that work
00:36:20in the front end would like to see a
00:36:22feature used so you need to optimize the
00:36:25data for a feature to be used I think
00:36:28it’s in the app I guess at the reason I
00:36:31can still fall asleep at night is that I
00:36:34think what we’re doing is vastly
00:36:36superior to the traditional sort of code
00:36:39download statistics that are used in
00:36:41science normally the things that get
00:36:44recommended across scientific publishers
00:36:46are the things that other people
00:36:48download it the same session and I think
00:36:52one of the biggest problems with that
00:36:54just to do a little diversion here is
00:36:56that when you only look at behavioral
00:36:59data that you have absolutely no way of
00:37:01recommending that new article that came
00:37:03out yesterday because you have no
00:37:05behavioral data attached to it and it’s
00:37:07a what we call the Coast our problem
00:37:09unless you can identify that this
00:37:11article is very similar to this other
00:37:13article that has behavioral data you can
00:37:16actually not make a recent
00:37:18recommendation until by accident people
00:37:21stumble across it and you know who
00:37:23actually did something with it so so I
00:37:26think what we do here obviously this is
00:37:29a Jekyll and Hyde thing then the best
00:37:31solution is always a combination of the
00:37:33two factors
00:37:35how do you make rules for classifying
00:37:38words or phrases that are very
00:37:40domain-specific across the many
00:37:42different research domains so there’s
00:37:47some actually very few phrases that are
00:37:53exactly similar across I have very
00:37:55different meanings but I’ve chef
00:37:57syntactic very similar across domains
00:38:00and most of that problem we’ve actually
00:38:02sort of circum navigated by looking at
00:38:06longer phrases and by filtering out this
00:38:10stuff that head that has ambivalence so
00:38:14you will actually see that we try to not
00:38:16mention things that when mentioned alone
00:38:19can mean different things than we add an
00:38:23additional token in front of it often
00:38:26times it becomes much less ambiguous and
00:38:29we then prefer that one and that’s
00:38:32simply ash and algorithmic solution is
00:38:34not something that we hard code but we
00:38:36actually look at the the ones that have
00:38:38ambiguity and try to pick longer phrases
00:38:40that are super sets that included do you
00:38:45do any kind of personalization we don’t
00:38:49have a product for personalization
00:38:51because it’s not it’s a big hot potato
00:38:54in science people are really afraid of
00:38:58being tracked because they think they
00:39:00have the cure for cancer and they don’t
00:39:03want like search history is a complete
00:39:05no go and for most of the clients that
00:39:08we work with so we haven’t we don’t have
00:39:11a product yet we think it’s incredibly
00:39:14interesting and we’d love to do it but
00:39:15we don’t have a partner to do it with
00:39:17and probably it’s going to be outside of
00:39:19science and what is the scale of data
00:39:22used in your processing how much states
00:39:24had words to train your model so so
00:39:29that’s another thing of the first two
00:39:30years of our of our startup we’re trying
00:39:35to build a school google scholar
00:39:36competitor we wanted to build a
00:39:38destination site where users could come
00:39:40search in full-text articles not see the
00:39:43full text articles but we would like it
00:39:44makes them for publishers and then link
00:39:46out to
00:39:47real constant and we spoke to many
00:39:50different scientific publishers and they
00:39:52all said that’s a brilliant idea and
00:39:54they had so many meetings with us for
00:39:56two years and they said oh here’s
00:39:58another test sample that you can have of
00:40:00our content and they said and once we’re
00:40:02ready to go you’ll have this hard drive
00:40:04with a ton of articles and it will be no
00:40:06problem everybody will be happy and then
00:40:08after two years and only a few thousand
00:40:11articles from each publisher and a ton
00:40:13of meetings where they asked about our
00:40:14technology and depth and detail we went
00:40:18out and one night I’m in London I
00:40:20remember and one of the product managers
00:40:22or it was actually a V VP level in one
00:40:27of those sawtooth publishes over a beer
00:40:29said you know it’s never going to happen
00:40:32they’re just keeping you close because
00:40:34they want to know what kind of
00:40:35technology you’re developing and I think
00:40:37a few months after that we pivoted into
00:40:40a different business plan where we
00:40:43provide our value in lieu of too little
00:40:46open access material we decided to work
00:40:50within the framework of the publishers
00:40:52and be their friends and so now what
00:40:55we’re providing our services services
00:40:58that are primarily focused on on using
00:41:01one publishers data to perform services
00:41:04for that one publishers clients and so
00:41:07clients the larger publishers have 10 to
00:41:1315 million articles some of the
00:41:15aggregators have more but most most of
00:41:20our clients have less than 10 million
00:41:22documents so with each document being I
00:41:26don’t know a few hundred K in simple a
00:41:30ski that it’s not crazy amounts of data
00:41:33it’s a few terabytes for a larger
00:41:35publisher so as jonathan schwartz found
00:41:39out it could easily be dumped anywhere
00:41:41in the internet but everyone would be
00:41:44sued okay
00:41:47would it make sense to pretty print an
00:41:51article normalize it and republish it
00:41:53along with the original and did do you
00:41:56have a tool for that so no we don’t we
00:42:00cannot provide access to the full text
00:42:03we work with publishers and they are
00:42:05it’s a very tightly controlled business
00:42:10they their primary business asset at
00:42:13least until open access becomes more
00:42:16dominant is the concept that they own
00:42:18and control so so we really can’t do
00:42:21much with it except behind closed doors
00:42:24we had when we worked with elsevier last
00:42:26year like the forms we had to fill out
00:42:28for compliance of security were crazy I
00:42:31think a hundred and forty seven pages
00:42:34tabs in an excel sheet with a hundred
00:42:37questions in each so that was just the
00:42:40lien and they are the survey questions
00:42:42before they send a person over so yeah
00:42:45they’re really really crazy about
00:42:47security I using dump the architecture
00:42:52and can you talk about that I’m not
00:42:57familiar with lambda architecture I know
00:42:59like lambda lambda coefficients but no
00:43:04no probably maybe we are who knows okay
00:43:11what is the most interesting finding you
00:43:13had done in your data for cancer that’s
00:43:18our we haven’t found that yet and I
00:43:20guess we would have published it so
00:43:22we’re a service provider so we work with
00:43:26what the industry called subject matter
00:43:29experts or SMEs and so we have models
00:43:35what we validate the quality of what we
00:43:38do and and then the error rates etc they
00:43:42all automated tests and then of course
00:43:44we run it by some selection a panel of
00:43:46real scientists that can look at it and
00:43:50then know the content that we’ve
00:43:51processed and can tell if there’s an
00:43:53error somewhere a word that we left out
00:43:55that was important but we can’t really
00:43:59evaluate ourselves
00:44:00so we know that the scientific
00:44:03publishers we work with the editors
00:44:05there say that we have the best
00:44:08extraction algorithms that produce the
00:44:11best and most usable phrases and results
00:44:13so that that’s what we go at we actually
00:44:15don’t know what is being used for okay
00:44:20what about articles published in the
00:44:22public domain published on open
00:44:25platforms I am indexing and presenting
00:44:27articles on these and turns it the
00:44:28sources yes we are working with a couple
00:44:31of open access publishers and sorry
00:44:34about that and so the open access model
00:44:42has sort of turned publishing inside out
00:44:44where traditionally traditional
00:44:47publishers actually publish your thing
00:44:49for free as long as you sign over
00:44:51copyright for open access you have to
00:44:54pay for the peer to peer review process
00:44:56and the publishing of course that cost
00:44:58has come down a lot from a few years ago
00:45:00but you still pay around 2,000 euros to
00:45:04publish an article and that sort of puts
00:45:06a little damper on the growth of open
00:45:09access but but we do work with some of
00:45:13the open access providers and we have
00:45:15this idea when we started our company
00:45:16that we would just aggregate all of open
00:45:18source and that’s fine good luck if you
00:45:22want to try because the only people that
00:45:24have succeeded in doing anything vaguely
00:45:26resembling that are just aggregating the
00:45:29metadata because it turns out that
00:45:30people publish their their articles in
00:45:35it in a gazillion different formats on a
00:45:38gazillion different websites where
00:45:39sometimes the download boredness behind
00:45:41some kind of I’m not a robot capture and
00:45:44it’s really really hard to get at the
00:45:47content it’s the biggest mistake that
00:45:49the open access community has done is
00:45:51not agreeing on some submission standard
00:45:53that allows that data to go there text
00:45:55to be mined and I just don’t see why no
00:45:58one has come up and said this is how you
00:46:00do it this is the format give us a Jets
00:46:03xml file right here on an ftp server
00:46:07dump it there and and let the community
00:46:09do the rest but it hasn’t been done so
00:46:11it’s not the
00:46:12it’s not a task for startups it’s
00:46:15incredibly time-consuming to deal with
00:46:17thousands of different submission
00:46:19forfeits and PDFs I mean you may think
00:46:22PDF is a nice format but it just turns
00:46:24out that sometimes the renderer will
00:46:27swap the the order of sentences around
00:46:31and it’s impossible to figure out which
00:46:34sentence is completed over here or you
00:46:37don’t want to know so so we we have to
00:46:40have someone else take care of that and
00:46:42then we can do open source open access
00:46:44in a few years do you have some kind of
00:46:48best practice to run ad to plication
00:46:50process where different deep learning
00:46:52methods could be applied I’m not sure I
00:46:56understand the question but we do have
00:46:58so that’s the key value add and I’m
00:47:02sorry I can’t share the source code it’s
00:47:04free we’re trying to build a business if
00:47:07you want to work with it you should come
00:47:08to us we do have like the pipeline that
00:47:12we’re building is about this and it’s
00:47:13iterative we pipe stuff in that we’ve
00:47:17learned elsewhere and and we basically
00:47:20we have we work internally in the team
00:47:24we write white papers we have give talks
00:47:28to each other and it’s a wonderful set
00:47:30up please come to honest does this apply
00:47:35well to computer science papers oh yes
00:47:38archive we we’ve indexed archive once
00:47:43but we haven’t set it up for re-indexing
00:47:45and I think we should it’s the whole eat
00:47:48your own dog food thing so we should get
00:47:52that up and running again when we get
00:47:55around to it right we have these other
00:47:57jobs that pay money that we have to do
00:47:59first did you try our technology work
00:48:04for languages other than English no we
00:48:08haven’t found anyone willing to pay for
00:48:10it yet most of what we do can be
00:48:15transferred to to to other languages and
00:48:19not myself fluent in German but I think
00:48:22possibly there are some rules that would
00:48:24have to be
00:48:25for their grammar but there’s nothing
00:48:29basically preventing it from being
00:48:31ported to other languages we’ve we’ve
00:48:34been asked to do Chinese for IP analysis
00:48:38of patent analysis but the tools that
00:48:43everyone else is using is basically some
00:48:46kind of auto translation and then
00:48:48applying text analytics afterwards which
00:48:51is probably inferior but makes more
00:48:54sense on a cost perspective
00:48:56unfortunately we I think that’s it a lot
00:49:00of questions thanks for that and let’s
00:49:01say thank you to to mess thank you