hello and thank you for joining us so
hello and thank you for joining us so I’m part of a small start-up located
I’m part of a small start-up located
I’m part of a small start-up located here in Denmark he knows called on silo
here in Denmark he knows called on silo
here in Denmark he knows called on silo we work with big scientific publishers
we work with big scientific publishers
we work with big scientific publishers to process article information and to
to process article information and to
to process article information and to make tools for researchers and so maybe
make tools for researchers and so maybe
make tools for researchers and so maybe I should start with explaining the
I should start with explaining the
I should start with explaining the mission that we set up four years ago we
mission that we set up four years ago we
mission that we set up four years ago we started the company so our idea was to
started the company so our idea was to
started the company so our idea was to build a system of discovery services
build a system of discovery services
build a system of discovery services that could make it easy to find patterns
that could make it easy to find patterns
that could make it easy to find patterns across a lot of unstructured text today
across a lot of unstructured text today
across a lot of unstructured text today or a couple of years ago the way things
or a couple of years ago the way things
or a couple of years ago the way things were linked when you looked at an
were linked when you looked at an
were linked when you looked at an article and try to find something
article and try to find something
article and try to find something similar was using human annotated
similar was using human annotated
similar was using human annotated editors keywords that’s how you find
editors keywords that’s how you find
editors keywords that’s how you find related articles in science and the big
related articles in science and the big
related articles in science and the big challenges that we saw with the system
challenges that we saw with the system
challenges that we saw with the system as it was was that because scientific
as it was was that because scientific
as it was was that because scientific language is constantly evolving and
language is constantly evolving and
language is constantly evolving and growing and new things are being
growing and new things are being
growing and new things are being discovered it’s impossible to keep up
discovered it’s impossible to keep up
discovered it’s impossible to keep up with sort of hand curation of content it
with sort of hand curation of content it
with sort of hand curation of content it also a system also has to be omniscient
also a system also has to be omniscient
also a system also has to be omniscient because presently it’s as an author and
because presently it’s as an author and
because presently it’s as an author and an editor that looks at a paper and
an editor that looks at a paper and
an editor that looks at a paper and tries to decide what’s the important
tries to decide what’s the important
tries to decide what’s the important aspects of this article and sometimes
aspects of this article and sometimes
aspects of this article and sometimes really interesting discoveries are
really interesting discoveries are
really interesting discoveries are really only apparent in hindsight so you
really only apparent in hindsight so you
really only apparent in hindsight so you need an automated system that can
need an automated system that can
need an automated system that can correlate a new article two tons of
correlate a new article two tons of
correlate a new article two tons of other things that are currently going on
other things that are currently going on
other things that are currently going on figure out if stuff people in China
figure out if stuff people in China
figure out if stuff people in China doing something exactly similar to what
doing something exactly similar to what
doing something exactly similar to what you’re trying to do then finally it has
you’re trying to do then finally it has
you’re trying to do then finally it has to be unbiased because right now we have
to be unbiased because right now we have
to be unbiased because right now we have this problem most of the sort of
this problem most of the sort of
this problem most of the sort of recommenders and the concept curation
recommenders and the concept curation
recommenders and the concept curation that’s automated today is based on so
that’s automated today is based on so
that’s automated today is based on so the collaborative filtering or like the
the collaborative filtering or like the
the collaborative filtering or like the stuff you see on Amazon people who
stuff you see on Amazon people who
stuff you see on Amazon people who bought this also bought that it tends to
bought this also bought that it tends to
bought this also bought that it tends to lead us down the same path and it tends
lead us down the same path and it tends
lead us down the same path and it tends to make researchers trying to do
to make researchers trying to do
to make researchers trying to do something you walk straight by the most
something you walk straight by the most
something you walk straight by the most interesting stuff because that’s what
interesting stuff because that’s what
interesting stuff because that’s what everyone else also does so we need an
everyone else also does so we need an
everyone else also does so we need an unbiased approach that doesn’t rely on
unbiased approach that doesn’t rely on
unbiased approach that doesn’t rely on some kind of popularity ranking like
some kind of popularity ranking like
some kind of popularity ranking like page rank or collaborative filtering the
page rank or collaborative filtering the
page rank or collaborative filtering the sound is a little odd is it okay I don’t
sound is a little odd is it okay I don’t
sound is a little odd is it okay I don’t have that fancy clicker so the core
have that fancy clicker so the core
have that fancy clicker so the core technology that we’ve built is based on
technology that we’ve built is based on
technology that we’ve built is based on a lot of open source components or at
a lot of open source components or at
a lot of open source components or at least three components we have a
least three components we have a
least three components we have a document processing pipeline built
document processing pipeline built
document processing pipeline built around batchi you eema and bruta and
around batchi you eema and bruta and
around batchi you eema and bruta and we’ve run sort of standard natural
we’ve run sort of standard natural
we’ve run sort of standard natural language processing pipeline and tools
language processing pipeline and tools
language processing pipeline and tools on top of that and then we use common
on top of that and then we use common
on top of that and then we use common languages like Python for prototyping
languages like Python for prototyping
languages like Python for prototyping Java and a lot of libraries and stuff in
Java and a lot of libraries and stuff in
Java and a lot of libraries and stuff in the I guess the data scientists toolbox
the key challenge is what we’re trying
the key challenge is what we’re trying to do is that unstructured knowledge
to do is that unstructured knowledge
to do is that unstructured knowledge text basically does not compute as I
text basically does not compute as I
text basically does not compute as I said before there’s too much stuff going
said before there’s too much stuff going
said before there’s too much stuff going on for humans to be involved in this
on for humans to be involved in this
on for humans to be involved in this process and even when humans are
process and even when humans are
process and even when humans are involved on a higher level in building
involved on a higher level in building
involved on a higher level in building ontology is to represent the knowledge
ontology is to represent the knowledge
ontology is to represent the knowledge that we have of a certain discipline
that we have of a certain discipline
that we have of a certain discipline it’s not going fast enough all the
it’s not going fast enough all the
it’s not going fast enough all the interesting stuff that was found out
interesting stuff that was found out
interesting stuff that was found out yesterday or last month or even six
yesterday or last month or even six
yesterday or last month or even six months ago has not made it into a a
months ago has not made it into a a
months ago has not made it into a a curated ontology yet so if you really
curated ontology yet so if you really
curated ontology yet so if you really want to be at the forefront where the
want to be at the forefront where the
want to be at the forefront where the money is and and and where things matter
money is and and and where things matter
money is and and and where things matter in research you really need a more
in research you really need a more
in research you really need a more dynamic approach so even when there are
dynamic approach so even when there are
dynamic approach so even when there are dictionaries or reference works it’s
dictionaries or reference works it’s
dictionaries or reference works it’s it’s not simply not comprehensible
it’s not simply not comprehensible
it’s not simply not comprehensible enough and then the second big problem
enough and then the second big problem
enough and then the second big problem that we have is that people are way too
that we have is that people are way too
that we have is that people are way too creative they don’t use
creative they don’t use
creative they don’t use just one name for a certain phenomenon
just one name for a certain phenomenon
just one name for a certain phenomenon they have many different variations and
they have many different variations and
they have many different variations and they often add descriptive detail in
they often add descriptive detail in
they often add descriptive detail in their own language that makes absolutely
their own language that makes absolutely
their own language that makes absolutely no sense to a computer and makes it
no sense to a computer and makes it
no sense to a computer and makes it really difficult to figure out what
really difficult to figure out what
really difficult to figure out what they’re actually talking about there is
they’re actually talking about there is
they’re actually talking about there is no right way to describe anything in the
no right way to describe anything in the
no right way to describe anything in the world and and we somehow have to figure
world and and we somehow have to figure
world and and we somehow have to figure out what people are talking about so as
out what people are talking about so as
out what people are talking about so as I said finally there’s all this all the
I said finally there’s all this all the
I said finally there’s all this all the data that people consider in obvious
data that people consider in obvious
data that people consider in obvious that’s probably the biggest problem for
that’s probably the biggest problem for
that’s probably the biggest problem for for analytics today or for computer AI
for analytics today or for computer AI
for analytics today or for computer AI in general all the stuff that people
in general all the stuff that people
in general all the stuff that people consider obvious and then fail to
consider obvious and then fail to
consider obvious and then fail to include in a description of anything so
include in a description of anything so
include in a description of anything so those are the key problems that were
those are the key problems that were
those are the key problems that were that we’re trying to solve here’s a
that we’re trying to solve here’s a
that we’re trying to solve here’s a piece of text it’s an article from 2006
piece of text it’s an article from 2006
piece of text it’s an article from 2006 and if you use a regular sort of
and if you use a regular sort of
and if you use a regular sort of full-text search or some kind of
full-text search or some kind of
full-text search or some kind of standard search engine and you throw
standard search engine and you throw
standard search engine and you throw this it’s an abstract of an article the
this it’s an abstract of an article the
this it’s an abstract of an article the real article is probably ten times as
real article is probably ten times as
real article is probably ten times as long then it’s really difficult to see
long then it’s really difficult to see
long then it’s really difficult to see what this text is really about and if I
what this text is really about and if I
what this text is really about and if I read this how do I figure out what other
read this how do I figure out what other
read this how do I figure out what other articles talk about the same things so
articles talk about the same things so
articles talk about the same things so today we use computers to annotate the
today we use computers to annotate the
today we use computers to annotate the words that we know what means so these
words that we know what means so these
words that we know what means so these are the words that are found in common
are the words that are found in common
are the words that are found in common sort of dictionaries and ontology of
sort of dictionaries and ontology of
sort of dictionaries and ontology of this of this area and we have at our
this of this area and we have at our
this of this area and we have at our company developed a much more
company developed a much more
company developed a much more comprehensive way of looking at this and
comprehensive way of looking at this and
comprehensive way of looking at this and dynamically statistically deriving
dynamically statistically deriving
dynamically statistically deriving longer phrases that mean stuff and we
longer phrases that mean stuff and we
longer phrases that mean stuff and we figure out which mean approximately
figure out which mean approximately
figure out which mean approximately which of them mean approximately the
which of them mean approximately the
which of them mean approximately the same thing and right now as I said in
same thing and right now as I said in
same thing and right now as I said in the I think of the remarks for the talk
the I think of the remarks for the talk
the I think of the remarks for the talk I’m also going to try to talk a little
I’m also going to try to talk a little
I’m also going to try to talk a little bit about where we want to take things
bit about where we want to take things
bit about where we want to take things and what we’re currently working on and
and what we’re currently working on and
and what we’re currently working on and it’s not just as you can see we’re
it’s not just as you can see we’re
it’s not just as you can see we’re trying to cover all of the information
trying to cover all of the information
trying to cover all of the information is actually in an article try to map
is actually in an article try to map
is actually in an article try to map that out and make it searchable make it
that out and make it searchable make it
that out and make it searchable make it findable
findable
findable and we’re presently working on all of
and we’re presently working on all of
and we’re presently working on all of the actions and relationships between
the actions and relationships between
the actions and relationships between these things so that when you find stuff
these things so that when you find stuff
these things so that when you find stuff that talks about a and B the most
that talks about a and B the most
that talks about a and B the most relevant article is probably the one
relevant article is probably the one
relevant article is probably the one that talks about a and B and
that talks about a and B and
that talks about a and B and approximately the same context or the
approximately the same context or the
approximately the same context or the same sentence or even talks about how a
same sentence or even talks about how a
same sentence or even talks about how a is related to B today you can also do
is related to B today you can also do
is related to B today you can also do this with sort of distance number of
this with sort of distance number of
this with sort of distance number of words in-between when you use a
words in-between when you use a
words in-between when you use a traditional search engine but the thing
traditional search engine but the thing
traditional search engine but the thing is when you’re working with checks then
is when you’re working with checks then
is when you’re working with checks then sometimes the number of words in-between
sometimes the number of words in-between
sometimes the number of words in-between cross a paragraph boundary or sometimes
cross a paragraph boundary or sometimes
cross a paragraph boundary or sometimes it’s the image text that’s right next to
it’s the image text that’s right next to
it’s the image text that’s right next to that really interesting other thing that
that really interesting other thing that
that really interesting other thing that you were looking for so and and other
you were looking for so and and other
you were looking for so and and other times actually the thing that you’re
times actually the thing that you’re
times actually the thing that you’re interested in is mentioned up here with
interested in is mentioned up here with
interested in is mentioned up here with that third thing and down here the other
that third thing and down here the other
that third thing and down here the other thing is mentioned with that third thing
thing is mentioned with that third thing
thing is mentioned with that third thing so they’re actually really closely
so they’re actually really closely
so they’re actually really closely connected but they’re just that odd ends
connected but they’re just that odd ends
connected but they’re just that odd ends of the article so you need a better
of the article so you need a better
of the article so you need a better understanding of this and we actually
understanding of this and we actually
understanding of this and we actually use graph analytics to understand the
use graph analytics to understand the
use graph analytics to understand the proximity of things and the centrality
proximity of things and the centrality
proximity of things and the centrality of things in an article so the first
of things in an article so the first
of things in an article so the first step we we perform is a regular natural
step we we perform is a regular natural
step we we perform is a regular natural language processing some of you may be
language processing some of you may be
language processing some of you may be familiar with this but the simplest part
familiar with this but the simplest part
familiar with this but the simplest part of a natural language processing the
of a natural language processing the
of a natural language processing the thing that you do without too much
thing that you do without too much
thing that you do without too much computation is the power of speech
computation is the power of speech
computation is the power of speech tagging basically assigning word classes
tagging basically assigning word classes
tagging basically assigning word classes to each word is this a verb or is it a
to each word is this a verb or is it a
to each word is this a verb or is it a noun in this context is that an
noun in this context is that an
noun in this context is that an adjective and and once we have the part
adjective and and once we have the part
adjective and and once we have the part of speech tagging we actually can find a
of speech tagging we actually can find a
of speech tagging we actually can find a lot of candidates for potential things
lot of candidates for potential things
lot of candidates for potential things in the sentence so as you can see here
in the sentence so as you can see here
in the sentence so as you can see here we have a sentence from the abstract you
we have a sentence from the abstract you
we have a sentence from the abstract you just saw methods for measuring sodium
just saw methods for measuring sodium
just saw methods for measuring sodium concentration in serum by indirect
concentration in serum by indirect
concentration in serum by indirect sodium select row selective electrode
sodium select row selective electrode
sodium select row selective electrode potentiometry so I’ve highlighted
potentiometry so I’ve highlighted
potentiometry so I’ve highlighted underneath for those who don’t read
underneath for those who don’t read
underneath for those who don’t read articles on the daily basis there are
articles on the daily basis there are
articles on the daily basis there are four things here in an action if you
four things here in an action if you
four things here in an action if you will in come and speak and if we extract
will in come and speak and if we extract
will in come and speak and if we extract all of the things here they seem pretty
all of the things here they seem pretty
all of the things here they seem pretty straightforward right so so what’s the
straightforward right so so what’s the
straightforward right so so what’s the beef
beef
beef so turns out you can say these things in
so turns out you can say these things in
so turns out you can say these things in many different ways and if you want to
many different ways and if you want to
many different ways and if you want to see other content that is closely
see other content that is closely
see other content that is closely related to this article you need to dock
related to this article you need to dock
related to this article you need to dock not just look at the ones that include
not just look at the ones that include
not just look at the ones that include those exact words you need to also look
those exact words you need to also look
those exact words you need to also look at the ones that mention these same
at the ones that mention these same
at the ones that mention these same things in different ways so we have to
things in different ways so we have to
things in different ways so we have to deduplicate basically so we work with
deduplicate basically so we work with
deduplicate basically so we work with Springer nature which is one of the
Springer nature which is one of the
Springer nature which is one of the larger scientific publishers in the
larger scientific publishers in the
larger scientific publishers in the world they’ve given us all of their
world they’ve given us all of their
world they’ve given us all of their content and we’ve sifted through it we
content and we’ve sifted through it we
content and we’ve sifted through it we found on the other side of a hundred
found on the other side of a hundred
found on the other side of a hundred million things in their content and we
million things in their content and we
million things in their content and we then after processing that in various
then after processing that in various
then after processing that in various ways deduplicate that down to maybe two
ways deduplicate that down to maybe two
ways deduplicate that down to maybe two or three million different things and
or three million different things and
or three million different things and even when you’re down at two or three
even when you’re down at two or three
even when you’re down at two or three two or three million different things
two or three million different things
two or three million different things you still have separation between things
you still have separation between things
you still have separation between things that may be a human reader would find to
that may be a human reader would find to
that may be a human reader would find to be mostly the same thing so there’s a
be mostly the same thing so there’s a
be mostly the same thing so there’s a lot of deduplication you need to do if
lot of deduplication you need to do if
lot of deduplication you need to do if you can look at the examples here so
you can look at the examples here so
you can look at the examples here so concentration of sodium can be mapped
concentration of sodium can be mapped
concentration of sodium can be mapped back to sodium concentration you can
back to sodium concentration you can
back to sodium concentration you can also have like sentences like the
also have like sentences like the
also have like sentences like the electro potentiometry was indirect well
electro potentiometry was indirect well
electro potentiometry was indirect well obviously that’s the same as indirect
obviously that’s the same as indirect
obviously that’s the same as indirect electrode present geometry you can talk
electrode present geometry you can talk
electrode present geometry you can talk some people like to call things a
some people like to call things a
some people like to call things a methodology rather than a method and
methodology rather than a method and
methodology rather than a method and sometimes people talk about zero and
sometimes people talk about zero and
sometimes people talk about zero and plural rather than serum so these are
plural rather than serum so these are
plural rather than serum so these are what we call morphological or
what we call morphological or
what we call morphological or syntactical variations basically the
syntactical variations basically the
syntactical variations basically the things that depend on the grammar we
things that depend on the grammar we
things that depend on the grammar we also try to reduce the lexical and
also try to reduce the lexical and
also try to reduce the lexical and semantic variations that’s when authors
semantic variations that’s when authors
semantic variations that’s when authors use synonyms or hypo names which are
use synonyms or hypo names which are
use synonyms or hypo names which are like more generic general terms for the
like more generic general terms for the
like more generic general terms for the same thing so for four parts of our
same thing so for four parts of our
same thing so for four parts of our pipeline we actually also do that sort
pipeline we actually also do that sort
pipeline we actually also do that sort of abstraction so whenever someone says
of abstraction so whenever someone says
of abstraction so whenever someone says method we might map that back to a more
method we might map that back to a more
method we might map that back to a more generic term called mechanism serum
generic term called mechanism serum
generic term called mechanism serum sample it’s actually a type of blood
sample it’s actually a type of blood
sample it’s actually a type of blood sample like the serum is the blood with
sample like the serum is the blood with
sample like the serum is the blood with something filtered out that’s not my
something filtered out that’s not my
something filtered out that’s not my primary
primary
primary business and serum sodium concentration
business and serum sodium concentration
business and serum sodium concentration well sodium actually is the i guess the
well sodium actually is the i guess the
well sodium actually is the i guess the American term for nature or it’s also
American term for nature or it’s also
American term for nature or it’s also use that sometimes and indirect
use that sometimes and indirect
use that sometimes and indirect electrode function geometry that we’ve
electrode function geometry that we’ve
electrode function geometry that we’ve now see you in a couple of times it’s
now see you in a couple of times it’s
now see you in a couple of times it’s actually a type of electroanalysis so
actually a type of electroanalysis so
actually a type of electroanalysis so when we look at longer sentences or
when we look at longer sentences or
when we look at longer sentences or longer phrases we actually go in and
longer phrases we actually go in and
longer phrases we actually go in and replace each of the tokens with a more
replace each of the tokens with a more
replace each of the tokens with a more generic term to figure out if this is
generic term to figure out if this is
generic term to figure out if this is actually a variation of something that
actually a variation of something that
actually a variation of something that we’ve seen before all of this is really
we’ve seen before all of this is really
we’ve seen before all of this is really nothing to do with machine learning this
nothing to do with machine learning this
nothing to do with machine learning this is just hard coded understanding of
is just hard coded understanding of
is just hard coded understanding of linguistic variations so we have
linguistic variations so we have
linguistic variations so we have compound paraphrases and ejectable
compound paraphrases and ejectable
compound paraphrases and ejectable modifiers and coordinations where you
modifiers and coordinations where you
modifiers and coordinations where you mention things like the concentration of
mention things like the concentration of
mention things like the concentration of sodium and magnesium can be expanded
sodium and magnesium can be expanded
sodium and magnesium can be expanded into concentration of magnesium and
into concentration of magnesium and
into concentration of magnesium and concentration of sodium and all of these
concentration of sodium and all of these
concentration of sodium and all of these tedious rules that we actually need to
tedious rules that we actually need to
tedious rules that we actually need to perform before you can do any type of
perform before you can do any type of
perform before you can do any type of sort of aggregated understanding and
sort of aggregated understanding and
sort of aggregated understanding and then final to a couple of things there
then final to a couple of things there
then final to a couple of things there often we’re looking at fragments of
often we’re looking at fragments of
often we’re looking at fragments of something else or we’re looking at
something else or we’re looking at
something else or we’re looking at something that contains a fragment which
something that contains a fragment which
something that contains a fragment which is more interesting so sometimes it’s
is more interesting so sometimes it’s
is more interesting so sometimes it’s the indirect potentiometry and no one
the indirect potentiometry and no one
the indirect potentiometry and no one else in the world has ever put sodium
else in the world has ever put sodium
else in the world has ever put sodium selective in between there so we have to
selective in between there so we have to
selective in between there so we have to identify that and take sort of author
identify that and take sort of author
identify that and take sort of author specific variations out of the question
specific variations out of the question
specific variations out of the question because they mean absolutely nothing to
because they mean absolutely nothing to
because they mean absolutely nothing to anyone else in the world and here we
anyone else in the world and here we
anyone else in the world and here we come to also this matter of adding
come to also this matter of adding
come to also this matter of adding additional descriptive detail that can
additional descriptive detail that can
additional descriptive detail that can really be in the way of understanding
really be in the way of understanding
really be in the way of understanding what’s going on so clinically
what’s going on so clinically
what’s going on so clinically implemented indirect something or
implemented indirect something or
implemented indirect something or error-prone indirect ion selective
error-prone indirect ion selective
error-prone indirect ion selective whatever whatever these are all things
whatever whatever these are all things
whatever whatever these are all things that get in the way of understanding
that get in the way of understanding
that get in the way of understanding what’s really being spoken about then
what’s really being spoken about then
what’s really being spoken about then once we have deduplicated all this these
once we have deduplicated all this these
once we have deduplicated all this these tons of things really we look at
tons of things really we look at
tons of things really we look at different types of features
different types of features
different types of features so the local features in the document
so the local features in the document
so the local features in the document include how many times it’s mentioned
include how many times it’s mentioned
include how many times it’s mentioned what’s it connected to we actually
what’s it connected to we actually
what’s it connected to we actually calculate a position in a document graph
calculate a position in a document graph
calculate a position in a document graph we connect all the things in the mention
we connect all the things in the mention
we connect all the things in the mention in the document with the relationships
in the document with the relationships
in the document with the relationships that connect them and then do regular
that connect them and then do regular
that connect them and then do regular sort of graph analysis to figure out
sort of graph analysis to figure out
sort of graph analysis to figure out what’s central and what sort of a
what’s central and what sort of a
what’s central and what sort of a peripheral to what’s being talked about
peripheral to what’s being talked about
peripheral to what’s being talked about so you can have something that’s only
so you can have something that’s only
so you can have something that’s only mentioned once but really central
mentioned once but really central
mentioned once but really central because it’s connected to that very one
because it’s connected to that very one
because it’s connected to that very one central thing and you can have stuff out
central thing and you can have stuff out
central thing and you can have stuff out here that may be mentioned a couple of
here that may be mentioned a couple of
here that may be mentioned a couple of times but always in relation to stuff
times but always in relation to stuff
times but always in relation to stuff that’s non central and and then of
that’s non central and and then of
that’s non central and and then of course we run these other types of
course we run these other types of
course we run these other types of analytics that use the textual context
analytics that use the textual context
analytics that use the textual context so the words right before and right
so the words right before and right
so the words right before and right after a piece of text the global
after a piece of text the global
after a piece of text the global features that we use are also sort of
features that we use are also sort of
features that we use are also sort of occurrence count the number of documents
occurrence count the number of documents
occurrence count the number of documents that contain the given phrase and we run
that contain the given phrase and we run
that contain the given phrase and we run various sort of fancy algorithms to
various sort of fancy algorithms to
various sort of fancy algorithms to figure out what the most common
figure out what the most common
figure out what the most common variation if you have a set of an Engram
variation if you have a set of an Engram
variation if you have a set of an Engram if you will a phrase the words what’s
if you will a phrase the words what’s
if you will a phrase the words what’s the most common used a variation if you
the most common used a variation if you
the most common used a variation if you add an additional adjective in front
add an additional adjective in front
add an additional adjective in front what’s the most commonly used adjective
what’s the most commonly used adjective
what’s the most commonly used adjective or what are the two most common things
or what are the two most common things
or what are the two most common things and are they sufficiently different
and are they sufficiently different
and are they sufficiently different different to be two different things
different to be two different things
different to be two different things then of course we also calculate I guess
then of course we also calculate I guess
then of course we also calculate I guess many of you probably also familiar with
many of you probably also familiar with
many of you probably also familiar with the tf-idf which is basically deviation
the tf-idf which is basically deviation
the tf-idf which is basically deviation in frequency from from a norm so if
in frequency from from a norm so if
in frequency from from a norm so if things occur more often than they do on
things occur more often than they do on
things occur more often than they do on average that’s that’s a significant
average that’s that’s a significant
average that’s that’s a significant phrase probably and then we look at
phrase probably and then we look at
phrase probably and then we look at distribution across the corpus so things
distribution across the corpus so things
distribution across the corpus so things a thing can be mentioned very few times
a thing can be mentioned very few times
a thing can be mentioned very few times but whenever someone uses that thing
but whenever someone uses that thing
but whenever someone uses that thing they mention it over and over again in
they mention it over and over again in
they mention it over and over again in the same document so that means it’s
the same document so that means it’s
the same document so that means it’s probably got some significance but if
probably got some significance but if
probably got some significance but if you look at it globally and just count
you look at it globally and just count
you look at it globally and just count the number of documents it occurs in it
the number of documents it occurs in it
the number of documents it occurs in it may seem insignificant so we have this
may seem insignificant so we have this
may seem insignificant so we have this concentration score which pay
concentration score which pay
concentration score which pay sickly tells us when it when it occurs
sickly tells us when it when it occurs
sickly tells us when it when it occurs in the document how likely is it occur
in the document how likely is it occur
in the document how likely is it occur to occur more than once and then we also
to occur more than once and then we also
to occur more than once and then we also do an analysis comparing the
do an analysis comparing the
do an analysis comparing the distribution to domain regions to figure
distribution to domain regions to figure
distribution to domain regions to figure out that this is something that’s very
out that this is something that’s very
out that this is something that’s very common but only in a certain domain and
common but only in a certain domain and
common but only in a certain domain and all of these things are affected into
all of these things are affected into
all of these things are affected into the our learning algorithms or ranking
the our learning algorithms or ranking
the our learning algorithms or ranking models we also use the aggregated
models we also use the aggregated
models we also use the aggregated textual context and this is the I’m
textual context and this is the I’m
textual context and this is the I’m going to get back to that in a little
going to get back to that in a little
going to get back to that in a little while this is the word to Vic or word
while this is the word to Vic or word
while this is the word to Vic or word embeddings models that the previous
embeddings models that the previous
embeddings models that the previous speaker also mentioned so if we look at
speaker also mentioned so if we look at
speaker also mentioned so if we look at all the occurrences of the given phrase
all the occurrences of the given phrase
all the occurrences of the given phrase across the entire corpus that tells us
across the entire corpus that tells us
across the entire corpus that tells us something about what it means or what
something about what it means or what
something about what it means or what other things might mean the same thing
other things might mean the same thing
other things might mean the same thing and then of course the biggest thing
and then of course the biggest thing
and then of course the biggest thing when you’re trying to train a model is
when you’re trying to train a model is
when you’re trying to train a model is the thing that you’re training it on so
the thing that you’re training it on so
the thing that you’re training it on so we have two types of things that we can
we have two types of things that we can
we have two types of things that we can train on we have human training data
train on we have human training data
train on we have human training data this could be the articles themselves we
this could be the articles themselves we
this could be the articles themselves we figure out if we have and I parthis
figure out if we have and I parthis
figure out if we have and I parthis purposes that a given concept is very
purposes that a given concept is very
purposes that a given concept is very central to an article we can compare it
central to an article we can compare it
central to an article we can compare it and see if we actually found it in the
and see if we actually found it in the
and see if we actually found it in the abstract so if it’s in the abstract or
abstract so if it’s in the abstract or
abstract so if it’s in the abstract or in the title as a high likelihood that
in the title as a high likelihood that
in the title as a high likelihood that the author also considers it important
the author also considers it important
the author also considers it important so that’s one data point and then
so that’s one data point and then
so that’s one data point and then aggregated over thousands or millions of
aggregated over thousands or millions of
aggregated over thousands or millions of articles that actually can tell us how
articles that actually can tell us how
articles that actually can tell us how good we are at selecting the things that
good we are at selecting the things that
good we are at selecting the things that authors find important of course if we
authors find important of course if we
authors find important of course if we think we can do better than the authors
think we can do better than the authors
think we can do better than the authors that’s a lousy way to measure it so we
that’s a lousy way to measure it so we
that’s a lousy way to measure it so we also use other types of human training
also use other types of human training
also use other types of human training data behavioral data from the companies
data behavioral data from the companies
data behavioral data from the companies we work with they kindly allow us access
we work with they kindly allow us access
we work with they kindly allow us access to usage patterns when we present
to usage patterns when we present
to usage patterns when we present something to users which of these things
something to users which of these things
something to users which of these things that we extract did they actually click
that we extract did they actually click
that we extract did they actually click on find interesting and and which
on find interesting and and which
on find interesting and and which articles when presented with a list of
articles when presented with a list of
articles when presented with a list of articles related articles
articles related articles
articles related articles the sidebar for instance which of these
the sidebar for instance which of these
the sidebar for instance which of these were found to be most interesting or
were found to be most interesting or
were found to be most interesting or clicked upon by by users turns out of
clicked upon by by users turns out of
clicked upon by by users turns out of course is the ones with the promising
course is the ones with the promising
course is the ones with the promising titles that get clicked on not
titles that get clicked on not
titles that get clicked on not necessarily the ones that are most
necessarily the ones that are most
necessarily the ones that are most similar so sometimes you need to make
similar so sometimes you need to make
similar so sometimes you need to make adjustments just to to create some link
adjustments just to to create some link
adjustments just to to create some link bait so the other type of synthetic data
bait so the other type of synthetic data
bait so the other type of synthetic data that we use is the data that we use is
that we use is the data that we use is
that we use is the data that we use is synthetic data so we can actually
synthetic data so we can actually
synthetic data so we can actually construct an artificial corpus and and
construct an artificial corpus and and
construct an artificial corpus and and train our models on that and try to
train our models on that and try to
train our models on that and try to improve our models using the principles
improve our models using the principles
improve our models using the principles that we that we use to create the
that we that we use to create the
that we that we use to create the synthetic data it’s slightly more
synthetic data it’s slightly more
synthetic data it’s slightly more complex but you can actually that’s
complex but you can actually that’s
complex but you can actually that’s that’s how the demo if any of you have
that’s how the demo if any of you have
that’s how the demo if any of you have tried war tyvek the demo that they
tried war tyvek the demo that they
tried war tyvek the demo that they create there is actually completely
create there is actually completely
create there is actually completely synthetic and you can also build
synthetic and you can also build
synthetic and you can also build partially synthetic data sets one that
partially synthetic data sets one that
partially synthetic data sets one that we’ve tried and that actually it was
we’ve tried and that actually it was
we’ve tried and that actually it was also used on what to work was to use a
also used on what to work was to use a
also used on what to work was to use a different search engine to create your
different search engine to create your
different search engine to create your artificial corpus so you search for
artificial corpus so you search for
artificial corpus so you search for something maybe two different concepts
something maybe two different concepts
something maybe two different concepts two different words and then you mix
two different words and then you mix
two different words and then you mix them together and you remove all traces
them together and you remove all traces
them together and you remove all traces of the worst that you searched for so
of the worst that you searched for so
of the worst that you searched for so the only thing that’s left or everything
the only thing that’s left or everything
the only thing that’s left or everything else in the document and then you try to
else in the document and then you try to
else in the document and then you try to figure out if you can still classify
figure out if you can still classify
figure out if you can still classify what was what and and and dump things in
what was what and and and dump things in
what was what and and and dump things in the right pile so a little bit about
the right pile so a little bit about
the right pile so a little bit about word embeddings so the previous author
word embeddings so the previous author
word embeddings so the previous author mentioned it here’s an example basically
mentioned it here’s an example basically
mentioned it here’s an example basically what you do is you build a lecture or
what you do is you build a lecture or
what you do is you build a lecture or it’s actually a tensor it’s a
it’s actually a tensor it’s a
it’s actually a tensor it’s a combination of vectors so each each word
combination of vectors so each each word
combination of vectors so each each word or token or phrase we work on phrases in
or token or phrase we work on phrases in
or token or phrase we work on phrases in our corpus is actually defined in this
our corpus is actually defined in this
our corpus is actually defined in this lecture space by an aggregation of
lecture space by an aggregation of
lecture space by an aggregation of vectors that it commonly co-occurs with
vectors that it commonly co-occurs with
vectors that it commonly co-occurs with so the traditional word tyvek algorithm
so the traditional word tyvek algorithm
so the traditional word tyvek algorithm will just work on create treat all text
will just work on create treat all text
will just work on create treat all text as a token every token as its own
as a token every token as its own
as a token every token as its own vector and then only a few things get
vector and then only a few things get
vector and then only a few things get concatenated because they belong
concatenated because they belong
concatenated because they belong together so we we pre process the text
together so we we pre process the text
together so we we pre process the text quite a lot and figure out after we’ve
quite a lot and figure out after we’ve
quite a lot and figure out after we’ve deduplicated all these hundred million
deduplicated all these hundred million
deduplicated all these hundred million things we’re down to so few million
things we’re down to so few million
things we’re down to so few million things that they actually have decent
things that they actually have decent
things that they actually have decent recurrent occurrence counts there’s the
recurrent occurrence counts there’s the
recurrent occurrence counts there’s the big problem when you’re looking at
big problem when you’re looking at
big problem when you’re looking at larger selections of text is that
larger selections of text is that
larger selections of text is that they’re kind of statistics be more
they’re kind of statistics be more
they’re kind of statistics be more unlikely than the word each word on its
unlikely than the word each word on its
unlikely than the word each word on its own so you have a problem with for
own so you have a problem with for
own so you have a problem with for instance hyper amick flow doesn’t
instance hyper amick flow doesn’t
instance hyper amick flow doesn’t necessarily occur that many times even
necessarily occur that many times even
necessarily occur that many times even when you have a million documents or 10
when you have a million documents or 10
when you have a million documents or 10 million documents it’s still something
million documents it’s still something
million documents it’s still something so specific that you only have a few
so specific that you only have a few
so specific that you only have a few hundred occurrences so it’s important to
hundred occurrences so it’s important to
hundred occurrences so it’s important to capture all of them even when the author
capture all of them even when the author
capture all of them even when the author calls it something different but after
calls it something different but after
calls it something different but after we’ve done all that deduplication we
we’ve done all that deduplication we
we’ve done all that deduplication we actually end up with a corpus what we
actually end up with a corpus what we
actually end up with a corpus what we can run a vector model or generate a
can run a vector model or generate a
can run a vector model or generate a vector model and then we use other
vector model and then we use other
vector model and then we use other things on top so we know that coronary
things on top so we know that coronary
things on top so we know that coronary vasodilation actually is defined in an
vasodilation actually is defined in an
vasodilation actually is defined in an ontology it’s related to all these
ontology it’s related to all these
ontology it’s related to all these different things and then we combine
different things and then we combine
different things and then we combine things using our so the structured
things using our so the structured
things using our so the structured knowledge of that domain to further
knowledge of that domain to further
knowledge of that domain to further refine the vector model and and that’s
refine the vector model and and that’s
refine the vector model and and that’s work really well for us here’s this is
work really well for us here’s this is
work really well for us here’s this is just a little data dump from a test a
just a little data dump from a test a
just a little data dump from a test a while ago but what you see here are
while ago but what you see here are
while ago but what you see here are phrases and a current accounts in the
phrases and a current accounts in the
phrases and a current accounts in the test corpus of I think are a million
test corpus of I think are a million
test corpus of I think are a million articles and here you can see like the
articles and here you can see like the
articles and here you can see like the first line deionized water it’s actually
first line deionized water it’s actually
first line deionized water it’s actually part of a set it extends further to the
part of a set it extends further to the
part of a set it extends further to the right but the first line you can see do
right but the first line you can see do
right but the first line you can see do ionized water is actually the same or
ionized water is actually the same or
ionized water is actually the same or has a similar vector as by distilled
has a similar vector as by distilled
has a similar vector as by distilled water ultrapure water di water tea /
water ultrapure water di water tea /
water ultrapure water di water tea / ionized water or double distilled water
ionized water or double distilled water
ionized water or double distilled water and these are important to notice that
and these are important to notice that
and these are important to notice that this is the output from a vector model
this is the output from a vector model
this is the output from a vector model where we basically for each concept in
where we basically for each concept in
where we basically for each concept in the first column we find the nearest
the first column we find the nearest
the first column we find the nearest concepts the most the concepts that
concepts the most the concepts that
concepts the most the concepts that appear in the most
appear in the most
appear in the most similar context so the algorithm
similar context so the algorithm
similar context so the algorithm actually does not even look at the
actually does not even look at the
actually does not even look at the letters it just has an ID and then it
letters it just has an ID and then it
letters it just has an ID and then it knows the ID of the things around it and
knows the ID of the things around it and
knows the ID of the things around it and so it’s pretty obvious you that it is
so it’s pretty obvious you that it is
so it’s pretty obvious you that it is actually possible just from the
actually possible just from the
actually possible just from the hypothesis is that words that mean
hypothesis is that words that mean
hypothesis is that words that mean approximately the same are used in
approximately the same are used in
approximately the same are used in approximately similar context so the 10
approximately similar context so the 10
approximately similar context so the 10 words or five words before and after
words or five words before and after
words or five words before and after over a million documents will be very
over a million documents will be very
over a million documents will be very similar for things that although they
similar for things that although they
similar for things that although they are different phrases mean more or less
are different phrases mean more or less
are different phrases mean more or less the same thing so you can see when
the same thing so you can see when
the same thing so you can see when things are used interchangeably that is
things are used interchangeably that is
things are used interchangeably that is very much the case so for instance row I
very much the case so for instance row I
very much the case so for instance row I guess 60 so crucial role actually is the
guess 60 so crucial role actually is the
guess 60 so crucial role actually is the more or less the interchangeably used
more or less the interchangeably used
more or less the interchangeably used with prominent role vital role
with prominent role vital role
with prominent role vital role fundamental role pivotal role or
fundamental role pivotal role or
fundamental role pivotal role or essential role sounds about right and
essential role sounds about right and
essential role sounds about right and again it’s it’s a great validation
again it’s it’s a great validation
again it’s it’s a great validation sometimes people work with data sets and
sometimes people work with data sets and
sometimes people work with data sets and they rarely ever see like anything else
they rarely ever see like anything else
they rarely ever see like anything else than floating point values here you can
than floating point values here you can
than floating point values here you can actually look at it and see that does
actually look at it and see that does
actually look at it and see that does actually make sense and if you’re in
actually make sense and if you’re in
actually make sense and if you’re in doubt you when we do sort of limited QA
doubt you when we do sort of limited QA
doubt you when we do sort of limited QA to see if things have become garbled by
to see if things have become garbled by
to see if things have become garbled by some bug introduced somewhere you can
some bug introduced somewhere you can
some bug introduced somewhere you can always just like look it up on Wikipedia
always just like look it up on Wikipedia
always just like look it up on Wikipedia or something see does it make sense and
or something see does it make sense and
or something see does it make sense and I think him so pivotal role key player
I think him so pivotal role key player
I think him so pivotal role key player essential role yeah so it actually works
essential role yeah so it actually works
essential role yeah so it actually works it’s possible to run this even on
it’s possible to run this even on
it’s possible to run this even on phrases which I think we have been the
phrases which I think we have been the
phrases which I think we have been the first to do so the upshot of this what
first to do so the upshot of this what
first to do so the upshot of this what have we done we’ve created human
have we done we’ve created human
have we done we’ve created human readable fingerprints so we’ve for any
readable fingerprints so we’ve for any
readable fingerprints so we’ve for any given text regardless of the type of
given text regardless of the type of
given text regardless of the type of language used we can extract some some
language used we can extract some some
language used we can extract some some phrases that we know what they mean and
phrases that we know what they mean and
phrases that we know what they mean and we can map them to the most commonly
we can map them to the most commonly
we can map them to the most commonly used definition or phrase that means the
used definition or phrase that means the
used definition or phrase that means the same thing and for a person skilled in
same thing and for a person skilled in
same thing and for a person skilled in the arts as they say it’s kind of easy
the arts as they say it’s kind of easy
the arts as they say it’s kind of easy to suddenly see what an article is about
to suddenly see what an article is about
to suddenly see what an article is about we can rank them and we can tell you the
we can rank them and we can tell you the
we can rank them and we can tell you the 5 10 things that are most important and
5 10 things that are most important and
5 10 things that are most important and an Arctic
an Arctic
an Arctic and when people say if you look at the
and when people say if you look at the
and when people say if you look at the graph up there when when when some
graph up there when when when some
graph up there when when when some author mentions insulin insensitivity
author mentions insulin insensitivity
author mentions insulin insensitivity and obese children we will know that
and obese children we will know that
and obese children we will know that that article that was written a couple
that article that was written a couple
that article that was written a couple of years ago about oh wait girls and
of years ago about oh wait girls and
of years ago about oh wait girls and reduced hormone response is actually
reduced hormone response is actually
reduced hormone response is actually talking about the exact same thing and
talking about the exact same thing and
talking about the exact same thing and that’s a that’s a that’s a very big leap
that’s a that’s a that’s a very big leap
that’s a that’s a that’s a very big leap in the way we recommend text in science
in the way we recommend text in science
in the way we recommend text in science or indeed anywhere so traditional
or indeed anywhere so traditional
or indeed anywhere so traditional document similarity relies on as I said
document similarity relies on as I said
document similarity relies on as I said to recap the words that we know what
to recap the words that we know what
to recap the words that we know what means sometimes word can be words can be
means sometimes word can be words can be
means sometimes word can be words can be ambiguous and that’s a big problem so
ambiguous and that’s a big problem so
ambiguous and that’s a big problem so there’s what we call the phrase
there’s what we call the phrase
there’s what we call the phrase hypothesis which is what we’re working
hypothesis which is what we’re working
hypothesis which is what we’re working on when you have a longer selection of
on when you have a longer selection of
on when you have a longer selection of words that stack together in the same
words that stack together in the same
words that stack together in the same fashion they rarely have a different
fashion they rarely have a different
fashion they rarely have a different meaning they often have a very precise
meaning they often have a very precise
meaning they often have a very precise meaning and and that’s the ability to
meaning and and that’s the ability to
meaning and and that’s the ability to capture those races dynamically is
capture those races dynamically is
capture those races dynamically is basically one what we do so once you
basically one what we do so once you
basically one what we do so once you have these fingerprints you can actually
have these fingerprints you can actually
have these fingerprints you can actually produce all kinds of different features
produce all kinds of different features
produce all kinds of different features that make it easier for researchers make
that make it easier for researchers make
that make it easier for researchers make life easier so what we’ve delivered to
life easier so what we’ve delivered to
life easier so what we’ve delivered to to the partners that we work with our
to the partners that we work with our
to the partners that we work with our inability to first as I said highlight
inability to first as I said highlight
inability to first as I said highlight the things that are most the principal
the things that are most the principal
the things that are most the principal components of an article so this is an
components of an article so this is an
components of an article so this is an article page some of you may have seen
article page some of you may have seen
article page some of you may have seen one if you search on Google for an
one if you search on Google for an
one if you search on Google for an article title you get bounced to a
article title you get bounced to a
article title you get bounced to a publisher’s webpage where that article
publisher’s webpage where that article
publisher’s webpage where that article is presented and so we helped make that
is presented and so we helped make that
is presented and so we helped make that page better we helped make it easier for
page better we helped make it easier for
page better we helped make it easier for readers to understand what’s going on
readers to understand what’s going on
readers to understand what’s going on and we can pull out key sentences and we
and we can pull out key sentences and we
and we can pull out key sentences and we can recommend stuff we can tell the user
can recommend stuff we can tell the user
can recommend stuff we can tell the user this is where they mentioned that thing
this is where they mentioned that thing
this is where they mentioned that thing you’re interested in they use some
you’re interested in they use some
you’re interested in they use some different words but it’s about the same
different words but it’s about the same
different words but it’s about the same thing and we can provide related content
thing and we can provide related content
thing and we can provide related content basically articles that are talking
basically articles that are talking
basically articles that are talking about the same things and when we do
about the same things and when we do
about the same things and when we do that we not only just provide a related
that we not only just provide a related
that we not only just provide a related article we’d actually tell you what it
article we’d actually tell you what it
article we’d actually tell you what it is how this overlaps with what you’re
is how this overlaps with what you’re
is how this overlaps with what you’re currently looking at so we can actually
currently looking at so we can actually
currently looking at so we can actually show you oh these are the concepts the
show you oh these are the concepts the
show you oh these are the concepts the current here that also occurs in the
current here that also occurs in the
current here that also occurs in the article you’re presently looking at and
article you’re presently looking at and
article you’re presently looking at and we can actually also we’ve done an
we can actually also we’ve done an
we can actually also we’ve done an interactive version that allows the user
interactive version that allows the user
interactive version that allows the user to drill down and further explore it has
to drill down and further explore it has
to drill down and further explore it has to contain this than this and then get a
to contain this than this and then get a
to contain this than this and then get a recommendation here so we work very
recommendation here so we work very
recommendation here so we work very closely with Springer nature Scientific
closely with Springer nature Scientific
closely with Springer nature Scientific American McMillan many of the largest
American McMillan many of the largest
American McMillan many of the largest publishers and we produce things like
publishers and we produce things like
publishers and we produce things like this so I guess the little difficult to
this so I guess the little difficult to
this so I guess the little difficult to see the highlights here but in essence
see the highlights here but in essence
see the highlights here but in essence this is the non schematic version of
this is the non schematic version of
this is the non schematic version of what I just saw told you on the right
what I just saw told you on the right
what I just saw told you on the right side we have related content you can
side we have related content you can
side we have related content you can click any of the things you’re
click any of the things you’re
click any of the things you’re interested in then get a filtered list
interested in then get a filtered list
interested in then get a filtered list of the most similar articles that also
of the most similar articles that also
of the most similar articles that also contain this thing you’re interested in
contain this thing you’re interested in
contain this thing you’re interested in we also do also other types of
we also do also other types of
we also do also other types of visualizations with related content we
visualizations with related content we
visualizations with related content we can use our technology to find
can use our technology to find
can use our technology to find definitions of things so many of these
definitions of things so many of these
definitions of things so many of these scientific publishers have a large back
scientific publishers have a large back
scientific publishers have a large back catalogue of reference works or teaching
catalogue of reference works or teaching
catalogue of reference works or teaching books if you will that define different
books if you will that define different
books if you will that define different concepts so users can can click on
concepts so users can can click on
concepts so users can can click on something like RNA editing and we can
something like RNA editing and we can
something like RNA editing and we can pick up the best definition we can find
pick up the best definition we can find
pick up the best definition we can find in in the publishers literature and not
in in the publishers literature and not
in in the publishers literature and not just rely on the stuff that’s on
just rely on the stuff that’s on
just rely on the stuff that’s on Wikipedia and more interesting we’re
Wikipedia and more interesting we’re
Wikipedia and more interesting we’re also working on building tools that
also working on building tools that
also working on building tools that allow researchers to to see more of the
allow researchers to to see more of the
allow researchers to to see more of the history that the stuff they’re
history that the stuff they’re
history that the stuff they’re interested in is sort of a part of so
interested in is sort of a part of so
interested in is sort of a part of so here is a tool that we call timeline
here is a tool that we call timeline
here is a tool that we call timeline that for a given article here in
that for a given article here in
that for a given article here in sometime in the past I guess around 2003
sometime in the past I guess around 2003
sometime in the past I guess around 2003 the selected article there we use the
the selected article there we use the
the selected article there we use the reference the citation data forwards and
reference the citation data forwards and
reference the citation data forwards and backwards citations to figure out which
backwards citations to figure out which
backwards citations to figure out which things were cited by this paper and
things were cited by this paper and
things were cited by this paper and which papers psyched this paper so
which papers psyched this paper so
which papers psyched this paper so forwards and backwards in time but
forwards and backwards in time but
forwards and backwards in time but that’s a very very large set because
that’s a very very large set because
that’s a very very large set because when you have a
when you have a
when you have a single article they often cite 10 20 50
single article they often cite 10 20 50
single article they often cite 10 20 50 other papers each of which site another
other papers each of which site another
other papers each of which site another 10 50 100 papers so it’s a very huge
10 50 100 papers so it’s a very huge
10 50 100 papers so it’s a very huge tree and then what we do is that we
tree and then what we do is that we
tree and then what we do is that we basically prune that tree to just look
basically prune that tree to just look
basically prune that tree to just look at the branches that have articles that
at the branches that have articles that
at the branches that have articles that talk about the same thing and that
talk about the same thing and that
talk about the same thing and that allows you to fairly easily identify an
allows you to fairly easily identify an
allows you to fairly easily identify an article from last year which talks about
article from last year which talks about
article from last year which talks about the same thing and actually through a
the same thing and actually through a
the same thing and actually through a couple of links cites the article that
couple of links cites the article that
couple of links cites the article that you’re presently looking at or if you’re
you’re presently looking at or if you’re
you’re presently looking at or if you’re looking at a recent article you can say
looking at a recent article you can say
looking at a recent article you can say who is the first author in this citation
who is the first author in this citation
who is the first author in this citation tree to actually combine this and that
tree to actually combine this and that
tree to actually combine this and that and in a paper so the value that we’re
and in a paper so the value that we’re
and in a paper so the value that we’re providing to researchers and this is
providing to researchers and this is
providing to researchers and this is we’re kind of proud of that is that we
we’re kind of proud of that is that we
we’re kind of proud of that is that we accelerate the path to successful
accelerate the path to successful
accelerate the path to successful discovery by pointing directly to what
discovery by pointing directly to what
discovery by pointing directly to what is relevant in an article and we can
is relevant in an article and we can
is relevant in an article and we can also provide more relevant suggestions
also provide more relevant suggestions
also provide more relevant suggestions because they’re much more precise than
because they’re much more precise than
because they’re much more precise than competing technologies and then we
competing technologies and then we
competing technologies and then we provide so our little company actually
provide so our little company actually
provide so our little company actually also provides end user features because
also provides end user features because
also provides end user features because we believe that it’s that understanding
we believe that it’s that understanding
we believe that it’s that understanding of the algorithms used and how they
of the algorithms used and how they
of the algorithms used and how they actually how different algorithms will
actually how different algorithms will
actually how different algorithms will favor different things and and that
favor different things and and that
favor different things and and that actually is important for the feature
actually is important for the feature
actually is important for the feature you’re trying to construct what how how
you’re trying to construct what how how
you’re trying to construct what how how you’re going to rank these and it’s
you’re going to rank these and it’s
you’re going to rank these and it’s actually very dependent on the type of
actually very dependent on the type of
actually very dependent on the type of use cases that we were trying to solve
use cases that we were trying to solve
use cases that we were trying to solve and for our our clients the publishers
and for our our clients the publishers
and for our our clients the publishers they’re really happy that they can roll
they’re really happy that they can roll
they’re really happy that they can roll out a feature across many different
out a feature across many different
out a feature across many different types of context content even so in
types of context content even so in
types of context content even so in biomedical for instance gene research or
biomedical for instance gene research or
biomedical for instance gene research or drugs diseases there’s a lot of
drugs diseases there’s a lot of
drugs diseases there’s a lot of structured documentation a lot of
structured documentation a lot of
structured documentation a lot of ontology zahl gene names at least
ontology zahl gene names at least
ontology zahl gene names at least discovered until fairly recently or
discovered until fairly recently or
discovered until fairly recently or logged in an open access ontology and
logged in an open access ontology and
logged in an open access ontology and and documentation is really really good
and documentation is really really good
and documentation is really really good in that small field of science but
in that small field of science but
in that small field of science but everywhere outside of that it’s much
everywhere outside of that it’s much
everywhere outside of that it’s much much worse if you look at humanities and
much worse if you look at humanities and
much worse if you look at humanities and Jen
Jen
Jen well there are rarely any any official
well there are rarely any any official
well there are rarely any any official ontology is available that tell you
ontology is available that tell you
ontology is available that tell you which words are important or which
which words are important or which
which words are important or which things is a synonym of what and and so
things is a synonym of what and and so
things is a synonym of what and and so what we do is actually is very important
what we do is actually is very important
what we do is actually is very important to do to developing this type of
to do to developing this type of
to do to developing this type of services or recommendations for for all
services or recommendations for for all
services or recommendations for for all the other disciplines so future
the other disciplines so future
the other disciplines so future directions well as I said we’re
directions well as I said we’re
directions well as I said we’re currently working on understanding the
currently working on understanding the
currently working on understanding the relationships between all these features
relationships between all these features
relationships between all these features of things that we extract there are so
of things that we extract there are so
of things that we extract there are so many different ways that you can say a
many different ways that you can say a
many different ways that you can say a given thing and when you talk about the
given thing and when you talk about the
given thing and when you talk about the relationship between two things there’s
relationship between two things there’s
relationship between two things there’s an equal amount of different ways you
an equal amount of different ways you
an equal amount of different ways you can say things so just the fact that
can say things so just the fact that
can say things so just the fact that serum consists mostly of water can be
serum consists mostly of water can be
serum consists mostly of water can be said in so many different ways and and
said in so many different ways and and
said in so many different ways and and the thing thin film coated gold
the thing thin film coated gold
the thing thin film coated gold nanoparticles we’re currently working on
nanoparticles we’re currently working on
nanoparticles we’re currently working on a nano product for the nano industry
a nano product for the nano industry
a nano product for the nano industry with a partner that can also be said in
with a partner that can also be said in
with a partner that can also be said in a number of different ways but what’s
a number of different ways but what’s
a number of different ways but what’s interesting is of course that these
interesting is of course that these
interesting is of course that these relationships when they stack up we can
relationships when they stack up we can
relationships when they stack up we can replace the two things the subject and
replace the two things the subject and
replace the two things the subject and the object and then have a general
the object and then have a general
the object and then have a general understanding of how this relationship
understanding of how this relationship
understanding of how this relationship can be described and so we’re trying to
can be described and so we’re trying to
can be described and so we’re trying to that’s a big challenge for us is trying
that’s a big challenge for us is trying
that’s a big challenge for us is trying to normalize and reduce the types of
to normalize and reduce the types of
to normalize and reduce the types of relationships between things and the
relationships between things and the
relationships between things and the corpus another big forward-looking
corpus another big forward-looking
corpus another big forward-looking feature is to provide our services to
feature is to provide our services to
feature is to provide our services to other companies that are trying to solve
other companies that are trying to solve
other companies that are trying to solve problems and have access to unstructured
problems and have access to unstructured
problems and have access to unstructured text but no ability to process it so
text but no ability to process it so
text but no ability to process it so we’re working with a couple of large
we’re working with a couple of large
we’re working with a couple of large companies to to make basically make
companies to to make basically make
companies to to make basically make large text collections computable so so
large text collections computable so so
large text collections computable so so much of what we do can be applied on any
much of what we do can be applied on any
much of what we do can be applied on any given sort of large collection of text
given sort of large collection of text
given sort of large collection of text and and you can do all sorts of really
and and you can do all sorts of really
and and you can do all sorts of really interesting analytics on it once you
interesting analytics on it once you
interesting analytics on it once you know what’s what and what’s similar and
know what’s what and what’s similar and
know what’s what and what’s similar and what’s the important aspects of text and
what’s the important aspects of text and
what’s the important aspects of text and then ultimately why we want to go is is
then ultimately why we want to go is is
then ultimately why we want to go is is to do reasoning at scale
to do reasoning at scale
to do reasoning at scale that’s really what you need in order to
that’s really what you need in order to
that’s really what you need in order to to augment scientific research most
to augment scientific research most
to augment scientific research most efficiently you need to be able to
efficiently you need to be able to
efficiently you need to be able to reason what is this how what’s the
reason what is this how what’s the
reason what is this how what’s the causal chain of events here and is is
causal chain of events here and is is
causal chain of events here and is is this a disputed fact does everyone say
this a disputed fact does everyone say
this a disputed fact does everyone say that this is how things are or the
that this is how things are or the
that this is how things are or the things that that may be long chains of
things that that may be long chains of
things that that may be long chains of of course ality that go unnoticed that
of course ality that go unnoticed that
of course ality that go unnoticed that can only really be uncovered by massive
can only really be uncovered by massive
can only really be uncovered by massive analytics so I guess the the ultimate
analytics so I guess the the ultimate
analytics so I guess the the ultimate price there is the cure for cancer so so
price there is the cure for cancer so so
price there is the cure for cancer so so I guess we have a small team we’re
I guess we have a small team we’re
I guess we have a small team we’re actually located in in almost in second
actually located in in almost in second
actually located in in almost in second town of Denmark were 18 people I think
town of Denmark were 18 people I think
town of Denmark were 18 people I think now and and all of them have worked at
now and and all of them have worked at
now and and all of them have worked at large big big international companies
large big big international companies
large big big international companies and basically chosen to come to work
and basically chosen to come to work
and basically chosen to come to work with us four measly salaries and living
with us four measly salaries and living
with us four measly salaries and living in the suburbs because we’re so excited
in the suburbs because we’re so excited
in the suburbs because we’re so excited about the promise of assisting science
about the promise of assisting science
about the promise of assisting science we have no Danish clients we all work
we have no Danish clients we all work
we have no Danish clients we all work with international publishers so and yes
with international publishers so and yes
with international publishers so and yes we are hiring and so feel free to apply
we are hiring and so feel free to apply
we are hiring and so feel free to apply where we’re growing right now and would
where we’re growing right now and would
where we’re growing right now and would love to receive applications for you
love to receive applications for you
love to receive applications for you guys so I think that concludes my speech
guys so I think that concludes my speech
guys so I think that concludes my speech and I’d love to answer questions there’s
and I’d love to answer questions there’s
and I’d love to answer questions there’s a ton of detail that I left out that if
a ton of detail that I left out that if
a ton of detail that I left out that if you have any sort of there are really
you have any sort of there are really
you have any sort of there are really many questions who you’ve been exid I’ve
many questions who you’ve been exid I’ve
many questions who you’ve been exid I’ve they’re asking questions with that so
they’re asking questions with that so
they’re asking questions with that so the first one is is kick stream analysis
the first one is is kick stream analysis
the first one is is kick stream analysis used to analyze behavioral data such as
used to analyze behavioral data such as
used to analyze behavioral data such as hyperlinks between articles and do you
hyperlinks between articles and do you
hyperlinks between articles and do you use spark for this yes I think we do you
use spark for this yes I think we do you
use spark for this yes I think we do you spark so I’m confession even though I
spark so I’m confession even though I
spark so I’m confession even though I grew up with a computer and a frog coded
grew up with a computer and a frog coded
grew up with a computer and a frog coded demos on my c64 and in my parents
demos on my c64 and in my parents
demos on my c64 and in my parents bedroom in the 80s I actually do not
bedroom in the 80s I actually do not
bedroom in the 80s I actually do not work as a developer in our company i’m
work as a developer in our company i’m
work as a developer in our company i’m one of the founders and i sell the
one of the founders and i sell the
one of the founders and i sell the vision so i can actually answer
vision so i can actually answer
vision so i can actually answer accurately we
accurately we
accurately we do look at clickstream data but mostly
do look at clickstream data but mostly
do look at clickstream data but mostly it’s not it’s limited to profile
it’s not it’s limited to profile
it’s not it’s limited to profile building not sort of session analysis
building not sort of session analysis
building not sort of session analysis because we we do there’s a lot of noise
because we we do there’s a lot of noise
because we we do there’s a lot of noise and people get distracted so if you have
and people get distracted so if you have
and people get distracted so if you have subsequent clicks through a corpus it
subsequent clicks through a corpus it
subsequent clicks through a corpus it really just attributes that tells you
really just attributes that tells you
really just attributes that tells you something about what the users
something about what the users
something about what the users interested in not necessarily that the
interested in not necessarily that the
interested in not necessarily that the things that they click on related
things that they click on related
things that they click on related because people get distracted so so yes
because people get distracted so so yes
because people get distracted so so yes we use clicks but not really streams and
we use clicks but not really streams and
we use clicks but not really streams and if you use if you do keep bait isn’t
if you use if you do keep bait isn’t
if you use if you do keep bait isn’t that minute manipulation all right we
that minute manipulation all right we
that minute manipulation all right we were actually asked to do this so yeah
were actually asked to do this so yeah
were actually asked to do this so yeah so I think it’s a there you’re always
so I think it’s a there you’re always
so I think it’s a there you’re always when you’re working with big
when you’re working with big
when you’re working with big corporations you have different layers
corporations you have different layers
corporations you have different layers of management and they have this
of management and they have this
of management and they have this different sort of key performance
different sort of key performance
different sort of key performance indicators and and the people that work
indicators and and the people that work
indicators and and the people that work in the front end would like to see a
in the front end would like to see a
in the front end would like to see a feature used so you need to optimize the
feature used so you need to optimize the
feature used so you need to optimize the data for a feature to be used I think
data for a feature to be used I think
data for a feature to be used I think it’s in the app I guess at the reason I
it’s in the app I guess at the reason I
it’s in the app I guess at the reason I can still fall asleep at night is that I
can still fall asleep at night is that I
can still fall asleep at night is that I think what we’re doing is vastly
think what we’re doing is vastly
think what we’re doing is vastly superior to the traditional sort of code
superior to the traditional sort of code
superior to the traditional sort of code download statistics that are used in
download statistics that are used in
download statistics that are used in science normally the things that get
science normally the things that get
science normally the things that get recommended across scientific publishers
recommended across scientific publishers
recommended across scientific publishers are the things that other people
are the things that other people
are the things that other people download it the same session and I think
download it the same session and I think
download it the same session and I think one of the biggest problems with that
one of the biggest problems with that
one of the biggest problems with that just to do a little diversion here is
just to do a little diversion here is
just to do a little diversion here is that when you only look at behavioral
that when you only look at behavioral
that when you only look at behavioral data that you have absolutely no way of
data that you have absolutely no way of
data that you have absolutely no way of recommending that new article that came
recommending that new article that came
recommending that new article that came out yesterday because you have no
out yesterday because you have no
out yesterday because you have no behavioral data attached to it and it’s
behavioral data attached to it and it’s
behavioral data attached to it and it’s a what we call the Coast our problem
a what we call the Coast our problem
a what we call the Coast our problem unless you can identify that this
unless you can identify that this
unless you can identify that this article is very similar to this other
article is very similar to this other
article is very similar to this other article that has behavioral data you can
article that has behavioral data you can
article that has behavioral data you can actually not make a recent
actually not make a recent
actually not make a recent recommendation until by accident people
recommendation until by accident people
recommendation until by accident people stumble across it and you know who
stumble across it and you know who
stumble across it and you know who actually did something with it so so I
actually did something with it so so I
actually did something with it so so I think what we do here obviously this is
think what we do here obviously this is
think what we do here obviously this is a Jekyll and Hyde thing then the best
a Jekyll and Hyde thing then the best
a Jekyll and Hyde thing then the best solution is always a combination of the
solution is always a combination of the
solution is always a combination of the two factors
two factors
two factors how do you make rules for classifying
how do you make rules for classifying
how do you make rules for classifying words or phrases that are very
words or phrases that are very
words or phrases that are very domain-specific across the many
domain-specific across the many
domain-specific across the many different research domains so there’s
different research domains so there’s
different research domains so there’s some actually very few phrases that are
some actually very few phrases that are
some actually very few phrases that are exactly similar across I have very
exactly similar across I have very
exactly similar across I have very different meanings but I’ve chef
different meanings but I’ve chef
different meanings but I’ve chef syntactic very similar across domains
syntactic very similar across domains
syntactic very similar across domains and most of that problem we’ve actually
and most of that problem we’ve actually
and most of that problem we’ve actually sort of circum navigated by looking at
sort of circum navigated by looking at
sort of circum navigated by looking at longer phrases and by filtering out this
longer phrases and by filtering out this
longer phrases and by filtering out this stuff that head that has ambivalence so
stuff that head that has ambivalence so
stuff that head that has ambivalence so you will actually see that we try to not
you will actually see that we try to not
you will actually see that we try to not mention things that when mentioned alone
mention things that when mentioned alone
mention things that when mentioned alone can mean different things than we add an
can mean different things than we add an
can mean different things than we add an additional token in front of it often
additional token in front of it often
additional token in front of it often times it becomes much less ambiguous and
times it becomes much less ambiguous and
times it becomes much less ambiguous and we then prefer that one and that’s
we then prefer that one and that’s
we then prefer that one and that’s simply ash and algorithmic solution is
simply ash and algorithmic solution is
simply ash and algorithmic solution is not something that we hard code but we
not something that we hard code but we
not something that we hard code but we actually look at the the ones that have
actually look at the the ones that have
actually look at the the ones that have ambiguity and try to pick longer phrases
ambiguity and try to pick longer phrases
ambiguity and try to pick longer phrases that are super sets that included do you
that are super sets that included do you
that are super sets that included do you do any kind of personalization we don’t
do any kind of personalization we don’t
do any kind of personalization we don’t have a product for personalization
have a product for personalization
have a product for personalization because it’s not it’s a big hot potato
because it’s not it’s a big hot potato
because it’s not it’s a big hot potato in science people are really afraid of
in science people are really afraid of
in science people are really afraid of being tracked because they think they
being tracked because they think they
being tracked because they think they have the cure for cancer and they don’t
have the cure for cancer and they don’t
have the cure for cancer and they don’t want like search history is a complete
want like search history is a complete
want like search history is a complete no go and for most of the clients that
no go and for most of the clients that
no go and for most of the clients that we work with so we haven’t we don’t have
we work with so we haven’t we don’t have
we work with so we haven’t we don’t have a product yet we think it’s incredibly
a product yet we think it’s incredibly
a product yet we think it’s incredibly interesting and we’d love to do it but
interesting and we’d love to do it but
interesting and we’d love to do it but we don’t have a partner to do it with
we don’t have a partner to do it with
we don’t have a partner to do it with and probably it’s going to be outside of
and probably it’s going to be outside of
and probably it’s going to be outside of science and what is the scale of data
science and what is the scale of data
science and what is the scale of data used in your processing how much states
used in your processing how much states
used in your processing how much states had words to train your model so so
had words to train your model so so
had words to train your model so so that’s another thing of the first two
that’s another thing of the first two
that’s another thing of the first two years of our of our startup we’re trying
years of our of our startup we’re trying
years of our of our startup we’re trying to build a school google scholar
to build a school google scholar
to build a school google scholar competitor we wanted to build a
competitor we wanted to build a
competitor we wanted to build a destination site where users could come
destination site where users could come
destination site where users could come search in full-text articles not see the
search in full-text articles not see the
search in full-text articles not see the full text articles but we would like it
full text articles but we would like it
full text articles but we would like it makes them for publishers and then link
makes them for publishers and then link
makes them for publishers and then link out to
out to
out to real constant and we spoke to many
real constant and we spoke to many
real constant and we spoke to many different scientific publishers and they
different scientific publishers and they
different scientific publishers and they all said that’s a brilliant idea and
all said that’s a brilliant idea and
all said that’s a brilliant idea and they had so many meetings with us for
they had so many meetings with us for
they had so many meetings with us for two years and they said oh here’s
two years and they said oh here’s
two years and they said oh here’s another test sample that you can have of
another test sample that you can have of
another test sample that you can have of our content and they said and once we’re
our content and they said and once we’re
our content and they said and once we’re ready to go you’ll have this hard drive
ready to go you’ll have this hard drive
ready to go you’ll have this hard drive with a ton of articles and it will be no
with a ton of articles and it will be no
with a ton of articles and it will be no problem everybody will be happy and then
problem everybody will be happy and then
problem everybody will be happy and then after two years and only a few thousand
after two years and only a few thousand
after two years and only a few thousand articles from each publisher and a ton
articles from each publisher and a ton
articles from each publisher and a ton of meetings where they asked about our
of meetings where they asked about our
of meetings where they asked about our technology and depth and detail we went
technology and depth and detail we went
technology and depth and detail we went out and one night I’m in London I
out and one night I’m in London I
out and one night I’m in London I remember and one of the product managers
remember and one of the product managers
remember and one of the product managers or it was actually a V VP level in one
or it was actually a V VP level in one
or it was actually a V VP level in one of those sawtooth publishes over a beer
of those sawtooth publishes over a beer
of those sawtooth publishes over a beer said you know it’s never going to happen
said you know it’s never going to happen
said you know it’s never going to happen they’re just keeping you close because
they’re just keeping you close because
they’re just keeping you close because they want to know what kind of
they want to know what kind of
they want to know what kind of technology you’re developing and I think
technology you’re developing and I think
technology you’re developing and I think a few months after that we pivoted into
a few months after that we pivoted into
a few months after that we pivoted into a different business plan where we
a different business plan where we
a different business plan where we provide our value in lieu of too little
provide our value in lieu of too little
provide our value in lieu of too little open access material we decided to work
open access material we decided to work
open access material we decided to work within the framework of the publishers
within the framework of the publishers
within the framework of the publishers and be their friends and so now what
and be their friends and so now what
and be their friends and so now what we’re providing our services services
we’re providing our services services
we’re providing our services services that are primarily focused on on using
that are primarily focused on on using
that are primarily focused on on using one publishers data to perform services
one publishers data to perform services
one publishers data to perform services for that one publishers clients and so
for that one publishers clients and so
for that one publishers clients and so clients the larger publishers have 10 to
clients the larger publishers have 10 to
clients the larger publishers have 10 to 15 million articles some of the
15 million articles some of the
15 million articles some of the aggregators have more but most most of
aggregators have more but most most of
aggregators have more but most most of our clients have less than 10 million
our clients have less than 10 million
our clients have less than 10 million documents so with each document being I
documents so with each document being I
documents so with each document being I don’t know a few hundred K in simple a
don’t know a few hundred K in simple a
don’t know a few hundred K in simple a ski that it’s not crazy amounts of data
ski that it’s not crazy amounts of data
ski that it’s not crazy amounts of data it’s a few terabytes for a larger
it’s a few terabytes for a larger
it’s a few terabytes for a larger publisher so as jonathan schwartz found
publisher so as jonathan schwartz found
publisher so as jonathan schwartz found out it could easily be dumped anywhere
out it could easily be dumped anywhere
out it could easily be dumped anywhere in the internet but everyone would be
in the internet but everyone would be
in the internet but everyone would be sued okay
sued okay
sued okay would it make sense to pretty print an
would it make sense to pretty print an
would it make sense to pretty print an article normalize it and republish it
article normalize it and republish it
article normalize it and republish it along with the original and did do you
along with the original and did do you
along with the original and did do you have a tool for that so no we don’t we
have a tool for that so no we don’t we
have a tool for that so no we don’t we cannot provide access to the full text
cannot provide access to the full text
cannot provide access to the full text we work with publishers and they are
we work with publishers and they are
we work with publishers and they are it’s a very tightly controlled business
it’s a very tightly controlled business
it’s a very tightly controlled business they their primary business asset at
they their primary business asset at
they their primary business asset at least until open access becomes more
least until open access becomes more
least until open access becomes more dominant is the concept that they own
dominant is the concept that they own
dominant is the concept that they own and control so so we really can’t do
and control so so we really can’t do
and control so so we really can’t do much with it except behind closed doors
much with it except behind closed doors
much with it except behind closed doors we had when we worked with elsevier last
we had when we worked with elsevier last
we had when we worked with elsevier last year like the forms we had to fill out
year like the forms we had to fill out
year like the forms we had to fill out for compliance of security were crazy I
for compliance of security were crazy I
for compliance of security were crazy I think a hundred and forty seven pages
think a hundred and forty seven pages
think a hundred and forty seven pages tabs in an excel sheet with a hundred
tabs in an excel sheet with a hundred
tabs in an excel sheet with a hundred questions in each so that was just the
questions in each so that was just the
questions in each so that was just the lien and they are the survey questions
lien and they are the survey questions
lien and they are the survey questions before they send a person over so yeah
before they send a person over so yeah
before they send a person over so yeah they’re really really crazy about
they’re really really crazy about
they’re really really crazy about security I using dump the architecture
security I using dump the architecture
security I using dump the architecture and can you talk about that I’m not
and can you talk about that I’m not
and can you talk about that I’m not familiar with lambda architecture I know
familiar with lambda architecture I know
familiar with lambda architecture I know like lambda lambda coefficients but no
like lambda lambda coefficients but no
like lambda lambda coefficients but no no probably maybe we are who knows okay
no probably maybe we are who knows okay
no probably maybe we are who knows okay what is the most interesting finding you
what is the most interesting finding you
what is the most interesting finding you had done in your data for cancer that’s
had done in your data for cancer that’s
had done in your data for cancer that’s our we haven’t found that yet and I
our we haven’t found that yet and I
our we haven’t found that yet and I guess we would have published it so
guess we would have published it so
guess we would have published it so we’re a service provider so we work with
we’re a service provider so we work with
we’re a service provider so we work with what the industry called subject matter
what the industry called subject matter
what the industry called subject matter experts or SMEs and so we have models
experts or SMEs and so we have models
experts or SMEs and so we have models what we validate the quality of what we
what we validate the quality of what we
what we validate the quality of what we do and and then the error rates etc they
do and and then the error rates etc they
do and and then the error rates etc they all automated tests and then of course
all automated tests and then of course
all automated tests and then of course we run it by some selection a panel of
we run it by some selection a panel of
we run it by some selection a panel of real scientists that can look at it and
real scientists that can look at it and
real scientists that can look at it and then know the content that we’ve
then know the content that we’ve
then know the content that we’ve processed and can tell if there’s an
processed and can tell if there’s an
processed and can tell if there’s an error somewhere a word that we left out
error somewhere a word that we left out
error somewhere a word that we left out that was important but we can’t really
that was important but we can’t really
that was important but we can’t really evaluate ourselves
evaluate ourselves
evaluate ourselves so we know that the scientific
so we know that the scientific
so we know that the scientific publishers we work with the editors
publishers we work with the editors
publishers we work with the editors there say that we have the best
there say that we have the best
there say that we have the best extraction algorithms that produce the
extraction algorithms that produce the
extraction algorithms that produce the best and most usable phrases and results
best and most usable phrases and results
best and most usable phrases and results so that that’s what we go at we actually
so that that’s what we go at we actually
so that that’s what we go at we actually don’t know what is being used for okay
don’t know what is being used for okay
don’t know what is being used for okay what about articles published in the
what about articles published in the
what about articles published in the public domain published on open
public domain published on open
public domain published on open platforms I am indexing and presenting
platforms I am indexing and presenting
platforms I am indexing and presenting articles on these and turns it the
articles on these and turns it the
articles on these and turns it the sources yes we are working with a couple
sources yes we are working with a couple
sources yes we are working with a couple of open access publishers and sorry
of open access publishers and sorry
of open access publishers and sorry about that and so the open access model
about that and so the open access model
about that and so the open access model has sort of turned publishing inside out
has sort of turned publishing inside out
has sort of turned publishing inside out where traditionally traditional
where traditionally traditional
where traditionally traditional publishers actually publish your thing
publishers actually publish your thing
publishers actually publish your thing for free as long as you sign over
for free as long as you sign over
for free as long as you sign over copyright for open access you have to
copyright for open access you have to
copyright for open access you have to pay for the peer to peer review process
pay for the peer to peer review process
pay for the peer to peer review process and the publishing of course that cost
and the publishing of course that cost
and the publishing of course that cost has come down a lot from a few years ago
has come down a lot from a few years ago
has come down a lot from a few years ago but you still pay around 2,000 euros to
but you still pay around 2,000 euros to
but you still pay around 2,000 euros to publish an article and that sort of puts
publish an article and that sort of puts
publish an article and that sort of puts a little damper on the growth of open
a little damper on the growth of open
a little damper on the growth of open access but but we do work with some of
access but but we do work with some of
access but but we do work with some of the open access providers and we have
the open access providers and we have
the open access providers and we have this idea when we started our company
this idea when we started our company
this idea when we started our company that we would just aggregate all of open
that we would just aggregate all of open
that we would just aggregate all of open source and that’s fine good luck if you
source and that’s fine good luck if you
source and that’s fine good luck if you want to try because the only people that
want to try because the only people that
want to try because the only people that have succeeded in doing anything vaguely
have succeeded in doing anything vaguely
have succeeded in doing anything vaguely resembling that are just aggregating the
resembling that are just aggregating the
resembling that are just aggregating the metadata because it turns out that
metadata because it turns out that
metadata because it turns out that people publish their their articles in
people publish their their articles in
people publish their their articles in it in a gazillion different formats on a
it in a gazillion different formats on a
it in a gazillion different formats on a gazillion different websites where
gazillion different websites where
gazillion different websites where sometimes the download boredness behind
sometimes the download boredness behind
sometimes the download boredness behind some kind of I’m not a robot capture and
some kind of I’m not a robot capture and
some kind of I’m not a robot capture and it’s really really hard to get at the
it’s really really hard to get at the
it’s really really hard to get at the content it’s the biggest mistake that
content it’s the biggest mistake that
content it’s the biggest mistake that the open access community has done is
the open access community has done is
the open access community has done is not agreeing on some submission standard
not agreeing on some submission standard
not agreeing on some submission standard that allows that data to go there text
that allows that data to go there text
that allows that data to go there text to be mined and I just don’t see why no
to be mined and I just don’t see why no
to be mined and I just don’t see why no one has come up and said this is how you
one has come up and said this is how you
one has come up and said this is how you do it this is the format give us a Jets
do it this is the format give us a Jets
do it this is the format give us a Jets xml file right here on an ftp server
xml file right here on an ftp server
xml file right here on an ftp server dump it there and and let the community
dump it there and and let the community
dump it there and and let the community do the rest but it hasn’t been done so
do the rest but it hasn’t been done so
do the rest but it hasn’t been done so it’s not the
it’s not the
it’s not the it’s not a task for startups it’s
it’s not a task for startups it’s
it’s not a task for startups it’s incredibly time-consuming to deal with
incredibly time-consuming to deal with
incredibly time-consuming to deal with thousands of different submission
thousands of different submission
thousands of different submission forfeits and PDFs I mean you may think
forfeits and PDFs I mean you may think
forfeits and PDFs I mean you may think PDF is a nice format but it just turns
PDF is a nice format but it just turns
PDF is a nice format but it just turns out that sometimes the renderer will
out that sometimes the renderer will
out that sometimes the renderer will swap the the order of sentences around
swap the the order of sentences around
swap the the order of sentences around and it’s impossible to figure out which
and it’s impossible to figure out which
and it’s impossible to figure out which sentence is completed over here or you
sentence is completed over here or you
sentence is completed over here or you don’t want to know so so we we have to
don’t want to know so so we we have to
don’t want to know so so we we have to have someone else take care of that and
have someone else take care of that and
have someone else take care of that and then we can do open source open access
then we can do open source open access
then we can do open source open access in a few years do you have some kind of
in a few years do you have some kind of
in a few years do you have some kind of best practice to run ad to plication
best practice to run ad to plication
best practice to run ad to plication process where different deep learning
process where different deep learning
process where different deep learning methods could be applied I’m not sure I
methods could be applied I’m not sure I
methods could be applied I’m not sure I understand the question but we do have
understand the question but we do have
understand the question but we do have so that’s the key value add and I’m
so that’s the key value add and I’m
so that’s the key value add and I’m sorry I can’t share the source code it’s
sorry I can’t share the source code it’s
sorry I can’t share the source code it’s free we’re trying to build a business if
free we’re trying to build a business if
free we’re trying to build a business if you want to work with it you should come
you want to work with it you should come
you want to work with it you should come to us we do have like the pipeline that
to us we do have like the pipeline that
to us we do have like the pipeline that we’re building is about this and it’s
we’re building is about this and it’s
we’re building is about this and it’s iterative we pipe stuff in that we’ve
iterative we pipe stuff in that we’ve
iterative we pipe stuff in that we’ve learned elsewhere and and we basically
learned elsewhere and and we basically
learned elsewhere and and we basically we have we work internally in the team
we have we work internally in the team
we have we work internally in the team we write white papers we have give talks
we write white papers we have give talks
we write white papers we have give talks to each other and it’s a wonderful set
to each other and it’s a wonderful set
to each other and it’s a wonderful set up please come to honest does this apply
up please come to honest does this apply
up please come to honest does this apply well to computer science papers oh yes
well to computer science papers oh yes
well to computer science papers oh yes archive we we’ve indexed archive once
archive we we’ve indexed archive once
archive we we’ve indexed archive once but we haven’t set it up for re-indexing
but we haven’t set it up for re-indexing
but we haven’t set it up for re-indexing and I think we should it’s the whole eat
and I think we should it’s the whole eat
and I think we should it’s the whole eat your own dog food thing so we should get
your own dog food thing so we should get
your own dog food thing so we should get that up and running again when we get
that up and running again when we get
that up and running again when we get around to it right we have these other
around to it right we have these other
around to it right we have these other jobs that pay money that we have to do
jobs that pay money that we have to do
jobs that pay money that we have to do first did you try our technology work
first did you try our technology work
first did you try our technology work for languages other than English no we
for languages other than English no we
for languages other than English no we haven’t found anyone willing to pay for
haven’t found anyone willing to pay for
haven’t found anyone willing to pay for it yet most of what we do can be
it yet most of what we do can be
it yet most of what we do can be transferred to to to other languages and
transferred to to to other languages and
transferred to to to other languages and not myself fluent in German but I think
not myself fluent in German but I think
not myself fluent in German but I think possibly there are some rules that would
possibly there are some rules that would
possibly there are some rules that would have to be
have to be
have to be for their grammar but there’s nothing
for their grammar but there’s nothing
for their grammar but there’s nothing basically preventing it from being
basically preventing it from being
basically preventing it from being ported to other languages we’ve we’ve
ported to other languages we’ve we’ve
ported to other languages we’ve we’ve been asked to do Chinese for IP analysis
been asked to do Chinese for IP analysis
been asked to do Chinese for IP analysis of patent analysis but the tools that
of patent analysis but the tools that
of patent analysis but the tools that everyone else is using is basically some
everyone else is using is basically some
everyone else is using is basically some kind of auto translation and then
kind of auto translation and then
kind of auto translation and then applying text analytics afterwards which
applying text analytics afterwards which
applying text analytics afterwards which is probably inferior but makes more
is probably inferior but makes more
is probably inferior but makes more sense on a cost perspective
sense on a cost perspective
sense on a cost perspective unfortunately we I think that’s it a lot
unfortunately we I think that’s it a lot
unfortunately we I think that’s it a lot of questions thanks for that and let’s
of questions thanks for that and let’s
of questions thanks for that and let’s say thank you to to mess thank you
Be First to Comment