Press "Enter" to skip to content

GOTO 2016 • Discovering Research Ideas Using Semantic Vectors & Machine Learning • Mads Rydahl


hello and thank you for joining us so

hello and thank you for joining us so I’m part of a small start-up located

I’m part of a small start-up located

I’m part of a small start-up located here in Denmark he knows called on silo

here in Denmark he knows called on silo

here in Denmark he knows called on silo we work with big scientific publishers

we work with big scientific publishers

we work with big scientific publishers to process article information and to

to process article information and to

to process article information and to make tools for researchers and so maybe

make tools for researchers and so maybe

make tools for researchers and so maybe I should start with explaining the

I should start with explaining the

I should start with explaining the mission that we set up four years ago we

mission that we set up four years ago we

mission that we set up four years ago we started the company so our idea was to

started the company so our idea was to

started the company so our idea was to build a system of discovery services

build a system of discovery services

build a system of discovery services that could make it easy to find patterns

that could make it easy to find patterns

that could make it easy to find patterns across a lot of unstructured text today

across a lot of unstructured text today

across a lot of unstructured text today or a couple of years ago the way things

or a couple of years ago the way things

or a couple of years ago the way things were linked when you looked at an

were linked when you looked at an

were linked when you looked at an article and try to find something

article and try to find something

article and try to find something similar was using human annotated

similar was using human annotated

similar was using human annotated editors keywords that’s how you find

editors keywords that’s how you find

editors keywords that’s how you find related articles in science and the big

related articles in science and the big

related articles in science and the big challenges that we saw with the system

challenges that we saw with the system

challenges that we saw with the system as it was was that because scientific

as it was was that because scientific

as it was was that because scientific language is constantly evolving and

language is constantly evolving and

language is constantly evolving and growing and new things are being

growing and new things are being

growing and new things are being discovered it’s impossible to keep up

discovered it’s impossible to keep up

discovered it’s impossible to keep up with sort of hand curation of content it

with sort of hand curation of content it

with sort of hand curation of content it also a system also has to be omniscient

also a system also has to be omniscient

also a system also has to be omniscient because presently it’s as an author and

because presently it’s as an author and

because presently it’s as an author and an editor that looks at a paper and

an editor that looks at a paper and

an editor that looks at a paper and tries to decide what’s the important

tries to decide what’s the important

tries to decide what’s the important aspects of this article and sometimes

aspects of this article and sometimes

aspects of this article and sometimes really interesting discoveries are

really interesting discoveries are

really interesting discoveries are really only apparent in hindsight so you

really only apparent in hindsight so you

really only apparent in hindsight so you need an automated system that can

need an automated system that can

need an automated system that can correlate a new article two tons of

correlate a new article two tons of

correlate a new article two tons of other things that are currently going on

other things that are currently going on

other things that are currently going on figure out if stuff people in China

figure out if stuff people in China

figure out if stuff people in China doing something exactly similar to what

doing something exactly similar to what

doing something exactly similar to what you’re trying to do then finally it has

you’re trying to do then finally it has

you’re trying to do then finally it has to be unbiased because right now we have

to be unbiased because right now we have

to be unbiased because right now we have this problem most of the sort of

this problem most of the sort of

this problem most of the sort of recommenders and the concept curation

recommenders and the concept curation

recommenders and the concept curation that’s automated today is based on so

that’s automated today is based on so

that’s automated today is based on so the collaborative filtering or like the

the collaborative filtering or like the

the collaborative filtering or like the stuff you see on Amazon people who

stuff you see on Amazon people who

stuff you see on Amazon people who bought this also bought that it tends to

bought this also bought that it tends to

bought this also bought that it tends to lead us down the same path and it tends

lead us down the same path and it tends

lead us down the same path and it tends to make researchers trying to do

to make researchers trying to do

to make researchers trying to do something you walk straight by the most

something you walk straight by the most

something you walk straight by the most interesting stuff because that’s what

interesting stuff because that’s what

interesting stuff because that’s what everyone else also does so we need an

everyone else also does so we need an

everyone else also does so we need an unbiased approach that doesn’t rely on

unbiased approach that doesn’t rely on

unbiased approach that doesn’t rely on some kind of popularity ranking like

some kind of popularity ranking like

some kind of popularity ranking like page rank or collaborative filtering the

page rank or collaborative filtering the

page rank or collaborative filtering the sound is a little odd is it okay I don’t

sound is a little odd is it okay I don’t

sound is a little odd is it okay I don’t have that fancy clicker so the core

have that fancy clicker so the core

have that fancy clicker so the core technology that we’ve built is based on

technology that we’ve built is based on

technology that we’ve built is based on a lot of open source components or at

a lot of open source components or at

a lot of open source components or at least three components we have a

least three components we have a

least three components we have a document processing pipeline built

document processing pipeline built

document processing pipeline built around batchi you eema and bruta and

around batchi you eema and bruta and

around batchi you eema and bruta and we’ve run sort of standard natural

we’ve run sort of standard natural

we’ve run sort of standard natural language processing pipeline and tools

language processing pipeline and tools

language processing pipeline and tools on top of that and then we use common

on top of that and then we use common

on top of that and then we use common languages like Python for prototyping

languages like Python for prototyping

languages like Python for prototyping Java and a lot of libraries and stuff in

Java and a lot of libraries and stuff in

Java and a lot of libraries and stuff in the I guess the data scientists toolbox

the key challenge is what we’re trying

the key challenge is what we’re trying to do is that unstructured knowledge

to do is that unstructured knowledge

to do is that unstructured knowledge text basically does not compute as I

text basically does not compute as I

text basically does not compute as I said before there’s too much stuff going

said before there’s too much stuff going

said before there’s too much stuff going on for humans to be involved in this

on for humans to be involved in this

on for humans to be involved in this process and even when humans are

process and even when humans are

process and even when humans are involved on a higher level in building

involved on a higher level in building

involved on a higher level in building ontology is to represent the knowledge

ontology is to represent the knowledge

ontology is to represent the knowledge that we have of a certain discipline

that we have of a certain discipline

that we have of a certain discipline it’s not going fast enough all the

it’s not going fast enough all the

it’s not going fast enough all the interesting stuff that was found out

interesting stuff that was found out

interesting stuff that was found out yesterday or last month or even six

yesterday or last month or even six

yesterday or last month or even six months ago has not made it into a a

months ago has not made it into a a

months ago has not made it into a a curated ontology yet so if you really

curated ontology yet so if you really

curated ontology yet so if you really want to be at the forefront where the

want to be at the forefront where the

want to be at the forefront where the money is and and and where things matter

money is and and and where things matter

money is and and and where things matter in research you really need a more

in research you really need a more

in research you really need a more dynamic approach so even when there are

dynamic approach so even when there are

dynamic approach so even when there are dictionaries or reference works it’s

dictionaries or reference works it’s

dictionaries or reference works it’s it’s not simply not comprehensible

it’s not simply not comprehensible

it’s not simply not comprehensible enough and then the second big problem

enough and then the second big problem

enough and then the second big problem that we have is that people are way too

that we have is that people are way too

that we have is that people are way too creative they don’t use

creative they don’t use

creative they don’t use just one name for a certain phenomenon

just one name for a certain phenomenon

just one name for a certain phenomenon they have many different variations and

they have many different variations and

they have many different variations and they often add descriptive detail in

they often add descriptive detail in

they often add descriptive detail in their own language that makes absolutely

their own language that makes absolutely

their own language that makes absolutely no sense to a computer and makes it

no sense to a computer and makes it

no sense to a computer and makes it really difficult to figure out what

really difficult to figure out what

really difficult to figure out what they’re actually talking about there is

they’re actually talking about there is

they’re actually talking about there is no right way to describe anything in the

no right way to describe anything in the

no right way to describe anything in the world and and we somehow have to figure

world and and we somehow have to figure

world and and we somehow have to figure out what people are talking about so as

out what people are talking about so as

out what people are talking about so as I said finally there’s all this all the

I said finally there’s all this all the

I said finally there’s all this all the data that people consider in obvious

data that people consider in obvious

data that people consider in obvious that’s probably the biggest problem for

that’s probably the biggest problem for

that’s probably the biggest problem for for analytics today or for computer AI

for analytics today or for computer AI

for analytics today or for computer AI in general all the stuff that people

in general all the stuff that people

in general all the stuff that people consider obvious and then fail to

consider obvious and then fail to

consider obvious and then fail to include in a description of anything so

include in a description of anything so

include in a description of anything so those are the key problems that were

those are the key problems that were

those are the key problems that were that we’re trying to solve here’s a

that we’re trying to solve here’s a

that we’re trying to solve here’s a piece of text it’s an article from 2006

piece of text it’s an article from 2006

piece of text it’s an article from 2006 and if you use a regular sort of

and if you use a regular sort of

and if you use a regular sort of full-text search or some kind of

full-text search or some kind of

full-text search or some kind of standard search engine and you throw

standard search engine and you throw

standard search engine and you throw this it’s an abstract of an article the

this it’s an abstract of an article the

this it’s an abstract of an article the real article is probably ten times as

real article is probably ten times as

real article is probably ten times as long then it’s really difficult to see

long then it’s really difficult to see

long then it’s really difficult to see what this text is really about and if I

what this text is really about and if I

what this text is really about and if I read this how do I figure out what other

read this how do I figure out what other

read this how do I figure out what other articles talk about the same things so

articles talk about the same things so

articles talk about the same things so today we use computers to annotate the

today we use computers to annotate the

today we use computers to annotate the words that we know what means so these

words that we know what means so these

words that we know what means so these are the words that are found in common

are the words that are found in common

are the words that are found in common sort of dictionaries and ontology of

sort of dictionaries and ontology of

sort of dictionaries and ontology of this of this area and we have at our

this of this area and we have at our

this of this area and we have at our company developed a much more

company developed a much more

company developed a much more comprehensive way of looking at this and

comprehensive way of looking at this and

comprehensive way of looking at this and dynamically statistically deriving

dynamically statistically deriving

dynamically statistically deriving longer phrases that mean stuff and we

longer phrases that mean stuff and we

longer phrases that mean stuff and we figure out which mean approximately

figure out which mean approximately

figure out which mean approximately which of them mean approximately the

which of them mean approximately the

which of them mean approximately the same thing and right now as I said in

same thing and right now as I said in

same thing and right now as I said in the I think of the remarks for the talk

the I think of the remarks for the talk

the I think of the remarks for the talk I’m also going to try to talk a little

I’m also going to try to talk a little

I’m also going to try to talk a little bit about where we want to take things

bit about where we want to take things

bit about where we want to take things and what we’re currently working on and

and what we’re currently working on and

and what we’re currently working on and it’s not just as you can see we’re

it’s not just as you can see we’re

it’s not just as you can see we’re trying to cover all of the information

trying to cover all of the information

trying to cover all of the information is actually in an article try to map

is actually in an article try to map

is actually in an article try to map that out and make it searchable make it

that out and make it searchable make it

that out and make it searchable make it findable

findable

findable and we’re presently working on all of

and we’re presently working on all of

and we’re presently working on all of the actions and relationships between

the actions and relationships between

the actions and relationships between these things so that when you find stuff

these things so that when you find stuff

these things so that when you find stuff that talks about a and B the most

that talks about a and B the most

that talks about a and B the most relevant article is probably the one

relevant article is probably the one

relevant article is probably the one that talks about a and B and

that talks about a and B and

that talks about a and B and approximately the same context or the

approximately the same context or the

approximately the same context or the same sentence or even talks about how a

same sentence or even talks about how a

same sentence or even talks about how a is related to B today you can also do

is related to B today you can also do

is related to B today you can also do this with sort of distance number of

this with sort of distance number of

this with sort of distance number of words in-between when you use a

words in-between when you use a

words in-between when you use a traditional search engine but the thing

traditional search engine but the thing

traditional search engine but the thing is when you’re working with checks then

is when you’re working with checks then

is when you’re working with checks then sometimes the number of words in-between

sometimes the number of words in-between

sometimes the number of words in-between cross a paragraph boundary or sometimes

cross a paragraph boundary or sometimes

cross a paragraph boundary or sometimes it’s the image text that’s right next to

it’s the image text that’s right next to

it’s the image text that’s right next to that really interesting other thing that

that really interesting other thing that

that really interesting other thing that you were looking for so and and other

you were looking for so and and other

you were looking for so and and other times actually the thing that you’re

times actually the thing that you’re

times actually the thing that you’re interested in is mentioned up here with

interested in is mentioned up here with

interested in is mentioned up here with that third thing and down here the other

that third thing and down here the other

that third thing and down here the other thing is mentioned with that third thing

thing is mentioned with that third thing

thing is mentioned with that third thing so they’re actually really closely

so they’re actually really closely

so they’re actually really closely connected but they’re just that odd ends

connected but they’re just that odd ends

connected but they’re just that odd ends of the article so you need a better

of the article so you need a better

of the article so you need a better understanding of this and we actually

understanding of this and we actually

understanding of this and we actually use graph analytics to understand the

use graph analytics to understand the

use graph analytics to understand the proximity of things and the centrality

proximity of things and the centrality

proximity of things and the centrality of things in an article so the first

of things in an article so the first

of things in an article so the first step we we perform is a regular natural

step we we perform is a regular natural

step we we perform is a regular natural language processing some of you may be

language processing some of you may be

language processing some of you may be familiar with this but the simplest part

familiar with this but the simplest part

familiar with this but the simplest part of a natural language processing the

of a natural language processing the

of a natural language processing the thing that you do without too much

thing that you do without too much

thing that you do without too much computation is the power of speech

computation is the power of speech

computation is the power of speech tagging basically assigning word classes

tagging basically assigning word classes

tagging basically assigning word classes to each word is this a verb or is it a

to each word is this a verb or is it a

to each word is this a verb or is it a noun in this context is that an

noun in this context is that an

noun in this context is that an adjective and and once we have the part

adjective and and once we have the part

adjective and and once we have the part of speech tagging we actually can find a

of speech tagging we actually can find a

of speech tagging we actually can find a lot of candidates for potential things

lot of candidates for potential things

lot of candidates for potential things in the sentence so as you can see here

in the sentence so as you can see here

in the sentence so as you can see here we have a sentence from the abstract you

we have a sentence from the abstract you

we have a sentence from the abstract you just saw methods for measuring sodium

just saw methods for measuring sodium

just saw methods for measuring sodium concentration in serum by indirect

concentration in serum by indirect

concentration in serum by indirect sodium select row selective electrode

sodium select row selective electrode

sodium select row selective electrode potentiometry so I’ve highlighted

potentiometry so I’ve highlighted

potentiometry so I’ve highlighted underneath for those who don’t read

underneath for those who don’t read

underneath for those who don’t read articles on the daily basis there are

articles on the daily basis there are

articles on the daily basis there are four things here in an action if you

four things here in an action if you

four things here in an action if you will in come and speak and if we extract

will in come and speak and if we extract

will in come and speak and if we extract all of the things here they seem pretty

all of the things here they seem pretty

all of the things here they seem pretty straightforward right so so what’s the

straightforward right so so what’s the

straightforward right so so what’s the beef

beef

beef so turns out you can say these things in

so turns out you can say these things in

so turns out you can say these things in many different ways and if you want to

many different ways and if you want to

many different ways and if you want to see other content that is closely

see other content that is closely

see other content that is closely related to this article you need to dock

related to this article you need to dock

related to this article you need to dock not just look at the ones that include

not just look at the ones that include

not just look at the ones that include those exact words you need to also look

those exact words you need to also look

those exact words you need to also look at the ones that mention these same

at the ones that mention these same

at the ones that mention these same things in different ways so we have to

things in different ways so we have to

things in different ways so we have to deduplicate basically so we work with

deduplicate basically so we work with

deduplicate basically so we work with Springer nature which is one of the

Springer nature which is one of the

Springer nature which is one of the larger scientific publishers in the

larger scientific publishers in the

larger scientific publishers in the world they’ve given us all of their

world they’ve given us all of their

world they’ve given us all of their content and we’ve sifted through it we

content and we’ve sifted through it we

content and we’ve sifted through it we found on the other side of a hundred

found on the other side of a hundred

found on the other side of a hundred million things in their content and we

million things in their content and we

million things in their content and we then after processing that in various

then after processing that in various

then after processing that in various ways deduplicate that down to maybe two

ways deduplicate that down to maybe two

ways deduplicate that down to maybe two or three million different things and

or three million different things and

or three million different things and even when you’re down at two or three

even when you’re down at two or three

even when you’re down at two or three two or three million different things

two or three million different things

two or three million different things you still have separation between things

you still have separation between things

you still have separation between things that may be a human reader would find to

that may be a human reader would find to

that may be a human reader would find to be mostly the same thing so there’s a

be mostly the same thing so there’s a

be mostly the same thing so there’s a lot of deduplication you need to do if

lot of deduplication you need to do if

lot of deduplication you need to do if you can look at the examples here so

you can look at the examples here so

you can look at the examples here so concentration of sodium can be mapped

concentration of sodium can be mapped

concentration of sodium can be mapped back to sodium concentration you can

back to sodium concentration you can

back to sodium concentration you can also have like sentences like the

also have like sentences like the

also have like sentences like the electro potentiometry was indirect well

electro potentiometry was indirect well

electro potentiometry was indirect well obviously that’s the same as indirect

obviously that’s the same as indirect

obviously that’s the same as indirect electrode present geometry you can talk

electrode present geometry you can talk

electrode present geometry you can talk some people like to call things a

some people like to call things a

some people like to call things a methodology rather than a method and

methodology rather than a method and

methodology rather than a method and sometimes people talk about zero and

sometimes people talk about zero and

sometimes people talk about zero and plural rather than serum so these are

plural rather than serum so these are

plural rather than serum so these are what we call morphological or

what we call morphological or

what we call morphological or syntactical variations basically the

syntactical variations basically the

syntactical variations basically the things that depend on the grammar we

things that depend on the grammar we

things that depend on the grammar we also try to reduce the lexical and

also try to reduce the lexical and

also try to reduce the lexical and semantic variations that’s when authors

semantic variations that’s when authors

semantic variations that’s when authors use synonyms or hypo names which are

use synonyms or hypo names which are

use synonyms or hypo names which are like more generic general terms for the

like more generic general terms for the

like more generic general terms for the same thing so for four parts of our

same thing so for four parts of our

same thing so for four parts of our pipeline we actually also do that sort

pipeline we actually also do that sort

pipeline we actually also do that sort of abstraction so whenever someone says

of abstraction so whenever someone says

of abstraction so whenever someone says method we might map that back to a more

method we might map that back to a more

method we might map that back to a more generic term called mechanism serum

generic term called mechanism serum

generic term called mechanism serum sample it’s actually a type of blood

sample it’s actually a type of blood

sample it’s actually a type of blood sample like the serum is the blood with

sample like the serum is the blood with

sample like the serum is the blood with something filtered out that’s not my

something filtered out that’s not my

something filtered out that’s not my primary

primary

primary business and serum sodium concentration

business and serum sodium concentration

business and serum sodium concentration well sodium actually is the i guess the

well sodium actually is the i guess the

well sodium actually is the i guess the American term for nature or it’s also

American term for nature or it’s also

American term for nature or it’s also use that sometimes and indirect

use that sometimes and indirect

use that sometimes and indirect electrode function geometry that we’ve

electrode function geometry that we’ve

electrode function geometry that we’ve now see you in a couple of times it’s

now see you in a couple of times it’s

now see you in a couple of times it’s actually a type of electroanalysis so

actually a type of electroanalysis so

actually a type of electroanalysis so when we look at longer sentences or

when we look at longer sentences or

when we look at longer sentences or longer phrases we actually go in and

longer phrases we actually go in and

longer phrases we actually go in and replace each of the tokens with a more

replace each of the tokens with a more

replace each of the tokens with a more generic term to figure out if this is

generic term to figure out if this is

generic term to figure out if this is actually a variation of something that

actually a variation of something that

actually a variation of something that we’ve seen before all of this is really

we’ve seen before all of this is really

we’ve seen before all of this is really nothing to do with machine learning this

nothing to do with machine learning this

nothing to do with machine learning this is just hard coded understanding of

is just hard coded understanding of

is just hard coded understanding of linguistic variations so we have

linguistic variations so we have

linguistic variations so we have compound paraphrases and ejectable

compound paraphrases and ejectable

compound paraphrases and ejectable modifiers and coordinations where you

modifiers and coordinations where you

modifiers and coordinations where you mention things like the concentration of

mention things like the concentration of

mention things like the concentration of sodium and magnesium can be expanded

sodium and magnesium can be expanded

sodium and magnesium can be expanded into concentration of magnesium and

into concentration of magnesium and

into concentration of magnesium and concentration of sodium and all of these

concentration of sodium and all of these

concentration of sodium and all of these tedious rules that we actually need to

tedious rules that we actually need to

tedious rules that we actually need to perform before you can do any type of

perform before you can do any type of

perform before you can do any type of sort of aggregated understanding and

sort of aggregated understanding and

sort of aggregated understanding and then final to a couple of things there

then final to a couple of things there

then final to a couple of things there often we’re looking at fragments of

often we’re looking at fragments of

often we’re looking at fragments of something else or we’re looking at

something else or we’re looking at

something else or we’re looking at something that contains a fragment which

something that contains a fragment which

something that contains a fragment which is more interesting so sometimes it’s

is more interesting so sometimes it’s

is more interesting so sometimes it’s the indirect potentiometry and no one

the indirect potentiometry and no one

the indirect potentiometry and no one else in the world has ever put sodium

else in the world has ever put sodium

else in the world has ever put sodium selective in between there so we have to

selective in between there so we have to

selective in between there so we have to identify that and take sort of author

identify that and take sort of author

identify that and take sort of author specific variations out of the question

specific variations out of the question

specific variations out of the question because they mean absolutely nothing to

because they mean absolutely nothing to

because they mean absolutely nothing to anyone else in the world and here we

anyone else in the world and here we

anyone else in the world and here we come to also this matter of adding

come to also this matter of adding

come to also this matter of adding additional descriptive detail that can

additional descriptive detail that can

additional descriptive detail that can really be in the way of understanding

really be in the way of understanding

really be in the way of understanding what’s going on so clinically

what’s going on so clinically

what’s going on so clinically implemented indirect something or

implemented indirect something or

implemented indirect something or error-prone indirect ion selective

error-prone indirect ion selective

error-prone indirect ion selective whatever whatever these are all things

whatever whatever these are all things

whatever whatever these are all things that get in the way of understanding

that get in the way of understanding

that get in the way of understanding what’s really being spoken about then

what’s really being spoken about then

what’s really being spoken about then once we have deduplicated all this these

once we have deduplicated all this these

once we have deduplicated all this these tons of things really we look at

tons of things really we look at

tons of things really we look at different types of features

different types of features

different types of features so the local features in the document

so the local features in the document

so the local features in the document include how many times it’s mentioned

include how many times it’s mentioned

include how many times it’s mentioned what’s it connected to we actually

what’s it connected to we actually

what’s it connected to we actually calculate a position in a document graph

calculate a position in a document graph

calculate a position in a document graph we connect all the things in the mention

we connect all the things in the mention

we connect all the things in the mention in the document with the relationships

in the document with the relationships

in the document with the relationships that connect them and then do regular

that connect them and then do regular

that connect them and then do regular sort of graph analysis to figure out

sort of graph analysis to figure out

sort of graph analysis to figure out what’s central and what sort of a

what’s central and what sort of a

what’s central and what sort of a peripheral to what’s being talked about

peripheral to what’s being talked about

peripheral to what’s being talked about so you can have something that’s only

so you can have something that’s only

so you can have something that’s only mentioned once but really central

mentioned once but really central

mentioned once but really central because it’s connected to that very one

because it’s connected to that very one

because it’s connected to that very one central thing and you can have stuff out

central thing and you can have stuff out

central thing and you can have stuff out here that may be mentioned a couple of

here that may be mentioned a couple of

here that may be mentioned a couple of times but always in relation to stuff

times but always in relation to stuff

times but always in relation to stuff that’s non central and and then of

that’s non central and and then of

that’s non central and and then of course we run these other types of

course we run these other types of

course we run these other types of analytics that use the textual context

analytics that use the textual context

analytics that use the textual context so the words right before and right

so the words right before and right

so the words right before and right after a piece of text the global

after a piece of text the global

after a piece of text the global features that we use are also sort of

features that we use are also sort of

features that we use are also sort of occurrence count the number of documents

occurrence count the number of documents

occurrence count the number of documents that contain the given phrase and we run

that contain the given phrase and we run

that contain the given phrase and we run various sort of fancy algorithms to

various sort of fancy algorithms to

various sort of fancy algorithms to figure out what the most common

figure out what the most common

figure out what the most common variation if you have a set of an Engram

variation if you have a set of an Engram

variation if you have a set of an Engram if you will a phrase the words what’s

if you will a phrase the words what’s

if you will a phrase the words what’s the most common used a variation if you

the most common used a variation if you

the most common used a variation if you add an additional adjective in front

add an additional adjective in front

add an additional adjective in front what’s the most commonly used adjective

what’s the most commonly used adjective

what’s the most commonly used adjective or what are the two most common things

or what are the two most common things

or what are the two most common things and are they sufficiently different

and are they sufficiently different

and are they sufficiently different different to be two different things

different to be two different things

different to be two different things then of course we also calculate I guess

then of course we also calculate I guess

then of course we also calculate I guess many of you probably also familiar with

many of you probably also familiar with

many of you probably also familiar with the tf-idf which is basically deviation

the tf-idf which is basically deviation

the tf-idf which is basically deviation in frequency from from a norm so if

in frequency from from a norm so if

in frequency from from a norm so if things occur more often than they do on

things occur more often than they do on

things occur more often than they do on average that’s that’s a significant

average that’s that’s a significant

average that’s that’s a significant phrase probably and then we look at

phrase probably and then we look at

phrase probably and then we look at distribution across the corpus so things

distribution across the corpus so things

distribution across the corpus so things a thing can be mentioned very few times

a thing can be mentioned very few times

a thing can be mentioned very few times but whenever someone uses that thing

but whenever someone uses that thing

but whenever someone uses that thing they mention it over and over again in

they mention it over and over again in

they mention it over and over again in the same document so that means it’s

the same document so that means it’s

the same document so that means it’s probably got some significance but if

probably got some significance but if

probably got some significance but if you look at it globally and just count

you look at it globally and just count

you look at it globally and just count the number of documents it occurs in it

the number of documents it occurs in it

the number of documents it occurs in it may seem insignificant so we have this

may seem insignificant so we have this

may seem insignificant so we have this concentration score which pay

concentration score which pay

concentration score which pay sickly tells us when it when it occurs

sickly tells us when it when it occurs

sickly tells us when it when it occurs in the document how likely is it occur

in the document how likely is it occur

in the document how likely is it occur to occur more than once and then we also

to occur more than once and then we also

to occur more than once and then we also do an analysis comparing the

do an analysis comparing the

do an analysis comparing the distribution to domain regions to figure

distribution to domain regions to figure

distribution to domain regions to figure out that this is something that’s very

out that this is something that’s very

out that this is something that’s very common but only in a certain domain and

common but only in a certain domain and

common but only in a certain domain and all of these things are affected into

all of these things are affected into

all of these things are affected into the our learning algorithms or ranking

the our learning algorithms or ranking

the our learning algorithms or ranking models we also use the aggregated

models we also use the aggregated

models we also use the aggregated textual context and this is the I’m

textual context and this is the I’m

textual context and this is the I’m going to get back to that in a little

going to get back to that in a little

going to get back to that in a little while this is the word to Vic or word

while this is the word to Vic or word

while this is the word to Vic or word embeddings models that the previous

embeddings models that the previous

embeddings models that the previous speaker also mentioned so if we look at

speaker also mentioned so if we look at

speaker also mentioned so if we look at all the occurrences of the given phrase

all the occurrences of the given phrase

all the occurrences of the given phrase across the entire corpus that tells us

across the entire corpus that tells us

across the entire corpus that tells us something about what it means or what

something about what it means or what

something about what it means or what other things might mean the same thing

other things might mean the same thing

other things might mean the same thing and then of course the biggest thing

and then of course the biggest thing

and then of course the biggest thing when you’re trying to train a model is

when you’re trying to train a model is

when you’re trying to train a model is the thing that you’re training it on so

the thing that you’re training it on so

the thing that you’re training it on so we have two types of things that we can

we have two types of things that we can

we have two types of things that we can train on we have human training data

train on we have human training data

train on we have human training data this could be the articles themselves we

this could be the articles themselves we

this could be the articles themselves we figure out if we have and I parthis

figure out if we have and I parthis

figure out if we have and I parthis purposes that a given concept is very

purposes that a given concept is very

purposes that a given concept is very central to an article we can compare it

central to an article we can compare it

central to an article we can compare it and see if we actually found it in the

and see if we actually found it in the

and see if we actually found it in the abstract so if it’s in the abstract or

abstract so if it’s in the abstract or

abstract so if it’s in the abstract or in the title as a high likelihood that

in the title as a high likelihood that

in the title as a high likelihood that the author also considers it important

the author also considers it important

the author also considers it important so that’s one data point and then

so that’s one data point and then

so that’s one data point and then aggregated over thousands or millions of

aggregated over thousands or millions of

aggregated over thousands or millions of articles that actually can tell us how

articles that actually can tell us how

articles that actually can tell us how good we are at selecting the things that

good we are at selecting the things that

good we are at selecting the things that authors find important of course if we

authors find important of course if we

authors find important of course if we think we can do better than the authors

think we can do better than the authors

think we can do better than the authors that’s a lousy way to measure it so we

that’s a lousy way to measure it so we

that’s a lousy way to measure it so we also use other types of human training

also use other types of human training

also use other types of human training data behavioral data from the companies

data behavioral data from the companies

data behavioral data from the companies we work with they kindly allow us access

we work with they kindly allow us access

we work with they kindly allow us access to usage patterns when we present

to usage patterns when we present

to usage patterns when we present something to users which of these things

something to users which of these things

something to users which of these things that we extract did they actually click

that we extract did they actually click

that we extract did they actually click on find interesting and and which

on find interesting and and which

on find interesting and and which articles when presented with a list of

articles when presented with a list of

articles when presented with a list of articles related articles

articles related articles

articles related articles the sidebar for instance which of these

the sidebar for instance which of these

the sidebar for instance which of these were found to be most interesting or

were found to be most interesting or

were found to be most interesting or clicked upon by by users turns out of

clicked upon by by users turns out of

clicked upon by by users turns out of course is the ones with the promising

course is the ones with the promising

course is the ones with the promising titles that get clicked on not

titles that get clicked on not

titles that get clicked on not necessarily the ones that are most

necessarily the ones that are most

necessarily the ones that are most similar so sometimes you need to make

similar so sometimes you need to make

similar so sometimes you need to make adjustments just to to create some link

adjustments just to to create some link

adjustments just to to create some link bait so the other type of synthetic data

bait so the other type of synthetic data

bait so the other type of synthetic data that we use is the data that we use is

that we use is the data that we use is

that we use is the data that we use is synthetic data so we can actually

synthetic data so we can actually

synthetic data so we can actually construct an artificial corpus and and

construct an artificial corpus and and

construct an artificial corpus and and train our models on that and try to

train our models on that and try to

train our models on that and try to improve our models using the principles

improve our models using the principles

improve our models using the principles that we that we use to create the

that we that we use to create the

that we that we use to create the synthetic data it’s slightly more

synthetic data it’s slightly more

synthetic data it’s slightly more complex but you can actually that’s

complex but you can actually that’s

complex but you can actually that’s that’s how the demo if any of you have

that’s how the demo if any of you have

that’s how the demo if any of you have tried war tyvek the demo that they

tried war tyvek the demo that they

tried war tyvek the demo that they create there is actually completely

create there is actually completely

create there is actually completely synthetic and you can also build

synthetic and you can also build

synthetic and you can also build partially synthetic data sets one that

partially synthetic data sets one that

partially synthetic data sets one that we’ve tried and that actually it was

we’ve tried and that actually it was

we’ve tried and that actually it was also used on what to work was to use a

also used on what to work was to use a

also used on what to work was to use a different search engine to create your

different search engine to create your

different search engine to create your artificial corpus so you search for

artificial corpus so you search for

artificial corpus so you search for something maybe two different concepts

something maybe two different concepts

something maybe two different concepts two different words and then you mix

two different words and then you mix

two different words and then you mix them together and you remove all traces

them together and you remove all traces

them together and you remove all traces of the worst that you searched for so

of the worst that you searched for so

of the worst that you searched for so the only thing that’s left or everything

the only thing that’s left or everything

the only thing that’s left or everything else in the document and then you try to

else in the document and then you try to

else in the document and then you try to figure out if you can still classify

figure out if you can still classify

figure out if you can still classify what was what and and and dump things in

what was what and and and dump things in

what was what and and and dump things in the right pile so a little bit about

the right pile so a little bit about

the right pile so a little bit about word embeddings so the previous author

word embeddings so the previous author

word embeddings so the previous author mentioned it here’s an example basically

mentioned it here’s an example basically

mentioned it here’s an example basically what you do is you build a lecture or

what you do is you build a lecture or

what you do is you build a lecture or it’s actually a tensor it’s a

it’s actually a tensor it’s a

it’s actually a tensor it’s a combination of vectors so each each word

combination of vectors so each each word

combination of vectors so each each word or token or phrase we work on phrases in

or token or phrase we work on phrases in

or token or phrase we work on phrases in our corpus is actually defined in this

our corpus is actually defined in this

our corpus is actually defined in this lecture space by an aggregation of

lecture space by an aggregation of

lecture space by an aggregation of vectors that it commonly co-occurs with

vectors that it commonly co-occurs with

vectors that it commonly co-occurs with so the traditional word tyvek algorithm

so the traditional word tyvek algorithm

so the traditional word tyvek algorithm will just work on create treat all text

will just work on create treat all text

will just work on create treat all text as a token every token as its own

as a token every token as its own

as a token every token as its own vector and then only a few things get

vector and then only a few things get

vector and then only a few things get concatenated because they belong

concatenated because they belong

concatenated because they belong together so we we pre process the text

together so we we pre process the text

together so we we pre process the text quite a lot and figure out after we’ve

quite a lot and figure out after we’ve

quite a lot and figure out after we’ve deduplicated all these hundred million

deduplicated all these hundred million

deduplicated all these hundred million things we’re down to so few million

things we’re down to so few million

things we’re down to so few million things that they actually have decent

things that they actually have decent

things that they actually have decent recurrent occurrence counts there’s the

recurrent occurrence counts there’s the

recurrent occurrence counts there’s the big problem when you’re looking at

big problem when you’re looking at

big problem when you’re looking at larger selections of text is that

larger selections of text is that

larger selections of text is that they’re kind of statistics be more

they’re kind of statistics be more

they’re kind of statistics be more unlikely than the word each word on its

unlikely than the word each word on its

unlikely than the word each word on its own so you have a problem with for

own so you have a problem with for

own so you have a problem with for instance hyper amick flow doesn’t

instance hyper amick flow doesn’t

instance hyper amick flow doesn’t necessarily occur that many times even

necessarily occur that many times even

necessarily occur that many times even when you have a million documents or 10

when you have a million documents or 10

when you have a million documents or 10 million documents it’s still something

million documents it’s still something

million documents it’s still something so specific that you only have a few

so specific that you only have a few

so specific that you only have a few hundred occurrences so it’s important to

hundred occurrences so it’s important to

hundred occurrences so it’s important to capture all of them even when the author

capture all of them even when the author

capture all of them even when the author calls it something different but after

calls it something different but after

calls it something different but after we’ve done all that deduplication we

we’ve done all that deduplication we

we’ve done all that deduplication we actually end up with a corpus what we

actually end up with a corpus what we

actually end up with a corpus what we can run a vector model or generate a

can run a vector model or generate a

can run a vector model or generate a vector model and then we use other

vector model and then we use other

vector model and then we use other things on top so we know that coronary

things on top so we know that coronary

things on top so we know that coronary vasodilation actually is defined in an

vasodilation actually is defined in an

vasodilation actually is defined in an ontology it’s related to all these

ontology it’s related to all these

ontology it’s related to all these different things and then we combine

different things and then we combine

different things and then we combine things using our so the structured

things using our so the structured

things using our so the structured knowledge of that domain to further

knowledge of that domain to further

knowledge of that domain to further refine the vector model and and that’s

refine the vector model and and that’s

refine the vector model and and that’s work really well for us here’s this is

work really well for us here’s this is

work really well for us here’s this is just a little data dump from a test a

just a little data dump from a test a

just a little data dump from a test a while ago but what you see here are

while ago but what you see here are

while ago but what you see here are phrases and a current accounts in the

phrases and a current accounts in the

phrases and a current accounts in the test corpus of I think are a million

test corpus of I think are a million

test corpus of I think are a million articles and here you can see like the

articles and here you can see like the

articles and here you can see like the first line deionized water it’s actually

first line deionized water it’s actually

first line deionized water it’s actually part of a set it extends further to the

part of a set it extends further to the

part of a set it extends further to the right but the first line you can see do

right but the first line you can see do

right but the first line you can see do ionized water is actually the same or

ionized water is actually the same or

ionized water is actually the same or has a similar vector as by distilled

has a similar vector as by distilled

has a similar vector as by distilled water ultrapure water di water tea /

water ultrapure water di water tea /

water ultrapure water di water tea / ionized water or double distilled water

ionized water or double distilled water

ionized water or double distilled water and these are important to notice that

and these are important to notice that

and these are important to notice that this is the output from a vector model

this is the output from a vector model

this is the output from a vector model where we basically for each concept in

where we basically for each concept in

where we basically for each concept in the first column we find the nearest

the first column we find the nearest

the first column we find the nearest concepts the most the concepts that

concepts the most the concepts that

concepts the most the concepts that appear in the most

appear in the most

appear in the most similar context so the algorithm

similar context so the algorithm

similar context so the algorithm actually does not even look at the

actually does not even look at the

actually does not even look at the letters it just has an ID and then it

letters it just has an ID and then it

letters it just has an ID and then it knows the ID of the things around it and

knows the ID of the things around it and

knows the ID of the things around it and so it’s pretty obvious you that it is

so it’s pretty obvious you that it is

so it’s pretty obvious you that it is actually possible just from the

actually possible just from the

actually possible just from the hypothesis is that words that mean

hypothesis is that words that mean

hypothesis is that words that mean approximately the same are used in

approximately the same are used in

approximately the same are used in approximately similar context so the 10

approximately similar context so the 10

approximately similar context so the 10 words or five words before and after

words or five words before and after

words or five words before and after over a million documents will be very

over a million documents will be very

over a million documents will be very similar for things that although they

similar for things that although they

similar for things that although they are different phrases mean more or less

are different phrases mean more or less

are different phrases mean more or less the same thing so you can see when

the same thing so you can see when

the same thing so you can see when things are used interchangeably that is

things are used interchangeably that is

things are used interchangeably that is very much the case so for instance row I

very much the case so for instance row I

very much the case so for instance row I guess 60 so crucial role actually is the

guess 60 so crucial role actually is the

guess 60 so crucial role actually is the more or less the interchangeably used

more or less the interchangeably used

more or less the interchangeably used with prominent role vital role

with prominent role vital role

with prominent role vital role fundamental role pivotal role or

fundamental role pivotal role or

fundamental role pivotal role or essential role sounds about right and

essential role sounds about right and

essential role sounds about right and again it’s it’s a great validation

again it’s it’s a great validation

again it’s it’s a great validation sometimes people work with data sets and

sometimes people work with data sets and

sometimes people work with data sets and they rarely ever see like anything else

they rarely ever see like anything else

they rarely ever see like anything else than floating point values here you can

than floating point values here you can

than floating point values here you can actually look at it and see that does

actually look at it and see that does

actually look at it and see that does actually make sense and if you’re in

actually make sense and if you’re in

actually make sense and if you’re in doubt you when we do sort of limited QA

doubt you when we do sort of limited QA

doubt you when we do sort of limited QA to see if things have become garbled by

to see if things have become garbled by

to see if things have become garbled by some bug introduced somewhere you can

some bug introduced somewhere you can

some bug introduced somewhere you can always just like look it up on Wikipedia

always just like look it up on Wikipedia

always just like look it up on Wikipedia or something see does it make sense and

or something see does it make sense and

or something see does it make sense and I think him so pivotal role key player

I think him so pivotal role key player

I think him so pivotal role key player essential role yeah so it actually works

essential role yeah so it actually works

essential role yeah so it actually works it’s possible to run this even on

it’s possible to run this even on

it’s possible to run this even on phrases which I think we have been the

phrases which I think we have been the

phrases which I think we have been the first to do so the upshot of this what

first to do so the upshot of this what

first to do so the upshot of this what have we done we’ve created human

have we done we’ve created human

have we done we’ve created human readable fingerprints so we’ve for any

readable fingerprints so we’ve for any

readable fingerprints so we’ve for any given text regardless of the type of

given text regardless of the type of

given text regardless of the type of language used we can extract some some

language used we can extract some some

language used we can extract some some phrases that we know what they mean and

phrases that we know what they mean and

phrases that we know what they mean and we can map them to the most commonly

we can map them to the most commonly

we can map them to the most commonly used definition or phrase that means the

used definition or phrase that means the

used definition or phrase that means the same thing and for a person skilled in

same thing and for a person skilled in

same thing and for a person skilled in the arts as they say it’s kind of easy

the arts as they say it’s kind of easy

the arts as they say it’s kind of easy to suddenly see what an article is about

to suddenly see what an article is about

to suddenly see what an article is about we can rank them and we can tell you the

we can rank them and we can tell you the

we can rank them and we can tell you the 5 10 things that are most important and

5 10 things that are most important and

5 10 things that are most important and an Arctic

an Arctic

an Arctic and when people say if you look at the

and when people say if you look at the

and when people say if you look at the graph up there when when when some

graph up there when when when some

graph up there when when when some author mentions insulin insensitivity

author mentions insulin insensitivity

author mentions insulin insensitivity and obese children we will know that

and obese children we will know that

and obese children we will know that that article that was written a couple

that article that was written a couple

that article that was written a couple of years ago about oh wait girls and

of years ago about oh wait girls and

of years ago about oh wait girls and reduced hormone response is actually

reduced hormone response is actually

reduced hormone response is actually talking about the exact same thing and

talking about the exact same thing and

talking about the exact same thing and that’s a that’s a that’s a very big leap

that’s a that’s a that’s a very big leap

that’s a that’s a that’s a very big leap in the way we recommend text in science

in the way we recommend text in science

in the way we recommend text in science or indeed anywhere so traditional

or indeed anywhere so traditional

or indeed anywhere so traditional document similarity relies on as I said

document similarity relies on as I said

document similarity relies on as I said to recap the words that we know what

to recap the words that we know what

to recap the words that we know what means sometimes word can be words can be

means sometimes word can be words can be

means sometimes word can be words can be ambiguous and that’s a big problem so

ambiguous and that’s a big problem so

ambiguous and that’s a big problem so there’s what we call the phrase

there’s what we call the phrase

there’s what we call the phrase hypothesis which is what we’re working

hypothesis which is what we’re working

hypothesis which is what we’re working on when you have a longer selection of

on when you have a longer selection of

on when you have a longer selection of words that stack together in the same

words that stack together in the same

words that stack together in the same fashion they rarely have a different

fashion they rarely have a different

fashion they rarely have a different meaning they often have a very precise

meaning they often have a very precise

meaning they often have a very precise meaning and and that’s the ability to

meaning and and that’s the ability to

meaning and and that’s the ability to capture those races dynamically is

capture those races dynamically is

capture those races dynamically is basically one what we do so once you

basically one what we do so once you

basically one what we do so once you have these fingerprints you can actually

have these fingerprints you can actually

have these fingerprints you can actually produce all kinds of different features

produce all kinds of different features

produce all kinds of different features that make it easier for researchers make

that make it easier for researchers make

that make it easier for researchers make life easier so what we’ve delivered to

life easier so what we’ve delivered to

life easier so what we’ve delivered to to the partners that we work with our

to the partners that we work with our

to the partners that we work with our inability to first as I said highlight

inability to first as I said highlight

inability to first as I said highlight the things that are most the principal

the things that are most the principal

the things that are most the principal components of an article so this is an

components of an article so this is an

components of an article so this is an article page some of you may have seen

article page some of you may have seen

article page some of you may have seen one if you search on Google for an

one if you search on Google for an

one if you search on Google for an article title you get bounced to a

article title you get bounced to a

article title you get bounced to a publisher’s webpage where that article

publisher’s webpage where that article

publisher’s webpage where that article is presented and so we helped make that

is presented and so we helped make that

is presented and so we helped make that page better we helped make it easier for

page better we helped make it easier for

page better we helped make it easier for readers to understand what’s going on

readers to understand what’s going on

readers to understand what’s going on and we can pull out key sentences and we

and we can pull out key sentences and we

and we can pull out key sentences and we can recommend stuff we can tell the user

can recommend stuff we can tell the user

can recommend stuff we can tell the user this is where they mentioned that thing

this is where they mentioned that thing

this is where they mentioned that thing you’re interested in they use some

you’re interested in they use some

you’re interested in they use some different words but it’s about the same

different words but it’s about the same

different words but it’s about the same thing and we can provide related content

thing and we can provide related content

thing and we can provide related content basically articles that are talking

basically articles that are talking

basically articles that are talking about the same things and when we do

about the same things and when we do

about the same things and when we do that we not only just provide a related

that we not only just provide a related

that we not only just provide a related article we’d actually tell you what it

article we’d actually tell you what it

article we’d actually tell you what it is how this overlaps with what you’re

is how this overlaps with what you’re

is how this overlaps with what you’re currently looking at so we can actually

currently looking at so we can actually

currently looking at so we can actually show you oh these are the concepts the

show you oh these are the concepts the

show you oh these are the concepts the current here that also occurs in the

current here that also occurs in the

current here that also occurs in the article you’re presently looking at and

article you’re presently looking at and

article you’re presently looking at and we can actually also we’ve done an

we can actually also we’ve done an

we can actually also we’ve done an interactive version that allows the user

interactive version that allows the user

interactive version that allows the user to drill down and further explore it has

to drill down and further explore it has

to drill down and further explore it has to contain this than this and then get a

to contain this than this and then get a

to contain this than this and then get a recommendation here so we work very

recommendation here so we work very

recommendation here so we work very closely with Springer nature Scientific

closely with Springer nature Scientific

closely with Springer nature Scientific American McMillan many of the largest

American McMillan many of the largest

American McMillan many of the largest publishers and we produce things like

publishers and we produce things like

publishers and we produce things like this so I guess the little difficult to

this so I guess the little difficult to

this so I guess the little difficult to see the highlights here but in essence

see the highlights here but in essence

see the highlights here but in essence this is the non schematic version of

this is the non schematic version of

this is the non schematic version of what I just saw told you on the right

what I just saw told you on the right

what I just saw told you on the right side we have related content you can

side we have related content you can

side we have related content you can click any of the things you’re

click any of the things you’re

click any of the things you’re interested in then get a filtered list

interested in then get a filtered list

interested in then get a filtered list of the most similar articles that also

of the most similar articles that also

of the most similar articles that also contain this thing you’re interested in

contain this thing you’re interested in

contain this thing you’re interested in we also do also other types of

we also do also other types of

we also do also other types of visualizations with related content we

visualizations with related content we

visualizations with related content we can use our technology to find

can use our technology to find

can use our technology to find definitions of things so many of these

definitions of things so many of these

definitions of things so many of these scientific publishers have a large back

scientific publishers have a large back

scientific publishers have a large back catalogue of reference works or teaching

catalogue of reference works or teaching

catalogue of reference works or teaching books if you will that define different

books if you will that define different

books if you will that define different concepts so users can can click on

concepts so users can can click on

concepts so users can can click on something like RNA editing and we can

something like RNA editing and we can

something like RNA editing and we can pick up the best definition we can find

pick up the best definition we can find

pick up the best definition we can find in in the publishers literature and not

in in the publishers literature and not

in in the publishers literature and not just rely on the stuff that’s on

just rely on the stuff that’s on

just rely on the stuff that’s on Wikipedia and more interesting we’re

Wikipedia and more interesting we’re

Wikipedia and more interesting we’re also working on building tools that

also working on building tools that

also working on building tools that allow researchers to to see more of the

allow researchers to to see more of the

allow researchers to to see more of the history that the stuff they’re

history that the stuff they’re

history that the stuff they’re interested in is sort of a part of so

interested in is sort of a part of so

interested in is sort of a part of so here is a tool that we call timeline

here is a tool that we call timeline

here is a tool that we call timeline that for a given article here in

that for a given article here in

that for a given article here in sometime in the past I guess around 2003

sometime in the past I guess around 2003

sometime in the past I guess around 2003 the selected article there we use the

the selected article there we use the

the selected article there we use the reference the citation data forwards and

reference the citation data forwards and

reference the citation data forwards and backwards citations to figure out which

backwards citations to figure out which

backwards citations to figure out which things were cited by this paper and

things were cited by this paper and

things were cited by this paper and which papers psyched this paper so

which papers psyched this paper so

which papers psyched this paper so forwards and backwards in time but

forwards and backwards in time but

forwards and backwards in time but that’s a very very large set because

that’s a very very large set because

that’s a very very large set because when you have a

when you have a

when you have a single article they often cite 10 20 50

single article they often cite 10 20 50

single article they often cite 10 20 50 other papers each of which site another

other papers each of which site another

other papers each of which site another 10 50 100 papers so it’s a very huge

10 50 100 papers so it’s a very huge

10 50 100 papers so it’s a very huge tree and then what we do is that we

tree and then what we do is that we

tree and then what we do is that we basically prune that tree to just look

basically prune that tree to just look

basically prune that tree to just look at the branches that have articles that

at the branches that have articles that

at the branches that have articles that talk about the same thing and that

talk about the same thing and that

talk about the same thing and that allows you to fairly easily identify an

allows you to fairly easily identify an

allows you to fairly easily identify an article from last year which talks about

article from last year which talks about

article from last year which talks about the same thing and actually through a

the same thing and actually through a

the same thing and actually through a couple of links cites the article that

couple of links cites the article that

couple of links cites the article that you’re presently looking at or if you’re

you’re presently looking at or if you’re

you’re presently looking at or if you’re looking at a recent article you can say

looking at a recent article you can say

looking at a recent article you can say who is the first author in this citation

who is the first author in this citation

who is the first author in this citation tree to actually combine this and that

tree to actually combine this and that

tree to actually combine this and that and in a paper so the value that we’re

and in a paper so the value that we’re

and in a paper so the value that we’re providing to researchers and this is

providing to researchers and this is

providing to researchers and this is we’re kind of proud of that is that we

we’re kind of proud of that is that we

we’re kind of proud of that is that we accelerate the path to successful

accelerate the path to successful

accelerate the path to successful discovery by pointing directly to what

discovery by pointing directly to what

discovery by pointing directly to what is relevant in an article and we can

is relevant in an article and we can

is relevant in an article and we can also provide more relevant suggestions

also provide more relevant suggestions

also provide more relevant suggestions because they’re much more precise than

because they’re much more precise than

because they’re much more precise than competing technologies and then we

competing technologies and then we

competing technologies and then we provide so our little company actually

provide so our little company actually

provide so our little company actually also provides end user features because

also provides end user features because

also provides end user features because we believe that it’s that understanding

we believe that it’s that understanding

we believe that it’s that understanding of the algorithms used and how they

of the algorithms used and how they

of the algorithms used and how they actually how different algorithms will

actually how different algorithms will

actually how different algorithms will favor different things and and that

favor different things and and that

favor different things and and that actually is important for the feature

actually is important for the feature

actually is important for the feature you’re trying to construct what how how

you’re trying to construct what how how

you’re trying to construct what how how you’re going to rank these and it’s

you’re going to rank these and it’s

you’re going to rank these and it’s actually very dependent on the type of

actually very dependent on the type of

actually very dependent on the type of use cases that we were trying to solve

use cases that we were trying to solve

use cases that we were trying to solve and for our our clients the publishers

and for our our clients the publishers

and for our our clients the publishers they’re really happy that they can roll

they’re really happy that they can roll

they’re really happy that they can roll out a feature across many different

out a feature across many different

out a feature across many different types of context content even so in

types of context content even so in

types of context content even so in biomedical for instance gene research or

biomedical for instance gene research or

biomedical for instance gene research or drugs diseases there’s a lot of

drugs diseases there’s a lot of

drugs diseases there’s a lot of structured documentation a lot of

structured documentation a lot of

structured documentation a lot of ontology zahl gene names at least

ontology zahl gene names at least

ontology zahl gene names at least discovered until fairly recently or

discovered until fairly recently or

discovered until fairly recently or logged in an open access ontology and

logged in an open access ontology and

logged in an open access ontology and and documentation is really really good

and documentation is really really good

and documentation is really really good in that small field of science but

in that small field of science but

in that small field of science but everywhere outside of that it’s much

everywhere outside of that it’s much

everywhere outside of that it’s much much worse if you look at humanities and

much worse if you look at humanities and

much worse if you look at humanities and Jen

Jen

Jen well there are rarely any any official

well there are rarely any any official

well there are rarely any any official ontology is available that tell you

ontology is available that tell you

ontology is available that tell you which words are important or which

which words are important or which

which words are important or which things is a synonym of what and and so

things is a synonym of what and and so

things is a synonym of what and and so what we do is actually is very important

what we do is actually is very important

what we do is actually is very important to do to developing this type of

to do to developing this type of

to do to developing this type of services or recommendations for for all

services or recommendations for for all

services or recommendations for for all the other disciplines so future

the other disciplines so future

the other disciplines so future directions well as I said we’re

directions well as I said we’re

directions well as I said we’re currently working on understanding the

currently working on understanding the

currently working on understanding the relationships between all these features

relationships between all these features

relationships between all these features of things that we extract there are so

of things that we extract there are so

of things that we extract there are so many different ways that you can say a

many different ways that you can say a

many different ways that you can say a given thing and when you talk about the

given thing and when you talk about the

given thing and when you talk about the relationship between two things there’s

relationship between two things there’s

relationship between two things there’s an equal amount of different ways you

an equal amount of different ways you

an equal amount of different ways you can say things so just the fact that

can say things so just the fact that

can say things so just the fact that serum consists mostly of water can be

serum consists mostly of water can be

serum consists mostly of water can be said in so many different ways and and

said in so many different ways and and

said in so many different ways and and the thing thin film coated gold

the thing thin film coated gold

the thing thin film coated gold nanoparticles we’re currently working on

nanoparticles we’re currently working on

nanoparticles we’re currently working on a nano product for the nano industry

a nano product for the nano industry

a nano product for the nano industry with a partner that can also be said in

with a partner that can also be said in

with a partner that can also be said in a number of different ways but what’s

a number of different ways but what’s

a number of different ways but what’s interesting is of course that these

interesting is of course that these

interesting is of course that these relationships when they stack up we can

relationships when they stack up we can

relationships when they stack up we can replace the two things the subject and

replace the two things the subject and

replace the two things the subject and the object and then have a general

the object and then have a general

the object and then have a general understanding of how this relationship

understanding of how this relationship

understanding of how this relationship can be described and so we’re trying to

can be described and so we’re trying to

can be described and so we’re trying to that’s a big challenge for us is trying

that’s a big challenge for us is trying

that’s a big challenge for us is trying to normalize and reduce the types of

to normalize and reduce the types of

to normalize and reduce the types of relationships between things and the

relationships between things and the

relationships between things and the corpus another big forward-looking

corpus another big forward-looking

corpus another big forward-looking feature is to provide our services to

feature is to provide our services to

feature is to provide our services to other companies that are trying to solve

other companies that are trying to solve

other companies that are trying to solve problems and have access to unstructured

problems and have access to unstructured

problems and have access to unstructured text but no ability to process it so

text but no ability to process it so

text but no ability to process it so we’re working with a couple of large

we’re working with a couple of large

we’re working with a couple of large companies to to make basically make

companies to to make basically make

companies to to make basically make large text collections computable so so

large text collections computable so so

large text collections computable so so much of what we do can be applied on any

much of what we do can be applied on any

much of what we do can be applied on any given sort of large collection of text

given sort of large collection of text

given sort of large collection of text and and you can do all sorts of really

and and you can do all sorts of really

and and you can do all sorts of really interesting analytics on it once you

interesting analytics on it once you

interesting analytics on it once you know what’s what and what’s similar and

know what’s what and what’s similar and

know what’s what and what’s similar and what’s the important aspects of text and

what’s the important aspects of text and

what’s the important aspects of text and then ultimately why we want to go is is

then ultimately why we want to go is is

then ultimately why we want to go is is to do reasoning at scale

to do reasoning at scale

to do reasoning at scale that’s really what you need in order to

that’s really what you need in order to

that’s really what you need in order to to augment scientific research most

to augment scientific research most

to augment scientific research most efficiently you need to be able to

efficiently you need to be able to

efficiently you need to be able to reason what is this how what’s the

reason what is this how what’s the

reason what is this how what’s the causal chain of events here and is is

causal chain of events here and is is

causal chain of events here and is is this a disputed fact does everyone say

this a disputed fact does everyone say

this a disputed fact does everyone say that this is how things are or the

that this is how things are or the

that this is how things are or the things that that may be long chains of

things that that may be long chains of

things that that may be long chains of of course ality that go unnoticed that

of course ality that go unnoticed that

of course ality that go unnoticed that can only really be uncovered by massive

can only really be uncovered by massive

can only really be uncovered by massive analytics so I guess the the ultimate

analytics so I guess the the ultimate

analytics so I guess the the ultimate price there is the cure for cancer so so

price there is the cure for cancer so so

price there is the cure for cancer so so I guess we have a small team we’re

I guess we have a small team we’re

I guess we have a small team we’re actually located in in almost in second

actually located in in almost in second

actually located in in almost in second town of Denmark were 18 people I think

town of Denmark were 18 people I think

town of Denmark were 18 people I think now and and all of them have worked at

now and and all of them have worked at

now and and all of them have worked at large big big international companies

large big big international companies

large big big international companies and basically chosen to come to work

and basically chosen to come to work

and basically chosen to come to work with us four measly salaries and living

with us four measly salaries and living

with us four measly salaries and living in the suburbs because we’re so excited

in the suburbs because we’re so excited

in the suburbs because we’re so excited about the promise of assisting science

about the promise of assisting science

about the promise of assisting science we have no Danish clients we all work

we have no Danish clients we all work

we have no Danish clients we all work with international publishers so and yes

with international publishers so and yes

with international publishers so and yes we are hiring and so feel free to apply

we are hiring and so feel free to apply

we are hiring and so feel free to apply where we’re growing right now and would

where we’re growing right now and would

where we’re growing right now and would love to receive applications for you

love to receive applications for you

love to receive applications for you guys so I think that concludes my speech

guys so I think that concludes my speech

guys so I think that concludes my speech and I’d love to answer questions there’s

and I’d love to answer questions there’s

and I’d love to answer questions there’s a ton of detail that I left out that if

a ton of detail that I left out that if

a ton of detail that I left out that if you have any sort of there are really

you have any sort of there are really

you have any sort of there are really many questions who you’ve been exid I’ve

many questions who you’ve been exid I’ve

many questions who you’ve been exid I’ve they’re asking questions with that so

they’re asking questions with that so

they’re asking questions with that so the first one is is kick stream analysis

the first one is is kick stream analysis

the first one is is kick stream analysis used to analyze behavioral data such as

used to analyze behavioral data such as

used to analyze behavioral data such as hyperlinks between articles and do you

hyperlinks between articles and do you

hyperlinks between articles and do you use spark for this yes I think we do you

use spark for this yes I think we do you

use spark for this yes I think we do you spark so I’m confession even though I

spark so I’m confession even though I

spark so I’m confession even though I grew up with a computer and a frog coded

grew up with a computer and a frog coded

grew up with a computer and a frog coded demos on my c64 and in my parents

demos on my c64 and in my parents

demos on my c64 and in my parents bedroom in the 80s I actually do not

bedroom in the 80s I actually do not

bedroom in the 80s I actually do not work as a developer in our company i’m

work as a developer in our company i’m

work as a developer in our company i’m one of the founders and i sell the

one of the founders and i sell the

one of the founders and i sell the vision so i can actually answer

vision so i can actually answer

vision so i can actually answer accurately we

accurately we

accurately we do look at clickstream data but mostly

do look at clickstream data but mostly

do look at clickstream data but mostly it’s not it’s limited to profile

it’s not it’s limited to profile

it’s not it’s limited to profile building not sort of session analysis

building not sort of session analysis

building not sort of session analysis because we we do there’s a lot of noise

because we we do there’s a lot of noise

because we we do there’s a lot of noise and people get distracted so if you have

and people get distracted so if you have

and people get distracted so if you have subsequent clicks through a corpus it

subsequent clicks through a corpus it

subsequent clicks through a corpus it really just attributes that tells you

really just attributes that tells you

really just attributes that tells you something about what the users

something about what the users

something about what the users interested in not necessarily that the

interested in not necessarily that the

interested in not necessarily that the things that they click on related

things that they click on related

things that they click on related because people get distracted so so yes

because people get distracted so so yes

because people get distracted so so yes we use clicks but not really streams and

we use clicks but not really streams and

we use clicks but not really streams and if you use if you do keep bait isn’t

if you use if you do keep bait isn’t

if you use if you do keep bait isn’t that minute manipulation all right we

that minute manipulation all right we

that minute manipulation all right we were actually asked to do this so yeah

were actually asked to do this so yeah

were actually asked to do this so yeah so I think it’s a there you’re always

so I think it’s a there you’re always

so I think it’s a there you’re always when you’re working with big

when you’re working with big

when you’re working with big corporations you have different layers

corporations you have different layers

corporations you have different layers of management and they have this

of management and they have this

of management and they have this different sort of key performance

different sort of key performance

different sort of key performance indicators and and the people that work

indicators and and the people that work

indicators and and the people that work in the front end would like to see a

in the front end would like to see a

in the front end would like to see a feature used so you need to optimize the

feature used so you need to optimize the

feature used so you need to optimize the data for a feature to be used I think

data for a feature to be used I think

data for a feature to be used I think it’s in the app I guess at the reason I

it’s in the app I guess at the reason I

it’s in the app I guess at the reason I can still fall asleep at night is that I

can still fall asleep at night is that I

can still fall asleep at night is that I think what we’re doing is vastly

think what we’re doing is vastly

think what we’re doing is vastly superior to the traditional sort of code

superior to the traditional sort of code

superior to the traditional sort of code download statistics that are used in

download statistics that are used in

download statistics that are used in science normally the things that get

science normally the things that get

science normally the things that get recommended across scientific publishers

recommended across scientific publishers

recommended across scientific publishers are the things that other people

are the things that other people

are the things that other people download it the same session and I think

download it the same session and I think

download it the same session and I think one of the biggest problems with that

one of the biggest problems with that

one of the biggest problems with that just to do a little diversion here is

just to do a little diversion here is

just to do a little diversion here is that when you only look at behavioral

that when you only look at behavioral

that when you only look at behavioral data that you have absolutely no way of

data that you have absolutely no way of

data that you have absolutely no way of recommending that new article that came

recommending that new article that came

recommending that new article that came out yesterday because you have no

out yesterday because you have no

out yesterday because you have no behavioral data attached to it and it’s

behavioral data attached to it and it’s

behavioral data attached to it and it’s a what we call the Coast our problem

a what we call the Coast our problem

a what we call the Coast our problem unless you can identify that this

unless you can identify that this

unless you can identify that this article is very similar to this other

article is very similar to this other

article is very similar to this other article that has behavioral data you can

article that has behavioral data you can

article that has behavioral data you can actually not make a recent

actually not make a recent

actually not make a recent recommendation until by accident people

recommendation until by accident people

recommendation until by accident people stumble across it and you know who

stumble across it and you know who

stumble across it and you know who actually did something with it so so I

actually did something with it so so I

actually did something with it so so I think what we do here obviously this is

think what we do here obviously this is

think what we do here obviously this is a Jekyll and Hyde thing then the best

a Jekyll and Hyde thing then the best

a Jekyll and Hyde thing then the best solution is always a combination of the

solution is always a combination of the

solution is always a combination of the two factors

two factors

two factors how do you make rules for classifying

how do you make rules for classifying

how do you make rules for classifying words or phrases that are very

words or phrases that are very

words or phrases that are very domain-specific across the many

domain-specific across the many

domain-specific across the many different research domains so there’s

different research domains so there’s

different research domains so there’s some actually very few phrases that are

some actually very few phrases that are

some actually very few phrases that are exactly similar across I have very

exactly similar across I have very

exactly similar across I have very different meanings but I’ve chef

different meanings but I’ve chef

different meanings but I’ve chef syntactic very similar across domains

syntactic very similar across domains

syntactic very similar across domains and most of that problem we’ve actually

and most of that problem we’ve actually

and most of that problem we’ve actually sort of circum navigated by looking at

sort of circum navigated by looking at

sort of circum navigated by looking at longer phrases and by filtering out this

longer phrases and by filtering out this

longer phrases and by filtering out this stuff that head that has ambivalence so

stuff that head that has ambivalence so

stuff that head that has ambivalence so you will actually see that we try to not

you will actually see that we try to not

you will actually see that we try to not mention things that when mentioned alone

mention things that when mentioned alone

mention things that when mentioned alone can mean different things than we add an

can mean different things than we add an

can mean different things than we add an additional token in front of it often

additional token in front of it often

additional token in front of it often times it becomes much less ambiguous and

times it becomes much less ambiguous and

times it becomes much less ambiguous and we then prefer that one and that’s

we then prefer that one and that’s

we then prefer that one and that’s simply ash and algorithmic solution is

simply ash and algorithmic solution is

simply ash and algorithmic solution is not something that we hard code but we

not something that we hard code but we

not something that we hard code but we actually look at the the ones that have

actually look at the the ones that have

actually look at the the ones that have ambiguity and try to pick longer phrases

ambiguity and try to pick longer phrases

ambiguity and try to pick longer phrases that are super sets that included do you

that are super sets that included do you

that are super sets that included do you do any kind of personalization we don’t

do any kind of personalization we don’t

do any kind of personalization we don’t have a product for personalization

have a product for personalization

have a product for personalization because it’s not it’s a big hot potato

because it’s not it’s a big hot potato

because it’s not it’s a big hot potato in science people are really afraid of

in science people are really afraid of

in science people are really afraid of being tracked because they think they

being tracked because they think they

being tracked because they think they have the cure for cancer and they don’t

have the cure for cancer and they don’t

have the cure for cancer and they don’t want like search history is a complete

want like search history is a complete

want like search history is a complete no go and for most of the clients that

no go and for most of the clients that

no go and for most of the clients that we work with so we haven’t we don’t have

we work with so we haven’t we don’t have

we work with so we haven’t we don’t have a product yet we think it’s incredibly

a product yet we think it’s incredibly

a product yet we think it’s incredibly interesting and we’d love to do it but

interesting and we’d love to do it but

interesting and we’d love to do it but we don’t have a partner to do it with

we don’t have a partner to do it with

we don’t have a partner to do it with and probably it’s going to be outside of

and probably it’s going to be outside of

and probably it’s going to be outside of science and what is the scale of data

science and what is the scale of data

science and what is the scale of data used in your processing how much states

used in your processing how much states

used in your processing how much states had words to train your model so so

had words to train your model so so

had words to train your model so so that’s another thing of the first two

that’s another thing of the first two

that’s another thing of the first two years of our of our startup we’re trying

years of our of our startup we’re trying

years of our of our startup we’re trying to build a school google scholar

to build a school google scholar

to build a school google scholar competitor we wanted to build a

competitor we wanted to build a

competitor we wanted to build a destination site where users could come

destination site where users could come

destination site where users could come search in full-text articles not see the

search in full-text articles not see the

search in full-text articles not see the full text articles but we would like it

full text articles but we would like it

full text articles but we would like it makes them for publishers and then link

makes them for publishers and then link

makes them for publishers and then link out to

out to

out to real constant and we spoke to many

real constant and we spoke to many

real constant and we spoke to many different scientific publishers and they

different scientific publishers and they

different scientific publishers and they all said that’s a brilliant idea and

all said that’s a brilliant idea and

all said that’s a brilliant idea and they had so many meetings with us for

they had so many meetings with us for

they had so many meetings with us for two years and they said oh here’s

two years and they said oh here’s

two years and they said oh here’s another test sample that you can have of

another test sample that you can have of

another test sample that you can have of our content and they said and once we’re

our content and they said and once we’re

our content and they said and once we’re ready to go you’ll have this hard drive

ready to go you’ll have this hard drive

ready to go you’ll have this hard drive with a ton of articles and it will be no

with a ton of articles and it will be no

with a ton of articles and it will be no problem everybody will be happy and then

problem everybody will be happy and then

problem everybody will be happy and then after two years and only a few thousand

after two years and only a few thousand

after two years and only a few thousand articles from each publisher and a ton

articles from each publisher and a ton

articles from each publisher and a ton of meetings where they asked about our

of meetings where they asked about our

of meetings where they asked about our technology and depth and detail we went

technology and depth and detail we went

technology and depth and detail we went out and one night I’m in London I

out and one night I’m in London I

out and one night I’m in London I remember and one of the product managers

remember and one of the product managers

remember and one of the product managers or it was actually a V VP level in one

or it was actually a V VP level in one

or it was actually a V VP level in one of those sawtooth publishes over a beer

of those sawtooth publishes over a beer

of those sawtooth publishes over a beer said you know it’s never going to happen

said you know it’s never going to happen

said you know it’s never going to happen they’re just keeping you close because

they’re just keeping you close because

they’re just keeping you close because they want to know what kind of

they want to know what kind of

they want to know what kind of technology you’re developing and I think

technology you’re developing and I think

technology you’re developing and I think a few months after that we pivoted into

a few months after that we pivoted into

a few months after that we pivoted into a different business plan where we

a different business plan where we

a different business plan where we provide our value in lieu of too little

provide our value in lieu of too little

provide our value in lieu of too little open access material we decided to work

open access material we decided to work

open access material we decided to work within the framework of the publishers

within the framework of the publishers

within the framework of the publishers and be their friends and so now what

and be their friends and so now what

and be their friends and so now what we’re providing our services services

we’re providing our services services

we’re providing our services services that are primarily focused on on using

that are primarily focused on on using

that are primarily focused on on using one publishers data to perform services

one publishers data to perform services

one publishers data to perform services for that one publishers clients and so

for that one publishers clients and so

for that one publishers clients and so clients the larger publishers have 10 to

clients the larger publishers have 10 to

clients the larger publishers have 10 to 15 million articles some of the

15 million articles some of the

15 million articles some of the aggregators have more but most most of

aggregators have more but most most of

aggregators have more but most most of our clients have less than 10 million

our clients have less than 10 million

our clients have less than 10 million documents so with each document being I

documents so with each document being I

documents so with each document being I don’t know a few hundred K in simple a

don’t know a few hundred K in simple a

don’t know a few hundred K in simple a ski that it’s not crazy amounts of data

ski that it’s not crazy amounts of data

ski that it’s not crazy amounts of data it’s a few terabytes for a larger

it’s a few terabytes for a larger

it’s a few terabytes for a larger publisher so as jonathan schwartz found

publisher so as jonathan schwartz found

publisher so as jonathan schwartz found out it could easily be dumped anywhere

out it could easily be dumped anywhere

out it could easily be dumped anywhere in the internet but everyone would be

in the internet but everyone would be

in the internet but everyone would be sued okay

sued okay

sued okay would it make sense to pretty print an

would it make sense to pretty print an

would it make sense to pretty print an article normalize it and republish it

article normalize it and republish it

article normalize it and republish it along with the original and did do you

along with the original and did do you

along with the original and did do you have a tool for that so no we don’t we

have a tool for that so no we don’t we

have a tool for that so no we don’t we cannot provide access to the full text

cannot provide access to the full text

cannot provide access to the full text we work with publishers and they are

we work with publishers and they are

we work with publishers and they are it’s a very tightly controlled business

it’s a very tightly controlled business

it’s a very tightly controlled business they their primary business asset at

they their primary business asset at

they their primary business asset at least until open access becomes more

least until open access becomes more

least until open access becomes more dominant is the concept that they own

dominant is the concept that they own

dominant is the concept that they own and control so so we really can’t do

and control so so we really can’t do

and control so so we really can’t do much with it except behind closed doors

much with it except behind closed doors

much with it except behind closed doors we had when we worked with elsevier last

we had when we worked with elsevier last

we had when we worked with elsevier last year like the forms we had to fill out

year like the forms we had to fill out

year like the forms we had to fill out for compliance of security were crazy I

for compliance of security were crazy I

for compliance of security were crazy I think a hundred and forty seven pages

think a hundred and forty seven pages

think a hundred and forty seven pages tabs in an excel sheet with a hundred

tabs in an excel sheet with a hundred

tabs in an excel sheet with a hundred questions in each so that was just the

questions in each so that was just the

questions in each so that was just the lien and they are the survey questions

lien and they are the survey questions

lien and they are the survey questions before they send a person over so yeah

before they send a person over so yeah

before they send a person over so yeah they’re really really crazy about

they’re really really crazy about

they’re really really crazy about security I using dump the architecture

security I using dump the architecture

security I using dump the architecture and can you talk about that I’m not

and can you talk about that I’m not

and can you talk about that I’m not familiar with lambda architecture I know

familiar with lambda architecture I know

familiar with lambda architecture I know like lambda lambda coefficients but no

like lambda lambda coefficients but no

like lambda lambda coefficients but no no probably maybe we are who knows okay

no probably maybe we are who knows okay

no probably maybe we are who knows okay what is the most interesting finding you

what is the most interesting finding you

what is the most interesting finding you had done in your data for cancer that’s

had done in your data for cancer that’s

had done in your data for cancer that’s our we haven’t found that yet and I

our we haven’t found that yet and I

our we haven’t found that yet and I guess we would have published it so

guess we would have published it so

guess we would have published it so we’re a service provider so we work with

we’re a service provider so we work with

we’re a service provider so we work with what the industry called subject matter

what the industry called subject matter

what the industry called subject matter experts or SMEs and so we have models

experts or SMEs and so we have models

experts or SMEs and so we have models what we validate the quality of what we

what we validate the quality of what we

what we validate the quality of what we do and and then the error rates etc they

do and and then the error rates etc they

do and and then the error rates etc they all automated tests and then of course

all automated tests and then of course

all automated tests and then of course we run it by some selection a panel of

we run it by some selection a panel of

we run it by some selection a panel of real scientists that can look at it and

real scientists that can look at it and

real scientists that can look at it and then know the content that we’ve

then know the content that we’ve

then know the content that we’ve processed and can tell if there’s an

processed and can tell if there’s an

processed and can tell if there’s an error somewhere a word that we left out

error somewhere a word that we left out

error somewhere a word that we left out that was important but we can’t really

that was important but we can’t really

that was important but we can’t really evaluate ourselves

evaluate ourselves

evaluate ourselves so we know that the scientific

so we know that the scientific

so we know that the scientific publishers we work with the editors

publishers we work with the editors

publishers we work with the editors there say that we have the best

there say that we have the best

there say that we have the best extraction algorithms that produce the

extraction algorithms that produce the

extraction algorithms that produce the best and most usable phrases and results

best and most usable phrases and results

best and most usable phrases and results so that that’s what we go at we actually

so that that’s what we go at we actually

so that that’s what we go at we actually don’t know what is being used for okay

don’t know what is being used for okay

don’t know what is being used for okay what about articles published in the

what about articles published in the

what about articles published in the public domain published on open

public domain published on open

public domain published on open platforms I am indexing and presenting

platforms I am indexing and presenting

platforms I am indexing and presenting articles on these and turns it the

articles on these and turns it the

articles on these and turns it the sources yes we are working with a couple

sources yes we are working with a couple

sources yes we are working with a couple of open access publishers and sorry

of open access publishers and sorry

of open access publishers and sorry about that and so the open access model

about that and so the open access model

about that and so the open access model has sort of turned publishing inside out

has sort of turned publishing inside out

has sort of turned publishing inside out where traditionally traditional

where traditionally traditional

where traditionally traditional publishers actually publish your thing

publishers actually publish your thing

publishers actually publish your thing for free as long as you sign over

for free as long as you sign over

for free as long as you sign over copyright for open access you have to

copyright for open access you have to

copyright for open access you have to pay for the peer to peer review process

pay for the peer to peer review process

pay for the peer to peer review process and the publishing of course that cost

and the publishing of course that cost

and the publishing of course that cost has come down a lot from a few years ago

has come down a lot from a few years ago

has come down a lot from a few years ago but you still pay around 2,000 euros to

but you still pay around 2,000 euros to

but you still pay around 2,000 euros to publish an article and that sort of puts

publish an article and that sort of puts

publish an article and that sort of puts a little damper on the growth of open

a little damper on the growth of open

a little damper on the growth of open access but but we do work with some of

access but but we do work with some of

access but but we do work with some of the open access providers and we have

the open access providers and we have

the open access providers and we have this idea when we started our company

this idea when we started our company

this idea when we started our company that we would just aggregate all of open

that we would just aggregate all of open

that we would just aggregate all of open source and that’s fine good luck if you

source and that’s fine good luck if you

source and that’s fine good luck if you want to try because the only people that

want to try because the only people that

want to try because the only people that have succeeded in doing anything vaguely

have succeeded in doing anything vaguely

have succeeded in doing anything vaguely resembling that are just aggregating the

resembling that are just aggregating the

resembling that are just aggregating the metadata because it turns out that

metadata because it turns out that

metadata because it turns out that people publish their their articles in

people publish their their articles in

people publish their their articles in it in a gazillion different formats on a

it in a gazillion different formats on a

it in a gazillion different formats on a gazillion different websites where

gazillion different websites where

gazillion different websites where sometimes the download boredness behind

sometimes the download boredness behind

sometimes the download boredness behind some kind of I’m not a robot capture and

some kind of I’m not a robot capture and

some kind of I’m not a robot capture and it’s really really hard to get at the

it’s really really hard to get at the

it’s really really hard to get at the content it’s the biggest mistake that

content it’s the biggest mistake that

content it’s the biggest mistake that the open access community has done is

the open access community has done is

the open access community has done is not agreeing on some submission standard

not agreeing on some submission standard

not agreeing on some submission standard that allows that data to go there text

that allows that data to go there text

that allows that data to go there text to be mined and I just don’t see why no

to be mined and I just don’t see why no

to be mined and I just don’t see why no one has come up and said this is how you

one has come up and said this is how you

one has come up and said this is how you do it this is the format give us a Jets

do it this is the format give us a Jets

do it this is the format give us a Jets xml file right here on an ftp server

xml file right here on an ftp server

xml file right here on an ftp server dump it there and and let the community

dump it there and and let the community

dump it there and and let the community do the rest but it hasn’t been done so

do the rest but it hasn’t been done so

do the rest but it hasn’t been done so it’s not the

it’s not the

it’s not the it’s not a task for startups it’s

it’s not a task for startups it’s

it’s not a task for startups it’s incredibly time-consuming to deal with

incredibly time-consuming to deal with

incredibly time-consuming to deal with thousands of different submission

thousands of different submission

thousands of different submission forfeits and PDFs I mean you may think

forfeits and PDFs I mean you may think

forfeits and PDFs I mean you may think PDF is a nice format but it just turns

PDF is a nice format but it just turns

PDF is a nice format but it just turns out that sometimes the renderer will

out that sometimes the renderer will

out that sometimes the renderer will swap the the order of sentences around

swap the the order of sentences around

swap the the order of sentences around and it’s impossible to figure out which

and it’s impossible to figure out which

and it’s impossible to figure out which sentence is completed over here or you

sentence is completed over here or you

sentence is completed over here or you don’t want to know so so we we have to

don’t want to know so so we we have to

don’t want to know so so we we have to have someone else take care of that and

have someone else take care of that and

have someone else take care of that and then we can do open source open access

then we can do open source open access

then we can do open source open access in a few years do you have some kind of

in a few years do you have some kind of

in a few years do you have some kind of best practice to run ad to plication

best practice to run ad to plication

best practice to run ad to plication process where different deep learning

process where different deep learning

process where different deep learning methods could be applied I’m not sure I

methods could be applied I’m not sure I

methods could be applied I’m not sure I understand the question but we do have

understand the question but we do have

understand the question but we do have so that’s the key value add and I’m

so that’s the key value add and I’m

so that’s the key value add and I’m sorry I can’t share the source code it’s

sorry I can’t share the source code it’s

sorry I can’t share the source code it’s free we’re trying to build a business if

free we’re trying to build a business if

free we’re trying to build a business if you want to work with it you should come

you want to work with it you should come

you want to work with it you should come to us we do have like the pipeline that

to us we do have like the pipeline that

to us we do have like the pipeline that we’re building is about this and it’s

we’re building is about this and it’s

we’re building is about this and it’s iterative we pipe stuff in that we’ve

iterative we pipe stuff in that we’ve

iterative we pipe stuff in that we’ve learned elsewhere and and we basically

learned elsewhere and and we basically

learned elsewhere and and we basically we have we work internally in the team

we have we work internally in the team

we have we work internally in the team we write white papers we have give talks

we write white papers we have give talks

we write white papers we have give talks to each other and it’s a wonderful set

to each other and it’s a wonderful set

to each other and it’s a wonderful set up please come to honest does this apply

up please come to honest does this apply

up please come to honest does this apply well to computer science papers oh yes

well to computer science papers oh yes

well to computer science papers oh yes archive we we’ve indexed archive once

archive we we’ve indexed archive once

archive we we’ve indexed archive once but we haven’t set it up for re-indexing

but we haven’t set it up for re-indexing

but we haven’t set it up for re-indexing and I think we should it’s the whole eat

and I think we should it’s the whole eat

and I think we should it’s the whole eat your own dog food thing so we should get

your own dog food thing so we should get

your own dog food thing so we should get that up and running again when we get

that up and running again when we get

that up and running again when we get around to it right we have these other

around to it right we have these other

around to it right we have these other jobs that pay money that we have to do

jobs that pay money that we have to do

jobs that pay money that we have to do first did you try our technology work

first did you try our technology work

first did you try our technology work for languages other than English no we

for languages other than English no we

for languages other than English no we haven’t found anyone willing to pay for

haven’t found anyone willing to pay for

haven’t found anyone willing to pay for it yet most of what we do can be

it yet most of what we do can be

it yet most of what we do can be transferred to to to other languages and

transferred to to to other languages and

transferred to to to other languages and not myself fluent in German but I think

not myself fluent in German but I think

not myself fluent in German but I think possibly there are some rules that would

possibly there are some rules that would

possibly there are some rules that would have to be

have to be

have to be for their grammar but there’s nothing

for their grammar but there’s nothing

for their grammar but there’s nothing basically preventing it from being

basically preventing it from being

basically preventing it from being ported to other languages we’ve we’ve

ported to other languages we’ve we’ve

ported to other languages we’ve we’ve been asked to do Chinese for IP analysis

been asked to do Chinese for IP analysis

been asked to do Chinese for IP analysis of patent analysis but the tools that

of patent analysis but the tools that

of patent analysis but the tools that everyone else is using is basically some

everyone else is using is basically some

everyone else is using is basically some kind of auto translation and then

kind of auto translation and then

kind of auto translation and then applying text analytics afterwards which

applying text analytics afterwards which

applying text analytics afterwards which is probably inferior but makes more

is probably inferior but makes more

is probably inferior but makes more sense on a cost perspective

sense on a cost perspective

sense on a cost perspective unfortunately we I think that’s it a lot

unfortunately we I think that’s it a lot

unfortunately we I think that’s it a lot of questions thanks for that and let’s

of questions thanks for that and let’s

of questions thanks for that and let’s say thank you to to mess thank you

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *