GOTO 2016 • Discovering Research Ideas Using Semantic Vectors & Machine Learning • Mads Rydahl

00:00:13hello and thank you for joining us so

00:00:17I’m part of a small start-up located

00:00:21here in Denmark he knows called on silo

00:00:24we work with big scientific publishers

00:00:28to process article information and to

00:00:30make tools for researchers and so maybe

00:00:35I should start with explaining the

00:00:39mission that we set up four years ago we

00:00:41started the company so our idea was to

00:00:47build a system of discovery services

00:00:50that could make it easy to find patterns

00:00:52across a lot of unstructured text today

00:00:56or a couple of years ago the way things

00:01:00were linked when you looked at an

00:01:02article and try to find something

00:01:04similar was using human annotated

00:01:07editors keywords that’s how you find

00:01:10related articles in science and the big

00:01:14challenges that we saw with the system

00:01:17as it was was that because scientific

00:01:21language is constantly evolving and

00:01:24growing and new things are being

00:01:26discovered it’s impossible to keep up

00:01:29with sort of hand curation of content it

00:01:34also a system also has to be omniscient

00:01:37because presently it’s as an author and

00:01:40an editor that looks at a paper and

00:01:42tries to decide what’s the important

00:01:44aspects of this article and sometimes

00:01:47really interesting discoveries are

00:01:49really only apparent in hindsight so you

00:01:52need an automated system that can

00:01:54correlate a new article two tons of

00:01:57other things that are currently going on

00:01:58figure out if stuff people in China

00:02:00doing something exactly similar to what

00:02:02you’re trying to do then finally it has

00:02:05to be unbiased because right now we have

00:02:08this problem most of the sort of

00:02:10recommenders and the concept curation

00:02:13that’s automated today is based on so

00:02:15the collaborative filtering or like the

00:02:17stuff you see on Amazon people who

00:02:19bought this also bought that it tends to

00:02:21lead us down the same path and it tends

00:02:23to make researchers trying to do

00:02:25something you walk straight by the most

00:02:27interesting stuff because that’s what

00:02:29everyone else also does so we need an

00:02:32unbiased approach that doesn’t rely on

00:02:35some kind of popularity ranking like

00:02:38page rank or collaborative filtering the

00:02:41sound is a little odd is it okay I don’t

00:02:47have that fancy clicker so the core

00:02:49technology that we’ve built is based on

00:02:52a lot of open source components or at

00:02:54least three components we have a

00:02:58document processing pipeline built

00:03:01around batchi you eema and bruta and

00:03:04we’ve run sort of standard natural

00:03:06language processing pipeline and tools

00:03:09on top of that and then we use common

00:03:13languages like Python for prototyping

00:03:15Java and a lot of libraries and stuff in

00:03:20the I guess the data scientists toolbox

00:03:25the key challenge is what we’re trying

00:03:28to do is that unstructured knowledge

00:03:30text basically does not compute as I

00:03:36said before there’s too much stuff going

00:03:37on for humans to be involved in this

00:03:40process and even when humans are

00:03:43involved on a higher level in building

00:03:45ontology is to represent the knowledge

00:03:48that we have of a certain discipline

00:03:49it’s not going fast enough all the

00:03:52interesting stuff that was found out

00:03:54yesterday or last month or even six

00:03:57months ago has not made it into a a

00:04:01curated ontology yet so if you really

00:04:03want to be at the forefront where the

00:04:05money is and and and where things matter

00:04:08in research you really need a more

00:04:10dynamic approach so even when there are

00:04:13dictionaries or reference works it’s

00:04:16it’s not simply not comprehensible

00:04:18enough and then the second big problem

00:04:20that we have is that people are way too

00:04:21creative they don’t use

00:04:24just one name for a certain phenomenon

00:04:26they have many different variations and

00:04:29they often add descriptive detail in

00:04:32their own language that makes absolutely

00:04:34no sense to a computer and makes it

00:04:36really difficult to figure out what

00:04:38they’re actually talking about there is

00:04:40no right way to describe anything in the

00:04:43world and and we somehow have to figure

00:04:46out what people are talking about so as

00:04:49I said finally there’s all this all the

00:04:53data that people consider in obvious

00:04:57that’s probably the biggest problem for

00:04:59for analytics today or for computer AI

00:05:03in general all the stuff that people

00:05:05consider obvious and then fail to

00:05:07include in a description of anything so

00:05:10those are the key problems that were

00:05:12that we’re trying to solve here’s a

00:05:16piece of text it’s an article from 2006

00:05:20and if you use a regular sort of

00:05:22full-text search or some kind of

00:05:27standard search engine and you throw

00:05:29this it’s an abstract of an article the

00:05:31real article is probably ten times as

00:05:33long then it’s really difficult to see

00:05:37what this text is really about and if I

00:05:39read this how do I figure out what other

00:05:42articles talk about the same things so

00:05:46today we use computers to annotate the

00:05:50words that we know what means so these

00:05:53are the words that are found in common

00:05:55sort of dictionaries and ontology of

00:05:57this of this area and we have at our

00:06:05company developed a much more

00:06:07comprehensive way of looking at this and

00:06:09dynamically statistically deriving

00:06:11longer phrases that mean stuff and we

00:06:14figure out which mean approximately

00:06:16which of them mean approximately the

00:06:18same thing and right now as I said in

00:06:22the I think of the remarks for the talk

00:06:24I’m also going to try to talk a little

00:06:25bit about where we want to take things

00:06:27and what we’re currently working on and

00:06:29it’s not just as you can see we’re

00:06:31trying to cover all of the information

00:06:33is actually in an article try to map

00:06:35that out and make it searchable make it

00:06:37findable

00:06:37and we’re presently working on all of

00:06:41the actions and relationships between

00:06:42these things so that when you find stuff

00:06:44that talks about a and B the most

00:06:48relevant article is probably the one

00:06:50that talks about a and B and

00:06:51approximately the same context or the

00:06:53same sentence or even talks about how a

00:06:55is related to B today you can also do

00:06:59this with sort of distance number of

00:07:01words in-between when you use a

00:07:03traditional search engine but the thing

00:07:05is when you’re working with checks then

00:07:07sometimes the number of words in-between

00:07:09cross a paragraph boundary or sometimes

00:07:12it’s the image text that’s right next to

00:07:14that really interesting other thing that

00:07:16you were looking for so and and other

00:07:19times actually the thing that you’re

00:07:20interested in is mentioned up here with

00:07:22that third thing and down here the other

00:07:25thing is mentioned with that third thing

00:07:27so they’re actually really closely

00:07:28connected but they’re just that odd ends

00:07:30of the article so you need a better

00:07:33understanding of this and we actually

00:07:34use graph analytics to understand the

00:07:37proximity of things and the centrality

00:07:40of things in an article so the first

00:07:44step we we perform is a regular natural

00:07:47language processing some of you may be

00:07:50familiar with this but the simplest part

00:07:53of a natural language processing the

00:07:54thing that you do without too much

00:07:56computation is the power of speech

00:07:58tagging basically assigning word classes

00:08:02to each word is this a verb or is it a

00:08:04noun in this context is that an

00:08:06adjective and and once we have the part

00:08:10of speech tagging we actually can find a

00:08:13lot of candidates for potential things

00:08:16in the sentence so as you can see here

00:08:18we have a sentence from the abstract you

00:08:20just saw methods for measuring sodium

00:08:23concentration in serum by indirect

00:08:25sodium select row selective electrode

00:08:28potentiometry so I’ve highlighted

00:08:31underneath for those who don’t read

00:08:33articles on the daily basis there are

00:08:36four things here in an action if you

00:08:38will in come and speak and if we extract

00:08:42all of the things here they seem pretty

00:08:44straightforward right so so what’s the

00:08:48beef

00:08:49so turns out you can say these things in

00:08:52many different ways and if you want to

00:08:53see other content that is closely

00:08:55related to this article you need to dock

00:08:57not just look at the ones that include

00:08:59those exact words you need to also look

00:09:01at the ones that mention these same

00:09:04things in different ways so we have to

00:09:07deduplicate basically so we work with

00:09:10Springer nature which is one of the

00:09:13larger scientific publishers in the

00:09:16world they’ve given us all of their

00:09:20content and we’ve sifted through it we

00:09:22found on the other side of a hundred

00:09:25million things in their content and we

00:09:29then after processing that in various

00:09:32ways deduplicate that down to maybe two

00:09:34or three million different things and

00:09:36even when you’re down at two or three

00:09:38two or three million different things

00:09:40you still have separation between things

00:09:43that may be a human reader would find to

00:09:46be mostly the same thing so there’s a

00:09:49lot of deduplication you need to do if

00:09:51you can look at the examples here so

00:09:53concentration of sodium can be mapped

00:09:57back to sodium concentration you can

00:10:00also have like sentences like the

00:10:03electro potentiometry was indirect well

00:10:06obviously that’s the same as indirect

00:10:08electrode present geometry you can talk

00:10:11some people like to call things a

00:10:13methodology rather than a method and

00:10:16sometimes people talk about zero and

00:10:18plural rather than serum so these are

00:10:20what we call morphological or

00:10:22syntactical variations basically the

00:10:24things that depend on the grammar we

00:10:27also try to reduce the lexical and

00:10:29semantic variations that’s when authors

00:10:31use synonyms or hypo names which are

00:10:34like more generic general terms for the

00:10:36same thing so for four parts of our

00:10:40pipeline we actually also do that sort

00:10:42of abstraction so whenever someone says

00:10:45method we might map that back to a more

00:10:48generic term called mechanism serum

00:10:51sample it’s actually a type of blood

00:10:53sample like the serum is the blood with

00:10:56something filtered out that’s not my

00:10:59primary

00:11:00business and serum sodium concentration

00:11:03well sodium actually is the i guess the

00:11:06American term for nature or it’s also

00:11:09use that sometimes and indirect

00:11:13electrode function geometry that we’ve

00:11:15now see you in a couple of times it’s

00:11:17actually a type of electroanalysis so

00:11:20when we look at longer sentences or

00:11:23longer phrases we actually go in and

00:11:25replace each of the tokens with a more

00:11:28generic term to figure out if this is

00:11:30actually a variation of something that

00:11:31we’ve seen before all of this is really

00:11:34nothing to do with machine learning this

00:11:35is just hard coded understanding of

00:11:38linguistic variations so we have

00:11:44compound paraphrases and ejectable

00:11:47modifiers and coordinations where you

00:11:49mention things like the concentration of

00:11:53sodium and magnesium can be expanded

00:11:55into concentration of magnesium and

00:11:58concentration of sodium and all of these

00:12:00tedious rules that we actually need to

00:12:01perform before you can do any type of

00:12:03sort of aggregated understanding and

00:12:07then final to a couple of things there

00:12:11often we’re looking at fragments of

00:12:16something else or we’re looking at

00:12:18something that contains a fragment which

00:12:20is more interesting so sometimes it’s

00:12:25the indirect potentiometry and no one

00:12:27else in the world has ever put sodium

00:12:29selective in between there so we have to

00:12:32identify that and take sort of author

00:12:34specific variations out of the question

00:12:37because they mean absolutely nothing to

00:12:39anyone else in the world and here we

00:12:44come to also this matter of adding

00:12:45additional descriptive detail that can

00:12:48really be in the way of understanding

00:12:49what’s going on so clinically

00:12:52implemented indirect something or

00:12:54error-prone indirect ion selective

00:12:57whatever whatever these are all things

00:12:59that get in the way of understanding

00:13:01what’s really being spoken about then

00:13:05once we have deduplicated all this these

00:13:09tons of things really we look at

00:13:11different types of features

00:13:14so the local features in the document

00:13:17include how many times it’s mentioned

00:13:19what’s it connected to we actually

00:13:21calculate a position in a document graph

00:13:25we connect all the things in the mention

00:13:29in the document with the relationships

00:13:31that connect them and then do regular

00:13:33sort of graph analysis to figure out

00:13:35what’s central and what sort of a

00:13:37peripheral to what’s being talked about

00:13:40so you can have something that’s only

00:13:42mentioned once but really central

00:13:43because it’s connected to that very one

00:13:45central thing and you can have stuff out

00:13:47here that may be mentioned a couple of

00:13:49times but always in relation to stuff

00:13:51that’s non central and and then of

00:13:55course we run these other types of

00:13:59analytics that use the textual context

00:14:01so the words right before and right

00:14:04after a piece of text the global

00:14:08features that we use are also sort of

00:14:10occurrence count the number of documents

00:14:12that contain the given phrase and we run

00:14:15various sort of fancy algorithms to

00:14:18figure out what the most common

00:14:19variation if you have a set of an Engram

00:14:23if you will a phrase the words what’s

00:14:26the most common used a variation if you

00:14:29add an additional adjective in front

00:14:31what’s the most commonly used adjective

00:14:33or what are the two most common things

00:14:36and are they sufficiently different

00:14:38different to be two different things

00:14:41then of course we also calculate I guess

00:14:44many of you probably also familiar with

00:14:46the tf-idf which is basically deviation

00:14:50in frequency from from a norm so if

00:14:53things occur more often than they do on

00:14:56average that’s that’s a significant

00:15:00phrase probably and then we look at

00:15:04distribution across the corpus so things

00:15:07a thing can be mentioned very few times

00:15:10but whenever someone uses that thing

00:15:12they mention it over and over again in

00:15:14the same document so that means it’s

00:15:16probably got some significance but if

00:15:19you look at it globally and just count

00:15:20the number of documents it occurs in it

00:15:22may seem insignificant so we have this

00:15:25concentration score which pay

00:15:27sickly tells us when it when it occurs

00:15:30in the document how likely is it occur

00:15:32to occur more than once and then we also

00:15:36do an analysis comparing the

00:15:39distribution to domain regions to figure

00:15:43out that this is something that’s very

00:15:44common but only in a certain domain and

00:15:48all of these things are affected into

00:15:51the our learning algorithms or ranking

00:15:53models we also use the aggregated

00:15:58textual context and this is the I’m

00:16:01going to get back to that in a little

00:16:02while this is the word to Vic or word

00:16:05embeddings models that the previous

00:16:09speaker also mentioned so if we look at

00:16:11all the occurrences of the given phrase

00:16:13across the entire corpus that tells us

00:16:16something about what it means or what

00:16:20other things might mean the same thing

00:16:22and then of course the biggest thing

00:16:25when you’re trying to train a model is

00:16:27the thing that you’re training it on so

00:16:32we have two types of things that we can

00:16:35train on we have human training data

00:16:38this could be the articles themselves we

00:16:41figure out if we have and I parthis

00:16:43purposes that a given concept is very

00:16:47central to an article we can compare it

00:16:49and see if we actually found it in the

00:16:51abstract so if it’s in the abstract or

00:16:53in the title as a high likelihood that

00:16:55the author also considers it important

00:16:57so that’s one data point and then

00:17:00aggregated over thousands or millions of

00:17:02articles that actually can tell us how

00:17:05good we are at selecting the things that

00:17:07authors find important of course if we

00:17:10think we can do better than the authors

00:17:12that’s a lousy way to measure it so we

00:17:16also use other types of human training

00:17:18data behavioral data from the companies

00:17:21we work with they kindly allow us access

00:17:24to usage patterns when we present

00:17:27something to users which of these things

00:17:29that we extract did they actually click

00:17:32on find interesting and and which

00:17:36articles when presented with a list of

00:17:38articles related articles

00:17:40the sidebar for instance which of these

00:17:43were found to be most interesting or

00:17:46clicked upon by by users turns out of

00:17:49course is the ones with the promising

00:17:50titles that get clicked on not

00:17:52necessarily the ones that are most

00:17:53similar so sometimes you need to make

00:17:56adjustments just to to create some link

00:17:59bait so the other type of synthetic data

00:18:02that we use is the data that we use is

00:18:06synthetic data so we can actually

00:18:08construct an artificial corpus and and

00:18:12train our models on that and try to

00:18:14improve our models using the principles

00:18:17that we that we use to create the

00:18:19synthetic data it’s slightly more

00:18:22complex but you can actually that’s

00:18:25that’s how the demo if any of you have

00:18:27tried war tyvek the demo that they

00:18:30create there is actually completely

00:18:31synthetic and you can also build

00:18:37partially synthetic data sets one that

00:18:40we’ve tried and that actually it was

00:18:42also used on what to work was to use a

00:18:45different search engine to create your

00:18:47artificial corpus so you search for

00:18:49something maybe two different concepts

00:18:52two different words and then you mix

00:18:54them together and you remove all traces

00:18:56of the worst that you searched for so

00:18:58the only thing that’s left or everything

00:19:00else in the document and then you try to

00:19:02figure out if you can still classify

00:19:04what was what and and and dump things in

00:19:08the right pile so a little bit about

00:19:12word embeddings so the previous author

00:19:15mentioned it here’s an example basically

00:19:19what you do is you build a lecture or

00:19:21it’s actually a tensor it’s a

00:19:22combination of vectors so each each word

00:19:27or token or phrase we work on phrases in

00:19:30our corpus is actually defined in this

00:19:33lecture space by an aggregation of

00:19:36vectors that it commonly co-occurs with

00:19:40so the traditional word tyvek algorithm

00:19:45will just work on create treat all text

00:19:49as a token every token as its own

00:19:53vector and then only a few things get

00:19:56concatenated because they belong

00:19:58together so we we pre process the text

00:20:01quite a lot and figure out after we’ve

00:20:03deduplicated all these hundred million

00:20:06things we’re down to so few million

00:20:07things that they actually have decent

00:20:10recurrent occurrence counts there’s the

00:20:12big problem when you’re looking at

00:20:13larger selections of text is that

00:20:15they’re kind of statistics be more

00:20:17unlikely than the word each word on its

00:20:20own so you have a problem with for

00:20:23instance hyper amick flow doesn’t

00:20:26necessarily occur that many times even

00:20:28when you have a million documents or 10

00:20:30million documents it’s still something

00:20:32so specific that you only have a few

00:20:34hundred occurrences so it’s important to

00:20:36capture all of them even when the author

00:20:39calls it something different but after

00:20:40we’ve done all that deduplication we

00:20:43actually end up with a corpus what we

00:20:45can run a vector model or generate a

00:20:47vector model and then we use other

00:20:49things on top so we know that coronary

00:20:54vasodilation actually is defined in an

00:20:57ontology it’s related to all these

00:20:59different things and then we combine

00:21:01things using our so the structured

00:21:03knowledge of that domain to further

00:21:06refine the vector model and and that’s

00:21:09work really well for us here’s this is

00:21:13just a little data dump from a test a

00:21:15while ago but what you see here are

00:21:19phrases and a current accounts in the

00:21:23test corpus of I think are a million

00:21:25articles and here you can see like the

00:21:28first line deionized water it’s actually

00:21:31part of a set it extends further to the

00:21:33right but the first line you can see do

00:21:37ionized water is actually the same or

00:21:39has a similar vector as by distilled

00:21:41water ultrapure water di water tea /

00:21:45ionized water or double distilled water

00:21:47and these are important to notice that

00:21:51this is the output from a vector model

00:21:55where we basically for each concept in

00:21:58the first column we find the nearest

00:22:01concepts the most the concepts that

00:22:04appear in the most

00:22:06similar context so the algorithm

00:22:08actually does not even look at the

00:22:10letters it just has an ID and then it

00:22:14knows the ID of the things around it and

00:22:17so it’s pretty obvious you that it is

00:22:21actually possible just from the

00:22:23hypothesis is that words that mean

00:22:26approximately the same are used in

00:22:28approximately similar context so the 10

00:22:31words or five words before and after

00:22:33over a million documents will be very

00:22:35similar for things that although they

00:22:37are different phrases mean more or less

00:22:40the same thing so you can see when

00:22:42things are used interchangeably that is

00:22:45very much the case so for instance row I

00:22:47guess 60 so crucial role actually is the

00:22:51more or less the interchangeably used

00:22:53with prominent role vital role

00:22:55fundamental role pivotal role or

00:22:57essential role sounds about right and

00:23:00again it’s it’s a great validation

00:23:02sometimes people work with data sets and

00:23:04they rarely ever see like anything else

00:23:07than floating point values here you can

00:23:09actually look at it and see that does

00:23:10actually make sense and if you’re in

00:23:13doubt you when we do sort of limited QA

00:23:17to see if things have become garbled by

00:23:19some bug introduced somewhere you can

00:23:22always just like look it up on Wikipedia

00:23:24or something see does it make sense and

00:23:27I think him so pivotal role key player

00:23:29essential role yeah so it actually works

00:23:34it’s possible to run this even on

00:23:36phrases which I think we have been the

00:23:40first to do so the upshot of this what

00:23:42have we done we’ve created human

00:23:45readable fingerprints so we’ve for any

00:23:47given text regardless of the type of

00:23:50language used we can extract some some

00:23:54phrases that we know what they mean and

00:23:56we can map them to the most commonly

00:24:00used definition or phrase that means the

00:24:03same thing and for a person skilled in

00:24:06the arts as they say it’s kind of easy

00:24:09to suddenly see what an article is about

00:24:11we can rank them and we can tell you the

00:24:125 10 things that are most important and

00:24:15an Arctic

00:24:15and when people say if you look at the

00:24:18graph up there when when when some

00:24:20author mentions insulin insensitivity

00:24:22and obese children we will know that

00:24:25that article that was written a couple

00:24:27of years ago about oh wait girls and

00:24:29reduced hormone response is actually

00:24:31talking about the exact same thing and

00:24:33that’s a that’s a that’s a very big leap

00:24:36in the way we recommend text in science

00:24:40or indeed anywhere so traditional

00:24:42document similarity relies on as I said

00:24:46to recap the words that we know what

00:24:48means sometimes word can be words can be

00:24:52ambiguous and that’s a big problem so

00:24:55there’s what we call the phrase

00:24:56hypothesis which is what we’re working

00:24:58on when you have a longer selection of

00:25:01words that stack together in the same

00:25:03fashion they rarely have a different

00:25:05meaning they often have a very precise

00:25:07meaning and and that’s the ability to

00:25:11capture those races dynamically is

00:25:14basically one what we do so once you

00:25:20have these fingerprints you can actually

00:25:22produce all kinds of different features

00:25:24that make it easier for researchers make

00:25:27life easier so what we’ve delivered to

00:25:30to the partners that we work with our

00:25:32inability to first as I said highlight

00:25:34the things that are most the principal

00:25:38components of an article so this is an

00:25:40article page some of you may have seen

00:25:41one if you search on Google for an

00:25:43article title you get bounced to a

00:25:45publisher’s webpage where that article

00:25:47is presented and so we helped make that

00:25:50page better we helped make it easier for

00:25:54readers to understand what’s going on

00:25:56and we can pull out key sentences and we

00:25:59can recommend stuff we can tell the user

00:26:02this is where they mentioned that thing

00:26:03you’re interested in they use some

00:26:05different words but it’s about the same

00:26:07thing and we can provide related content

00:26:10basically articles that are talking

00:26:13about the same things and when we do

00:26:15that we not only just provide a related

00:26:18article we’d actually tell you what it

00:26:20is how this overlaps with what you’re

00:26:22currently looking at so we can actually

00:26:25show you oh these are the concepts the

00:26:27current here that also occurs in the

00:26:29article you’re presently looking at and

00:26:31we can actually also we’ve done an

00:26:32interactive version that allows the user

00:26:34to drill down and further explore it has

00:26:38to contain this than this and then get a

00:26:41recommendation here so we work very

00:26:43closely with Springer nature Scientific

00:26:47American McMillan many of the largest

00:26:52publishers and we produce things like

00:26:54this so I guess the little difficult to

00:26:58see the highlights here but in essence

00:27:02this is the non schematic version of

00:27:04what I just saw told you on the right

00:27:07side we have related content you can

00:27:09click any of the things you’re

00:27:10interested in then get a filtered list

00:27:12of the most similar articles that also

00:27:14contain this thing you’re interested in

00:27:17we also do also other types of

00:27:20visualizations with related content we

00:27:24can use our technology to find

00:27:25definitions of things so many of these

00:27:27scientific publishers have a large back

00:27:30catalogue of reference works or teaching

00:27:34books if you will that define different

00:27:36concepts so users can can click on

00:27:41something like RNA editing and we can

00:27:44pick up the best definition we can find

00:27:47in in the publishers literature and not

00:27:51just rely on the stuff that’s on

00:27:54Wikipedia and more interesting we’re

00:27:58also working on building tools that

00:28:02allow researchers to to see more of the

00:28:06history that the stuff they’re

00:28:09interested in is sort of a part of so

00:28:12here is a tool that we call timeline

00:28:15that for a given article here in

00:28:18sometime in the past I guess around 2003

00:28:21the selected article there we use the

00:28:24reference the citation data forwards and

00:28:28backwards citations to figure out which

00:28:31things were cited by this paper and

00:28:33which papers psyched this paper so

00:28:36forwards and backwards in time but

00:28:38that’s a very very large set because

00:28:41when you have a

00:28:42single article they often cite 10 20 50

00:28:46other papers each of which site another

00:28:4910 50 100 papers so it’s a very huge

00:28:53tree and then what we do is that we

00:28:55basically prune that tree to just look

00:28:57at the branches that have articles that

00:29:00talk about the same thing and that

00:29:02allows you to fairly easily identify an

00:29:04article from last year which talks about

00:29:06the same thing and actually through a

00:29:09couple of links cites the article that

00:29:10you’re presently looking at or if you’re

00:29:12looking at a recent article you can say

00:29:13who is the first author in this citation

00:29:16tree to actually combine this and that

00:29:19and in a paper so the value that we’re

00:29:25providing to researchers and this is

00:29:27we’re kind of proud of that is that we

00:29:30accelerate the path to successful

00:29:35discovery by pointing directly to what

00:29:38is relevant in an article and we can

00:29:41also provide more relevant suggestions

00:29:44because they’re much more precise than

00:29:47competing technologies and then we

00:29:49provide so our little company actually

00:29:51also provides end user features because

00:29:54we believe that it’s that understanding

00:29:56of the algorithms used and how they

00:29:59actually how different algorithms will

00:30:04favor different things and and that

00:30:06actually is important for the feature

00:30:07you’re trying to construct what how how

00:30:10you’re going to rank these and it’s

00:30:11actually very dependent on the type of

00:30:13use cases that we were trying to solve

00:30:15and for our our clients the publishers

00:30:19they’re really happy that they can roll

00:30:21out a feature across many different

00:30:22types of context content even so in

00:30:26biomedical for instance gene research or

00:30:30drugs diseases there’s a lot of

00:30:33structured documentation a lot of

00:30:36ontology zahl gene names at least

00:30:38discovered until fairly recently or

00:30:40logged in an open access ontology and

00:30:44and documentation is really really good

00:30:47in that small field of science but

00:30:50everywhere outside of that it’s much

00:30:52much worse if you look at humanities and

00:30:56Jen

00:30:56well there are rarely any any official

00:30:59ontology is available that tell you

00:31:01which words are important or which

00:31:03things is a synonym of what and and so

00:31:07what we do is actually is very important

00:31:10to do to developing this type of

00:31:12services or recommendations for for all

00:31:16the other disciplines so future

00:31:19directions well as I said we’re

00:31:21currently working on understanding the

00:31:24relationships between all these features

00:31:26of things that we extract there are so

00:31:28many different ways that you can say a

00:31:30given thing and when you talk about the

00:31:33relationship between two things there’s

00:31:35an equal amount of different ways you

00:31:37can say things so just the fact that

00:31:41serum consists mostly of water can be

00:31:44said in so many different ways and and

00:31:46the thing thin film coated gold

00:31:50nanoparticles we’re currently working on

00:31:52a nano product for the nano industry

00:31:55with a partner that can also be said in

00:31:58a number of different ways but what’s

00:32:00interesting is of course that these

00:32:02relationships when they stack up we can

00:32:04replace the two things the subject and

00:32:08the object and then have a general

00:32:10understanding of how this relationship

00:32:12can be described and so we’re trying to

00:32:15that’s a big challenge for us is trying

00:32:17to normalize and reduce the types of

00:32:20relationships between things and the

00:32:24corpus another big forward-looking

00:32:27feature is to provide our services to

00:32:31other companies that are trying to solve

00:32:33problems and have access to unstructured

00:32:35text but no ability to process it so

00:32:39we’re working with a couple of large

00:32:40companies to to make basically make

00:32:44large text collections computable so so

00:32:48much of what we do can be applied on any

00:32:51given sort of large collection of text

00:32:53and and you can do all sorts of really

00:32:56interesting analytics on it once you

00:32:58know what’s what and what’s similar and

00:33:00what’s the important aspects of text and

00:33:04then ultimately why we want to go is is

00:33:08to do reasoning at scale

00:33:10that’s really what you need in order to

00:33:12to augment scientific research most

00:33:16efficiently you need to be able to

00:33:18reason what is this how what’s the

00:33:20causal chain of events here and is is

00:33:24this a disputed fact does everyone say

00:33:26that this is how things are or the

00:33:29things that that may be long chains of

00:33:32of course ality that go unnoticed that

00:33:35can only really be uncovered by massive

00:33:38analytics so I guess the the ultimate

00:33:41price there is the cure for cancer so so

00:33:46I guess we have a small team we’re

00:33:47actually located in in almost in second

00:33:51town of Denmark were 18 people I think

00:33:55now and and all of them have worked at

00:33:59large big big international companies

00:34:02and basically chosen to come to work

00:34:04with us four measly salaries and living

00:34:07in the suburbs because we’re so excited

00:34:10about the promise of assisting science

00:34:13we have no Danish clients we all work

00:34:15with international publishers so and yes

00:34:20we are hiring and so feel free to apply

00:34:24where we’re growing right now and would

00:34:28love to receive applications for you

00:34:31guys so I think that concludes my speech

00:34:35and I’d love to answer questions there’s

00:34:36a ton of detail that I left out that if

00:34:40you have any sort of there are really

00:34:44many questions who you’ve been exid I’ve

00:34:46they’re asking questions with that so

00:34:49the first one is is kick stream analysis

00:34:51used to analyze behavioral data such as

00:34:54hyperlinks between articles and do you

00:34:57use spark for this yes I think we do you

00:35:02spark so I’m confession even though I

00:35:05grew up with a computer and a frog coded

00:35:08demos on my c64 and in my parents

00:35:12bedroom in the 80s I actually do not

00:35:15work as a developer in our company i’m

00:35:16one of the founders and i sell the

00:35:18vision so i can actually answer

00:35:21accurately we

00:35:22do look at clickstream data but mostly

00:35:25it’s not it’s limited to profile

00:35:28building not sort of session analysis

00:35:31because we we do there’s a lot of noise

00:35:35and people get distracted so if you have

00:35:38subsequent clicks through a corpus it

00:35:40really just attributes that tells you

00:35:43something about what the users

00:35:44interested in not necessarily that the

00:35:47things that they click on related

00:35:49because people get distracted so so yes

00:35:52we use clicks but not really streams and

00:35:56if you use if you do keep bait isn’t

00:36:00that minute manipulation all right we

00:36:03were actually asked to do this so yeah

00:36:06so I think it’s a there you’re always

00:36:09when you’re working with big

00:36:10corporations you have different layers

00:36:13of management and they have this

00:36:14different sort of key performance

00:36:17indicators and and the people that work

00:36:20in the front end would like to see a

00:36:22feature used so you need to optimize the

00:36:25data for a feature to be used I think

00:36:28it’s in the app I guess at the reason I

00:36:31can still fall asleep at night is that I

00:36:34think what we’re doing is vastly

00:36:36superior to the traditional sort of code

00:36:39download statistics that are used in

00:36:41science normally the things that get

00:36:44recommended across scientific publishers

00:36:46are the things that other people

00:36:48download it the same session and I think

00:36:52one of the biggest problems with that

00:36:54just to do a little diversion here is

00:36:56that when you only look at behavioral

00:36:59data that you have absolutely no way of

00:37:01recommending that new article that came

00:37:03out yesterday because you have no

00:37:05behavioral data attached to it and it’s

00:37:07a what we call the Coast our problem

00:37:09unless you can identify that this

00:37:11article is very similar to this other

00:37:13article that has behavioral data you can

00:37:16actually not make a recent

00:37:18recommendation until by accident people

00:37:21stumble across it and you know who

00:37:23actually did something with it so so I

00:37:26think what we do here obviously this is

00:37:29a Jekyll and Hyde thing then the best

00:37:31solution is always a combination of the

00:37:33two factors

00:37:35how do you make rules for classifying

00:37:38words or phrases that are very

00:37:40domain-specific across the many

00:37:42different research domains so there’s

00:37:47some actually very few phrases that are

00:37:53exactly similar across I have very

00:37:55different meanings but I’ve chef

00:37:57syntactic very similar across domains

00:38:00and most of that problem we’ve actually

00:38:02sort of circum navigated by looking at

00:38:06longer phrases and by filtering out this

00:38:10stuff that head that has ambivalence so

00:38:14you will actually see that we try to not

00:38:16mention things that when mentioned alone

00:38:19can mean different things than we add an

00:38:23additional token in front of it often

00:38:26times it becomes much less ambiguous and

00:38:29we then prefer that one and that’s

00:38:32simply ash and algorithmic solution is

00:38:34not something that we hard code but we

00:38:36actually look at the the ones that have

00:38:38ambiguity and try to pick longer phrases

00:38:40that are super sets that included do you

00:38:45do any kind of personalization we don’t

00:38:49have a product for personalization

00:38:51because it’s not it’s a big hot potato

00:38:54in science people are really afraid of

00:38:58being tracked because they think they

00:39:00have the cure for cancer and they don’t

00:39:03want like search history is a complete

00:39:05no go and for most of the clients that

00:39:08we work with so we haven’t we don’t have

00:39:11a product yet we think it’s incredibly

00:39:14interesting and we’d love to do it but

00:39:15we don’t have a partner to do it with

00:39:17and probably it’s going to be outside of

00:39:19science and what is the scale of data

00:39:22used in your processing how much states

00:39:24had words to train your model so so

00:39:29that’s another thing of the first two

00:39:30years of our of our startup we’re trying

00:39:35to build a school google scholar

00:39:36competitor we wanted to build a

00:39:38destination site where users could come

00:39:40search in full-text articles not see the

00:39:43full text articles but we would like it

00:39:44makes them for publishers and then link

00:39:46out to

00:39:47real constant and we spoke to many

00:39:50different scientific publishers and they

00:39:52all said that’s a brilliant idea and

00:39:54they had so many meetings with us for

00:39:56two years and they said oh here’s

00:39:58another test sample that you can have of

00:40:00our content and they said and once we’re

00:40:02ready to go you’ll have this hard drive

00:40:04with a ton of articles and it will be no

00:40:06problem everybody will be happy and then

00:40:08after two years and only a few thousand

00:40:11articles from each publisher and a ton

00:40:13of meetings where they asked about our

00:40:14technology and depth and detail we went

00:40:18out and one night I’m in London I

00:40:20remember and one of the product managers

00:40:22or it was actually a V VP level in one

00:40:27of those sawtooth publishes over a beer

00:40:29said you know it’s never going to happen

00:40:32they’re just keeping you close because

00:40:34they want to know what kind of

00:40:35technology you’re developing and I think

00:40:37a few months after that we pivoted into

00:40:40a different business plan where we

00:40:43provide our value in lieu of too little

00:40:46open access material we decided to work

00:40:50within the framework of the publishers

00:40:52and be their friends and so now what

00:40:55we’re providing our services services

00:40:58that are primarily focused on on using

00:41:01one publishers data to perform services

00:41:04for that one publishers clients and so

00:41:07clients the larger publishers have 10 to

00:41:1315 million articles some of the

00:41:15aggregators have more but most most of

00:41:20our clients have less than 10 million

00:41:22documents so with each document being I

00:41:26don’t know a few hundred K in simple a

00:41:30ski that it’s not crazy amounts of data

00:41:33it’s a few terabytes for a larger

00:41:35publisher so as jonathan schwartz found

00:41:39out it could easily be dumped anywhere

00:41:41in the internet but everyone would be

00:41:44sued okay

00:41:47would it make sense to pretty print an

00:41:51article normalize it and republish it

00:41:53along with the original and did do you

00:41:56have a tool for that so no we don’t we

00:42:00cannot provide access to the full text

00:42:03we work with publishers and they are

00:42:05it’s a very tightly controlled business

00:42:10they their primary business asset at

00:42:13least until open access becomes more

00:42:16dominant is the concept that they own

00:42:18and control so so we really can’t do

00:42:21much with it except behind closed doors

00:42:24we had when we worked with elsevier last

00:42:26year like the forms we had to fill out

00:42:28for compliance of security were crazy I

00:42:31think a hundred and forty seven pages

00:42:34tabs in an excel sheet with a hundred

00:42:37questions in each so that was just the

00:42:40lien and they are the survey questions

00:42:42before they send a person over so yeah

00:42:45they’re really really crazy about

00:42:47security I using dump the architecture

00:42:52and can you talk about that I’m not

00:42:57familiar with lambda architecture I know

00:42:59like lambda lambda coefficients but no

00:43:04no probably maybe we are who knows okay

00:43:11what is the most interesting finding you

00:43:13had done in your data for cancer that’s

00:43:18our we haven’t found that yet and I

00:43:20guess we would have published it so

00:43:22we’re a service provider so we work with

00:43:26what the industry called subject matter

00:43:29experts or SMEs and so we have models

00:43:35what we validate the quality of what we

00:43:38do and and then the error rates etc they

00:43:42all automated tests and then of course

00:43:44we run it by some selection a panel of

00:43:46real scientists that can look at it and

00:43:50then know the content that we’ve

00:43:51processed and can tell if there’s an

00:43:53error somewhere a word that we left out

00:43:55that was important but we can’t really

00:43:59evaluate ourselves

00:44:00so we know that the scientific

00:44:03publishers we work with the editors

00:44:05there say that we have the best

00:44:08extraction algorithms that produce the

00:44:11best and most usable phrases and results

00:44:13so that that’s what we go at we actually

00:44:15don’t know what is being used for okay

00:44:20what about articles published in the

00:44:22public domain published on open

00:44:25platforms I am indexing and presenting

00:44:27articles on these and turns it the

00:44:28sources yes we are working with a couple

00:44:31of open access publishers and sorry

00:44:34about that and so the open access model

00:44:42has sort of turned publishing inside out

00:44:44where traditionally traditional

00:44:47publishers actually publish your thing

00:44:49for free as long as you sign over

00:44:51copyright for open access you have to

00:44:54pay for the peer to peer review process

00:44:56and the publishing of course that cost

00:44:58has come down a lot from a few years ago

00:45:00but you still pay around 2,000 euros to

00:45:04publish an article and that sort of puts

00:45:06a little damper on the growth of open

00:45:09access but but we do work with some of

00:45:13the open access providers and we have

00:45:15this idea when we started our company

00:45:16that we would just aggregate all of open

00:45:18source and that’s fine good luck if you

00:45:22want to try because the only people that

00:45:24have succeeded in doing anything vaguely

00:45:26resembling that are just aggregating the

00:45:29metadata because it turns out that

00:45:30people publish their their articles in

00:45:35it in a gazillion different formats on a

00:45:38gazillion different websites where

00:45:39sometimes the download boredness behind

00:45:41some kind of I’m not a robot capture and

00:45:44it’s really really hard to get at the

00:45:47content it’s the biggest mistake that

00:45:49the open access community has done is

00:45:51not agreeing on some submission standard

00:45:53that allows that data to go there text

00:45:55to be mined and I just don’t see why no

00:45:58one has come up and said this is how you

00:46:00do it this is the format give us a Jets

00:46:03xml file right here on an ftp server

00:46:07dump it there and and let the community

00:46:09do the rest but it hasn’t been done so

00:46:11it’s not the

00:46:12it’s not a task for startups it’s

00:46:15incredibly time-consuming to deal with

00:46:17thousands of different submission

00:46:19forfeits and PDFs I mean you may think

00:46:22PDF is a nice format but it just turns

00:46:24out that sometimes the renderer will

00:46:27swap the the order of sentences around

00:46:31and it’s impossible to figure out which

00:46:34sentence is completed over here or you

00:46:37don’t want to know so so we we have to

00:46:40have someone else take care of that and

00:46:42then we can do open source open access

00:46:44in a few years do you have some kind of

00:46:48best practice to run ad to plication

00:46:50process where different deep learning

00:46:52methods could be applied I’m not sure I

00:46:56understand the question but we do have

00:46:58so that’s the key value add and I’m

00:47:02sorry I can’t share the source code it’s

00:47:04free we’re trying to build a business if

00:47:07you want to work with it you should come

00:47:08to us we do have like the pipeline that

00:47:12we’re building is about this and it’s

00:47:13iterative we pipe stuff in that we’ve

00:47:17learned elsewhere and and we basically

00:47:20we have we work internally in the team

00:47:24we write white papers we have give talks

00:47:28to each other and it’s a wonderful set

00:47:30up please come to honest does this apply

00:47:35well to computer science papers oh yes

00:47:38archive we we’ve indexed archive once

00:47:43but we haven’t set it up for re-indexing

00:47:45and I think we should it’s the whole eat

00:47:48your own dog food thing so we should get

00:47:52that up and running again when we get

00:47:55around to it right we have these other

00:47:57jobs that pay money that we have to do

00:47:59first did you try our technology work

00:48:04for languages other than English no we

00:48:08haven’t found anyone willing to pay for

00:48:10it yet most of what we do can be

00:48:15transferred to to to other languages and

00:48:19not myself fluent in German but I think

00:48:22possibly there are some rules that would

00:48:24have to be

00:48:25for their grammar but there’s nothing

00:48:29basically preventing it from being

00:48:31ported to other languages we’ve we’ve

00:48:34been asked to do Chinese for IP analysis

00:48:38of patent analysis but the tools that

00:48:43everyone else is using is basically some

00:48:46kind of auto translation and then

00:48:48applying text analytics afterwards which

00:48:51is probably inferior but makes more

00:48:54sense on a cost perspective

00:48:56unfortunately we I think that’s it a lot

00:49:00of questions thanks for that and let’s

00:49:01say thank you to to mess thank you

”