Building a repository of biomedical ontologies with Neo4j

Building a repository of biomedical ontologies with Neo4j Simon Jupp Samples, Phenotypes and Ontologies Team European Bioinformatics Institute Cambridge, UK.

Outline Why we care about ontologies in biology Why we need a repository of ontologies Building a new Ontology Lookup Service (OLS) at the EBI Index OWL ontologies in Neo4j OLS Infrastructure Challenges with Neo4j Neo4j and Linked Open Data

What is EMBL-EBI? Part of the European Molecular Biology Laboratory International, non-profit research institute Europe’s hub for biological data services and research Based in Hinxton , Cambridge

Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas Metabolights PRIDE InterPro Pfam UniProt ChEMBL ChEBI Literature & ontologies Europe PubMed Central Gene Ontology Experimental Factor Ontology Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive 1000 Genomes Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes European Genome- phenome Archive Metagenomics portal

Biological data heavily interlinked Proteome Metabolome Genome tissue CE-MS antibody array LC-MS/MS m/z 600 800 1000 1200 1400 1600 10 20 30 40 50 60 70 80 90 100 Intensity 609.256 b 6 755.422 y 8 882.357 b 9 852.476 y 9 995.435 b 10 1092.506 b 11 1181.252 y 12 1318.578 b 13 1587.759 b 16 1715.817 b 18 858.408 b 18 ++ 794.380 b 16 ++ miRNA array mRNA array Pathways Protein Interaction Drug targets

We have a lot of data silos A lot of public data Heterogeneous semantics, formats, identifiers EBI and other institutes invest heavily in cross-linking resources

We need terminology standards Canine Dog Different Words Same Concept

One Identity for each entity Mouse or Mus or mice = NCBITaxon_10088 …but not all mice are equal

Building ontologies Put things into categories Helps organise the data Allows us to generalise over data Capture the relations between things Anatomical parts Biopolymer Nucleic Acid Polypeptide Enzyme DNA RNA tRNA mRNA smRNA

Web Ontology Language – (OWL) W3C standard vocabulary for describing ontologies OWL is based on a description logic We can use it to describe sets of things based on their properties A subclassOf B - Implies all things of type A, are also things of type B “heart” part-of “Cardiovascular System” Powerful knowledge representation ‘mitochondrial chromosome’ ‘ equivalent to ’ chromosome and ‘ part of ’ some mitochondrion

Using a DL reasoner to infer classification Relatively flat asserted view Inferred polyhierarchy OWL reasoner

12 Genotype Phenotype Sequence Proteins Gene products Transcript Pathways Cell type BRENDA tissue / enzyme source Development Anatomy Phenotype Plasmodium life cycle Sequence types and features Genetic Context - Molecule role - Molecular Function - Biological process - Cellular component Protein covalent bond Protein domain UniProt taxonomy -Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction -Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version -Mosquito gross anatomy -Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy -Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development -NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history eVOC (Expressed Sequence Annotation for Humans) Ontologies for life sciences

We do a lot of tagging CL:CL_0000071 (blood vessel endothelial cell) obo:CHEBI_39867 (valproic acid) NCBITaxon:NCBITaxon_9606 (Homo Sapiens)

Ontologies add value Smarter searching Data visualisation Data analysis Data integration

Summary so far… Ontologies provide a “semantic glue” for integrating biological data There’s a lot of ontologies about The biological community need ontology infrastructure and services Ontologies can be complex Ontologies can be big Ontologies can change

Ontologies as Graphs OWL ontologies aren’t graphs, but… … can be represented as an RDF graph … people want to use them as graphs Plenty of RDF databases around But incomplete w.r.t . OWL semantics SPARQL is an acquired taste

Ontology repository use-cases Search for ontology terms labels, synonyms, descriptions Querying the structure Get parent/child terms Querying transitive closure Get ancestor/descendant terms Querying across relations Partonomy or development stages A graph database and search index should satisfy these requirements

The old Ontology Lookup Service EBI been hosting a repository of over 100 Bio-medical ontologies for past 10 years SOAP services for programmatic access Up to 25 million requests per month (mostly API). http://www.ebi.ac.uk/ontology-lookup

Why we need a new OLS Old codebase (+10 years in places) Updated to work with OWL (not OBO) Uses Oracle RDMS and SQL for querying ontology structure (suboptimal) Ditch SOAP/XML in favour of REST/JSON

OLS 3.0 Rebuilt from scratch Polls ontologies by URL Server side checksum for detecting changes in files Uses Java OWL API for loading (still supports OBO) Infer relations with reasoner RESTful API built with Spring Data Multiple indexes for scalable querying SOLR server – text queries Embedded Neo4j – graph queries (drives REST API) Virtuoso server – SPARQL for Advanced users

OLS 3 beta is now live http://www.ebi.ac.uk/ols/beta / 140 ontologies Neo4j version 2.2 Runs in embedded mode Inside Tomcat container 7 million nodes 11 million edges ~10Gb on disk Generic ontology infrastructure Can load any OWL or SKOS file Built with standard technologies Solr , Neo4j, Spring IO, Thymeleaf , Bootstrap, Jquery Includes stand-alone Spring-Boot app for loading ontologies into Neo4j Open-source project https://github.com/EBISPOT/ OLS

REST API Search across any field in one or more ontologies (SOLR) /search Get ontology and term meta data (Neo4j) /ontologies / ontologies/{ name } / ontologies/{ name }/terms / ontologies/{ name }/ terms/{ termid } Get related terms and navigate ontology structure ( Neo4j) / ontologies/{ name }/terms/{ termid }/parent / ontologies/{ name }/terms/{ termid }/children / ontologies/{ name }/terms/{ termid }/descendants / ontologies/{ name }/terms/{ termid }/ancestors / ontologies/{ name }/terms/{ termid }/{relation} e.g. part_of Get JSON for common visualisation libraries ( Neo4j) / ontologies/{ name }/terms/{ termid } / tree / ontologies/{ name }/terms/{ termid } /graph http://www.ebi.ac.uk/ols/beta/api

OWL to Neo4j schema Label every node by type (e.g. class, property or individual) and ontology id Label every relation by name include additional index for “special relations” like partonomy and subsets

Nightly Neo4j build process Nightly crawl of all >140 registered ontologies Use the Java OWL API and reasoner to classify ontology (get the inferred classification) Use Neo4j BatchInserter to update neo4j index Download file create checksum If the file is new Drop ontology from neo4j index

OLS 3.0 Infrastructure 2 x Load balanced Tomcat servers Two data centers Data center 1 (8GB VM) Data center 2 (8GB VM)

Why Neo4j? Our primary use-case required a graph store OWL mapping to RDF graph is complex (lots of blank nodes) We wanted Spring Data and Spring Data Rest Less code for us to maintain Didn’t want to write our own DAO using SPARQL (We’ve tried this on another project) We wanted something that we could rely on with community behind it Neo4j was quick to pick up 1 day GraphAware course 4 months ago Working pilot for new OLS + Neo4j 1 month later

Powerful yet simple queries Get the transitive closure for “heart” following parent and partonomy relations from the UBERON anatomy ontology MATCH path = ( n:Class ) -[ r:SUBCLASSOF|RelatedTree * ] - >(parent )< -[r2:SUBCLASSOF|RelatedTree] -( sibling:Class ) WHERE n.ontology_name = {0} AND n.iri = {1 }

Generating visualisations MATCH path = ( n:Class )-[ r:SUBCLASSOF|Related ]-(parent ) WHERE n.ontology_name = {0} AND n.iri = {1 } RETURN {nodes: collect( distinct { iri : p.iri , label: p.label }), edges: collect (distinct {source: startNode (r1). iri , target: endNode (r1). iri , label: r1.label, uri : r1.uri} )} as result Generating common JSON representations directly from Cypher is very powerful

Challenges Wanted to utilise Spring for our REST API We had a REST resource hierarchy that we wanted api / ontologies/{ name }/terms/{ termid }/ parents api / ontologies/{ name }/terms/{ termid }/ children Too hard to get this to work using just an object model and SDN alone No matter what we tried always ended up sending Neo4j into a spin @ NodeEntity @ TypeAlias (value = "Class" ) public class Term { @ RelatedToVia (direction= Direction.OUTGOING , type = ”SUBCLASSOF") @Fetch Set<Term> parents; @ RelatedToVia (direction= Direction.INCOMING , type = ”SUBCLASSOF" ) @Fetch Set< Term > children; }

…but it was easy enough to achieve what we wanted with some Spring magic Repository interface with custom Cypher Define our own controllers Custom Resource Assemblers for HAL links

Challenges We need dynamic fields Neo4j is driving the REST API Each ontology term has metadata where we don’t know the field names up front (e.g. ‘created by’ or ‘comment’) To get get the right set of dependencies we currently use SDN 3.4.0 Dynamic fields not supported in SDN 4.0 We are forced to run in embedded mode Is this true? Scaling tips for running inside a tomcat please

Challenges Full index rebuild takes up to 20 hours Most nights the update runs in ~2 hours We have one master Neo4j db If an ontology needs updating we take it out and then reload Built on machine with 128GB memory + SSD There’s always a chance we might trash the entire index We’d like to build an index for each ontology independently. Have a final stage where we merge all the successfully built indexes Other suggestions?

Things we’d like to do Extract subsets from a graph Some nodes are tagged as being in a subset Help to give broad overview of an annotated datasets May require us to infer relations Master graph Extracted subset graph

B cells IGJ IGHA1 LRRN3 SYT11 DSC1 SVIL IGLC3 DPP4 MAN1C1 liver cancer GNA01 CEP57 ASB1 PNPLA4 FA2H NR4A1 IFNA2 TNPO1 epithelial cells DST FBLN1 BCL2 WDR1 METTL7A CYB561 FGFR2 SPARC EMC1 Calculating shortest paths ? Where do these nodes intersect? How can we enrich these datasets using the ontologies?

Recap The EBI Ontology Lookup Service provides access to the ontologies for biological researchers and database curators Main priority is providing a scalable API for external services to develop against Pilot of Neo4j quickly turned into our primary index for driving the REST API There is no one fit solution for the backend, always some compromise So we make the most of frameworks like Spring Data Solr and Spring Data Neo4j to make creating multiple indexes simpler Neo4j has been easy to get grips with and scaled well for our setup with pretty much out of the box configuration

A word on Linked Data We have many years experience working with RDF and Semantic Web technologies The EBI RDF platform –EBI data that has been converted to RDF (Billions of triples) The ontologies and the data in one big federated graph http://www.ebi.ac.uk/rdf - powerful data integration platform Semantic Web technologies have struggled to get mainstream adoption Reasons: Hype, Complexity, Baggage, Poor implementations Remain relevant in the life sciences A lot of public data out there that needs to be integrated

Life sciences rely on Linked Open Data Linked data is a rebranding of the Semantic Web Core principles address our data integration needs Use URIs to identify things Type things with ontology terms Make sure URIs resolve (self describing documents) Link documents together We see some major wins if Neo4j was more linked data friendly This doesn’t have to mean supporting SPARQL A general feeling of tension between Neo4j and the RDF community

Final thoughts – Neo4j and JSON-LD? A lot of frameworks now make it trivial to produce good APIs What’s currently missing is how to integrate data from two or more independent APIs Hard to crawl independent datasets for connections without a human to interpret semantics Still a need to express a schema alongside the data W3C standard like RDF/RDFS/SKOS/OWL provide the basic vocabularies and semantics for expressing data schemas JSON-LD is bridging the gap from JSON to RDF

Be open We are committed to making life science data public and freely available Likewise the tools and software we develop to work with the data are open We always strive to use products that are open and freely available We can only use Neo4j while it continues to be made available in this model Vendor lock-in for our products is very bad for us Graph database have great potential for biology But we need open standards for these databases

Acknowledgements Sample Phenotypes and Ontologies Team - Tony Burdett, James Malone, Dani Welter, Catherine Leroy, Sira Sarntivijai , Ilinca Tudose , Helen Parkinson Matt Pearce – Flax ( BioSOLR project) Michal Bachman and GraphAware team Funding European Molecular Biology Laboratory (EMBL) European Union projects: DIACHRON, BioMedBridges and CORBEL

Building a repository of biomedical ontologies with Neo4j

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Building a repository of biomedical ontologies with Neo4j

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Clinical approach Dyspnoea A simple and practical approach.pptx

Cancer Awareness therapy for public by Dr Kanhu Charan Patro

Prof Satyadas Memorial oration Kozhikode.pptx

Viral Conjunctivitis and it;s managment.pptx

Essential Thrombocythemia 15 Years of Experience at the Hematology Department, Algies, Algeria.pdf

Hypertension sign symptoms cmdt style with regime