"Extraction and formalization of knowledge from text"
Leader : Claire Nédellec
The Bibliome team develops Natural Language Processing (NLP) and Machine Learning (ML) methods to extract information from text in the biology domain.
We work on specific information extraction (IE) tasks such as entity recognition, entity normalization (linking) and relation extraction. We focus on methods that combine linguistic information, ML and domain knowledge (ontologies and taxonomies) and are able to handle a small number of training examples.
We apply our methods to a wide range of biological applications--from microbial diversity to plant biology and epidemiological surveillance.
An important part of our activity is also to promote the development and evaluation of IE systems by organizing shared tasks.
Omnicrobe : développement d’une base de données d'informations sur les habitats et les phénotypes microbiens à partir de textes. CRD ANSES. 2022-2023.
HoloOligo Structure diversity, functionality and modulation of milk oligosaccharides in monogastric livestock species: towards optimal development of rabbit and pig holobionts. Project-ANR-21-CE20-0045 - Biologie des animaux, des organismes photosynthétiques et des microorganismes
TIERS - ESV. Traitement de l’Information et Expertise des Risques Sanitaires pour l’Epidémiosurveillance en Santé Végétal. IB2021 Departments INRAE MathNum and SPE.
TyDI Terminology Design Interface. DiBiSO, université Paris-Saclay, INIST-CNRS, BIA-INRAE et MaIAGE-INRAE. 2021-2023.
Beyond - ANR Programme Prioritaire de Recherche Cultiver et protéger autrement. Building epidemiological surveillance and prophylaxis with observations both near and distant Projet IA-20-PCPA-0002
D2KAB Data to Knowledge in Agriculture and Biodiversity. ANR AAPG 2018. Défi : B.7. Axe : axe 4.
ENovFood Linking a phenotypic and a network food microbe data bases: an application to food microbial ecology and food innovation. Metaprogramme MEM INRA. 2018-2020.
OntoBedding. Amélioration de plongements lexicaux par des ontologies pour leur adaptation aux domaines de spécialité, avec le LIMSI. Projet financé par le DIM RFSI. 2019
Visa TM (Towards an advanced infrastructure in text-mining) CoSO project, (2017-2019)
OpenMinTeD (Open Mining Infrastructure for Text and Data) Infrastructure H2020 project (2015-2018)
D-ONT, Exploitation optimisée des bases de données phénotypiques - Des ontologies pour le partage d’information, ACI Phase 2016-2018.
IMSV, Institut de modélisation des systèmes vivants, Lidex de l'Université Paris-Saclay (2014-2016)
SeeDev, Regulations in the development of Arabidopsis thaliana seed (Challenge Lidex CDS) (2015)
OntoBiotope: Metaprogramme INRA MEM (Metagenomics of microbial ecosystems). (2012-2013).
Triphase: Semantic information system for publications in animal physiology and agricultural systems. PHASE department (2013-2014).
Quaero: Automatic multimedia content processing. Oséo. (2008-2013).
FSOV SAM Blé: Selection of wheat by genetic markers. Fond de soutien à l'obtention végétale (2010-2013).
Workinggroup Labex DigiCosme D2K (from Data to Knowledge)
BioNLP-Open Shared Task 2019: annotated corpora and online evaluation services
BioNLP-Shared Task (2011, 2013, 2016): annotated corpora and online evaluation services
LLL, Learning Language in Logics (2005)
|Claire Nédellec, Principal Investigator, head of the Bibliome group|
Robert Bossy, Permanent position, Research Engineer, coordinator of the Alvis suite
|Louise Deléger, Permanent position, Researcher|
|Arnaud Ferré, Permanent position, Researcher|
|Anfu Tang, PhD student|
|Mariya Borovikova, PhD student|
|Elisa Lubrini, R&D|
|Clara Sauvion, R&D|
|Mouhamadou Ba, Postdoc, OpenMinTeD project|
|Estelle Chaix, Postdoc, OpenMinTeD project|
|Philippe Bessières, Research Director|
|Dialekti Valsamou, PhD student, IDEX IDI|
- Alvis NLP/ML is a pipeline that annotates text documents for the semantic annotation of textual documents. It integrates Natural Language Processing (NLP) tools for sentence and word segmentation, named-entity recognition, term analysis, semantic typing and relation extraction. These tools rely on resources such as terminologies or ontologies for the adaptation to the application domain. Alvis NLP/ML contains several tools for (semi)-automatic acquisition of these resources, using Machine Learning (ML) techniques. New components can be easily integrated into the pipeline. Part of this work has been funded by the European project Alvis and the French project Quaero. (See the paper by Nedellec et al. In Handbook on Ontologies 2009 for an overview)
- AlvisAE (Alvis Annotation Editor) is an on-line annotation editor for the collective edition and the visualisation of annotations of entities, relations and groups. It includes a workflow for annotation campaign management. The annotations of the text entities are defined in an ontology that can be revised in parallel. AlvisAE also includes a tool for detection and resolution of annotation conflicts. Part of this work has been funded by the European project Alvis and the French project Quaero. See Bossy et al., LAW VI 2012 for more details.
- AlvisIR (Alvis Information Retrieval) is an on-line generic semantic search engine ; only few hours are needed to create a a new instance for a given document collection and an ontology. A user query with the ontology concepts retrieves all documents that contain the concepts, in the form of specific concepts, or synonyms. AlvisIR semantic search engine also handles relational queries. See for example search on biotopes of microorganisms . Part of this work has been funded by the European project Alvis and the French project Quaero.
- BioYaTeA is an extension of the YaTeA term extractor that deals with prepositional attachments and adjectival participle. It extracts terms from documents in French and in English. Its distribution includes post-filtering of irrelevant terms. It is publicly available as CPAN module. Part of this work has been funded by the European project Alvis and the French project Quaero. See (Golik et al., CiCLING'2013) for more details.
- TyDI (Terminology Design Interface) is a collaborative tool for the manual validation and structuring of terms either originating from terminologies or extracted from training corpus of textual documents. It is used on the output of so-called term extractor programs (like BioYatea), which are used to identify candidates terms (e.g. compound nouns). With TyDI, a user can validate candidate terms and specify synonymy/hyperonymy relations. These annotations can then be exported in several formats, and used in other natural language processing tools. Part of this work has been funded by the French project Quaero. More details (Golik et al., Ekaw 2010 ).
Semantic search engines based on the AlvisIR technology
- Biotope relational search engine indexes all PubMed references on habitats of microorganisms and phenotypes (2,3 millions references) with Alvis Suite technology and OntoBiotope Ontology. Funded by OpenMinTeD, Quaero project and MEM metaprogramme.
- SamBlé indexes a large set of references on genetic markers and phentoypes in bread wheat with Alvis Suite technology and Wheat Trait Ontology. FSOV SamBlé Project and OpenMinTeD
- SeeDev indexes a large set of references on molecular mechanism involved in seed development using Alvis Suite technology. Supported by UPSay CDS&IMSV projects and OpenMinTeD.
- TriPhas’IR indexes the publications of the PHASE scientific department (2010-2014) with the TriPhase termino-ontology.
- AnimalIR indexes Animal Journal articles with the ATOL ontology
- Omnicrobe is an online database that integrates information on microbe habitats and phenotypes from articles and databases, BRC, and genetic databases.