We research on natural language processing, both written and oral, in order to come up with tools for the automated processing of linguistic content in multilingual environments or where human language becomes the preferred form of interaction. The technologies we develop enable:
- Mass analysis of texts in order to extract opinions, feelings and data from sets of texts, for the purposes of generating user-profiling systems and hybrid recommendations in addition to grouping and categorising textual content.
- Proofreading and style guides, both for native speakers and for those learning a second language.
- Development of standardisation systems for texts, filtering/moderation of content and the automatic generation of content and summaries..
- Automatic translation between two languages and retrieving cross-language information.
- Synthesis of a bilingual Catalan-Spanish voice with natural levels of expressiveness, based on the Cereproc© synthesis motor.
- Processing of sign language and the development of applications with integrated signing avatars.
Natural language processing
We research, develop and innovate robust, portable technologies in the field of natural language processing; specifically, we focus on semantic annotation, named-entity recognition and classification (NERC), language modelling, semantic analysis, grouping and classification and factuality analysis.
These technologies study, model and characterise texts, via linguistic as well as statistical approximation. The former is based on an understanding of language through rules, dictionaries, ontology and the like; i.e. understanding the dependencies and relationships between words. The latter, in contrast, infer knowledge by learning through examples. A hybrid approximation combines the advantages of both approaches, enabling us to “understand” automatically or semi-automatically what a set of texts is saying, who is saying it and how they are saying it. In other words, structured information can be extracted from texts containing unstructured information.
Specifically, our research into natural language processing is mainly focused on:
- Semantic annotation – Named entity recognition and classification (NERC)
- Language modelling
- Semantic analysis
- Grouping and classification
- Factuality analysis
Linguistic technologies are highly dependent on the language and type of writing. Currently the research team is examining Catalan, Spanish and English, and is also looking at formal writing (from news articles or blogs), user-generated content (reviews and limited texts such as those from Facebook and Twitter) and automatic transcriptions. Additionally, the team is studying how information is treated in more than one language.
Prosody for voice synthesis
We are working on the automation of the voice-creation process and the adaptation of these technologies to specific fields. Consequently, our research efforts are mainly concentrated on developing models of phonetic and prosodic language, models that improve the natural qualities of synthetic voices, and models that enable the generation of synthetic voices with emotion, in addition to rule-based linguistic processing and the generation of dictionaries and vocabularies.