About Our Data

About IOS Press Data

We have spent considerable time cleaning up our data and constructing a conversion pipeline to transform all IOS Press journal and book metadata to RDF-based linked data.

We use a custom vocabulary and web standards while describing our data in order to make that data even more discoverable, accessible, linkable and interoperable with other datasets. Affiliations are geocoded and authors as well as affiliations are disambiguated using our co-reference resolution script. With the help of machine learning techniques, the data conversion pipeline keeps on improving as more data are added.

LD Connect was developed in collaboration with STKO Lab at UCSB in Santa Barbara, CA, USA and the co-reference resolution with DaSe Lab at Wright State University in Dayton, OH, USA and Kansas State University in Manhattan, KS, USA. *

Our datasets include, for example, metadata of journal articles and book chapters, authors, affiliations, countries, volumes, issues, series, pre-press and publication dates, ISSNs, DOIs accessibility, keywords, pages and abstracts.

Explore

LD Connect lets you explore its knowledge graph by browsing or searching using expert level semantic search.

Download

Download and connect our data or embeddings to your research output or application and unleash the potential of unsiloed data.

Download Dataset

Semantic Vectors: More Powerful, Intelligent Search and Retrieval

Searching LD Connect goes beyond queries that just use exact keywords. AI-based semantic vectors, generated by embeddings, enable far more sophisticated searches. They crawl the data and automatically look for all semantically similar terms — terms that the user may not have even considered — as well as all truncations, resulting in more accurate and comprehensive results.

For example, searching for “Artificial Intelligence” would also retrieve data based on all full text and would include variations like “AI” or related terms like “Machine Learning” or “Model Based Reasoning”.

Pre-Trained Doc2Vec Models

The two files linked below contain pre-trained Doc2Vec models of all English journal articles and book chapters published by IOS Press over the years and are based on their full text content, not just abstracts. In total, the dataset used was made up of >132000 papers, all of which are also matched to entities in the IOS Knowledge Graph. The corresponding word embedding model has a vocabulary size of 105839. The embedding dimension of both of these two models is 200. The Doc2Vec model is trained using the Python gensim@3.3.0 library.

IOS-Doc2Vec.zip can be loaded directly into the gensim library.
IOS-Doc2Vec-TXT.zip contains the Doc2Vec model and its corresponding Word2Vec model as plain-text files (“doc2vec.txt”, “w2v. txt”). The “doc2vec_voc.txt” contains a list of all the paper entity URLs of the Doc2Vec model. The “w2v_voc.txt” contains a list of the word vocabulary of the corresponding word2Vec model. This version can, therefore, be used for work that requires a direct integration with the IOS Knowledge Graph.

For questions, please contact Krzysztof Janowicz at janowicz-at-ucsb.edu.

Download Doc2Vec Download Doc2Vec-TXT

IOS Knowledge Graph Embedding

The IOS Knowledge Graph (KG) Embedding files are trained on the IOS Knowledge Graph by using the TransE algorithm. The algorithm utilizes each triple with object properties to train an embedding model for each entity and each predicate in the KG. As for a triple <s, p, o>, TransE learns k-dimensional embeddings for the entity s, o as well as relation p to make s + p approximately zero.

entity_sameAs_merge_mapping_iri.txt: Note that before we train TransE on the IOS KG, we first perform entity conflation based on owl:sameAs relation. For all the entities connected by owl:sameAs, only one entity URL is selected to represent the corresponding entity. This file contains the mapping between the original entity URL (1st column) to this conflated URL (2nd column).
TransE_ent.txt: The embedding for every conflated entity in the IOS KG. The first line is the number of entities and the embedding dimension (50). As for the rest, each line begins with the entity URL which is followed by its embedding.
TransE_relation.txt: The embedding for every predicate in the IOS KG. The file format is the same as TransE_ent.txt.

Note: TransE_ent.txt and TransE_relation.txt follow the word embedding format defined by Python’s gensim package.

For questions, please contact Krzysztof Janowicz at janowicz-at-ucsb.edu.

Download KG Embedding

*For insights into how the LD Connect embeddings and Toolbox were developed by the team read: Gengchen Mai, Krzysztof Janowicz, Bo Yan. Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines, In: Proceedings of SemDeep-4 Workshop co-located with ISWC 2018, Oct. 8-12, 2018, Monterey, CA, USA.