Open Source @ IBM Blog

Follow the latest happenings with Open Source @ IBM and stay in the know.

Access trusted, curated open source data sets


The IBM® Data Asset eXchange (DAX) is an online hub for developers and data scientists to find carefully curated free and open data sets under open data licenses. A particular focus of the exchange is data sets under the Community Data License Agreement (CDLA). Since launching the exchange in 2019, the CODAIT team has been working on steadily adding new data sets to the exchange.

IBM Project Debater data sets

These recent additions include a group of data sets related to IBM Project Debater, an IBM Research project aiming to create an artificial intelligence (AI) system that can debate competitively with humans on complex topics. Project Debater has achieved significant milestones, including a live debate against an expert at THINK 2019. As part of the project, IBM Research has created various open data sets covering various use cases relevant to creating an AI debating system.

We’re proud to host these Debater data sets on the Data Asset eXchange. A few of the highlights include:

  • A set of recorded debates from experts, making up three data sets: Recorded Debating #1, #2, and #3
  • Labeled Emphasized Words in Speech: A data set created to train and evaluate a system that receives a written argumentative speech and predicts which words should be emphasized by the Text-to-Speech component.
  • Sentiment Lexicon of IDiomatic Expressions (SLIDE): 5,000 frequently occurring idioms with sentiment annotation
  • Claim Sentences Search: A set of sentences from Wikipedia, together with their topic. The aim of the Claim Sentence Search task is to detect sentences containing claims in a large corpus, given a debatable topic or motion.

Data sets for document analysis and extraction

Also emanating from IBM Research, the PubTabNet and PubLayNet data sets relate to document layout analysis and information extraction. PubTabNet is a large data set for image-based table recognition, containing over 568,000 images of tabular data, annotated with the corresponding HTML representation of the tables. PubLayNet contains document images, with each document’s layout annotated with both bounding boxes and polygonal segmentations. Both data sets are based on documents from the PubMed Central Open Access Subset. These data sets are released under the CDLA – Permissive license.

Video action understanding

A final data set to highlight is the Video-Text Compliance data set, which contains over 1.2 million frames of video of atomic activities, along with text instructions and compliance labels. Importantly, the data set creators carefully followed privacy-preserving safe-guards in the generation of this data set, illustrating best practices for addressing concerns over data privacy while still creating useful data sets for real-world applications. This data set is released under the CDLA – Sharing license.

Exploratory data analysis notebooks

To make it easier to use data sets on the Data Asset eXchange, we’ve introduced interactive notebooks hosted on Watson Studio that illustrate how to get started with your first steps of exploratory data analysis. Right now, we’ve added notebooks for a few data sets, including Fashion-MNIST, JFK Weather, PubTabNet, PubLayNet and more.

We’re working on more content related to data cleansing, exploratory analysis, and machine learning with data sets from the Data Asset eXchange, so watch this space! We encourage you to check out these recent data sets and notebooks as well as all of the other data sets.

If you have any comments or feedback on the Data Asset eXchange, please get in touch with us using GitHub or Slack. Look forward to an exciting year ahead!