
RapidMiner Extensions

Our contributions for RapidMiner can be found here.

aspects-DB-Dataset for Focus Aspect Value (FAV) model for Explainable Subjective Interpretation

Dataset Description
aspects-DB The dataset contains Aspects and the corresponding image URLS. Please follow the README for further information.

Datasets for Image Captioning

Dataset Description Crowdsourcing annotations of image captions from the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). The dataset contains responses with respect to subjectivity, visibility, appeal and intent of around 2.2k image titles.

Datasets for Document Analysis

Here you find the data sets that have been generated at MADM for research purposes. Detailed information about each dataset can be obtained on the specific page.

Topic Name Description
Document security Doctor bills The data set contains genuine and forged doctor bills. Forgeries are made by re-engineering of genuine documents.
Document security MIC dataset The data set contains print-outs from color laser printers and copiers that show Machine Identification Codes (MIC), also known as “yellow dots” or counterfeit protection system codes.
Document security StaVer dataset The data set contains scanned invoices with color logos, color text and various kinds of stamps.
Document security Scan Distortion dataset The dataset contains gray scale invoices from the same source as well as copies of genuine invoices to detect and measure the scanning distortions.
Document security Distorted Text-Lines dataset The dataset contains synthetic gray scale document images with single column text where the last paragraph is either rotated or mis-aligned. Different fonts and font sizes are used.
Document security DFKI Printing Technique dataset This dataset contains documents printed on 7 inkjet and 13 laser printers.

Datasets for Image and Video Analysis

Dataset Description
YouTube-22concepts A dataset of YouTube video clips tagged with 22 different concepts for experiments with automatic video annotation.

Datasets for Audio Analysis

Dataset Description
AudioPairBank A Large-Scale Tag-Pair-Based Audio Dataset (385.5 hours, 1116 classes)

Datasets for Machine Learning

Dataset Generator Description Python code for generating synthetic datasets with known Bayes error rate and defined statistical properties.
EuroSAT (RGB color space images) EuroSAT: A land use and land cover classification dataset based on Sentinel-2 satellite images.
EuroSAT (all 13 bands) EuroSAT: A land use and land cover classification dataset based on Sentinel-2 satellite images.

Datasets for Unsupervised Anomaly Detection

Below datasets for unsupervised anomaly detection could be found. The outlier label must not be used for detection, only for evaluation. The first row contains the column naming. For the UCI datasets, permission for republication has been granted. For more information please refer to

More unsupervised anomaly detection datasets for evaluation can be now found on the Harvard Dataverse:

Dataset Records Dimensions % outliers Description 3000 2 1.23 Artificial test data set with 4 normal distributions (one of which with low density), a micro cluster and local anomalies.
breast-cancer-unsupervised.csv 367 30 2.72 Modified “Breast Cancer Wisconsin (Diagnostic)” dataset from the UCI machine learning repositoy. Original version available here.
pen-local-unsupervised.csv 6724 16 0.15 Modified “Pen-Based Recognition of Handwritten Digits” dataset from the UCI machine learning repositoy. Original version available here.
pen-global-unsupervised.csv 809 16 11.1 Modified “Pen-Based Recognition of Handwritten Digits” dataset from the UCI machine learning repositoy. Original version available here.
shuttle-unsupervised.csv 46464 9 1.89 Modified “Statlog (Shuttle)” dataset from the UCI machine learning repositoy. Original version available here.
satellite-unsupervised.csv 5100 36 1.49 Modified “Statlog (Landsat Satellite)” dataset from the UCI machine learning repositoy. Original version available here.
annthyroid-unsupervised.csv 6916 21 3.61 Modified “Thyroid Disease” dataset from the UCI machine learning repositoy. See version “ann-thyroid”. Original version available here.
kdd99-unsupervised-ad.csv 620089 38 0.17 Modified “KDD Cup 1999” dataset from the UCI machine learning repositoy. Only HTTP connections selected. Original version available here.