Dataset module

Recommendation datasets and related utilities

Recommendation datasets

Amazon Reviews

Amazon Reviews dataset consists of reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.

Citation:J. McAuley and J. Leskovec, “Hidden factors and hidden topics: understanding rating dimensions with review text”, RecSys, 2013.
recommenders.datasets.amazon_reviews.data_preprocessing(reviews_file, meta_file, train_file, valid_file, test_file, user_vocab, item_vocab, cate_vocab, sample_rate=0.01, valid_num_ngs=4, test_num_ngs=9, is_history_expanding=True)[source]

Create data for training, validation and testing from original dataset

Parameters:
  • reviews_file (str) – Reviews dataset downloaded from former operations.
  • meta_file (str) – Meta dataset downloaded from former operations.
recommenders.datasets.amazon_reviews.download_and_extract(name, dest_path)[source]

Downloads and extracts Amazon reviews and meta datafiles if they don’t already exist

Parameters:
  • name (str) – Category of reviews.
  • dest_path (str) – File path for the downloaded file.
Returns:

File path for the extracted file.

Return type:

str

recommenders.datasets.amazon_reviews.get_review_data(reviews_file)[source]

Downloads amazon review data (only), prepares in the required format and stores in the same location

Parameters:reviews_file (str) – Filename for downloaded reviews dataset.

CORD-19

COVID-19 Open Research Dataset (CORD-19) is a full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized for machine readability and made available for use by the global research community.

In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.

This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease.

Citation:Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P. “Cord-19: The COVID-19 Open Research Dataset.”, 2020.
recommenders.datasets.covid_utils.clean_dataframe(df)[source]

Clean up the dataframe.

Parameters:df (pandas.DataFrame) – Pandas dataframe.
Returns:Cleaned pandas dataframe.
Return type:df (pandas.DataFrame)
recommenders.datasets.covid_utils.get_public_domain_text(df, container_name, azure_storage_account_name='azureopendatastorage', azure_storage_sas_token='')[source]

Get all public domain text.

Parameters:
  • df (pandas.DataFrame) – Metadata dataframe for public domain text.
  • container_name (str) – Azure storage container name.
  • azure_storage_account_name (str) – Azure storage account name.
  • azure_storage_sas_token (str) – Azure storage SAS token.
Returns:

Dataframe with select metadata and full article text.

Return type:

df_full (pandas.DataFrame)

recommenders.datasets.covid_utils.load_pandas_df(azure_storage_account_name='azureopendatastorage', azure_storage_sas_token='', container_name='covid19temp', metadata_filename='metadata.csv')[source]

Loads the Azure Open Research COVID-19 dataset as a pd.DataFrame.

The Azure COVID-19 Open Research Dataset may be found at https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/

Parameters:
  • azure_storage_account_name (str) – Azure storage account name.
  • azure_storage_sas_token (str) – Azure storage SAS token.
  • container_name (str) – Azure storage container name.
  • metadata_filename (str) – Name of file containing top-level metadata for the dataset.
Returns:

Metadata dataframe.

Return type:

metadata (pandas.DataFrame)

recommenders.datasets.covid_utils.remove_duplicates(df, cols)[source]

Remove duplicated entries.

Parameters:
  • df (pd.DataFrame) – Pandas dataframe.
  • cols (list of str) – Name of columns in which to look for duplicates.
Returns:

Pandas dataframe with duplicate rows dropped.

Return type:

df (pandas.DataFrame)

recommenders.datasets.covid_utils.remove_nan(df, cols)[source]

Remove rows with NaN values in specified column.

Parameters:
  • df (pandas.DataFrame) – Pandas dataframe.
  • cols (list of str) – Name of columns in which to look for NaN.
Returns:

Pandas dataframe with invalid rows dropped.

Return type:

df (pandas.DataFrame)

recommenders.datasets.covid_utils.retrieve_text(entry, container_name, azure_storage_account_name='azureopendatastorage', azure_storage_sas_token='')[source]

Retrieve body text from article of interest.

Parameters:
  • entry (pd.Series) – A single row from the dataframe (df.iloc[n]).
  • container_name (str) – Azure storage container name.
  • azure_storage_account_name (str) – Azure storage account name.
  • azure_storage_sas_token (str) – Azure storage SAS token.
Results:
text (str): Full text of the blob as a single string.

Criteo

Criteo dataset, released by Criteo Labs, is an online advertising dataset that contains feature values and click feedback for millions of display Ads. Every Ad has has 40 attributes, the first attribute is the label where a value 1 represents that the Ad has been clicked on and a 0 represents it wasn’t clicked on. The rest consist of 13 integer columns and 26 categorical columns.

recommenders.datasets.criteo.download_criteo(size='sample', work_directory='.')[source]

Download criteo dataset as a compressed file.

Parameters:
  • size (str) – Size of criteo dataset. It can be “full” or “sample”.
  • work_directory (str) – Working directory.
Returns:

Path of the downloaded file.

Return type:

str

recommenders.datasets.criteo.extract_criteo(size, compressed_file, path=None)[source]

Extract Criteo dataset tar.

Parameters:
  • size (str) – Size of Criteo dataset. It can be “full” or “sample”.
  • compressed_file (str) – Path to compressed file.
  • path (str) – Path to extract the file.
Returns:

Path to the extracted file.

Return type:

str

recommenders.datasets.criteo.get_spark_schema(header=['label', 'int00', 'int01', 'int02', 'int03', 'int04', 'int05', 'int06', 'int07', 'int08', 'int09', 'int10', 'int11', 'int12', 'cat00', 'cat01', 'cat02', 'cat03', 'cat04', 'cat05', 'cat06', 'cat07', 'cat08', 'cat09', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat22', 'cat23', 'cat24', 'cat25'])[source]

Get Spark schema from header.

Parameters:header (list) – Dataset header names.
Returns:Spark schema.
Return type:pyspark.sql.types.StructType
recommenders.datasets.criteo.load_pandas_df(size='sample', local_cache_path=None, header=['label', 'int00', 'int01', 'int02', 'int03', 'int04', 'int05', 'int06', 'int07', 'int08', 'int09', 'int10', 'int11', 'int12', 'cat00', 'cat01', 'cat02', 'cat03', 'cat04', 'cat05', 'cat06', 'cat07', 'cat08', 'cat09', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat22', 'cat23', 'cat24', 'cat25'])[source]

Loads the Criteo DAC dataset as pandas.DataFrame. This function download, untar, and load the dataset.

The dataset consists of a portion of Criteo’s traffic over a period of 24 days. Each row corresponds to a display ad served by Criteo and the first column indicates whether this ad has been clicked or not.

There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes.

The schema is:

<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>

More details (need to accept user terms to see the information): http://labs.criteo.com/2013/12/download-terabyte-click-logs/

Parameters:
  • size (str) – Dataset size. It can be “sample” or “full”.
  • local_cache_path (str) – Path where to cache the tar.gz file locally
  • header (list) – Dataset header names.
Returns:

Criteo DAC sample dataset.

Return type:

pandas.DataFrame

recommenders.datasets.criteo.load_spark_df(spark, size='sample', header=['label', 'int00', 'int01', 'int02', 'int03', 'int04', 'int05', 'int06', 'int07', 'int08', 'int09', 'int10', 'int11', 'int12', 'cat00', 'cat01', 'cat02', 'cat03', 'cat04', 'cat05', 'cat06', 'cat07', 'cat08', 'cat09', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat22', 'cat23', 'cat24', 'cat25'], local_cache_path=None, dbfs_datapath='dbfs:/FileStore/dac', dbutils=None)[source]

Loads the Criteo DAC dataset as pySpark.DataFrame.

The dataset consists of a portion of Criteo’s traffic over a period of 24 days. Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not.

There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes.

The schema is:

<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>

More details (need to accept user terms to see the information): http://labs.criteo.com/2013/12/download-terabyte-click-logs/

Parameters:
  • spark (pySpark.SparkSession) – Spark session.
  • size (str) – Dataset size. It can be “sample” or “full”.
  • local_cache_path (str) – Path where to cache the tar.gz file locally.
  • header (list) – Dataset header names.
  • dbfs_datapath (str) – Where to store the extracted files on Databricks.
  • dbutils (Databricks.dbutils) – Databricks utility object.
Returns:

Criteo DAC training dataset.

Return type:

pyspark.sql.DataFrame

MIND

MIcrosoft News Dataset (MIND), is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website.

MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID.

Citation:Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu and Ming Zhou, “MIND: A Large-scale Dataset for News Recommendation”, ACL, 2020.
recommenders.datasets.mind.download_and_extract_glove(dest_path)[source]

Download and extract the Glove embedding

Parameters:dest_path (str) – Destination directory path for the downloaded file
Returns:File path where Glove was extracted.
Return type:str
recommenders.datasets.mind.download_mind(size='small', dest_path=None)[source]

Download MIND dataset

Parameters:
  • size (str) – Dataset size. One of [“small”, “large”]
  • dest_path (str) – Download path. If path is None, it will download the dataset on a temporal path
Returns:

Path to train and validation sets.

Return type:

str, str

recommenders.datasets.mind.extract_mind(train_zip, valid_zip, train_folder='train', valid_folder='valid', clean_zip_file=True)[source]

Extract MIND dataset

Parameters:
  • train_zip (str) – Path to train zip file
  • valid_zip (str) – Path to valid zip file
  • train_folder (str) – Destination forder for train set
  • valid_folder (str) – Destination forder for validation set
Returns:

Train and validation folders

Return type:

str, str

recommenders.datasets.mind.generate_embeddings(data_path, news_words, news_entities, train_entities, valid_entities, max_sentence=10, word_embedding_dim=100)[source]

Generate embeddings.

Parameters:
  • data_path (str) – Data path.
  • news_words (dict) – News word dictionary.
  • news_entities (dict) – News entity dictionary.
  • train_entities (str) – Train entity file.
  • valid_entities (str) – Validation entity file.
  • max_sentence (int) – Max sentence size.
  • word_embedding_dim (int) – Word embedding dimension.
Returns:

File paths to news, word and entity embeddings.

Return type:

str, str, str

recommenders.datasets.mind.get_train_input(session, train_file_path, npratio=4)[source]

Generate train file.

Parameters:
  • session (list) – List of user session with user_id, clicks, positive and negative interactions.
  • train_file_path (str) – Path to file.
  • npration (int) – Ratio for negative sampling.
recommenders.datasets.mind.get_user_history(train_history, valid_history, user_history_path)[source]

Generate user history file.

Parameters:
  • train_history (list) – Train history.
  • valid_history (list) – Validation history
  • user_history_path (str) – Path to file.
recommenders.datasets.mind.get_valid_input(session, valid_file_path)[source]

Generate validation file.

Parameters:
  • session (list) – List of user session with user_id, clicks, positive and negative interactions.
  • valid_file_path (str) – Path to file.
recommenders.datasets.mind.get_words_and_entities(train_news, valid_news)[source]

Load words and entities

Parameters:
  • train_news (str) – News train file.
  • valid_news (str) – News validation file.
Returns:

Words and entities dictionaries.

Return type:

dict, dict

recommenders.datasets.mind.load_glove_matrix(path_emb, word_dict, word_embedding_dim)[source]

Load pretrained embedding metrics of words in word_dict

Parameters:
  • path_emb (string) – Folder path of downloaded glove file
  • word_dict (dict) – word dictionary
  • word_embedding_dim – dimention of word embedding vectors
Returns:

pretrained word embedding metrics, words can be found in glove files

Return type:

numpy.ndarray, list

recommenders.datasets.mind.read_clickhistory(path, filename)[source]

Read click history file

Parameters:
  • path (str) – Folder path
  • filename (str) – Filename
Returns:

  • A list of user session with user_id, clicks, positive and negative interactions.
  • A dictionary with user_id click history.

Return type:

list, dict

recommenders.datasets.mind.word_tokenize(sent)[source]

Tokenize a sententence

Parameters:sent – the sentence need to be tokenized
Returns:words in the sentence
Return type:list

MovieLens

The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of <user, item, rating, timestamp> tuples, each the result of a person expressing a preference (a 0-5 star rating) for a movie at a particular time.

It comes with several sizes:

  • MovieLens 100k: 100,000 ratings from 1000 users on 1700 movies.
  • MovieLens 1M: 1 million ratings from 6000 users on 4000 movies.
  • MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
  • MovieLens 20M: 20 million ratings from 138000 users on 27000 movies
Citation:F. M. Harper and J. A. Konstan. “The MovieLens Datasets: History and Context”. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19, DOI=http://dx.doi.org/10.1145/2827872, 2015.
class recommenders.datasets.movielens.MockMovielensSchema[source]

Mock dataset schema to generate fake data for testing purpose. This schema is configured to mimic the Movielens dataset

https://files.grouplens.org/datasets/movielens/ml-100k/

Dataset schema and generation is configured using pandera. Please see https://pandera.readthedocs.io/en/latest/schema_models.html for more information.

class Config
classmethod get_df(size: int = 3, seed: int = 100, keep_first_n_cols: Optional[int] = None, keep_title_col: bool = False, keep_genre_col: bool = False) → pandas.core.frame.DataFrame[source]

Return fake movielens dataset as a Pandas Dataframe with specified rows.

Parameters:
  • size (int) – number of rows to generate
  • seed (int, optional) – seeding the pseudo-number generation. Defaults to 100.
  • keep_first_n_cols (int, optional) – keep the first n default movielens columns.
  • keep_title_col (bool) – remove the title column if False. Defaults to True.
  • keep_genre_col (bool) – remove the genre column if False. Defaults to True.
Returns:

a mock dataset

Return type:

pandas.DataFrame

classmethod get_spark_df(spark, size: int = 3, seed: int = 100, keep_title_col: bool = False, keep_genre_col: bool = False, tmp_path: Optional[str] = None)[source]

Return fake movielens dataset as a Spark Dataframe with specified rows

Parameters:
  • spark (SparkSession) – spark session to load the dataframe into
  • size (int) – number of rows to generate
  • seed (int) – seeding the pseudo-number generation. Defaults to 100.
  • keep_title_col (bool) – remove the title column if False. Defaults to False.
  • keep_genre_col (bool) – remove the genre column if False. Defaults to False.
  • tmp_path (str, optional) – path to store files for serialization purpose when transferring data from python to java. If None, a temporal path is used instead
Returns:

a mock dataset

Return type:

pyspark.sql.DataFrame

recommenders.datasets.movielens.download_movielens(size, dest_path)[source]

Downloads MovieLens datafile.

Parameters:
  • size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”).
  • dest_path (str) – File path for the downloaded file
recommenders.datasets.movielens.extract_movielens(size, rating_path, item_path, zip_path)[source]

Extract MovieLens rating and item datafiles from the MovieLens raw zip file.

To extract all files instead of just rating and item datafiles, use ZipFile’s extractall(path) instead.

Parameters:
  • size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”).
  • rating_path (str) – Destination path for rating datafile
  • item_path (str) – Destination path for item datafile
  • zip_path (str) – zipfile path
recommenders.datasets.movielens.load_item_df(size='100k', local_cache_path=None, movie_col='itemID', title_col=None, genres_col=None, year_col=None)[source]

Loads Movie info.

Parameters:
  • size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”).
  • local_cache_path (str) – Path (directory or a zip file) to cache the downloaded zip file. If None, all the intermediate files will be stored in a temporary directory and removed after use.
  • movie_col (str) – Movie id column name.
  • title_col (str) – Movie title column name. If None, the column will not be loaded.
  • genres_col (str) – Genres column name. Genres are ‘|’ separated string. If None, the column will not be loaded.
  • year_col (str) – Movie release year column name. If None, the column will not be loaded.
Returns:

Movie information data, such as title, genres, and release year.

Return type:

pandas.DataFrame

recommenders.datasets.movielens.load_pandas_df(size='100k', header=None, local_cache_path=None, title_col=None, genres_col=None, year_col=None)[source]

Loads the MovieLens dataset as pd.DataFrame.

Download the dataset from https://files.grouplens.org/datasets/movielens, unzip, and load. To load movie information only, you can use load_item_df function.

Parameters:
  • size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”, “mock100”).
  • header (list or tuple or None) – Rating dataset header. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored and data is rendered using the ‘DEFAULT_HEADER’ instead.
  • local_cache_path (str) – Path (directory or a zip file) to cache the downloaded zip file. If None, all the intermediate files will be stored in a temporary directory and removed after use. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
  • title_col (str) – Movie title column name. If None, the column will not be loaded.
  • genres_col (str) – Genres column name. Genres are ‘|’ separated string. If None, the column will not be loaded.
  • year_col (str) – Movie release year column name. If None, the column will not be loaded. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
Returns:

Movie rating dataset.

Return type:

pandas.DataFrame

Examples

# To load just user-id, item-id, and ratings from MovieLens-1M dataset,
df = load_pandas_df('1m', ('UserId', 'ItemId', 'Rating'))

# To load rating's timestamp together,
df = load_pandas_df('1m', ('UserId', 'ItemId', 'Rating', 'Timestamp'))

# To load movie's title, genres, and released year info along with the ratings data,
df = load_pandas_df('1m', ('UserId', 'ItemId', 'Rating', 'Timestamp'),
    title_col='Title',
    genres_col='Genres',
    year_col='Year'
)
recommenders.datasets.movielens.load_spark_df(spark, size='100k', header=None, schema=None, local_cache_path=None, dbutils=None, title_col=None, genres_col=None, year_col=None)[source]

Loads the MovieLens dataset as pyspark.sql.DataFrame.

Download the dataset from https://files.grouplens.org/datasets/movielens, unzip, and load as pyspark.sql.DataFrame.

To load movie information only, you can use load_item_df function.

Parameters:
  • spark (pyspark.SparkSession) – Spark session.
  • size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”, “mock100”).
  • header (list or tuple) – Rating dataset header. If schema is provided or size is set to any of ‘MOCK_DATA_FORMAT’, this argument is ignored.
  • schema (pyspark.StructType) – Dataset schema. If size is set to any of ‘MOCK_DATA_FORMAT’, data is rendered in the ‘MockMovielensSchema’ instead.
  • local_cache_path (str) – Path (directory or a zip file) to cache the downloaded zip file. If None, all the intermediate files will be stored in a temporary directory and removed after use.
  • dbutils (Databricks.dbutils) – Databricks utility object If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
  • title_col (str) – Title column name. If None, the column will not be loaded.
  • genres_col (str) – Genres column name. Genres are ‘|’ separated string. If None, the column will not be loaded.
  • year_col (str) – Movie release year column name. If None, the column will not be loaded. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
Returns:

Movie rating dataset.

Return type:

pyspark.sql.DataFrame

Examples

# To load just user-id, item-id, and ratings from MovieLens-1M dataset:
spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating'))

# The schema can be defined as well:
schema = StructType([
    StructField(DEFAULT_USER_COL, IntegerType()),
    StructField(DEFAULT_ITEM_COL, IntegerType()),
    StructField(DEFAULT_RATING_COL, FloatType()),
    StructField(DEFAULT_TIMESTAMP_COL, LongType()),
    ])
spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating'), schema=schema)

# To load rating's timestamp together:
spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating', 'Timestamp'))

# To load movie's title, genres, and released year info along with the ratings data:
spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating', 'Timestamp'),
    title_col='Title',
    genres_col='Genres',
    year_col='Year'
)

# On DataBricks, pass the dbutils argument as follows:
spark_df = load_spark_df(spark, dbutils=dbutils)

Download utilities

recommenders.datasets.download_utils.download_path(path=None)[source]

Return a path to download data. If path=None, then it yields a temporal path that is eventually deleted, otherwise the real path of the input.

Parameters:path (str) – Path to download data.
Returns:Real path where the data is stored.
Return type:str

Examples

>>> with download_path() as path:
>>> ... maybe_download(url="http://example.com/file.zip", work_directory=path)
recommenders.datasets.download_utils.maybe_download(url, filename=None, work_directory='.', expected_bytes=None)[source]

Download a file if it is not already downloaded.

Parameters:
  • filename (str) – File name.
  • work_directory (str) – Working directory.
  • url (str) – URL of the file to download.
  • expected_bytes (int) – Expected file size in bytes.
Returns:

File path of the file downloaded.

Return type:

str

recommenders.datasets.download_utils.unzip_file(zip_src, dst_dir, clean_zip_file=False)[source]

Unzip a file

Parameters:
  • zip_src (str) – Zip file.
  • dst_dir (str) – Destination folder.
  • clean_zip_file (bool) – Whether or not to clean the zip file.

Cosmos CLI utilities

Pandas dataframe utilities

class recommenders.datasets.pandas_df_utils.LibffmConverter(filepath=None)[source]

Converts an input dataframe to another dataframe in libffm format. A text file of the converted Dataframe is optionally generated.

Note

The input dataframe is expected to represent the feature data in the following schema:

|field-1|field-2|...|field-n|rating|
|feature-1-1|feature-2-1|...|feature-n-1|1|
|feature-1-2|feature-2-2|...|feature-n-2|0|
...
|feature-1-i|feature-2-j|...|feature-n-k|0|

Where 1. each field-* is the column name of the dataframe (column of label/rating is excluded), and 2. feature-*-* can be either a string or a numerical value, representing the categorical variable or actual numerical variable of the feature value in the field, respectively. 3. If there are ordinal variables represented in int types, users should make sure these columns are properly converted to string type.

The above data will be converted to the libffm format by following the convention as explained in this paper.

i.e. <field_index>:<field_feature_index>:1 or <field_index>:<field_feature_index>:<field_feature_value>, depending on the data type of the features in the original dataframe.

Parameters:filepath (str) – path to save the converted data.
field_count

count of field in the libffm format data

Type:int
feature_count

count of feature in the libffm format data

Type:int
filepath

file path where the output is stored - it can be None or a string

Type:str or None

Examples

>>> import pandas as pd
>>> df_feature = pd.DataFrame({
        'rating': [1, 0, 0, 1, 1],
        'field1': ['xxx1', 'xxx2', 'xxx4', 'xxx4', 'xxx4'],
        'field2': [3, 4, 5, 6, 7],
        'field3': [1.0, 2.0, 3.0, 4.0, 5.0],
        'field4': ['1', '2', '3', '4', '5']
    })
>>> converter = LibffmConverter().fit(df_feature, col_rating='rating')
>>> df_out = converter.transform(df_feature)
>>> df_out
    rating field1 field2   field3 field4
0       1  1:1:1  2:4:3  3:5:1.0  4:6:1
1       0  1:2:1  2:4:4  3:5:2.0  4:7:1
2       0  1:3:1  2:4:5  3:5:3.0  4:8:1
3       1  1:3:1  2:4:6  3:5:4.0  4:9:1
4       1  1:3:1  2:4:7  3:5:5.0  4:10:1
__init__(filepath=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(df, col_rating='rating')[source]

Fit the dataframe for libffm format. This method does nothing but check the validity of the input columns

Parameters:
  • df (pandas.DataFrame) – input Pandas dataframe.
  • col_rating (str) – rating of the data.
Returns:

the instance of the converter

Return type:

object

fit_transform(df, col_rating='rating')[source]

Do fit and transform in a row

Parameters:
  • df (pandas.DataFrame) – input Pandas dataframe.
  • col_rating (str) – rating of the data.
Returns:

Output libffm format dataframe.

Return type:

pandas.DataFrame

get_params()[source]

Get parameters (attributes) of the libffm converter

Returns:A dictionary that contains parameters field count, feature count, and file path.
Return type:dict
transform(df)[source]

Tranform an input dataset with the same schema (column names and dtypes) to libffm format by using the fitted converter.

Parameters:df (pandas.DataFrame) – input Pandas dataframe.
Returns:Output libffm format dataframe.
Return type:pandas.DataFrame
class recommenders.datasets.pandas_df_utils.PandasHash(pandas_object)[source]

Wrapper class to allow pandas objects (DataFrames or Series) to be hashable

__init__(pandas_object)[source]

Initialize class

Parameters:pandas_object (pandas.DataFrame|pandas.Series) – pandas object
recommenders.datasets.pandas_df_utils.filter_by(df, filter_by_df, filter_by_cols)[source]

From the input DataFrame df, remove the records whose target column filter_by_cols values are exist in the filter-by DataFrame filter_by_df.

Parameters:
  • df (pandas.DataFrame) – Source dataframe.
  • filter_by_df (pandas.DataFrame) – Filter dataframe.
  • filter_by_cols (iterable of str) – Filter columns.
Returns:

Dataframe filtered by filter_by_df on filter_by_cols.

Return type:

pandas.DataFrame

recommenders.datasets.pandas_df_utils.has_columns(df, columns)[source]

Check if DataFrame has necessary columns

Parameters:
  • df (pandas.DataFrame) – DataFrame
  • columns (list(str) – columns to check for
Returns:

True if DataFrame has specified columns.

Return type:

bool

recommenders.datasets.pandas_df_utils.has_same_base_dtype(df_1, df_2, columns=None)[source]

Check if specified columns have the same base dtypes across both DataFrames

Parameters:
  • df_1 (pandas.DataFrame) – first DataFrame
  • df_2 (pandas.DataFrame) – second DataFrame
  • columns (list(str)) – columns to check, None checks all columns
Returns:

True if DataFrames columns have the same base dtypes.

Return type:

bool

recommenders.datasets.pandas_df_utils.lru_cache_df(maxsize, typed=False)[source]

Least-recently-used cache decorator for pandas Dataframes.

Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls. It can save time when an expensive or I/O bound function is periodically called with the same arguments.

Inspired in the lru_cache function.

Parameters:
  • maxsize (int|None) – max size of cache, if set to None cache is boundless
  • typed (bool) – arguments of different types are cached separately
recommenders.datasets.pandas_df_utils.negative_feedback_sampler(df, col_user='userID', col_item='itemID', col_label='label', col_feedback='feedback', ratio_neg_per_user=1, pos_value=1, neg_value=0, seed=42)[source]

Utility function to sample negative feedback from user-item interaction dataset. This negative sampling function will take the user-item interaction data to create binarized feedback, i.e., 1 and 0 indicate positive and negative feedback, respectively.

Negative sampling is used in the literature frequently to generate negative samples from a user-item interaction data.

See for example the neural collaborative filtering paper.

Parameters:
  • df (pandas.DataFrame) – input data that contains user-item tuples.
  • col_user (str) – user id column name.
  • col_item (str) – item id column name.
  • col_label (str) – label column name in df.
  • col_feedback (str) – feedback column name in the returned data frame; it is used for the generated column of positive and negative feedback.
  • ratio_neg_per_user (int) – ratio of negative feedback w.r.t to the number of positive feedback for each user. If the samples exceed the number of total possible negative feedback samples, it will be reduced to the number of all the possible samples.
  • pos_value (float) – value of positive feedback.
  • neg_value (float) – value of negative feedback.
  • inplace (bool) –
  • seed (int) – seed for the random state of the sampling function.
Returns:

Data with negative feedback.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
    'userID': [1, 2, 3],
    'itemID': [1, 2, 3],
    'rating': [5, 5, 5]
})
>>> df_neg_sampled = negative_feedback_sampler(
    df, col_user='userID', col_item='itemID', ratio_neg_per_user=1
)
>>> df_neg_sampled
userID  itemID  feedback
1   1   1
1   2   0
2   2   1
2   1   0
3   3   1
3   1   0
recommenders.datasets.pandas_df_utils.user_item_pairs(user_df, item_df, user_col='userID', item_col='itemID', user_item_filter_df=None, shuffle=True, seed=None)[source]

Get all pairs of users and items data.

Parameters:
  • user_df (pandas.DataFrame) – User data containing unique user ids and maybe their features.
  • item_df (pandas.DataFrame) – Item data containing unique item ids and maybe their features.
  • user_col (str) – User id column name.
  • item_col (str) – Item id column name.
  • user_item_filter_df (pd.DataFrame) – User-item pairs to be used as a filter.
  • shuffle (bool) – If True, shuffles the result.
  • seed (int) – Random seed for shuffle
Returns:

All pairs of user-item from user_df and item_df, excepting the pairs in user_item_filter_df.

Return type:

pandas.DataFrame

Splitter utilities

recommenders.datasets.python_splitters.numpy_stratified_split(X, ratio=0.75, seed=42)[source]

Split the user/item affinity matrix (sparse matrix) into train and test set matrices while maintaining local (i.e. per user) ratios.

Main points :

1. In a typical recommender problem, different users rate a different number of items, and therefore the user/affinity matrix has a sparse structure with variable number of zeroes (unrated items) per row (user). Cutting a total amount of ratings will result in a non-homogeneous distribution between train and test set, i.e. some test users may have many ratings while other very little if none.

2. In an unsupervised learning problem, no explicit answer is given. For this reason the split needs to be implemented in a different way then in supervised learningself. In the latter, one typically split the dataset by rows (by examples), ending up with the same number of features but different number of examples in the train/test setself. This scheme does not work in the unsupervised case, as part of the rated items needs to be used as a test set for fixed number of users.

Solution:

1. Instead of cutting a total percentage, for each user we cut a relative ratio of the rated items. For example, if user1 has rated 4 items and user2 10, cutting 25% will correspond to 1 and 2.6 ratings in the test set, approximated as 1 and 3 according to the round() function. In this way, the 0.75 ratio is satisfied both locally and globally, preserving the original distribution of ratings across the train and test set.

2. It is easy (and fast) to satisfy this requirements by creating the test via element subtraction from the original dataset X. We first create two copies of X; for each user we select a random sample of local size ratio (point 1) and erase the remaining ratings, obtaining in this way the train set matrix Xtst. The train set matrix is obtained in the opposite way.

Parameters:
  • X (numpy.ndarray, int) – a sparse matrix to be split
  • ratio (float) – fraction of the entire dataset to constitute the train set
  • seed (int) – random seed
Returns:

  • Xtr: The train set user/item affinity matrix.
  • Xtst: The test set user/item affinity matrix.

Return type:

numpy.ndarray, numpy.ndarray

recommenders.datasets.python_splitters.python_chrono_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', col_timestamp='timestamp')[source]

Pandas chronological splitter.

This function splits data in a chronological manner. That is, for each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.

Parameters:
  • data (pandas.DataFrame) – Pandas DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
  • seed (int) – Seed.
  • min_rating (int) – minimum number of ratings for user or item.
  • filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
  • col_user (str) – column name of user IDs.
  • col_item (str) – column name of item IDs.
  • col_timestamp (str) – column name of timestamps.
Returns:

Splits of the input data as pandas.DataFrame.

Return type:

list

recommenders.datasets.python_splitters.python_random_split(data, ratio=0.75, seed=42)[source]

Pandas random splitter.

The splitter randomly splits the input data.

Parameters:
  • data (pandas.DataFrame) – Pandas DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
  • seed (int) – Seed.
Returns:

Splits of the input data as pandas.DataFrame.

Return type:

list

recommenders.datasets.python_splitters.python_stratified_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', seed=42)[source]

Pandas stratified splitter.

For each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.

Parameters:
  • data (pandas.DataFrame) – Pandas DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
  • seed (int) – Seed.
  • min_rating (int) – minimum number of ratings for user or item.
  • filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
  • col_user (str) – column name of user IDs.
  • col_item (str) – column name of item IDs.
Returns:

Splits of the input data as pandas.DataFrame.

Return type:

list

recommenders.datasets.spark_splitters.spark_chrono_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', col_timestamp='timestamp', no_partition=False)[source]

Spark chronological splitter.

This function splits data in a chronological manner. That is, for each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.

Parameters:
  • data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two sets and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
  • min_rating (int) – minimum number of ratings for user or item.
  • filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
  • col_user (str) – column name of user IDs.
  • col_item (str) – column name of item IDs.
  • col_timestamp (str) – column name of timestamps.
  • no_partition (bool) – set to enable more accurate and less efficient splitting.
Returns:

Splits of the input data as pyspark.sql.DataFrame.

Return type:

list

recommenders.datasets.spark_splitters.spark_random_split(data, ratio=0.75, seed=42)[source]

Spark random splitter.

Randomly split the data into several splits.

Parameters:
  • data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
  • seed (int) – Seed.
Returns:

Splits of the input data as pyspark.sql.DataFrame.

Return type:

list

recommenders.datasets.spark_splitters.spark_stratified_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', seed=42)[source]

Spark stratified splitter.

For each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.

Parameters:
  • data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized. Earlier indexed splits will have earlier times (e.g. the latest time per user or item in split[0] <= the earliest time per user or item in split[1])
  • seed (int) – Seed.
  • min_rating (int) – minimum number of ratings for user or item.
  • filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
  • col_user (str) – column name of user IDs.
  • col_item (str) – column name of item IDs.
Returns:

Splits of the input data as pyspark.sql.DataFrame.

Return type:

list

recommenders.datasets.spark_splitters.spark_timestamp_split(data, ratio=0.75, col_user='userID', col_item='itemID', col_timestamp='timestamp')[source]

Spark timestamp based splitter.

The splitter splits the data into sets by timestamps without stratification on either user or item. The ratios are applied on the timestamp column which is divided accordingly into several partitions.

Parameters:
  • data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
  • ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two sets and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized. Earlier indexed splits will have earlier times (e.g. the latest time in split[0] <= the earliest time in split[1])
  • col_user (str) – column name of user IDs.
  • col_item (str) – column name of item IDs.
  • col_timestamp (str) – column name of timestamps. Float number represented in
  • since Epoch. (seconds) –
Returns:

Splits of the input data as pyspark.sql.DataFrame.

Return type:

list

recommenders.datasets.split_utils.filter_k_core(data, core_num=0, col_user='userID', col_item='itemID')[source]

Filter rating dataframe for minimum number of users and items by repeatedly applying min_rating_filter until the condition is satisfied.

recommenders.datasets.split_utils.min_rating_filter_pandas(data, min_rating=1, filter_by='user', col_user='userID', col_item='itemID')[source]

Filter rating DataFrame for each user with minimum rating.

Filter rating data frame with minimum number of ratings for user/item is usually useful to generate a new data frame with warm user/item. The warmth is defined by min_rating argument. For example, a user is called warm if he has rated at least 4 items.

Parameters:
  • data (pandas.DataFrame) – DataFrame of user-item tuples. Columns of user and item should be present in the DataFrame while other columns like rating, timestamp, etc. can be optional.
  • min_rating (int) – minimum number of ratings for user or item.
  • filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
  • col_user (str) – column name of user ID.
  • col_item (str) – column name of item ID.
Returns:

DataFrame with at least columns of user and item that has been filtered by the given specifications.

Return type:

pandas.DataFrame

recommenders.datasets.split_utils.min_rating_filter_spark(data, min_rating=1, filter_by='user', col_user='userID', col_item='itemID')[source]

Filter rating DataFrame for each user with minimum rating.

Filter rating data frame with minimum number of ratings for user/item is usually useful to generate a new data frame with warm user/item. The warmth is defined by min_rating argument. For example, a user is called warm if he has rated at least 4 items.

Parameters:
  • data (pyspark.sql.DataFrame) – DataFrame of user-item tuples. Columns of user and item should be present in the DataFrame while other columns like rating, timestamp, etc. can be optional.
  • min_rating (int) – minimum number of ratings for user or item.
  • filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
  • col_user (str) – column name of user ID.
  • col_item (str) – column name of item ID.
Returns:

DataFrame with at least columns of user and item that has been filtered by the given specifications.

Return type:

pyspark.sql.DataFrame

recommenders.datasets.split_utils.process_split_ratio(ratio)[source]

Generate split ratio lists.

Parameters:
  • ratio (float or list) – a float number that indicates split ratio or a list of float
  • that indicate split ratios (numbers) –
Returns:

  • bool: A boolean variable multi that indicates if the splitting is multi or single.
  • list: A list of normalized split ratios.

Return type:

tuple

recommenders.datasets.split_utils.split_pandas_data_with_ratios(data, ratios, seed=42, shuffle=False)[source]

Helper function to split pandas DataFrame with given ratios

Note

Implementation referenced from this source.

Parameters:
  • data (pandas.DataFrame) – Pandas data frame to be split.
  • ratios (list of floats) – list of ratios for split. The ratios have to sum to 1.
  • seed (int) – random seed.
  • shuffle (bool) – whether data will be shuffled when being split.
Returns:

List of pd.DataFrame split by the given specifications.

Return type:

list

Sparse utilities

class recommenders.datasets.sparse.AffinityMatrix(df, items_list=None, col_user='userID', col_item='itemID', col_rating='rating', col_pred='prediction', save_path=None)[source]

Generate the user/item affinity matrix from a pandas dataframe and vice versa

__init__(df, items_list=None, col_user='userID', col_item='itemID', col_rating='rating', col_pred='prediction', save_path=None)[source]

Initialize class parameters

Parameters:
  • df (pandas.DataFrame) – a dataframe containing the data
  • items_list (numpy.ndarray) – a list of unique items to use (if provided)
  • col_user (str) – default name for user column
  • col_item (str) – default name for item column
  • col_rating (str) – default name for rating columns
  • save_path (str) – default path to save item/user maps
gen_affinity_matrix()[source]

Generate the user/item affinity matrix.

As a first step, two new columns are added to the input DF, containing the index maps generated by the gen_index() method. The new indices, together with the ratings, are then used to generate the user/item affinity matrix using scipy’s sparse matrix method coo_matrix; for reference see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html. The input format is: coo_matrix((data, (rows, columns)), shape=(rows, columns))

Returns:User-affinity matrix of dimensions (Nusers, Nitems) in numpy format. Unrated movies are assigned a value of 0.
Return type:scipy.sparse.coo_matrix
map_back_sparse(X, kind)[source]

Map back the user/affinity matrix to a pd dataframe

Parameters:
  • X (numpy.ndarray, int32) – user/item affinity matrix
  • kind (string) – specify if the output values are ratings or predictions
Returns:

the generated pandas dataframe

Return type:

pandas.DataFrame

Knowledge graph utilities

recommenders.datasets.wikidata.find_wikidata_id(name, limit=1, session=None)[source]

Find the entity ID in wikidata from a title string.

Parameters:
  • name (str) – A string with search terms (eg. “Batman (1989) film”)
  • limit (int) – Number of results to return
  • session (requests.Session) – requests session to reuse connections
Returns:

wikidata entityID corresponding to the title string. ‘entityNotFound’ will be returned if no page is found

Return type:

str

recommenders.datasets.wikidata.get_session(session=None)[source]

Get session object

Parameters:session (requests.Session) – request session object
Returns:request session object
Return type:requests.Session
recommenders.datasets.wikidata.query_entity_description(entity_id, session=None)[source]

Query entity wikidata description from entityID

Parameters:
  • entity_id (str) – A wikidata page ID.
  • session (requests.Session) – requests session to reuse connections
Returns:

Wikidata short description of the entityID descriptionNotFound’ will be returned if no description is found

Return type:

str

Query all linked pages from a wikidata entityID

Parameters:
  • entity_id (str) – A wikidata entity ID
  • session (requests.Session) – requests session to reuse connections
Returns:

Dictionary with linked pages.

Return type:

json

recommenders.datasets.wikidata.read_linked_entities(data)[source]

Obtain lists of liken entities (IDs and names) from dictionary

Parameters:data (json) – dictionary with linked pages
Returns:
  • List of liked entityIDs.
  • List of liked entity names.
Return type:list, list
recommenders.datasets.wikidata.search_wikidata(names, extras=None, describe=True, verbose=False)[source]

Create DataFrame of Wikidata search results

Parameters:
  • names (list[str]) – List of names to search for
  • (dict(str (extras) – list)): Optional extra items to assign to results for corresponding name
  • describe (bool) – Optional flag to include description of entity
  • verbose (bool) – Optional flag to print out intermediate data
Returns:

Wikipedia results for all names with found entities

Return type:

pandas.DataFrame