Recommender algorithms module

Recommender system algorithms and utilities.

Cornac

recommenders.models.cornac.cornac_utils.predict(model, data, usercol='userID', itemcol='itemID', predcol='prediction')[source]

Computes predictions of a recommender model from Cornac on the data. Can be used for computing rating metrics like RMSE.

Parameters:
  • model (cornac.models.Recommender) – A recommender model from Cornac
  • data (pandas.DataFrame) – The data on which to predict
  • usercol (str) – Name of the user column
  • itemcol (str) – Name of the item column
Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

recommenders.models.cornac.cornac_utils.predict_ranking(model, data, usercol='userID', itemcol='itemID', predcol='prediction', remove_seen=False)[source]

Computes predictions of recommender model from Cornac on all users and items in data. It can be used for computing ranking metrics like NDCG.

Parameters:
  • model (cornac.models.Recommender) – A recommender model from Cornac
  • data (pandas.DataFrame) – The data from which to get the users and items
  • usercol (str) – Name of the user column
  • itemcol (str) – Name of the item column
  • remove_seen (bool) – Flag to remove (user, item) pairs seen in the training data
Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

DeepRec

Base model

class recommenders.models.deeprec.models.base_model.BaseModel(hparams, iterator_creator, graph=None, seed=None)[source]

Base class for models

__init__(hparams, iterator_creator, graph=None, seed=None)[source]

Initializing the model. Create common logics which are needed by all deeprec models, such as loss function, parameter set.

Parameters:
  • hparams (object) – A tf.contrib.training.HParams object, hold the entire set of hyperparameters.
  • iterator_creator (object) – An iterator to load the data.
  • graph (object) – An optional graph.
  • seed (int) – Random seed.
eval(sess, feed_dict)[source]

Evaluate the data in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.
Returns:

A list of evaluated results, including total loss value, data loss value, predicted scores, and ground-truth labels.

Return type:

list

fit(train_file, valid_file, test_file=None)[source]

Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status. If test_file is not None, evaluate it too.

Parameters:
  • train_file (str) – training data set.
  • valid_file (str) – validation set.
  • test_file (str) – test set.
Returns:

An instance of self.

Return type:

object

group_labels(labels, preds, group_keys)[source]

Devide labels and preds into several group according to values in group keys.

Parameters:
  • labels (list) – ground truth label list.
  • preds (list) – prediction score list.
  • group_keys (list) – group key list.
Returns:

  • Labels after group.
  • Predictions after group.

Return type:

list, list

infer(sess, feed_dict)[source]

Given feature data (in feed_dict), get predicted scores with current model.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Instances to predict. This is a dictionary that maps graph elements to values.
Returns:

Predicted scores for the given instances.

Return type:

list

load_model(model_path=None)[source]

Load an existing model.

Parameters:model_path – model path.
Raises:IOError – if the restore operation failed.
predict(infile_name, outfile_name)[source]

Make predictions on the given data, and output predicted scores to a file.

Parameters:
  • infile_name (str) – Input file name, format is same as train/val/test file.
  • outfile_name (str) – Output file name, each line is the predict score.
Returns:

An instance of self.

Return type:

object

run_eval(filename)[source]

Evaluate the given file and returns some evaluation metrics.

Parameters:filename (str) – A file name that will be evaluated.
Returns:A dictionary that contains evaluation metrics.
Return type:dict
train(sess, feed_dict)[source]

Go through the optimization step once with training data in feed_dict.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Feed values to train the model. This is a dictionary that maps graph elements to values.
Returns:

A list of values, including update operation, total loss, data loss, and merged summary.

Return type:

list

DKN

class recommenders.models.deeprec.models.dkn.DKN(hparams, iterator_creator)[source]

DKN model (Deep Knowledge-Aware Network)

Citation:H. Wang, F. Zhang, X. Xie and M. Guo, “DKN: Deep Knowledge-Aware Network for News Recommendation”, in Proceedings of the 2018 World Wide Web Conference on World Wide Web, 2018.
__init__(hparams, iterator_creator)[source]

Initialization steps for DKN. Compared with the BaseModel, DKN requires two different pre-computed embeddings, i.e. word embedding and entity embedding. After creating these two embedding variables, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters.
  • iterator_creator (object) – DKN data loader class.
infer_embedding(sess, feed_dict)[source]

Infer document embedding in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.
Returns:

News embedding in a batch.

Return type:

list

run_get_embedding(infile_name, outfile_name)[source]

infer document embedding with current model.

Parameters:
  • infile_name (str) – Input file name, format is [Newsid] [w1,w2,w3…] [e1,e2,e3…]
  • outfile_name (str) – Output file name, format is [Newsid] [embedding]
Returns:

An instance of self.

Return type:

object

DKN item-to-item

class recommenders.models.deeprec.models.dkn_item2item.DKNItem2Item(hparams, iterator_creator)[source]

Class for item-to-item recommendations using DKN. See https://github.com/microsoft/recommenders/blob/main/examples/07_tutorials/KDD2020-tutorial/step4_run_dkn_item2item.ipynb

eval(sess, feed_dict)[source]

Evaluate the data in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.
Returns:

A tuple with predictions and labels arrays.

Return type:

numpy.ndarray, numpy.ndarray

run_eval(filename)[source]

Evaluate the given file and returns some evaluation metrics.

Parameters:filename (str) – A file name that will be evaluated.
Returns:A dictionary containing evaluation metrics.
Return type:dict

LightGCN

class recommenders.models.deeprec.models.graphrec.lightgcn.LightGCN(hparams, data, seed=None)[source]

LightGCN model

Citation:He, Xiangnan, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. “LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation.” arXiv preprint arXiv:2002.02126, 2020.
__init__(hparams, data, seed=None)[source]

Initializing the model. Create parameters, placeholders, embeddings and loss function.

Parameters:
  • hparams (object) – A tf.contrib.training.HParams object, hold the entire set of hyperparameters.
  • data (object) – A recommenders.models.deeprec.DataModel.ImplicitCF object, load and process data.
  • seed (int) – Seed.
fit()[source]

Fit the model on self.data.train. If eval_epoch is not -1, evaluate the model on self.data.test every eval_epoch epoch to observe the training status.

infer_embedding(user_file, item_file)[source]

Export user and item embeddings to csv files.

Parameters:
  • user_file (str) – Path of file to save user embeddings.
  • item_file (str) – Path of file to save item embeddings.
load(model_path=None)[source]

Load an existing model.

Parameters:model_path – Model path.
Raises:IOError – if the restore operation failed.
recommend_k_items(test, top_k=10, sort_top_k=True, remove_seen=True, use_id=False)[source]

Recommend top K items for all users in the test set.

Parameters:
  • test (pandas.DataFrame) – Test data.
  • top_k (int) – Number of top items to recommend.
  • sort_top_k (bool) – Flag to sort top k results.
  • remove_seen (bool) – Flag to remove items seen in training from recommendation.
Returns:

Top k recommendation items for each user.

Return type:

pandas.DataFrame

run_eval()[source]

Run evaluation on self.data.test.

Returns:Results of all metrics in self.metrics.
Return type:dict
score(user_ids, remove_seen=True)[source]

Score all items for test users.

Parameters:
  • user_ids (np.array) – Users to test.
  • remove_seen (bool) – Flag to remove items seen in training from recommendation.
Returns:

Value of interest of all items for the users.

Return type:

numpy.ndarray

xDeepFM

class recommenders.models.deeprec.models.xDeepFM.XDeepFMModel(hparams, iterator_creator, graph=None, seed=None)[source]

xDeepFM model

Citation:J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, G. Sun, “xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems”, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, 2018.

Sequential models

Sequential base model

class recommenders.models.deeprec.models.sequential.sequential_base_model.SequentialBaseModel(hparams, iterator_creator, graph=None, seed=None)[source]

Base class for sequential models

__init__(hparams, iterator_creator, graph=None, seed=None)[source]

Initializing the model. Create common logics which are needed by all sequential models, such as loss function, parameter set.

Parameters:
  • hparams (object) – A tf.contrib.training.HParams object, hold the entire set of hyperparameters.
  • iterator_creator (object) – An iterator to load the data.
  • graph (object) – An optional graph.
  • seed (int) – Random seed.
fit(train_file, valid_file, valid_num_ngs, eval_metric='group_auc')[source]

Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status. If test_file is not None, evaluate it too.

Parameters:
  • train_file (str) – training data set.
  • valid_file (str) – validation set.
  • valid_num_ngs (int) – the number of negative instances with one positive instance in validation data.
  • eval_metric (str) – the metric that control early stopping. e.g. “auc”, “group_auc”, etc.
Returns:

An instance of self.

Return type:

object

predict(infile_name, outfile_name)[source]

Make predictions on the given data, and output predicted scores to a file.

Parameters:
  • infile_name (str) – Input file name.
  • outfile_name (str) – Output file name.
Returns:

An instance of self.

Return type:

object

run_eval(filename, num_ngs)[source]

Evaluate the given file and returns some evaluation metrics.

Parameters:
  • filename (str) – A file name that will be evaluated.
  • num_ngs (int) – The number of negative sampling for a positive instance.
Returns:

A dictionary that contains evaluation metrics.

Return type:

dict

A2SVD

class recommenders.models.deeprec.models.sequential.asvd.A2SVDModel(hparams, iterator_creator, graph=None, seed=None)[source]

A2SVD Model (Attentive Asynchronous Singular Value Decomposition)

It extends ASVD with an attention module.

Citation:

ASVD: Y. Koren, “Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model”, in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, ACM, 2008.

A2SVD: Z. Yu, J. Lian, A. Mahmoody, G. Liu and X. Xie, “Adaptive User Modeling with Long and Short-Term Preferences for Personailzed Recommendation”, in Proceedings of the 28th International Joint Conferences on Artificial Intelligence, IJCAI’19, Pages 4213-4219, AAAI Press, 2019.

Caser

class recommenders.models.deeprec.models.sequential.caser.CaserModel(hparams, iterator_creator, seed=None)[source]

Caser Model

Citation:J. Tang and K. Wang, “Personalized top-n sequential recommendation via convolutional sequence embedding”, in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, ACM, 2018.
__init__(hparams, iterator_creator, seed=None)[source]

Initialization of variables for caser

Parameters:
  • hparams (object) – A tf.contrib.training.HParams object, hold the entire set of hyperparameters.
  • iterator_creator (object) – An iterator to load the data.

GRU4Rec

class recommenders.models.deeprec.models.sequential.gru4rec.GRU4RecModel(hparams, iterator_creator, graph=None, seed=None)[source]

GRU4Rec Model

Citation:B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, “Session-based Recommendations with Recurrent Neural Networks”, ICLR (Poster), 2016.

NextItNet

class recommenders.models.deeprec.models.sequential.nextitnet.NextItNetModel(hparams, iterator_creator, graph=None, seed=None)[source]

NextItNet Model

Citation:Yuan, Fajie, et al. “A Simple Convolutional Generative Network for Next Item Recommendation”, in Web Search and Data Mining, 2019.

Note

It requires strong sequence with dataset.

RNN Cells

Module implementing RNN Cells.

This module provides a number of basic commonly used RNN cells, such as LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit), and a number of operators that allow adding dropouts, projections, or embeddings for inputs. Constructing multi-layer cells is supported by the class MultiRNNCell, or by calling the rnn ops several times.

class recommenders.models.deeprec.models.sequential.rnn_cell_implement.Time4ALSTMCell(num_units, use_peepholes=False, cell_clip=None, initializer=None, num_proj=None, proj_clip=None, num_unit_shards=None, num_proj_shards=None, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)[source]
__init__(num_units, use_peepholes=False, cell_clip=None, initializer=None, num_proj=None, proj_clip=None, num_unit_shards=None, num_proj_shards=None, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)[source]
call(inputs, state)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

output_size

size of outputs produced by this cell.

Type:Integer or TensorShape
state_size

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

class recommenders.models.deeprec.models.sequential.rnn_cell_implement.Time4LSTMCell(num_units, use_peepholes=False, cell_clip=None, initializer=None, num_proj=None, proj_clip=None, num_unit_shards=None, num_proj_shards=None, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)[source]
__init__(num_units, use_peepholes=False, cell_clip=None, initializer=None, num_proj=None, proj_clip=None, num_unit_shards=None, num_proj_shards=None, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)[source]
call(inputs, state)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

output_size

size of outputs produced by this cell.

Type:Integer or TensorShape
state_size

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

SUM

class recommenders.models.deeprec.models.sequential.sum.SUMModel(hparams, iterator_creator, graph=None, seed=None)[source]

Sequential User Matrix Model

Citation:Lian, J., Batal, I., Liu, Z., Soni, A., Kang, E. Y., Wang, Y., & Xie, X., “Multi-Interest-Aware User Modeling for Large-Scale Sequential Recommendations”, arXiv preprint arXiv:2102.09211, 2021.
class recommenders.models.deeprec.models.sequential.sum_cells.SUMCell(num_units, slots, attention_size, input_size, activation=None, reuse=None, kernel_initializer=None, bias_initializer=None, name=None, dtype=None, **kwargs)[source]

Cell for Sequential User Matrix

__init__(num_units, slots, attention_size, input_size, activation=None, reuse=None, kernel_initializer=None, bias_initializer=None, name=None, dtype=None, **kwargs)[source]
call(inputs, state)[source]

The real operations for SUM cell to process user behaviors.

params:
inputs: (a batch of) user behaviors at time T state: (a batch of) user states at time T-1
Returns:
  • after process the user behavior at time T, returns (a batch of) new user states at time T
  • after process the user behavior at time T, returns (a batch of) new user states at time T
Return type:state, state
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns:Python dictionary.
output_size

size of outputs produced by this cell.

Type:Integer or TensorShape
state_size

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

class recommenders.models.deeprec.models.sequential.sum_cells.SUMV2Cell(num_units, slots, attention_size, input_size, activation=None, reuse=None, kernel_initializer=None, bias_initializer=None, name=None, dtype=None, **kwargs)[source]

A variant of SUM cell, which upgrades the writing attention

call(inputs, state)[source]

The real operations for SUMV2 cell to process user behaviors.

Parameters:
  • inputs – (a batch of) user behaviors at time T
  • state – (a batch of) user states at time T-1
Returns:

after process the user behavior at time T, returns (a batch of) new user states at time T state: after process the user behavior at time T, returns (a batch of) new user states at time T

Return type:

state

SLIRec

class recommenders.models.deeprec.models.sequential.sli_rec.SLI_RECModel(hparams, iterator_creator, graph=None, seed=None)[source]

SLI Rec model

Citation:Z. Yu, J. Lian, A. Mahmoody, G. Liu and X. Xie, “Adaptive User Modeling with Long and Short-Term Preferences for Personailzed Recommendation”, in Proceedings of the 28th International Joint Conferences on Artificial Intelligence, IJCAI’19, Pages 4213-4219, AAAI Press, 2019.

Iterators

class recommenders.models.deeprec.io.iterator.BaseIterator[source]

Abstract base iterator class

gen_feed_dict(data_dict)[source]

Abstract method. Construct a dictionary that maps graph elements to values.

Parameters:data_dict (dict) – A dictionary that maps string name to numpy arrays.
load_data_from_file(infile)[source]

Abstract method. Read and parse data from a file.

Parameters:infile (str) – Text input file. Each line in this file is an instance.
parser_one_line(line)[source]

Abstract method. Parse one string line into feature values.

Parameters:line (str) – A string indicating one instance.
class recommenders.models.deeprec.io.iterator.FFMTextIterator(hparams, graph, col_spliter=' ', ID_spliter='%')[source]

Data loader for FFM format based models, such as xDeepFM. Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

__init__(hparams, graph, col_spliter=' ', ID_spliter='%')[source]

Initialize an iterator. Create the necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key settings such as #_feature and #_field are there.
  • graph (object) – The running graph. All created placeholder will be added to this graph.
  • col_spliter (str) – column splitter in one line.
  • ID_spliter (str) – ID splitter in one line.
gen_feed_dict(data_dict)[source]

Construct a dictionary that maps graph elements to values.

Parameters:data_dict (dict) – A dictionary that maps string name to numpy arrays.
Returns:A dictionary that maps graph elements to numpy arrays.
Return type:dict
load_data_from_file(infile)[source]

Read and parse data from a file.

Parameters:infile (str) – Text input file. Each line in this file is an instance.
Returns:An iterator that yields parsed results, in the format of graph feed_dict.
Return type:object
parser_one_line(line)[source]

Parse one string line into feature values.

Parameters:line (str) – A string indicating one instance.
Returns:Parsed results, including label, features and impression_id.
Return type:list
class recommenders.models.deeprec.io.dkn_iterator.DKNTextIterator(hparams, graph, col_spliter=' ', ID_spliter='%')[source]

Data loader for the DKN model. DKN requires a special type of data format, where each instance contains a label, the candidate news article, and user’s clicked news article. Articles are represented by title words and title entities. Words and entities are aligned.

Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

__init__(hparams, graph, col_spliter=' ', ID_spliter='%')[source]

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as #_feature and #_field are there.
  • graph (object) – the running graph. All created placeholder will be added to this graph.
  • col_spliter (str) – column spliter in one line.
  • ID_spliter (str) – ID spliter in one line.
gen_feed_dict(data_dict)[source]

Construct a dictionary that maps graph elements to values.

Parameters:data_dict (dict) – a dictionary that maps string name to numpy arrays.
Returns:A dictionary that maps graph elements to numpy arrays.
Return type:dict
gen_infer_feed_dict(data_dict)[source]

Construct a dictionary that maps graph elements to values.

Parameters:data_dict (dict) – a dictionary that maps string name to numpy arrays.
Returns:A dictionary that maps graph elements to numpy arrays.
Return type:dict
load_data_from_file(infile)[source]

Read and parse data from a file.

Parameters:

infile (str) – text input file. Each line in this file is an instance.

Yields:

obj, list, int

  • An iterator that yields parsed results, in the format of graph feed_dict.
  • Impression id list.
  • Size of the data in a batch.
load_infer_data_from_file(infile)[source]

Read and parse data from a file for infer document embedding.

Parameters:

infile (str) – text input file. Each line in this file is an instance.

Yields:

obj, list, int

  • An iterator that yields parsed results, in the format of graph feed_dict.
  • Impression id list.
  • Size of the data in a batch.
parser_one_line(line)[source]

Parse one string line into feature values.

Parameters:line (str) – a string indicating one instance
Returns:Parsed results including label, candidate_news_index, click_news_index, candidate_news_entity_index, click_news_entity_index, impression_id.
Return type:list
class recommenders.models.deeprec.io.dkn_item2item_iterator.DKNItem2itemTextIterator(hparams, graph)[source]
__init__(hparams, graph)[source]

This new iterator is for DKN’s item-to-item recommendations version. The tutorial can be found on this notebook.

Compared with user-to-item recommendations, we don’t need the user behavior module. So the placeholder can be simplified from the original DKNTextIterator.

Parameters:
  • hparams (object) – Global hyper-parameters.
  • graph (object) – The running graph.
load_data_from_file(infile)[source]

This function will return a mini-batch of data with features, by looking up news_word_index dictionary and news_entity_index dictionary according to the news article’s ID.

Parameters:

infile (str) – File path. Each line of infile is a news article’s ID.

Yields:

dict, list, int

  • A dictionary that maps graph elements to numpy arrays.
  • A list with news article’s ID.
  • Size of the data in a batch.
class recommenders.models.deeprec.io.nextitnet_iterator.NextItNetIterator(hparams, graph, col_spliter='t')[source]

Data loader for the NextItNet model.

NextItNet requires a special type of data format. In training stage, each instance will produce (sequence_length * train_num_ngs) target items and labels, to let NextItNet output predictions of every item in a sequence except only of the last item.

__init__(hparams, graph, col_spliter='\t')[source]

Initialize an iterator. Create necessary placeholders for the model. Different from sequential iterator

Parameters:
  • hparams (object) – Global hyper-parameters. Some key settings such as #_feature and #_field are there.
  • graph (object) – The running graph. All created placeholder will be added to this graph.
  • col_spliter (str) – Column splitter in one line.
class recommenders.models.deeprec.io.sequential_iterator.SequentialIterator(hparams, graph, col_spliter='t')[source]
__init__(hparams, graph, col_spliter='\t')[source]

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key settings such as #_feature and #_field are there.
  • graph (object) – The running graph. All created placeholder will be added to this graph.
  • col_spliter (str) – Column splitter in one line.
gen_feed_dict(data_dict)[source]

Construct a dictionary that maps graph elements to values.

Parameters:data_dict (dict) – A dictionary that maps string name to numpy arrays.
Returns:A dictionary that maps graph elements to numpy arrays.
Return type:dict
load_data_from_file(infile, batch_num_ngs=0, min_seq_length=1)[source]

Read and parse data from a file.

Parameters:
  • infile (str) – Text input file. Each line in this file is an instance.
  • batch_num_ngs (int) – The number of negative sampling here in batch. 0 represents that there is no need to do negative sampling here.
  • min_seq_length (int) – The minimum number of a sequence length. Sequences with length lower than min_seq_length will be ignored.
Yields:

object – An iterator that yields parsed results, in the format of graph feed_dict.

parse_file(input_file)[source]

Parse the file to A list ready to be used for downstream tasks.

Parameters:input_file – One of train, valid or test file which has never been parsed.
Returns:A list with parsing result.
Return type:list
parser_one_line(line)[source]

Parse one string line into feature values.

Parameters:line (str) – a string indicating one instance. This string contains tab-separated values including: label, user_hash, item_hash, item_cate, operation_time, item_history_sequence, item_cate_history_sequence, and time_history_sequence.
Returns:Parsed results including label, user_id, item_id, item_cate, item_history_sequence, cate_history_sequence, current_time, time_diff, time_from_first_action, time_to_now.
Return type:list

Data processing utilities

class recommenders.models.deeprec.DataModel.ImplicitCF.ImplicitCF(train, test=None, adj_dir=None, col_user='userID', col_item='itemID', col_rating='rating', col_prediction='prediction', seed=None)[source]

Data processing class for GCN models which use implicit feedback.

Initialize train and test set, create normalized adjacency matrix and sample data for training epochs.

__init__(train, test=None, adj_dir=None, col_user='userID', col_item='itemID', col_rating='rating', col_prediction='prediction', seed=None)[source]

Constructor

Parameters:
  • adj_dir (str) – Directory to save / load adjacency matrices. If it is None, adjacency matrices will be created and will not be saved.
  • train (pandas.DataFrame) – Training data with at least columns (col_user, col_item, col_rating).
  • test (pandas.DataFrame) – Test data with at least columns (col_user, col_item, col_rating). test can be None, if so, we only process the training data.
  • col_user (str) – User column name.
  • col_item (str) – Item column name.
  • col_rating (str) – Rating column name.
  • seed (int) – Seed.
create_norm_adj_mat()[source]

Create normalized adjacency matrix.

Returns:Normalized adjacency matrix.
Return type:scipy.sparse.csr_matrix
get_norm_adj_mat()[source]

Load normalized adjacency matrix if it exists, otherwise create (and save) it.

Returns:Normalized adjacency matrix.
Return type:scipy.sparse.csr_matrix
train_loader(batch_size)[source]

Sample train data every batch. One positive item and one negative item sampled for each user.

Parameters:batch_size (int) – Batch size of users.
Returns:
  • Sampled users.
  • Sampled positive items.
  • Sampled negative items.
Return type:numpy.ndarray, numpy.ndarray, numpy.ndarray

Utilities

recommenders.models.deeprec.deeprec_utils.cal_metric(labels, preds, metrics)[source]

Calculate metrics.

Available options are: auc, rmse, logloss, acc (accurary), f1, mean_mrr, ndcg (format like: ndcg@2;4;6;8), hit (format like: hit@2;4;6;8), group_auc.

Parameters:
  • labels (array-like) – Labels.
  • preds (array-like) – Predictions.
  • metrics (list) – List of metric names.
Returns:

Metrics.

Return type:

dict

Examples

>>> cal_metric(labels, preds, ["ndcg@2;4;6", "group_auc"])
{'ndcg@2': 0.4026, 'ndcg@4': 0.4953, 'ndcg@6': 0.5346, 'group_auc': 0.8096}
recommenders.models.deeprec.deeprec_utils.check_nn_config(f_config)[source]

Check neural networks configuration.

Parameters:f_config (dict) – Neural network configuration.
Raises:ValueError – If the parameters are not correct.
recommenders.models.deeprec.deeprec_utils.check_type(config)[source]

Check that the config parameters are the correct type

Parameters:config (dict) – Configuration dictionary.
Raises:TypeError – If the parameters are not the correct type.
recommenders.models.deeprec.deeprec_utils.create_hparams(flags)[source]

Create the model hyperparameters.

Parameters:flags (dict) – Dictionary with the model requirements.
Returns:Hyperparameter object in TF.
Return type:tf.contrib.training.HParams
recommenders.models.deeprec.deeprec_utils.dcg_score(y_true, y_score, k=10)[source]

Computing dcg score metric at k.

Parameters:
  • y_true (np.ndarray) – Ground-truth labels.
  • y_score (np.ndarray) – Predicted labels.
Returns:

dcg scores.

Return type:

np.ndarray

recommenders.models.deeprec.deeprec_utils.download_deeprec_resources(azure_container_url, data_path, remote_resource_name)[source]

Download resources.

Parameters:
  • azure_container_url (str) – URL of Azure container.
  • data_path (str) – Path to download the resources.
  • remote_resource_name (str) – Name of the resource.
recommenders.models.deeprec.deeprec_utils.flat_config(config)[source]

Flat config loaded from a yaml file to a flat dict.

Parameters:config (dict) – Configuration loaded from a yaml file.
Returns:Configuration dictionary.
Return type:dict
recommenders.models.deeprec.deeprec_utils.hit_score(y_true, y_score, k=10)[source]

Computing hit score metric at k.

Parameters:
  • y_true (np.ndarray) – ground-truth labels.
  • y_score (np.ndarray) – predicted labels.
Returns:

hit score.

Return type:

np.ndarray

recommenders.models.deeprec.deeprec_utils.load_dict(filename)[source]

Load the vocabularies.

Parameters:filename (str) – Filename of user, item or category vocabulary.
Returns:A saved vocabulary.
Return type:dict
recommenders.models.deeprec.deeprec_utils.load_yaml(filename)[source]

Load a yaml file.

Parameters:filename (str) – Filename.
Returns:Dictionary.
Return type:dict
recommenders.models.deeprec.deeprec_utils.mrr_score(y_true, y_score)[source]

Computing mrr score metric.

Parameters:
  • y_true (np.ndarray) – Ground-truth labels.
  • y_score (np.ndarray) – Predicted labels.
Returns:

mrr scores.

Return type:

numpy.ndarray

recommenders.models.deeprec.deeprec_utils.ndcg_score(y_true, y_score, k=10)[source]

Computing ndcg score metric at k.

Parameters:
  • y_true (np.ndarray) – Ground-truth labels.
  • y_score (np.ndarray) – Predicted labels.
Returns:

ndcg scores.

Return type:

numpy.ndarray

recommenders.models.deeprec.deeprec_utils.prepare_hparams(yaml_file=None, **kwargs)[source]

Prepare the model hyperparameters and check that all have the correct value.

Parameters:yaml_file (str) – YAML file as configuration.
Returns:Hyperparameter object in TF.
Return type:tf.contrib.training.HParams

FastAI

recommenders.models.fastai.fastai_utils.cartesian_product(*arrays)[source]

Compute the Cartesian product in fastai algo. This is a helper function.

Parameters:arrays (tuple of numpy.ndarray) – Input arrays
Returns:product
Return type:numpy.ndarray
recommenders.models.fastai.fastai_utils.hide_fastai_progress_bar()[source]

Hide fastai progress bar

recommenders.models.fastai.fastai_utils.score(learner, test_df, user_col='userID', item_col='itemID', prediction_col='prediction', top_k=None)[source]

Score all users+items provided and reduce to top_k items per user if top_k>0

Parameters:
  • learner (object) – Model.
  • test_df (pandas.DataFrame) – Test dataframe.
  • user_col (str) – User column name.
  • item_col (str) – Item column name.
  • prediction_col (str) – Prediction column name.
  • top_k (int) – Number of top items to recommend.
Returns:

Result of recommendation

Return type:

pandas.DataFrame

GeoIMC

Module maintaining the IMC problem.

class recommenders.models.geoimc.geoimc_algorithm.IMCProblem(dataPtr, lambda1=0.01, rank=10)[source]

Implements the IMC problem.

__init__(dataPtr, lambda1=0.01, rank=10)[source]

Initialize parameters

Parameters:
  • dataPtr (DataPtr) – An object of which contains X, Z side features and target matrix Y.
  • lambda1 (uint) – Regularizer.
  • rank (uint) – rank of the U, B, V parametrization.
reset()[source]

Reset the model.

solve(*args)[source]

Main solver of the IMC model

Parameters:
  • max_opt_time (uint) – Maximum time (in secs) for optimization
  • max_opt_iter (uint) – Maximum iterations for optimization
  • verbosity (uint) – The level of verbosity for Pymanopt logs
class recommenders.models.geoimc.geoimc_data.DataPtr(data, entities)[source]

Holds data and its respective indices

__init__(data, entities)[source]

Initialize a data pointer

Parameters:
  • data (csr_matrix) – The target data matrix.
  • entities (Iterator) – An iterator (of 2 elements (ndarray)) containing
  • features of row, col entities. (the) –
get_data()[source]
Returns:Target matrix (based on the data_indices filter)
Return type:csr_matrix
get_entity(of='row')[source]

Get entity

Parameters:of (str) – The entity, either ‘row’ or ‘col’
Returns:Entity matrix (based on the entity_indices filter)
Return type:numpy.ndarray
class recommenders.models.geoimc.geoimc_data.Dataset(name, features_dim=0, normalize=False, target_transform='')[source]

Base class that holds necessary (minimal) information needed

__init__(name, features_dim=0, normalize=False, target_transform='')[source]

Initialize parameters

Parameters:
  • name (str) – Name of the dataset
  • features_dim (uint) – Dimension of the features. If not 0, PCA is performed on the features as the dimensionality reduction technique
  • normalize (bool) – Normalize the features
  • target_transform (str) – Transform the target values. Current options are ‘normalize’ (Normalize the values), ‘’ (Do nothing), ‘binarize’ (convert the values using a threshold defined per dataset)
generate_train_test_data(data, test_ratio=0.3)[source]

Generate train, test split. The split is performed on the row entities. So, this essentially becomes a cold start row entity test.

Parameters:
  • data (csr_matrix) – The entire target matrix.
  • test_ratio (float) – Ratio of test split.
normalize()[source]

Normalizes the entity features

reduce_dims()[source]

Reduces the dimensionality of entity features.

class recommenders.models.geoimc.geoimc_data.ML_100K(**kwargs)[source]

Handles MovieLens-100K

__init__(**kwargs)[source]

Initialize parameters

Parameters:
  • name (str) – Name of the dataset
  • features_dim (uint) – Dimension of the features. If not 0, PCA is performed on the features as the dimensionality reduction technique
  • normalize (bool) – Normalize the features
  • target_transform (str) – Transform the target values. Current options are ‘normalize’ (Normalize the values), ‘’ (Do nothing), ‘binarize’ (convert the values using a threshold defined per dataset)
df2coo(df)[source]

Convert the input dataframe into a coo matrix

Parameters:df (pandas.DataFrame) – DataFrame containing the target matrix information.
load_data(path)[source]

Load dataset

Parameters:
  • path (str) – Path to the directory containing ML100K dataset
  • e1_path (str) – Path to the file containing row (user) features of ML100K dataset
  • e2_path (str) – Path to the file containing col (movie) features of ML100K dataset
class recommenders.models.geoimc.geoimc_predict.Inferer(method='dot', k=10, transformation='')[source]

Holds necessary (minimal) information needed for inference

__init__(method='dot', k=10, transformation='')[source]

Initialize parameters

Parameters:
  • method (str) – The inference method. Currently ‘dot’ (Dot product) is supported.
  • k (uint) – k for ‘topk’ transformation.
  • transformation (str) – Transform the inferred values into a different scale. Currently ‘mean’ (Binarize the values using mean of inferred matrix as the threshold), ‘topk’ (Pick Top-K inferred values per row and assign them 1, setting rest of them to 0), ‘’ (No transformation) are supported.
infer(dataPtr, W, **kwargs)[source]

Main inference method

Parameters:
  • dataPtr (DataPtr) – An object containing the X, Z features needed for inference
  • W (iterable) – An iterable containing the U, B, V parametrized matrices.
class recommenders.models.geoimc.geoimc_predict.PlainScalarProduct(X, Y, **kwargs)[source]

Module that implements plain scalar product as the retrieval criterion

__init__(X, Y, **kwargs)[source]
Parameters:
  • X – numpy matrix of shape (users, features)
  • Y – numpy matrix of shape (items, features)
sim(**kwargs)[source]

Calculate the similarity score

recommenders.models.geoimc.geoimc_utils.length_normalize(matrix)[source]

Length normalize the matrix

Parameters:matrix (np.ndarray) – Input matrix that needs to be normalized
Returns:Normalized matrix
recommenders.models.geoimc.geoimc_utils.mean_center(matrix)[source]

Performs mean centering across axis 0

Parameters:matrix (np.ndarray) – Input matrix that needs to be mean centered
recommenders.models.geoimc.geoimc_utils.reduce_dims(matrix, target_dim)[source]

Reduce dimensionality of the data using PCA.

Parameters:
  • matrix (np.ndarray) – Matrix of the form (n_sampes, n_features)
  • target_dim (uint) – Dimension to which n_features should be reduced to.

LightFM

recommenders.models.lightfm.lightfm_utils.compare_metric(df_list, metric='prec', stage='test')[source]

Function to combine and prepare list of dataframes into tidy format.

Parameters:
  • df_list (list) – List of dataframes
  • metrics (str) – name of metric to be extracted, optional
  • stage (str) – name of model fitting stage to be extracted, optional
Returns:

Metrics

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.model_perf_plots(df)[source]

Function to plot model performance metrics.

Parameters:df (pandas.DataFrame) – Dataframe in tidy format, with [‘epoch’,’level’,’value’] columns
Returns:matplotlib axes
Return type:object
recommenders.models.lightfm.lightfm_utils.prepare_all_predictions(data, uid_map, iid_map, interactions, model, num_threads, user_features=None, item_features=None)[source]

Function to prepare all predictions for evaluation.

Parameters:
  • data (pandas df) – dataframe of all users, items and ratings as loaded
  • uid_map (dict) – Keys to map internal user indices to external ids.
  • iid_map (dict) – Keys to map internal item indices to external ids.
  • interactions (np.float32 coo_matrix) – user-item interaction
  • model (LightFM instance) – fitted LightFM model
  • num_threads (int) – number of parallel computation threads
  • user_features (np.float32 csr_matrix) – User weights over features
  • item_features (np.float32 csr_matrix) – Item weights over features
Returns:

all predictions

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.prepare_test_df(test_idx, uids, iids, uid_map, iid_map, weights)[source]

Function to prepare test df for evaluation

Parameters:
  • test_idx (slice) – slice of test indices
  • uids (numpy.ndarray) – Array of internal user indices
  • iids (numpy.ndarray) – Array of internal item indices
  • uid_map (dict) – Keys to map internal user indices to external ids.
  • iid_map (dict) – Keys to map internal item indices to external ids.
  • weights (numpy.float32 coo_matrix) – user-item interaction
Returns:

user-item selected for testing

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.similar_items(item_id, item_features, model, N=10)[source]

Function to return top N similar items based on https://github.com/lyst/lightfm/issues/244#issuecomment-355305681

Parameters:
  • item_id (int) – id of item to be used as reference
  • item_features (scipy sparse CSR matrix) – item feature matric
  • model (LightFM instance) – fitted LightFM model
  • N (int) – Number of top similar items to return
Returns:

top N most similar items with score

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.similar_users(user_id, user_features, model, N=10)[source]

Function to return top N similar users based on https://github.com/lyst/lightfm/issues/244#issuecomment-355305681

Args:
user_id (int): id of user to be used as reference user_features (scipy sparse CSR matrix): user feature matric model (LightFM instance): fitted LightFM model N (int): Number of top similar users to return
Returns:top N most similar users with score
Return type:pandas.DataFrame
recommenders.models.lightfm.lightfm_utils.track_model_metrics(model, train_interactions, test_interactions, k=10, no_epochs=100, no_threads=8, show_plot=True, **kwargs)[source]

Function to record model’s performance at each epoch, formats the performance into tidy format, plots the performance and outputs the performance data.

Parameters:
  • model (LightFM instance) – fitted LightFM model
  • train_interactions (scipy sparse COO matrix) – train interactions set
  • test_interactions (scipy sparse COO matrix) – test interaction set
  • k (int) – number of recommendations, optional
  • no_epochs (int) – Number of epochs to run, optional
  • no_threads (int) – Number of parallel threads to use, optional
  • **kwargs – other keyword arguments to be passed down
Returns:

  • Performance traces of the fitted model
  • Fitted model
  • Side effect of the method

Return type:

pandas.DataFrame, LightFM model, matplotlib axes

LightGBM

class recommenders.models.lightgbm.lightgbm_utils.NumEncoder(cate_cols, nume_cols, label_col, threshold=10, thresrate=0.99)[source]

Encode all the categorical features into numerical ones by sequential label encoding, sequential count encoding, and binary encoding. Additionally, it also filters the low-frequency categories and fills the missing values.

__init__(cate_cols, nume_cols, label_col, threshold=10, thresrate=0.99)[source]

Constructor.

Parameters:
  • cate_cols (list) – The columns of categorical features.
  • nume_cols (list) – The columns of numerical features.
  • label_col (object) – The column of Label.
  • threshold (int) – The categories whose frequency is lower than the threshold will be filtered (be treated as “<LESS>”).
  • thresrate (float) – The (1.0 - thersrate, default 1%) lowest-frequency categories will also be filtered.
fit_transform(df)[source]

Input a training set (pandas.DataFrame) and return the converted 2 numpy.ndarray (x,y).

Parameters:df (pandas.DataFrame) – Input dataframe
Returns:New features and labels.
Return type:numpy.ndarray, numpy.ndarray
transform(df)[source]

Input a testing / validation set (pandas.DataFrame) and return the converted 2 numpy.ndarray (x,y).

Parameters:df (pandas.DataFrame) – Input dataframe
Returns:New features and labels.
Return type:numpy.ndarray, numpy.ndarray
recommenders.models.lightgbm.lightgbm_utils.unpackbits(x, num_bits)[source]

Convert a decimal value numpy.ndarray into multi-binary value numpy.ndarray ([1,2]->[[0,1],[1,0]])

Parameters:
  • x (numpy.ndarray) – Decimal array.
  • num_bits (int) – The max length of the converted binary value.

NCF

class recommenders.models.ncf.dataset.Dataset(train, test=None, n_neg=4, n_neg_test=100, col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', binary=True, seed=None)[source]

Dataset class for NCF

__init__(train, test=None, n_neg=4, n_neg_test=100, col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', binary=True, seed=None)[source]

Constructor

Parameters:
  • train (pandas.DataFrame) – Training data with at least columns (col_user, col_item, col_rating).
  • test (pandas.DataFrame) – Test data with at least columns (col_user, col_item, col_rating). test can be None, if so, we only process the training data.
  • n_neg (int) – Number of negative samples for training set.
  • n_neg_test (int) – Number of negative samples for test set.
  • col_user (str) – User column name.
  • col_item (str) – Item column name.
  • col_rating (str) – Rating column name.
  • col_timestamp (str) – Timestamp column name.
  • binary (bool) – If true, set rating > 0 to rating = 1.
  • seed (int) – Seed.
negative_sampling()[source]

Sample n_neg negative items per positive item, this function should be called every epoch.

test_loader()[source]

Feed leave-one-out data every user

Generate test batch by every positive test instance, (eg. [1, 2, 1] is a positive user & item pair in test set ([userID, itemID, rating] for this tuple). This function returns like [[1, 2, 1], [1, 3, 0], [1,6, 0], …], ie. following our leave-one-out evaluation protocol.

Returns:userID list, itemID list, rating list. public data loader return the userID, itemID consistent with raw data the first (userID, itemID, rating) is the positive one
Return type:list
train_loader(batch_size, shuffle=True)[source]

Feed train data every batch.

Parameters:
  • batch_size (int) – Batch size.
  • shuffle (bool) – Ff true, train data will be shuffled.
Yields:

list – A list of userID list, itemID list, and rating list. Public data loader returns the userID, itemID consistent with raw data.

class recommenders.models.ncf.ncf_singlenode.NCF(n_users, n_items, model_type='NeuMF', n_factors=8, layer_sizes=[16, 8, 4], n_epochs=50, batch_size=64, learning_rate=0.005, verbose=1, seed=None)[source]

Neural Collaborative Filtering (NCF) implementation

Citation:He, Xiangnan, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. “Neural collaborative filtering.” In Proceedings of the 26th International Conference on World Wide Web, pp. 173-182. International World Wide Web Conferences Steering Committee, 2017. Link: https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf
__init__(n_users, n_items, model_type='NeuMF', n_factors=8, layer_sizes=[16, 8, 4], n_epochs=50, batch_size=64, learning_rate=0.005, verbose=1, seed=None)[source]

Constructor

Parameters:
  • n_users (int) – Number of users in the dataset.
  • n_items (int) – Number of items in the dataset.
  • model_type (str) – Model type.
  • n_factors (int) – Dimension of latent space.
  • layer_sizes (list) – Number of layers for MLP.
  • n_epochs (int) – Number of epochs for training.
  • batch_size (int) – Batch size.
  • learning_rate (float) – Learning rate.
  • verbose (int) – Whether to show the training output or not.
  • seed (int) – Seed.
fit(data)[source]

Fit model with training data

Parameters:data (NCFDataset) – initilized Dataset in ./dataset.py
load(gmf_dir=None, mlp_dir=None, neumf_dir=None, alpha=0.5)[source]

Load model parameters for further use.

GMF model –> load parameters in gmf_dir

MLP model –> load parameters in mlp_dir

NeuMF model –> load parameters in neumf_dir or in gmf_dir and mlp_dir

Parameters:
  • gmf_dir (str) – Directory name for GMF model.
  • mlp_dir (str) – Directory name for MLP model.
  • neumf_dir (str) – Directory name for neumf model.
  • alpha (float) – the concatenation hyper-parameter for gmf and mlp output layer.
Returns:

Load parameters in this model.

Return type:

object

predict(user_input, item_input, is_list=False)[source]

Predict function of this trained model

Parameters:
  • user_input (list or element of list) – userID or userID list
  • item_input (list or element of list) – itemID or itemID list
  • is_list (bool) – if true, the input is list type noting that list-wise type prediction is faster than element-wise’s.
Returns:

A list of predicted rating or predicted rating score.

Return type:

list or float

save(dir_name)[source]

Save model parameters in dir_name

Parameters:dir_name (str) – directory name, which should be a folder name instead of file name we will create a new directory if not existing.

NewsRec

class recommenders.models.newsrec.io.mind_all_iterator.MINDAllIterator(hparams, npratio=-1, col_spliter='t', ID_spliter='%')[source]

Train data loader for NAML model. The model require a special type of data format, where each instance contains a label, impresion id, user id, the candidate news articles and user’s clicked news article. Articles are represented by title words, body words, verts and subverts.

Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

col_spliter

column spliter in one line.

Type:str
ID_spliter

ID spliter in one line.

Type:str
batch_size

the samples num in one batch.

Type:int
title_size

max word num in news title.

Type:int
body_size

max word num in news body (abstract used in MIND).

Type:int
his_size

max clicked news num in user click history.

Type:int
npratio

negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.

Type:int
__init__(hparams, npratio=-1, col_spliter='\t', ID_spliter='%')[source]

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as head_num and head_dim are there.
  • graph (object) – the running graph. All created placeholder will be added to this graph.
  • col_spliter (str) – column spliter in one line.
  • ID_spliter (str) – ID spliter in one line.
init_behaviors(behaviors_file)[source]

Init behavior logs given behaviors file.

Parameters:behaviors_file (str) – path of behaviors file
init_news(news_file)[source]

Init news information given news file, such as news_title_index, news_abstract_index.

Parameters:news_file – path of news file
load_data_from_file(news_file, behavior_file)[source]

Read and parse data from a file.

Parameters:
  • news_file (str) – A file contains several informations of news.
  • beahaviros_file (str) – A file contains information of user impressions.
Yields:

object – An iterator that yields parsed results, in the format of graph feed_dict.

load_dict(file_path)[source]

Load pickled file

Parameters:path (file) – File path
Returns:pickle load obj
Return type:object
load_impression_from_file(behaivors_file)[source]

Read and parse impression data from behaivors file.

Parameters:behaivors_file (str) – A file contains several informations of behaviros.
Yields:object – An iterator that yields parsed impression data, in the format of dict.
load_news_from_file(news_file)[source]

Read and parse user data from news file.

Parameters:news_file (str) – A file contains several informations of news.
Yields:object – An iterator that yields parsed news feature, in the format of dict.
load_user_from_file(news_file, behavior_file)[source]

Read and parse user data from news file and behavior file.

Parameters:
  • news_file (str) – A file contains several informations of news.
  • beahaviros_file (str) – A file contains information of user impressions.
Yields:

object – An iterator that yields parsed user feature, in the format of dict.

parser_one_line(line)[source]

Parse one string line into feature values.

Parameters:line (str) – a string indicating one instance.
Yields:list – Parsed results including label, impression id , user id, candidate_title_index, clicked_title_index, candidate_ab_index, clicked_ab_index, candidate_vert_index, clicked_vert_index, candidate_subvert_index, clicked_subvert_index,
class recommenders.models.newsrec.io.mind_iterator.MINDIterator(hparams, npratio=-1, col_spliter='t', ID_spliter='%')[source]

Train data loader for NAML model. The model require a special type of data format, where each instance contains a label, impresion id, user id, the candidate news articles and user’s clicked news article. Articles are represented by title words, body words, verts and subverts.

Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

col_spliter

column spliter in one line.

Type:str
ID_spliter

ID spliter in one line.

Type:str
batch_size

the samples num in one batch.

Type:int
title_size

max word num in news title.

Type:int
his_size

max clicked news num in user click history.

Type:int
npratio

negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.

Type:int
__init__(hparams, npratio=-1, col_spliter='\t', ID_spliter='%')[source]

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as head_num and head_dim are there.
  • npratio (int) – negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.
  • col_spliter (str) – column spliter in one line.
  • ID_spliter (str) – ID spliter in one line.
init_behaviors(behaviors_file)[source]

init behavior logs given behaviors file.

Args: behaviors_file: path of behaviors file

init_news(news_file)[source]

init news information given news file, such as news_title_index and nid2index. :param news_file: path of news file

load_data_from_file(news_file, behavior_file)[source]

Read and parse data from news file and behavior file.

Parameters:
  • news_file (str) – A file contains several informations of news.
  • beahaviros_file (str) – A file contains information of user impressions.
Yields:

object – An iterator that yields parsed results, in the format of dict.

load_dict(file_path)[source]

load pickle file

Parameters:path (file) – file path
Returns:pickle loaded object
Return type:object
load_impression_from_file(behaivors_file)[source]

Read and parse impression data from behaivors file.

Parameters:behaivors_file (str) – A file contains several informations of behaviros.
Yields:object – An iterator that yields parsed impression data, in the format of dict.
load_news_from_file(news_file)[source]

Read and parse user data from news file.

Parameters:news_file (str) – A file contains several informations of news.
Yields:object – An iterator that yields parsed news feature, in the format of dict.
load_user_from_file(news_file, behavior_file)[source]

Read and parse user data from news file and behavior file.

Parameters:
  • news_file (str) – A file contains several informations of news.
  • beahaviros_file (str) – A file contains information of user impressions.
Yields:

object – An iterator that yields parsed user feature, in the format of dict.

parser_one_line(line)[source]

Parse one behavior sample into feature values. if npratio is larger than 0, return negtive sampled result.

Parameters:line (int) – sample index.
Yields:list – Parsed results including label, impression id , user id, candidate_title_index, clicked_title_index.
class recommenders.models.newsrec.models.base_model.BaseModel(hparams, iterator_creator, seed=None)[source]

Basic class of models

hparams

A tf.contrib.training.HParams object, hold the entire set of hyperparameters.

Type:object
train_iterator

An iterator to load the data in training steps.

Type:object
test_iterator

An iterator to load the data in testing steps.

Type:object
graph

An optional graph.

Type:object
seed

Random seed.

Type:int
__init__(hparams, iterator_creator, seed=None)[source]

Initializing the model. Create common logics which are needed by all deeprec models, such as loss function, parameter set.

Parameters:
  • hparams (object) – A tf.contrib.training.HParams object, hold the entire set of hyperparameters.
  • iterator_creator (object) – An iterator to load the data.
  • graph (object) – An optional graph.
  • seed (int) – Random seed.
eval(eval_batch_data)[source]

Evaluate the data in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.
Returns:

A list of evaluated results, including total loss value, data loss value, predicted scores, and ground-truth labels.

Return type:

list

fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file, test_news_file=None, test_behaviors_file=None)[source]

Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status. If test_news_file is not None, evaluate it too.

Parameters:
  • train_file (str) – training data set.
  • valid_file (str) – validation set.
  • test_news_file (str) – test set.
Returns:

An instance of self.

Return type:

object

group_labels(labels, preds, group_keys)[source]

Devide labels and preds into several group according to values in group keys.

Parameters:
  • labels (list) – ground truth label list.
  • preds (list) – prediction score list.
  • group_keys (list) – group key list.
Returns:

  • Keys after group.
  • Labels after group.
  • Preds after group.

Return type:

list, list, list

run_eval(news_filename, behaviors_file)[source]

Evaluate the given file and returns some evaluation metrics.

Parameters:filename (str) – A file name that will be evaluated.
Returns:A dictionary that contains evaluation metrics.
Return type:dict
train(train_batch_data)[source]

Go through the optimization step once with training data in feed_dict.

Parameters:
  • sess (object) – The model session object.
  • feed_dict (dict) – Feed values to train the model. This is a dictionary that maps graph elements to values.
Returns:

A list of values, including update operation, total loss, data loss, and merged summary.

Return type:

list

class recommenders.models.newsrec.models.layers.AttLayer2(dim=200, seed=0, **kwargs)[source]

Soft alignment attention implement.

dim

attention hidden dim

Type:int
__init__(dim=200, seed=0, **kwargs)[source]

Initialization steps for AttLayer2.

Parameters:dim (int) – attention hidden dim
build(input_shape)[source]

Initialization for variables in AttLayer2 There are there variables in AttLayer2, i.e. W, b and q.

Parameters:input_shape (object) – shape of input tensor.
call(inputs, mask=None, **kwargs)[source]

Core implemention of soft attention

Parameters:inputs (object) – input tensor.
Returns:weighted sum of input tensors.
Return type:object
compute_mask(input, input_mask=None)[source]

Compte output mask value

Parameters:
  • input (object) – input tensor.
  • input_mask – input mask
Returns:

output mask.

Return type:

object

compute_output_shape(input_shape)[source]

Compute shape of output tensor

Parameters:input_shape (tuple) – shape of input tensor.
Returns:shape of output tensor.
Return type:tuple
class recommenders.models.newsrec.models.layers.ComputeMasking(**kwargs)[source]

Compute if inputs contains zero value.

Returns:True for values not equal to zero.
Return type:bool tensor
__init__(**kwargs)[source]
call(inputs, **kwargs)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

If the layer has not been built, this method will call build on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.

Parameters:input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
Returns:An input shape tuple.
class recommenders.models.newsrec.models.layers.OverwriteMasking(**kwargs)[source]

Set values at spasific positions to zero.

Parameters:inputs (list) – value tensor and mask tensor.
Returns:tensor after setting values to zero.
Return type:object
__init__(**kwargs)[source]
build(input_shape)[source]

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters:input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).
call(inputs, **kwargs)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

If the layer has not been built, this method will call build on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.

Parameters:input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
Returns:An input shape tuple.
recommenders.models.newsrec.models.layers.PersonalizedAttentivePooling(dim1, dim2, dim3, seed=0)[source]

Soft alignment attention implement.

recommenders.models.newsrec.models.layers.dim1

first dimention of value shape.

Type:int
recommenders.models.newsrec.models.layers.dim2

second dimention of value shape.

Type:int
recommenders.models.newsrec.models.layers.dim3

shape of query

Type:int
Returns:weighted summary of inputs value.
Return type:object
class recommenders.models.newsrec.models.layers.SelfAttention(multiheads, head_dim, seed=0, mask_right=False, **kwargs)[source]

Multi-head self attention implement.

Parameters:
  • multiheads (int) – The number of heads.
  • head_dim (object) – Dimention of each head.
  • mask_right (boolean) – whether to mask right words.
Returns:

Weighted sum after attention.

Return type:

object

Mask(inputs, seq_len, mode='add')[source]

Mask operation used in multi-head self attention

Parameters:
  • seq_len (object) – sequence length of inputs.
  • mode (str) – mode of mask.
Returns:

tensors after masking.

Return type:

object

__init__(multiheads, head_dim, seed=0, mask_right=False, **kwargs)[source]

Initialization steps for AttLayer2.

Parameters:
  • multiheads (int) – The number of heads.
  • head_dim (object) – Dimention of each head.
  • mask_right (boolean) – whether to mask right words.
build(input_shape)[source]

Initialization for variables in SelfAttention. There are three variables in SelfAttention, i.e. WQ, WK ans WV. WQ is used for linear transformation of query. WK is used for linear transformation of key. WV is used for linear transformation of value.

Parameters:input_shape (object) – shape of input tensor.
call(QKVs)[source]

Core logic of multi-head self attention.

Parameters:QKVs (list) – inputs of multi-head self attention i.e. qeury, key and value.
Returns:ouput tensors.
Return type:object
compute_output_shape(input_shape)[source]

Compute shape of output tensor.

Returns:output shape tuple.
Return type:tuple
get_config()[source]

add multiheads, multiheads and mask_right into layer config.

Returns:config of SelfAttention layer.
Return type:dict
class recommenders.models.newsrec.models.lstur.LSTURModel(hparams, iterator_creator, seed=None)[source]

LSTUR model(Neural News Recommendation with Multi-Head Self-Attention)

Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019

word2vec_embedding

Pretrained word embedding matrix.

Type:numpy.ndarray
hparam

Global hyper-parameters.

Type:object
__init__(hparams, iterator_creator, seed=None)[source]

Initialization steps for LSTUR. Compared with the BaseModel, LSTUR need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as type and gru_unit are there.
  • iterator_creator_train (object) – LSTUR data loader class for train data.
  • iterator_creator_test (object) – LSTUR data loader class for test and validation data
class recommenders.models.newsrec.models.naml.NAMLModel(hparams, iterator_creator, seed=None)[source]

NAML model(Neural News Recommendation with Attentive Multi-View Learning)

Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie, Neural News Recommendation with Attentive Multi-View Learning, IJCAI 2019

word2vec_embedding

Pretrained word embedding matrix.

Type:numpy.ndarray
hparam

Global hyper-parameters.

Type:object
__init__(hparams, iterator_creator, seed=None)[source]

Initialization steps for NAML. Compared with the BaseModel, NAML need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as filter_num are there.
  • iterator_creator_train (object) – NAML data loader class for train data.
  • iterator_creator_test (object) – NAML data loader class for test and validation data
class recommenders.models.newsrec.models.npa.NPAModel(hparams, iterator_creator, seed=None)[source]

NPA model(Neural News Recommendation with Attentive Multi-View Learning)

Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie: NPA: Neural News Recommendation with Personalized Attention, KDD 2019, ADS track.

word2vec_embedding

Pretrained word embedding matrix.

Type:numpy.ndarray
hparam

Global hyper-parameters.

Type:object
__init__(hparams, iterator_creator, seed=None)[source]

Initialization steps for MANL. Compared with the BaseModel, NPA need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as filter_num are there.
  • iterator_creator_train (object) – NPA data loader class for train data.
  • iterator_creator_test (object) – NPA data loader class for test and validation data
class recommenders.models.newsrec.models.nrms.NRMSModel(hparams, iterator_creator, seed=None)[source]

NRMS model(Neural News Recommendation with Multi-Head Self-Attention)

Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang,and Xing Xie, “Neural News Recommendation with Multi-Head Self-Attention” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

word2vec_embedding

Pretrained word embedding matrix.

Type:numpy.ndarray
hparam

Global hyper-parameters.

Type:object
__init__(hparams, iterator_creator, seed=None)[source]

Initialization steps for NRMS. Compared with the BaseModel, NRMS need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as head_num and head_dim are there.
  • iterator_creator_train (object) – NRMS data loader class for train data.
  • iterator_creator_test (object) – NRMS data loader class for test and validation data
recommenders.models.newsrec.newsrec_utils.check_nn_config(f_config)[source]

Check neural networks configuration.

Parameters:f_config (dict) – Neural network configuration.
Raises:ValueError – If the parameters are not correct.
recommenders.models.newsrec.newsrec_utils.check_type(config)[source]

Check that the config parameters are the correct type

Parameters:config (dict) – Configuration dictionary.
Raises:TypeError – If the parameters are not the correct type.
recommenders.models.newsrec.newsrec_utils.create_hparams(flags)[source]

Create the model hyperparameters.

Parameters:flags (dict) – Dictionary with the model requirements.
Returns:Hyperparameter object in TF (tf.contrib.training.HParams).
Return type:object
recommenders.models.newsrec.newsrec_utils.get_mind_data_set(type)[source]

Get MIND dataset address

Parameters:type (str) – type of mind dataset, must be in [‘large’, ‘small’, ‘demo’]
Returns:data url and train valid dataset name
Return type:list
recommenders.models.newsrec.newsrec_utils.newsample(news, ratio)[source]

Sample ratio samples from news list. If length of news is less than ratio, pad zeros.

Parameters:
  • news (list) – input news list
  • ratio (int) – sample number
Returns:

output of sample list.

Return type:

list

recommenders.models.newsrec.newsrec_utils.prepare_hparams(yaml_file=None, **kwargs)[source]

Prepare the model hyperparameters and check that all have the correct value.

Parameters:yaml_file (str) – YAML file as configuration.
Returns:Hyperparameter object in TF (tf.contrib.training.HParams).
Return type:object
recommenders.models.newsrec.newsrec_utils.word_tokenize(sent)[source]

Split sentence into word list using regex. :param sent: Input sentence :type sent: str

Returns:word list
Return type:list

RBM

class recommenders.models.rbm.rbm.RBM(hidden_units=500, keep_prob=0.7, init_stdv=0.1, learning_rate=0.004, minibatch_size=100, training_epoch=20, display_epoch=10, sampling_protocol=[50, 70, 80, 90, 100], debug=False, with_metrics=False, seed=42)[source]

Restricted Boltzmann Machine

__init__(hidden_units=500, keep_prob=0.7, init_stdv=0.1, learning_rate=0.004, minibatch_size=100, training_epoch=20, display_epoch=10, sampling_protocol=[50, 70, 80, 90, 100], debug=False, with_metrics=False, seed=42)[source]

Implementation of a multinomial Restricted Boltzmann Machine for collaborative filtering in numpy/pandas/tensorflow

Based on the article by Ruslan Salakhutdinov, Andriy Mnih and Geoffrey Hinton https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf

In this implementation we use multinomial units instead of the one-hot-encoded used in the paper.This means that the weights are rank 2 (matrices) instead of rank 3 tensors.

Basic mechanics:

1) A computational graph is created when the RBM class is instantiated; For an item based recommender this consists of: visible units: The number Nv of visible units equals the number of items hidden units : hyperparameter to fix during training

  1. Gibbs Sampling:

2.1) for each training epoch, the visible units are first clamped on the data

2.2) The activation probability of the hidden units, given a linear combination of the visibles, is evaluated P(h=1|phi_v). The latter is then used to sample the value of the hidden units.

2.3) The probability P(v=l|phi_h) is evaluated, where l=1,..,r are the rates (e.g. r=5 for the movielens dataset). In general, this is a multinomial distribution, from which we sample the value of v.

2.4) This step is repeated k times, where k increases as optimization converges. It is essential to fix to zero the original unrated items during the all learning process.

3) Optimization: The free energy of the visible units given the hidden is evaluated at the beginning (F_0) and after k steps of Bernoulli sampling (F_k). The weights and biases are updated by minimizing the differene F_0 - F_k.

4) Inference: Once the joint probability distribution P(v,h) is learned, this is used to generate ratings for unrated items for all users

accuracy(vp)[source]

Train/Test Mean average precision

Evaluates MAP over the train/test set in online mode. Note that this needs to be evaluated on the rated items only.

\(acc = 1/m \sum_{mu=1}^{m} \sum{i=1}^Nv 1/s(i) I(v-vp = 0)_{mu,i}\)

where m = Nusers, Nv = number of items = number of visible units and s(i) is the number of non-zero elements per row.

Parameters:vp (tf.Tensor, float32) – Inferred output (Network prediction)
Returns:accuracy.
Return type:tf.Tensor
batch_training(num_minibatches)[source]

Perform training over input minibatches. If self.with_metrics is False, no online metrics are evaluated.

Parameters:num_minibatches (scalar, int32) – Number of training minibatches.
Returns:Training error per single epoch. If self.with_metrics is False, this is zero.
Return type:float
binomial_sampling(pr)[source]

Binomial sampling of hidden units activations using a rejection method.

Basic mechanics:

1) Extract a random number from a uniform distribution (g) and compare it with the unit’s probability (pr)

2) Choose 0 if pr<g, 1 otherwise. It is convenient to implement this condtion using the relu function.

Parameters:
  • pr (tf.Tensor, float32) – Input conditional probability.
  • g (numpy.ndarray, float32) – Uniform probability used for comparison.
Returns:

Float32 tensor of sampled units. The value is 1 if pr>g and 0 otherwise.

Return type:

tf.Tensor

data_pipeline()[source]

Define the data pipeline

display_metrics(Rmse_train, precision_train, precision_test)[source]

Display training/test metrics and plots the rmse error as a function of the training epochs

Parameters:
  • Rmse_train (list, float32) – Per epoch rmse on the train set.
  • precision_train (float) – Precision on the train set.
  • precision_test (float) – Precision on the test set.
eval_out()[source]

Implement multinomial sampling from a trained model

fit(xtr, xtst)[source]

Fit method

Training in generative models takes place in two steps:

  1. Gibbs sampling
  2. Gradient evaluation and parameters update

This estimate is later used in the weight update step by minimizing the distance between the model and the empirical free energy. Note that while the unit’s configuration space is sampled, the weights are determined via maximum likelihood (saddle point).

Main component of the algo; once instantiated, it generates the computational graph and performs model training

Parameters:
  • xtr (numpy.ndarray, integers) – the user/affinity matrix for the train set
  • xtst (numpy.ndarray, integers) – the user/affinity matrix for the test set
Returns:

elapsed time during training

Return type:

float

free_energy(x)[source]

Free energy of the visible units given the hidden units. Since the sum is over the hidden units’ states, the functional form of the visible units Free energy is the same as the one for the binary model.

Parameters:x (tf.Tensor) – This can be either the sampled value of the visible units (v_k) or the input data
Returns:Free energy of the model.
Return type:tf.Tensor
generate_graph()[source]

Call the different RBM modules to generate the computational graph

gibbs_protocol(i)[source]

Gibbs protocol.

Basic mechanics:

If the current epoch i is in the interval specified in the training protocol, the number of steps in Gibbs sampling (k) is incremented by one and gibbs_sampling is updated accordingly.

Parameters:i (int) – Current epoch in the loop
gibbs_sampling()[source]

Gibbs sampling: Determines an estimate of the model configuration via sampling. In the binary RBM we need to impose that unseen movies stay as such, i.e. the sampling phase should not modify the elements where v=0.

Parameters:
  • k (scalar, integer) – iterator. Number of sampling steps.
  • v (tf.Tensor, float32) – visible units.
Returns:

  • h_k: The sampled value of the hidden unit at step k, float32.
  • v_k: The sampled value of the visible unit at step k, float32.

Return type:

tf.Tensor, tf.Tensor

init_gpu()[source]

Config GPU memory

init_metrics()[source]

Initialize metrics

init_parameters()[source]

Initialize the parameters of the model.

This is a single layer model with two biases. So we have a rectangular matrix w_{ij} and two bias vectors to initialize.

Parameters:
  • Nv (int) – number of visible units (input layer)
  • Nh (int) – number of hidden units (latent variables of the model)
Returns:

  • w of size (Nv, Nh): correlation matrix initialized by sampling from a normal distribution with zero mean and given variance init_stdv.
  • bv of size (1, Nvisible): visible units’ bias, initialized to zero.
  • bh of size (1, Nhidden): hidden units’ bias, initiliazed to zero.

Return type:

tf.Tensor, tf.Tensor, tf.Tensor

init_training_session(xtr)[source]

Initialize the TF session on training data

Parameters:xtr (numpy.ndarray, int32) – The user/affinity matrix for the train set.
losses(vv)[source]

Loss functions.

Parameters:
  • v (tf.Tensor, float32) – empirical input
  • v_k (tf.Tensor, float32) – sampled visible units at step k
Returns:

  • Objective function of Contrastive divergence: the difference between the free energy clamped on the data (v) and the model Free energy (v_k).

Return type:

object

multinomial_distribution(phi)[source]

Probability that unit v has value l given phi: P(v=l|phi)

Parameters:
  • phi (tf.Tensor) – linear combination of values of the previous layer
  • r (float) – rating scale, corresponding to the number of classes
Returns:

  • A tensor of shape (r, m, Nv): This needs to be reshaped as (m, Nv, r) in the last step to allow for faster sampling when used in the multinomial function.

Return type:

tf.Tensor

multinomial_sampling(pr)[source]

Multinomial Sampling of ratings

Basic mechanics: For r classes, we sample r binomial distributions using the rejection method. This is possible since each class is statistically independent from the other. Note that this is the same method used in numpy’s random.multinomial() function.

1) extract a size r array of random numbers from a uniform distribution (g). As pr is normalized, we need to normalize g as well.

2) For each user and item, compare pr with the reference distribution. Note that the latter needs to be the same for ALL the user/item pairs in the dataset, as by assumptions they are sampled from a common distribution.

Parameters:
  • pr (tf.Tensor, float32) – A distributions of shape (m, n, r), where m is the number of examples, n the number of features and r the number of classes. pr needs to be normalized, i.e. sum_k p(k) = 1 for all m, at fixed n.
  • f (tf.Tensor, float32) – Normalized, uniform probability used for comparison.
Returns:

An (m,n) float32 tensor of sampled rankings from 1 to r.

Return type:

tf.Tensor

placeholder()[source]

Initialize the placeholders for the visible units

predict(x, maps)[source]

Returns the inferred ratings. This method is similar to recommend_k_items() with the exceptions that it returns all the inferred ratings

Basic mechanics:

The method samples new ratings from the learned joint distribution, together with their probabilities. The input x must have the same number of columns as the one used for training the model, i.e. the same number of items, but it can have an arbitrary number of rows (users).

Parameters:
  • x (numpy.ndarray, int32) – Input user/affinity matrix. Note that this can be a single vector, i.e.
  • ratings of a single user. (the) –
Returns:

  • A matrix with the inferred ratings.
  • The elapsed time for predediction.

Return type:

numpy.ndarray, float

recommend_k_items(x, top_k=10, remove_seen=True)[source]

Returns the top-k items ordered by a relevancy score.

Basic mechanics:

The method samples new ratings from the learned joint distribution, together with their probabilities. The input x must have the same number of columns as the one used for training the model (i.e. the same number of items) but it can have an arbitrary number of rows (users).

A recommendation score is evaluated by taking the element-wise product between the ratings and the associated probabilities. For example, we could have the following situation:

        rating     probability     score
item1     5           0.5          2.5
item2     4           0.8          3.2

then item2 will be recommended.

Parameters:
  • x (numpy.ndarray, int32) – input user/affinity matrix. Note that this can be a single vector, i.e. the ratings
  • a single user. (of) –
  • top_k (scalar, int32) – the number of items to recommend.
Returns:

  • A sparse matrix containing the top_k elements ordered by their score.
  • The time taken to recommend k items.

Return type:

numpy.ndarray, float

rmse(vp)[source]

Root Mean Square Error

Note that this needs to be evaluated on the rated items only

Parameters:vp (tf.Tensor, float32) – Inferred output (Network prediction)
Returns:root mean square error.
Return type:tf.Tensor
sample_hidden_units(vv)[source]

Sampling: In RBM we use Contrastive divergence to sample the parameter space. In order to do that we need to initialize the two conditional probabilities:

P(h|phi_v) –> returns the probability that the i-th hidden unit is active

P(v|phi_h) –> returns the probability that the i-th visible unit is active

Sample hidden units given the visibles. This can be thought of as a Forward pass step in a FFN

Parameters:vv (tf.Tensor, float32) – visible units
Returns:
  • phv: The activation probability of the hidden unit.
  • h_: The sampled value of the hidden unit from a Bernoulli distributions having success probability phv.
Return type:tf.Tensor, tf.Tensor
sample_visible_units(h)[source]

Sample the visible units given the hiddens. This can be thought of as a Backward pass in a FFN (negative phase). Each visible unit can take values in [1,rating], while the zero is reserved for missing data; as such the value of the hidden unit is sampled from a multinomial distribution.

Basic mechanics:

1) For every training example we first sample Nv Multinomial distributions. The result is of the form [0,1,0,0,0,…,0] where the index of the 1 element corresponds to the rth rating. The index is extracted using the argmax function and we need to add 1 at the end since array indeces starts from 0.

2) Selects only those units that have been sampled. During the training phase it is important to not use the reconstructed inputs, so we beed to enforce a zero value in the reconstructed ratings in the same position as the original input.

Parameters:h (tf.Tensor, float32) – visible units.
Returns:
  • pvh: The activation probability of the visible unit given the hidden.
  • v_: The sampled value of the visible unit from a Multinomial distributions having success probability pvh.
Return type:tf.Tensor, tf.Tensor
time()[source]

Time a particular section of the code - call this once to set the state somewhere in the code, then call it again to return the elapsed time since last call. Call again to set the time and so on…

Returns:if timer started time in seconds since the last time time function was called
Return type:float
train_test_precision(xtst)[source]

Evaluates precision on the train and test set

Parameters:xtst (numpy.ndarray, integer32) – The user/affinity matrix for the test set
Returns:Precision on the train and test sets.
Return type:float, float

RLRMC

class recommenders.models.rlrmc.RLRMCalgorithm.RLRMCalgorithm(rank, C, model_param, initialize_flag='random', max_time=1000, maxiter=100, seed=42)[source]

RLRMC algorithm implementation.

__init__(rank, C, model_param, initialize_flag='random', max_time=1000, maxiter=100, seed=42)[source]

Initialize parameters.

Parameters:
  • rank (int) – rank of the final model. Should be a positive integer.
  • C (float) – regularization parameter. Should be a positive real number.
  • model_param (dict) – contains model parameters such as number of rows & columns of the matrix as well as the mean rating in the training dataset.
  • initialize_flag (str) – flag to set the initialization step of the algorithm. Current options are ‘random’ (which is random initilization) and ‘svd’ (which is a singular value decomposition based initilization).
  • max_time (int) – maximum time (in seconds), for which the algorithm is allowed to execute.
  • maxiter (int) – maximum number of iterations, for which the algorithm is allowed to execute.
fit(RLRMCdata, verbosity=0, _evaluate=False)[source]

The underlying fit method for RLRMC

Parameters:
  • RLRMCdata (RLRMCdataset) – the RLRMCdataset object.
  • verbosity (int) – verbosity of Pymanopt. Possible values are 0 (least verbose), 1, or 2 (most verbose).
  • _evaluate (bool) – flag to compute the per iteration statistics in train (and validation) datasets.
fit_and_evaluate(RLRMCdata, verbosity=0)[source]

Main fit and evalute method for RLRMC. In addition to fitting the model, it also computes the per iteration statistics in train (and validation) datasets.

Parameters:
  • RLRMCdata (RLRMCdataset) – the RLRMCdataset object.
  • verbosity (int) – verbosity of Pymanopt. Possible values are 0 (least verbose), 1, or 2 (most verbose).
predict(user_input, item_input, low_memory=False)[source]

Predict function of this trained model

Parameters:
  • user_input (list or element of list) – userID or userID list
  • item_input (list or element of list) – itemID or itemID list
Returns:

list of predicted rating or predicted rating score.

Return type:

list or float

class recommenders.models.rlrmc.RLRMCdataset.RLRMCdataset(train, validation=None, test=None, mean_center=True, col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp')[source]

RLRMC dataset implementation. Creates sparse data structures for RLRMC algorithm.

__init__(train, validation=None, test=None, mean_center=True, col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp')[source]

Initialize parameters.

Parameters:
  • (pandas.DataFrame (train) – training data with at least columns (col_user, col_item, col_rating)
  • validation (pandas.DataFrame) – validation data with at least columns (col_user, col_item, col_rating). validation can be None, if so, we only process the training data
  • mean_center (bool) – flag to mean center the ratings in train (and validation) data
  • col_user (str) – user column name
  • col_item (str) – item column name
  • col_rating (str) – rating column name
  • col_timestamp (str) – timestamp column name
class recommenders.models.rlrmc.conjugate_gradient_ms.ConjugateGradientMS(beta_type=2, orth_value=inf, linesearch=None, *args, **kwargs)[source]

Module containing conjugate gradient algorithm based on conjugategradient.m from the manopt MATLAB package.

__init__(beta_type=2, orth_value=inf, linesearch=None, *args, **kwargs)[source]

Instantiate gradient solver class.

Parameters:
  • beta_type (object) – Conjugate gradient beta rule used to construct the new search direction.
  • orth_value (float) – Parameter for Powell’s restart strategy. An infinite value disables this strategy. See in code formula for the specific criterion used.
  • linesearch (-) – The linesearch method to used.
solve(problem, x=None, reuselinesearch=False, compute_stats=None)[source]

Perform optimization using nonlinear conjugate gradient method with linesearch.

This method first computes the gradient of obj w.r.t. arg, and then optimizes by moving in a direction that is conjugate to all previous search directions.

Parameters:
  • problem (object) – Pymanopt problem setup using the Problem class, this must have a .manifold attribute specifying the manifold to optimize over, as well as a cost and enough information to compute the gradient of that cost.
  • x (numpy.ndarray) – Optional parameter. Starting point on the manifold. If none then a starting point will be randomly generated.
  • reuselinesearch (bool) – Whether to reuse the previous linesearch object. Allows to use information from a previous solve run.
Returns:

Local minimum of obj, or if algorithm terminated before convergence x will be the point at which it terminated.

Return type:

numpy.ndarray

SAR

class recommenders.models.sar.sar_singlenode.SARSingleNode(col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', col_prediction='prediction', similarity_type='jaccard', time_decay_coefficient=30, time_now=None, timedecay_formula=False, threshold=1, normalize=False)[source]

Simple Algorithm for Recommendations (SAR) implementation

SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transaction history and items description. The core idea behind SAR is to recommend items like those that a user already has demonstrated an affinity to. It does this by 1) estimating the affinity of users for items, 2) estimating similarity across items, and then 3) combining the estimates to generate a set of recommendations for a given user.

__init__(col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', col_prediction='prediction', similarity_type='jaccard', time_decay_coefficient=30, time_now=None, timedecay_formula=False, threshold=1, normalize=False)[source]

Initialize model parameters

Parameters:
  • col_user (str) – user column name
  • col_item (str) – item column name
  • col_rating (str) – rating column name
  • col_timestamp (str) – timestamp column name
  • col_prediction (str) – prediction column name
  • similarity_type (str) – [‘cooccurrence’, ‘jaccard’, ‘lift’] option for computing item-item similarity
  • time_decay_coefficient (float) – number of days till ratings are decayed by 1/2
  • time_now (int | None) – current time for time decay calculation
  • timedecay_formula (bool) – flag to apply time decay
  • threshold (int) – item-item co-occurrences below this threshold will be removed
  • normalize (bool) – option for normalizing predictions to scale of original ratings
compute_affinity_matrix(df, rating_col)[source]

Affinity matrix.

The user-affinity matrix can be constructed by treating the users and items as indices in a sparse matrix, and the events as the data. Here, we’re treating the ratings as the event weights. We convert between different sparse-matrix formats to de-duplicate user-item pairs, otherwise they will get added up.

Parameters:
  • df (pandas.DataFrame) – Indexed df of users and items
  • rating_col (str) – Name of column to use for ratings
Returns:

Affinity matrix in Compressed Sparse Row (CSR) format.

Return type:

sparse.csr

compute_coocurrence_matrix(df)[source]

Co-occurrence matrix.

The co-occurrence matrix is defined as \(C = U^T * U\)

where U is the user_affinity matrix with 1’s as values (instead of ratings).

Parameters:df (pandas.DataFrame) – DataFrame of users and items
Returns:Co-occurrence matrix
Return type:numpy.ndarray
compute_time_decay(df, decay_column)[source]

Compute time decay on provided column.

Parameters:
  • df (pandas.DataFrame) – DataFrame of users and items
  • decay_column (str) – column to decay
Returns:

with column decayed

Return type:

pandas.DataFrame

fit(df)[source]

Main fit method for SAR.

Parameters:df (pandas.DataFrame) – User item rating dataframe
get_item_based_topk(items, top_k=10, sort_top_k=True)[source]

Get top K similar items to provided seed items based on similarity metric defined. This method will take a set of items and use them to recommend the most similar items to that set based on the similarity matrix fit during training. This allows recommendations for cold-users (unseen during training), note - the model is not updated.

The following options are possible based on information provided in the items input: 1. Single user or seed of items: only item column (ratings are assumed to be 1) 2. Single user or seed of items w/ ratings: item column and rating column 3. Separate users or seeds of items: item and user column (user ids are only used to separate item sets) 4. Separate users or seeds of items with ratings: item, user and rating columns provided

Parameters:
  • items (pandas.DataFrame) – DataFrame with item, user (optional), and rating (optional) columns
  • top_k (int) – number of top items to recommend
  • sort_top_k (bool) – flag to sort top k results
Returns:

sorted top k recommendation items

Return type:

pandas.DataFrame

get_popularity_based_topk(top_k=10, sort_top_k=True)[source]

Get top K most frequently occurring items across all users.

Parameters:
  • top_k (int) – number of top items to recommend.
  • sort_top_k (bool) – flag to sort top k results.
Returns:

top k most popular items.

Return type:

pandas.DataFrame

predict(test)[source]

Output SAR scores for only the users-items pairs which are in the test set

Parameters:test (pandas.DataFrame) – DataFrame that contains users and items to test
Returns:DataFrame contains the prediction results
Return type:pandas.DataFrame
recommend_k_items(test, top_k=10, sort_top_k=True, remove_seen=False)[source]

Recommend top K items for all users which are in the test set

Parameters:
  • test (pandas.DataFrame) – users to test
  • top_k (int) – number of top items to recommend
  • sort_top_k (bool) – flag to sort top k results
  • remove_seen (bool) – flag to remove items seen in training from recommendation
Returns:

top k recommendation items for each user

Return type:

pandas.DataFrame

score(test, remove_seen=False)[source]

Score all items for test users.

Parameters:
  • test (pandas.DataFrame) – user to test
  • remove_seen (bool) – flag to remove items seen in training from recommendation
Returns:

Value of interest of all items for the users.

Return type:

numpy.ndarray

set_index(df)[source]

Generate continuous indices for users and items to reduce memory usage.

Parameters:df (pandas.DataFrame) – dataframe with user and item ids

Surprise

recommenders.models.surprise.surprise_utils.compute_ranking_predictions(algo, data, usercol='userID', itemcol='itemID', predcol='prediction', remove_seen=False)[source]

Computes predictions of an algorithm from Surprise on all users and items in data. It can be used for computing ranking metrics like NDCG.

Parameters:
  • algo (surprise.prediction_algorithms.algo_base.AlgoBase) – an algorithm from Surprise
  • data (pandas.DataFrame) – the data from which to get the users and items
  • usercol (str) – name of the user column
  • itemcol (str) – name of the item column
  • remove_seen (bool) – flag to remove (user, item) pairs seen in the training data
Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

recommenders.models.surprise.surprise_utils.predict(algo, data, usercol='userID', itemcol='itemID', predcol='prediction')[source]

Computes predictions of an algorithm from Surprise on the data. Can be used for computing rating metrics like RMSE.

Parameters:
  • algo (surprise.prediction_algorithms.algo_base.AlgoBase) – an algorithm from Surprise
  • data (pandas.DataFrame) – the data on which to predict
  • usercol (str) – name of the user column
  • itemcol (str) – name of the item column
Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

recommenders.models.surprise.surprise_utils.surprise_trainset_to_df(trainset, col_user='uid', col_item='iid', col_rating='rating')[source]

Converts a surprise.Trainset object to pandas.DataFrame

More info: https://surprise.readthedocs.io/en/stable/trainset.html

Parameters:
  • trainset (object) – A surprise.Trainset object.
  • col_user (str) – User column name.
  • col_item (str) – Item column name.
  • col_rating (str) – Rating column name.
Returns:

A dataframe with user column (str), item column (str), and rating column (float).

Return type:

pandas.DataFrame

TF-IDF

class recommenders.models.tfidf.tfidf_utils.TfidfRecommender(id_col, tokenization_method='scibert')[source]

Term Frequency - Inverse Document Frequency (TF-IDF) Recommender

This class provides content-based recommendations using TF-IDF vectorization in combination with cosine similarity.

__init__(id_col, tokenization_method='scibert')[source]

Initialize model parameters

Parameters:
  • id_col (str) – Name of column containing item IDs.
  • tokenization_method (str) – [‘none’,’nltk’,’bert’,’scibert’] option for tokenization method.
clean_dataframe(df, cols_to_clean, new_col_name='cleaned_text')[source]

Clean the text within the columns of interest and return a dataframe with cleaned and combined text.

Parameters:
  • df (pandas.DataFrame) – Dataframe containing the text content to clean.
  • cols_to_clean (list of str) – List of columns to clean by name (e.g., [‘abstract’,’full_text’]).
  • new_col_name (str) – Name of the new column that will contain the cleaned text.
Returns:

Dataframe with cleaned text in the new column.

Return type:

pandas.DataFrame

fit(tf, vectors_tokenized)[source]

Fit TF-IDF vectorizer to the cleaned and tokenized text.

Parameters:
  • tf (TfidfVectorizer) – sklearn.feature_extraction.text.TfidfVectorizer object defined in .tokenize_text().
  • vectors_tokenized (pandas.Series) – Each row contains tokens for respective documents separated by spaces.
get_stop_words()[source]

Return the stop words excluded in the TF-IDF vectorizer.

Returns:Frozenset of stop words used by the TF-IDF vectorizer (can be converted to list).
Return type:list
get_tokens()[source]

Return the tokens generated by the TF-IDF vectorizer.

Returns:Dictionary of tokens generated by the TF-IDF vectorizer.
Return type:dict
get_top_k_recommendations(metadata, query_id, cols_to_keep=[], verbose=True)[source]

Return the top k recommendations with useful metadata for each recommendation.

Parameters:
  • metadata (pandas.DataFrame) – Dataframe holding metadata for all public domain papers.
  • query_id (str) – ID of item of interest.
  • cols_to_keep (list of str) – List of columns from the metadata dataframe to include (e.g., [‘title’,’authors’,’journal’,’publish_time’,’url’]). By default, all columns are kept.
  • verbose (boolean) – Set to True if you want to print the table.
Returns:

Stylized dataframe holding recommendations and associated metadata just for the item of interest (can access as normal dataframe by using df.data).

Return type:

pandas.Styler

recommend_top_k_items(df_clean, k=5)[source]

Recommend k number of items similar to the item of interest.

Parameters:
  • df_clean (pandas.DataFrame) – Dataframe with cleaned text.
  • k (int) – Number of recommendations to return.
Returns:

Dataframe containing id of top k recommendations for all items.

Return type:

pandas.DataFrame

tokenize_text(df_clean, text_col='cleaned_text', ngram_range=(1, 3), min_df=0)[source]

Tokenize the input text. For more details on the TfidfVectorizer, see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Parameters:
  • df_clean (pandas.DataFrame) – Dataframe with cleaned text in the new column.
  • text_col (str) – Name of column containing the cleaned text.
  • ngram_range (tuple of int) – The lower and upper boundary of the range of n-values for different n-grams to be extracted.
  • min_df (int) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
Returns:

  • Scikit-learn TfidfVectorizer object defined in .tokenize_text().
  • Each row contains tokens for respective documents separated by spaces.

Return type:

TfidfVectorizer, pandas.Series

VAE

class recommenders.models.vae.multinomial_vae.AnnealingCallback(beta, anneal_cap, total_anneal_steps)[source]

This class is used for updating the value of β during the annealing process. When β reaches the value of anneal_cap, it stops increasing.

__init__(beta, anneal_cap, total_anneal_steps)[source]

Constructor

Parameters:
  • beta (float) – current value of beta.
  • anneal_cap (float) – maximum value that beta can reach.
  • total_anneal_steps (int) – total number of annealing steps.
get_data()[source]

Returns a list of the beta values per epoch.

on_batch_end(epoch, logs={})[source]

At the end of each batch the beta should is updated until it reaches the values of anneal cap.

on_epoch_end(epoch, logs={})[source]

At the end of each epoch save the value of beta in _beta list.

on_train_begin(logs={})[source]

Initialise a list in which the beta value will be saved at the end of each epoch.

class recommenders.models.vae.multinomial_vae.LossHistory[source]

This class is used for saving the validation loss and the training loss per epoch.

on_epoch_end(epoch, logs={})[source]

Save the loss of training and validation set at the end of each epoch.

on_train_begin(logs={})[source]

Initialise the lists where the loss of training and validation will be saved.

class recommenders.models.vae.multinomial_vae.Metrics(model, val_tr, val_te, mapper, k, save_path=None)[source]

Callback function used to calculate the NDCG@k metric of validation set at the end of each epoch. Weights of the model with the highest NDCG@k value is saved.

__init__(model, val_tr, val_te, mapper, k, save_path=None)[source]

Initialize the class parameters.

Parameters:
  • model – trained model for validation.
  • val_tr (numpy.ndarray, float) – the click matrix for the validation set training part.
  • val_te (numpy.ndarray, float) – the click matrix for the validation set testing part.
  • mapper (AffinityMatrix) – the mapper for converting click matrix to dataframe.
  • k (int) – number of top k items per user (optional).
  • save_path (str) – Default path to save weights.
get_data()[source]

Returns a list of the NDCG@k of the validation set metrics calculated at the end of each epoch.

on_epoch_end(batch, logs={})[source]

At the end of each epoch calculate NDCG@k of the validation set.

If the model performance is improved, the model weights are saved. Update the list of validation NDCG@k by adding obtained value

on_train_begin(logs={})[source]

Initialise the list for validation NDCG@k.

recommend_k_items(x, k, remove_seen=True)[source]

Returns the top-k items ordered by a relevancy score. Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray, int32) – input click matrix.
  • k (scalar, int32) – the number of items to recommend.
Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray

class recommenders.models.vae.multinomial_vae.Mult_VAE(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]

Multinomial Variational Autoencoders (Multi-VAE) for Collaborative Filtering implementation

Citation:Liang, Dawen, et al. “Variational autoencoders for collaborative filtering.” Proceedings of the 2018 World Wide Web Conference. 2018. https://arxiv.org/pdf/1802.05814.pdf
__init__(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]

Constructor

Parameters:
  • n_users (int) – Number of unique users in the train set.
  • original_dim (int) – Number of unique items in the train set.
  • intermediate_dim (int) – Dimension of intermediate space.
  • latent_dim (int) – Dimension of latent space.
  • n_epochs (int) – Number of epochs for training.
  • batch_size (int) – Batch size.
  • k (int) – number of top k items per user.
  • verbose (int) – Whether to show the training output or not.
  • drop_encoder (float) – Dropout percentage of the encoder.
  • drop_decoder (float) – Dropout percentage of the decoder.
  • beta (float) – a constant parameter β in the ELBO function, when you are not using annealing (annealing=False)
  • annealing (bool) – option of using annealing method for training the model (True) or not using annealing, keeping a constant beta (False)
  • anneal_cap (float) – maximum value that beta can take during annealing process.
  • seed (int) – Seed.
  • save_path (str) – Default path to save weights.
display_metrics()[source]

Plots: 1) Loss per epoch both for validation and train set 2) NDCG@k per epoch of the validation set

fit(x_train, x_valid, x_val_tr, x_val_te, mapper)[source]

Fit model with the train sets and validate on the validation set.

Parameters:
  • x_train (numpy.ndarray) – the click matrix for the train set.
  • x_valid (numpy.ndarray) – the click matrix for the validation set.
  • x_val_tr (numpy.ndarray) – the click matrix for the validation set training part.
  • x_val_te (numpy.ndarray) – the click matrix for the validation set testing part.
  • mapper (object) – the mapper for converting click matrix to dataframe. It can be AffinityMatrix.
get_optimal_beta()[source]

Returns the value of the optimal beta.

ndcg_per_epoch()[source]

Returns the list of NDCG@k at each epoch.

nn_batch_generator(x_train)[source]

Used for splitting dataset in batches.

Parameters:x_train (numpy.ndarray) – The click matrix for the train set, with float values.
recommend_k_items(x, k, remove_seen=True)[source]

Returns the top-k items ordered by a relevancy score. Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray, int32) – input click matrix.
  • k (scalar, int32) – the number of items to recommend.
Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray, float

class recommenders.models.vae.standard_vae.AnnealingCallback(beta, anneal_cap, total_anneal_steps)[source]

This class is used for updating the value of β during the annealing process. When β reaches the value of anneal_cap, it stops increasing.

__init__(beta, anneal_cap, total_anneal_steps)[source]

Constructor

Parameters:
  • beta (float) – current value of beta.
  • anneal_cap (float) – maximum value that beta can reach.
  • total_anneal_steps (int) – total number of annealing steps.
get_data()[source]

Returns a list of the beta values per epoch.

on_batch_end(epoch, logs={})[source]

At the end of each batch the beta should is updated until it reaches the values of anneal cap.

on_epoch_end(epoch, logs={})[source]

At the end of each epoch save the value of beta in _beta list.

on_train_begin(logs={})[source]

Initialise a list in which the beta value will be saved at the end of each epoch.

class recommenders.models.vae.standard_vae.LossHistory[source]

This class is used for saving the validation loss and the training loss per epoch.

on_epoch_end(epoch, logs={})[source]

Save the loss of training and validation set at the end of each epoch.

on_train_begin(logs={})[source]

Initialise the lists where the loss of training and validation will be saved.

class recommenders.models.vae.standard_vae.Metrics(model, val_tr, val_te, mapper, k, save_path=None)[source]

Callback function used to calculate the NDCG@k metric of validation set at the end of each epoch. Weights of the model with the highest NDCG@k value is saved.

__init__(model, val_tr, val_te, mapper, k, save_path=None)[source]

Initialize the class parameters.

Parameters:
  • model – trained model for validation.
  • val_tr (numpy.ndarray, float) – the click matrix for the validation set training part.
  • val_te (numpy.ndarray, float) – the click matrix for the validation set testing part.
  • mapper (AffinityMatrix) – the mapper for converting click matrix to dataframe.
  • k (int) – number of top k items per user (optional).
  • save_path (str) – Default path to save weights.
get_data()[source]

Returns a list of the NDCG@k of the validation set metrics calculated at the end of each epoch.

on_epoch_end(batch, logs={})[source]

At the end of each epoch calculate NDCG@k of the validation set. If the model performance is improved, the model weights are saved. Update the list of validation NDCG@k by adding obtained value.

on_train_begin(logs={})[source]

Initialise the list for validation NDCG@k.

recommend_k_items(x, k, remove_seen=True)[source]

Returns the top-k items ordered by a relevancy score. Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray, int32) – input click matrix.
  • k (scalar, int32) – the number of items to recommend.
Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray

class recommenders.models.vae.standard_vae.StandardVAE(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]

Standard Variational Autoencoders (VAE) for Collaborative Filtering implementation.

__init__(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]

Initialize class parameters.

Parameters:
  • n_users (int) – Number of unique users in the train set.
  • original_dim (int) – Number of unique items in the train set.
  • intermediate_dim (int) – Dimension of intermediate space.
  • latent_dim (int) – Dimension of latent space.
  • n_epochs (int) – Number of epochs for training.
  • batch_size (int) – Batch size.
  • k (int) – number of top k items per user.
  • verbose (int) – Whether to show the training output or not.
  • drop_encoder (float) – Dropout percentage of the encoder.
  • drop_decoder (float) – Dropout percentage of the decoder.
  • beta (float) – a constant parameter β in the ELBO function, when you are not using annealing (annealing=False)
  • annealing (bool) – option of using annealing method for training the model (True) or not using annealing, keeping a constant beta (False)
  • anneal_cap (float) – maximum value that beta can take during annealing process.
  • seed (int) – Seed.
  • save_path (str) – Default path to save weights.
display_metrics()[source]

Plots: 1) Loss per epoch both for validation and train sets 2) NDCG@k per epoch of the validation set

fit(x_train, x_valid, x_val_tr, x_val_te, mapper)[source]

Fit model with the train sets and validate on the validation set.

Parameters:
  • x_train (numpy.ndarray) – The click matrix for the train set.
  • x_valid (numpy.ndarray) – The click matrix for the validation set.
  • x_val_tr (numpy.ndarray) – The click matrix for the validation set training part.
  • x_val_te (numpy.ndarray) – The click matrix for the validation set testing part.
  • mapper (object) – The mapper for converting click matrix to dataframe. It can be AffinityMatrix.
get_optimal_beta()[source]

Returns the value of the optimal beta.

ndcg_per_epoch()[source]

Returns the list of NDCG@k at each epoch.

nn_batch_generator(x_train)[source]

Used for splitting dataset in batches.

Parameters:x_train (numpy.ndarray) – The click matrix for the train set with float values.
recommend_k_items(x, k, remove_seen=True)[source]

Returns the top-k items ordered by a relevancy score.

Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray) – Input click matrix, with int32 values.
  • k (scalar) – The number of items to recommend.
Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray

Wide & Deep

recommenders.models.wide_deep.wide_deep_utils.build_feature_columns(users, items, user_col='userID', item_col='itemID', item_feat_col=None, crossed_feat_dim=1000, user_dim=8, item_dim=8, item_feat_shape=None, model_type='wide_deep')[source]

Build wide and/or deep feature columns for TensorFlow high-level API Estimator.

Parameters:
  • users (iterable) – Distinct user ids.
  • items (iterable) – Distinct item ids.
  • user_col (str) – User column name.
  • item_col (str) – Item column name.
  • item_feat_col (str) – Item feature column name for ‘deep’ or ‘wide_deep’ model.
  • crossed_feat_dim (int) – Crossed feature dimension for ‘wide’ or ‘wide_deep’ model.
  • user_dim (int) – User embedding dimension for ‘deep’ or ‘wide_deep’ model.
  • item_dim (int) – Item embedding dimension for ‘deep’ or ‘wide_deep’ model.
  • item_feat_shape (int or an iterable of integers) – Item feature array shape for ‘deep’ or ‘wide_deep’ model.
  • model_type (str) – Model type, either ‘wide’ for a linear model, ‘deep’ for a deep neural networks, or ‘wide_deep’ for a combination of linear model and neural networks.
Returns:

  • The wide feature columns
  • The deep feature columns. If only the wide model is selected, the deep column list is empty and viceversa.

Return type:

list, list

recommenders.models.wide_deep.wide_deep_utils.build_model(model_dir='model_checkpoints', wide_columns=(), deep_columns=(), linear_optimizer='Ftrl', dnn_optimizer='Adagrad', dnn_hidden_units=(128, 128), dnn_dropout=0.0, dnn_batch_norm=True, log_every_n_iter=1000, save_checkpoints_steps=10000, seed=None)[source]

Build wide-deep model.

To generate wide model, pass wide_columns only. To generate deep model, pass deep_columns only. To generate wide_deep model, pass both wide_columns and deep_columns.

Parameters:
  • model_dir (str) – Model checkpoint directory.
  • wide_columns (list of tf.feature_column) – Wide model feature columns.
  • deep_columns (list of tf.feature_column) – Deep model feature columns.
  • linear_optimizer (str or tf.train.Optimizer) – Wide model optimizer name or object.
  • dnn_optimizer (str or tf.train.Optimizer) – Deep model optimizer name or object.
  • dnn_hidden_units (list of int) – Deep model hidden units. E.g., [10, 10, 10] is three layers of 10 nodes each.
  • dnn_dropout (float) – Deep model’s dropout rate.
  • dnn_batch_norm (bool) – Deep model’s batch normalization flag.
  • log_every_n_iter (int) – Log the training loss for every n steps.
  • save_checkpoints_steps (int) – Model checkpoint frequency.
  • seed (int) – Random seed.
Returns:

Model

Return type:

tf.estimator.Estimator