Common utilities module¶

General utilities¶

recommenders.utils.general_utils.get_number_processors()[source]¶

Get the number of processors in a CPU.

Returns:	Number of processors.
Return type:	int

recommenders.utils.general_utils.get_physical_memory()[source]¶

Get the physical memory in GBs.

Returns:	Physical memory in GBs.
Return type:	float

recommenders.utils.general_utils.invert_dictionary(dictionary)[source]¶

Invert a dictionary

Note

If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are repeated values, the inversion can take different keys

Parameters:	dictionary (dict) – A dictionary
Returns:	inverted dictionary
Return type:	dict

GPU utilities¶

recommenders.utils.gpu_utils.clear_memory_all_gpus()[source]¶: Clear memory of all GPUs.

recommenders.utils.gpu_utils.get_cuda_version()[source]¶

Get CUDA version

Returns:	Version of the library.
Return type:	str

recommenders.utils.gpu_utils.get_cudnn_version()[source]¶

Get the CuDNN version

Returns:	Version of the library.
Return type:	str

recommenders.utils.gpu_utils.get_gpu_info()[source]¶

Get information of GPUs.

Returns:	List of gpu information dictionary as with device_name, total_memory (in Mb) and free_memory (in Mb). Returns an empty list if there is no cuda device available.
Return type:	list

recommenders.utils.gpu_utils.get_number_gpus()[source]¶: Get the number of GPUs in the system. :returns: Number of GPUs. :rtype: int

Kubernetes utilities¶

recommenders.utils.k8s_utils.nodes_to_replicas(n_cores_per_node, n_nodes=3, cpu_cores_per_replica=0.1)[source]¶

Provide a rough estimate of the number of replicas supported by a given number of nodes with n_cores_per_node cores each

Parameters:	n_cores_per_node (int) – Total number of cores per node within an AKS cluster that you want to use n_nodes (int) – Number of nodes (i.e. VMs) used in the AKS cluster cpu_cores_per_replica (float) – Cores assigned to each replica. This can be fractional and corresponds to the cpu_cores argument passed to AksWebservice.deploy_configuration()
Returns:	Total number of replicas supported by the configuration
Return type:	int

recommenders.utils.k8s_utils.qps_to_replicas(target_qps, processing_time, max_qp_replica=1, target_utilization=0.7)[source]¶

Provide a rough estimate of the number of replicas to support a given load (queries per second)

Parameters:	target_qps (int) – target queries per second that you want to support processing_time (float) – the estimated amount of time (in seconds) your service call takes max_qp_replica (int) – maximum number of concurrent queries per replica target_utilization (float) – proportion of CPU utilization you think is ideal
Returns:	Number of estimated replicas required to support a target number of queries per second.
Return type:	int

recommenders.utils.k8s_utils.replicas_to_qps(num_replicas, processing_time, max_qp_replica=1, target_utilization=0.7)[source]¶

Provide a rough estimate of the queries per second supported by a number of replicas

Parameters:	num_replicas (int) – number of replicas processing_time (float) – the estimated amount of time (in seconds) your service call takes max_qp_replica (int) – maximum number of concurrent queries per replica target_utilization (float) – proportion of CPU utilization you think is ideal
Returns:	queries per second supported by the number of replicas
Return type:	int

Notebook utilities¶

recommenders.utils.notebook_utils.is_databricks()[source]¶

Check if the module is running on Databricks.

Returns:	True if the module is running on Databricks notebook, False otherwise.
Return type:	bool

recommenders.utils.notebook_utils.is_jupyter()[source]¶

Check if the module is running on Jupyter notebook/console.

Returns:	True if the module is running on Jupyter notebook or Jupyter console, False otherwise.
Return type:	bool

recommenders.utils.notebook_memory_management.pre_run_cell()[source]¶: Capture current time before we execute the current command

recommenders.utils.notebook_memory_management.start_watching_memory()[source]¶: Register memory profiling tools to IPython instance.

recommenders.utils.notebook_memory_management.stop_watching_memory()[source]¶: Unregister memory profiling tools from IPython instance.

recommenders.utils.notebook_memory_management.watch_memory()[source]¶: Bring in the global memory usage value from the previous iteration

Python utilities¶

recommenders.utils.python_utils.binarize(a, threshold)[source]¶

Binarize the values.

Parameters:	a (numpy.ndarray) – Input array that needs to be binarized. threshold (float) – Threshold below which all values are set to 0, else 1.
Returns:	Binarized array.
Return type:	numpy.ndarray

recommenders.utils.python_utils.cosine_similarity(cooccurrence)[source]¶

Helper method to calculate the Cosine similarity of a matrix of co-occurrences.

Cosine similarity can be interpreted as the angle between the i-th and j-th item.

Parameters:	cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:	The matrix of cosine similarity between any two items.
Return type:	numpy.ndarray

recommenders.utils.python_utils.exponential_decay(value, max_val, half_life)[source]¶

Compute decay factor for a given value based on an exponential decay.

Values greater than max_val will be set to 1.

Parameters:	value (numeric) – Value to calculate decay factor max_val (numeric) – Value at which decay factor will be 1 half_life (numeric) – Value at which decay factor will be 0.5
Returns:	Decay factor
Return type:	float

recommenders.utils.python_utils.get_top_k_scored_items(scores, top_k, sort_top_k=False)[source]¶

Extract top K items from a matrix of scores for each user-item pair, optionally sort results per user.

Parameters:

scores (numpy.ndarray) – Score matrix (users x items).
top_k (int) – Number of top items to recommend.
sort_top_k (bool) – Flag to sort top k results.

Returns:

Indices into score matrix for each user’s top items.
Scores corresponding to top items.

Return type:

numpy.ndarray, numpy.ndarray

recommenders.utils.python_utils.inclusion_index(cooccurrence)[source]¶

Helper method to calculate the Inclusion Index of a matrix of co-occurrences.

Inclusion index measures the overlap between items.

Parameters:	cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:	The matrix of inclusion index between any two items.
Return type:	numpy.ndarray

recommenders.utils.python_utils.jaccard(cooccurrence)[source]¶

Helper method to calculate the Jaccard similarity of a matrix of co-occurrences. When comparing Jaccard with count co-occurrence and lift similarity, count favours predictability, meaning that the most popular items will be recommended most of the time. Lift, by contrast, favours discoverability/serendipity, meaning that an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.

Parameters:	cooccurrence (numpy.ndarray) – the symmetric matrix of co-occurrences of items.
Returns:	The matrix of Jaccard similarities between any two items.
Return type:	numpy.ndarray

recommenders.utils.python_utils.lexicographers_mutual_information(cooccurrence)[source]¶

Helper method to calculate the Lexicographers Mutual Information of a matrix of co-occurrences.

Due to the bias of mutual information for low frequency items, lexicographers mutual information corrects the formula by multiplying it by the co-occurrence frequency.

Parameters:	cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:	The matrix of lexicographers mutual information between any two items.
Return type:	numpy.ndarray

recommenders.utils.python_utils.lift(cooccurrence)[source]¶

Helper method to calculate the Lift of a matrix of co-occurrences. In comparison with basic co-occurrence and Jaccard similarity, lift favours discoverability and serendipity, as opposed to co-occurrence that favours the most popular items, and Jaccard that is a compromise between the two.

Parameters:	cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:	The matrix of Lifts between any two items.
Return type:	numpy.ndarray

recommenders.utils.python_utils.mutual_information(cooccurrence)[source]¶

Helper method to calculate the Mutual Information of a matrix of co-occurrences.

Mutual information is a measurement of the amount of information explained by the i-th j-th item column vector.

Parameters:	cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:	The matrix of mutual information between any two items.
Return type:	numpy.ndarray

recommenders.utils.python_utils.rescale(data, new_min=0, new_max=1, data_min=None, data_max=None)[source]¶

Rescale/normalize the data to be within the range [new_min, new_max] If data_min and data_max are explicitly provided, they will be used as the old min/max values instead of taken from the data.

Note

This is same as the scipy.MinMaxScaler with the exception that we can override the min/max of the old scale.

Parameters:	data (numpy.ndarray) – 1d scores vector or 2d score matrix (users x items). new_min (int\|float) – The minimum of the newly scaled data. new_max (int\|float) – The maximum of the newly scaled data. data_min (None\|number) – The minimum of the passed data [if omitted it will be inferred]. data_max (None\|number) – The maximum of the passed data [if omitted it will be inferred].
Returns:	The newly scaled/normalized data.
Return type:	numpy.ndarray

Spark utilities¶

recommenders.utils.spark_utils.start_or_get_spark(app_name='Sample', url='local[*]', memory='10g', config=None, packages=None, jars=None, repositories=None)[source]¶

Start Spark if not started

Parameters:	app_name (str) – set name of the application url (str) – URL for spark master memory (str) – size of memory for spark driver. This will be ignored if spark.driver.memory is set in config. config (dict) – dictionary of configuration options packages (list) – list of packages to install jars (list) – list of jar files to add repositories (list) – list of maven repositories
Returns:	Spark context.
Return type:	object

Tensorflow utilities¶

class recommenders.utils.tf_utils.MetricsLogger[source]¶

Metrics logger

__init__()[source]¶: Initializer

get_log()[source]¶

Getter

Returns:	Log metrics.
Return type:	dict

log(metric, value)[source]¶

Log metrics. Each metric’s log will be stored in the corresponding list.

Parameters:	metric (str) – Metric name. value (float) – Value.

recommenders.utils.tf_utils.build_optimizer(name, lr=0.001, **kwargs)[source]¶

Get an optimizer for TensorFlow high-level API Estimator.

Available options are: adadelta, adagrad, adam, ftrl, momentum, rmsprop or sgd.

Parameters:	name (str) – Optimizer name. lr (float) – Learning rate kwargs – Optimizer arguments as key-value pairs
Returns:	Tensorflow optimizer.
Return type:	tf.train.Optimizer

recommenders.utils.tf_utils.evaluation_log_hook(estimator, logger, true_df, y_col, eval_df, every_n_iter=10000, model_dir=None, batch_size=256, eval_fns=None, **eval_kwargs)[source]¶

Evaluation log hook for TensorFlow high-level API Estimator.

Note

TensorFlow Estimator model uses the last checkpoint weights for evaluation or prediction. In order to get the most up-to-date evaluation results while training, set model’s save_checkpoints_steps to be equal or greater than hook’s every_n_iter.

Parameters:	estimator (tf.estimator.Estimator) – Model to evaluate. logger (Logger) – Custom logger to log the results. E.g., define a subclass of Logger for AzureML logging. true_df (pd.DataFrame) – Ground-truth data. y_col (str) – Label column name in true_df eval_df (pd.DataFrame) – Evaluation data without label column. every_n_iter (int) – Evaluation frequency (steps). model_dir (str) – Model directory to save the summaries to. If None, does not record. batch_size (int) – Number of samples fed into the model at a time. Note, the batch size doesn’t affect on evaluation results. eval_fns (iterable of functions) – List of evaluation functions that have signature of (true_df, prediction_df, eval_kwargs)->(float). If None, loss is calculated on true_df. eval_kwargs** – Evaluation function’s keyword arguments. Note, prediction column name should be ‘prediction’
Returns:	Session run hook to evaluate the model while training.
Return type:	tf.train.SessionRunHook

recommenders.utils.tf_utils.export_model(model, train_input_fn, eval_input_fn, tf_feat_cols, base_dir)[source]¶

Export TensorFlow estimator (model).

Parameters:	model (tf.estimator.Estimator) – Model to export. train_input_fn (function) – Training input function to create data receiver spec. eval_input_fn (function) – Evaluation input function to create data receiver spec. tf_feat_cols (list(tf.feature_column)) – Feature columns. base_dir (str) – Base directory to export the model.
Returns:	Exported model path
Return type:	str

recommenders.utils.tf_utils.pandas_input_fn(df, y_col=None, batch_size=128, num_epochs=1, shuffle=False, seed=None)[source]¶

Pandas input function for TensorFlow high-level API Estimator. This function returns a tf.data.Dataset function.

Note

tf.estimator.inputs.pandas_input_fn cannot handle array/list column properly. For more information, see https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn

Parameters:	df (pandas.DataFrame) – Data containing features. y_col (str) – Label column name if df has it. batch_size (int) – Batch size for the input function. num_epochs (int) – Number of epochs to iterate over data. If None, it will run forever. shuffle (bool) – If True, shuffles the data queue. seed (int) – Random seed for shuffle.
Returns:	Function.
Return type:	tf.data.Dataset

recommenders.utils.tf_utils.pandas_input_fn_for_saved_model(df, feat_name_type)[source]¶

Pandas input function for TensorFlow SavedModel.

Parameters:	df (pandas.DataFrame) – Data containing features. feat_name_type (dict) – Feature name and type spec. E.g. {‘userID’: int, ‘itemID’: int, ‘rating’: float}
Returns:	Input function
Return type:	func

Timer¶

class recommenders.utils.timer.Timer[source]¶

Timer class.

Original code.

Examples

>>> import time
>>> t = Timer()
>>> t.start()
>>> time.sleep(1)
>>> t.stop()
>>> t.interval < 1
True
>>> with Timer() as t:
...   time.sleep(1)
>>> t.interval < 1
True
>>> "Time elapsed {}".format(t) #doctest: +ELLIPSIS
'Time elapsed 1...'

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

interval¶

Get time interval in seconds.

Returns:	Seconds.
Return type:	float

start()[source]¶: Start the timer.

stop()[source]¶: Stop the timer. Calculate the interval in seconds.

Plot utilities¶

recommenders.utils.plot.line_graph(values, labels, x_guides=None, x_name=None, y_name=None, x_min_max=None, y_min_max=None, legend_loc=None, subplot=None, plot_size=(5, 5))[source]¶

Plot line graph(s).

Parameters:

values (list(list(float or tuple)) or list(float or tuple) – List of graphs or a graph to plot E.g. a graph = list(y) or list((y,x))
labels (list(str) or str) – List of labels or a label for graph. If labels is a string, this function assumes the values is a single graph.
x_guides (list(int)) – List of guidelines (a vertical dotted line)
x_name (str) – x axis label
y_name (str) – y axis label
x_min_max (list or tuple) – Min and max value of the x axis
y_min_max (list or tuple) – Min and max value of the y axis
legend_loc (str) – legend location
subplot (list or tuple) – matplotlib.pyplot.subplot format. E.g. to draw 1 x 2 subplot, pass (1,2,1) for the first subplot and (1,2,2) for the second subplot.
plot_size (list or tuple) – Plot size (width, height)