Common utilities module

General utilities

recommenders.utils.general_utils.get_number_processors()[source]

Get the number of processors in a CPU.

Returns:Number of processors.
Return type:int
recommenders.utils.general_utils.get_physical_memory()[source]

Get the physical memory in GBs.

Returns:Physical memory in GBs.
Return type:float
recommenders.utils.general_utils.invert_dictionary(dictionary)[source]

Invert a dictionary

Note

If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are repeated values, the inversion can take different keys

Parameters:dictionary (dict) – A dictionary
Returns:inverted dictionary
Return type:dict

GPU utilities

recommenders.utils.gpu_utils.clear_memory_all_gpus()[source]

Clear memory of all GPUs.

recommenders.utils.gpu_utils.get_cuda_version()[source]

Get CUDA version

Returns:Version of the library.
Return type:str
recommenders.utils.gpu_utils.get_cudnn_version()[source]

Get the CuDNN version

Returns:Version of the library.
Return type:str
recommenders.utils.gpu_utils.get_gpu_info()[source]

Get information of GPUs.

Returns:List of gpu information dictionary as with device_name, total_memory (in Mb) and free_memory (in Mb). Returns an empty list if there is no cuda device available.
Return type:list
recommenders.utils.gpu_utils.get_number_gpus()[source]

Get the number of GPUs in the system. :returns: Number of GPUs. :rtype: int

Kubernetes utilities

recommenders.utils.k8s_utils.nodes_to_replicas(n_cores_per_node, n_nodes=3, cpu_cores_per_replica=0.1)[source]

Provide a rough estimate of the number of replicas supported by a given number of nodes with n_cores_per_node cores each

Parameters:
  • n_cores_per_node (int) – Total number of cores per node within an AKS cluster that you want to use
  • n_nodes (int) – Number of nodes (i.e. VMs) used in the AKS cluster
  • cpu_cores_per_replica (float) – Cores assigned to each replica. This can be fractional and corresponds to the cpu_cores argument passed to AksWebservice.deploy_configuration()
Returns:

Total number of replicas supported by the configuration

Return type:

int

recommenders.utils.k8s_utils.qps_to_replicas(target_qps, processing_time, max_qp_replica=1, target_utilization=0.7)[source]

Provide a rough estimate of the number of replicas to support a given load (queries per second)

Parameters:
  • target_qps (int) – target queries per second that you want to support
  • processing_time (float) – the estimated amount of time (in seconds) your service call takes
  • max_qp_replica (int) – maximum number of concurrent queries per replica
  • target_utilization (float) – proportion of CPU utilization you think is ideal
Returns:

Number of estimated replicas required to support a target number of queries per second.

Return type:

int

recommenders.utils.k8s_utils.replicas_to_qps(num_replicas, processing_time, max_qp_replica=1, target_utilization=0.7)[source]

Provide a rough estimate of the queries per second supported by a number of replicas

Parameters:
  • num_replicas (int) – number of replicas
  • processing_time (float) – the estimated amount of time (in seconds) your service call takes
  • max_qp_replica (int) – maximum number of concurrent queries per replica
  • target_utilization (float) – proportion of CPU utilization you think is ideal
Returns:

queries per second supported by the number of replicas

Return type:

int

Notebook utilities

recommenders.utils.notebook_utils.is_databricks()[source]

Check if the module is running on Databricks.

Returns:True if the module is running on Databricks notebook, False otherwise.
Return type:bool
recommenders.utils.notebook_utils.is_jupyter()[source]

Check if the module is running on Jupyter notebook/console.

Returns:True if the module is running on Jupyter notebook or Jupyter console, False otherwise.
Return type:bool
recommenders.utils.notebook_memory_management.pre_run_cell()[source]

Capture current time before we execute the current command

recommenders.utils.notebook_memory_management.start_watching_memory()[source]

Register memory profiling tools to IPython instance.

recommenders.utils.notebook_memory_management.stop_watching_memory()[source]

Unregister memory profiling tools from IPython instance.

recommenders.utils.notebook_memory_management.watch_memory()[source]

Bring in the global memory usage value from the previous iteration

Python utilities

recommenders.utils.python_utils.binarize(a, threshold)[source]

Binarize the values.

Parameters:
  • a (numpy.ndarray) – Input array that needs to be binarized.
  • threshold (float) – Threshold below which all values are set to 0, else 1.
Returns:

Binarized array.

Return type:

numpy.ndarray

recommenders.utils.python_utils.cosine_similarity(cooccurrence)[source]

Helper method to calculate the Cosine similarity of a matrix of co-occurrences.

Cosine similarity can be interpreted as the angle between the i-th and j-th item.

Parameters:cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:The matrix of cosine similarity between any two items.
Return type:numpy.ndarray
recommenders.utils.python_utils.exponential_decay(value, max_val, half_life)[source]

Compute decay factor for a given value based on an exponential decay.

Values greater than max_val will be set to 1.

Parameters:
  • value (numeric) – Value to calculate decay factor
  • max_val (numeric) – Value at which decay factor will be 1
  • half_life (numeric) – Value at which decay factor will be 0.5
Returns:

Decay factor

Return type:

float

recommenders.utils.python_utils.get_top_k_scored_items(scores, top_k, sort_top_k=False)[source]

Extract top K items from a matrix of scores for each user-item pair, optionally sort results per user.

Parameters:
  • scores (numpy.ndarray) – Score matrix (users x items).
  • top_k (int) – Number of top items to recommend.
  • sort_top_k (bool) – Flag to sort top k results.
Returns:

  • Indices into score matrix for each user’s top items.
  • Scores corresponding to top items.

Return type:

numpy.ndarray, numpy.ndarray

recommenders.utils.python_utils.inclusion_index(cooccurrence)[source]

Helper method to calculate the Inclusion Index of a matrix of co-occurrences.

Inclusion index measures the overlap between items.

Parameters:cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:The matrix of inclusion index between any two items.
Return type:numpy.ndarray
recommenders.utils.python_utils.jaccard(cooccurrence)[source]

Helper method to calculate the Jaccard similarity of a matrix of co-occurrences. When comparing Jaccard with count co-occurrence and lift similarity, count favours predictability, meaning that the most popular items will be recommended most of the time. Lift, by contrast, favours discoverability/serendipity, meaning that an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.

Parameters:cooccurrence (numpy.ndarray) – the symmetric matrix of co-occurrences of items.
Returns:The matrix of Jaccard similarities between any two items.
Return type:numpy.ndarray
recommenders.utils.python_utils.lexicographers_mutual_information(cooccurrence)[source]

Helper method to calculate the Lexicographers Mutual Information of a matrix of co-occurrences.

Due to the bias of mutual information for low frequency items, lexicographers mutual information corrects the formula by multiplying it by the co-occurrence frequency.

Parameters:cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:The matrix of lexicographers mutual information between any two items.
Return type:numpy.ndarray
recommenders.utils.python_utils.lift(cooccurrence)[source]

Helper method to calculate the Lift of a matrix of co-occurrences. In comparison with basic co-occurrence and Jaccard similarity, lift favours discoverability and serendipity, as opposed to co-occurrence that favours the most popular items, and Jaccard that is a compromise between the two.

Parameters:cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:The matrix of Lifts between any two items.
Return type:numpy.ndarray
recommenders.utils.python_utils.mutual_information(cooccurrence)[source]

Helper method to calculate the Mutual Information of a matrix of co-occurrences.

Mutual information is a measurement of the amount of information explained by the i-th j-th item column vector.

Parameters:cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
Returns:The matrix of mutual information between any two items.
Return type:numpy.ndarray
recommenders.utils.python_utils.rescale(data, new_min=0, new_max=1, data_min=None, data_max=None)[source]

Rescale/normalize the data to be within the range [new_min, new_max] If data_min and data_max are explicitly provided, they will be used as the old min/max values instead of taken from the data.

Note

This is same as the scipy.MinMaxScaler with the exception that we can override the min/max of the old scale.

Parameters:
  • data (numpy.ndarray) – 1d scores vector or 2d score matrix (users x items).
  • new_min (int|float) – The minimum of the newly scaled data.
  • new_max (int|float) – The maximum of the newly scaled data.
  • data_min (None|number) – The minimum of the passed data [if omitted it will be inferred].
  • data_max (None|number) – The maximum of the passed data [if omitted it will be inferred].
Returns:

The newly scaled/normalized data.

Return type:

numpy.ndarray

Spark utilities

recommenders.utils.spark_utils.start_or_get_spark(app_name='Sample', url='local[*]', memory='10g', config=None, packages=None, jars=None, repositories=None)[source]

Start Spark if not started

Parameters:
  • app_name (str) – set name of the application
  • url (str) – URL for spark master
  • memory (str) – size of memory for spark driver. This will be ignored if spark.driver.memory is set in config.
  • config (dict) – dictionary of configuration options
  • packages (list) – list of packages to install
  • jars (list) – list of jar files to add
  • repositories (list) – list of maven repositories
Returns:

Spark context.

Return type:

object

Tensorflow utilities

class recommenders.utils.tf_utils.MetricsLogger[source]

Metrics logger

__init__()[source]

Initializer

get_log()[source]

Getter

Returns:Log metrics.
Return type:dict
log(metric, value)[source]

Log metrics. Each metric’s log will be stored in the corresponding list.

Parameters:
  • metric (str) – Metric name.
  • value (float) – Value.
recommenders.utils.tf_utils.build_optimizer(name, lr=0.001, **kwargs)[source]

Get an optimizer for TensorFlow high-level API Estimator.

Available options are: adadelta, adagrad, adam, ftrl, momentum, rmsprop or sgd.

Parameters:
  • name (str) – Optimizer name.
  • lr (float) – Learning rate
  • kwargs – Optimizer arguments as key-value pairs
Returns:

Tensorflow optimizer.

Return type:

tf.train.Optimizer

recommenders.utils.tf_utils.evaluation_log_hook(estimator, logger, true_df, y_col, eval_df, every_n_iter=10000, model_dir=None, batch_size=256, eval_fns=None, **eval_kwargs)[source]

Evaluation log hook for TensorFlow high-level API Estimator.

Note

TensorFlow Estimator model uses the last checkpoint weights for evaluation or prediction. In order to get the most up-to-date evaluation results while training, set model’s save_checkpoints_steps to be equal or greater than hook’s every_n_iter.

Parameters:
  • estimator (tf.estimator.Estimator) – Model to evaluate.
  • logger (Logger) – Custom logger to log the results. E.g., define a subclass of Logger for AzureML logging.
  • true_df (pd.DataFrame) – Ground-truth data.
  • y_col (str) – Label column name in true_df
  • eval_df (pd.DataFrame) – Evaluation data without label column.
  • every_n_iter (int) – Evaluation frequency (steps).
  • model_dir (str) – Model directory to save the summaries to. If None, does not record.
  • batch_size (int) – Number of samples fed into the model at a time. Note, the batch size doesn’t affect on evaluation results.
  • eval_fns (iterable of functions) – List of evaluation functions that have signature of (true_df, prediction_df, **eval_kwargs)->(float). If None, loss is calculated on true_df.
  • eval_kwargs – Evaluation function’s keyword arguments. Note, prediction column name should be ‘prediction’
Returns:

Session run hook to evaluate the model while training.

Return type:

tf.train.SessionRunHook

recommenders.utils.tf_utils.export_model(model, train_input_fn, eval_input_fn, tf_feat_cols, base_dir)[source]

Export TensorFlow estimator (model).

Parameters:
  • model (tf.estimator.Estimator) – Model to export.
  • train_input_fn (function) – Training input function to create data receiver spec.
  • eval_input_fn (function) – Evaluation input function to create data receiver spec.
  • tf_feat_cols (list(tf.feature_column)) – Feature columns.
  • base_dir (str) – Base directory to export the model.
Returns:

Exported model path

Return type:

str

recommenders.utils.tf_utils.pandas_input_fn(df, y_col=None, batch_size=128, num_epochs=1, shuffle=False, seed=None)[source]

Pandas input function for TensorFlow high-level API Estimator. This function returns a tf.data.Dataset function.

Note

tf.estimator.inputs.pandas_input_fn cannot handle array/list column properly. For more information, see https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn

Parameters:
  • df (pandas.DataFrame) – Data containing features.
  • y_col (str) – Label column name if df has it.
  • batch_size (int) – Batch size for the input function.
  • num_epochs (int) – Number of epochs to iterate over data. If None, it will run forever.
  • shuffle (bool) – If True, shuffles the data queue.
  • seed (int) – Random seed for shuffle.
Returns:

Function.

Return type:

tf.data.Dataset

recommenders.utils.tf_utils.pandas_input_fn_for_saved_model(df, feat_name_type)[source]

Pandas input function for TensorFlow SavedModel.

Parameters:
  • df (pandas.DataFrame) – Data containing features.
  • feat_name_type (dict) – Feature name and type spec. E.g. {‘userID’: int, ‘itemID’: int, ‘rating’: float}
Returns:

Input function

Return type:

func

Timer

class recommenders.utils.timer.Timer[source]

Timer class.

Original code.

Examples

>>> import time
>>> t = Timer()
>>> t.start()
>>> time.sleep(1)
>>> t.stop()
>>> t.interval < 1
True
>>> with Timer() as t:
...   time.sleep(1)
>>> t.interval < 1
True
>>> "Time elapsed {}".format(t) #doctest: +ELLIPSIS
'Time elapsed 1...'
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

interval

Get time interval in seconds.

Returns:Seconds.
Return type:float
start()[source]

Start the timer.

stop()[source]

Stop the timer. Calculate the interval in seconds.

Plot utilities

recommenders.utils.plot.line_graph(values, labels, x_guides=None, x_name=None, y_name=None, x_min_max=None, y_min_max=None, legend_loc=None, subplot=None, plot_size=(5, 5))[source]

Plot line graph(s).

Parameters:
  • values (list(list(float or tuple)) or list(float or tuple) – List of graphs or a graph to plot E.g. a graph = list(y) or list((y,x))
  • labels (list(str) or str) – List of labels or a label for graph. If labels is a string, this function assumes the values is a single graph.
  • x_guides (list(int)) – List of guidelines (a vertical dotted line)
  • x_name (str) – x axis label
  • y_name (str) – y axis label
  • x_min_max (list or tuple) – Min and max value of the x axis
  • y_min_max (list or tuple) – Min and max value of the y axis
  • legend_loc (str) – legend location
  • subplot (list or tuple) – matplotlib.pyplot.subplot format. E.g. to draw 1 x 2 subplot, pass (1,2,1) for the first subplot and (1,2,2) for the second subplot.
  • plot_size (list or tuple) – Plot size (width, height)