Common utilities module¶
General utilities¶
-
recommenders.utils.general_utils.
get_number_processors
()[source]¶ Get the number of processors in a CPU.
Returns: Number of processors. Return type: int
-
recommenders.utils.general_utils.
get_physical_memory
()[source]¶ Get the physical memory in GBs.
Returns: Physical memory in GBs. Return type: float
-
recommenders.utils.general_utils.
invert_dictionary
(dictionary)[source]¶ Invert a dictionary
Note
If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are repeated values, the inversion can take different keys
Parameters: dictionary (dict) – A dictionary Returns: inverted dictionary Return type: dict
GPU utilities¶
-
recommenders.utils.gpu_utils.
get_cuda_version
()[source]¶ Get CUDA version
Returns: Version of the library. Return type: str
-
recommenders.utils.gpu_utils.
get_cudnn_version
()[source]¶ Get the CuDNN version
Returns: Version of the library. Return type: str
Kubernetes utilities¶
-
recommenders.utils.k8s_utils.
nodes_to_replicas
(n_cores_per_node, n_nodes=3, cpu_cores_per_replica=0.1)[source]¶ Provide a rough estimate of the number of replicas supported by a given number of nodes with n_cores_per_node cores each
Parameters: - n_cores_per_node (int) – Total number of cores per node within an AKS cluster that you want to use
- n_nodes (int) – Number of nodes (i.e. VMs) used in the AKS cluster
- cpu_cores_per_replica (float) – Cores assigned to each replica. This can be fractional and corresponds to the cpu_cores argument passed to AksWebservice.deploy_configuration()
Returns: Total number of replicas supported by the configuration
Return type:
-
recommenders.utils.k8s_utils.
qps_to_replicas
(target_qps, processing_time, max_qp_replica=1, target_utilization=0.7)[source]¶ Provide a rough estimate of the number of replicas to support a given load (queries per second)
Parameters: - target_qps (int) – target queries per second that you want to support
- processing_time (float) – the estimated amount of time (in seconds) your service call takes
- max_qp_replica (int) – maximum number of concurrent queries per replica
- target_utilization (float) – proportion of CPU utilization you think is ideal
Returns: Number of estimated replicas required to support a target number of queries per second.
Return type:
Notebook utilities¶
-
recommenders.utils.notebook_utils.
is_databricks
()[source]¶ Check if the module is running on Databricks.
Returns: True if the module is running on Databricks notebook, False otherwise. Return type: bool
-
recommenders.utils.notebook_utils.
is_jupyter
()[source]¶ Check if the module is running on Jupyter notebook/console.
Returns: True if the module is running on Jupyter notebook or Jupyter console, False otherwise. Return type: bool
-
recommenders.utils.notebook_memory_management.
pre_run_cell
()[source]¶ Capture current time before we execute the current command
-
recommenders.utils.notebook_memory_management.
start_watching_memory
()[source]¶ Register memory profiling tools to IPython instance.
Python utilities¶
-
recommenders.utils.python_utils.
binarize
(a, threshold)[source]¶ Binarize the values.
Parameters: - a (numpy.ndarray) – Input array that needs to be binarized.
- threshold (float) – Threshold below which all values are set to 0, else 1.
Returns: Binarized array.
Return type: numpy.ndarray
-
recommenders.utils.python_utils.
cosine_similarity
(cooccurrence)[source]¶ Helper method to calculate the Cosine similarity of a matrix of co-occurrences.
Cosine similarity can be interpreted as the angle between the i-th and j-th item.
Parameters: cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items. Returns: The matrix of cosine similarity between any two items. Return type: numpy.ndarray
-
recommenders.utils.python_utils.
exponential_decay
(value, max_val, half_life)[source]¶ Compute decay factor for a given value based on an exponential decay.
Values greater than max_val will be set to 1.
Parameters: - value (numeric) – Value to calculate decay factor
- max_val (numeric) – Value at which decay factor will be 1
- half_life (numeric) – Value at which decay factor will be 0.5
Returns: Decay factor
Return type:
-
recommenders.utils.python_utils.
get_top_k_scored_items
(scores, top_k, sort_top_k=False)[source]¶ Extract top K items from a matrix of scores for each user-item pair, optionally sort results per user.
Parameters: Returns: - Indices into score matrix for each user’s top items.
- Scores corresponding to top items.
Return type: numpy.ndarray, numpy.ndarray
-
recommenders.utils.python_utils.
inclusion_index
(cooccurrence)[source]¶ Helper method to calculate the Inclusion Index of a matrix of co-occurrences.
Inclusion index measures the overlap between items.
Parameters: cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items. Returns: The matrix of inclusion index between any two items. Return type: numpy.ndarray
-
recommenders.utils.python_utils.
jaccard
(cooccurrence)[source]¶ Helper method to calculate the Jaccard similarity of a matrix of co-occurrences. When comparing Jaccard with count co-occurrence and lift similarity, count favours predictability, meaning that the most popular items will be recommended most of the time. Lift, by contrast, favours discoverability/serendipity, meaning that an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.
Parameters: cooccurrence (numpy.ndarray) – the symmetric matrix of co-occurrences of items. Returns: The matrix of Jaccard similarities between any two items. Return type: numpy.ndarray
-
recommenders.utils.python_utils.
lexicographers_mutual_information
(cooccurrence)[source]¶ Helper method to calculate the Lexicographers Mutual Information of a matrix of co-occurrences.
Due to the bias of mutual information for low frequency items, lexicographers mutual information corrects the formula by multiplying it by the co-occurrence frequency.
Parameters: cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items. Returns: The matrix of lexicographers mutual information between any two items. Return type: numpy.ndarray
-
recommenders.utils.python_utils.
lift
(cooccurrence)[source]¶ Helper method to calculate the Lift of a matrix of co-occurrences. In comparison with basic co-occurrence and Jaccard similarity, lift favours discoverability and serendipity, as opposed to co-occurrence that favours the most popular items, and Jaccard that is a compromise between the two.
Parameters: cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items. Returns: The matrix of Lifts between any two items. Return type: numpy.ndarray
-
recommenders.utils.python_utils.
mutual_information
(cooccurrence)[source]¶ Helper method to calculate the Mutual Information of a matrix of co-occurrences.
Mutual information is a measurement of the amount of information explained by the i-th j-th item column vector.
Parameters: cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items. Returns: The matrix of mutual information between any two items. Return type: numpy.ndarray
-
recommenders.utils.python_utils.
rescale
(data, new_min=0, new_max=1, data_min=None, data_max=None)[source]¶ Rescale/normalize the data to be within the range [new_min, new_max] If data_min and data_max are explicitly provided, they will be used as the old min/max values instead of taken from the data.
Note
This is same as the scipy.MinMaxScaler with the exception that we can override the min/max of the old scale.
Parameters: - data (numpy.ndarray) – 1d scores vector or 2d score matrix (users x items).
- new_min (int|float) – The minimum of the newly scaled data.
- new_max (int|float) – The maximum of the newly scaled data.
- data_min (None|number) – The minimum of the passed data [if omitted it will be inferred].
- data_max (None|number) – The maximum of the passed data [if omitted it will be inferred].
Returns: The newly scaled/normalized data.
Return type: numpy.ndarray
Spark utilities¶
-
recommenders.utils.spark_utils.
start_or_get_spark
(app_name='Sample', url='local[*]', memory='10g', config=None, packages=None, jars=None, repositories=None)[source]¶ Start Spark if not started
Parameters: - app_name (str) – set name of the application
- url (str) – URL for spark master
- memory (str) – size of memory for spark driver. This will be ignored if spark.driver.memory is set in config.
- config (dict) – dictionary of configuration options
- packages (list) – list of packages to install
- jars (list) – list of jar files to add
- repositories (list) – list of maven repositories
Returns: Spark context.
Return type:
Tensorflow utilities¶
-
recommenders.utils.tf_utils.
build_optimizer
(name, lr=0.001, **kwargs)[source]¶ Get an optimizer for TensorFlow high-level API Estimator.
Available options are: adadelta, adagrad, adam, ftrl, momentum, rmsprop or sgd.
Parameters: Returns: Tensorflow optimizer.
Return type: tf.train.Optimizer
-
recommenders.utils.tf_utils.
evaluation_log_hook
(estimator, logger, true_df, y_col, eval_df, every_n_iter=10000, model_dir=None, batch_size=256, eval_fns=None, **eval_kwargs)[source]¶ Evaluation log hook for TensorFlow high-level API Estimator.
Note
TensorFlow Estimator model uses the last checkpoint weights for evaluation or prediction. In order to get the most up-to-date evaluation results while training, set model’s save_checkpoints_steps to be equal or greater than hook’s every_n_iter.
Parameters: - estimator (tf.estimator.Estimator) – Model to evaluate.
- logger (Logger) – Custom logger to log the results. E.g., define a subclass of Logger for AzureML logging.
- true_df (pd.DataFrame) – Ground-truth data.
- y_col (str) – Label column name in true_df
- eval_df (pd.DataFrame) – Evaluation data without label column.
- every_n_iter (int) – Evaluation frequency (steps).
- model_dir (str) – Model directory to save the summaries to. If None, does not record.
- batch_size (int) – Number of samples fed into the model at a time. Note, the batch size doesn’t affect on evaluation results.
- eval_fns (iterable of functions) – List of evaluation functions that have signature of (true_df, prediction_df, **eval_kwargs)->(float). If None, loss is calculated on true_df.
- eval_kwargs – Evaluation function’s keyword arguments. Note, prediction column name should be ‘prediction’
Returns: Session run hook to evaluate the model while training.
Return type: tf.train.SessionRunHook
-
recommenders.utils.tf_utils.
export_model
(model, train_input_fn, eval_input_fn, tf_feat_cols, base_dir)[source]¶ Export TensorFlow estimator (model).
Parameters: - model (tf.estimator.Estimator) – Model to export.
- train_input_fn (function) – Training input function to create data receiver spec.
- eval_input_fn (function) – Evaluation input function to create data receiver spec.
- tf_feat_cols (list(tf.feature_column)) – Feature columns.
- base_dir (str) – Base directory to export the model.
Returns: Exported model path
Return type:
-
recommenders.utils.tf_utils.
pandas_input_fn
(df, y_col=None, batch_size=128, num_epochs=1, shuffle=False, seed=None)[source]¶ Pandas input function for TensorFlow high-level API Estimator. This function returns a tf.data.Dataset function.
Note
tf.estimator.inputs.pandas_input_fn cannot handle array/list column properly. For more information, see https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn
Parameters: - df (pandas.DataFrame) – Data containing features.
- y_col (str) – Label column name if df has it.
- batch_size (int) – Batch size for the input function.
- num_epochs (int) – Number of epochs to iterate over data. If None, it will run forever.
- shuffle (bool) – If True, shuffles the data queue.
- seed (int) – Random seed for shuffle.
Returns: Function.
Return type: tf.data.Dataset
-
recommenders.utils.tf_utils.
pandas_input_fn_for_saved_model
(df, feat_name_type)[source]¶ Pandas input function for TensorFlow SavedModel.
Parameters: - df (pandas.DataFrame) – Data containing features.
- feat_name_type (dict) – Feature name and type spec. E.g. {‘userID’: int, ‘itemID’: int, ‘rating’: float}
Returns: Input function
Return type: func
Timer¶
-
class
recommenders.utils.timer.
Timer
[source]¶ Timer class.
Examples
>>> import time >>> t = Timer() >>> t.start() >>> time.sleep(1) >>> t.stop() >>> t.interval < 1 True >>> with Timer() as t: ... time.sleep(1) >>> t.interval < 1 True >>> "Time elapsed {}".format(t) #doctest: +ELLIPSIS 'Time elapsed 1...'
Plot utilities¶
-
recommenders.utils.plot.
line_graph
(values, labels, x_guides=None, x_name=None, y_name=None, x_min_max=None, y_min_max=None, legend_loc=None, subplot=None, plot_size=(5, 5))[source]¶ Plot line graph(s).
Parameters: - values (list(list(float or tuple)) or list(float or tuple) – List of graphs or a graph to plot E.g. a graph = list(y) or list((y,x))
- labels (list(str) or str) – List of labels or a label for graph. If labels is a string, this function assumes the values is a single graph.
- x_guides (list(int)) – List of guidelines (a vertical dotted line)
- x_name (str) – x axis label
- y_name (str) – y axis label
- x_min_max (list or tuple) – Min and max value of the x axis
- y_min_max (list or tuple) – Min and max value of the y axis
- legend_loc (str) – legend location
- subplot (list or tuple) – matplotlib.pyplot.subplot format. E.g. to draw 1 x 2 subplot, pass (1,2,1) for the first subplot and (1,2,2) for the second subplot.
- plot_size (list or tuple) – Plot size (width, height)