ARTS打卡 - 20191014~20191027

Published: October 20, 2019 by Forrest Yang

Categories:
ARTS打卡 6

这个系列是 ARTS 打卡计划，什么是ARTS，参看这里 https://time.geekbang.org/column/article/85839

Algorithm

Review

Recommendation Systems: Collaborative Filtering using Matrix Factorization — Simplified

Collaborative filtering methods are based on collecting and analyzing a large amount of information on user behaviors, activities or preferences and predicting what users will like based on their similarity to other users.

"协同过滤方法基于收集和分析大量用户行为，活动，偏好信息以及用他们之间的相似性来预测用户可能喜欢的物品。"

文中这句话强调了用户之间的相似性，描述稍微有些片面。不光用户之间有相似性，物品之间也有相似性。矩阵分解既不能说是user-based ，也不能说是item-based ，其实是model-based 。

“HOW DO WE FIND THE RIGHT FACTORIZATION FOR A MATRIX?”

Machine Learning is the solution. The model learns to find latent factors to factorize the rating matrix. To arrive at the best approximation of the factors, RMSE(root mean squared error) is the cost function to be minimized. After Matrix factorization is done, squared error is calculated for every movie rating in the rating matrix and the root value of the mean of squared error values are minimized. In order to minimize RMSE to learn the factors, Gradient Descent and Alternating Least Squares are the two most used techniques.

MF(矩阵分解)的方法主要有两种：SGD (随机梯度下降)和ALS(交替最小二乘法)

工业界常用的是交替最小二乘法，代价函数为：

\(\min _{x_{\star}, y_{\star}} \sum_{u, i} \left(r_{u i}-x_{u}^{T} y_{i}\right)^{2}+\lambda\left(\sum_{u}\left\|x_{u}\right\|^{2}+\sum_{i}\left\|y_{i}\right\|^{2}\right)\) (其中\(\lambda\)为正则化系数)

详细说明可参看原论文 [Large-scale Parallel Collaborative Filtering for the Netflix Prize]

Tip

看论文的时候，经常需要把文中的数学公式拷贝出来，自己重写Latex代码又费神费事。这个时候就有个偷懒的办法。神奇的工具-mathpix，它会识别屏幕截图中的公式。还有个小技巧，通过观察输出结果底部的进度条，可以看出机器对此次识别结果的confidence。一般来讲，当confidence较小的时候，就可能需要人为去校对一下了。

Hive 的3种Ranking函数

Algorithm

Review

Tip

Share