ARTS打卡 - 20191111~20191124
这个系列是
ARTS
打卡计划, 什么是ARTS
, 参看这里 https://time.geekbang.org/column/article/85839
Algorithm
Review
Building a Movie Recommendation Service with Apache Spark & Flask - Part 1
Spark's MLlib library provides scalable data analytics through a rich set of methods. Its Alternating Least Squares implementation for Collaborative Filtering is one that fits perfectly in a recommendation engine. Due to its very nature, collaborative filtering is a costly procedure since requires updating its model when new user preferences arrive. Therefore, having a distributed computation engine such as Spark to perform model computation is a must in any real-world recommendation engine like the one we have built here.
Spark MLlib
通过提供丰富的方法来实现大规模分布式的数据分析。它对协同过滤的ALS
算法的实现就完美的契合了一个推荐系统引擎。因为它本身特殊的性质,协同过滤是一个昂贵的程序,因为只要有新的用户偏好出现就需要去更新模型。任何一个真实的推荐系统引擎,就像我们今天创建的这个,必须要有像spark
这样的分布式计算平台来实现模型计算。
Tip
介绍Numpy
中矩阵的三种复制方法及使用技巧。尤其要注意的是,当不想对原矩阵做出改变的时候,一定要用深拷贝。
1. 完全不复制
a = np.arange(12)
# a 和 b 是同一个ndarray对象
b = a
b is a
True
b.shape = 3, 4
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
2. 视图或浅拷贝
# 不同的数组对象可以共享相同的数据(例如形状不同的数组可能共享相同的数据)
c = a.view()
c
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
c is a
False
c.base is a
True
c.flags.owndata
False
c.shape = 2, 6
c
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
# a 并不变
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
# a和c形状不同,属于不同数组对象,但是共享相同的数据
c[:] = 100
a
array([[100, 100, 100, 100],
[100, 100, 100, 100],
[100, 100, 100, 100]])
# 切片数组也返回一个视图
s = a[:, 1:3]
s
array([[100, 100],
[100, 100],
[100, 100]])
s[:] = 10
s
array([[10, 10],
[10, 10],
[10, 10]])
a
array([[100, 10, 10, 100],
[100, 10, 10, 100],
[100, 10, 10, 100]])
3. 深拷贝
d = a.copy()
d is a
False
d.base is a
False
d[:] = 10
a
array([[100, 10, 10, 100],
[100, 10, 10, 100],
[100, 10, 10, 100]])
当中间结果巨大的时候,通常这样用
a = np.arange(int(1e8))
b = a[:100].copy()
del a