ARTS打卡 - 20191111~20191124

Published: by Creative Commons Licence

这个系列是 ARTS 打卡计划, 什么是ARTS, 参看这里 https://time.geekbang.org/column/article/85839

Algorithm

LeetCode题解-873(Python实现)

Review

Building a Movie Recommendation Service with Apache Spark & Flask - Part 1

Spark's MLlib library provides scalable data analytics through a rich set of methods. Its Alternating Least Squares implementation for Collaborative Filtering is one that fits perfectly in a recommendation engine. Due to its very nature, collaborative filtering is a costly procedure since requires updating its model when new user preferences arrive. Therefore, having a distributed computation engine such as Spark to perform model computation is a must in any real-world recommendation engine like the one we have built here.

Spark MLlib通过提供丰富的方法来实现大规模分布式的数据分析。它对协同过滤的ALS算法的实现就完美的契合了一个推荐系统引擎。因为它本身特殊的性质,协同过滤是一个昂贵的程序,因为只要有新的用户偏好出现就需要去更新模型。任何一个真实的推荐系统引擎,就像我们今天创建的这个,必须要有像spark这样的分布式计算平台来实现模型计算。

Tip

介绍Numpy中矩阵的三种复制方法及使用技巧。尤其要注意的是,当不想对原矩阵做出改变的时候,一定要用深拷贝。

1. 完全不复制

a = np.arange(12)
# a 和 b 是同一个ndarray对象
b = a
b is a 
True
b.shape = 3, 4
a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

2. 视图或浅拷贝

# 不同的数组对象可以共享相同的数据(例如形状不同的数组可能共享相同的数据)
c = a.view()
c
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
c is a 
False
c.base is a 
True
c.flags.owndata
False
c.shape = 2, 6
c
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])
# a 并不变
a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
# a和c形状不同,属于不同数组对象,但是共享相同的数据
c[:] = 100
a
array([[100, 100, 100, 100],
       [100, 100, 100, 100],
       [100, 100, 100, 100]])
# 切片数组也返回一个视图
s = a[:, 1:3]
s
array([[100, 100],
       [100, 100],
       [100, 100]])
s[:] = 10
s
array([[10, 10],
       [10, 10],
       [10, 10]])
a
array([[100,  10,  10, 100],
       [100,  10,  10, 100],
       [100,  10,  10, 100]])

3. 深拷贝

d = a.copy()
d is a
False
d.base is a
False
d[:] = 10
a
array([[100,  10,  10, 100],
       [100,  10,  10, 100],
       [100,  10,  10, 100]])

当中间结果巨大的时候,通常这样用

a = np.arange(int(1e8))
b = a[:100].copy()
del a

Share

机器学习中常见的相似度衡量方法及代码实现