两年前,为了便于浏览各个网站的收藏文章,于是把它们都整合到pinbox这个软件上,然而整合完自己都忘了。直到最近又有收藏文章的需求,打开它一瞧,除了多了个收费外基本没什么变化,后来发现用邀请码可以创建多级收藏夹,心安理得的继续白嫖,没过多久,估计白嫖多级收藏夹的事被知道了,明明没有达到收藏上限,却再也无法创建新收藏了,虽然一年会员费也不贵,但pinbox略带简陋的界面以及两年来几乎没变化的功能,请容许我拒绝
开始了寻找替代pinbox软件,起初想自部署笔记软件,但功能太少了,自己又没有服务器,想想还是算了,后来找到了notion,虽然是笔记软件,但完美的契合了要求,notion介绍功能网上很多,懒得说了,感觉这款软件最大的亮点在于白嫖模块化
notion api
事情到这里,一般就结束了,无非是换了个软件罢了,但某天躺平在床上刷着豆瓣,偶然发现了notion原来有api的,垂死病中惊坐起,在用过notion功能后,一直想把花瓣网的图片和网易云的歌单导入进去,在看到有api后,开启了折腾之旅
notion api可以结合python使用,python以前从没写过,后来看了下有点像node.js爬虫,仰仗贫瘠的js知识与捉急的智商,抄袭借鉴Notion → 支付宝&微信 → 账单里的代码,头发少了几根后,恼恨愉悦开启爬虫之旅
然而一开始就有问题了,写代码常有的事,输入pip install requests解决
接着又发现notion api怪得很,用图片链接一定要求有后缀,而花瓣网图片恰恰是链接显示的,还能不能愉快地玩耍 后来思考notion支持导入markdown文件,那先把图片链接保存为md文件,在导入到notion中,再根据api更新里面链接,测试可行后,就先开始爬虫花瓣网
花瓣网
花瓣网虽然有tag功能,但却没有排除关键字搜索,这样找起图片来诸多不易,
1.图片链接与信息汇总
简单来说就是为每一个图片链接保存为md文件,再把图片中的tag,花瓣链接,源地址保存到汇总.csv中
代码
import os
import requests
import re
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'X-Request': 'JSON',
"cookie":"cookie"
}
req = requests.get(url = "画板链接",headers=headers)
htmlPage = req.content
for p in range(1,25):
prog = re.compile(r'"pins".*')
appPins = prog.findall(htmlPage.decode("utf-8"))
null = None
true = True
result = eval(appPins[0][7:-2])
images = []
for i in result:
info = "![]("+"https://hbimg.huabanimg.com/" + str(i["file"]["key"])+"_fw236/format/webp)"+"\n"+"![]("+"https://hbimg.huabanimg.com/" + str(i["file"]["key"])+")"
with open("E:/office/py/爬虫/"+ str(i['pin_id']) +".md", "a",encoding="utf-8", newline="")as f:
f.write(str(info)+"\n")
tagnull= str(i["tags"])
if tagnull.count(",")>4:
tagnull= tagnull[0:24]
while tagnull.count(",") < 4:
tagnull= tagnull+",null"
with open("E:/office/py/爬虫/花瓣汇总.csv", "a",encoding="utf-8", newline="")as fo:
linknull=str(i["link"])
if linknull== "":
linknull="None"
fo.write(str(i['pin_id']) +","+ tagnull.replace("[]", "null").replace("[", "").replace("]", "").replace("'", "")+","+ "https://huaban.com/pins/"+ str(i['pin_id'])+","+linknull+"\n")
images.append(i['pin_id'])
htmlPage = requests.get(url = "加载链接前缀" + str(images[-1]) + "&limit=20&wfl=1",headers=headers).content
写完,接下来是
如无意外文件如下
2.信息处理 将md导入到notion,需要获取每张图的page-id,这也是notion api奇怪的原因,无法从api获取到所有page-id,也不用爬虫,直接写点js了事
function notion(){
a=0
while (a <206){
var links= document.getElementsByClassName('notion-selectable notion-page-block notion-collection-item')[a].firstElementChild.href.slice(41)
a+=1
console.log(links)
}
}
将page-id 与汇总进行匹配
代码
import csv
filepath = "E:/office/py/date/花瓣汇总.csv"
filepath2 = "E:/office/py/date/notionlink.csv"
names1=[]
tags=[]
names2=[]
links=[]
fo = open("E:/office/py/date/test.txt", "w",encoding="utf-8")
with open(filepath, "r", encoding="utf-8", newline="") as f:
csvreader = csv.reader(f)
for row1 in csvreader:
names1.append(row1[0])
tags.append(row1[1] + ","+row1[2]+ ","+row1[3]+ ","+row1[4]+ ","+row1[5]+ ","+row1[6]+ ","+row1[7])
with open(filepath2, "r", encoding="utf-8", newline="") as f:
csvreader1 = csv.reader(f)
for row2 in csvreader1:
names2.append(row2[0])
links.append(row2[1])
a=0
while a < 数目:
pageid= names2.index(names1[a])
seq=[links[pageid],tags[a]]
fo.write(str(seq).replace("[", "").replace("]", "").replace("'", "")+"\n")
a+=1
匹配过后
3.更新页面
绕了半天,终于能根据路径.csv直接更新页面
代码
import requests
import csv
filepath2 = "E:/office/py/date/路径.csv"
pageid=[]
tag1=[]
tag2=[]
tag3=[]
tag4=[]
tag5=[]
links=[]
link2=[]
a=0
with open(filepath2, "r", encoding="utf-8", newline="") as f:
csvreader1 = csv.reader(f)
for row2 in csvreader1:
tag1.append(row2[1])
tag2.append(row2[2])
tag3.append(row2[3])
tag4.append(row2[4])
tag5.append(row2[5])
pageid.append(row2[0])
links.append(row2[6])
link2.append(row2[7])
class notionDemo():
def add_bill(a):
body = {
"properties": {
"Tags": {"multi_select": [{
"name":tag1[a],
},
{
"name": tag2[a],
},
{
"name": tag3[a],
},
{
"name": tag4[a],
},
{
"name": tag5[a],
},
]},
"link": {
"url": links[a],
},
"源站": {
"url": link2[a],
},
},
}
r = requests.request(
"Patch",
"https://api.notion.com/v1/pages/"+pageid[a],
json=body,
headers={"Authorization": "Bearer " + "自己token", "Notion-Version": "2021-05-13"},
)
print(r.text)
a = 0
while a < 数目:
notionDemo.add_bill(a)
a+=1
~
网易云音乐
网抑云歌单不支持专辑封面浏览就算了,为每首歌添加标签也不支持,对我这种歌单基本是英语和日语,找起歌来只能一首一首听
1.信息汇总
封面图片链接有后缀,这样直接就能添加了,不过网易云官方api反爬虫严重,还是自部署一个,github地址,
代码
import os
from sys import argv
import requests
import re
import json
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'X-Request': 'JSON',
"cookie":"自己的cookies"
}
req = requests.get(url = "https://自部署api/playlist/track/all?id=歌单链接",headers=headers).content.decode('utf-8')
response_dict=json.loads(req)
song=response_dict["songs"]
for i in song:
with open("E:/office/py/爬虫/网易云/我喜欢的音乐.csv", "a",encoding="utf-8", newline="")as fo:
# 这里面有个中括号不知道怎么对付,应该可以写成i["ar"][0]["name"]
ar=str(i['ar']).replace("[{'id': ","").replace("'name':","").replace("''':","").replace(" 'tns': [], 'alias': []}]","").replace("'","").replace(" ","")
fo.write(str(i["name"]).replace(",","-")+","+"https://music.163.com/#/song?id="+str(i["id"])+","+str(i['al']['name']).replace(",","-")+","+"https://music.163.com/#/album?id="+str(i['al']['id'])+","+str(i['al']['picUrl'])+","+ar+"\n")
获取到要添加内容
2.添加到notion中
值得注意的是要先创建模板,比如要添加专辑这一内容,那先要再notion创建专辑
代码
from os import path
import requests
import csv
import arrow
class notionDemo():
def add_bill(a,b,c,d,e,f,g):
body = {
"parent": {"database_id": "自己的database"},
"properties": {
"Name": {"title": [{"type": "text", "text": {"content": b}}]},
"专辑": {"type": "rich_text",
"rich_text": [{
"type": "text",
"text": {
"content": c,
"link": {"url": d}
},}]},
"name": {"type": "rich_text",
"rich_text": [{
"type": "text",
"text": {
"content": a,
"link": {"url": b}
},}]},
"歌手": {"type": "rich_text",
"rich_text": [{
"type": "text",
"text": {
"content": g,
"link": {"url": "https://music.163.com/#/artist?id="+f}
},}]},
"封面": {
"type": "files",
"files": [{
"name": e,
"type": "external",
"external": {
"url": e
}
}]
},
},
}
r = requests.request(
"POST",
"https://api.notion.com/v1/pages",
json=body,
headers={"Authorization": "Bearer " + "自己token", "Notion-Version": "2021-05-13"},
)
print(r.text)
filepath="E:/office/py/爬虫/网易云/我喜欢的音乐.csv"
with open(filepath, "r", encoding="utf-8", newline="") as f:
csvreader1 = csv.reader(f)
for row2 in csvreader1:
a=row2[0]
b=row2[1]
c=row2[2]
d=row2[3]
e=row2[4]
f=row2[5]
g=row2[6]
notionDemo.add_bill(a,b,c,d,e,f,g)
接下来仍是熟悉的
添加后的界面
写到这里就该结束了,不过貌似可以跟github action结合,这样就可以实时更新页面,不过对我用处不大,一来网易云听歌少了,没什么新添加的歌曲,二来花瓣网图片链接无法直接添加页面,所以还是算了
十分感谢以下文章让我抄借鉴代码