xin053

shell编程

2017-03-10T04:38:10.000Z

Hello World

#!/bin/bash
# this is a comment
echo 'Hello World！'
exit

文件保存为hello.sh,然后修改文件的权限:

1	$ chmod 755 hello.sh

最后，执行:

1 2	$ ./hello.sh Hello World!

exit不是必须的，但是每个命令都会返回一个退出状态给父进程，成功返回0，非0值通常被认为是错误码，良好脚本都会带上exit，当一个脚本不带参数exit来结束时，脚本的退出状态由脚本中最后执行命令来决定

echo $?可以用来查看前一个命令的退出状态

赋值

使用=进行赋值，并且=左右两边不能有空格,获取变量值得时候在变量名前面加$

1
2
3

$ a=1 # 如果是a = 1,那么就会被解释为执行a命令,并带有'= 1'参数
$ echo $a
1

变量

hello="a b  c   d"
echo $hello  # a b c d  变量替换
echo "$hello" # a b  c   d   部分引用
echo "${hello}" # a b  c   d
echo '$hello' # $hello   全引用

正如所见,变量替换会去除掉空白，全引用会禁止所有特殊符号,如果只是想输出变量的值，推荐使用"${}"这种形式

bash中变量的类型

1
2
3

a=2334 #整形
b=${a/23/BB} #这将把b变量从整形变为string
c=${b/BB/23} #这将把c变量从string变为整形

所以说bash中的变量都是无类型的

特殊变量

1	$ ./scriptname 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10是从命令行传入的10个参数，$0表示脚本名称，$1表示第一个参数，${10}表示第10个参数，$#位置参数的个数，$*所有的位置参数，被作为一个单词

每一次执行shift命令能够将所有位置参数向前移动一个位置，而原来第一个位置的参数则被丢弃

内部变量

$BASH - bash二进制执行文件的位置

$FUNCNAME - 当前函数的名字

$GROUPS - 当前用户属于的组

$HOME - 用户home目录

$HOSTNAME - 主机名

$IFS - 内部域分隔符，该变量决定bash在解释字符串时如何识别域或单词的边界

$LINENO - 记录它所在shell脚本中它所在行的行号

$OSTYPE - 系统类型

$PPID - 一个进程的$PPID就是它的父进程的pid

$PWD - 当前工作目录

$SECONDS - 这个脚本已经运行的时间

$SHLVL - shell层叠的层次

$UID - 用户id号

$$ - 脚本自身进程pid

获取变量名

1 2	${!prefix*} ${!prefix@}

这两个命令都可以返回以prefix开头的已有变量

Here Documents

here documents是一种重定向的形式

1
2
3

command << token
text
token

这里的command是一个可以接受标准输入的命令，token是一个用来指示嵌入文本结束的字符串。上述结构就是将text的内容当作标准输入传给了command

将<<改为<<-，shell就会忽略text开头的tab字符，这样text内容就可以缩进，从而提高代码的可读性。

cat <<- _EOF_
	hello
	world
	!!!!!
_EOF_

常用上述方法代替echo输出多行内容

获取用户输入

使用read来获取用户的输入

read a将获取用户的输入到变量a，如果没有提供变量名，默认变量REPLY会包含用户输入

read支持以下选项

-a array - 把输入赋值到数组array中，从索引号0开始

-n num - 读取num个输入字符，而不是整行

-p prompt - 为输入显示提示信息

-r - raw modw，不会把反斜杠字符解释为转义字符

-s - silent mode，不会再屏幕上显示输入的文字

-t seconds - 超时，seconds秒之后，如果没有输入，则返回一个非零退出状态

给变量指定默认值

1	${parameter:-word}

若parameter没有设置或者为空，展开结果为word，若parameter不为空，则展开结果是parameter的值

1	${parameter:=word}

若parameter没有设置或者为空，展开结果为word，并且word的值会赋值给parameter,若parameter不为空，则展开结果是parameter的值

1	${parameter:?word}

若parameter没有设置或者为空，这种展开导致脚本带有错误退出，并且word的内容会发送到标准错误，若parameter不为空，则展开结果是parameter的值

函数

函数定义

函数定义有两种形式

function name(){
  commands
  return
}

或者

name(){
  commands
  return
}

调用函数时，只用写函数名，不用加括号，并且函数的定义要在函数调用之前

#!/bin/bash
function hello(){
  echo "Hello World!"
  return
}
hello   # 函数调用

局部变量

在函数内部使用local关键字来定义局部变量

function funcname(){
  local test=1
  echo $test
  return
}

if

x=5
if [ $x == 5 ]; then          # 注意[右边的空格和]左边的空格以及==两边的空格
	echo "x equals 5"
else
	echo "x dose not equals 5"
fi

判断

涉及到判断的地方都是检测命令的退出状态码，如果是0，表示命令成功执行，也就表示当前判断的内容为真，非0则假。

文件表达式

-d file - file存在并且是一个目录

-e file - file存在

-f file - file存在并且是一个普通文件

-s file - file存在并且其长度大于0

-r file - file存在并且可读

-w file - file存在并且可写

-x file - file存在并且可执行

#!/bin/bash
FILE=~/.bashrc
if [ -f "$FILE" ]; then
	echo "$FILE is a file"
fi
exit

字符串表达式

-n string - 字符串string的长度大于0

-z string - 字符串string的长度为0

string1 == string2 - 字符串string1等于字符串string2

string1 > string2 - string1排列在string2之后

其他判断

1	[[ expression ]]

类似于test

1	string =~ regex

如果string匹配正则表达式regex，则返回真

while

#!/bin/bash
count=1
while [ "${count}" -le 5 ]; do
	echo "${count}"
	count=$((count + 1))
done
echo "finished!"
exit

循环中可以使用continue和break

循环读取数据

#!/bin/bash
while read para1 para2 para3; do
	...
done < test.txt

1
2
3

#!/bin/bash
sort -k 1,1 -k 2n test.txt | while read para1 para2 para3; do

read每次读取文本行之后将会返回退出状态码0，知道文件末尾，返回状态码非零才结束while循环

当循环终止时，循环中创建的任意变量或赋值的变量都会消失

until

与while类似

#!/bin/bash
count=1
until [ "${count}" -gt 5 ]; do
	echo "${count}"
	count=$((count + 1))
done
echo "finished!"
exit

case

read -p "Enter selection [0-3]"
case $REPLY in
	0)	echo "Program terminated."
		exit
		;;
	1)	echo "Hostname: $HOSTNAME"
		uptime
		;;
	2)	df -h
		;;
	3)	echo "Hello"
		;;
	*)	echo "Invalid entry" >&2
		exit 1
		;;
esac

匹配模式

a) - 匹配单词a

a|A) - 匹配单词a或A

[[:alpha:]] - 若单词是一个字母字符，则匹配

???) - 若单词只有3个字符，则匹配

*.txt - 若单词以.txt字符结尾，则匹配

for

1
2
3

for i in A B C D; do
	echo "$i"
done

1
2
3

for i in {A..D}; do
	echo "$i"
done

1
2
3

for i in cloud*.txt; do
	echo "$i"
done

也可以使用c语言格式:

1
2
3

for (( expression1; expression2; expression3 )); do
	commands
done

字符串操作

1	${#parameter}

会展开为parameter所包含的字符串的长度

1 2	${parameter:offset} # 提取从offset到末尾的字符串 ${parameter:offset:length} # 提取offset开始，指定长度的字符串

子串消除

1 2	${parameter#pattern} # 展开为删除parameter中从开头开始匹配pattern的最短字符串 ${parameter##pattern} # 展开为删除parameter中从开头开始匹配pattern的最长字符串

$ foo=file.txt.zip
$ echo ${foo#*.}
txt.zip
$ echo ${foo##*.}
zip

1 2	${parameter%pattern} ${parameter%%pattern}

功能与#和##类似，只是是从结尾开始匹配

字符串替换

${parameter/pattern/string}  # 用string替换第一个匹配pattern的字符串
${parameter//pattern/string} # 替换掉全部匹配的
${parameter/#pattern/string} # 替换从字符串开头开始匹配的第一个字符串
${parameter/%pattern/string} # 替换从字符串结尾开始匹配的第一个字符串

原parameter变量值不变

字符串大小写

${parameter,,}   # 把parameter的值全部展开为小写
${parameter,}    # 仅把第一个字符展开为小写
${parameter^^}   # 把parameter的值全部展开为大写
${parameter^}    # 仅把第一个字符展开为大写

原parameter变量值不变

数组

$ declare -a array  # 声明array为一个数组
$ array[0]=0
$ array[1]=1
$ echo ${array[0]}
0
$ echo ${array[1]}
1

多值赋值

1
2
3

$ test=(a b c d)
$ echo ${test[0]}
a

输出整个数组内容

$ animals=("a dog" "a cat" "a fish")
$ for i in "${animals[*]}"; do echo $i; done
a dog a cat a fish
$ for i in "${animals[@]}"; do echo $i; done
a dog
a cat
a fish

下标*和@可以被用来访问数组中的每一个元素

关联数组

$ declare -A colors
$ colors["red"]="#ff0000"
$ colors["green"]="#00ff00"
$ colors["blue"]="#0000ff"
$ echo ${colors["blue"]}
#0000ff

找到数组使用的下标

bash允许数组下标包含空格，有时候确定哪个元素真正存在是很有用的

1 2	${!array[*]} ${!array[@]}

组命令和子shell

组命令

1	{ command1; command2; [commands3; ...] } # 注意花括号旁边的空格

子shell

1	(command1; command2; [command3; ...])

组命令和子shell都是用来管理重定向的

1	{ ls -l; echo "test"; cat foo.txt } > output.txt

会将三个命令的结果合成在一起然后重定向到output.txt中

组命令是在当前shell中执行它所有的命令，而子shell是在一个子shell中执行命令，在子shell中执行命令对环境变量等修改在子shell消失之后便会消失，大多数情况下，我们使用组命令。

1 2	$ echo "foo" \| read $ echo $REPLY

该REPLY变量的内容总是空，是应为在管道线中的命令总是在子shell中执行的，bash提供进程替换来解决这个问题

进程替换

<(list) - 一种适用于产生标准输出的进程

>(list) - 一种适用于接受标准输入的进程

1 2	read < <(echo "foo") echo $REPLY

进程替换允许我们把一个子shell的输出结果当作一个用于重定向的普通文件，事实上，它就是一种展开形式

linux命令学习

2017-03-08T05:26:10.000Z

Linux 命令学习

常用命令

显示磁盘容量

$ df -h

显示内存信息

$ free -h

确定文件类型

file 文件名

less和more都能浏览文件，但是前者可以前后分页浏览，后者只支持向前分页浏览

以管理员模式打开资源管理器

1	$ sudo nautilus

说明怎样解释一个命令名

type 命令名

获取命令简介

1	whatis 命令名

help和man都可以查看命令帮助文档，但是前者是shell内部命令的帮助文档

输入文件前多少行

1	head -n 行数文件名

输出文件后多少行

1	tail -n 行数文件名

清空屏幕,与ctrl+l功能一样

clear

显示历史列表内容

history

显示所有服务的运行状态

1	$ service --status-all

显示单个服务的运行状态,例如ssh服务

1	$ service ssh status

特殊符号

;命令分隔符，可以用来在一行中来写多个命令

""部分引用，阻止了一部分特殊字符

''全引用，阻止了全部特殊字符

` 反引号，命令替换

?测试操作，在参数替换中，可以测试一个变量是够被set

$?退出状态变量

$$进程ID变量，保存运行脚本进程ID

文件操作

cp - 复制文件和目录

mv - 移动/重命名文件和目录

mkdir - 创建目录

rm - 删除文件和目录

ln - 创建硬链和符号链接

命令

命令可以是下面四种形式之一：

是一个可执行程序，就像我们所看到的位于目录/usr/bin 中的文件一样。属于这一类的程序，可以编译成二进制文件，诸如用 C 和 C++ 语言写成的程序, 也可以是由脚本语言写成的程序，比如说 shell， perl， python， ruby，等等。
是一个内建于 shell 自身的命令。bash 支持若干命令，内部叫做 shell 内部命令
(builtins)。例如， cd 命令，就是一个 shell 内部命令。
是一个 shell 函数。这些是小规模的 shell 脚本，它们混合到环境变量中。在后续的章节里，我们将讨论配置环境变量以及书写 shell 函数。但是现在，仅仅意识到它们的存在就可以了。
是一个命令别名。我们可以定义自己的命令，建立在其它命令之上。

重定向

>会删除文件中的内容，然后将内容定向到文件中，>>则是在文件末尾中追加

标准输入和标准输出以及标准错误流是各自重定向的，shell内部参考它们文件描述符为0，1，2

1	$ ls -l /bin/use 2>> ls-error.txt

上述命令就是将错误流输出到ls-error.txt文件中

如果我们想实现将标准输出和标准错误重定向到同一个文件中，我们可以：

1	$ ls -l /bin/usr > ls-output.txt 2>&1

上述命令就是先将标准输出重定向到文件，然后将标准错误重定向到标准输出

注意重定向的顺序很重要，标准错误的重定向必须总是出现在标准输出重定向之后，要不然它不起作用

现在的bash也支持使用以下更精简的方法来将标准输出和错误重定向到同一个文件中

1	$ ls -l /bin/usr &> ls-output.txt

有时候，我们不想要一个命令的输出结果，只想把它扔掉，我们就可以利用一个特殊的设备/dev/null(相当于垃圾桶)

1	$ ls -l /bin/usr 2> /dev/null

上述命令就是将标准错误流扔掉了

1	$ cat /dev/null > filename

将文件内容清空，如果文件不存在，则创建文件，与下面命令功能一样

1	$ : > filename

:是空命令

管道命令|是将一个命令的标准输出重定向到另一个命令的标准输入

例如，我们使用:

1	$ ll \| less

就能更方便的查看当前目录下的所有文件了

tee命令从标准输入读取数据，并同时输出到标准输出和文件中。

花括号展开

$ echo {1..5}
1 2 3 4 5
$ echo {z..a}
z y x w v u t s r q p o n m l k j i h g f e d c b a

命令替换

命令替换允许我们把一个命令的输出作为一个展开模式来使用

1 2	$ ll $(which cp) -rwxr-xr-x 1 root root 151024 2月 18 2016 /bin/cp*

也可以使用反引号来代替美元符号和括号

1 2	$ ll `which cp` -rwxr-xr-x 1 root root 151024 2月 18 2016 /bin/cp*

特殊权限

setuid

当应用到一个可执行文件时，它把有效用户ID从真正的用户(实际运行程序的用户)设置成程序所有者的ID

setgid

与setuid位相似，把有效用户组ID从真正的用户组ID更改为文件所有者的组的ID

sticky

linux会忽略文件的sticky位，但是如果一个目录设置了sticky位，那么它能阻止用户删除或重命名，除非用户是这个目录的所有者，或是文件的所有者，或是超级用户

进程

ps显示当前有TTY(进程的控制终端)的进程,ps x显示所有进程，不管它们由什么终端控制,px aux还可以显示进程的所有者，CPU和内存使用率等

进程状态

R - 运行
S - 正在睡眠
D - 不可中断睡眠，进程正在等待I/O
T - 已停止
Z - 僵尸进程
< - 高优先级进程
N - 低优先级进程

ps只是进程快照，而top命令可以动态的显示系统进程更新的信息(默认情况下，每3秒更新一次).pstree可以输出一个树形结构的进程列表

进程控制

可以在命令之后加上&，让它立即在后台执行

1 2	$ xlogo & [1] 28236

jobs可以显示当前终端后头运行的任务以及状态

一个在后台运行的进程对一切来自键盘的输入都免疫，也不能用ctrl+c来中断它。

使用fg将一个进程返回前台执行

1
2
3

$ xlogo &
[1] 55692
$ fg %1  //这里的%1被称为jobspec

有时候我们需要停止一个进程，而不是终止。这样会把一个前台进程移到后台等待，输入ctrl+z,可以停止一个前台进程。处于停止的进程可以使用fg命令恢复程序到前台运行或者用bg命令把程序移到后台。

可以使用kill PID或kill jobspec来终止进程

vim

常用命令:

yy - 复制当前行
5yy - 复制当前行以及随后的四行文本
y0 - 复制当前光标位置到当前行首的内容
y$ - 复制当前光标位置到当前行的尾部
p - 粘贴
d - 删除/剪切文本

文本处理

cat -A 文件名可以查看文件中的特殊符号

cat -n 文件名输出文件内容并显示行号

sort对标准输入的内容，或命令行中指定的一个或多个文件进行排序，然后把排序结果发送到标准输出。

cut用来从文本行中抽取文本，并把它输入到标准输出

paste功能与cut相反，它会添加一个或多个文本列到文件中，而不是从文件中抽取文本列。它通过读取多个文件，然后把每个文件中的字段整合成单个单个文本流，输入到标准输出。

sed命令对文本流就行编辑，一般用来做替换操作。

Python3.6更新内容

2016-12-23T11:15:12.000Z

Python3.6

北京时间2016年12月23日晚上6点半左右，python官网放出了python3.6.0正式版，安装后，可以看到windows版具体编译时间是2016年12月23日早上8点6分。可以说python3.6从测试到正式发布已经有很长一段时间了，并且官方表示，2017年初开始对3.6版本进行各种bug修复等改进，也就是3.6.x的版本，关于python3.6相较于3.5有哪些变化，请看What’s New In Python 3.6
本文主要讲解如何将工作环境从python3.5转到python3.6，以及python3.6新功能的介绍。

工作环境

由于python的每个版本，例如3.5和3.6安装时安装目录是分开的(windows环境)，而如果我们将python第三方库安装在python安装目录下的话，那么现在我如果使用3.6，又得重新将3.6的安装目录添加到环境变量PATH，并且将大量第三方库安装到3.6安装目录，但是这样就引发了一个问题，那就是多份第三方库都存在于电脑中，当然也可以删除3.5相关的所有文件，但是实际上重新安装常用的那些库又很麻烦，所以我将python虚拟环境当作我的工作环境，也就是在F:\pythonVE目录创建一个python虚拟环境，将第三方库都安装在这个虚拟环境中，所以现在刚刚安装好python3.6，只用在cmd执行:

1	python -m venv --upgrade F:\pythonVE

注意这里的python是3.6中的python.exe,--upgrade参数的意思就是将虚拟环境中的python版本升级为此python版本(3.6版本)

所以PAHT中只用添加虚拟环境的路径就可以了，然后就是慢慢更新第三方包了，毕竟第三方包适配3.6也需要时间，但是毫无疑问，会很快。jupyter的ipython-qtconsole.exe现在就用不了，因为pyqt还没支持3.6(毕竟3.6今天才出23333)，不过相信过几天就可以用了，python3已经是趋势，不要告诉我你的主要工作环境是python2(话说12月17号更新了python2.7.13)

注意有些包还是要手动更新的，例如windows上无法编译lxml，所以一般都是下载编译好的进行安装，之前下载的是支持python3.5的lxml，现在需要卸载当前库，并手动下载编译好的支持3.6的lxml进行安装,有些包使用pip安装的时候会提示编码问题，简单的方法就是从Unofficial Windows Binaries for Python Extension Packages下载，然后直接安装

以上只是本人环境，因为我目前只把python当作工具，所以不会像开发库一样考虑版本兼容等情况，不过一般还是建议将常用包放在python安装目录下，对于特定的项目构建虚拟环境，在虚拟环境中安装与python版本相适应的包进行开发。

What’s New In Python 3.6

主要改变:

PEP 468 - Preserving the order of **kwargs in a function
PEP 487 - Simpler customization of class creation
PEP 495 - Local Time Disambiguation
PEP 498 - Literal String Formatting
PEP 506 - Adding A Secrets Module To The Standard Library
PEP 509 - Add a private version to dict
PEP 515 - Underscores in Numeric Literals
PEP 519 - Adding a file system path protocol
PEP 520 - Preserving Class Attribute Definition Order
PEP 523 - Adding a frame evaluation API to CPython
PEP 524 - Make os.urandom() blocking on Linux (during system startup)
PEP 525 - Asynchronous Generators (provisional)
PEP 526 - Syntax for Variable Annotations (provisional)
PEP 528 - Change Windows console encoding to UTF-8
PEP 529 - Change Windows filesystem encoding to UTF-8
PEP 530 - Asynchronous Comprehensions

PEP 498: Formatted string literals

>>> name = "Fred"
>>> f"He said his name is {name}."
'He said his name is Fred.'
>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}"  # nested fields
'result:      12.35'

在字符串前面加f，表示该字符串将被格式化，类似于对字符串进行str.format()操作，不得不说，确实很方便

PEP 526: Syntax for variable annotations

提供变量声明语法,，包括类中的变量，实例中的变量和函数参数

primes: List[int] = []
captain: str  # Note: no initial value!
class Starship:
    stats: Dict[str, int] = {}

>>> class Starship:
...     stats: str
...
>>> Starship.__annotations__
{'stats': <class 'str'>}

当然，python始终是一门动态语言，所以这些类型声明实际上只是将这些类型信息存储在类或者模块的__annotations__属性中，并不会在运行时检擦这些属性，只是起到提示的作用，当然，这个特性确实也很有用处，具体类型声明语法请看PEP 484

PEP 515: Underscores in Numeric Literals

能够在数字间添加下划线以提高阅读性

>>> 1_000_000_000_000_000
1000000000000000
>>> type(1_000_000_000_000_000)
>>> 0x_FF_FF_FF_FF
4294967295

同时字符串格式化也支持这种下划线的格式化方式:

>>> '{:_}'.format(1000000)
'1_000_000'
>>> '{:_x}'.format(0xFFFFFFFF)
'ffff_ffff'
>>> '{:_X}'.format(0xFFfFFFFF)
'FFFF_FFFF'

当然也可以使用二进制b，八进制o

PEP 525: Asynchronous Generators

异步生成器，python3.6中可以在同一函数体中使用await和yield

class Ticker:
    """Yield numbers from 0 to `to` every `delay` seconds."""
    def __init__(self, delay, to):
        self.delay = delay
        self.i = 0
        self.to = to
    def __aiter__(self):
        return self
    async def __anext__(self):
        i = self.i
        if i >= self.to:
            raise StopAsyncIteration
        self.i += 1
        if i:
            await asyncio.sleep(self.delay)
        return i

以上代码现在可以简写为:

async def ticker(delay, to):
    """Yield numbers from 0 to `to` every `delay` seconds."""
    for i in range(to):
        yield i
        await asyncio.sleep(delay)

PEP 530: Asynchronous Comprehensions

可以在列表，元组，字典，生成器表达式中使用async for和await

result = []
async for i in aiter():
    if i % 2:
        result.append(i)

可以简写为:

1	result = [i async for i in aiter() if i % 2]

有关await的例子:

1	result = [await fun() for fun in funcs if await condition()]

PEP 487: Simpler customization of class creation

现在可以不用使用元类来自定义子类的创建

当子类被创建时，基类中的__init_subclass__()类方法将被调用

class PluginBase:
    subclasses = []
    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        cls.subclasses.append(cls)
class Plugin1(PluginBase):
    pass
class Plugin2(PluginBase):
    pass

PEP 487: Descriptor Protocol Enhancements

描述符中新增了__set_name__()方法，当描述符被实例化时，便会调用__set_name__()方法

class IntField:
    def __get__(self, instance, owner):
        return instance.__dict__[self.name]
    def __set__(self, instance, value):
        if not isinstance(value, int):
            raise ValueError(f'expecting integer in {self.name}')
        instance.__dict__[self.name] = value
    # this is the new initializer:
    def __set_name__(self, owner, name):
        self.name = name
class Model:
    int_field = IntField() # 将会调用__set_name__()方法，将属性名int_field保存起来

PEP 519: Adding a file system path protocol

在大多数眼中，路径就是字符串或者是字节对象,以至于python标准库pathlib较少被使用。现在提供了一个os.PathLike接口，只要实现了__fspath__()方法，那么这个对象就表示是一个路径，并且可以使用os.fspath(),os.fsdecode(), 或者 os.fsencode()方法或者这个路径对象的字符串或字节表示

>>> import pathlib
>>> with open(pathlib.Path("README")) as f:
...     contents = f.read()
...
>>> import os.path
>>> os.path.splitext(pathlib.Path("some_file.txt"))
('some_file', '.txt')
>>> os.path.join("/a/b", pathlib.Path("c"))
'/a/b/c'
>>> import os
>>> os.fspath(pathlib.Path("some_file.txt"))
'some_file.txt'

PEP 529: Change Windows filesystem encoding to UTF-8

现在的python3.6版本使得我们可以在windows平台是正确使用字节对象表示的路径，而不会造成数据丢失，事实上，该字节对象就是通过sys.getfilesystemencoding()编码的，也就是UTF-8

PEP 528: Change Windows console encoding to UTF-8

The default console on Windows will now accept all Unicode characters and provide correctly read str objects to Python code. sys.stdin, sys.stdout andsys.stderr now default to utf-8 encoding.

只想说，简直是福音，再也不用担心控制台输出乱码了。。。

PEP 520: Preserving Class Attribute Definition Order

类中定义的属性的顺序在__dict__中将被保留

PEP 468: Preserving Keyword Argument Order

**kwargs in a function signature is now guaranteed to be an insertion-order-preserving mapping.

New dict implementation

新的dict实现，比原来的实现快20% 到25%不说，还保留了顺序，也就是说dict现在是有序的。。。所以要OrderedDict何用？不过，官方也说了，现在只是暂时这样，有可能之后的版本又变成无序的了

1
2
3

>>> b = {'one': 1, 'two': 2, 'three': 3}
>>> b
{'one': 1, 'two': 2, 'three': 3}

其他改动

添加了secrets模块

改进了re模块，在正则表达式中添加了修饰符跨度的支持，Examples: '(i:p)ython' matches 'python' and 'Python', but not 'PYTHON'; '(?i)g(?-i:v)r'matches 'GvR' and 'gvr', but not 'GVR'

更多细节改动参考官网What’s New In Python 3.6

参考文档

What’s New In Python 3.6

cryptography加密库使用详解

2016-12-20T12:59:43.000Z

cryptography简介

cryptography模块主要分为两类，一类是高层次的加密配方，也就是我们只用关心如何使用它提供的api，并不用关心具体加密过程等细节，这也是我们经常使用的。另一类是低层次的加密原语，如果对密码学不是很了解的话，使用加密原语构造自己的加密算法是很危险的。本片文章介绍高层次的对称加密api和低层次非对称的公钥私钥以及证书

cryptography使用

Fernet(对称加密)

from cryptography.fernet import Fernet
key = Fernet.generate_key()
key  # A URL-safe base64-encoded 32-byte key
# b'7A7idpk7MjmvTWqZf4_vWwvXwAJmmi4SFRnomqKTrB8='
f = Fernet(key)
token = f.encrypt(b"my deep dark secret")
token
# b'gAAAAABYWUWYZywJx9l3UrSUMGa5OS3dlz15NpUuOu-Wk6UNsLnQmtDx2hGdRRhwe62EhzT7OuvLafjzwjf7fASFRLMBQPhq3fa2U_WsFcEUzCFR0ZcxJC8='
f.decrypt(token)
# b'my deep dark secret'

Using passwords with Fernet

>>> import base64
>>> import os
>>> from cryptography.fernet import Fernet
>>> from cryptography.hazmat.backends import default_backend
>>> from cryptography.hazmat.primitives import hashes
>>> from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
>>> password = b"password"
>>> salt = os.urandom(16)
>>> kdf = PBKDF2HMAC(
...     algorithm=hashes.SHA256(),
...     length=32,
...     salt=salt,
...     iterations=100000,
...     backend=default_backend()
... )
>>> key = base64.urlsafe_b64encode(kdf.derive(password))
>>> f = Fernet(key)
>>> token = f.encrypt(b"Secret message!")
>>> token
'...'
>>> f.decrypt(token)
'Secret message!'

为了以后根据password得到token，需要保存好salt

X.509(数字证书标准)

数字证书是CA机构签名的含有服务器公钥以及其他网站相关信息的一种电子证书，用来说明该服务器(网站)确实是真的(官方的)，而不是伪造的

这里主要使用的是非对称加密，也就是公钥和私钥(RSA)，私钥用来签名，公钥用来验签

Creating a Certificate Signing Request (CSR)

When obtaining a certificate from a certificate authority (CA), the usual flow is:

You generate a private/public key pair.
You create a request for a certificate, which is signed by your key (to prove that you own that key).
You give your CSR to a CA (but not the private key).
The CA validates that you own the resource (e.g. domain) you want a certificate for.
The CA gives you a certificate, signed by them, which identifies your public key, and the resource you are authenticated for.
You configure your server to use that certificate, combined with your private key, to server traffic.

所以首先要生成密钥对:

from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa
key = rsa.generate_private_key(
    public_exponent=65537,
    key_size=2048,
    backend=default_backend()
)

关于生成certificate signing request，请看官方文档,然后就可以将生成的证书发送给CA机构，待CA机构处理完，就会返回给你经过他们签名的数字证书，该数字证书也是用户用来核实我们网站的证书。

RSA 常用操作

生成

from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.asymmetric import rsa
private_key = rsa.generate_private_key(
    public_exponent=65537,
    key_size=2048,
    backend=default_backend()
)

这样就生成了一个RSAPrivateKey对象。参数保持上面就可以了，具体参数解析看官方文档

私钥公钥是成对生成的，所以当我们使用generate_private_key生成RSAPrivateKey对象时，我们可以通过生成的对象获取到RSAPublicKey对象

1	public_key = private_key.public_key()

当然，肯定是不可以从RSAPublicKey对象中获取到RSAPrivateKey对象的。

从pem文件导入

也可以从一个pem格式的文件导入一个RSAPrivateKey对象

pem格式文件就是类似:

A PEM block which starts with -----BEGIN CERTIFICATE----- is not a public or private key, it’s anX.509 Certificate. You can load it using load_pem_x509_certificate() and extract the public key with Certificate.public_key

当然这个文件也可以被加密，我们使用如下方法从pem文件中导入RSAPrivateKey对象

from cryptography.hazmat.primitives import serialization
with open("path/to/key.pem", "rb") as key_file:
    private_key = serialization.load_pem_private_key(
        key_file.read(),
        password=None,
        backend=default_backend()
    )

同理也可以从cer文件和ssh格式文件中导入私钥或公钥。

序列化

RSAPrivateKey对象和RSAPublicKey对象都可以序列化为pem文件

from cryptography.hazmat.primitives import serialization
pem = private_key.private_bytes(
   encoding=serialization.Encoding.PEM,
   format=serialization.PrivateFormat.PKCS8,
   encryption_algorithm=serialization.BestAvailableEncryption(b'mypassword')
)
pem.splitlines()
# [b'-----BEGIN ENCRYPTED PRIVATE KEY-----',
#  b'MIIFHzBJBgkqhkiG9w0BBQ0wPDAbBgkqhkiG9w0BBQwwDgQI4LyuGo+hDoACAggA',
#  b'MB0GCWCGSAFlAwQBKgQQGuA8UxHCt7qLEF29noqffQSCBNBH0rZH59FTTWaPWEV/',
#  ......
#  b'Y6Dt0ACOPHcd8Z2Y9MTJ0QFY8A==',
#  b'-----END ENCRYPTED PRIVATE KEY-----']

强烈建议对私钥进行序列化的时候用自己的密钥进行加密，这样不会将私钥完全暴露

我们之所以说上述过程是序列化，而不是保存私钥，是因为该pem文件不止包含私钥，还包括一些有关私钥的重要信息，具体pem格式请查阅相关文档。而且实际上用的时候并不需要我们手动对pem文件进行解析，只用使用库提供的api就行

也可以不加密，改变如下

1	encryption_algorithm=serialization.NoEncryption()

对于公钥的序列化，如下:

from cryptography.hazmat.primitives import serialization
public_key = private_key.public_key()
pem = public_key.public_bytes(
   encoding=serialization.Encoding.PEM,
   format=serialization.PublicFormat.SubjectPublicKeyInfo
)
pem.splitlines()
# [b'-----BEGIN PUBLIC KEY-----',
#  b'MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAtboyGrCz1JIVru4+eoKG',
#  b'n/adEsavPDb2FQ6/UkIum392ni/Q9H27chliPXEZWZmEorbJvWeHupuL0ld3IWXi',
#  ......
#  b'LwIDAQAB',
#  b'-----END PUBLIC KEY-----']

签名

使用私钥可以对一段信息进行签名，然后别人就可以使用公钥进行验证。

from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding
signer = private_key.signer(
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)
message = b"A message I want to sign"
signer.update(message)
signature = signer.finalize()
signature
# b'\x19\x87!5\xc0\xe3s\x01M\xa5-\xf3......\xce\xf5\x03=F\xb3\xd5\xd1\xf9\xc2\xf2\xbak'

padding也就是填充，就是将不够长度的信息填充成指定长度(这里为256)，具体为什么需要填充请参考SHA256算法实现

也可以使用更简单的方法进行签名:

message = b"A message I want to sign"
signature = private_key.sign(
    message,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)

验证

public_key = private_key.public_key()
verifier = public_key.verifier(
    signature,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)
verifier.update(message)
verifier.verify()

如果验证不通过，将会触发异常，同样，也有以下简单的方式进行验证:

public_key.verify(
    signature,
    message,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)

加密

使用私钥对信息加密没有意义，因为全世界都有你的公钥，毕竟公钥是公开的，当然，如果你不公开你的公钥，那更失去了意义，所以加密指的是用公钥进行加密，然后我们使用私钥来解密

message = b"encrypted data"
ciphertext = public_key.encrypt(
    message,
    padding.OAEP(
        mgf=padding.MGF1(algorithm=hashes.SHA1()),
        algorithm=hashes.SHA1(),
        label=None
    )
)
ciphertext
# b'J\x95\xadC\xa9......\x18\xbb\\\xa3\xb3\x13f_N\x89\x07`\xa1'

解密

plaintext = private_key.decrypt(
    ciphertext,
    padding.OAEP(
        mgf=padding.MGF1(algorithm=hashes.SHA1()),
        algorithm=hashes.SHA1(),
        label=None
    )
)
plaintext
# b'encrypted data'

可以看到目前对公钥私钥的操作很多都是使用固定参数就完全够了，所以可以对此进一步封装，于是就出现了该项目

参考文档

yagmail邮件发送库使用详解

2016-12-17T08:26:07.000Z

yagmail简介

使用python标准库进行邮件的处理比较复杂，所以产生了yagmail，但是yagmail目前只能用SMTP协议进行邮件发送，并不能读取邮件，也不支持其他的邮件相关协议，但是对于一般使用完全够了。

yagmail使用

首先是通过yagmail.SMTP()生成一个客户端，但是为了不将我们的密码暴露下脚本文件中，yagmail使用keyring模块将密码存放在系统keyring服务中。

关于keyring是什么，请看:What does a Keyring do?

官方文档中，

1	yagmail.register('mygmailusername', 'mygmailpassword')

实际上是对keyring.set_password('yagmail', 'mygmailusername', 'mygmailpassword')的封装。

SMTP()方法会去用户主文件夹读取.yagmail文件，但是以上操作并不会生成这个文件，所以需要自己创建，并将自己的邮箱写入文件中。

例如，我测试过程中写入.yagmail文件中的内容为:

1	810620174@qq.com

而之前我已经通过register()方法将该邮箱的密码保存到了系统keyring中，所以接下来就可以初始化一个SMTP客户端

另外还需要注意的是，经过测试，163邮箱很容易将邮件识别为垃圾邮件，导致邮件发送错误，而qq邮箱需要关闭邮件保护，其他邮箱没有测试，这里推荐使用qq邮箱。

常用邮箱SMTP服务器地址和端口

sina.com: 
POP3服务器地址:pop3.sina.com.cn（端口：110） 
SMTP服务器地址:smtp.sina.com.cn（端口：25）   
sinaVIP： 
POP3服务器:pop3.vip.sina.com （端口：110） 
SMTP服务器:smtp.vip.sina.com （端口：25）  
sohu.com: 
POP3服务器地址:pop3.sohu.com（端口：110） 
SMTP服务器地址:smtp.sohu.com（端口：25）  
126邮箱： 
POP3服务器地址:pop.126.com（端口：110） 
SMTP服务器地址:smtp.126.com（端口：25）  
139邮箱： 
POP3服务器地址：POP.139.com（端口：110） 
SMTP服务器地址：SMTP.139.com(端口：25)  
163.com: 
POP3服务器地址:pop.163.com（端口：110） 
SMTP服务器地址:smtp.163.com（端口：25）  
QQ邮箱  
POP3服务器地址：pop.qq.com（端口：110） 
SMTP服务器地址：smtp.qq.com （端口：25）  
QQ企业邮箱 
POP3服务器地址：pop.exmail.qq.com （SSL启用 端口：995） 
SMTP服务器地址：smtp.exmail.qq.com（SSL启用 端口：587/465）
yahoo.com: 
POP3服务器地址:pop.mail.yahoo.com 
SMTP服务器地址:smtp.mail.yahoo.com  
yahoo.com.cn: 
POP3服务器地址:pop.mail.yahoo.com.cn（端口：995） 
SMTP服务器地址:smtp.mail.yahoo.com.cn（端口：587）  
HotMail 
POP3服务器地址：pop3.live.com （端口：995） 
SMTP服务器地址：smtp.live.com （端口：587） 
gmail(google.com) 
POP3服务器地址:pop.gmail.com（SSL启用 端口：995） 
SMTP服务器地址:smtp.gmail.com（SSL启用 端口：587）  
263.net: 
POP3服务器地址:pop3.263.net（端口：110） 
SMTP服务器地址:smtp.263.net（端口：25）  
263.net.cn: 
POP3服务器地址:pop.263.net.cn（端口：110） 
SMTP服务器地址:smtp.263.net.cn（端口：25） 
x263.net: 
POP3服务器地址:pop.x263.net（端口：110） 
SMTP服务器地址:smtp.x263.net（端口：25） 
21cn.com: 
POP3服务器地址:pop.21cn.com（端口：110） 
SMTP服务器地址:smtp.21cn.com（端口：25） 
Foxmail： 
POP3服务器地址:POP.foxmail.com（端口：110） 
SMTP服务器地址:SMTP.foxmail.com（端口：25）  
china.com: 
POP3服务器地址:pop.china.com（端口：110） 
SMTP服务器地址:smtp.china.com（端口：25） 
tom.com: 
POP3服务器地址:pop.tom.com（端口：110） 
SMTP服务器地址:smtp.tom.com（端口：25）  
etang.com: 
POP3服务器地址:pop.etang.com 
SMTP服务器地址:smtp.etang.com

yagmail.SMTP()默认使用的gmail的SMTP服务，所以我们如果使用qq邮箱，则使用如下代码初始化一个SMTP客户端

1	yag = yagmail.SMTP('810620174@qq.com', host='smtp.qq.com', port='25')

紧接着就可以发送邮件了

1	yag.send('13207130066.cool@163.com', '邮件主题', '这是邮件内容')

至此，便像13207130066.cool@163.com这个邮箱发送了一封邮件。

注意send()方法的定义:

1	def send(self, to=None, subject=None, contents=None, attachments=None, cc=None, bcc=None,preview_only=False, validate_email=True, throw_invalid_exception=False, headers=None)

如果不指定to参数，则发送给自己,如果to参数是一个列表，则将该邮件发送给列表中的所有用户，attachments表示附件，该参数可以是列表，表示发送多个附件

对于contents参数，官方说明如下:

If it is a dictionary it will assume the key is the content and the value is an alias (only for images currently!) e.g. {‘/path/to/image.png’ : ‘MyPicture’}
It will try to see if the content (string) can be read as a file locally, e.g. ‘/path/to/image.png’
if impossible, it will check if the string is valid html e.g. This is a big title
if not, it must be text. e.g. ‘Hi Dorika!’

参考文档

计算机重点问题集锦

2016-12-10T08:10:12.000Z

简介

计算机行业重点问题，需要深入理解，持续更新

阻塞非阻塞与同步异步以及并发并行的区别

Scrapy爬虫库使用详解

2016-12-10T04:36:04.000Z

Scrapy简介

scrapy发出的请求是异步的，默认过滤掉相同的url。能做html/xml解析，数据能导出多种格式，还有强大的插件系统

scrapy(1.2.2)目前支持python 3，但是官方文档是也有说明，并不支持windows平台上的python3，因为scrapy的核心依赖Twisted目前并不支持windows平台上的python 3，所以知乎上有人推荐使用python 2.7，并需要安装Visual C++ Compiler for Python 2.7，并且window10 也支持这个软件，但是按照python开发者手册上的说明，python2.7只会维护到2020年，并且python的未来也是指向python 3，基本上主流库都支持了python 3，并且很多库已经开始不支持python 2了，所以这里我还是想使用python 3.

关于为什么不支持windows平台，原因是windows上不能编译scrapy的依赖lxml和Twisted,但是我们可以下载已经编译好的whl包，用pip安装即可，详情，可以参考这篇博客: python 3.5 + scrapy1.2 windows下的安装

Scrapy使用

创建项目

1	scrapy startproject test_scrapy

将会在当前工作目录下创建test_scrapy文件夹，文件下下有以下内容:

test_scrapy/
    scrapy.cfg            # deploy configuration file
    test_scrapy/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # Define here the models for your spider middleware
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

第一个爬虫

我们编写的爬虫类必须继承scrapy.Spider并定义好初始请求链接，并且应该将文件放置在spiders目录下。

我们在spiders目录下创建quotes_spider.py:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

name是spider名称，同一项目中不能同名

start_requests()必须返回可迭代的Requests(一个Requests列表或者是生成器对象)，这些请求是爬虫初始的爬取对象.scrapy提供一种简单实现start_requests()的方式，就是使用start_urls列表，该列表在后台会被自动封装成Requests生成器并使用默认的回掉函数parse()

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

parse()是默认的回调函数。Request可以设置得到响应后的回调函数。

运行爬虫

在项目的根目录执行:

1	scrapy crawl quotes

quotes是爬虫名

将会看到以下输出:

...
2016-12-11 14:39:27 [scrapy] INFO: Spider opened
2016-12-11 14:39:27 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-11 14:39:27 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-11 14:39:28 [scrapy] DEBUG: Crawled (404)  (referer: None)
2016-12-11 14:39:28 [scrapy] DEBUG: Crawled (200) 1/> (referer: None)
2016-12-11 14:39:28 [quotes] DEBUG: Saved file quotes-1.html
2016-12-11 14:39:29 [scrapy] DEBUG: Crawled (200) 2/> (referer: None)
2016-12-11 14:39:29 [quotes] DEBUG: Saved file quotes-2.html
2016-12-11 14:39:29 [scrapy] INFO: Closing spider (finished)
2016-12-11 14:39:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 675,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 5976,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 11, 6, 39, 29, 492581),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 12, 11, 6, 39, 27, 724826)}
2016-12-11 14:39:29 [scrapy] INFO: Spider closed (finished)

并在根目录生成quotes-1.html和quotes-2.html

解析网页

使用类选择器对html/xml进行解析,同时scrapy也支持XPath表达式

>>> response.css('title')
['descendant-or-self::title' data='Quotes to Scrape'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']
>>> response.css('title').extract()
['Quotes to Scrape']
>>> response.css('li.next a').extract_first()
'Next →'
>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

response.css()返回列表，如果想提取第一个，可以这样:

>>> response.css('title::text').extract_first()
'Quotes to Scrape'
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

推荐使用第一种方式，这样，如果response.css()返回空列表，前者会返回None，后者会触发异常

除了使用 extract() 和 extract_first()提取数据，也可以使用re()进行正则提取

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

Following links

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                # 'author': quote.xpath('span/small/text()').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page) # urljoin()获取完整url地址
            yield scrapy.Request(next_page, callback=self.parse)

import scrapy
class AuthorSpider(scrapy.Spider):
    name = 'author'
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author+a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                                 callback=self.parse_author)
        # follow pagination links
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()
        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

命令行工具

C:\WINDOWS\system32>scrapy
Scrapy 1.2.2 - no active project
Usage:
  scrapy  [options] [args]
Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
  [ more ]      More commands available when run from project directory
Use "scrapy  -h" to see more info about a command

更多命令以及命令的详细使用方法请参考官方文档

CrawlSpider

除了继承scrapy.Spider，常用的还有scrapy.spiders.CrawlSpider,该类可以在前者的基础上添加Rule。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )
    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

SitemapSpider

scrapy.spiders.SitemapSpider可以根据sitemaps和robots.txt进行爬去

from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]
    sitemap_follow = ['/sitemap_shops']
    def parse_shop(self, response):
        pass # ... scrape shop here ...

规则中表示含有/shop/的url的回调函数为parse_shop,sitemap_follow表示只跟随包含/sitemap_shops的url

Item

python自带的dict没有结构体的概念，所以scrapy提供了Item类

import scrapy
class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

Item Loader能够更好将response中的数据注入到Item中

from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Item Pipeline

Item被爬取后会发送给pipeline进行处理，一般pipeline是只用实现process_item的类，也可以实现open_spider()(爬虫开始前执行)和close_spider()

import pymongo
class MongoPipeline(object):
    collection_name = 'scrapy_items'
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):
        self.client.close()
    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

以上是scrapy基础内容，更多有关scrapy，如log和email等查看官方文档

参考文档

Scrapy官方文档

re正则库使用详解

2016-12-01T08:00:45.000Z

re简介

正则表达式会被python解释器编译成字节码，这样查找的效率比单纯用python代码实现查找要快，但是匹配统一内容可以有多种不同的正则表达式，并且他们的效率各不相同

特殊符号

1	. ^ $ * + ? { } [ ] \ \| ( )

匹配这些特殊符号需要使用\进行转义

`.`

匹配除换行符以外的任意字符，如果指定了DOTALL标志，则匹配所有字符，但注意.表示仅仅匹配一个字符

import re
re.findall(r'.', '\r\nabc')
# ['\r', 'a', 'b', 'c']
re.findall(r'.', '\r\nabc', flags=re.DOTALL)
# ['\r', '\n', 'a', 'b', 'c']

`^`

匹配字符串的开始，当指定MULTILINE标志，则匹配每一行的开头

re.findall(r'ab.', 'abcdefabhy')
# ['abc', 'abh']
re.findall(r'^ab.', 'abcdefabhy')
# ['abc']
re.findall(r'^ab.',
           '''abcd
           abcd
           acd
           abcd''')
# ['abc']
re.findall(r'^ab.',
           '''abcd
           abcd
           acd
           abcd''', flags=re.MULTILINE)
# ['abc', 'abc', 'abc']

`###`

匹配字符串的结尾，当指定MULTILINE标志，则匹配每一行的结尾(匹配换行符之前的)

re.findall(r'.ab$', 'aabcbab')
# ['bab']
re.findall(r'ab.$', 'aabcbab')
# []
re.findall(r'ab.$', 'aabcbab1\n') # 注意换行符不是结尾，换行符之前的才是结尾
# ['ab1']

`*`

*表示0个或多个前一字符或正则

1 2	re.findall(r'ab*c', 'ac.abc.abbbbc') # ['ac', 'abc', 'abbbbc']

`+`

+表示1个或多个前一字符或正则

1 2	re.findall(r'ab+c', 'ac.abc.abbbbc') # ['abc', 'abbbbc']

`?`

?表示0个或1个前一字符或正则

1 2	re.findall(r'ab?c', 'ac.abc.abbbbc') # ['ac', 'abc']

`*?` `+?` `??`

* + ? 都是贪婪的，会匹配最长的

1 2	re.findall(r'<.*>', ' b ') # [' b ']

在这些操作符后面添加?能够使之变为不贪婪的，也就是匹配最短的

1 2	re.findall(r'<.*?>', ' b ') # ['', '']

`{m}`

{m}表示m个前一字符或正则

1 2	re.findall(r'a{3}b', 'aabaaabaaaab') # ['aaab', 'aaab']

`{m,n}`

{m,n}表示m到n个前一字符或正则注意:,后面没有空格

1 2	re.findall(r'a{2,3}b', 'aabaaabaaaab') # ['aab', 'aaab', 'aaab']

省略m表示没有下限，省略n表示没有上限

re.findall(r'a{,3}b', 'babaabaaabaaaab')
# ['b', 'ab', 'aab', 'aaab', 'aaab']
re.findall(r'a{2,}b', 'babaabaaabaaaab')
# ['aab', 'aaab', 'aaaab']

`{m,n}?`

{m,n}会匹配最长的，在后面加?，则匹配最短的

re.findall(r'a{2,4}', 'aaaa')
# ['aaaa']
re.findall(r'a{2,4}?', 'aaaa')
# ['aa', 'aa']

`[]`

[]指定一组字符

re.findall(r'[a-z]', 'adfzADFZ059')
# ['a', 'd', 'f', 'z']
re.findall(r'[a-zA-Z0-9]', 'adfzADFZ059')
# ['a', 'd', 'f', 'z', 'A', 'D', 'F', 'Z', '0', '5', '9']

很多特殊符号在[]环境内无效,其他特殊符号需要转义:

1 2	re.findall(r'[.$+?{}\|()]', '.^$+?{}[]\\|()') # ['.', '$', '*', '+', '?', '{', '}', '\|', '(', ')']

[]内的^表示非，^^表示除^以外的全部字符:

re.findall(r'[^5]', '1359')
# ['1', '3', '9']
re.findall(r'[^^]', '1359^')
# ['1', '3', '5', '9']

`|`

|也就是或，注意也是短路操作

re.findall(r'a|bc', 'acbcabc')
# ['a', 'bc', 'a', 'bc']
re.findall(r'[a|b]c', 'acbcabc')
# ['ac', 'bc', 'bc']

`(...)`

匹配圆括号里的RE匹配的内容，并指定组的开始和结束位置。组里面的内容可以被提取,要匹配(和)，则需要使用转义符号或者是[(],[)]

`(?aiLmsux)`

i,L,m,s,u,x里的一个或多个字母。表达式不匹配任何字符，但是指定相应的标志：re.I(忽略大小写)、re.L(依赖locale)、re.M(多行模式)、re.S(.匹配所有字符)、re.U(依赖Unicode)、re.X(详细模式)

1 2	re.findall(r'(?i)ab', 'abABAbaB') # ['ab', 'AB', 'Ab', 'aB']

`(?P...)`

和普通的圆括号类似，但是子串匹配到的内容将可以用命名的name参数来提取。组的name必须是有效的python标识符，而且在本表达式内不重名。命名了的组和普通组一样，也用数字来提取，也就是说名字只是个额外的属性。

m = re.match('(?P\w+)', 'zzx:22')
m.group('name')
# 'zzx'
m.group(1)
# 'zzx'

special sequences

`\number`

表示之前的分组

1 2	re.match(r'(.+) \1 (abc) \2', '55 55 abc abc') # <_sre.SRE_Match object; span=(0, 13), match='55 55 abc abc'>

`\A`

仅匹配字符串的开头

1 2	re.findall(r'\Aabc', 'abcabc') # ['abc']

`\b`

表示单词开始和结尾处的空白字符以及非字母非数字的字符

re.findall(r'\babc\b', 'abc.')
# ['abc']
re.findall(r'\babc\b', 'abc!')
# ['abc']
re.findall(r'\babc\b', 'abca')
# []

`\B`

\b的反面

re.findall(r'py\B', 'python')
# ['py']
re.findall(r'py\B', 'py.')
# []

`\s`

匹配空白字符,包括[ \t\n\r\f\v]

1 2	re.findall(r'aa\s+bb', 'aa \n\t bb') # ['aa \n\t bb']

`\S`

\s的反面

re.findall(r'aa\S+bb', 'aahg.!bb')
# ['aahg.!bb']
re.findall(r'aa\S+bb', 'aa bb')
# []

`\w`

匹配数字和字母

1 2	re.findall(r'\w+', 'aa3bb 45AS') # ['aa3bb', '45AS']

`\W`

\w的反面

1 2	re.findall(r'\W+', 'aa3bb .! 45AS') # [' .! ']

`\Z`

匹配字符串结尾

1 2	re.findall(r'ab\Z', 'abab') # ['ab']

`re`模块方法

`re.compile(pattern, flags=0)`

编译一个正则表达式为一个正则表达式对象，之后就可以使用该对象对字符串进行匹配了

`re.search(pattern, string, flags=0)`

从字符串的开头开始搜索匹配，返回匹配到的第一个

`re.match(pattern, string, flags=0)`

返回字符串中匹配的第一个

`re.fullmatch(pattern, string, flags=0)`

对整个字符串进行匹配

`re.split(pattern, string, maxsplit=0, flags=0)`

凭正则表达式分割字符串

`re.findall(pattern, string, flags=0)`

如果匹配模式中包含分组，则返回分组，如果有多个分组，则返回分组组成的元组

`re.finditer(pattern, string, flags=0)`

返回迭代器

`re.sub(pattern, repl, string, count=0, flags=0)`

替换

Match Objects

像match() search()等方法返回的就是一个Match对象，该对象包括的属性和方法请看官方文档

注意，关于分组，第0组就是匹配到的字符串

1
2
3

a = re.match(r'\babc\b', 'abc!')
a.group()
# 'abc'

参考文档

Python描述符descriptor

2016-11-29T10:40:14.000Z

简介

Python描述符(descriptor)解密

原文链接： Chris Beaumont 翻译：极客范 - 慕容老匹夫

转载链接： http://www.geekfan.net/7862/

Python中包含了许多内建的语言特性，它们使得代码简洁且易于理解。这些特性包括列表/集合/字典推导式，属性（property）、以及装饰器（decorator）。对于大部分特性来说，这些“中级”的语言特性有着完善的文档，并且易于学习。

但是这里有个例外，那就是描述符。至少对于我来说，描述符是Python语言核心中困扰我时间最长的一个特性。这里有几点原因如下：

有关描述符的官方文档相当难懂，而且没有包含优秀的示例告诉你为什么需要编写描述符（我得为Raymond Hettinger辩护一下，他写的其他主题的Python文章和视频对我的帮助还是非常大的）
编写描述符的语法显得有些怪异
自定义描述符可能是Python中用的最少的特性，因此你很难在开源项目中找到优秀的示例

但是一旦你理解了之后，描述符的确还是有它的应用价值的。这篇文章告诉你描述符可以用来做什么，以及为什么应该引起你的注意。

一句话概括：描述符就是可重用的属性

在这里我要告诉你：从根本上讲，描述符就是可以重复使用的属性。也就是说，描述符可以让你编写这样的代码：

f = Foo()
b = f.bar
f.bar = c
del f.bar

而在解释器执行上述代码时，当发现你试图访问属性b = f.bar、对属性赋值f.bar = c或者删除一个实例变量的属性del f.bar时，就会去调用自定义的方法。

让我们先来解释一下为什么把对函数的调用伪装成对属性的访问是大有好处的。

property——把函数调用伪装成对属性的访问

想象一下你正在编写管理电影信息的代码。你最后写好的Movie类可能看上去是这样的：

class Movie(object):
    def __init__(self, title, rating, runtime, budget, gross):
        self.title = title
        self.rating = rating
        self.runtime = runtime
        self.budget = budget
        self.gross = gross
 
    def profit(self):
        return self.gross - self.budget

你开始在项目的其他地方使用这个类，但是之后你意识到：如果不小心给电影打了负分怎么办？你觉得这是错误的行为，希望Movie类可以阻止这个错误。你首先想到的办法是将Movie类修改为这样：

class Movie(object):
    def __init__(self, title, rating, runtime, budget, gross):
        self.title = title
        self.rating = rating
        self.runtime = runtime
        self.gross = gross
        if budget < 0:
            raise ValueError("Negative value not allowed: %s" % budget)
        self.budget = budget
 
    def profit(self):
        return self.gross - self.budget

但这行不通。因为其他部分的代码都是直接通过Movie.budget来赋值的,这个新修改的类只会在__init__方法中捕获错误的数据，但对于已经存在的类实例就无能为力了。如果有人试着运行m.budget = -100，那么谁也没法阻止。作为一个Python程序员同时也是电影迷，你该怎么办？

幸运的是，Python的property解决了这个问题。如果你从未见过property的用法，下面是一个示例：

class Movie(object):
    def __init__(self, title, rating, runtime, budget, gross):
        self._budget = None
 
        self.title = title
        self.rating = rating
        self.runtime = runtime
        self.gross = gross
        self.budget = budget
 
    @property
    def budget(self):
        return self._budget
 
    @budget.setter
    def budget(self, value):
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self._budget = value
 
    def profit(self):
        return self.gross - self.budget
 
m = Movie('Casablanca', 97, 102, 964000, 1300000)
print m.budget       # calls m.budget(), returns result
try:
    m.budget = -100  # calls budget.setter(-100), and raises ValueError
except ValueError:
    print "Woops. Not allowed"
 
964000
Woops. Not allowed

我们用@property装饰器指定了一个getter方法，用@budget.setter装饰器指定了一个setter方法。当我们这么做时，每当有人试着访问budget属性，Python就会自动调用相应的getter/setter方法。比方说，当遇到m.budget = value这样的代码时就会自动调用budget.setter

花点时间来欣赏一下Python这么做是多么的优雅：如果没有property，我们将不得不把所有的实例属性隐藏起来，提供大量显式的类似get_budget和set_budget方法。像这样编写类的话，使用起来就会不断的去调用这些getter/setter方法，这看起来就像臃肿的Java代码一样。更糟的是，如果我们不采用这种编码风格，直接对实例属性进行访问。那么稍后就没法以清晰的方式增加对非负数的条件检查——我们不得不重新创建set_budget方法，然后搜索整个工程中的源代码，将m.budget = value这样的代码替换为m.set_budget(value)。太蛋疼了！！

因此，property让我们将自定义的代码同变量的访问/设定联系在了一起，同时为你的类保持一个简单的访问属性的接口。干得漂亮！

property的不足

对property来说，最大的缺点就是它们不能重复使用。举个例子，假设你想为rating，runtime和gross这些字段也添加非负检查。下面是修改过的新类：

class Movie(object):
    def __init__(self, title, rating, runtime, budget, gross):
        self._rating = None
        self._runtime = None
        self._budget = None
        self._gross = None
 
        self.title = title
        self.rating = rating
        self.runtime = runtime
        self.gross = gross
        self.budget = budget
 
    #nice
    @property
    def budget(self):
        return self._budget
 
    @budget.setter
    def budget(self, value):
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self._budget = value
 
    #ok    
    @property
    def rating(self):
        return self._rating
 
    @rating.setter
    def rating(self, value):
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self._rating = value
 
    #uhh...
    @property
    def runtime(self):
        return self._runtime
 
    @runtime.setter
    def runtime(self, value):
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self._runtime = value        
 
    #is this forever?
    @property
    def gross(self):
        return self._gross
 
    @gross.setter
    def gross(self, value):
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self._gross = value        
 
    def profit(self):
        return self.gross - self.budget

可以看到代码增加了不少，但重复的逻辑也出现了不少。虽然property可以让类从外部看起来接口整洁漂亮，但是却做不到内部同样整洁漂亮。

描述符登场（最终的大杀器）

这就是描述符所解决的问题。描述符是property的升级版，允许你为重复的property逻辑编写单独的类来处理。下面的示例展示了描述符是如何工作的（现在还不必担心NonNegative类的实现）：

from weakref import WeakKeyDictionary
 
class NonNegative(object):
    """A descriptor that forbids negative values"""
    def __init__(self, default):
        self.default = default
        self.data = WeakKeyDictionary()
 
    def __get__(self, instance, owner):
        # we get here when someone calls x.d, and d is a NonNegative instance
        # instance = x
        # owner = type(x)
        return self.data.get(instance, self.default)
 
    def __set__(self, instance, value):
        # we get here when someone calls x.d = val, and d is a NonNegative instance
        # instance = x
        # value = val
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self.data[instance] = value
 
class Movie(object):
 
    #always put descriptors at the class-level
    rating = NonNegative(0)
    runtime = NonNegative(0)
    budget = NonNegative(0)
    gross = NonNegative(0)
 
    def __init__(self, title, rating, runtime, budget, gross):
        self.title = title
        self.rating = rating
        self.runtime = runtime
        self.budget = budget
        self.gross = gross
 
    def profit(self):
        return self.gross - self.budget
 
m = Movie('Casablanca', 97, 102, 964000, 1300000)
print m.budget  # calls Movie.budget.__get__(m, Movie)
m.rating = 100  # calls Movie.budget.__set__(m, 100)
try:
    m.rating = -1   # calls Movie.budget.__set__(m, -100)
except ValueError:
    print "Woops, negative value"
 
964000
Woops, negative value

这里引入了一些新的语法，我们一条条的来看：

NonNegative是一个描述符对象，因为它定义了__get__，__set__或__delete__方法。

Movie类现在看起来非常清晰。我们在类的层面上创建了4个描述符，把它们当做普通的实例属性。显然，描述符在这里为我们做非负检查。

访问描述符

当解释器遇到print m.buget时，它就会把budget当作一个带有__get__方法的描述符，调用Movie.budget.__get__方法并将方法的返回值打印出来，而不是直接传递m.budget来打印。这和你访问一个property相似，Python自动调用一个方法，同时返回结果。

__get__接收2个参数：一个是点号左边的实例对象（在这里，就是m.budget中的m），另一个是这个实例的类型Movie。在一些Python文档中，Movie被称作描述符的所有者（owner）。如果我们需要访问Movie.budget，Python将会调用Movie.budget.__get__(None, Movie)。可以看到，第一个参数要么是所有者的实例，要么是None。这些输入参数可能看起来很怪，但是这里它们告诉了你描述符属于哪个对象的一部分。当我们看到NonNegative类的实现时这一切就合情合理了。

对描述符赋值

当解释器看到m.rating = 100时，Python识别出rating是一个带有__set__方法的描述符，于是就调用Movie.rating.__set__(m, 100)。和__get__一样，__set__的第一个参数是点号左边的类实例m.rating = 100中的m。第二个参数是所赋的值（100）。

删除描述符

为了说明的完整，这里提一下删除。如果你调用del m.budget，Python就会调用Movie.budget.__delete__(m)。

NonNegative类是如何工作的？

带着前面的困惑，我们终于要揭示NonNegative类是如何工作的了。每个NonNegative的实例都维护着一个字典，其中保存着所有者实例和对应数据的映射关系。当我们调用m.budget时，__get__方法会查找与m相关联的数据，并返回这个结果（如果这个值不存在，则会返回一个默认值）。__set__采用的方式相同，但是这里会包含额外的非负检查。我们使用WeakKeyDictionary来取代普通的字典以防止内存泄露——我们可不想仅仅因为它在描述符的字典中就让一个无用的实例一直存活着。

使用描述符会有一点别扭。因为它们作用于类的层次上，每一个类实例都共享同一个描述符。这就意味着对不同的实例对象而言，描述符不得不手动地管理不同的状态，同时需要显式的将类实例作为第一个参数准确传递给__get__、__set__以及__delete__方法。

我希望这个例子解释清楚了描述符可以用来做什么——它们提供了一种方法将property的逻辑隔离到单独的类中来处理。如果你发现自己正在不同的property之间重复着相同的逻辑，那么本文也许会成为一个线索供你思考为何用描述符重构代码是值得一试的。

秘诀和陷阱

把描述符放在类的层次上（class level）

为了让描述符能够正常工作，它们必须定义在类的层次上。如果你不这么做，那么Python无法自动为你调用__get__和__set__方法。

class Broken(object):
    y = NonNegative(5)
    def __init__(self):
        self.x = NonNegative(0)  # NOT a good descriptor
 
b = Broken()
print "X is %s, Y is %s" % (b.x, b.y)
 
X is <__main__.NonNegative object at 0x10432c250>, Y is 5

可以看到，访问类层次上的描述符y可以自动调用__get__。但是访问实例层次上的描述符x只会返回描述符本身，真是魔法一般的存在啊。

确保实例的数据只属于实例本身

你可能会像这样编写NonNegative描述符：

class BrokenNonNegative(object):
    def __init__(self, default):
        self.value = default
 
    def __get__(self, instance, owner):
        return self.value
 
    def __set__(self, instance, value):
        if value < 0:
            raise ValueError("Negative value not allowed: %s" % value)
        self.value = value
 
class Foo(object):
    bar = BrokenNonNegative(5) 
 
f = Foo()
try:
    f.bar = -1
except ValueError:
    print "Caught the invalid assignment"
 
Caught the invalid assignment

这么做看起来似乎能正常工作。但这里的问题就在于所有Foo的实例都共享相同的bar，这会产生一些令人痛苦的结果：

class Foo(object):
    bar = BrokenNonNegative(5) 
 
f = Foo()
g = Foo()
 
print "f.bar is %s\ng.bar is %s" % (f.bar, g.bar)
print "Setting f.bar to 10"
f.bar = 10
print "f.bar is %s\ng.bar is %s" % (f.bar, g.bar)  #ouch
f.bar is 5
g.bar is 5
Setting f.bar to 10
f.bar is 10
g.bar is 10

这就是为什么我们要在NonNegative中使用数据字典的原因。__get__和__set__的第一个参数告诉我们需要关心哪一个实例。NonNegative使用这个参数作为字典的key，为每一个Foo实例单独保存一份数据。

class Foo(object):
    bar = NonNegative(5)
 
f = Foo()
g = Foo()
print "f.bar is %s\ng.bar is %s" % (f.bar, g.bar)
print "Setting f.bar to 10"
f.bar = 10
print "f.bar is %s\ng.bar is %s" % (f.bar, g.bar)  #better
f.bar is 5
g.bar is 5
Setting f.bar to 10
f.bar is 10
g.bar is 5

这就是描述符最令人感到别扭的地方（坦白的说，我不理解为什么Python不让你在实例的层次上定义描述符，并且总是需要将实际的处理分发给__get__和__set__。这么做行不通一定是有原因的）

注意不可哈希的描述符所有者

NonNegative类使用了一个字典来单独保存专属于实例的数据。这个一般来说是没问题的，除非你用到了不可哈希（unhashable）的对象：

class MoProblems(list):  #you can't use lists as dictionary keys
    x = NonNegative(5)
 
m = MoProblems()
print m.x  # womp womp
 
TypeError
Traceback (most recent call last)
 in ()
      3 
      4 m = MoProblems()
----> 5 print m.x  # womp womp
 
 in __get__(self, instance, owner)
      9         # instance = x
     10         # owner = type(x)
---> 11         return self.data.get(instance, self.default)
     12 
     13     def __set__(self, instance, value):
 
TypeError: unhashable type: 'MoProblems'

因为MoProblems的实例（list的子类）是不可哈希的，因此它们不能为MoProblems.x用做数据字典的key。有一些方法可以规避这个问题，但是都不完美。最好的方法可能就是给你的描述符加标签了。

class Descriptor(object):
 
    def __init__(self, label):
        self.label = label
 
    def __get__(self, instance, owner):
        print '__get__', instance, owner
        return instance.__dict__.get(self.label)
 
    def __set__(self, instance, value):
        print '__set__'
        instance.__dict__[self.label] = value
 
class Foo(list):
    x = Descriptor('x')
    y = Descriptor('y')
 
f = Foo()
f.x = 5
print f.x
 
__set__
__get__ [] 
5

这种方法依赖于Python的方法解析顺序（即，MRO）。我们给Foo中的每个描述符加上一个标签名，名称和我们赋值给描述符的变量名相同，比如x = Descriptor(‘x’)。之后，描述符将特定于实例的数据保存在f.__dict__['x']中。这个字典条目通常是当我们请求f.x时Python给出的返回值。然而，由于Foo.x是一个描述符，Python不能正常的使用f.__dict__[‘x’]，但是描述符可以安全的在这里存储数据。只是要记住，不要在别的地方也给这个描述符添加标签。

class Foo(object):
    x = Descriptor('y')
 
f = Foo()
f.x = 5
print f.x
 
f.y = 4    #oh no!
print f.x
__set__
__get__ <__main__.Foo object at 0x10432c810> 
5
__get__ <__main__.Foo object at 0x10432c810> 
4

我不喜欢这种方式，因为这样的代码很脆弱也有很多微妙之处。但这个方法的确很普遍，可以用在不可哈希的所有者类上。David Beazley在他的书中用到了这个方法。

在元类中使用带标签的描述符

由于描述符的标签名和赋给它的变量名相同，所以有人使用元类来自动处理这个簿记（bookkeeping）任务。

class Descriptor(object):
    def __init__(self):
        #notice we aren't setting the label here
        self.label = None
 
    def __get__(self, instance, owner):
        print '__get__. Label = %s' % self.label
        return instance.__dict__.get(self.label, None)
 
    def __set__(self, instance, value):
        print '__set__'
        instance.__dict__[self.label] = value
 
class DescriptorOwner(type):
    def __new__(cls, name, bases, attrs):
        # find all descriptors, auto-set their labels
        for n, v in attrs.items():
            if isinstance(v, Descriptor):
                v.label = n
        return super(DescriptorOwner, cls).__new__(cls, name, bases, attrs)
 
class Foo(object):
    __metaclass__ = DescriptorOwner
    x = Descriptor()
 
f = Foo()
f.x = 10
print f.x
 
__set__
__get__. Label = x
10

我不会去解释有关元类的细节——参考文献中David Beazley已经在他的文章中解释的很清楚了。需要指出的是元类自动的为描述符添加标签，并且和赋给描述符的变量名字相匹配。

尽管这样解决了描述符的标签和变量名不一致的问题，但是却引入了复杂的元类。虽然我很怀疑，但是你可以自行判断这么做是否值得。

访问描述符的方法

描述符仅仅是类，也许你想要为它们增加一些方法。举个例子，描述符是一个用来回调property的很好的手段。比如我们想要一个类的某个部分的状态发生变化时就立刻通知我们。下面的大部分代码是用来做这个的：

class CallbackProperty(object):
    """A property that will alert observers when upon updates"""
    def __init__(self, default=None):
        self.data = WeakKeyDictionary()
        self.default = default
        self.callbacks = WeakKeyDictionary()
 
    def __get__(self, instance, owner):
        return self.data.get(instance, self.default)
 
    def __set__(self, instance, value):        
        for callback in self.callbacks.get(instance, []):
            # alert callback function of new value
            callback(value)
        self.data[instance] = value
 
    def add_callback(self, instance, callback):
        """Add a new function to call everytime the descriptor updates"""
        #but how do we get here?!?!
        if instance not in self.callbacks:
            self.callbacks[instance] = []
        self.callbacks[instance].append(callback)
 
class BankAccount(object):
    balance = CallbackProperty(0)
 
def low_balance_warning(value):
    if value < 100:
        print "You are poor"
 
ba = BankAccount()
 
# will not work -- try it
#ba.balance.add_callback(ba, low_balance_warning)

这是一个很有吸引力的模式——我们可以自定义回调函数用来响应一个类中的状态变化，而且完全无需修改这个类的代码。这样做可真是替人分忧解难呀。现在，我们所要做的就是调用ba.balance.add_callback(ba, low_balance_warning)，以使得每次balance变化时low_balance_warning都会被调用。

但是我们是如何做到的呢？当我们试图访问它们时，描述符总是会调用__get__。就好像add_callback方法是无法触及的一样！其实关键在于利用了一种特殊的情况，即，当从类的层次访问时，__get__方法的第一个参数是None。

class CallbackProperty(object):
    """A property that will alert observers when upon updates"""
    def __init__(self, default=None):
        self.data = WeakKeyDictionary()
        self.default = default
        self.callbacks = WeakKeyDictionary()
 
    def __get__(self, instance, owner):
        if instance is None:
            return self        
        return self.data.get(instance, self.default)
 
    def __set__(self, instance, value):
        for callback in self.callbacks.get(instance, []):
            # alert callback function of new value
            callback(value)
        self.data[instance] = value
 
    def add_callback(self, instance, callback):
        """Add a new function to call everytime the descriptor within instance updates"""
        if instance not in self.callbacks:
            self.callbacks[instance] = []
        self.callbacks[instance].append(callback)
 
class BankAccount(object):
    balance = CallbackProperty(0)
 
def low_balance_warning(value):
    if value < 100:
        print "You are now poor"
 
ba = BankAccount()
BankAccount.balance.add_callback(ba, low_balance_warning)
 
ba.balance = 5000
print "Balance is %s" % ba.balance
ba.balance = 99
print "Balance is %s" % ba.balance
Balance is 5000
You are now poor
Balance is 99

个人总结

描述符伪装成类的属型，而当类的实例通过点操作符访问时，实际是就是调用描述符中三个方法之一
属性查找的顺序是:”类 -> 基类 -> 实例”,并不是首先就在表示实例的那片内存中查找属性，而是首先在类中查找，因为python需要首先判断该’属性’是否是描述符(伪装的属性)，如果是描述符，那么则不是调用__setattr__()或者__getattr__()方法对__dict__字典进行处理，而是调用描述符的__get__(),__set__()和__delete__()方法
由于描述符只能作为类的属性，所以该类的多个实例都是公用的这个描述符，所以一般在描述符中的__init__()函数中创建一个字典，以类实例的地址(例子中的instance)参数作为key，以要这个实例的数据作为value
类中的普通方法第一个参数是self,因为实例化类时，会自动将分配给实例的内存地址传递该self，也就是所谓的绑定，该函数也就成为绑定函数了，而给实例动态添加的方法以及类之外定义的方法就不需要self参数了
以底层的思维了看待类和对象，都是内存中分配的地址空间而已，虽然有书上说类也是对象，但是不好理解，从底层就容易理解一些，先划分区域，并写入相应数据，然后这就是类，然后以这个类实例化时，就是再划分一块内存，写于相应数据(为了节省空间，不会完全复制类中的属性和方法，只会简单的赋值一些属性表示该对象是那个类的实例)，然后这就是类。类属性就是属性的值只在代表类的那块内存中，而不在代表对象的那块内存中

参考文档

os库常用方法使用介绍

2016-11-29T01:27:27.000Z

os简介

与系统相依赖的一些操作，有些操作只支持unix系统

os常用方法

environ与getenv

获取环境变量

import os
os.environ["PYTHON_HOME"]
# 'F:\\pythonVE'
os.getenv('PYTHON_HOME')
# 'F:\\pythonVE'

用户与用户组

获取当前进程或者指定pid进程的用户和用户组，仅支持unix，详情见os

其中windows平台也可以使用的:

获取当前登陆用户:

1 2	os.getlogin() # 'zzx'

chdir与getcwd

改变与获取当前工作路径

os.getcwd()
# 'F:\\pythonVE\\Scripts'
os.chdir('..')
os.getcwd()
# 'F:\\pythonVE'

listdir与scandir

枚举指定目录,不指定path参数则默认当前路径

os.listdir()
# ['Include', 'Lib', 'pip-selfcheck.json', 'pyvenv.cfg', 'Scripts', 'share']
os.listdir('.')
# ['Include', 'Lib', 'pip-selfcheck.json', 'pyvenv.cfg', 'Scripts', 'share']

scandir()与listdir()作用相同，但是返回的是迭代器

a = os.scandir()
a
# 
a.__next__()
# 
a.__next__()
#

而DirEntry对象包含了与文件相关的属性，详情见:os.DirEntry

文件系统相关

mkdir() 创建目录
remove() 删除文件
rmdir() 删除目录
rename() 重命名

stat

文件相关信息

>>> import os
>>> statinfo = os.stat('somefile.txt')
>>> statinfo
os.stat_result(st_mode=33188, st_ino=7876932, st_dev=234881026,
st_nlink=1, st_uid=501, st_gid=501, st_size=264, st_atime=1297230295,
st_mtime=1297230027, st_ctime=1297230027)
>>> statinfo.st_size
264

startfile

使用电脑上默认应用打开指定文件

分隔符换行符相关

>>> os.curdir
'.'
>>> os.pardir
'..'
>>> os.sep
'\\'
>>> os.altsep
'/'
>>> os.extsep
'.'
>>> os.pathsep
';'
>>> os.defpath
'.;C:\\bin'
>>> os.linesep
'\r\n'

参考文档

os官方文档

VS Code常用快捷键

2016-11-15T02:17:52.000Z

VS Code常用快捷键

F1 打开命令模式

Ctrl+X 剪切当前行或选中内容

Ctrl+C 复制当前行或选中内容

Alt + ↓ / ↑ 上下移动当前行

Shift+Alt + ↓ / ↑ 复制当前行并上下移动

Ctrl+Enter 在下一行插入光标

Ctrl+Shift+Enter 在上一行插入光标

Home 跳到当前行的开始

End 跳到当前行的末尾

Ctrl+Home 跳到当前文件的开始

Ctrl+End 跳到当前文件的末尾

Ctrl+↑ / ↓ 上下滑动滚动条

Ctrl+G 行跳转

Ctrl+P 文件跳转

Ctrl+Shift+O 符号跳转

Alt+ ← / → 前进或后退,跟鼠标上的宏键功能一样

Ctrl+M 通过tab切换焦点

Alt+Click 插入光标

Ctrl+U 撤销上次光标操作

Ctrl+F2 在所有选中单词后面添加光标

Shift+Alt+ → / ← 控制选中范围

代码提示

默认快捷键是Ctrl + space，但是和系统输入法的切换冲突了，并且之前java开发使用习惯了Alt + /作为代码提示的快捷键，所有将代码提示的快捷键改为了Alt + /

Trigger parameter hints

默认快捷键是Ctrl+Shift+Space ,同样因为冲突改为了alt+shift+/

F12 跳转到定义处与Ctrl + 左键效果一样

Alt + F12

Ctrl + Alt + 左键在侧边打开定义

与Ctrl+K F12效果相同

Shift+F12 Show References

F11 全屏

以上便是常用的VS Code快捷键，不包括插件提供的快捷键，关于其他的快捷键请看参考文档

参考文档

官方快捷键手册

BeautifulSoup html与xml解析库使用详解

2016-11-14T07:38:10.000Z

BeautifulSoup简介

BeautifulSoup 3只支持python 2，并且已经停止开发，BeautifulSoup支持python2和3，以下使用方法参考4.4版说明文档

BeautifulSoup使用

解析器比较

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup,"html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup,"lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup,["lxml-xml"])``BeautifulSoup(markup,"xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup,"html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

如果不指定解析器，BeautifulSoup会自动选择最合适的解析器来解析文档

对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

Tag

Tag 对象与XML或HTML原生文档中的tag相同:

from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
type(tag)
# 
str(tag)
# 'Extremely bold'

每个tag都有name和attribute:

tag.name
# 'b'
tag.attrs
# {'class': ['boldest']}
tag['class']
# ['boldest']

可以通过直接赋值来增加或修改tag的名字和属性:

tag.name = "blockquote"
tag
# Extremely bold
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely bold

通过del删除属性:

del tag['class']
del tag['id']
tag
# Extremely bold
print(tag.get('class'))
# None

对于多值属性,会返回一个列表，使用的时候注意是返回列表还是字符串:

css_soup = BeautifulSoup('
')
css_soup.p['class']
# ["body", "strikeout"]
css_soup = BeautifulSoup('
')
css_soup.p['class']
# ["body"]

如果转换的文档是XML格式,那么tag中不包含多值属性

1
2
3

xml_soup = BeautifulSoup('
', 'xml')
xml_soup.p['class']
# 'body strikeout'

NavigableString

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

tag.string
# 'Extremely bold'
type(tag.string)
#

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:

1
2
3

tag.string.replace_with("No longer bold")
tag
# No longer bold

如果想在Beautiful Soup之外使用 NavigableString 对象,需要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

1 2	soup.name # '[document]'

Comment

markup = ""
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
#

Comment 对象是一个特殊类型的 NavigableString 对象:

1 2	comment # 'Hey, buddy. Want to buy a used parser'

但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:

print(soup.b.prettify())
# 
#  
#

遍历文档树

我们测试的文档内容:

html_doc = """
The Dormouse's story
    
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

通过点取属性的方式只能获得当前名字的第一个tag:

soup.body.b
# The Dormouse's story
soup.a
# Elsie

使用find_all()获取所有的tag:

soup.find_all('a')
# [Elsie,
#  Lacie,
#  Tillie]

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
head_tag
# The Dormouse's story
head_tag.contents
[The Dormouse's story]
title_tag = head_tag.contents[0]
title_tag
# The Dormouse's story
title_tag.contents
# ['The Dormouse's story']

字符串没有 .contents 属性,因为字符串没有子节点:

1
2
3

text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'

通过tag的 .children 生成器,可以对tag的子节点进行循环:

1
2
3

for child in title_tag.children:
    print(child)
    # The Dormouse's story

.descendants 属性可以对所有tag的子孙节点进行递归循环

BeautifulSoup 有一个直接子节点(节点),却有很多子孙节点:

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25

输出所有string:

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'...'

通过 .parent 属性来获取某个元素的父节点.

通过元素的 .parents 属性可以递归得到元素的所有父辈节点

在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

.next_element 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 .next_sibling相同,但通常是不一样的.

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

搜索文档树

除了find_all()之外，搜索也支持正则表达式:

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

下面代码找到文档中所有标签和标签:

1
2
3
4
5
soup.find_all(["a", "b"])
# [The Dormouse's story,
# Elsie,
# Lacie,
# Tillie]

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

1
2
3
4
5
6
7
8
9
10
11
12
13
for tag in soup.find_all(True):
print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性.

1
2
soup.find_all(id='link2')
# [Lacie]

如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性:

1
2
soup.find_all(href=re.compile("elsie"))
# [Elsie]

下面的例子在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么:

1
2
3
4
soup.find_all(id=True)
# [Elsie,
# Lacie,
# Tillie]

使用多个指定名字的参数可以同时过滤tag的多个属性:

1
2
soup.find_all(href=re.compile("elsie"), id='link1')
# [three]
通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True . 看例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
soup.find_all(string="Elsie")
# [u'Elsie']
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
""Return True if this string is the only child of its parent tag.""
return (s == s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然 string 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到.string 方法与 string 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的标签:

1
2
soup.find_all("a", string="Elsie")
# [Elsie]

限制返回结果的个数:

1
2
3
soup.find_all("a", limit=2)
# [Elsie,
# Lacie]

下面两行代码是等价的:

1
2
soup.find_all("a")
soup("a")

这两行代码也是等价的:

1
2
soup.title.find_all(string=True)
soup.title(string=True)

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等. find_parents() 和find_parent() 用来搜索当前节点的父辈节点

find_next_siblings() 方法返回所有符合条件的后面的兄弟节点, find_next_sibling() 只返回符合条件的后面的第一个tag节点.

find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点

find_all_next()方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

find_all_previous() 方法返回所有符合条件的节点, find_previous() 方法返回第一个符合条件的节点.

CSS选择器:对于熟悉css选择器的开发人员来说，使用这种方法来查找比较简单:

1
2
3
4
5
soup.select("title")
# [The Dormouse's story]
soup.select("p nth-of-type(3)")
# [
...
]

通过tag标签逐层查找:

1
2
3
4
5
6
7
soup.select("body a")
# [Elsie,
# Lacie,
# Tillie]
soup.select("html head title")
# [The Dormouse's story]

找到某个tag标签下的直接子标签 [6] :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
soup.select("head > title")
# [The Dormouse's story]
soup.select("p > a")
# [Elsie,
# Lacie,
# Tillie]
soup.select("p > a:nth-of-type(2)")
# [Lacie]
soup.select("p > #link1")
# [Elsie]
soup.select("body > a")
# []

找到兄弟节点标签:

1
2
3
4
5
6
soup.select("#link1 ~ .sister")
# [Lacie,
# Tillie]
soup.select("#link1 + .sister")
# [Lacie]

通过CSS的类名查找:

1
2
3
4
5
6
7
8
9
soup.select(".sister")
# [Elsie,
# Lacie,
# Tillie]
soup.select("[class~=sister]")
# [Elsie,
# Lacie,
# Tillie]

通过tag的id查找:

1
2
3
4
5
soup.select("#link1")
# [Elsie]
soup.select("a#link2")
# [Lacie]

同时用多种CSS选择器查询元素:

1
2
3
soup.select("#link1,#link2")
# [Elsie,
# Lacie]

通过是否存在某个属性来查找:

1
2
3
4
soup.select('a[href]')
# [Elsie,
# Lacie,
# Tillie]

通过属性的值来查找:

1
2
3
4
5
6
7
8
9
10
11
12
13
soup.select('a[href="http://example.com/elsie"]')
# [Elsie]
soup.select('a[href^="http://example.com/"]')
# [Elsie,
# Lacie,
# Tillie]
soup.select('a[href$="tillie"]')
# [Tillie]
soup.select('a[href*=".com/el"]')
# [Elsie]

返回查找到的元素的第一个

1
2
soup.select_one(".sister")
# Elsie

修改文档树
Tag.insert() 方法与 Tag.append() 方法类似,区别是不会把新元素添加到父节点 .contents 属性的最后,而是把元素插入到指定的位置.与Python列表总的 .insert() 方法的用法下同:

1
2
3
4
5
6
7
8
9
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.insert(1, "but did not endorse ")
tag
# I linked to but did not endorse example.com
tag.contents
# [u'I linked to ', u'but did not endorse', example.com]

Tag.clear() 方法移除当前tag的内容:

1
2
3
4
5
6
7
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.clear()
tag
#

PageElement.extract() 方法将当前tag移除文档树,并作为方法结果返回:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
i_tag = soup.i.extract()
a_tag
# I linked to
i_tag
# example.com
print(i_tag.parent)
None

这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 BeautifulSoup 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 extract 方法:

1
2
3
4
5
6
7
8
my_string = i_tag.string.extract()
my_string
# u'example.com'
print(my_string.parent)
# None
i_tag
#

Tag.decompose() 方法将当前节点移除文档树并完全销毁:

1
2
3
4
5
6
7
8
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
soup.i.decompose()
a_tag
# I linked to

PageElement.replace_with() 方法移除文档树中的某段内容,并用新tag或文本节点替代它:

1
2
3
4
5
6
7
8
9
10
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
a_tag
# I linked to example.net

replace_with() 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方

PageElement.wrap() 方法可以对指定的tag元素进行包装,并返回包装后的结果:

1
2
3
4
5
6
soup = BeautifulSoup("
I wish I was bold.
")
soup.p.string.wrap(soup.new_tag("b"))
# I wish I was bold.
soup.p.wrap(soup.new_tag("div"))
#
I wish I was bold.

该方法在 Beautiful Soup 4.0.5 中添加

Tag.unwrap() 方法与 wrap() 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包:

1
2
3
4
5
6
7
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
a_tag.i.unwrap()
a_tag
# I linked to example.com

与 replace_with() 方法相同, unwrap() 方法返回被移除的tag

参考文档

官网文档

中文4.4文档

furl链接解析库使用详解

2016-11-13T13:53:46.000Z

furl简介
1
scheme://username:password@host:port/path?query#fragment

scheme is the scheme string (all lowercase) or None. None means no scheme. An empty string means a protocol relative URL, like //www.google.com.

username is the username string for authentication.

password is the password string for authentication with username.

host is the domain name, IPv4, or IPv6 address as a string. Domain names are all lowercase.

port is an integer or None. A value of None means no port specified and the default port for the given scheme should be inferred, if possible.

path is a Path object comprised of path segments.

query is a Query object comprised of query arguments.

fragment is a Fragment object comprised of a Path and Query object separated by an optional ? separator.

1
2
3
4
>>> from furl import furl
>>> f = furl('http://user:pass@www.google.com:99/')
>>> f.scheme, f.username, f.password, f.host, f.port
('http', 'user', 'pass', 'www.google.com', 99)

furl使用
端口
会根据协议自动识别默认端口,目前仅支持ftp，ssh，http，https

1
2
3
4
5
6
7
>>> f = furl('https://secure.google.com/')
>>> f.port
443
>>> f = furl('unknown://www.google.com/')
>>> print f.port
None

netloc
1
2
3
4
5
6
7
8
9
10
11
>>> furl('http://www.google.com/').netloc
'www.google.com'
>>> furl('http://www.google.com:99/').netloc
'www.google.com:99'
>>> furl('http://user:pass@www.google.com:99/').netloc
'user:pass@www.google.com:99'
>>> furl('http://www.baidu.com?username=zzx').netloc
'www.baidu.com'

origin
1
2
3
4
5
6
7
8
>>> furl('http://www.google.com/').origin
'http://www.google.com'
>>> furl('http://www.google.com:99/').origin
'http://www.google.com:99'
>>> furl('http://www.baidu.com?username=zzx').origin
'http://www.baidu.com'

Path
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> f = furl('http://www.google.com/a/large%20ish/path')
>>> f.path
Path('/a/large ish/path')
>>> f.path.segments
['a', 'large ish', 'path']
>>> str(f.path)
'/a/large%20ish/path'
>>> f.path.segments = ['a', 'new', 'path', '']
>>> str(f.path)
'/a/new/path/'
>>> f.path = 'o/hi/there/with%20some%20encoding/'
>>> f.path.segments
['o', 'hi', 'there', 'with some encoding', '']
>>> str(f.path)
'/o/hi/there/with%20some%20encoding/'
>>> f.url
'http://www.google.com/o/hi/there/with%20some%20encoding/'
>>> f.path.segments = ['segments', 'are', 'maintained', 'decoded', '^`<>[]"#/?']
>>> str(f.path)
'/segments/are/maintained/decoded/%5E%60%3C%3E%5B%5D%22%23%2F%3F'

可以注意到链接末尾的/被解析为'',因为它被当作是一个目录:

1
2
3
4
5
6
7
8
9
10
11
>>> f = furl('http://www.google.com/a/directory/')
>>> f.path.isdir
True
>>> f.path.isfile
False
>>> f = furl('http://www.google.com/a/file')
>>> f.path.isdir
False
>>> f.path.isfile
True

对path进行规范化:

1
2
3
4
>>> f = furl('http://www.google.com////a/./b/lolsup/../c/')
>>> f.path.normalize()
>>> f.url
'http://www.google.com/a/b/c/'

参数处理
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> f = furl('http://www.google.com/?one=1&two=2')
>>> f.query
Query('one=1&two=2')
>>> f.query.params
omdict1D([('one', '1'), ('two', '2')])
>>> str(f.query)
'one=1&two=2'
>>> f = furl('http://www.google.com/?one=1&two=2')
>>> f.query.params
omdict1D([('one', '1'), ('two', '2')])
>>> f.args
omdict1D([('one', '1'), ('two', '2')])
>>> f.args is f.query.params
True

有关query属性的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> f = furl('http://www.google.com/?space=jams&space=slams')
>>> f.args['space']
'jams'
>>> f.args.getlist('space')
['jams', 'slams']
>>> f.args.addlist('repeated', ['1', '2', '3'])
>>> str(f.query)
'space=jams&space=slams&repeated=1&repeated=2&repeated=3'
>>> f.args.popvalue('space')
'slams'
>>> f.args.popvalue('repeated', '2')
'2'
>>> str(f.query)
'space=jams&repeated=1&repeated=3'

''与None参数:

1
2
3
4
5
6
7
8
9
>>> f = furl('http://sprop.su')
>>> f.args['param'] = ''
>>> f.url
'http://sprop.su/?param='
>>> f = furl('http://sprop.su')
>>> f.args['param'] = None
>>> f.url
'http://sprop.su/?param'

Fragment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> f = furl('http://www.google.com/#/fragment/path?with=params')
>>> f.fragment
Fragment('/fragment/path?with=params')
>>> f.fragment.path
Path('/fragment/path')
>>> f.fragment.query
Query('with=params')
>>> f.fragment.separator
True
>>> f = furl('http://www.google.com/#/fragment/path?with=params')
>>> str(f.fragment)
'/fragment/path?with=params'
>>> f.fragment.path.segments.append('file.ext')
>>> str(f.fragment)
'/fragment/path/file.ext?with=params'
>>> f = furl('http://www.google.com/#/fragment/path?with=params')
>>> str(f.fragment)
'/fragment/path?with=params'
>>> f.fragment.args['new'] = 'yep'
>>> str(f.fragment)
'/fragment/path?new=yep&with=params'

fragment的分隔符是?

Encoding
1
2
3
4
5
6
7
8
>>> f = furl('http://www.google.com/')
>>> f.path = 'some encoding here'
>>> f.args['and some encoding'] = 'here, too'
>>> f.url
'http://www.google.com/some%20encoding%20here?and+some+encoding=here,+too'
>>> f.set(host=u'ドメイン.テスト', path=u'джк', query=u'☃=☺')
>>> f.url
'http://xn--eckwd4c7c.xn--zckzah/%D0%B4%D0%B6%D0%BA?%E2%98%83=%E2%98%BA'

Inline manipulation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> from furl import furl
>>> f = furl('http://www.google.com/?one=1&two=2')
>>> f.args['three'] = '3'
>>> del f.args['one']
>>> f.url
'http://www.google.com/?two=2&three=3'
>>> furl('http://www.google.com/?one=1').add({'two':'2'}).url
'http://www.google.com/?one=1&two=2'
>>> furl('http://www.google.com/?one=1&two=2').set({'three':'3'}).url
'http://www.google.com/?three=3'
>>> furl('http://www.google.com/?one=1&two=2').remove(['one']).url
'http://www.google.com/?two=2'
>>> url = 'http://www.google.com/#fragment'
>>> furl(url).add(args={'example':'arg'}).set(port=99).remove(fragment=True).url
'http://www.google.com:99/?example=arg'
>>> f = furl().set(
... scheme='https', host='secure.google.com', port=99, path='index.html',
... args={'some':'args'}, fragment='great job')
>>> f.url
'https://secure.google.com:99/index.html?some=args#great%20job'
>>> url = 'https://secure.google.com:99/a/path/?some=args#great job'
>>> furl(url).remove(args=['some'], path='path/', fragment=True, port=True).url
'https://secure.google.com/a/'

copy
copy() creates and returns a new furl object with an identical URL.

1
2
3
4
5
>>> f = furl('http://www.google.com')
>>> f.copy().set(path='/new/path').url
'http://www.google.com/new/path'
>>> f.url
'http://www.google.com'

join
1
2
3
4
5
6
7
8
9
10
11
>>> f = furl('http://www.google.com')
>>> f.join('new/path').url
'http://www.google.com/new/path'
>>> f.join('replaced').url
'http://www.google.com/new/replaced'
>>> f.join('../parent').url
'http://www.google.com/parent'
>>> f.join('path?query=yes#fragment').url
'http://www.google.com/path?query=yes#fragment'
>>> f.join('unknown://www.yahoo.com/new/url/').url
'unknown://www.yahoo.com/new/url/'

参考文档

官方API文档

github

Redis学习笔记

2016-11-12T03:20:36.000Z

Redis简介
Redis 是完全开源免费的，遵守BSD协议，是一个高性能的key-value数据库。

特点:

Redis是完全在内存中保存数据的数据库，使用磁盘只是为了持久性目的

Redis不仅仅支持简单的key-value类型的数据，同时还提供list，set，zset，hash等数据结构的存储。

Redis支持数据的备份，即master-slave模式的数据备份。

优点:

异常快速: Redis是非常快的，每秒可以执行大约110000设置操作，81000个/每秒的读取操作。

支持丰富的数据类型: Redis支持最大多数开发人员已经知道如列表，集合，可排序集合，哈希等数据类型。
这使得在应用中很容易解决的各种问题，因为我们知道哪些问题处理使用哪种数据类型更好解决。

操作都是原子的 : 所有 Redis 的操作都是原子，从而确保当两个客户同时访问 Redis 服务器得到的是更新后的值（最新值）。

MultiUtility工具: Redis是一个多功能实用工具，可以在很多如：缓存，消息传递队列中使用（Redis原生支持发布/订阅），在应用程序中，如：Web应用程序会话，网站页面点击数等任何短暂的数据

因为redis原生支持linux，所以出现了https://github.com/MSOpenTech/redis，支持windows平台，下载安装包安装即可，并且可以设置最高使用的内存大小，更多配置参考安装目录下的配置文件

Redis使用
连接redis服务器
1
2
C:\WINDOWS\system32>redis-cli -h 127.0.0.1 -p 6379 -a "123" -n 0
127.0.0.1:6379>

-a后面是密码,-n表示连接第几个数据库，默认连接编号为0的数据库

如果默认是本机6397端口,没有密码，可以直接使用以下连接:

1
2
C:\Users\zzx>redis-cli
127.0.0.1:6379>

输入quit退出

数据类型
String
string类型是二进制安全的。意思是redis的string可以包含任何数据。比如jpg图片或者序列化的对象

string类型是Redis最基本的数据类型，一个键最大能存储512MB。

1
2
3
4
127.0.0.1:6379> SET name 'zzx'
OK
127.0.0.1:6379> GET name
"zzx"

SET与GET都可以使用小写，但是一般都是用大写，好区分是不是redix命令

其他命令:http://www.runoob.com/redis/redis-strings.html

更多命令见参考文档

Hash
Redis hash是一个string类型的field和value的映射表，hash特别适合用于存储对象。

1
2
3
4
5
6
7
8
9
127.0.0.1:6379> HMSET user:1 username zzx password 123 age 22
OK
127.0.0.1:6379> HGETALL user:1
1) "username"
2) "zzx"
3) "password"
4) "123"
5) "age"
6) "22"

user:1是key

每个 hash 可以存储2^32-1键值对（40多亿）

其他命令:http://www.runoob.com/redis/redis-hashes.html

更多命令见参考文档

List
1
2
3
4
5
6
7
8
9
10
11
12
13
127.0.0.1:6379> LPUSH test_list this is
(integer) 2
127.0.0.1:6379> LPUSH test_list a
(integer) 3
127.0.0.1:6379> LPUSH test_list test for list
(integer) 6
127.0.0.1:6379> LRANGE test_list 0 6
1) "list"
2) "for"
3) "test"
4) "a"
5) "is"
6) "this"

列表最多可存储2^32-1 元素

其他命令:http://www.runoob.com/redis/redis-lists.html

更多命令见参考文档

Set
通过hash实现的，不能保证顺序，元素唯一性

1
2
3
4
5
6
7
8
9
10
11
12
13
127.0.0.1:6379> SADD test_set this is a
(integer) 3
127.0.0.1:6379> SADD test_set test for set
(integer) 3
127.0.0.1:6379> SADD test_set this
(integer) 0
127.0.0.1:6379> SMEMBERS test_set
1) "test"
2) "this"
3) "set"
4) "for"
5) "is"
6) "a"

对于已经存在与set中的元素会返回0

其他命令:http://www.runoob.com/redis/redis-sets.html

更多命令见参考文档

zset(有序集合)
元素不重复并且保持插入元素的顺序,与Set不同的是，zset中的每个元素有都个score属性，可以理解为权重，内部是按照权重的大小进行排序的

1
2
3
4
5
6
7
8
9
127.0.0.1:6379> ZADD test_zset 1 this 2 is 3 a 4 test 0 for 7 zset
(integer) 6
127.0.0.1:6379> ZRANGE test_zset 0 7
1) "for"
2) "this"
3) "is"
4) "a"
5) "test"
6) "zset"

其他命令:http://www.runoob.com/redis/redis-sorted-sets.html

更多命令见参考文档

redis key
以上的name,test_list,test_set,test_zset和uesr:1都是key

可以通过DEL命令来删除key

序号命令及描述

1 DEL key该命令用于在 key 存在时删除 key。

2 DUMP key 序列化给定 key ，并返回被序列化的值。

3 EXISTS key 检查给定 key 是否存在。

4 EXPIRE key seconds为给定 key 设置过期时间。

5 EXPIREAT key timestamp EXPIREAT 的作用和 EXPIRE 类似，都用于为 key 设置过期时间。不同在于 EXPIREAT 命令接受的时间参数是 UNIX 时间戳(unix timestamp)。

6 PEXPIRE key milliseconds 设置 key 的过期时间以毫秒计。

7 PEXPIREAT key milliseconds-timestamp 设置 key 过期时间的时间戳(unix timestamp) 以毫秒计

8 KEYS pattern 查找所有符合给定模式( pattern)的 key 。

9 MOVE key db 将当前数据库的 key 移动到给定的数据库 db 当中。

10 PERSIST key 移除 key 的过期时间，key 将持久保持。

11 PTTL key 以毫秒为单位返回 key 的剩余的过期时间。

12 TTL key 以秒为单位，返回给定 key 的剩余生存时间(TTL, time to live)。

13 RANDOMKEY 从当前数据库中随机返回一个 key 。

14 RENAME key newkey 修改 key 的名称

15 RENAMENX key newkey 仅当 newkey 不存在时，将 key 改名为 newkey 。

16 TYPE key 返回 key 所储存的值的类型。

事务

序号命令及描述

1 DISCARD 取消事务，放弃执行事务块内的所有命令。

2 EXEC 执行所有事务块内的命令。

3 MULTI 标记一个事务块的开始。

4 UNWATCH 取消 WATCH 命令对所有 key 的监视。

5 WATCH key [key …] 监视一个(或多个) key ，如果在事务执行之前这个(或这些) key 被其他命令所改动，那么事务将被打断。

Redis服务器
输入INFO可以获取 Redis 服务器的各种信息和统计数值

Redis数据备份与恢复
1
2
127.0.0.1:6379> save
OK

该命令将在 redis 安装目录中创建dump.rdb文件。

如果需要恢复数据，只需将备份文件 dump.rdb 移动到 redis 安装目录并启动服务即可。获取 redis 目录可以使用 CONFIG 命令

1
2
3
127.0.0.1:6379> CONFIG GET dir
1) "dir"
2) "D:\\Redis"

创建 redis 备份文件也可以使用命令 BGSAVE，该命令在后台执行。

删除数据库
FLUSHDB 清除一个数据库，FLUSHALL清除整个redis数据

Redis管道
Redis是一种基于客户端-服务端模型以及请求/响应协议的TCP服务。这意味着通常情况下一个请求会遵循以下步骤：

客户端向服务端发送一个查询请求，并监听Socket返回，通常是以阻塞模式，等待服务端响应。

服务端处理命令，并将结果返回给客户端。

Redis 管道技术可以在服务端未响应时，客户端可以继续向服务端发送请求，并最终一次性读取所有服务端的响应。

redis-py
redis-py是python实现的redis客户端，关于redis-py的使用参考:

https://github.com/andymccurdy/redis-py#scan-iterators

https://redis-py.readthedocs.io/en/latest/

参考文档

https://github.com/MSOpenTech/redisredis-py[Memory Configuration For Redis 3.0](https://github.com/MSOpenTech/redis/wiki/Memory-Configuration-For-Redis-3.0)

Windows下Redis的安装使用教程

Redis 教程

Redis快速入门

Redis 命令参考

Command reference -Redis

PyMongo芒果库使用详解

2016-11-09T14:09:36.000Z

PyMongo简介
MongoDB官方出的针对python平台的库，相当于数据库的客户端，所以需要安装MongoDB的服务器端，按照Install MongoDB Community Edition on Windows说明可以在windows平台上安装MongoDB

并在管理员权限的cmd窗口运行:

1
"D:\MongoDB\Server\3.2\bin\mongod.exe" --config "D:\MongoDB\Server\3.2\mongod.cfg" --install --serviceName "MongoDB"

将会产生系统服务，mongod.cfg文件内容:

1
2
3
4
5
systemLog:
destination: file
path: F:\cookies\MongoDB\log\mongod.log
storage:
dbPath: F:\cookies\MongoDB\database

PyMongo使用
针对PyMongo3.3.1

创建客户端
1
2
>>> from pymongo import MongoClient
>>> client = MongoClient()

不指定服务器地址和端口就是默认localhost下的27017端口,也就是:

1
>>> client = MongoClient('localhost', 27017)

创建数据库
1
>>> db = client.test_database

底层检查是否有test_database这个属性，如果有，获取的就是test_database数据库，如果没有，则创建test_database数据库,也可以使用如下方式创建数据库:

1
>>> db = client['test-database']

创建集合
1
2
>>> collection = db.my_collection
>>> collection = db['my-collection']

与创建数据库基本一样

获取集合
1
2
>>> db.collection_names()
['my_collection']

插入文档
之前的各种操作都不会产生数据文件，只有在插入文档的时候，才连接服务器，产生相应的数据文件

1
2
>>> db.my_collection.insert_one({"x": 10})
0x2248b7f3af8>

也可以在插入文档的同时返回插入文档的id:

1
2
3
>>> my_document_id = db.my_collection.insert_one({"x": 10}).inserted_id
>>> my_document_id
ObjectId('582404cff67a2f29f4cb8565')

insert_many()可以插入多个文档

查找文档
find_one()查找的是符合条件的第一个文档

1
2
>>> db.my_collection.find_one({"x":10})
{'_id': ObjectId('582404adf67a2f29f4cb8564'), 'x': 10}

根据id查找文档:

1
2
>>> db.my_collection.find_one({"_id":my_document_id})
{'_id': ObjectId('582404cff67a2f29f4cb8565'), 'x': 10}

也可以如下:

1
2
>>> from bson.objectid import ObjectId
>>> db.my_collection.find_one({"_id":ObjectId('582404cff67a2f29f4cb8565')})

输出所有符合条件的文档:

1
2
3
4
5
for collection in db.my_collection.find({"x":10}).sort("_id"):
print(collection)

{'x': 10, '_id': ObjectId('582404adf67a2f29f4cb8564')}
{'x': 10, '_id': ObjectId('582404cff67a2f29f4cb8565')}

sort("_id")表示按id列排序

统计
获取集合中文档数:

1
2
>>> db.my_collection.count()
2

创建索引
我们先将ObjectId('582404adf67a2f29f4cb8564')中的x值改为11

然后在x上创建索引

1
2
>>> db.my_collection.create_index([('x', pymongo.ASCENDING)],unique=True)
'x_1'

列出所有的索引:

1
2
>>> list(db.my_collection.index_information())
['_id_', 'x_1']

索引'_id_'是根据_id自动创建的

其他基础操作比如更新，删除的语法与命令行Mongo类似，在此不赘述

以上便是PyMongo的基本操作，高级操作可参考:

http://api.mongodb.com/python/3.3.1/examples/aggregation.html

API参考:

http://api.mongodb.com/python/3.3.1/api/index.html

参考文档

PyMongo github README

PyMongo 3.3.1 doc

Introduction to MongoDB

MongoDB学习笔记

2016-11-09T04:08:29.000Z

MongoDB简介

MongoDB是对象型数据库，mysql等关系型数据库的表格式固定，如果想增添带有更多信息的属性就需要重新建一张表，然后用外键进行关联，这样查询也会造成表之间的join，效率低，而且结构越复杂，表越多，表之间的关系就越紧密，会影响表之间的清晰度。而对象型数据库将每条记录看作是一个文档，以json格式存放在一个文件中，并且每个文档结构可以不同，一个文档中就包含了这条记录的所有相关信息，以面对对象的思维来看就是一个对象，文档的集合也就是关系型数据库记录的集合，也就是表

MongoDB使用
创建数据库
1
use DATABASE_NAME

删除数据库
1
db.dropDatabase()

db这个变量的值就是我们当前使用的数据库

创建集合
1
db.createCollection(name, options)

在该命令中，name 是所要创建的集合名称。options 是一个用来指定集合配置的文档。

删除集合
1
db.COLLECTION_NAME.drop()

数据类型

String：字符串。存储数据常用的数据类型。在 MongoDB 中，UTF-8 编码的字符串才是合法的。

Integer：整型数值。用于存储数值。根据你所采用的服务器，可分为 32 位或 64 位。

Boolean：布尔值。用于存储布尔值（真/假）。

Double：双精度浮点值。用于存储浮点值。

Min/Max keys：将一个值与 BSON（二进制的 JSON）元素的最低值和最高值相对比。

Arrays：用于将数组或列表或多个值存储为一个键。

Timestamp：时间戳。记录文档修改或添加的具体时间。

Object：用于内嵌文档。

Null：用于创建空值。

Symbol：符号。该数据类型基本上等同于字符串类型，但不同的是，它一般用于采用特殊符号类型的语言。

Date：日期时间。用 UNIX 时间格式来存储当前日期或时间。你可以指定自己的日期时间：创建 Date 对象，传入年月日信息。

Object ID：对象 ID。用于创建文档的 ID。

Binary Data：二进制数据。用于存储二进制数据。

Code：代码类型。用于在文档中存储 JavaScript 代码。

Regular expression：正则表达式类型。用于存储正则表达式。

插入文档
1
db.COLLECTION_NAME.insert(document)

1
2
3
4
5
6
7
8
9
db.mycol.insert({
_id: ObjectId(7df78ad8902c),
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100
})

在插入的文档中，如果我们没有指定 _id 参数，那么 MongoDB 会自动为文档指定一个唯一的 ID

查询文档
1
db.COLLECTION_NAME.find()

1
2
3
4
5
6
7
8
9
10
db.mycol.find().pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}

用格式化方式显示结果，使用的是 pretty() 方法。除了 find() 方法之外，还有一个 findOne() 方法，它只返回一个文档。

操作格式范例 RDBMS中的类似语句

等于 {:} db.mycol.find({"by":"tutorials point"}).pretty() where by = 'tutorials point'

小于 {:{$lt:}} db.mycol.find({"likes":{$lt:50}}).pretty() where likes < 50

小于或等于 {:{$lte:}} db.mycol.find({"likes":{$lte:50}}).pretty() where likes <= 50

大于 {:{$gt:}} db.mycol.find({"likes":{$gt:50}}).pretty() where likes > 50

大于或等于 {:{$gte:}} db.mycol.find({"likes":{$gte:50}}).pretty() where likes >= 50

不等于 {:{$ne:}} db.mycol.find({"likes":{$ne:50}}).pretty() where likes != 50

and语法就用逗号表示:

1
db.mycol.find({key1:value1, key2:value2}).pretty()

or语法:

1
2
3
4
5
6
7
db.mycol.find(
{
$or: [
{key1: value1}, {key2:value2}
]
}
).pretty()

更新文档
1
db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)

MongoDB 中的 update() 与 save() 方法都能用于更新集合中的文档。update() 方法更新已有文档中的值，而save() 方法则是用传入该方法的文档来替换已有文档。

MongoDB 默认只更新单个文档，要想更新多个文档，需要把参数 multi 设为 true。

1
db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB Tutorial'}},{multi:true})

1
db.mycol.update({title:'Seven'}, {$inc:{likes:2}})

$inc表示将likes值加2

删除文档
1
db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)

如果有多个记录，而你只想删除第一条记录，那么就设置 remove() 方法中的 justOne 参数：

1
db.COLLECTION_NAME.remove(DELETION_CRITERIA,1)

如果没有指定删除标准，则 MongoDB 会将集合中所有文档都予以删除。

1
db.COLLECTION_NAME.remove()

映射
MongoDB 的查询文档曾介绍过 find() 方法，它可以利用 AND 或 OR 条件来获取想要的字段列表。在 MongoDB 中执行 find() 方法时，显示的是一个文档的所有字段。要想限制，可以利用 0 或 1 来设置字段列表。1 用于显示字段，0 用于隐藏字段。

1
db.COLLECTION_NAME.find({},{KEY:1})

假如 mycol 集合拥有下列数据：

1
2
3
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

1
2
3
4
db.mycol.find({},{"title":1,_id:0})
{"title":"MongoDB Overview"}
{"title":"NoSQL Overview"}
{"title":"Tutorials Point Overview"}

注意：在执行 find() 方法时，_id 字段是一直显示的。如果不想显示该字段，则可以将其设为 0。

限制记录
1
db.COLLECTION_NAME.find().limit(NUMBER)

记录排序
MongoDB 中的文档排序是通过 sort() 方法来实现的。sort() 方法可以通过一些参数来指定要进行排序的字段，并使用 1 和 -1 来指定排序方式，其中 1 表示升序，而 -1 表示降序。

1
db.COLLECTION_NAME.find().sort({KEY:1})

索引
1
db.COLLECTION_NAME.ensureIndex({KEY:1})

这里的 key 是想创建索引的字段名称，1 代表按升序排列字段值。-1 代表按降序排列。

获取索引信息:

1
db.mycol.getIndexes()

将返回所有索引，包括其名字。

删除索引:

1
db.mycol.dropIndex('index_name')

聚合
相当于关系型数据库中的group by

1
db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

比如有集合:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
_id: ObjectId(7df78ad8902c)
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by_user: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100
},
{
_id: ObjectId(7df78ad8902d)
title: 'NoSQL Overview',
description: 'No sql database is very fast',
by_user: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 10
},
{
_id: ObjectId(7df78ad8902e)
title: 'Neo4j Overview',
description: 'Neo4j is no sql database',
by_user: 'Neo4j',
url: 'http://www.neo4j.com',
tags: ['neo4j', 'database', 'NoSQL'],
likes: 750
}

假如想从上述集合中，归纳出一个列表，以显示每个用户写的教程数量，需要像下面这样使用 aggregate() 方法：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : 1}}}])
{
"result" : [
{
"_id" : "tutorials point",
"num_tutorial" : 2
},
{
"_id" : "Neo4j",
"num_tutorial" : 1
}
],
"ok" : 1
}

表达式描述范例

$sum 对集合中所有文档的定义值进行加和操作 db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : "$likes"}}}])

$avg 对集合中所有文档的定义值进行平均值 db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$avg : "$likes"}}}])

$min 计算集合中所有文档的对应值中的最小值 db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$min : "$likes"}}}])

$max 计算集合中所有文档的对应值中的最大值 db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$max : "$likes"}}}])

$push 将值插入到一个结果文档的数组中 db.mycol.aggregate([{$group : {_id : "$by_user", url : {$push: "$url"}}}])

$addToSet 将值插入到一个结果文档的数组中，但不进行复制 db.mycol.aggregate([{$group : {_id : "$by_user", url : {$addToSet : "$url"}}}])

$first 根据成组方式，从源文档中获取第一个文档。但只有对之前应用过 $sort管道操作符的结果才有意义。 db.mycol.aggregate([{$group : {_id : "$by_user", first_url : {$first : "$url"}}}])

$last 根据成组方式，从源文档中获取最后一个文档。但只有对之前进行过 $sort管道操作符的结果才有意义。 db.mycol.aggregate([{$group : {_id : "$by_user", last_url : {$last : "$url"}}}])

事务
1
2
3
4
5
6
db.mycol.findAndModify(
{
query:{'title':'Forrest Gump'},
update:{$inc:{likes:10}}
}
)

query是查找出匹配的文档，和find是一样的，而update则是更新likes这个项目。注意由于MongoDB只支持单个文档的atomic operation，因此如果query出多于一个文档，则只会对第一个文档进行操作。

正则表达式
1
db.mycol.find({title:/.*b$/}).pretty()

注意以上匹配都是区分大小写的，如果你要让其不区分大小写，则可以：

1
db.mycol.find({title:{$regex:'fight.*b',$options:'$i'}}).pretty()

$i是insensitive的意思。这样的话，即使是小写的fight，也能搜到了。

参考文档

MongoDB 极简实践入门

极客学院 Mongodb 教程

https://university.mongodb.com/

MongoDB 3.2 中文文档

MongoDB Tutorials

Install MongoDB Community Edition on Windows

dataset简易数据库包使用详解

2016-11-08T05:33:15.000Z

dataset简介
dataset号称是为懒人所写的数据库,并说明了很多程序员存储数据都会使用不易查询和更新的CSV和JSON格式，而不是数据库，主要原因是数据库的相关代码比较复杂，而dataset正式解决这个问题，为程序员提供更方便的数据库操作

1
2
3
4
5
6
7
8
9
import dataset
db = dataset.connect('sqlite:///:memory:')
table = db['sometable']
table.insert(dict(name='John Doe', age=37))
table.insert(dict(name='Jane Doe', age=34, gender='female'))
john = table.find_one(name='John Doe')

Features:

Automatic schema: If a table or column is written that does not exist in the database, it will be created automatically.

Upserts: Records are either created or updated, depending on whether an existing version can be found.

Query helpers for simple queries such as all rows in a table or all distinct values across a set of columns.

Compatibility: Being built on top of SQLAlchemy, dataset works with all major databases, such as SQLite, PostgreSQL and MySQL.

Scripted exports: Data can be exported based on a scripted configuration, making the process easy and replicable.

dataset使用
连接数据库
1
2
3
4
5
6
7
8
import dataset
# connecting to a SQLite database
db = dataset.connect('sqlite:///mydatabase.db')
# connecting to a MySQL database with user and password
db = dataset.connect('mysql://user:password@localhost/mydatabase')
# connecting to a PostgreSQL database
db = dataset.connect('postgresql://scott:tiger@localhost:5432/mydatabase')

插入数据
dataset会根据输入自动创建表和字段名

1
2
3
4
5
6
7
8
# get a reference to the table 'user'
table = db['user']
# table = db.get_table('user')
# Insert a new record.
table.insert(dict(name='John Doe', age=46, country='China'))
# dataset will create "missing" columns any time you insert a dict with an unknown key
table.insert(dict(name='Jane Doe', age=37, country='France', gender='female'))

将产生(主键id自动生成):

id country name age gender

1 China John Doe 46

2 France Jane Doe 37 female

更新记录
1
table.update(dict(name='John Doe', age=47), ['name'])

第二个参数相当于sql update语句中的where，用来过滤出需要更新的记录

事务操作
事务操作可以简单的使用上下文管理器来实现,出现异常，将会回滚

1
2
with dataset.connect() as tx:
tx['user'].insert(dict(name='John Doe', age=46, country='China'))

等同于:

1
2
3
4
5
6
7
db = dataset.connect()
db.begin()
try:
db['user'].insert(dict(name='John Doe', age=46, country='China'))
db.commit()
except:
db.rollback()

也可以嵌套使用:

1
2
3
4
5
db = dataset.connect()
with db as tx1:
tx1['user'].insert(dict(name='John Doe', age=46, country='China'))
with db as tx2:
tx2['user'].insert(dict(name='Jane Doe', age=37, country='France', gender='female'))

其他操作
1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> print(db)
>>> print(db.tables)
['user']
>>> print(db['user'].columns)
['id', 'country', 'name', 'age', 'gender']
>>> print(len(db['user']))
2
>>> table = db['user']
>>> table
>>> table.table
Table('user', MetaData(bind=Engine(sqlite:///mydatabase.db)), Column('id', INTEGER(), table=, primary_key=True, nullable=False), Column('country', TEXT(), table=), Column('name', TEXT(), table=), Column('age', INTEGER(), table=), Column('gender', TEXT(), table=), schema=None)

从表获取数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> users = db['user'].all()
>>> users
0x157c27ef978>
>>> for user in db['user']:
print(user['age'])
OrderedDict([('id', 1), ('country', 'China'), ('name', 'John Doe'), ('age', 47), ('gender', None)])
OrderedDict([('id', 2), ('country', 'France'), ('name', 'Jane Doe'), ('age', 37), ('gender', 'female')])
>>> chinese_users = table.find(country='China')
>>> chinese_users
0x157c2816978>
>>> john = table.find_one(name='John Doe')
>>> john
OrderedDict([('id', 1),
('country', 'China'),
('name', 'John Doe'),
('age', 47),
('gender', None)])
>>> elderly_users = table.find(table.table.columns.age >= 70)

获取非重复数据

1
2
# Get one user per country
db['user'].distinct('country')

删除记录
1
table.delete(place='Berlin')

执行SQL语句
1
2
3
result = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country')
for row in result:
print(row['country'], row['c'])

导出数据
1
2
3
# export all users into a single JSON
result = db['users'].all()
dataset.freeze(result, format='json', filename='users.json')

参考文档

dataset官方文档

PyMySQL库使用详解

2016-11-06T12:19:12.000Z

PyMySQL简介
一个比较方便的连接mysql使用的python库，官网给的例子很简单，但是看下源码发现内容还是很多的，很多函数都没有介绍，所以只有在使用的时候查看源代码了。从github上该项目所获得的星数来看，该库还是很出名的。

PyMySQL使用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Create a new record
sql = "INSERT INTO `users` (`email`, `password`) VALUES (%s, %s)"
cursor.execute(sql, ('webmaster@python.org', 'very-secret'))
# connection is not autocommit by default. So you must commit to save
# your changes.
connection.commit()
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `id`, `password` FROM `users` WHERE `email`=%s"
cursor.execute(sql, ('webmaster@python.org',))
result = cursor.fetchone()
print(result)
finally:
connection.close()

结果:

1
{'password': 'very-secret', 'id': 1}

参考文档

PyMySQL官方文档

geopy地理查询库使用详解

2016-11-06T06:39:39.000Z

geopy简介
可以使用geopy库来查询地址，国家，城市，地标，geopy使用的是第三方的geo解析器(包括谷歌地图，必应地图，Nominatim等)和一些数据源来获取地理信息

Each geolocation service you might use, such as Google Maps, Bing Maps, or Yahoo BOSS, has its own class in geopy.geocoders abstracting the service’s API. Geocoders each define at least ageocode method, for resolving a location from a string, and may define a reverse method, which resolves a pair of coordinates to an address.

geopy使用
从地址字符串获取Location对象
1
2
3
4
5
6
7
8
9
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}

从经纬度获取Location对象
1
2
3
4
5
6
7
8
9
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.reverse("52.509669, 13.376294")
>>> print(location.address)
Potsdamer Platz, Mitte, Berlin, 10117, Deutschland, European Union
>>> print((location.latitude, location.longitude))
(52.5094982, 13.3765983)
>>> print(location.raw)
{'place_id': '654513', 'osm_type': 'node', ...}

计算两点间距离
可以使用 Vincenty distance 或 great-circle distance

1
2
3
4
5
>>> from geopy.distance import vincenty
>>> newport_ri = (41.49008, -71.312796)
>>> cleveland_oh = (41.499498, -81.695391)
>>> print(vincenty(newport_ri, cleveland_oh).miles)
538.3904451566326

1
2
3
4
5
>>> from geopy.distance import great_circle
>>> newport_ri = (41.49008, -71.312796)
>>> cleveland_oh = (41.499498, -81.695391)
>>> print(great_circle(newport_ri, cleveland_oh).miles)
537.1485284062816

各三方地理服务API
ArcGIS

classgeopy.geocoders.ArcGIS(username=None, password=None, referer=None, token_lifetime=60,scheme=’https’, timeout=1, proxies=None, user_agent=None)

参数详解

Baidu

classgeopy.geocoders.Baidu(api_key, scheme=’http’, timeout=1, proxies=None, user_agent=None)

参数详解

Bing

classgeopy.geocoders.Bing(api_key, format_string=’%s’, scheme=’https’, timeout=1, proxies=None,user_agent=None)

参数详解

DataBC

classgeopy.geocoders.DataBC(scheme=’https’, timeout=1, proxies=None, user_agent=None)

参数详解

GeocodeFarm

classgeopy.geocoders.GeocodeFarm(api_key=None, format_string=’%s’, timeout=1, proxies=None,user_agent=None)

参数详解

GeocoderDotUS

classgeopy.geocoders.GeocoderDotUS(username=None, password=None, format_string=’%s’,timeout=1, proxies=None, user_agent=None)

参数详解

GeoNames

classgeopy.geocoders.GeoNames(country_bias=None, username=None, timeout=1, proxies=None,user_agent=None)

参数详解

GoogleV3

classgeopy.geocoders.GoogleV3(api_key=None, domain=’maps.googleapis.com’, scheme=’https’,client_id=None, secret_key=None, timeout=1, proxies=None, user_agent=None)

参数详解

IGNFrance

classgeopy.geocoders.IGNFrance(api_key, username=None, password=None, referer=None,domain=’wxs.ign.fr’, scheme=’https’, timeout=1, proxies=None, user_agent=None)

参数详解

LiveAddress

classgeopy.geocoders.LiveAddress(auth_id, auth_token, candidates=1, scheme=’https’, timeout=1,proxies=None, user_agent=None)

参数详解

NaviData

classgeopy.geocoders.NaviData(api_key=None, domain=’api.navidata.pl’, timeout=1, proxies=None,user_agent=None)

参数详解

Nominatim

classgeopy.geocoders.Nominatim(format_string=’%s’, view_box=None, country_bias=None, timeout=1,proxies=None, domain=’nominatim.openstreetmap.org’, scheme=’https’, user_agent=None)

参数详解

OpenCage

classgeopy.geocoders.OpenCage(api_key, domain=’api.opencagedata.com’, scheme=’https’, timeout=1,proxies=None, user_agent=None)

参数详解

OpenMapQuest

classgeopy.geocoders.OpenMapQuest(api_key=None, format_string=’%s’, scheme=’https’, timeout=1,proxies=None, user_agent=None)

参数详解

Photon

classgeopy.geocoders.Photon(format_string=’%s’, scheme=’https’, timeout=1, proxies=None,domain=’photon.komoot.de’)

参数详解

YahooPlaceFinder

classgeopy.geocoders.YahooPlaceFinder(consumer_key, consumer_secret, timeout=1, proxies=None,user_agent=None)

参数详解

What3Words

classgeopy.geocoders.What3Words(api_key, format_string=’%s’, scheme=’https’, timeout=1,proxies=None, user_agent=None)

参数详解

Yandex

classgeopy.geocoders.Yandex(api_key=None, lang=None, timeout=1, proxies=None, user_agent=None)

参数详解

参考文档

geogy官方文档

moviepy视频处理库使用详解

2016-11-05T07:36:48.000Z

moviepy简介
moviepy能够对音频，视频，以及git图片进行剪切，合并，标题插入等处理，并支持多种格式。

moviepy也是基于ffmpeg，如果没有安装ffmpeg，moviepy会在第一次使用moviepy的时候自动下载安装ffmpeg，如果本机安装有ffmpeg，建议修改config_defaults.py文件中的配置为FFMPEG_BINARY = 'auto-detect'

至于其他工具，则是对应相应的工具自行决定要不要安装，比如增加文字需要ImageMagick，预览音频和视频需要PyGame

moviepy使用
moviepy的核心对象是clips，可以是AudioClips 或 VideoClips

create clips
1
2
3
4
5
6
7
8
9
10
11
12
# VIDEO CLIPS
clip = VideoClip(make_frame, duration=4) # for custom animations (see below)
clip = VideoFileClip("my_video_file.mp4") # or .avi, .webm, .gif ...
clip = ImageSequenceClip(['image_file1.jpeg', ...], fps=24)
clip = ImageClip("my_picture.png") # or .jpeg, .tiff, ...
clip = TextClip("Hello !", font="Amiri-Bold", fontsize=70, color="black")
clip = ColorClip(size=(460,380), color=[R,G,B])
# AUDIO CLIPS
clip = AudioFileClip("my_audiofile.mp3") # or .ogg, .wav... or a video !
clip = AudioArrayClip(numpy_array, fps=44100) # from a numerical array
clip = AudioClip(make_frame, duration=3) # uses a function make_frame(t)

VideoClip
VideoClip is the base class for all the other video clips in MoviePy. If all you want is to edit video files, you will never need it. This class is practical when you want to make animations from frames that are generated by another library. All you need is to define a function make_frame(t) which returns a HxWx3 numpy array (of 8-bits integers) representing the frame at time t. Here is an example with the graphics library Gizeh:

1
2
3
4
5
6
7
8
9
10
11
12
import gizeh
import moviepy.editor as mpy
def make_frame(t):
surface = gizeh.Surface(128,128) # width, height
radius = W*(1+ (t*(2-t))**2 )/6 # the radius varies over time
circle = gizeh.circle(radius, xy = (64,64), fill=(1,0,0))
circle.draw(surface)
return surface.get_npimage() # returns a 8-bit RGB array
clip = mpy.VideoClip(make_frame, duration=2) # 2 seconds
clip.write_gif("circle.gif",fps=15)

ImageSequenceClip
This is a clip made from a series of images, you call it with:

1
clip = ImageSequenceClip(images_list, fps=25)

where images_list can be either a list of image names (that will be played) in that order, a folder name (at which case all the image files in the folder will be played in alphanumerical order), or a list of frames (Numpy arrays), obtained for instance from other clips.

TextClip
Generating a TextClip requires to have ImageMagick installed and (for windows users) linked to MoviePy

Exporting video clips
1
2
3
4
my_clip.write_videofile("movie.mp4") # default codec: 'libx264', 24 fps
my_clip.write_videofile("movie.mp4",fps=15)
my_clip.write_videofile("movie.webm") # webm format
my_clip.write_videofile("movie.webm",audio=False) # don't render audio.

Sometimes it is impossible for MoviePy to guess the duration attribute of the clip (keep in mind that some clips, like ImageClips displaying a picture, have a priori an infinite duration). Then, the durationmust be set manually with clip.set_duration:

1
2
3
4
# Make a video showing a flower for 5 seconds
my_clip = Image("flower.jpeg") # has infinite duration
my_clip.write_videofile("flower.mp4") # Will fail ! NO DURATION !
my_clip.set_duration(5).write_videofile("flower.mp4") # works !

To write your video as an animated GIF, use

1
my_clip.write_gif('test.gif', fps=12)

You can write a frame to an image file with

1
2
myclip.save_frame("frame.png") # by default the first frame is extracted
myclip.save_frame("frame.jpeg", t='01:00:00') # frame at time t=1h

concatenating clips
1
2
3
4
5
6
from moviepy.editor import VideoFileClip, concatenate_videoclips
clip1 = VideoFileClip("myvideo.mp4")
clip2 = VideoFileClip("myvideo2.mp4").subclip(50,60)
clip3 = VideoFileClip("myvideo3.mp4")
final_clip = concatenate_videoclips([clip1,clip2,clip3])
final_clip.write_videofile("my_concatenation.mp4")

CompositeVideoClips也能合并clips

1
video = CompositeVideoClip([clip1,clip2,clip3], size=(720,460))

Clips transformations and effects
1
2
3
4
5
from moviepy.editor import *
clip = (VideoFileClip("myvideo.avi")
.fx( vfx.resize, width=460) # resize (keep aspect ratio)
.fx( vfx.speedx, 2) # double the speed
.fx( vfx.colorx, 0.5)) # darken the picture

Example Scripts
https://zulko.github.io/moviepy/examples/examples.html

参考文档

moviepy官方文档

序号	命令及描述
1	DEL key该命令用于在 key 存在时删除 key。
2	DUMP key 序列化给定 key ，并返回被序列化的值。
3	EXISTS key 检查给定 key 是否存在。
4	EXPIRE key seconds为给定 key 设置过期时间。
5	EXPIREAT key timestamp EXPIREAT 的作用和 EXPIRE 类似，都用于为 key 设置过期时间。不同在于 EXPIREAT 命令接受的时间参数是 UNIX 时间戳(unix timestamp)。
6	PEXPIRE key milliseconds 设置 key 的过期时间以毫秒计。
7	PEXPIREAT key milliseconds-timestamp 设置 key 过期时间的时间戳(unix timestamp) 以毫秒计
8	KEYS pattern 查找所有符合给定模式( pattern)的 key 。
9	MOVE key db 将当前数据库的 key 移动到给定的数据库 db 当中。
10	PERSIST key 移除 key 的过期时间，key 将持久保持。
11	PTTL key 以毫秒为单位返回 key 的剩余的过期时间。
12	TTL key 以秒为单位，返回给定 key 的剩余生存时间(TTL, time to live)。
13	RANDOMKEY 从当前数据库中随机返回一个 key 。
14	RENAME key newkey 修改 key 的名称
15	RENAMENX key newkey 仅当 newkey 不存在时，将 key 改名为 newkey 。
16	TYPE key 返回 key 所储存的值的类型。

序号	命令及描述
1	DISCARD 取消事务，放弃执行事务块内的所有命令。
2	EXEC 执行所有事务块内的命令。
3	MULTI 标记一个事务块的开始。
4	UNWATCH 取消 WATCH 命令对所有 key 的监视。
5	WATCH key [key …] 监视一个(或多个) key ，如果在事务执行之前这个(或这些) key 被其他命令所改动，那么事务将被打断。

操作	格式	范例	RDBMS中的类似语句
等于	`{:}`	`db.mycol.find({"by":"tutorials point"}).pretty()`	`where by = 'tutorials point'`
小于	`{:{$lt:}}`	`db.mycol.find({"likes":{$lt:50}}).pretty()`	`where likes < 50`
小于或等于	`{:{$lte:}}`	`db.mycol.find({"likes":{$lte:50}}).pretty()`	`where likes <= 50`
大于	`{:{$gt:}}`	`db.mycol.find({"likes":{$gt:50}}).pretty()`	`where likes > 50`
大于或等于	`{:{$gte:}}`	`db.mycol.find({"likes":{$gte:50}}).pretty()`	`where likes >= 50`
不等于	`{:{$ne:}}`	`db.mycol.find({"likes":{$ne:50}}).pretty()`	`where likes != 50`

表达式	描述	范例
`$sum`	对集合中所有文档的定义值进行加和操作	`db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : "$likes"}}}])`
`$avg`	对集合中所有文档的定义值进行平均值	`db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$avg : "$likes"}}}])`
`$min`	计算集合中所有文档的对应值中的最小值	`db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$min : "$likes"}}}])`
`$max`	计算集合中所有文档的对应值中的最大值	`db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$max : "$likes"}}}])`
`$push`	将值插入到一个结果文档的数组中	`db.mycol.aggregate([{$group : {_id : "$by_user", url : {$push: "$url"}}}])`
`$addToSet`	将值插入到一个结果文档的数组中，但不进行复制	`db.mycol.aggregate([{$group : {_id : "$by_user", url : {$addToSet : "$url"}}}])`
`$first`	根据成组方式，从源文档中获取第一个文档。但只有对之前应用过 `$sort`管道操作符的结果才有意义。	`db.mycol.aggregate([{$group : {_id : "$by_user", first_url : {$first : "$url"}}}])`
`$last`	根据成组方式，从源文档中获取最后一个文档。但只有对之前进行过 `$sort`管道操作符的结果才有意义。	`db.mycol.aggregate([{$group : {_id : "$by_user", last_url : {$last : "$url"}}}])`