分类目录归档:Python-Advance

[repost ]Test-Driven Web Development with Python

original:http://chimera.labs.oreilly.com/books/1234000000754/index.html

Table of Contents

[repost ]A guide to analyzing Python performance

original:http://www.huyng.com/posts/python-performance-analysis/

While it’s not always the case that every Python program you write will require a rigorous performance analysis, it is reassuring to know that there are a wide variety of tools in Python’s ecosystem that one can turn to when the time arises.

Analyzing a program’s performance boils down to answering 4 basic questions:

  1. How fast is it running?
  2. Where are the speed bottlenecks?
  3. How much memory is it using?
  4. Where is memory leaking?

Below, we’ll dive into the details of answering these questions using some awesome tools.

Coarse grain timing with time

Let’s begin by using a quick and dirty method of timing our code: the good old unix utilitytime.

$ time python yourprogram.py

real    0m1.028s
user    0m0.001s
sys     0m0.003s

The meaning between the three output measurements are detailed in this stackoverflow article, but in short

  • real – refers to the actual elasped time
  • user – refers to the amount of cpu time spent outside of kernel
  • sys – refers to the amount of cpu time spent inside kernel specific functions

You can get a sense of how many cpu cycles your program used up regardless of other programs running on the system by adding together the sys and user times.

If the sum of sys and user times is much less than real time, then you can guess that most your program’s performance issues are most likely related to IO waits.

Fine grain timing with a timing context manager

Our next technique involves direct instrumentation of the code to get access to finer grain timing information. Here’s a small snippet I’ve found invaluable for making ad-hoc timing measurements:

timer.py

import time

class Timer(object):
    def __init__(self, verbose=False):
        self.verbose = verbose

    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, *args):
        self.end = time.time()
        self.secs = self.end - self.start
        self.msecs = self.secs * 1000  # millisecs
        if self.verbose:
            print 'elapsed time: %f ms' % self.msecs

In order to use it, wrap blocks of code that you want to time with Python’s with keyword and this Timer context manager. It will take care of starting the timer when your code block begins execution and stopping the timer when your code block ends.

Here’s an example use of the snippet:

from timer import Timer
from redis import Redis
rdb = Redis()

with Timer() as t:
    rdb.lpush("foo", "bar")
print "=> elasped lpush: %s s" % t.secs

with Timer as t:
    rdb.lpop("foo")
print "=> elasped lpop: %s s" % t.secs

I’ll often log the outputs of these timers to a file in order to see how my program’s performance evolves over time.

Line-by-line timing and execution frequency with a profiler

Robert Kern has a nice project called line_profiler which I often use to see how fast and how often each line of code is running in my scripts.

To use it, you’ll need to install the python package via pip:

$ pip install line_profiler

Once installed you’ll have access to a new module called “line_profiler” as well as an executable script “kernprof.py”.

To use this tool, first modify your source code by decorating the function you want to measure with the @profile decorator. Don’t worry, you don’t have to import anyting in order to use this decorator. The kernprof.py script automatically injects it into your script’s runtime during execution.

primes.py

@profile
def primes(n): 
    if n==2:
        return [2]
    elif n<2:
        return []
    s=range(3,n+1,2)
    mroot = n ** 0.5
    half=(n+1)/2-1
    i=0
    m=3
    while m <= mroot:
        if s[i]:
            j=(m*m-3)/2
            s[j]=0
            while j<half:
                s[j]=0
                j+=m
        i=i+1
        m=2*i+3
    return [2]+[x for x in s if x]
primes(100)

Once you’ve gotten your code setup with the @profile decorator, use kernprof.py to run your script.

$ kernprof.py -l -v fib.py

The -l option tells kernprof to inject the @profile decorator into your script’s builtins, and -v tells kernprof to display timing information once you’re script finishes. Here’s one the output should look like for the above script:

Wrote profile results to primes.py.lprof
Timer unit: 1e-06 s

File: primes.py
Function: primes at line 2
Total time: 0.00019 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           @profile
     3                                           def primes(n): 
     4         1            2      2.0      1.1      if n==2:
     5                                                   return [2]
     6         1            1      1.0      0.5      elif n<2:
     7                                                   return []
     8         1            4      4.0      2.1      s=range(3,n+1,2)
     9         1           10     10.0      5.3      mroot = n ** 0.5
    10         1            2      2.0      1.1      half=(n+1)/2-1
    11         1            1      1.0      0.5      i=0
    12         1            1      1.0      0.5      m=3
    13         5            7      1.4      3.7      while m <= mroot:
    14         4            4      1.0      2.1          if s[i]:
    15         3            4      1.3      2.1              j=(m*m-3)/2
    16         3            4      1.3      2.1              s[j]=0
    17        31           31      1.0     16.3              while j<half:
    18        28           28      1.0     14.7                  s[j]=0
    19        28           29      1.0     15.3                  j+=m
    20         4            4      1.0      2.1          i=i+1
    21         4            4      1.0      2.1          m=2*i+3
    22        50           54      1.1     28.4      return [2]+[x for x in s if x]

Look for lines with a high amount of hits or a high time interval. These are the areas where optimizations can yield the greatest improvements.

How much memory does it use?

Now that we have a good grasp on timing our code, let’s move on to figuring out how much memory our programs are using. Fortunately for us, Fabian Pedregosa has implemented a nicememory profiler modeled after Robert Kern’s line_profiler.

First install it via pip:

$ pip install -U memory_profiler
$ pip install psutil

(Installing the psutil package here is recommended because it greatly improves the performance of the memory_profiler).

Like line_profiler, memory_profiler requires that you decorate your function of interest with an@profile decorator like so:

@profile
def primes(n): 
    ...
    ...

To see how much memory your function uses run the following:

$ python -m memory_profiler primes.py

You should see output that looks like this once your program exits:

Filename: primes.py

Line #    Mem usage  Increment   Line Contents
==============================================
     2                           @profile
     3    7.9219 MB  0.0000 MB   def primes(n): 
     4    7.9219 MB  0.0000 MB       if n==2:
     5                                   return [2]
     6    7.9219 MB  0.0000 MB       elif n<2:
     7                                   return []
     8    7.9219 MB  0.0000 MB       s=range(3,n+1,2)
     9    7.9258 MB  0.0039 MB       mroot = n ** 0.5
    10    7.9258 MB  0.0000 MB       half=(n+1)/2-1
    11    7.9258 MB  0.0000 MB       i=0
    12    7.9258 MB  0.0000 MB       m=3
    13    7.9297 MB  0.0039 MB       while m <= mroot:
    14    7.9297 MB  0.0000 MB           if s[i]:
    15    7.9297 MB  0.0000 MB               j=(m*m-3)/2
    16    7.9258 MB -0.0039 MB               s[j]=0
    17    7.9297 MB  0.0039 MB               while j<half:
    18    7.9297 MB  0.0000 MB                   s[j]=0
    19    7.9297 MB  0.0000 MB                   j+=m
    20    7.9297 MB  0.0000 MB           i=i+1
    21    7.9297 MB  0.0000 MB           m=2*i+3
    22    7.9297 MB  0.0000 MB       return [2]+[x for x in s if x]

IPython shortcuts for line_profiler and memory_profiler

A little known feature of line_profiler and memory_profiler is that both programs have shortcut commands accessible from within IPython. All you have to do is type the following within an IPython session:

%load_ext memory_profiler
%load_ext line_profiler

Upon doing so you’ll have access to the magic commands %lprun and %mprun which behave similarly to their command-line counterparts. The major difference here is that you won’t need to decorate your to-be-profiled functions with the @profile decorator. Just go ahead and run the profiling directly within your IPython session like so:

In [1]: from primes import primes
In [2]: %mprun -f primes primes(1000)
In [3]: %lprun -f primes primes(1000)

This can save you a lot of time and effort since none of your source code needs to be modified in order to use these profiling commands.

Where’s the memory leak?

The cPython interpreter uses reference counting as it’s main method of keeping track of memory. This means that every object contains a counter, which is incremented when a reference to the object is stored somewhere, and decremented when a reference to it is deleted. When the counter reaches zero, the cPython interpreter knows that the object is no longer in use so it deletes the object and deallocates the occupied memory.

A memory leak can often occur in your program if references to objects are held even though the object is no longer in use.

The quickest way to find these “memory leaks” is to use an awesome tool called objgraphwritten by Marius Gedminas. This tool allows you to see the number of objects in memory and also locate all the different places in your code that hold references to these objects.

To get started, first install objgraph:

pip install objgraph

Once you have this tool installed, insert into your code a statement to invoke the debugger:

import pdb; pdb.set_trace()
Which objects are the most common?

At run time, you can inspect the top 20 most prevalent objects in your program by running:

(pdb) import objgraph
(pdb) objgraph.show_most_common_types()

MyBigFatObject             20000
tuple                      16938
function                   4310
dict                       2790
wrapper_descriptor         1181
builtin_function_or_method 934
weakref                    764
list                       634
method_descriptor          507
getset_descriptor          451
type                       439
Which objects have been added or deleted?

We can also see which objects have been added or deleted between two points in time:

(pdb) import objgraph
(pdb) objgraph.show_growth()
.
.
.
(pdb) objgraph.show_growth()   # this only shows objects that has been added or deleted since last show_growth() call

traceback                4        +2
KeyboardInterrupt        1        +1
frame                   24        +1
list                   667        +1
tuple                16969        +1
What is referencing this leaky object?

Continuing down this route, we can also see where references to any given object is being held. Let’s take as an example the simple program below:

x = [1]
y = [x, [x], {"a":x}]
import pdb; pdb.set_trace()

To see what is holding a reference to the variable x, run the objgraph.show_backref()function:

(pdb) import objgraph
(pdb) objgraph.show_backref([x], filename="/tmp/backrefs.png")

The output of that command should be a PNG image stored at /tmp/backrefs.png and it should look something like this:

back refrences

The box at the bottom with red lettering is our object of interest. We can see that it’s referenced by the symbol x once and by the list y three times. If x is the object causing a memory leak, we can use this method to see why it’s not automatically being deallocated by tracking down all of its references.

So to review, objgraph allows us to:

  • show the top N objects occupying our python program’s memory
  • show what objects have been deleted or added over a period of time
  • show all references to a given object in our script

Effort vs precision

In this post, I’ve shown you how to use several tools to analyze a python program’s performance. Armed with these tools and techniques you should have all the information required to track down most memory leaks as well as identify speed bottlenecks in a Python program.

As with many other topics, running a performance analysis means balancing the tradeoffs between effort and precision. When in doubt, implement the simplest solution that will suit your current needs.

Refrences

I build products at IQ Engines. You can get updates on new essays by subscribing to my rss feed. Occassionally, I will send out interesting links on twitter so follow me if you like this kind stuff.

[repost ]Python性能分析指南

original:http://www.oschina.net/translate/python-performance-analysis

尽管并非每个你写的Python程序都需要严格的性能分析,但了解一下Python的生态系统中很多优秀的在你需要做性能分析的时候可以使用的工具仍然是一件值得去做的事。

分析一个程序的性能,最终都归结为回答4个基本的问题:

  1. 程序运行速度有多快?
  2. 运行速度瓶颈在哪儿?
  3. 程序使用了多少内存?
  4. 内存泄露发生在哪里?

下面,我们将使用一些优秀的工具深入回答这些问题。

)

使用time工具粗糙定时

首先,我们可以使用快速然而粗糙的工具:古老的unix工具time,来为我们的代码检测运行时间。

1 $ time python yourprogram.py
2
3 real    0m1.028s
4 user    0m0.001s
5 sys     0m0.003s

上面三个输入变量的意义在文章 stackoverflow article 中有详细介绍。简单的说:

  • real – 表示实际的程序运行时间
  • user – 表示程序在用户态的cpu总时间
  • sys – 表示在内核态的cpu总时间

通过sysuser时间的求和,你可以直观的得到系统上没有其他程序运行时你的程序运行所需要的CPU周期。

sysuser时间之和远远少于real时间,那么你可以猜测你的程序的主要性能问题很可能与IO等待相关。

使用计时上下文管理器进行细粒度计时

我们的下一个技术涉及访问细粒度计时信息的直接代码指令。这是一小段代码,我发现使用专门的计时测量是非常重要的:

timer.py

01 import time
02
03 class Timer(object):
04     def __init__(self, verbose=False):
05         self.verbose = verbose
06
07     def __enter__(self):
08         self.start = time.time()
09         return self
10
11     def __exit__(self*args):
12         self.end = time.time()
13         self.secs = self.end - self.start
14         self.msecs = self.secs * 1000  # millisecs
15         if self.verbose:
16             print 'elapsed time: %f ms' % self.msecs

为了使用它,你需要用Python的with关键字和Timer上下文管理器包装想要计时的代码块。它将会在你的代码块开始执行的时候启动计时器,在你的代码块结束的时候停止计时器。

这是一个使用上述代码片段的例子:

01 from timer import Timer
02 from redis import Redis
03 rdb = Redis()
04
05 with Timer() as t:
06     rdb.lpush("foo""bar")
07 print "=> elasped lpush: %s s" % t.secs
08
09 with Timer as t:
10     rdb.lpop("foo")
11 print "=> elasped lpop: %s s" % t.secs

我经常将这些计时器的输出记录到文件中,这样就可以观察我的程序的性能如何随着时间进化。

使用分析器逐行统计时间和执行频率

Robert Kern有一个称作line_profiler的不错的项目,我经常使用它查看我的脚步中每行代码多快多频繁的被执行。

想要使用它,你需要通过pip安装该python包:

1 $ pip install line_profiler

一旦安装完成,你将会使用一个称做“line_profiler”的新模组和一个“kernprof.py”可执行脚本。

想要使用该工具,首先修改你的源代码,在想要测量的函数上装饰@profile装饰器。不要担心,你不需要导入任何模组。kernprof.py脚本将会在执行的时候将它自动地注入到你的脚步的运行时。

primes.py

01 @profile
02 defprimes(n):
03     if n==2:
04         return [2]
05     elif n<2:
06         return []
07     s=range(3,n+1,2)
08     mroot = ** 0.5
09     half=(n+1)/2-1
10     i=0
11     m=3
12     while m <= mroot:
13         if s[i]:
14             j=(m*m-3)/2
15             s[j]=0
16             while j<half:
17                 s[j]=0
18                 j+=m
19         i=i+1
20         m=2*i+3
21     return [2]+[x for in if x]
22 primes(100)

一旦你已经设置好了@profile装饰器,使用kernprof.py执行你的脚步。

1 $ kernprof.py --v fib.py

-l选项通知kernprof注入@profile装饰器到你的脚步的内建函数,-v选项通知kernprof在脚本执行完毕的时候显示计时信息。上述脚本的输出看起来像这样:

01 Wrote profile results to primes.py.lprof
02 Timer unit: 1e-06 s
03
04 File: primes.py
05 Function: primes at line 2
06 Total time: 0.00019 s
07
08 Line #      Hits         Time  Per Hit   % Time  Line Contents
09 ==============================================================
10      2                                           @profile
11      3                                           def primes(n):
12      4         1            2      2.0      1.1      if n==2:
13      5                                                   return [2]
14      6         1            1      1.0      0.5      elif n<2:
15      7                                                   return []
16      8         1            4      4.0      2.1      s=range(3,n+1,2)
17      9         1           10     10.0      5.3      mroot = n ** 0.5
18     10         1            2      2.0      1.1      half=(n+1)/2-1
19     11         1            1      1.0      0.5      i=0
20     12         1            1      1.0      0.5      m=3
21     13         5            7      1.4      3.7      while m <= mroot:
22     14         4            4      1.0      2.1          if s[i]:
23     15         3            4      1.3      2.1              j=(m*m-3)/2
24     16         3            4      1.3      2.1              s[j]=0
25     17        31           31      1.0     16.3              while j<half:
26     18        28           28      1.0     14.7                  s[j]=0
27     19        28           29      1.0     15.3                  j+=m
28     20         4            4      1.0      2.1          i=i+1
29     21         4            4      1.0      2.1          m=2*i+3
30     22        50           54      1.1     28.4      return [2]+[x for in if x]

寻找具有高Hits值或高Time值的行。这些就是可以通过优化带来最大改善的地方。

程序使用了多少内存?

现在我们对计时有了较好的理解,那么让我们继续弄清楚程序使用了多少内存。我们很幸运,Fabian Pedregosa模仿Robert Kern的line_profiler实现了一个不错的内存分析器

首先使用pip安装:

1 $ pip install -U memory_profiler
2 $ pip install psutil

(这里建议安装psutil包,因为它可以大大改善memory_profiler的性能)。

就像line_profiler,memory_profiler也需要在感兴趣的函数上面装饰@profile装饰器:

1 @profile
2 defprimes(n):
3     ...
4     ...

想要观察你的函数使用了多少内存,像下面这样执行:

1 $ python -m memory_profiler primes.py

一旦程序退出,你将会看到看起来像这样的输出:

01 Filename: primes.py
02
03 Line #    Mem usage  Increment   Line Contents
04 ==============================================
05      2                           @profile
06      3    7.9219 MB  0.0000 MB   defprimes(n):
07      4    7.9219 MB  0.0000 MB       if n==2:
08      5                                   return [2]
09      6    7.9219 MB  0.0000 MB       elif n<2:
10      7                                   return []
11      8    7.9219 MB  0.0000 MB       s=range(3,n+1,2)
12      9    7.9258 MB  0.0039 MB       mroot = ** 0.5
13     10    7.9258 MB  0.0000 MB       half=(n+1)/2-1
14     11    7.9258 MB  0.0000 MB       i=0
15     12    7.9258 MB  0.0000 MB       m=3
16     13    7.9297 MB  0.0039 MB       while m <= mroot:
17     14    7.9297 MB  0.0000 MB           if s[i]:
18     15    7.9297 MB  0.0000 MB               j=(m*m-3)/2
19     16    7.9258 MB -0.0039 MB               s[j]=0
20     17    7.9297 MB  0.0039 MB               while j<half:
21     18    7.9297 MB  0.0000 MB                   s[j]=0
22     19    7.9297 MB  0.0000 MB                   j+=m
23     20    7.9297 MB  0.0000 MB           i=i+1
24     21    7.9297 MB  0.0000 MB           m=2*i+3
25     22    7.9297 MB  0.0000 MB       return [2]+[x for in if x]

line_profiler和memory_profiler的IPython快捷方式

memory_profiler和line_profiler有一个鲜为人知的小窍门,两者都有在IPython中的快捷命令。你需要做的就是在IPython会话中输入以下内容:

1 %load_ext memory_profiler
2 %load_ext line_profiler

在这样做的时候你需要访问魔法命令%lprun和%mprun,它们的行为类似于他们的命令行形式。主要区别是你不需要使用@profiledecorator来修饰你要分析的函数。只需要在IPython会话中像先前一样直接运行分析:

1 In [1]: from primes import primes
2 In [2]: %mprun -f primes primes(1000)
3 In [3]: %lprun -f primes primes(1000)

这样可以节省你很多时间和精力,因为你的源代码不需要为使用这些分析命令而进行修改。

内存泄漏在哪里?

cPython解释器使用引用计数做为记录内存使用的主要方法。这意味着每个对象包含一个计数器,当某处对该对象的引用被存储时计数器增加,当引用被删除时计数器递减。当计数器到达零时,cPython解释器就知道该对象不再被使用,所以删除对象,释放占用的内存。

如果程序中不再被使用的对象的引用一直被占有,那么就经常发生内存泄漏。

查找这种“内存泄漏”最快的方式是使用Marius Gedminas编写的objgraph,这是一个极好的工具。该工具允许你查看内存中对象的数量,定位含有该对象的引用的所有代码的位置。

一开始,首先安装objgraph:

1 pip install objgraph

一旦你已经安装了这个工具,在你的代码中插入一行声明调用调试器:

1 import pdb; pdb.set_trace()
最普遍的对象是哪些?

在运行的时候,你可以通过执行下述指令查看程序中前20个最普遍的对象:

01 (pdb) import objgraph
02 (pdb) objgraph.show_most_common_types()
03
04 MyBigFatObject             20000
05 tuple                      16938
06 function                   4310
07 dict                       2790
08 wrapper_descriptor         1181
09 builtin_function_or_method 934
10 weakref                    764
11 list                       634
12 method_descriptor          507
13 getset_descriptor          451
14 type                       439
哪些对象已经被添加或删除?

我们也可以查看两个时间点之间那些对象已经被添加或删除:

01 (pdb) import objgraph
02 (pdb) objgraph.show_growth()
03 .
04 .
05 .
06 (pdb) objgraph.show_growth()   # this only shows objects that has been added or deleted since last show_growth() call
07
08 traceback                4        +2
09 KeyboardInterrupt        1        +1
10 frame                   24        +1
11 list                   667        +1
12 tuple                16969        +1
谁引用着泄漏的对象?

继续,你还可以查看哪里包含给定对象的引用。让我们以下述简单的程序做为一个例子:

1 = [1]
2 = [x, [x], {"a":x}]
3 import pdb; pdb.set_trace()

想要看看哪里包含变量x的引用,执行objgraph.show_backref()函数:

1 (pdb) import objgraph
2 (pdb) objgraph.show_backref([x], filename="/tmp/backrefs.png")

该命令的输出应该是一副PNG图像,保存在/tmp/backrefs.png,它看起来是像这样:

back refrences

最下面有红字的盒子是我们感兴趣的对象。我们可以看到,它被符号x引用了一次,被列表y引用了三次。如果是x引起了一个内存泄漏,我们可以使用这个方法,通过跟踪它的所有引用,来检查为什么它没有自动的被释放。

回顾一下,objgraph 使我们可以:

  • 显示占据python程序内存的头N个对象
  • 显示一段时间以后哪些对象被删除活增加了
  • 在我们的脚本中显示某个给定对象的所有引用

努力与精度

在本帖中,我给你显示了怎样用几个工具来分析python程序的性能。通过这些工具与技术的武装,你可以获得所有需要的信息,来跟踪一个python程序中大多数的内存泄漏,以及识别出其速度瓶颈。

对许多其他观点来说,运行一次性能分析就意味着在努力目标与事实精度之间做出平衡。如果感到困惑,那么就实现能适应你目前需求的最简单的解决方案。

参考

[repost ]Bloom Filter(python版)

original:http://hi.baidu.com/ruclin/item/8f78b786be2f672a110ef369

Bloom Filter(python版)
1、我的基本python版本是2.6
2、到http://RVL4.ecn.purdue.edu/~kak/dist/BitVector-2.0.tar.gz?download 下载BitVector模块并安装

参考资料:
http://files.cnblogs.com/asashina/Bloom%20Filter.pdf
http://blog.csdn.net/jiaomeng/archive/2007/01/27/1495500.aspx
http://stackoverflow.com/questions/311202/modern-high-performance-bloom-filter-in-python

今天学习了bloom filter。主要有以下收获:
[1]bloom filter的基本概念
[2]python中hash的实现
[3]python中操作符重载

问题:
[1]并没有重载in,为什么它会返回正确?

#coding:utf-8
from BitVector import BitVector
from random import Random
# get hashes from http://www.partow.net/programming/hashfunctions/index.html
from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash

#
# / www.asciiarmor.com
#
# copyright (c) 2008, ryan cox
# all rights reserved
# BSD license: http://www.opensource.org/licenses/bsd-license.php
#

class BloomFilter(object):
def __init__(self, n=None, m=None, k=None, p=None, bits=None ):
self.m = m
if k > 4 or k < 1:
raise Exception(‘Must specify value of k between 1 and 4′)
self.k = k
if bits:
self.bits = bits
else:
self.bits = BitVector( size=m )
self.rand = Random()
self.hashes = []
self.hashes.append(RSHash)
self.hashes.append(JSHash)
self.hashes.append(PJWHash)
self.hashes.append(DJBHash)

# switch between hashing techniques
self._indexes = self._rand_indexes
#self._indexes = self._hash_indexes

def __contains__(self, key):
for i in self._indexes(key):
if not self.bits[i]:
return False
return True

def add(self, key):
dupe = True
bits = []
for i in self._indexes(key):
if dupe and not self.bits[i]:
dupe = False
self.bits[i] = 1
bits.append(i)
return dupe

def __and__(self, filter):
if (self.k != filter.k) or (self.m != filter.m):
raise Exception(‘Must use bloom filters created with equal k / m paramters for bitwise AND’)
return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))

def __or__(self, filter):
if (self.k != filter.k) or (self.m != filter.m):
raise Exception(‘Must use bloom filters created with equal k / m paramters for bitwise OR’)
return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))

def _hash_indexes(self,key):
ret = []
for i in range(self.k):
ret.append(self.hashes[i](key) % self.m)
return ret

def _rand_indexes(self,key):
self.rand.seed(hash(key))
ret = []
for i in range(self.k):
ret.append(self.rand.randint(0,self.m-1))
return ret

if __name__ == ‘__main__’:
e = BloomFilter(m=100, k=4)
e.add(‘one’)
e.add(‘two’)
e.add(‘three’)
e.add(‘four’)
e.add(‘five’)
e.add(‘你好’)

f = BloomFilter(m=100, k=4)
f.add(‘three’)
f.add(‘four’)
f.add(‘five’)
f.add(‘six’)
f.add(‘seven’)
f.add(‘eight’)
f.add(‘nine’)
f.add(“ten”)

# test check for dupe on add
assert not f.add(‘eleven’)
assert f.add(‘eleven’)

# test membership operations
assert ‘ten’ in f
assert ‘one’ in e
assert ‘你好’ in e
assert ‘ten’ not in e
assert ‘one’ not in f

# test set based operations
union = f | e
intersection = f & e

assert ‘ten’ in union
assert ‘one’ in union
assert ‘three’ in intersection
assert ‘ten’ not in intersection
assert ‘one’ not in intersection

 

[repost ] Python 的 Socket 编程教程

original:http://www.oschina.net/question/12_76126

这是用来快速学习 Python Socket 套接字编程的指南和教程。Python 的 Socket 编程跟 C 语言很像。

Python 官方关于 Socket 的函数请看 http://docs.python.org/library/socket.html

基本上,Socket 是任何一种计算机网络通讯中最基础的内容。例如当你在浏览器地址栏中输入 www.oschina.net 时,你会打开一个套接字,然后连接到 www.oschina.net 并读取响应的页面然后然后显示出来。而其他一些聊天客户端如 gtalk 和 skype 也是类似。任何网络通讯都是通过 Socket 来完成的。

写在开头

本教程假设你已经有一些基本的 Python 编程的知识。

让我们开始 Socket 编程吧。

创建 Socket

首先要做的就是创建一个 Socket,socket 的 socket 函数可以实现,代码如下:

1 #Socket client example in python
2
3 import socket   #for sockets
4
5 #create an AF_INET, STREAM socket (TCP)
6 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
7
8 print 'Socket Created'

函数 socket.socket 创建了一个 Socket,并返回 Socket 的描述符可用于其他 Socket 相关的函数。

上述代码使用了下面两个属性来创建 Socket:

地址簇 : AF_INET (IPv4)
类型: SOCK_STREAM (使用 TCP 传输控制协议)

错误处理

如果 socket 函数失败了,python 将抛出一个名为 socket.error 的异常,这个异常必须予以处理:

01 #handling errors in python socket programs
02
03 import socket   #for sockets
04 import sys  #for exit
05
06 try:
07     #create an AF_INET, STREAM socket (TCP)
08     = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
09 except socket.error, msg:
10     print 'Failed to create socket. Error code: ' + str(msg[0]) +' , Error message : ' + msg[1]
11     sys.exit();
12
13 print 'Socket Created'

好了,假设你已经成功创建了 Socket,下一步该做什么呢?接下来我们将使用这个 Socket 来连接到服务器。

注意

与 SOCK_STREAM 相对应的其他类型是 SOCK_DGRAM 用于 UDP 通讯协议,UDP 通讯是非连接 Socket,在这篇文章中我们只讨论 SOCK_STREAM ,或者叫 TCP 。

连接到服务器

连接到服务器需要服务器地址和端口号,这里使用的是 www.oschina.net 和 80 端口。

首先获取远程主机的 IP 地址

连接到远程主机之前,我们需要知道它的 IP 地址,在 Python 中,获取 IP 地址是很简单的:

01 import socket   #for sockets
02 import sys  #for exit
03
04 try:
05     #create an AF_INET, STREAM socket (TCP)
06     = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
07 except socket.error, msg:
08     print 'Failed to create socket. Error code: ' + str(msg[0]) +' , Error message : ' + msg[1]
09     sys.exit();
10
11 print 'Socket Created'
12
13 host = 'www.oschina.net'
14
15 try:
16     remote_ip = socket.gethostbyname( host )
17
18 except socket.gaierror:
19     #could not resolve
20     print 'Hostname could not be resolved. Exiting'
21     sys.exit()
22     
23 print 'Ip address of ' + host + ' is ' + remote_ip

我们已经有 IP 地址了,接下来需要指定要连接的端口。

代码:

01 import socket   #for sockets
02 import sys  #for exit
03
04 try:
05     #create an AF_INET, STREAM socket (TCP)
06     = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
07 except socket.error, msg:
08     print 'Failed to create socket. Error code: ' + str(msg[0]) +' , Error message : ' + msg[1]
09     sys.exit();
10
11 print 'Socket Created'
12
13 host = 'www.oschina.net'
14 port = 80
15
16 try:
17     remote_ip = socket.gethostbyname( host )
18
19 except socket.gaierror:
20     #could not resolve
21     print 'Hostname could not be resolved. Exiting'
22     sys.exit()
23     
24 print 'Ip address of ' + host + ' is ' + remote_ip
25
26 #Connect to remote server
27 s.connect((remote_ip , port))
28
29 print 'Socket Connected to ' + host + ' on ip ' + remote_ip

现在运行程序

1 $ python client.py
2 Socket Created
3 Ip address of www.oschina.net is 61.145.122.155
4 Socket Connected to www.oschina.net on ip 61.145.122.155

这段程序创建了一个 Socket 并进行连接,试试使用其他一些不存在的端口(如81)会是怎样?这个逻辑相当于构建了一个端口扫描器。

已经连接上了,接下来就是往服务器上发送数据。

免费提示

使用 SOCK_STREAM/TCP 套接字才有“连接”的概念。连接意味着可靠的数据流通讯机制,可以同时有多个数据流。可以想象成一个数据互不干扰的管道。另外一个重要的提示是:数据包的发送和接收是有顺序的。

其他一些 Socket 如 UDP、ICMP 和 ARP 没有“连接”的概念,它们是无连接通讯,意味着你可从任何人或者给任何人发送和接收数据包。

发送数据

sendall 函数用于简单的发送数据,我们来向 oschina 发送一些数据:

01 import socket   #for sockets
02 import sys  #for exit
03
04 try:
05     #create an AF_INET, STREAM socket (TCP)
06     = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
07 except socket.error, msg:
08     print 'Failed to create socket. Error code: ' + str(msg[0]) +' , Error message : ' + msg[1]
09     sys.exit();
10
11 print 'Socket Created'
12
13 host = 'www.oschina.net'
14 port = 80
15
16 try:
17     remote_ip = socket.gethostbyname( host )
18
19 except socket.gaierror:
20     #could not resolve
21     print 'Hostname could not be resolved. Exiting'
22     sys.exit()
23     
24 print 'Ip address of ' + host + ' is ' + remote_ip
25
26 #Connect to remote server
27 s.connect((remote_ip , port))
28
29 print 'Socket Connected to ' + host + ' on ip ' + remote_ip
30
31 #Send some data to remote server
32 message = "GET / HTTP/1.1\r\n\r\n"
33
34 try :
35     #Set the whole string
36     s.sendall(message)
37 except socket.error:
38     #Send failed
39     print 'Send failed'
40     sys.exit()
41
42 print 'Message send successfully'

上述例子中,首先连接到目标服务器,然后发送字符串数据 “GET / HTTP/1.1\r\n\r\n” ,这是一个 HTTP 协议的命令,用来获取网站首页的内容。

接下来需要读取服务器返回的数据。

接收数据

recv 函数用于从 socket 接收数据:

01 #Socket client example in python
02
03 import socket   #for sockets
04 import sys  #for exit
05
06 #create an INET, STREAMing socket
07 try:
08     = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
09 except socket.error:
10     print 'Failed to create socket'
11     sys.exit()
12     
13 print 'Socket Created'
14
15 host = 'oschina.net';
16 port = 80;
17
18 try:
19     remote_ip = socket.gethostbyname( host )
20
21 except socket.gaierror:
22     #could not resolve
23     print 'Hostname could not be resolved. Exiting'
24     sys.exit()
25
26 #Connect to remote server
27 s.connect((remote_ip , port))
28
29 print 'Socket Connected to ' + host + ' on ip ' + remote_ip
30
31 #Send some data to remote server
32 message = "GET / HTTP/1.1\r\nHost: oschina.net\r\n\r\n"
33
34 try :
35     #Set the whole string
36     s.sendall(message)
37 except socket.error:
38     #Send failed
39     print 'Send failed'
40     sys.exit()
41
42 print 'Message send successfully'
43
44 #Now receive data
45 reply = s.recv(4096)
46
47 print reply

下面是上述程序执行的结果:

01 $ python client.py
02 Socket Created
03 Ip address of oschina.net is 61.145.122.
04 Socket Connected to oschina.net on ip 61.145.122.155
05 Message send successfully
06 HTTP/1.1 301 Moved Permanently
07 Server: nginx
08 Date: Wed, 24 Oct 2012 13:26:46 GMT
09 Content-Type: text/html
10 Content-Length: 178
11 Connection: keep-alive
12 Keep-Alive: timeout=20
13 Location: http://www.oschina.net/

oschina.net 回应了我们所请求的 URL 的内容,很简单。数据接收完了,可以关闭 Socket 了。

关闭 socket

close 函数用于关闭 Socket:

1 s.close()

这就是了。

让我们回顾一下

上述的示例中我们学到了如何:

1. 创建 Socket
2. 连接到远程服务器
3. 发送数据
4. 接收回应

当你用浏览器打开 www.oschina.net 时,其过程也是一样。包含两种类型,分别是客户端和服务器,客户端连接到服务器并读取数据,服务器使用 Socket 接收进入的连接并提供数据。因此在这里 www.oschina.net 是服务器端,而你的浏览器是客户端。

接下来我们开始在服务器端做点编码。

服务器端编程

服务器端编程主要包括下面几步:

1. 打开 socket
2. 绑定到一个地址和端口
3. 侦听进来的连接
4. 接受连接
5. 读写数据

我们已经学习过如何打开 Socket 了,下面是绑定到指定的地址和端口上。

绑定 Socket

bind 函数用于将 Socket 绑定到一个特定的地址和端口,它需要一个类似 connect 函数所需的 sockaddr_in 结构体。

示例代码:

01 import socket
02 import sys
03
04 HOST = ''   # Symbolic name meaning all available interfaces
05 PORT = 8888 # Arbitrary non-privileged port
06
07 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
08 print 'Socket created'
09
10 try:
11     s.bind((HOST, PORT))
12 except socket.error , msg:
13     print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1]
14     sys.exit()
15     
16 print 'Socket bind complete'

绑定完成后,就需要让 Socket 开始侦听连接。很显然,你不能将两个不同的 Socket 绑定到同一个端口之上。

连接侦听

绑定 Socket 之后就可以开始侦听连接,我们需要将 Socket 变成侦听模式。socket 的 listen 函数用于实现侦听模式:

1 s.listen(10)
2 print 'Socket now listening'

listen 函数所需的参数成为 backlog,用来控制程序忙时可保持等待状态的连接数。这里我们传递的是 10,意味着如果已经有 10 个连接在等待处理,那么第 11 个连接将会被拒绝。当检查了 socket_accept 后这个会更加清晰。

接受连接

示例代码:

01 import socket
02 import sys
03
04 HOST = ''   # Symbolic name meaning all available interfaces
05 PORT = 8888 # Arbitrary non-privileged port
06
07 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
08 print 'Socket created'
09
10 try:
11     s.bind((HOST, PORT))
12 except socket.error , msg:
13     print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1]
14     sys.exit()
15     
16 print 'Socket bind complete'
17
18 s.listen(10)
19 print 'Socket now listening'
20
21 #wait to accept a connection - blocking call
22 conn, addr = s.accept()
23
24 #display client information
25 print 'Connected with ' + addr[0+ ':' + str(addr[1])

输出

运行该程序将会显示:

1 $ python server.py
2 Socket created
3 Socket bind complete
4 Socket now listening

现在这个程序开始等待连接进入,端口是 8888,请不要关闭这个程序,我们来通过 telnet 程序来进行测试。

打开命令行窗口并输入:

1 $ telnet localhost 8888
2
3 It will immediately show
4 $ telnet localhost 8888
5 Trying 127.0.0.1...
6 Connected to localhost.
7 Escape character is '^]'.
8 Connection closed by foreign host.

而服务器端窗口显示的是:

1 $ python server.py
2 Socket created
3 Socket bind complete
4 Socket now listening
5 Connected with 127.0.0.1:59954

我们可看到客户端已经成功连接到服务器。

上面例子我们接收到连接并立即关闭,这样的程序没什么实际的价值,连接建立后一般会有大量的事情需要处理,因此让我们来给客户端做出点回应吧。

sendall 函数可通过 Socket 给客户端发送数据:

01 import socket
02 import sys
03
04 HOST = ''   # Symbolic name meaning all available interfaces
05 PORT = 8888 # Arbitrary non-privileged port
06
07 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
08 print 'Socket created'
09
10 try:
11     s.bind((HOST, PORT))
12 except socket.error , msg:
13     print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1]
14     sys.exit()
15     
16 print 'Socket bind complete'
17
18 s.listen(10)
19 print 'Socket now listening'
20
21 #wait to accept a connection - blocking call
22 conn, addr = s.accept()
23
24 print 'Connected with ' + addr[0+ ':' + str(addr[1])
25
26 #now keep talking with the client
27 data = conn.recv(1024)
28 conn.sendall(data)
29
30 conn.close()
31 s.close()

继续运行上述代码,然后打开另外一个命令行窗口输入下面命令:

1 $ telnet localhost 8888
2 Trying 127.0.0.1...
3 Connected to localhost.
4 Escape character is '^]'.
5 happy
6 happy
7 Connection closed by foreign host.

可看到客户端接收到来自服务器端的回应内容。

上面的例子还是一样,服务器端回应后就立即退出了。而一些真正的服务器像 www.oschina.net 是一直在运行的,时刻接受连接请求。

也就是说服务器端应该一直处于运行状态,否则就不能成为“服务”,因此我们要让服务器端一直运行,最简单的方法就是把 accept 方法放在一个循环内。

一直在运行的服务器

对上述代码稍作改动:

01 import socket
02 import sys
03
04 HOST = ''   # Symbolic name meaning all available interfaces
05 PORT = 8888 # Arbitrary non-privileged port
06
07 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
08 print 'Socket created'
09
10 try:
11     s.bind((HOST, PORT))
12 except socket.error , msg:
13     print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1]
14     sys.exit()
15     
16 print 'Socket bind complete'
17
18 s.listen(10)
19 print 'Socket now listening'
20
21 #now keep talking with the client
22 while 1:
23     #wait to accept a connection - blocking call
24     conn, addr = s.accept()
25     print 'Connected with ' + addr[0+ ':' + str(addr[1])
26     
27     data = conn.recv(1024)
28     reply = 'OK...' + data
29     if notdata:
30         break
31     
32     conn.sendall(reply)
33
34 conn.close()
35 s.close()

很简单只是加多一个 while 1 语句而已。

继续运行服务器,然后打开另外三个命令行窗口。每个窗口都使用 telnet 命令连接到服务器:

1 $ telnet localhost 5000
2 Trying 127.0.0.1...
3 Connected to localhost.
4 Escape character is '^]'.
5 happy
6 OK .. happy
7 Connection closed by foreign host.

 

服务器所在的终端窗口显示的是:

1 $ python server.py
2 Socket created
3 Socket bind complete
4 Socket now listening
5 Connected with 127.0.0.1:60225
6 Connected with 127.0.0.1:60237
7 Connected with 127.0.0.1:60239

 

你看服务器再也不退出了,好吧,用 Ctrl+C 关闭服务器,所有的 telnet 终端将会显示 “Connection closed by foreign host.”

已经很不错了,但是这样的通讯效率太低了,服务器程序使用循环来接受连接并发送回应,这相当于是一次最多处理一个客户端的请求,而我们要求服务器可同时处理多个请求。

处理多个连接

为了处理多个连接,我们需要一个独立的处理代码在主服务器接收到连接时运行。一种方法是使用线程,服务器接收到连接然后创建一个线程来处理连接收发数据,然后主服务器程序返回去接收新的连接。

下面是我们使用线程来处理连接请求:

01 import socket
02 import sys
03 from thread import *
04
05 HOST = ''   # Symbolic name meaning all available interfaces
06 PORT = 8888 # Arbitrary non-privileged port
07
08 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
09 print 'Socket created'
10
11 #Bind socket to local host and port
12 try:
13     s.bind((HOST, PORT))
14 except socket.error , msg:
15     print 'Bind failed. Error Code : ' + str(msg[0]) + ' Message ' + msg[1]
16     sys.exit()
17     
18 print 'Socket bind complete'
19
20 #Start listening on socket
21 s.listen(10)
22 print 'Socket now listening'
23
24 #Function for handling connections. This will be used to create threads
25 def clientthread(conn):
26     #Sending message to connected client
27     conn.send('Welcome to the server. Type something and hit enter\n'#send only takes string
28     
29     #infinite loop so that function do not terminate and thread do not end.
30     while True:
31         
32         #Receiving from client
33         data = conn.recv(1024)
34         reply = 'OK...' + data
35         if notdata:
36             break
37     
38         conn.sendall(reply)
39     
40     #came out of loop
41     conn.close()
42
43 #now keep talking with the client
44 while 1:
45     #wait to accept a connection - blocking call
46     conn, addr = s.accept()
47     print 'Connected with ' + addr[0+ ':' + str(addr[1])
48     
49     #start new thread takes 1st argument as a function name to be run, second is the tuple of arguments to the function.
50     start_new_thread(clientthread ,(conn,))
51
52 s.close()

运行上述服务端程序,然后像之前一样打开三个终端窗口并执行 telent 命令:

01 $ telnet localhost 8888
02 Trying 127.0.0.1...
03 Connected to localhost.
04 Escape character is '^]'.
05 Welcome to the server. Type something and hit enter
06 hi
07 OK...hi
08 asd
09 OK...asd
10 cv
11 OK...cv

服务器端所在终端窗口输出信息如下:

1 $ python server.py
2 Socket created
3 Socket bind complete
4 Socket now listening
5 Connected with 127.0.0.1:60730
6 Connected with 127.0.0.1:60731

 

线程接管了连接并返回相应数据给客户端。

这便是我们所要介绍的服务器端编程。

结论

到这里为止,你已经学习了 Python 的 Socket 基本编程,你可自己动手编写一些例子来强化这些知识。

你可能会遇见一些问题:Bind failed. Error Code : 98 Message Address already in use,碰见这种问题只需要简单更改服务器端口即可。

[project ]PyPy

original:http://pypy.org/

PyPy is a fastcompliant alternative implementation of the Python language (2.7.1). It has several advantages and distinct features:

  • Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy.(What is a JIT compiler?)
  • Memory usage: large, memory-hungry Python programs might end up taking less space than they do in CPython.
  • Compatibility: PyPy is highly compatible with existing python code. It supports ctypesand can run popular python libraries like twisted and django.
  • Sandboxing: PyPy provides the ability to run untrusted code in a fully secure way.
  • Stackless: PyPy can be configured to run in stackless mode, providing micro-threads for massive concurrency.
  • As well as other features.

Download and try out the PyPy release 1.7!

Want to know more? A good place to start is our detailed speed and compatibility reports!

[repost ] 翻译 How To Use Linux epoll with Python

original:http://blog.csdn.net/ani_di/article/details/6438130

原文:http://scotdoyle.com/python-epoll-howto.html?

How To Use Linux epoll with Python

Contents

Introduction

Python2.6已包含了调用Linux epoll库的API。本文使用Python 3的例子简要地描述这个API和使用。欢迎提问和反馈信息。

Blocking Socket Programming Examples

例子1是一个Python 3写的简单服务器,它在8080端口listen HTTP请求,打印消息到控制台,然后发送反馈到客户端。

  • Line 9: 创建server socket.
  • Line 10: 设置SO_REUSEADDR,这样我们的程序就能与其它进程在监听同一个端口时。
  • Line 11: 编写本地所有IPv4地址的8080端口。
  • Line 12: 监听此socket上的连接。
  • Line 14: 程序会在此阻塞,直接有新的连接过来。调用accept函数会产生此连接的socket对象和请求的地址。
  • Lines 15-17: 组合请求所有信息。有关HTTP协议参考 HTTP Made Easy.
  • Line 18: 打印请求。
  • Line 19: 将消息发送回客户端
  • Lines 22-22: 关闭连接到客户端和监听socket。

官方HOWTO 上有更多关于socket的编程细节。

Example 1 (All examples use Python 3)

·········10········20········30········40········50········60········70········80········90········100·······110·······120·······130·······140·······150
  1. import socket
  2. EOL1 = b’/n/n’
  3. EOL2 = b’/n/r/n’
  4. response  = b’HTTP/1.0 200 OK/r/nDate: Mon, 1 Jan 1996 01:01:01 GMT/r/n’
  5. response += b’Content-Type: text/plain/r/nContent-Length: 13/r/n/r/n’
  6. response += b’Hello, world!’
  7. serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  8. serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  9. serversocket.bind((’0.0.0.0′, 8080))
  10. serversocket.listen(1)
  11. connectiontoclient, address = serversocket.accept()
  12. request = b”
  13. while EOL1 not in request and EOL2 not in request:
  14.    request += connectiontoclient.recv(1024)
  15. print(request.decode())
  16. connectiontoclient.send(response)
  17. connectiontoclient.close()
  18. 22  serversocket.close()

 

 

Example 2 在第15行处加上循环,它将处理掺的连接直接用户中断。在此可能清楚地看到,server socket是从来不用做与客户端交换数据,而是为连接创建新的socket,由该新socket交换数据。

23-24的 finally 语句用于砍server socket总是关闭,即使有异常发生(比较用户中断)。

 

Example 2

·········10········20········30········40········50········60········70········80········90········100·······110·······120·······130·······140·······150
  1. import socket
  2. EOL1 = b’/n/n’
  3. EOL2 = b’/n/r/n’
  4. response  = b’HTTP/1.0 200 OK/r/nDate: Mon, 1 Jan 1996 01:01:01 GMT/r/n’
  5. response += b’Content-Type: text/plain/r/nContent-Length: 13/r/n/r/n’
  6. response += b’Hello, world!’
  7. serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  8. serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  9. serversocket.bind((’0.0.0.0′, 8080))
  10. serversocket.listen(1)
  11. try:
  12.    while True:
  13.       connectiontoclient, address = serversocket.accept()
  14.       request = b”
  15.       while EOL1 not in request and EOL2 not in request:
  16.           request += connectiontoclient.recv(1024)
  17.       print(‘-’*40 + ‘/n’ + request.decode()[:-2])
  18.       connectiontoclient.send(response)
  19.       connectiontoclient.close()
  20. finally:
  21.    serversocket.close(

Benefits of Asynchronous Sockets and Linux epoll

Example 2 中的socket被称为blocking sockets。16行的 accept()调用会阻塞直接客户端有新的连接 。19行的recv()亦会阻塞直到接收满缓冲(或者数据已发送完毕)。21行的send()操作同样阻塞直接消息进入发送队列中。

当使用 blocking sockets服务器模型时,通常会将每一个连接放在单独的线程(进程也可以)中处理。主程序只负责监听连接。它每次只接受一个请求,然后把新创建的socket分发到子线程中。于是所有的阻塞将分发到子线程中,不同客户之间的连接互不影响。

blocking sockets非常直观且易于理解,但是它有很多 缺陷。在多线程间协调资源非常困难,并且这种模型在单核CPU上不是很高效。

The C10K Problem 列出了一些处理并发socket的方法。其中之一就是 asynchronous sockets。这种 sockets 不会阻塞自己以等待事件发生,相反,当操作成功或失败后程序将会收到通知。通知中的信息将有助于决定下一步该如何处理。采用asynchronous sockets 的服务器可以用单线程来处理并发连接,而在访问阻塞资源时采用多线程,比如数据库。

Linux 2.6 有许多方法来管理synchronous sockets,其中三个已经导出到Python API中,它们是 select,poll 和 epoll。epoll 和 poll 比 select 更好,因为select方法是需要手动检查每人socket是否有感兴趣事件发生,而前者是操作系统会在这些事件发生时通知应用程序。并且 epoll 要比 poll 更好,它由程序自己发出请求,Linux记录并返回一个队列表示那些socket有事件发生,这比每有一个事件发生都由操作系统通知一遍要好得多。所以 epoll 对于大量(上千)并发连接更有效,扩展也相对容易。 these graphs

Asynchronous Socket Programming Examples with epoll

下面程序展示了使用 epoll 的一般流程:

  1. 创建 epoll 对象
  2. 通知 epoll 对象监听特定sockets上的特定消息
  3. 询问自上一次询问前,有那些socket收到了指定的消息
  4. 处理这些socket
  5. 通知 epoll 对象修改监听sockets或消息类型
  6. 重复3到5操作,直接结束
  7. 销毁 epoll 对象

Example 3 是 Example 2 的asynchronous sockets版本。它会更原来更复杂。

  • Line 1: 导入select模块,epoll调用包含在此模块中
  • Line 13: sockets默认为blocking,需要设置为non-blocking (asynchronous) 模式
  • Line 15: 创建 epoll 对象
  • Line 16: 注册 server socket上的关于读可用的消息(EPOLLIN — Available for read)。
  • Line 19: 建立一个connection字典,将 file descriptors (integers) 映射到网络连接对象。
  • Line 21: 查询 epoll 对象是否有事件发生,参数”1“表示超时时间1秒。如果有任何事件发生,该函数会立即返回一个列表。
  • Line 22: 事件列表是里包含fileno和event code,fileno等于文件描述符(在整个系统中,它是独立的)。
  • Line 26: 注册新socket上的读可用消息。
  • Line 31: 如果可读消息来自于新socket发生的数据
  • Line 33: 当读操作完成后,修改注册到此socket的消息,改为写准备好事件,用于后面的回写操作。
  • Line 34: 打印读到的信息。
  • Line 35: 如果写新socket操作准备好
  • Lines 36-38: 发送数据给客户端。
  • Line 39: 发送完毕,不再监视此socket上的任何消息
  • Line 40: shutdown消息,通知双方结束了。
  • Line 41: 收到 HUP (hang-up) 事件表示连接中断。HUP事件不需要注册,epoll对象都会处理这个消息。
  • Line 42: 注销丢失连接的
  • Line 43: 关闭socket connection.
  • Lines 18-45: 保证epoll对象能正确关闭
  • Lines 46-48: 关闭server socket. (Python在程序结束时会自动关闭所有打开的文件/socket)
Example 3

·········10········20········30········40········50········60········70········80········90········100·······110·······120·······130·······140·······150
  1. import socket, select
  2. EOL1 = b’/n/n’
  3. EOL2 = b’/n/r/n’
  4. response  = b’HTTP/1.0 200 OK/r/nDate: Mon, 1 Jan 1996 01:01:01 GMT/r/n’
  5. response += b’Content-Type: text/plain/r/nContent-Length: 13/r/n/r/n’
  6. response += b’Hello, world!’
  7. serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  8. serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  9. serversocket.bind((’0.0.0.0′, 8080))
  10. serversocket.listen(1)
  11. serversocket.setblocking(0)
  12. epoll = select.epoll()
  13. epoll.register(serversocket.fileno(), select.EPOLLIN)
  14. try:
  15.    connections = {}; requests = {}; responses = {}
  16.    while True:
  17.       events = epoll.poll(1)
  18.       for fileno, event in events:
  19.          if fileno == serversocket.fileno():
  20.             connection, address = serversocket.accept()
  21.             connection.setblocking(0)
  22.             epoll.register(connection.fileno(), select.EPOLLIN)
  23.             connections[connection.fileno()] = connection
  24.             requests[connection.fileno()] = b”
  25.             responses[connection.fileno()] = response
  26.          elif event & select.EPOLLIN:
  27.             requests[fileno] += connections[fileno].recv(1024)
  28.             if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
  29.                epoll.modify(fileno, select.EPOLLOUT)
  30.                print(‘-’*40 + ‘/n’ + requests[fileno].decode()[:-2])
  31.          elif event & select.EPOLLOUT:
  32.             byteswritten = connections[fileno].send(responses[fileno])
  33.             responses[fileno] = responses[fileno][byteswritten:]
  34.             if len(responses[fileno]) == 0:
  35.                epoll.modify(fileno, 0)
  36.                connections[fileno].shutdown(socket.SHUT_RDWR)
  37.          elif event & select.EPOLLHUP:
  38.             epoll.unregister(fileno)
  39.             connections[fileno].close()
  40.             del connections[fileno]
  41. finally:
  42.    epoll.unregister(serversocket.fileno())
  43.    epoll.close()
  44.    serversocket.close()
epoll 有两种模式,edge-triggered (边缘触发)和 level-triggered (电平触发). edge-triggered 下某个socket上的读写事件只会在epoll.poll()调用中返回一次。 所以程序要处理完该socket上的所有数据,下次poll()调用都会有响应。 When the data from a particular event is exhausted, additional attempts to operate on the socket will cause an exception. 相反,level-triggered 会一起重复通知直到所有的数据都处理完。

举例来说,假设一个server socket注册了EPOLLIN事件,edge-triggered模式下,程序需要一直调用accept()直到socket.error异常

发生为止。而level-triggered下可以按例子3那样做。

例子 3 正是使用level-triggered模式,并且也是默认的模式。例子4将展示edge-triggered的使用方法。第25,36和45新增了一些循环,它们都需要一直重复直到异常发生。最好,第16,28,41和51行是切换到edge-triggered的方法。

Example 4

·········10········20········30········40········50········60········70········80········90········100·······110·······120·······130·······140·······150
  1. import socket, select
  2. EOL1 = b’/n/n’
  3. EOL2 = b’/n/r/n’
  4. response  = b’HTTP/1.0 200 OK/r/nDate: Mon, 1 Jan 1996 01:01:01 GMT/r/n’
  5. response += b’Content-Type: text/plain/r/nContent-Length: 13/r/n/r/n’
  6. response += b’Hello, world!’
  7. serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  8. serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  9. serversocket.bind((’0.0.0.0′, 8080))
  10. serversocket.listen(1)
  11. serversocket.setblocking(0)
  12. epoll = select.epoll()
  13. epoll.register(serversocket.fileno(), select.EPOLLIN | select.EPOLLET)
  14. try:
  15.    connections = {}; requests = {}; responses = {}
  16.    while True:
  17.       events = epoll.poll(1)
  18.       for fileno, event in events:
  19.          if fileno == serversocket.fileno():
  20.             try:
  21.                while True:
  22.                   connection, address = serversocket.accept()
  23.                   connection.setblocking(0)
  24.                   epoll.register(connection.fileno(), select.EPOLLIN | select.EPOLLET)
  25.                   connections[connection.fileno()] = connection
  26.                   requests[connection.fileno()] = b”
  27.                   responses[connection.fileno()] = response
  28.             except socket.error:
  29.                pass
  30.          elif event & select.EPOLLIN:
  31.             try:
  32.                while True:
  33.                   requests[fileno] += connections[fileno].recv(1024)
  34.             except socket.error:
  35.                pass
  36.             if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
  37.                epoll.modify(fileno, select.EPOLLOUT | select.EPOLLET)
  38.                print(‘-’*40 + ‘/n’ + requests[fileno].decode()[:-2])
  39.          elif event & select.EPOLLOUT:
  40.             try:
  41.                while len(responses[fileno]) > 0:
  42.                   byteswritten = connections[fileno].send(responses[fileno])
  43.                   responses[fileno] = responses[fileno][byteswritten:]
  44.             except socket.error:
  45.                pass
  46.             if len(responses[fileno]) == 0:
  47.                epoll.modify(fileno, select.EPOLLET)
  48.                connections[fileno].shutdown(socket.SHUT_RDWR)
  49.          elif event & select.EPOLLHUP:
  50.             epoll.unregister(fileno)
  51.             connections[fileno].close()
  52.             del connections[fileno]
  53. finally:
  54.    epoll.unregister(serversocket.fileno())
  55.    epoll.close()
  56.    serversocket.close()
level-triggered 通常用来移植以前用 select 或 poll 模型的程序,而edge-triggered 模式主要用在当程序员不想让操作系统协助管理事件状态的情形。
除了这两种操作外,socket还可以注册epoll时使用EPOLLONESHOT标记。有此标记的socket只会在第一次epoll.poll()成功有效,然后自动从监听队列中移除。

Performance Considerations

Listen Backlog Queue Size

在例子 1-4中,第12行有一条listen()调用。传入的参数是监听队列的大小,它告诉操作系统允许多少没有被accept的TCP/IP连接可以放在监听队列中。每次调用accept(), 监听队列就会突出一个位置用于下个连接;如果队列已满,新的连接将直接忽略,以便不让客户端产生不必要的等待。真实场景只服务器会有成百上千的连接,所以这时传入1并不合适。例如,当使用 ab 做性能测试时,任何小于50的队列都会对性能产生影响。

TCP Options

TCP_CORK 选项可以等到消息装满(bottle up)后再发送。例子5的34和40行展示了其用法,可能对于使用HTTP/1.1管道的服务器有用

Example 5
  1. import socket, select
  2. EOL1 = b’/n/n’
  3. EOL2 = b’/n/r/n’
  4. response  = b’HTTP/1.0 200 OK/r/nDate: Mon, 1 Jan 1996 01:01:01 GMT/r/n’
  5. response += b’Content-Type: text/plain/r/nContent-Length: 13/r/n/r/n’
  6. response += b’Hello, world!’
  7. serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  8. serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  9. serversocket.bind((’0.0.0.0′, 8080))
  10. serversocket.listen(1)
  11. serversocket.setblocking(0)
  12. epoll = select.epoll()
  13. epoll.register(serversocket.fileno(), select.EPOLLIN)
  14. try:
  15.    connections = {}; requests = {}; responses = {}
  16.    while True:
  17.       events = epoll.poll(1)
  18.       for fileno, event in events:
  19.          if fileno == serversocket.fileno():
  20.             connection, address = serversocket.accept()
  21.             connection.setblocking(0)
  22.             epoll.register(connection.fileno(), select.EPOLLIN)
  23.             connections[connection.fileno()] = connection
  24.             requests[connection.fileno()] = b”
  25.             responses[connection.fileno()] = response
  26.          elif event & select.EPOLLIN:
  27.             requests[fileno] += connections[fileno].recv(1024)
  28.             if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
  29.                epoll.modify(fileno, select.EPOLLOUT)
  30.                connections[fileno].setsockopt(socket.IPPROTO_TCP, socket.TCP_CORK, 1)
  31.                print(‘-’*40 + ‘/n’ + requests[fileno].decode()[:-2])
  32.          elif event & select.EPOLLOUT:
  33.             byteswritten = connections[fileno].send(responses[fileno])
  34.             responses[fileno] = responses[fileno][byteswritten:]
  35.             if len(responses[fileno]) == 0:
  36.                connections[fileno].setsockopt(socket.IPPROTO_TCP, socket.TCP_CORK, 0)
  37.                epoll.modify(fileno, 0)
  38.                connections[fileno].shutdown(socket.SHUT_RDWR)
  39.          elif event & select.EPOLLHUP:
  40.             epoll.unregister(fileno)
  41.             connections[fileno].close()
  42.             del connections[fileno]
  43. finally:
  44.    epoll.unregister(serversocket.fileno())
  45.    epoll.close()
  46.    serversocket.close()
相反, TCP_NODELAY 选项则通知操作系统,当调用socket.send()时数据应该立即发送。例子6中的14行展示了这种用法,像SSH或其它“实时”性程序可能会用到此选项。
Example 6

  1. import socket, select
  2. EOL1 = b’/n/n’
  3. EOL2 = b’/n/r/n’
  4. response  = b’HTTP/1.0 200 OK/r/nDate: Mon, 1 Jan 1996 01:01:01 GMT/r/n’
  5. response += b’Content-Type: text/plain/r/nContent-Length: 13/r/n/r/n’
  6. response += b’Hello, world!’
  7. serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  8. serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
  9. serversocket.bind((’0.0.0.0′, 8080))
  10. serversocket.listen(1)
  11. serversocket.setblocking(0)
  12. serversocket.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
  13. epoll = select.epoll()
  14. epoll.register(serversocket.fileno(), select.EPOLLIN)
  15. ry:
  16.    connections = {}; requests = {}; responses = {}
  17.    while True:
  18.       events = epoll.poll(1)
  19.       for fileno, event in events:
  20.          if fileno == serversocket.fileno():
  21.             connection, address = serversocket.accept()
  22.             connection.setblocking(0)
  23.             epoll.register(connection.fileno(), select.EPOLLIN)
  24.             connections[connection.fileno()] = connection
  25.             requests[connection.fileno()] = b”
  26.             responses[connection.fileno()] = response
  27.          elif event & select.EPOLLIN:
  28.             requests[fileno] += connections[fileno].recv(1024)
  29.             if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
  30.                epoll.modify(fileno, select.EPOLLOUT)
  31.                print(‘-’*40 + ‘/n’ + requests[fileno].decode()[:-2])
  32.          elif event & select.EPOLLOUT:
  33.             byteswritten = connections[fileno].send(responses[fileno])
  34.             responses[fileno] = responses[fileno][byteswritten:]
  35.             if len(responses[fileno]) == 0:
  36.                epoll.modify(fileno, 0)
  37.                connections[fileno].shutdown(socket.SHUT_RDWR)
  38.          elif event & select.EPOLLHUP:
  39.             epoll.unregister(fileno)
  40.             connections[fileno].close()
  41.             del connections[fileno]
  42. finally:
  43.    epoll.unregister(serversocket.fileno())
  44.    epoll.close()
  45.    serversocket.close()

Source Code

The examples on this page are in the public domain and available for download.