精通
英语
和
开源
,
擅长
开发
与
培训
,
胸怀四海
第一信赖
The Apache 2 Filter Architecture is the major innovation that sets it
apart from other webservers, including Apache 1.x, as a uniquely powerful
and versatile applications platform. But this power comes at a price: there is a bit of a learning curve to harnessing it.Apart from understanding the architecture itself, the crux of the matter is to get to grips with Buckets and Brigades, the building blocks of a filter.
Apache2过滤器架构是一个使其区别于其它web服务器软件的主要革新,包括Apache 1.x版本也是一个独一无二的强大且具备灵活性的平台。但是这个强大动力建立在以上的代价上:基于它进行开发需要学习曲线。除理解它的架构外,事情的本质是对组桶队列(Buckets and Brigades)技术——过滤器的构造块——学习困难上的克服。
In this article, we introduce buckets and brigades, taking the reader to the point where you should have a basic working knowledge. In the process, we develop a simple but useful filter module that works by manipulating buckets and brigades directly.
本文介绍组桶队列技术,使你在学习后具备基本的工作知识。在学习过程中,我们开发一个简单但是有用的过滤器模块,这个模块通过直接操纵组桶队列实现工作目的。
This direct manipulation is the lowest-level API for working with buckets and brigades, and probably the hardest to use. But because it is low level, it serves to demonstrate what's going on. In other articles we will discuss related subjects including debugging, resource management, and alternative ways to work with the data.
直接操纵指的是面向组桶队列工作的低级API的使用,这可能是最难使用的。但是因为它是低层次的,它就表达了内在的本质。在其它文章里我们将讨论相关的主题,包括调试、资源管理和处理数据的可选办法。
The basic concepts we are dealing with are the bucket and the brigade. Let us first introduce them, before moving on to why and how to use them.
基本概念有桶和成编队列。先介绍概念,后说明使用。
A bucket is a container for data. Buckets can contain any type of data. Although the most common case is a block of memory, a bucket may instead contain a file on disc, or even be fed a data stream from a dynamic source such as a separate program. Different bucket types exist to hold different kinds of data and the methods for handling it. In OOP terms, the apr_bucket is an abstract base class from which actual bucket types are derived.
一个桶是数据的容器。桶能包含任何类型的数据。尽管在通常情况下是一个内存块,一个桶可能有其它情况去包含一个磁盘上的文件,甚至由一个动态源—如独立的程序—产生的数据流来维持。不同的桶类型为包含不同的数据和处理数据的方法而存在。用OOP的术语,apr_bucket是个虚基类,其它实际的桶类型从这个类派生。
There are several different types of data bucket, as well as metadata buckets. We will describe these at the end of this article.
有几种不同的数据桶类型,同样也有元数据桶。随后介绍这些类型。
In normal use, there is no such thing as a freestanding bucket: they are contained in bucket brigades. A brigade is a container that may hold any number of buckets in a ring structure. The brigade serves to enable flexible and efficient manipulation of data, and is the unit that gets passed to and from your filter.
在普通用法中,没有象独立桶这类东西:它们被包含在组桶队列内。成编队列是一个容器,它使用环式结构包含任意个数的桶。成编队列服务于数据操纵的灵活性和有效性的使能化,也使你在过滤器中传出和传入的基本单元。
So, why do we need buckets and brigades? Can't we just keep it simple and pass simple blocks of data? Maybe a void* with a length, or a C++ string?
Well, the first part of the answer we've seen already: buckets are more than just data: they are an abstraction that unifies fundamentally different types of data. But even so, how do they justify the additional complexity over simple buffers and ad-hoc use of other data sources in exceptional cases?
哪,为什么我们需要组桶队列技术?难道我们就不能让事情简单化来只传递数据块?可能是一个带长度的void*指针,或者一个C++字符串?
好,答案的第一部分我们已经看到了:桶不仅只是数据:他们是一种在基础上统一不同类型数据的抽象。但即使这样,他们是怎样证明通过简单的缓冲来实现额外的复杂性和在例外情况下的数据源的特别使用?
The second motivation for buckets and brigades is that they enable efficient manipulation of blocks of memory, typical of many filtering applications. We will demonstrate a simple but typical example of this: a filter to display plain text documents prettified as an HTML page, with header and footer supplied by the webmaster in the manner of a fancy directory listing.
组桶队列的第二个动机是他们使对内存块的有效操作成为可能,典型情况是很多有过滤的应用。我们将会描述一个简单但是典型的例子:一个过滤器显示这种被视为一个应修饰为HTML页面的纯文本文档,这个文档的头和脚注是由web站点管理员以一种设想目录列表的方式提供的。
Now, HTML can of course include blocks of plain text, enclosing them in <pre> to preserve spacing and formatting. So the main task of a text->html filter is to pass the text straight through. But certain special characters need to be escaped. To be safe both with the HTML spec and browsers, we will escape the four characters <, >, &, and " as <, etc.
现在,HTML当然能包含纯文本块,方法是用<pre>来封装来保留空格和格式。所以text->html的主要任务是直接传递文本,但是某些特别字符需要进行转义。为了保证和HTML标准和浏览器标准一致的安全性,我们将转义四个字符:<字符、>字符、&字符和作为<的"字符等等。
Because the replacement < is longer by three bytes than the original, we cannot just replace the character. If we are using a simple buffer, we either have to extend it with realloc() or equivalent, or copy the whole thing interpolating the replacement. Repeat this a few times and it rapidly gets very inefficient. A better solution is a two-pass scan of the buffer: the first pass simply computes the length of the new buffer, after which we allocate the memory and copy the data with the required replacements. But even that is by no means efficient.
由于对<的替换是比原来字符要多过3个字节,我们不能只替换字符。如果我们使用一个简单的内存区,我们要么必须用realloc()或其它等效方法来扩展它,要么拷贝内插有替换的整体。多重复几次,效率就会急速下降。一个更好的解决办法是对内存区的两段扫描:第一段只是计算新内存区的长度,随后我们分配内存和拷贝带有需要的替换的数据。但即使这样,也绝不是个高效的办法。
By using buckets and brigades in place of a simple buffer, we can simply replace the characters in situ, without allocating or copying any big blocks of memory. Provided the number of characters replaced is small in comparison to the total document size, this is much more efficient. In outline:
We encounter a character that needs replacing in the bucket
通过使用组桶队列来替换简单内存区,我们能简单地原地替换字符串,而不会分配和拷贝任何大的内存块。提供需要被替换的字符数量这个工作和以文档总长度为级别的的工作相比较是要简单的,也是非常高效的。要点:我们遇到一个需要在桶内被替换的字符。
We split the bucket before and after the character. Now we have three buckets: the character itself, and all data before and after it.
We drop the character, leaving the before and after buckets.
We create a new bucket containing the replacement, and insert it where the character was.
Now instead of moving/copying big blocks of data, we are just manipulating pointers into an existing block. The only actual data to change are the single character removed and the few bytes that replace it.
我们以字符前后位置拆分桶。这样我们得到三个桶:字符本身、字符前面数据和字符后面数据。
我们抛弃字符,保留字符前桶和字符后桶。
我们创建一个包含有替换的新桶,随后把它插入到字符所在位置。
现在不用移动和拷贝大的数据块操作,我们只是操作指针进入到一个存在的块内。唯一实际被修改的数据是被删除的字符,和要替换字符的几个字节。
mod_txt is a simple output filter module to display plain text files as HTML (or XHTML) with a header and footer. When a text file is requested, it escapes the text as required for HTML, and displays it between the header and the footer.
Mod_txt是一个显示带有头和脚注的纯文本文件为HTML(或者XHTML)的简单输出过滤模块。当一个文本文件需要时,它对需要显示为HTML文件的文本进行转义,然后在头和脚注之间显示。
It works by direct manipulation of buckets (the lowest-level API), and demonstrates both insertion of file data and substitution of characters, without any allocation of moving of big blocks.
它通过直接操作桶来工作(低级别的API),且描述了文件数据的插入也描述了字符的替换,而没有任何大内存块的移动的分配。
Firstly we introduce two functions to deal with the data insertions: one for the files, one for the simple entity replacements:
Creating a File bucket requires an open filehandle and a byte range within the file. Since we're transmitting the entire file, we just stat its size to set the byte range. We open it with a shared lock and with sendfile enabled for maximum performance.
首先我们介绍两个用来处理数据插入的函数:一个服务于文件,一个服务于简单实体替换:
创建一个文件桶需要一个打开的文件句柄和文件内的字节范围。因为我们将要传送整个文件,我们只是统计它的长度来设置字节数。我们用共享锁和用为了最大化性能的发送文件允许来方法打开它。
static apr_bucket* txt_file_bucket(request_rec* r, const char* fname) { apr_file_t* file = NULL ; apr_finfo_t finfo ; if ( apr_stat(&finfo, fname, APR_FINFO_SIZE, r->pool) != APR_SUCCESS ) { return NULL ; } if ( apr_file_open(&file, fname, APR_READ|APR_SHARELOCK|APR_SENDFILE_ENABLED, APR_OS_DEFAULT, r->pool ) != APR_SUCCESS ) { return NULL ; } if ( ! file ) { return NULL ; } return apr_bucket_file_create(file, 0, finfo.size, r->pool, r->connection->bucket_alloc) ; }
Creating the simple text replacements, we can just make a bucket of an inline string. The appropriate bucket type for such data is transient:
为了创建一个简单文本替换,我们可以只创建一个内联字符串的桶。适合这类数据的桶类型为transient(暂时型):
static apr_bucket* txt_esc(char c, apr_bucket_alloc_t* alloc ) { switch (c) { case '<': return apr_bucket_transient_create("<", 4, alloc) ; case '>': return apr_bucket_transient_create(">", 4, alloc) ; case '&': return apr_bucket_transient_create("&", 5, alloc) ; case '"': return apr_bucket_transient_create(""", 6, alloc) ; default: return NULL ; /* shut compilers up */ } }
Actually this is not the most efficient way to do this. We will discuss alternative formulations of the above below.
实际上这不是一个最高效的办法。我们随后会讨论一个可选择的上述工作的公式方案。
Now the main filter itself is broadly straightforward, but there are a number of interesting and unexpected points to consider. Since this is a little longer than the above utility functions, we'll comment it inline instead. Note that the Header and Footer file buckets are set in a filter_init function (omitted for brevity).
目前过滤器本身是比较概括直白的,但有几个有趣且没有被考虑的情况。因为这比上述的工具函数要长一些,我们将用内联注释来代替。注意头和脚注文件桶在filter_init函数(出于行文简易被省略)内被设置。
static int txt_filter(ap_filter_t* f, apr_bucket_brigade* bb) { apr_bucket* b ; txt_ctxt* ctxt = (txt_ctxt*)f->ctx ; if ( ctxt == NULL ) { txt_filter_init(f) ; ctxt = f->ctx ; }
Main Loop: This construct is typical for iterating over the incoming data
主循环:对于新来数据的迭代来说,如下的构造是典型的。
for ( b = APR_BRIGADE_FIRST(bb); b != APR_BRIGADE_SENTINEL(bb); b = APR_BUCKET_NEXT(b) ) { const char* buf ; size_t bytes ;
As in any filter, we need to check for EOS. When we encounter it, we insert the footer in front of it. We shouldn't get more than one EOS, but just in case we do we'll note having inserted the footer. That means we're being error-tolerant.
如何在任何过滤器内一样,我们需要检查EOS(结束)。当我们遇到它时,我们在它前面插入脚注。我们不需要处理多个EOS情况,但是只要是在我们处理的情况下,我们要注意完成插入脚注情况。哪也意味着我们在容忍错误。
if ( APR_BUCKET_IS_EOS(b) ) { /* end of input file - insert footer if any */ if ( ctxt->foot && ! (ctxt->state & TXT_FOOT ) ) { ctxt->state |= TXT_FOOT ; APR_BUCKET_INSERT_BEFORE(b, ctxt->foot); }
The main case is a bucket containing data, We can get it as a simple
buffer with its size in bytes:
主要情况是包含数据的桶,我们能象一个简单的带有它的字节长度的内存区哪样得到一个桶。
} else if ( apr_bucket_read(b, &buf, &bytes, APR_BLOCK_READ)== APR_SUCCESS ) { /* We have a bucket full of text. Just escape it where necessary */ size_t count = 0 ; const char* p = buf ;
Now we can search for characters that need replacing, and replace them
现在我们能搜索需要被替换的字符,且替换他们。
while ( count < bytes ) { size_t sz = strcspn(p, "<>&\"") ; count += sz ;
Here comes the tricky bit: replacing a single character inline.
现在进入了技巧要点了:内联替换一个字符。
if ( count < bytes )apr_bucket_split(b, sz) ;Split off before bufferb = APR_BUCKET_NEXT(b) ; Skip over before buffer APR_BUCKET_INSERT_BEFORE(b, txt_esc(p[sz],f->r->connection->bucket_alloc)) ; Insert the replacement apr_bucket_split(b, 1) ; Split off the char to remove APR_BUCKET_REMOVE(b) ; ... and remove it b = APR_BUCKET_NEXT(b) ; Move cursor on to what-remains so that it stays in sequence with our main loop count += 1 ; p += sz + 1 ; } } } }
Now we insert the Header if it hasn't already been inserted.
Note:
(a) This has to come after the main loop, to avoid the header itself getting into the parse.
(b) It works because we can insert a bucket anywhere in the brigade, and in this case put it at the head.
(c) As with the footer, we save state to avoid inserting it more than once.
现在,即使它没有准备好接受插入,我们也会插入头部。
注意:
为了避免头本身进入解析,这个必须在主循环后面。
因为我们可以在成编队列里的任何位置插入,所以它可以有效工作且在这种情况下把它放到头部
当配合脚注完成工作时,我们保存状态来避免多次插入脚注。
if ( ctxt->head && ! (ctxt->state & TXT_HEAD ) ) { ctxt->state |= TXT_HEAD ; APR_BRIGADE_INSERT_HEAD(bb, ctxt->head); }
Now we've finished manipulating data, we just pass it down the filter chain.
现在既然我们已经完成了数据操作,我们只把它传递给过滤器链就行了。
return ap_pass_brigade(f->next, bb) ; }
Note that we created a new bucket every time we replaced a character. Couldn't we have prepared four buckets in advance - one for each of the characters to be replaced - and then re-used them whenever the character occurred?
注意:每次我们替换一个字符,我们就新建一个桶。难道我们不能提前为每个桶准备好—一个桶为所有被替换的字符服务—随后当替换时重用它吗?
The problem here is that each bucket is linked to its neighbours. So if we re-use the same bucket, we lose the links, so that the brigade now jumps over any data between the two instances of it. Hence we do need a new bucket every time. That means this technique becomes inefficient when a high proportion of input data has to be changed. We will show alternative techniques for such cases in other articles.
这里的问题在于每个桶和它的邻居是相连的。所以如果我们重用同样的桶的话,我们就失去连接了,所以成编队列现在可能在两个实例间跳过任何的数据。这样我们确实每次都需要一个新桶。哪就意味着当一个输入数据内有很多数据需要被修改时,这个办法效率不高。我们会在随后的文章里演示另外的办法。
In the above, we used two data bucket types: file and transient, and the eos metadata bucket type. There are several other bucket types suitable for different kinds of data and metadata.
上面,我们使用了两种桶类型:文件型和暂时型,还有一个EOS元数据桶类型。另外还有适合其它不同种类数据的桶类型和元数据桶类型。
When we created transient buckets above, we were inserting a chunk of memory in the output stream. But we noted that this bucket was not the most efficient way to escape a character. The reason for this is that the transient memory has to be copied internally to prevent it going out of scope. We could instead have used memory that's guaranteed never to go out of scope, by replacing
当我们如上述代码创建暂时型桶时,我们插入到输出流内一个内存块。但是我们注意到这个桶不是一个最高效的办法来进行转义字符。理由是暂时内存必须在内部拷贝来防止它溢出作用范围。我们能使用永远不可能溢出作用范围的内存区来进行替换,替代代码如下:
case '<': return apr_bucket_transient_create("<", 4, alloc) ; ith static const char* lt = "<" ; ... case '<': return apr_bucket_immortal_create(lt, 4, alloc) ;
When we create an immortal bucket, we guarantee that the memory won't go out of scope during the lifetime of the bucket, so the APR never needs to copy it internally.
A third variant on the same principle is the pool bucket. This refers to memory allocated on a pool, and will have to be copied internally if and only if the pool is destroyed within the lifetime of the bucket.
当我们创建一个永久桶时,我们保证再做步会溢出作用范围的内存区来,所以APR永远不需要在内部拷贝它。
有同样原理的第三个变量是池桶。它引用在内存池上分配的内存,如果池在桶的生命周报内销毁时,经将必须在内部拷贝。
The heap bucket is another form of in-memory bucket. But its usage is rather different from any of the above. We rarely if ever need to create a heap bucket explicitly: rather they are managed internally when we use the stdio-like API to write data to the next filter. This API is discussed in other articles.
堆桶是另外一种内存内桶。但是它的用法和上面任何一种差别要大。如果有过需要显式创建一个堆桶,我们很少这样做:更多的是他们被内部管理,当我们使用成套的API来写数据给下一个过滤器时。这个API在其它文章有讨论。
The File bucket, as we have already seen, enables us to insert a file (or part) file into the data stream. Although we had to stat it to find its length, we didn't have to read it. If sendfile is enabled, the operating system (through the APR) can optimise sending the file.
The mmap bucket type is similar, and is appropriate to mmaped files. APR may convert file buckets to mmap internally if we (or a later filter) read the data.
Two other more unusual bucket types are the pipe and the socket, which enable us to insert data from an external source via IPC.
我们已经看过了文件桶,它使我们能插入文件到数据流内。尽管我们必须统计它来得到它的长度,我们没必要去读它。如果发送文件使能了,操作系统(通过APR)能够优化发送文件。
mmap桶是同样的,是适合mmaped文件使用的。如果我们(或者随后的过滤器)读取数据时,APR可能转换文件桶到mmap内部去。
另外两个不经常使用的桶类型是管理和socket,它们使们能能够通过IPC从外部源插入数据。
In addition to data buckets, there are two metadata types. The EOS bucket is crucial: it must be sent at the end of a data stream, and it serves to signal the end of incoming data. The other metadata type is the rarely-user FLUSH bucket, which may occasionally be required, but is not guaranted to be propagated by every filter.
除了数据库桶外,还有两种元数据桶。EOS桶是关键的:它必须在数据流的尾部发送,它是输入数据的结束标志。另外的元数据是很少用的FLUSH桶,不能保证每个过滤器内都会有这种桶。
Further reading
Cliff Woolley gave a talk on buckets and brigades at ApacheCon 2002. His notes are very readable, and go into more depth than this article.