| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | \section{\module{codecs} --- | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  |          Codec registry and base classes} | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | \declaremodule{standard}{codecs} | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | \modulesynopsis{Encode and decode data and streams.} | 
					
						
							|  |  |  | \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com} | 
					
						
							|  |  |  | \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \index{Unicode} | 
					
						
							|  |  |  | \index{Codecs} | 
					
						
							|  |  |  | \indexii{Codecs}{encode} | 
					
						
							|  |  |  | \indexii{Codecs}{decode} | 
					
						
							|  |  |  | \index{streams} | 
					
						
							|  |  |  | \indexii{stackable}{streams} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This module defines base classes for standard Python codecs (encoders | 
					
						
							|  |  |  | and decoders) and provides access to the internal Python codec | 
					
						
							|  |  |  | registry which manages the codec lookup process. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | It defines the following functions: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{funcdesc}{register}{search_function} | 
					
						
							|  |  |  | Register a codec search function. Search functions are expected to | 
					
						
							|  |  |  | take one argument, the encoding name in all lower case letters, and | 
					
						
							|  |  |  | return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader}, | 
					
						
							|  |  |  | \var{stream_writer})} taking the following arguments: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   \var{encoder} and \var{decoder}: These must be functions or methods | 
					
						
							|  |  |  |   which have the same interface as the .encode/.decode methods of | 
					
						
							|  |  |  |   Codec instances (see Codec Interface). The functions/methods are | 
					
						
							|  |  |  |   expected to work in a stateless mode. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |   \var{stream_reader} and \var{stream_writer}: These have to be | 
					
						
							|  |  |  |   factory functions providing the following interface: | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | 	\code{factory(\var{stream}, \var{errors}='strict')} | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |   The factory functions must return objects providing the interfaces | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  |   defined by the base classes \class{StreamWriter} and | 
					
						
							|  |  |  |   \class{StreamReader}, respectively. Stream codecs can maintain | 
					
						
							|  |  |  |   state. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  |   Possible values for errors are \code{'strict'} (raise an exception | 
					
						
							|  |  |  |   in case of an encoding error), \code{'replace'} (replace malformed | 
					
						
							|  |  |  |   data with a suitable replacement marker, such as \character{?}) and | 
					
						
							|  |  |  |   \code{'ignore'} (ignore malformed data and continue without further | 
					
						
							|  |  |  |   notice). | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | In case a search function cannot find a given encoding, it should | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | return \code{None}. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | \end{funcdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{funcdesc}{lookup}{encoding} | 
					
						
							|  |  |  | Looks up a codec tuple in the Python codec registry and returns the | 
					
						
							|  |  |  | function tuple as defined above. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Encodings are first looked up in the registry's cache. If not found, | 
					
						
							|  |  |  | the list of registered search functions is scanned. If no codecs tuple | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | is found, a \exception{LookupError} is raised. Otherwise, the codecs | 
					
						
							|  |  |  | tuple is stored in the cache and returned to the caller. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | \end{funcdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To simplify working with encoded files or stream, the module | 
					
						
							|  |  |  | also defines these utility functions: | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | \begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{, | 
					
						
							|  |  |  |                        errors\optional{, buffering}}}} | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | Open an encoded file using the given \var{mode} and return | 
					
						
							|  |  |  | a wrapped version providing transparent encoding/decoding. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | \strong{Note:} The wrapped version will only accept the object format | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | defined by the codecs, i.e.\ Unicode objects for most built-in | 
					
						
							|  |  |  | codecs.  Output is also codec-dependent and will usually be Unicode as | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | well. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | \var{encoding} specifies the encoding which is to be used for the | 
					
						
							|  |  |  | the file. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \var{errors} may be given to define the error handling. It defaults | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | to \code{'strict'} which causes a \exception{ValueError} to be raised | 
					
						
							|  |  |  | in case an encoding error occurs. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-04-06 16:09:59 +00:00
										 |  |  | \var{buffering} has the same meaning as for the built-in | 
					
						
							|  |  |  | \function{open()} function.  It defaults to line buffered. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | \end{funcdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | \begin{funcdesc}{EncodedFile}{file, input\optional{, | 
					
						
							|  |  |  |                               output\optional{, errors}}} | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | Return a wrapped version of file which provides transparent | 
					
						
							|  |  |  | encoding translation. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Strings written to the wrapped file are interpreted according to the | 
					
						
							|  |  |  | given \var{input} encoding and then written to the original file as | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | strings using the \var{output} encoding. The intermediate encoding will | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | usually be Unicode but depends on the specified codecs. | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | If \var{output} is not given, it defaults to \var{input}. | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | \var{errors} may be given to define the error handling. It defaults to | 
					
						
							| 
									
										
										
										
											2000-07-24 19:35:52 +00:00
										 |  |  | \code{'strict'}, which causes \exception{ValueError} to be raised in case | 
					
						
							| 
									
										
										
										
											2000-04-06 14:21:58 +00:00
										 |  |  | an encoding error occurs. | 
					
						
							|  |  |  | \end{funcdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ...XXX document codec base classes... | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The module also provides the following constants which are useful | 
					
						
							|  |  |  | for reading and writing to platform dependent files: | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{datadesc}{BOM} | 
					
						
							|  |  |  | \dataline{BOM_BE} | 
					
						
							|  |  |  | \dataline{BOM_LE} | 
					
						
							|  |  |  | \dataline{BOM32_BE} | 
					
						
							|  |  |  | \dataline{BOM32_LE} | 
					
						
							|  |  |  | \dataline{BOM64_BE} | 
					
						
							|  |  |  | \dataline{BOM64_LE} | 
					
						
							|  |  |  | These constants define the byte order marks (BOM) used in data | 
					
						
							|  |  |  | streams to indicate the byte order used in the stream or file. | 
					
						
							|  |  |  | \constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE} | 
					
						
							|  |  |  | depending on the platform's native byte order, while the others | 
					
						
							|  |  |  | represent big endian (\samp{_BE} suffix) and little endian | 
					
						
							|  |  |  | (\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings. | 
					
						
							|  |  |  | \end{datadesc} | 
					
						
							|  |  |  | 
 |